High-Performance Computing at the NIH

RSS Feed
GATK on Helix

GATK (Genome Analysis Tool Kit) is a structured software library that makes writing efficient analysis tools using next-generation sequencing data very easy, and second it's a suite of tools for working with human medical resequencing projects such as 1000 Genomes and The Cancer Genome Atlas. These tools include things like a depth of coverage analyzers, a quality score recalibrator, a SNP/indel caller and a local realigner.

GATK is developed and maintained at the Broad Institute. GATK website

The GATK Resource Bundle (a collection of standard files for working with human resequencing data with the GATK) is available on Helix/Biowulf in /fdb/GATK_resource_bundle.

Running a GATK job on Helix

To set up the environment for GATK, you need to load the GATK module using the command 'module load GATK'. This will load the latest version of GATK, set up the environment module GATK_HOME, and also set up an alias for 'java -jar /usr/local/GATK/GenomeAnalysisTK.jar'.

If you need a particular version of GATK, use the module commands to find and load that version. e.g.

[user@helix]$ module avail GATK

------------- /usr/local/Modules/3.2.9/modulefiles ------------------
 GATK/1.5-11   GATK/1.6.13   GATK/2.0.36    GATK/2.1-11

[user@helix]$ module load  GATK/1.6.13

[user@helix]$ module list
Currently Loaded Modulefiles:
  1) GATK/1.6.13

The alias 'GATK' is set up to use 1 GB of memory. If you need more than that, set up your own GATK modulefile or use the full command

java -Xm####m -jar $GATK_HOME/GenomeAnalysisTK.jar
instead of the GATK alias, where #### is the number of MB you want to use.

To get a brief list of the available options for GATK:

helix% module load GATK

helix% GATK -help
---------------------------------------------------------------------------
The Genome Analysis Toolkit (GATK) v1.0.4418, Compiled 2010/10/03 21:55:47
Copyright (c) 2010 The Broad Institute
Please view our documentation at http://www.broadinstitute.org/gsa/wiki
For support, please view our support site at http://getsatisfaction.com/gsa
---------------------------------------------------------------------------
---------------------------------------------------------------------------
usage: java -jar GenomeAnalysisTK.jar -T  [-I ] [-im ] [-rbs ] 
       [-U ] [-SM ] [-rf ] [-B ] [-rgbl ] [-log 
       ] [-l ] [-L ] [-BTIMR ] [-debug] [-dfrac 
       ] [-D ] [-nt ] [-quiet] [-BTI ] [-h] [-S 
       ] [-dcov ] [-XL ] [-R ] [-OQ] [-et 
       ] [-dt ]

 -T,--analysis_type                          Type of analysis to run
 -I,--input_file                                SAM or BAM file(s)
 -im,--interval_merging                   What interval merging rule should we use. (ALL|
                                                            OVERLAPPING_ONLY)
 -rbs,--read_buffer_size                  Number of reads per SAM file to buffer in memory
 -U,--unsafe                                        If set, enables unsafe operations: nothing will be checked 

[...etc...]

You will also see an error message that no input was supplied; this can safely be ignored.

A set of example files from the Broad is in /usr/local/GATK/exampleFiles. These files are used in the following example to count the number of reads in the BAM file

helix% cd /data/user/gatk-test
helix% cp -r /usr/local/GATK/exampleFiles .
helix% cd exampleFiles
helix% GATK  -R exampleFASTA.fasta -I  exampleBAM.bam  -T CountReads 
INFO  11:10:23,367 HelpFormatter - --------------------------------------------------------------------------- 
INFO  11:10:23,370 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.0.4418, Compiled 2010/10/03 21:55:47 
INFO  11:10:23,370 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  11:10:23,370 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki 
INFO  11:10:23,370 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa 
INFO  11:10:23,370 HelpFormatter - Program Args: -R exampleFASTA.fasta -I exampleBAM.bam -T CountReads -et NO_ET  
INFO  11:10:23,371 HelpFormatter - Date/Time: 2010/11/17 11:10:23 
INFO  11:10:23,371 HelpFormatter - --------------------------------------------------------------------------- 
INFO  11:10:23,371 HelpFormatter - --------------------------------------------------------------------------- 
INFO  11:10:23,372 AbstractGenomeAnalysisEngine - Strictness is SILENT 
INFO  11:10:23,507 TraversalEngine - [PROGRESS] Traversed to chr1:200, processing 1 reads in 0.04 secs (39000.00 secs per 1M reads) 
INFO  11:10:23,508 Walker - [REDUCE RESULT] Traversal result is: 33 
INFO  11:10:23,509 TraversalEngine - [PROGRESS] Traversed 33 reads in 0.04 secs (1333.33 secs per 1M reads) 
INFO  11:10:23,510 TraversalEngine - Total runtime 0.04 secs, 0.00 min, 0.00 hours 
INFO  11:10:23,514 TraversalEngine - 0 reads were filtered out during traversal out of 66 total (0.00%) 

GATK has multithreading options. To enable multi-threading in the GATK, simply add the -nt x argument to your command line, where x is the number of threads, or cores, you want to use. On Helix, please use a max of 4 threads. On Biowulf you can use all the cores on the allocated node.

Documentation

GATK website