GATK (Genome Analysis Tool Kit) is a structured software library that makes writing efficient analysis tools using next-generation sequencing data very easy, and second it's a suite of tools for working with human medical resequencing projects such as 1000 Genomes and The Cancer Genome Atlas. These tools include things like a depth of coverage analyzers, a quality score recalibrator, a SNP/indel caller and a local realigner.
GATK is developed and maintained at the Broad Institute. GATK website
The GATK Resource Bundle (a collection of standard files for working with human resequencing data with the GATK) is available on Helix/Biowulf in /fdb/GATK_resource_bundle.
To set up the environment for GATK, you need to load the GATK module using the command 'module load GATK'. This will load the latest version of GATK, set up the environment module GATK_HOME, and also set up an alias for 'java -jar /usr/local/GATK/GenomeAnalysisTK.jar'.
If you need a particular version of GATK, use the module commands to find and load that version. e.g.
[user@helix]$ module avail GATK ------------- /usr/local/Modules/3.2.9/modulefiles ------------------ GATK/1.5-11 GATK/1.6.13 GATK/2.0.36 GATK/2.1-11 [user@helix]$ module load GATK/1.6.13 [user@helix]$ module list Currently Loaded Modulefiles: 1) GATK/1.6.13
The alias 'GATK' is set up to use 1 GB of memory. If you need more than that, set up your own GATK modulefile or use the full command
java -Xm####m -jar $GATK_HOME/GenomeAnalysisTK.jar
To get a brief list of the available options for GATK:
helix% module load GATK helix% GATK -help --------------------------------------------------------------------------- The Genome Analysis Toolkit (GATK) v1.0.4418, Compiled 2010/10/03 21:55:47 Copyright (c) 2010 The Broad Institute Please view our documentation at http://www.broadinstitute.org/gsa/wiki For support, please view our support site at http://getsatisfaction.com/gsa --------------------------------------------------------------------------- --------------------------------------------------------------------------- usage: java -jar GenomeAnalysisTK.jar -T[-I ] [-im ] [-rbs ] [-U ] [-SM ] [-rf ] [-B ] [-rgbl ] [-log ] [-l ] [-L ] [-BTIMR ] [-debug] [-dfrac ] [-D ] [-nt ] [-quiet] [-BTI ] [-h] [-S ] [-dcov ] [-XL ] [-R ] [-OQ] [-et ] [-dt ] -T,--analysis_type Type of analysis to run -I,--input_file SAM or BAM file(s) -im,--interval_merging What interval merging rule should we use. (ALL| OVERLAPPING_ONLY) -rbs,--read_buffer_size Number of reads per SAM file to buffer in memory -U,--unsafe If set, enables unsafe operations: nothing will be checked [...etc...]
You will also see an error message that no input was supplied; this can safely be ignored.
A set of example files from the Broad is in /usr/local/GATK/exampleFiles. These files are used in the following example to count the number of reads in the BAM file
helix% cd /data/user/gatk-test helix% cp -r /usr/local/GATK/exampleFiles . helix% cd exampleFiles helix% GATK -R exampleFASTA.fasta -I exampleBAM.bam -T CountReads INFO 11:10:23,367 HelpFormatter - --------------------------------------------------------------------------- INFO 11:10:23,370 HelpFormatter - The Genome Analysis Toolkit (GATK) v1.0.4418, Compiled 2010/10/03 21:55:47 INFO 11:10:23,370 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 11:10:23,370 HelpFormatter - Please view our documentation at http://www.broadinstitute.org/gsa/wiki INFO 11:10:23,370 HelpFormatter - For support, please view our support site at http://getsatisfaction.com/gsa INFO 11:10:23,370 HelpFormatter - Program Args: -R exampleFASTA.fasta -I exampleBAM.bam -T CountReads -et NO_ET INFO 11:10:23,371 HelpFormatter - Date/Time: 2010/11/17 11:10:23 INFO 11:10:23,371 HelpFormatter - --------------------------------------------------------------------------- INFO 11:10:23,371 HelpFormatter - --------------------------------------------------------------------------- INFO 11:10:23,372 AbstractGenomeAnalysisEngine - Strictness is SILENT INFO 11:10:23,507 TraversalEngine - [PROGRESS] Traversed to chr1:200, processing 1 reads in 0.04 secs (39000.00 secs per 1M reads) INFO 11:10:23,508 Walker - [REDUCE RESULT] Traversal result is: 33 INFO 11:10:23,509 TraversalEngine - [PROGRESS] Traversed 33 reads in 0.04 secs (1333.33 secs per 1M reads) INFO 11:10:23,510 TraversalEngine - Total runtime 0.04 secs, 0.00 min, 0.00 hours INFO 11:10:23,514 TraversalEngine - 0 reads were filtered out during traversal out of 66 total (0.00%)
GATK has multithreading options. To enable multi-threading in the GATK, simply add the -nt x argument to your command line, where x is the number of threads, or cores, you want to use. On Helix, please use a max of 4 threads. On Biowulf you can use all the cores on the allocated node.

