Skip to main content

Table 7 GATK steps

From: Challenges in exome analysis by LifeScope and its alternative computational pipelines

Step

Command

Description

Input BAM

java -jar MarkDuplicates.jar INPUT = your_bam_file OUTPUT = step1.bam METRICS_FILE = Fmetrics_step1.bam ASSUME_SORTED = true

Marking duplicates

Step 1.

java -jar AddOrReplaceReadGroups.jar INPUT= step1.bam OUTPUT = step2.bam RGID= Read_Group ID RGLB = Read_Group_Library RGPL= platform RGPU = platform_unit RGSM= sample_name RGDS = Read_Group_Description RGDT = Read_Group_Run_Date

Replacing all read groups in the INPUT file with a new read group

Step 2.

java -jar ReorderSam.jar INPUT =  step2.bam OUTPUT =  step3.bam REFERENCE = ucsc.hg19.fasta

Reorder reads in BAM file to match the contig ordering in a provided reference file

Step 3.

java -jar SortSam.jar INPUT = step3.bam OUTPUT = step4.bam SORT_ORDER = coordinate

Sorting the aligned reads by coordinate order

Step 4.

java -jar BuildBamIndex.jar INPUT= step4.bam

Generating BAM index

java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R ucsc.hg19.fasta -S STRICT -I step4.bam -o indels.intervals -allowPotentiallyMisencodedQuals

Indel Realignment I (Creating a target list of intervals to be realigned)

java -jar GenomeAnalysisTK.jar -T IndelRealigner -R ucsc.hg19.fasta -S STRICT -I step4.bam -targetIntervals indels.intervals -o step5.bam -known Mills_and_1000G_gold_standard.indels.hg19.vcf -known 1000G_phase1.indels.hg19.vcf -allowPotentiallyMisencodedQuals

Indel Realignment II (Performing realignment of the target intervals)

Step 5.

java -jar SortSam.jar INPUT = step5.bam OUTPUT = step6.bam SORT_ORDER = coordinate

Sorting the aligned reads by coordinate order

Step 6.

java -jar BuildBamIndex.jar INPUT = step6.bam

Generating BAM index

java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -I step6.bam -R ucsc.hg19.fasta -S STRICT -knownSites dbsnp_138.hg19.vcf -o recal.grp –covariate QualityScoreCovariate –covariate ReadGroupCovariate –covariate ContextCovariate –covariate CycleCovariate –solid_nocall_strategy PURGE_READ –solid_recal_mode SET_Q_ZERO_BASE_N -allowPotentiallyMisencodedQuals

Base quality score recalibration I (data-driven adjustment of base quality scores)

java -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -S STRICT -I step6.bam -T PrintReads -o step7.bam -BQSR recal.grp -allowPotentiallyMisencodedQuals

Base quality score recalibration II (Applying the recalibration to sequence data)

Step 7.

java -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T HaplotypeCaller -I step7.bam -S STRICT –dbsnp dbsnp_138.hg19.vcf -minPruning 3 -o step8.vcf -stand_call_conf 50 -stand_emit_conf 30

Calling variants in sequence data

Step 8.

java -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T SelectVariants –variant step8.vcf -o step9_SNP.vcf -selectType SNP -S STRICT

Select SNPs from the input file

Step 9.

java -jar GenomeAnalysisTK.jar -T VariantRecalibrator –input step9_SNP.vcf -R ucsc.hg19.fasta -S STRICT -resource:1000G,known = false,training = true,truth = false,prior = 10 1000G_phase1.snps.high_confidence.hg19.vcf -resource:hapmap, known =f alse, training = true, truth = true, prior = 15.0 hapmap_3.3.hg19.vcf -resource:omni, known=false, training = true, truth = true, prior = 12.0 1000G_omni2.5.hg19.vcf -resource:dbsnp, known = true, training = false, truth = false, prior = 2.0 dbsnp_138.hg19.vcf -an QD -an MQRankSum -an ReadPosRankSum -an FS -an MQ –maxGaussians 4 -mode SNP -recalFile recal -tranchesFile tranches

Building SNP recalibration model

java -jar GenomeAnalysisTK.jar -R ucsc.hg19.fasta -T ApplyRecalibration -S STRICT –input step9_SNP.vcf -ts_filter_level 99.5 -mode SNP -tranchesFile tranches -recalFile recal -o step10_final.vcf

Applying SNP recalibration model