GATK pipline to processing raw read data

4 minute read

Published:

I’m sharing with you a sketch of comprehensive pipline for processing raw sequencing data. The pipeline mainly employs Bash and Python. The bioinformatic tools I’m using including samtools, Picard, GATK and bwa. I’m outlining the steps below:

Bioinformatics Pipeline for Variant Calling

This pipeline accepts raw sequencing data (in FASTQ format) and processes it to produce annotated variant calls. It’s a basic yet comprehensive genomic variant discovery workflow.

Tools Used

  1. BWA (Burrows-Wheeler Aligner)
  2. SAMtools
  3. Picard
  4. GATK (Genome Analysis Toolkit)
  5. Python

Pipeline Steps

  1. Map the reads to the reference genome using BWA-MEM.
  2. Convert the output SAM file to a BAM file and sort it using SAMtools.
  3. Mark duplicates in the sorted BAM file using Picard.
  4. Recalibrate the base quality scores of the BAM file using GATK’s BaseRecalibrator and ApplyBQSR.
  5. Call variants using GATK’s HaplotypeCaller in GVCF mode.
  6. Consolidate GVCFs using GATK’s CombineGVCFs.
  7. Perform joint genotyping on the cohort using GATK’s GenotypeGVCFs to produce raw SNPs and indels.
  8. Filter variants using GATK’s VariantFiltration.
  9. Refine the genotypes using GATK’s VariantRecalibrator and ApplyVQSR.
  10. Annotate the variants using GATK’s VariantAnnotator.

Processing raw data to variant calling

#!/bin/bash

# Define input files
R1="r1.fastq"
R2="r2.fastq"
REF="ref.fasta"

# 1. Map reads to reference genome
bwa mem $REF $R1 $R2 > aln.sam

# 2. Sort the mapping
samtools sort aln.sam -o sorted.bam

# 3. Mark duplicates
picard MarkDuplicates I=sorted.bam O=marked.bam M=marked_dup_metrics.txt

# 4. Recalibrate base quality scores
gatk BaseRecalibrator -I marked.bam -R $REF --known-sites dbsnp.vcf -O recal_data.table
gatk ApplyBQSR -R $REF -I marked.bam --bqsr-recal-file recal_data.table -O recalibrated.bam

# 5. Call variants using HaplotypeCaller in GVCF mode
gatk HaplotypeCaller -R $REF -I recalibrated.bam -O output.g.vcf.gz -ERC GVCF

# 6. Consolidate GCVFs
gatk CombineGVCFs -R $REF -V output.g.vcf.gz -O combined.g.vcf.gz

# 7. Joint-call the cohort
gatk GenotypeGVCFs -R $REF -V combined.g.vcf.gz -O raw_variants.vcf

# 8. Filter the variants
gatk VariantFiltration -R $REF -V raw_variants.vcf -O filtered_variants.vcf --filter-expression "QD < 2.0 || FS > 60.0 || MQ < 40.0" --filter-name "my_snp_filter"

# 9. Refine genotypes
gatk VariantRecalibrator -R $REF -V filtered_variants.vcf --trust-all-polymorphic -O recalibration_report.grp
gatk ApplyVQSR -R $REF -V filtered_variants.vcf --ts-filter-level 99.0 -O recalibrated_variants.vcf --recal-file recalibration_report.grp

# 10. Annotate variants
gatk VariantAnnotator -R $REF -V recalibrated_variants.vcf -O annotated_variants.vcf

Calculate Average QUAL score in Python

import vcf

vcf_reader = vcf.Reader(open('annotated_variants.vcf', 'r'))
qual_scores = [record.QUAL for record in vcf_reader if record.QUAL is not None]
average_qual = sum(qual_scores) / len(qual_scores) if qual_scores else 0

print(f"Average QUAL score: {average_qual}")

My thoughts:

The pipeline outlined here is a simplified yet comprehensive genomic variant discovery workflow, starting from raw sequencing data and ending up with annotated variants. This pipeline combines several industry-standard tools like BWA, SAMtools, Picard, and GATK, reflecting common practices in bioinformatics.

Here are some thoughts:

Flexibility: This pipeline is fairly adaptable to a variety of genome analysis tasks. However, the parameter settings and certain steps might need to be tailored based on the specifics of the dataset and research question.

Robustness: Although it is simplified, the pipeline includes key steps such as duplicate marking, base quality score recalibration, and variant quality score recalibration, which can help to control for common sources of error and bias in high-throughput sequencing data.

Ease of Use: By encapsulating the pipeline in a script, it can be executed with minimal user intervention. This not only reduces manual labor and potential for error, but it also makes the pipeline easily repeatable and shareable.

Improvement: This pipeline could be enhanced in several ways. For instance, adding parallelization could significantly speed up computation. Incorporating more extensive logging and error checking would make the pipeline more robust and easier to debug. Including steps for visualizing the data and results would make it more user-friendly.

Interpretability: The final output includes annotated variants, which provide additional context that is invaluable for interpreting the biological significance of the variants.

Limitation: The pipeline is designed for diploid organisms. For polyploid organisms, modifications will be needed, especially at the variant calling stage.