Sunday, May 4, 2014

Blogging My Genome, episode 6: variant calling, filtering, and QC

This is the sixth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

I'd previously generated mappings of my sequence reads to the hs37d5 reference assembly. The next step is variant calling, to systematically identify the differences between my genome and the reference. There's a variety of popular tools for this, from which I selected FreeBayes to try first. FreeBayes is an example of a relatively new generation of algorithms capable of considering many hypothetical combinations of genetic variation and sequence alignments to explain the short read data mapped to a genomic region - all within a Bayesian statistical model, per the name.

FreeBayes took my BAM file and the hs37d5 reference as input, and produced a VCF (Variant Call Format) file with genome-wide calls. This took about 32 core-hours (parallelized on an 8-core cloud instance) and produced a VCF file of 377 MiB compressed - tiny compared to my 63 GiB BAM file, as it doesn't preserve the individual reads. Here's the one-line entry of my VCF file corresponding to my ALDH2*2 variant from last time:

$ dx cat "My Genome Analysis:/C2K.deduplicated.vcf.gz" | zcat | grep "12[[:space:]]112241766"
12 112241766 . G A 1264 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=42;CIGAR=1X;DP=42;DPB=42;DPRA=0;EPP=23.691;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=58.5772;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=1467;QR=0;RO=0;RPP=4.87156;RPPR=0;RUN=1;SAF=27;SAP=10.4553;SAR=15;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1 GT:DP:RO:QR:AO:QA:GL 1/1:42:0:0:42:1467:-10,-10,0

This line indicates that, at position #112,241,766 on chromosome 12, the reference assembly has a G but my reads indicate I've got an A. Further, the "1/1" prefixing the last string indicates I'm homozygous for the A, as all of my reads show it. Most of the nearly five million calls in my VCF are biallelic single-nucleotide variants (SNVs) like this one, but FreeBayes also calls more complex variation like indels and regions with two different non-reference alleles. Unfortunately, not all of them can be trusted immediately.