Sunday, May 4, 2014

Blogging My Genome, episode 6: variant calling, filtering, and QC

This is the sixth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

I'd previously generated mappings of my sequence reads to the hs37d5 reference assembly. The next step is variant calling, to systematically identify the differences between my genome and the reference. There's a variety of popular tools for this, from which I selected FreeBayes to try first. FreeBayes is an example of a relatively new generation of algorithms capable of considering many hypothetical combinations of genetic variation and sequence alignments to explain the short read data mapped to a genomic region - all within a Bayesian statistical model, per the name.

FreeBayes took my BAM file and the hs37d5 reference as input, and produced a VCF (Variant Call Format) file with genome-wide calls. This took about 32 core-hours (parallelized on an 8-core cloud instance) and produced a VCF file of 377 MiB compressed - tiny compared to my 63 GiB BAM file, as it doesn't preserve the individual reads. Here's the one-line entry of my VCF file corresponding to my ALDH2*2 variant from last time:

$ dx cat "My Genome Analysis:/C2K.deduplicated.vcf.gz" | zcat | grep "12[[:space:]]112241766"
12 112241766 . G A 1264 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=42;CIGAR=1X;DP=42;DPB=42;DPRA=0;EPP=23.691;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=58.5772;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=1467;QR=0;RO=0;RPP=4.87156;RPPR=0;RUN=1;SAF=27;SAP=10.4553;SAR=15;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1 GT:DP:RO:QR:AO:QA:GL 1/1:42:0:0:42:1467:-10,-10,0

This line indicates that, at position #112,241,766 on chromosome 12, the reference assembly has a G but my reads indicate I've got an A. Further, the "1/1" prefixing the last string indicates I'm homozygous for the A, as all of my reads show it. Most of the nearly five million calls in my VCF are biallelic single-nucleotide variants (SNVs) like this one, but FreeBayes also calls more complex variation like indels and regions with two different non-reference alleles. Unfortunately, not all of them can be trusted immediately.


Filtering and QC

FreeBayes, and perhaps other tools, tend to be quite promiscuous in calling variants, leaving it to the analyst to perform downstream filtration suited to their specific goals. That freedom has given rise to a variety of techniques, from which it can be difficult to distinguish best practices. Serendipitously, just as I got to this step, Heng Li posted a preprint exploring variant filtration for deep WGS like mine, with a clever benchmark design and more than a few invaluable nuggets of wisdom. 

Based on that manuscript, I used bedtools to remove variants overlapping regions of the reference assembly with low sequence complexity, which tend to attract spurious read mappings leading to erroneous variant calls. Then I used vcffilter on the attributes of each variant, which are specified in the lengthy "info" field:
  • $\operatorname{DP} < 59$, excluding regions with excessive read depth, probably also associated with spurious mappings. (59 is two Poisson standard deviations above my data's 40X coverage.)
  • $\operatorname{SAF} > 1$, $\operatorname{SAR} > 1$, requiring multiple observations of the alternate (non-reference) allele in reads from both DNA strands
  • $\operatorname{AB} \leq 0.8$, requiring reasonable overall balance between the strands of the observed reads
  • $\operatorname{QUAL} \geq 30$, a threshold on the FreeBayes-calculated variant quality score
As always, my workflow is on DNAnexus (free account required) and the underlying source code is on GitHub

Overall, the filtering removed 1.2 million out of 4.9 million raw variant calls, including about 800,000 out of 4.1 million biallelic SNVs. Here are a couple plots showing the before-and-after effects:

Before filteringAfter filtering
They show the distribution of QUAL versus the number of reads supporting the alternate allele (Alternate Observations, AO). The filters remove a lot of calls with low quality scores, as expected, and many others as well. Two remaining modes in the distribution reflect heterozygous and homozygous SNVs, the overwhelming majority far above the QUAL threshold.

One way to quantify the benefits of the filtration is to look at the ratio of transitions and transversions observed in the called SNVs. These two types of SNVs are expected to occur in a certain rough proportion, owing to the chemical structure of DNA and mechanisms of its mutation and repair. The Ts/Tv ratio would come out to 0.5 in random garbage sampled from the four nucleotides uniformly, whereas we like to see it north of 2.0 in high-quality samples. The filtration steps improved the Ts/Tv of my calls from 1.9 to 2.2.

Another aspect we can look at is the distribution of variants across the chromosomes. Here are a few:
We expect more X variants than the other chromosomes shown here, since 19-22 are the smallest autosomes. The SNV calls on the autosomes tend to be about 56% heterozygous and 44% homozygous after filtering. So it's notable that X is vastly depleted of heterozygous calls; we should very well hope so, since I've only got one X! The same should be true for the Y, but we don't see that reflected proportionately. This is an early indication of problems current WGS technologies have with the Y - a story for another time.

One more sanity check

Let's try a slightly more involved sanity check, by comparing my variants to data from the 1000 Genomes Project. Their VCF files contain genome-wide variant calls for 1,092 individuals (close enough!), sampled from a variety of global populations. For each variant site, they've also taken the helpful step of inferring the ancestral allele - the allele that was likely prevalent in the primate lineage leading to modern humans - based on the genomes of other primates that've been sequenced. The ancestral allele information is helpful to reduce the degree to which analyses are skewed by the particular humans' genomes that went into making the hs37d5 reference.

Let $G$ be a matrix whose rows represent the 1,092 individuals and columns correspond to biallelic SNV sites across the genome, where each entry $G_{ij} \in \left\{0,1,2\right\}$ is the number of copies of the non-ancestral (derived) allele individual $i$ possesses at site $j$. I computed $G$ for about 115,000 autosomal sites - a tiny sliver of the available data, but quite adequate for this exercise - and subjected it to principal component analysis, which reduces the 115,000 observations for each individual into a chosen few dimensions that maximally capture the variance across the dataset $G$.

The following plot shows the projection of each of the 1,092 individuals (rows of $G$) onto the first two principal components (code).

pca <- prcomp(G,scale=FALSE); qplot(pca\$x[,1], pca\$x[,2])

We seem to have three distinct clusters of genetic variation, plus a number of individuals showing mixture between them. The meaning becomes quite clear when we label each individual's ethnicity:
In this single plot, we can see reflections of the native peopling of the Americas by migration across Beringia from northeastern Asia, 15,000 years ago; European colonization of the Caribbean and Central America, starting 600 years ago; and the rapidly ensuing arrival of Africans in North America - mostly under the worst of circumstances - with humans then doing as humans do to create genetic admixture.

It's now pretty straightforward to take my VCF file, create a row of $G$ representing myself, and project it onto the previously-determined principal components:

Stop the presses!!

So, another sanity check on my filtered VCF seems to look good, insofar as the called genotypes generally show concordance with other individuals of my ethnicity. As far as genomic ancestry analysis though, this has been at best a "prosumer" approach, if not completely amateurish. That's because, while it's treated all the variant sites as more-or-less independent observations, those nearby on the same chromosome actually tend to inherit certain joint configurations or haplotypes. That process can be modeled in great detail, which I'll get to, but first there's a lot more to do with my newly-validated variant calls!