Sunday, May 4, 2014

Blogging My Genome, episode 6: variant calling, filtering, and QC

This is the sixth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

I'd previously generated mappings of my sequence reads to the hs37d5 reference assembly. The next step is variant calling, to systematically identify the differences between my genome and the reference. There's a variety of popular tools for this, from which I selected FreeBayes to try first. FreeBayes is an example of a relatively new generation of algorithms capable of considering many hypothetical combinations of genetic variation and sequence alignments to explain the short read data mapped to a genomic region - all within a Bayesian statistical model, per the name.

FreeBayes took my BAM file and the hs37d5 reference as input, and produced a VCF (Variant Call Format) file with genome-wide calls. This took about 32 core-hours (parallelized on an 8-core cloud instance) and produced a VCF file of 377 MiB compressed - tiny compared to my 63 GiB BAM file, as it doesn't preserve the individual reads. Here's the one-line entry of my VCF file corresponding to my ALDH2*2 variant from last time:

$ dx cat "My Genome Analysis:/C2K.deduplicated.vcf.gz" | zcat | grep "12[[:space:]]112241766"
12 112241766 . G A 1264 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=42;CIGAR=1X;DP=42;DPB=42;DPRA=0;EPP=23.691;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=58.5772;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=1467;QR=0;RO=0;RPP=4.87156;RPPR=0;RUN=1;SAF=27;SAP=10.4553;SAR=15;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1 GT:DP:RO:QR:AO:QA:GL 1/1:42:0:0:42:1467:-10,-10,0

This line indicates that, at position #112,241,766 on chromosome 12, the reference assembly has a G but my reads indicate I've got an A. Further, the "1/1" prefixing the last string indicates I'm homozygous for the A, as all of my reads show it. Most of the nearly five million calls in my VCF are biallelic single-nucleotide variants (SNVs) like this one, but FreeBayes also calls more complex variation like indels and regions with two different non-reference alleles. Unfortunately, not all of them can be trusted immediately.

Filtering and QC

FreeBayes, and perhaps other tools, tend to be quite promiscuous in calling variants, leaving it to the analyst to perform downstream filtration suited to their specific goals. That freedom has given rise to a variety of techniques, from which it can be difficult to distinguish best practices. Serendipitously, just as I got to this step, Heng Li posted a preprint exploring variant filtration for deep WGS like mine, with a clever benchmark design and more than a few invaluable nuggets of wisdom. 

Based on that manuscript, I used bedtools to remove variants overlapping regions of the reference assembly with low sequence complexity, which tend to attract spurious read mappings leading to erroneous variant calls. Then I used vcffilter on the attributes of each variant, which are specified in the lengthy "info" field:
  • $\operatorname{DP} < 59$, excluding regions with excessive read depth, probably also associated with spurious mappings. (59 is two Poisson standard deviations above my data's 40X coverage.)
  • $\operatorname{SAF} > 1$, $\operatorname{SAR} > 1$, requiring multiple observations of the alternate (non-reference) allele in reads from both DNA strands
  • $\operatorname{AB} \leq 0.8$, requiring reasonable overall balance between the strands of the observed reads
  • $\operatorname{QUAL} \geq 30$, a threshold on the FreeBayes-calculated variant quality score
As always, my workflow is on DNAnexus (free account required) and the underlying source code is on GitHub

Overall, the filtering removed 1.2 million out of 4.9 million raw variant calls, including about 800,000 out of 4.1 million biallelic SNVs. Here are a couple plots showing the before-and-after effects:

Before filteringAfter filtering
They show the distribution of QUAL versus the number of reads supporting the alternate allele (Alternate Observations, AO). The filters remove a lot of calls with low quality scores, as expected, and many others as well. Two remaining modes in the distribution reflect heterozygous and homozygous SNVs, the overwhelming majority far above the QUAL threshold.

One way to quantify the benefits of the filtration is to look at the ratio of transitions and transversions observed in the called SNVs. These two types of SNVs are expected to occur in a certain rough proportion, owing to the chemical structure of DNA and mechanisms of its mutation and repair. The Ts/Tv ratio would come out to 0.5 in random garbage sampled from the four nucleotides uniformly, whereas we like to see it north of 2.0 in high-quality samples. The filtration steps improved the Ts/Tv of my calls from 1.9 to 2.2.

Another aspect we can look at is the distribution of variants across the chromosomes. Here are a few:
We expect more X variants than the other chromosomes shown here, since 19-22 are the smallest autosomes. The SNV calls on the autosomes tend to be about 56% heterozygous and 44% homozygous after filtering. So it's notable that X is vastly depleted of heterozygous calls; we should very well hope so, since I've only got one X! The same should be true for the Y, but we don't see that reflected proportionately. This is an early indication of problems current WGS technologies have with the Y - a story for another time.

One more sanity check

Let's try a slightly more involved sanity check, by comparing my variants to data from the 1000 Genomes Project. Their VCF files contain genome-wide variant calls for 1,092 individuals (close enough!), sampled from a variety of global populations. For each variant site, they've also taken the helpful step of inferring the ancestral allele - the allele that was likely prevalent in the primate lineage leading to modern humans - based on the genomes of other primates that've been sequenced. The ancestral allele information is helpful to reduce the degree to which analyses are skewed by the particular humans' genomes that went into making the hs37d5 reference.

Let $G$ be a matrix whose rows represent the 1,092 individuals and columns correspond to biallelic SNV sites across the genome, where each entry $G_{ij} \in \left\{0,1,2\right\}$ is the number of copies of the non-ancestral (derived) allele individual $i$ possesses at site $j$. I computed $G$ for about 115,000 autosomal sites - a tiny sliver of the available data, but quite adequate for this exercise - and subjected it to principal component analysis, which reduces the 115,000 observations for each individual into a chosen few dimensions that maximally capture the variance across the dataset $G$.

The following plot shows the projection of each of the 1,092 individuals (rows of $G$) onto the first two principal components (code).

pca <- prcomp(G,scale=FALSE); qplot(pca\$x[,1], pca\$x[,2])

We seem to have three distinct clusters of genetic variation, plus a number of individuals showing mixture between them. The meaning becomes quite clear when we label each individual's ethnicity:
In this single plot, we can see reflections of the native peopling of the Americas by migration across Beringia from northeastern Asia, 15,000 years ago; European colonization of the Caribbean and Central America, starting 600 years ago; and the rapidly ensuing arrival of Africans in North America - mostly under the worst of circumstances - with humans then doing as humans do to create genetic admixture.

It's now pretty straightforward to take my VCF file, create a row of $G$ representing myself, and project it onto the previously-determined principal components:

Stop the presses!!

So, another sanity check on my filtered VCF seems to look good, insofar as the called genotypes generally show concordance with other individuals of my ethnicity. As far as genomic ancestry analysis though, this has been at best a "prosumer" approach, if not completely amateurish. That's because, while it's treated all the variant sites as more-or-less independent observations, those nearby on the same chromosome actually tend to inherit certain joint configurations or haplotypes. That process can be modeled in great detail, which I'll get to, but first there's a lot more to do with my newly-validated variant calls!


  1. Now a large number of young and educated people create very creative and innovative services that help people. An example is the essay writing service They are very responsible in writing each of their essays, so I trust them completely. If you also need help like me, write to them as soon as you can.

  2. It takes two weeks for a new company registration or shelf company transfer, in most cases. However, the set-up or transfer process will take up to 15 business days, if the shareholder of Romanian company is an overseas corporate entity
    VAT registration requires up to 1 month in most cases (Romanian tax authorities may request additional information)
    For a ready-to-use bank account a minimum of 15-20 days is required. (Account activation is possible only upon receipt of the required information and hand-signed application forms)

  3. Students find Speech Writing Services as being of great assistance since they are able to seek our professional speech writing services and online custom speech writing help on time.

  4. Genetic studies also enabled genome sequencing data to be collected by several individuals in the next year. dissertation help uk

  5. filmstarlook is now providing best celebrity leather jackets. at our website you can get all the new leahter jackes of celebrities and can get all the custom designs. visit to our store to get the best services.
    Aviator B3 sheepskin leather jacket

  6. If you are experiencing stress from your academic trials and you have little time to complete such a difficult task, you can choose an essay to help the service to find quick and easy solutions to your problems. You have access to a large catalog where you can choose from a list of hundreds of essay authors who specialize in the art of writing website.

  7. Wow. It's so incredible. I had read all your series of blog posts about your genome. And I even have no words to say how detailed you write about it. But I can only use professional term paper help because I have no ideas how to write so good like you.

  8. By delivering our services, we do not undervalue students' strengths or talents; instead, we provide a roadmap for high-quality assignments that can help you achieve your main objectives. We can provide Free reference generator to ensure that your work, essays, research papers, and assignments are properly completed.