Sunday, May 4, 2014

Blogging My Genome, episode 6: variant calling, filtering, and QC

This is the sixth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

I'd previously generated mappings of my sequence reads to the hs37d5 reference assembly. The next step is variant calling, to systematically identify the differences between my genome and the reference. There's a variety of popular tools for this, from which I selected FreeBayes to try first. FreeBayes is an example of a relatively new generation of algorithms capable of considering many hypothetical combinations of genetic variation and sequence alignments to explain the short read data mapped to a genomic region - all within a Bayesian statistical model, per the name.

FreeBayes took my BAM file and the hs37d5 reference as input, and produced a VCF (Variant Call Format) file with genome-wide calls. This took about 32 core-hours (parallelized on an 8-core cloud instance) and produced a VCF file of 377 MiB compressed - tiny compared to my 63 GiB BAM file, as it doesn't preserve the individual reads. Here's the one-line entry of my VCF file corresponding to my ALDH2*2 variant from last time:

$ dx cat "My Genome Analysis:/C2K.deduplicated.vcf.gz" | zcat | grep "12[[:space:]]112241766"
12 112241766 . G A 1264 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=42;CIGAR=1X;DP=42;DPB=42;DPRA=0;EPP=23.691;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=58.5772;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=1467;QR=0;RO=0;RPP=4.87156;RPPR=0;RUN=1;SAF=27;SAP=10.4553;SAR=15;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1 GT:DP:RO:QR:AO:QA:GL 1/1:42:0:0:42:1467:-10,-10,0

This line indicates that, at position #112,241,766 on chromosome 12, the reference assembly has a G but my reads indicate I've got an A. Further, the "1/1" prefixing the last string indicates I'm homozygous for the A, as all of my reads show it. Most of the nearly five million calls in my VCF are biallelic single-nucleotide variants (SNVs) like this one, but FreeBayes also calls more complex variation like indels and regions with two different non-reference alleles. Unfortunately, not all of them can be trusted immediately.


Filtering and QC

FreeBayes, and perhaps other tools, tend to be quite promiscuous in calling variants, leaving it to the analyst to perform downstream filtration suited to their specific goals. That freedom has given rise to a variety of techniques, from which it can be difficult to distinguish best practices. Serendipitously, just as I got to this step, Heng Li posted a preprint exploring variant filtration for deep WGS like mine, with a clever benchmark design and more than a few invaluable nuggets of wisdom. 

Based on that manuscript, I used bedtools to remove variants overlapping regions of the reference assembly with low sequence complexity, which tend to attract spurious read mappings leading to erroneous variant calls. Then I used vcffilter on the attributes of each variant, which are specified in the lengthy "info" field:
  • $\operatorname{DP} < 59$, excluding regions with excessive read depth, probably also associated with spurious mappings. (59 is two Poisson standard deviations above my data's 40X coverage.)
  • $\operatorname{SAF} > 1$, $\operatorname{SAR} > 1$, requiring multiple observations of the alternate (non-reference) allele in reads from both DNA strands
  • $\operatorname{AB} \leq 0.8$, requiring reasonable overall balance between the strands of the observed reads
  • $\operatorname{QUAL} \geq 30$, a threshold on the FreeBayes-calculated variant quality score
As always, my workflow is on DNAnexus (free account required) and the underlying source code is on GitHub

Overall, the filtering removed 1.2 million out of 4.9 million raw variant calls, including about 800,000 out of 4.1 million biallelic SNVs. Here are a couple plots showing the before-and-after effects:

Before filteringAfter filtering
They show the distribution of QUAL versus the number of reads supporting the alternate allele (Alternate Observations, AO). The filters remove a lot of calls with low quality scores, as expected, and many others as well. Two remaining modes in the distribution reflect heterozygous and homozygous SNVs, the overwhelming majority far above the QUAL threshold.

One way to quantify the benefits of the filtration is to look at the ratio of transitions and transversions observed in the called SNVs. These two types of SNVs are expected to occur in a certain rough proportion, owing to the chemical structure of DNA and mechanisms of its mutation and repair. The Ts/Tv ratio would come out to 0.5 in random garbage sampled from the four nucleotides uniformly, whereas we like to see it north of 2.0 in high-quality samples. The filtration steps improved the Ts/Tv of my calls from 1.9 to 2.2.

Another aspect we can look at is the distribution of variants across the chromosomes. Here are a few:
We expect more X variants than the other chromosomes shown here, since 19-22 are the smallest autosomes. The SNV calls on the autosomes tend to be about 56% heterozygous and 44% homozygous after filtering. So it's notable that X is vastly depleted of heterozygous calls; we should very well hope so, since I've only got one X! The same should be true for the Y, but we don't see that reflected proportionately. This is an early indication of problems current WGS technologies have with the Y - a story for another time.

One more sanity check

Let's try a slightly more involved sanity check, by comparing my variants to data from the 1000 Genomes Project. Their VCF files contain genome-wide variant calls for 1,092 individuals (close enough!), sampled from a variety of global populations. For each variant site, they've also taken the helpful step of inferring the ancestral allele - the allele that was likely prevalent in the primate lineage leading to modern humans - based on the genomes of other primates that've been sequenced. The ancestral allele information is helpful to reduce the degree to which analyses are skewed by the particular humans' genomes that went into making the hs37d5 reference.

Let $G$ be a matrix whose rows represent the 1,092 individuals and columns correspond to biallelic SNV sites across the genome, where each entry $G_{ij} \in \left\{0,1,2\right\}$ is the number of copies of the non-ancestral (derived) allele individual $i$ possesses at site $j$. I computed $G$ for about 115,000 autosomal sites - a tiny sliver of the available data, but quite adequate for this exercise - and subjected it to principal component analysis, which reduces the 115,000 observations for each individual into a chosen few dimensions that maximally capture the variance across the dataset $G$.

The following plot shows the projection of each of the 1,092 individuals (rows of $G$) onto the first two principal components (code).

pca <- prcomp(G,scale=FALSE); qplot(pca\$x[,1], pca\$x[,2])

We seem to have three distinct clusters of genetic variation, plus a number of individuals showing mixture between them. The meaning becomes quite clear when we label each individual's ethnicity:
In this single plot, we can see reflections of the native peopling of the Americas by migration across Beringia from northeastern Asia, 15,000 years ago; European colonization of the Caribbean and Central America, starting 600 years ago; and the rapidly ensuing arrival of Africans in North America - mostly under the worst of circumstances - with humans then doing as humans do to create genetic admixture.

It's now pretty straightforward to take my VCF file, create a row of $G$ representing myself, and project it onto the previously-determined principal components:

Stop the presses!!

So, another sanity check on my filtered VCF seems to look good, insofar as the called genotypes generally show concordance with other individuals of my ethnicity. As far as genomic ancestry analysis though, this has been at best a "prosumer" approach, if not completely amateurish. That's because, while it's treated all the variant sites as more-or-less independent observations, those nearby on the same chromosome actually tend to inherit certain joint configurations or haplotypes. That process can be modeled in great detail, which I'll get to, but first there's a lot more to do with my newly-validated variant calls!

19 comments:

  1. Now a large number of young and educated people create very creative and innovative services that help people. An example is the essay writing service https://essaywriter.org/ They are very responsible in writing each of their essays, so I trust them completely. If you also need help like me, write to them as soon as you can.

    ReplyDelete
  2. It takes two weeks for a new company registration or shelf company transfer, in most cases. However, the set-up or transfer process will take up to 15 business days, if the shareholder of Romanian company is an overseas corporate entity
    VAT registration requires up to 1 month in most cases (Romanian tax authorities may request additional information)
    For a ready-to-use bank account a minimum of 15-20 days is required. (Account activation is possible only upon receipt of the required information and hand-signed application forms) http://www.confiduss.com/en/jurisdictions/romania/business/company-formation/

    ReplyDelete
  3. Students find Speech Writing Services as being of great assistance since they are able to seek our professional speech writing services and online custom speech writing help on time.

    ReplyDelete
  4. Genetic studies also enabled genome sequencing data to be collected by several individuals in the next year. dissertation help uk

    ReplyDelete
  5. filmstarlook is now providing best celebrity leather jackets. at our website you can get all the new leahter jackes of celebrities and can get all the custom designs. visit to our store to get the best services.
    Aviator B3 sheepskin leather jacket

    ReplyDelete
  6. If you are experiencing stress from your academic trials and you have little time to complete such a difficult task, you can choose an essay to help the service to find quick and easy solutions to your problems. You have access to a large catalog where you can choose from a list of hundreds of essay authors who specialize in the art of writing website.

    ReplyDelete
  7. Wow. It's so incredible. I had read all your series of blog posts about your genome. And I even have no words to say how detailed you write about it. But I can only use professional term paper help because I have no ideas how to write so good like you.

    ReplyDelete
  8. By delivering our services, we do not undervalue students' strengths or talents; instead, we provide a roadmap for high-quality assignments that can help you achieve your main objectives. We can provide Free reference generator to ensure that your work, essays, research papers, and assignments are properly completed.

    ReplyDelete
  9. Your work will be done up to par and within an acceptable period of time. Best Available writers GradeMiners are great when you need an essay or a similar type of paper to be done in a couple of days. But if your paper has a burning deadline…

    ReplyDelete
  10. Nowadays, students enjoy a plethora of advantages. One of them is an assignment writing service. As a student, you may simply ask for assistance to make even the most difficult tasks easy. For example, if you believe that essays are difficult to complete, remain cool and contact an essay helper. There are many brilliant professionals willing to assist students like you in their academic endeavours. You can count on their dedication, hard effort, and knowledge. Find a legitimate and dependable essay writing service. It's a decision you'll never regret in your life.

    ReplyDelete
  11. Your writing is perfect and complete. However, I think it will be more wonderful if your post includes additional topics that I am thinking of. I have a lot of posts on my site similar to your topic. help in writing dissertation

    ReplyDelete
  12. Australian students can now relax and get expert help with Assignment Help Australia. Students can enjoy some extra time along with best and timely assignment submission. GAH ensures that students should get A+ grades with our experts.

    ReplyDelete
  13. A great year and great academic achievements both are incomplete without the college assignments. Hilarious may it sound but it's true, the wonderful troubles might have ebbed away after a long time but I still remember taking joy in teasing my friends when deadline was nearing and they were still to complete their assignments. Little did they knew that it was the greatassinmenthelper.com and their team of commited assignment help who made all of it a piece of cake for me. Now that my academic days are over I think that it would be of great help to you now.

    ReplyDelete
  14. Assignments have become a new norm in modern education, necessitating the use of Assignment helper and help with assignments Websites. We are one of the organisations that has been assisting students for many years. Our trustworthiness is a major factor in our success. Once you've trusted us with the most crucial aspect of your academic life, discover how we make it shine.

    ReplyDelete
  15. Yo! Our grademiners service believe in quality and commitment which makes us so different from our competitors and thus always keeps our customers satisfied.We deal in avariety of essay topics such asScience,History,Literature, Sociology and etc.

    ReplyDelete
  16. The astounding Ashley Massis, a previous design purchaser, beautician, and expert, told me, "A quality cowhide coat can go somewhere in the range of $300 to men's fashion jackets for sale upwards of $1000 or more. While you're taking a gander at your spending plan, a couple of things should be thought about for the ideal piece."

    ReplyDelete
  17. In today’s world people are very much stucked in their personal works, same is happening with the students that’s why we have strated the services of the IT Assignment Help . Connect with us to avail this premium offer.

    ReplyDelete
  18. Thank You for Providing Such insightful information. If someone is looking for the QuickBooks Customer Service in US.

    ReplyDelete