Sunday, May 4, 2014

Blogging My Genome, episode 6: variant calling, filtering, and QC

This is the sixth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

I'd previously generated mappings of my sequence reads to the hs37d5 reference assembly. The next step is variant calling, to systematically identify the differences between my genome and the reference. There's a variety of popular tools for this, from which I selected FreeBayes to try first. FreeBayes is an example of a relatively new generation of algorithms capable of considering many hypothetical combinations of genetic variation and sequence alignments to explain the short read data mapped to a genomic region - all within a Bayesian statistical model, per the name.

FreeBayes took my BAM file and the hs37d5 reference as input, and produced a VCF (Variant Call Format) file with genome-wide calls. This took about 32 core-hours (parallelized on an 8-core cloud instance) and produced a VCF file of 377 MiB compressed - tiny compared to my 63 GiB BAM file, as it doesn't preserve the individual reads. Here's the one-line entry of my VCF file corresponding to my ALDH2*2 variant from last time:

$ dx cat "My Genome Analysis:/C2K.deduplicated.vcf.gz" | zcat | grep "12[[:space:]]112241766"
12 112241766 . G A 1264 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=42;CIGAR=1X;DP=42;DPB=42;DPRA=0;EPP=23.691;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=58.5772;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=1467;QR=0;RO=0;RPP=4.87156;RPPR=0;RUN=1;SAF=27;SAP=10.4553;SAR=15;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1 GT:DP:RO:QR:AO:QA:GL 1/1:42:0:0:42:1467:-10,-10,0

This line indicates that, at position #112,241,766 on chromosome 12, the reference assembly has a G but my reads indicate I've got an A. Further, the "1/1" prefixing the last string indicates I'm homozygous for the A, as all of my reads show it. Most of the nearly five million calls in my VCF are biallelic single-nucleotide variants (SNVs) like this one, but FreeBayes also calls more complex variation like indels and regions with two different non-reference alleles. Unfortunately, not all of them can be trusted immediately.


Filtering and QC

FreeBayes, and perhaps other tools, tend to be quite promiscuous in calling variants, leaving it to the analyst to perform downstream filtration suited to their specific goals. That freedom has given rise to a variety of techniques, from which it can be difficult to distinguish best practices. Serendipitously, just as I got to this step, Heng Li posted a preprint exploring variant filtration for deep WGS like mine, with a clever benchmark design and more than a few invaluable nuggets of wisdom. 

Based on that manuscript, I used bedtools to remove variants overlapping regions of the reference assembly with low sequence complexity, which tend to attract spurious read mappings leading to erroneous variant calls. Then I used vcffilter on the attributes of each variant, which are specified in the lengthy "info" field:
  • $\operatorname{DP} < 59$, excluding regions with excessive read depth, probably also associated with spurious mappings. (59 is two Poisson standard deviations above my data's 40X coverage.)
  • $\operatorname{SAF} > 1$, $\operatorname{SAR} > 1$, requiring multiple observations of the alternate (non-reference) allele in reads from both DNA strands
  • $\operatorname{AB} \leq 0.8$, requiring reasonable overall balance between the strands of the observed reads
  • $\operatorname{QUAL} \geq 30$, a threshold on the FreeBayes-calculated variant quality score
As always, my workflow is on DNAnexus (free account required) and the underlying source code is on GitHub

Overall, the filtering removed 1.2 million out of 4.9 million raw variant calls, including about 800,000 out of 4.1 million biallelic SNVs. Here are a couple plots showing the before-and-after effects:

Before filteringAfter filtering
They show the distribution of QUAL versus the number of reads supporting the alternate allele (Alternate Observations, AO). The filters remove a lot of calls with low quality scores, as expected, and many others as well. Two remaining modes in the distribution reflect heterozygous and homozygous SNVs, the overwhelming majority far above the QUAL threshold.

One way to quantify the benefits of the filtration is to look at the ratio of transitions and transversions observed in the called SNVs. These two types of SNVs are expected to occur in a certain rough proportion, owing to the chemical structure of DNA and mechanisms of its mutation and repair. The Ts/Tv ratio would come out to 0.5 in random garbage sampled from the four nucleotides uniformly, whereas we like to see it north of 2.0 in high-quality samples. The filtration steps improved the Ts/Tv of my calls from 1.9 to 2.2.

Another aspect we can look at is the distribution of variants across the chromosomes. Here are a few:
We expect more X variants than the other chromosomes shown here, since 19-22 are the smallest autosomes. The SNV calls on the autosomes tend to be about 56% heterozygous and 44% homozygous after filtering. So it's notable that X is vastly depleted of heterozygous calls; we should very well hope so, since I've only got one X! The same should be true for the Y, but we don't see that reflected proportionately. This is an early indication of problems current WGS technologies have with the Y - a story for another time.

One more sanity check

Let's try a slightly more involved sanity check, by comparing my variants to data from the 1000 Genomes Project. Their VCF files contain genome-wide variant calls for 1,092 individuals (close enough!), sampled from a variety of global populations. For each variant site, they've also taken the helpful step of inferring the ancestral allele - the allele that was likely prevalent in the primate lineage leading to modern humans - based on the genomes of other primates that've been sequenced. The ancestral allele information is helpful to reduce the degree to which analyses are skewed by the particular humans' genomes that went into making the hs37d5 reference.

Let $G$ be a matrix whose rows represent the 1,092 individuals and columns correspond to biallelic SNV sites across the genome, where each entry $G_{ij} \in \left\{0,1,2\right\}$ is the number of copies of the non-ancestral (derived) allele individual $i$ possesses at site $j$. I computed $G$ for about 115,000 autosomal sites - a tiny sliver of the available data, but quite adequate for this exercise - and subjected it to principal component analysis, which reduces the 115,000 observations for each individual into a chosen few dimensions that maximally capture the variance across the dataset $G$.

The following plot shows the projection of each of the 1,092 individuals (rows of $G$) onto the first two principal components (code).

pca <- prcomp(G,scale=FALSE); qplot(pca\$x[,1], pca\$x[,2])

We seem to have three distinct clusters of genetic variation, plus a number of individuals showing mixture between them. The meaning becomes quite clear when we label each individual's ethnicity:
In this single plot, we can see reflections of the native peopling of the Americas by migration across Beringia from northeastern Asia, 15,000 years ago; European colonization of the Caribbean and Central America, starting 600 years ago; and the rapidly ensuing arrival of Africans in North America - mostly under the worst of circumstances - with humans then doing as humans do to create genetic admixture.

It's now pretty straightforward to take my VCF file, create a row of $G$ representing myself, and project it onto the previously-determined principal components:

Stop the presses!!

So, another sanity check on my filtered VCF seems to look good, insofar as the called genotypes generally show concordance with other individuals of my ethnicity. As far as genomic ancestry analysis though, this has been at best a "prosumer" approach, if not completely amateurish. That's because, while it's treated all the variant sites as more-or-less independent observations, those nearby on the same chromosome actually tend to inherit certain joint configurations or haplotypes. That process can be modeled in great detail, which I'll get to, but first there's a lot more to do with my newly-validated variant calls!

30 comments:

  1. Now a large number of young and educated people create very creative and innovative services that help people. An example is the essay writing service https://essaywriter.org/ They are very responsible in writing each of their essays, so I trust them completely. If you also need help like me, write to them as soon as you can.

    ReplyDelete
  2. It takes two weeks for a new company registration or shelf company transfer, in most cases. However, the set-up or transfer process will take up to 15 business days, if the shareholder of Romanian company is an overseas corporate entity
    VAT registration requires up to 1 month in most cases (Romanian tax authorities may request additional information)
    For a ready-to-use bank account a minimum of 15-20 days is required. (Account activation is possible only upon receipt of the required information and hand-signed application forms) http://www.confiduss.com/en/jurisdictions/romania/business/company-formation/

    ReplyDelete
  3. Genetic studies also enabled genome sequencing data to be collected by several individuals in the next year. dissertation help uk

    ReplyDelete
  4. filmstarlook is now providing best celebrity leather jackets. at our website you can get all the new leahter jackes of celebrities and can get all the custom designs. visit to our store to get the best services.
    Aviator B3 sheepskin leather jacket

    ReplyDelete
  5. If you are experiencing stress from your academic trials and you have little time to complete such a difficult task, you can choose an essay to help the service to find quick and easy solutions to your problems. You have access to a large catalog where you can choose from a list of hundreds of essay authors who specialize in the art of writing website.

    ReplyDelete
  6. Wow. It's so incredible. I had read all your series of blog posts about your genome. And I even have no words to say how detailed you write about it. But I can only use professional term paper help because I have no ideas how to write so good like you.

    ReplyDelete
  7. By delivering our services, we do not undervalue students' strengths or talents; instead, we provide a roadmap for high-quality assignments that can help you achieve your main objectives. We can provide Free reference generator to ensure that your work, essays, research papers, and assignments are properly completed.

    ReplyDelete
  8. Your work will be done up to par and within an acceptable period of time. Best Available writers GradeMiners are great when you need an essay or a similar type of paper to be done in a couple of days. But if your paper has a burning deadline…

    ReplyDelete
  9. Nowadays, students enjoy a plethora of advantages. One of them is an assignment writing service. As a student, you may simply ask for assistance to make even the most difficult tasks easy. For example, if you believe that essays are difficult to complete, remain cool and contact an essay helper. There are many brilliant professionals willing to assist students like you in their academic endeavours. You can count on their dedication, hard effort, and knowledge. Find a legitimate and dependable essay writing service. It's a decision you'll never regret in your life.

    ReplyDelete
  10. Your writing is perfect and complete. However, I think it will be more wonderful if your post includes additional topics that I am thinking of. I have a lot of posts on my site similar to your topic. help in writing dissertation

    ReplyDelete
  11. Yo! Our grademiners service believe in quality and commitment which makes us so different from our competitors and thus always keeps our customers satisfied.We deal in avariety of essay topics such asScience,History,Literature, Sociology and etc.

    ReplyDelete
  12. The astounding Ashley Massis, a previous design purchaser, beautician, and expert, told me, "A quality cowhide coat can go somewhere in the range of $300 to men's fashion jackets for sale upwards of $1000 or more. While you're taking a gander at your spending plan, a couple of things should be thought about for the ideal piece."

    ReplyDelete
  13. You nailed it. Thank you for taking the time. I'll check again to find out more and recommend my coworkers about your website. 토토

    ReplyDelete
  14. Thanks for sharing your info. I truly appreciate your efforts and I am waiting for your further write ups thank you once again. 스포츠토토

    ReplyDelete
  15. Allow your dragons to participate in battle in Dragon City. The Dragon League Tourney and the Dragon Stadium are the two fighting systems in this game(descargo). To open and begin the dragon combat in Dragon Stadium, you'll need three buddies. Choose three dragons from your dragon collection for each tournament once you've opened a stadium. The chosen dragons for battle, however, must be at least level 4 to fight. Furthermore, each fight you win increases the complexity of the subsequent ones. The game's PVP mode is the Dragon League Tourney. You can battle here for a maximum of three times per six hours.

    Choose the best wireless gaming mouse for you!

    ReplyDelete
  16. Thank you so much for the post you do. I like your post and all you share with us is up to date and quite informative, i would like to bookmark the page so i can come here again to read you, as you have done a wonderful job.nursing assignment writing service uk

    ReplyDelete
  17. This episode of Blogging My Genome takes viewers through the process of variant calling, filtering, and QC. With the help of business assignment help, viewers can ensure they understand the process clearly, and can use this knowledge to get their project done accurately. By properly understanding the concepts of variant calling, filtering, and QC, viewers can make sure that their project is of the highest quality.

    ReplyDelete
  18. I admire this article for the well-researched content

    ReplyDelete
  19. Appreciate you spending some time and effort to put this wonderful article. Goodjob!!

    ReplyDelete
  20. It’s very informative blog in this area. Continue writing man! Keep it up

    ReplyDelete
  21. Its opportunity are so fantastic and working style so speedy.

    ReplyDelete
  22. Having a hard time looking for good and trusted site?

    ReplyDelete
  23. I am really happy to say it’s an interesting post to read .

    ReplyDelete
  24. This comment has been removed by the author.

    ReplyDelete
  25. Fascinating article, read with interest.
    I can also recommend one web resource with interesting articles - [url=https://www.esterdigital.de/webdesign-agentur-frankfurt] homepage erstellen frankfurt [/url]

    ReplyDelete
  26. At Native Assignment Help, we take pride in delivering exceptional Assignment Editing Services UK. Our dedicated team of editors pays meticulous attention to detail, refining your assignments to perfection. With a keen eye for detail and a commitment to quality, we ensure your work meets the highest standards, leaving you with polished, refined assignments that stand out academically.

    ReplyDelete