Mike's Fourth Try: Blogging My Genome, episode 6: variant calling, filtering, and QC

Sunday, May 4, 2014

Blogging My Genome, episode 6: variant calling, filtering, and QC

This is the sixth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

I'd previously generated mappings of my sequence reads to the hs37d5 reference assembly. The next step is variant calling, to systematically identify the differences between my genome and the reference. There's a variety of popular tools for this, from which I selected FreeBayes to try first. FreeBayes is an example of a relatively new generation of algorithms capable of considering many hypothetical combinations of genetic variation and sequence alignments to explain the short read data mapped to a genomic region - all within a Bayesian statistical model, per the name.

FreeBayes took my BAM file and the hs37d5 reference as input, and produced a VCF (Variant Call Format) file with genome-wide calls. This took about 32 core-hours (parallelized on an 8-core cloud instance) and produced a VCF file of 377 MiB compressed - tiny compared to my 63 GiB BAM file, as it doesn't preserve the individual reads. Here's the one-line entry of my VCF file corresponding to my ALDH2*2 variant from last time:

$ dx cat "My Genome Analysis:/C2K.deduplicated.vcf.gz" | zcat | grep "12[[:space:]]112241766"
12 112241766 . G A 1264 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=42;CIGAR=1X;DP=42;DPB=42;DPRA=0;EPP=23.691;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=58.5772;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=1467;QR=0;RO=0;RPP=4.87156;RPPR=0;RUN=1;SAF=27;SAP=10.4553;SAR=15;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1 GT:DP:RO:QR:AO:QA:GL 1/1:42:0:0:42:1467:-10,-10,0

This line indicates that, at position #112,241,766 on chromosome 12, the reference assembly has a G but my reads indicate I've got an A. Further, the "1/1" prefixing the last string indicates I'm homozygous for the A, as all of my reads show it. Most of the nearly five million calls in my VCF are biallelic single-nucleotide variants (SNVs) like this one, but FreeBayes also calls more complex variation like indels and regions with two different non-reference alleles. Unfortunately, not all of them can be trusted immediately.

Filtering and QC

FreeBayes, and perhaps other tools, tend to be quite promiscuous in calling variants, leaving it to the analyst to perform downstream filtration suited to their specific goals. That freedom has given rise to a variety of techniques, from which it can be difficult to distinguish best practices. Serendipitously, just as I got to this step, Heng Li posted a preprint exploring variant filtration for deep WGS like mine, with a clever benchmark design and more than a few invaluable nuggets of wisdom.

Based on that manuscript, I used bedtools to remove variants overlapping regions of the reference assembly with low sequence complexity, which tend to attract spurious read mappings leading to erroneous variant calls. Then I used vcffilter on the attributes of each variant, which are specified in the lengthy "info" field:

$\operatorname{DP} < 59$, excluding regions with excessive read depth, probably also associated with spurious mappings. (59 is two Poisson standard deviations above my data's 40X coverage.)
$\operatorname{SAF} > 1$, $\operatorname{SAR} > 1$, requiring multiple observations of the alternate (non-reference) allele in reads from both DNA strands
$\operatorname{AB} \leq 0.8$, requiring reasonable overall balance between the strands of the observed reads
$\operatorname{QUAL} \geq 30$, a threshold on the FreeBayes-calculated variant quality score

As always, my workflow is on DNAnexus (free account required) and the underlying source code is on GitHub.

Overall, the filtering removed 1.2 million out of 4.9 million raw variant calls, including about 800,000 out of 4.1 million biallelic SNVs. Here are a couple plots showing the before-and-after effects:

Before filtering	After filtering

They show the distribution of QUAL versus the number of reads supporting the alternate allele (Alternate Observations, AO). The filters remove a lot of calls with low quality scores, as expected, and many others as well. Two remaining modes in the distribution reflect heterozygous and homozygous SNVs, the overwhelming majority far above the QUAL threshold.

One way to quantify the benefits of the filtration is to look at the ratio of transitions and transversions observed in the called SNVs. These two types of SNVs are expected to occur in a certain rough proportion, owing to the chemical structure of DNA and mechanisms of its mutation and repair. The Ts/Tv ratio would come out to 0.5 in random garbage sampled from the four nucleotides uniformly, whereas we like to see it north of 2.0 in high-quality samples. The filtration steps improved the Ts/Tv of my calls from 1.9 to 2.2.

Another aspect we can look at is the distribution of variants across the chromosomes. Here are a few:

We expect more X variants than the other chromosomes shown here, since 19-22 are the smallest autosomes. The SNV calls on the autosomes tend to be about 56% heterozygous and 44% homozygous after filtering. So it's notable that X is vastly depleted of heterozygous calls; we should very well hope so, since I've only got one X! The same should be true for the Y, but we don't see that reflected proportionately. This is an early indication of problems current WGS technologies have with the Y - a story for another time.

One more sanity check

Let's try a slightly more involved sanity check, by comparing my variants to data from the 1000 Genomes Project. Their VCF files contain genome-wide variant calls for 1,092 individuals (close enough!), sampled from a variety of global populations. For each variant site, they've also taken the helpful step of inferring the ancestral allele - the allele that was likely prevalent in the primate lineage leading to modern humans - based on the genomes of other primates that've been sequenced. The ancestral allele information is helpful to reduce the degree to which analyses are skewed by the particular humans' genomes that went into making the hs37d5 reference.

Let $G$ be a matrix whose rows represent the 1,092 individuals and columns correspond to biallelic SNV sites across the genome, where each entry $G_{ij} \in \left\{0,1,2\right\}$ is the number of copies of the non-ancestral (derived) allele individual $i$ possesses at site $j$. I computed $G$ for about 115,000 autosomal sites - a tiny sliver of the available data, but quite adequate for this exercise - and subjected it to principal component analysis, which reduces the 115,000 observations for each individual into a chosen few dimensions that maximally capture the variance across the dataset $G$.

The following plot shows the projection of each of the 1,092 individuals (rows of $G$) onto the first two principal components (code).

pca <- prcomp(G,scale=FALSE); qplot(pca\$x[,1], pca\$x[,2])

We seem to have three distinct clusters of genetic variation, plus a number of individuals showing mixture between them. The meaning becomes quite clear when we label each individual's ethnicity:

In this single plot, we can see reflections of the native peopling of the Americas by migration across Beringia from northeastern Asia, 15,000 years ago; European colonization of the Caribbean and Central America, starting 600 years ago; and the rapidly ensuing arrival of Africans in North America - mostly under the worst of circumstances - with humans then doing as humans do to create genetic admixture.

It's now pretty straightforward to take my VCF file, create a row of $G$ representing myself, and project it onto the previously-determined principal components:

Stop the presses!!

So, another sanity check on my filtered VCF seems to look good, insofar as the called genotypes generally show concordance with other individuals of my ethnicity. As far as genomic ancestry analysis though, this has been at best a "prosumer" approach, if not completely amateurish. That's because, while it's treated all the variant sites as more-or-less independent observations, those nearby on the same chromosome actually tend to inherit certain joint configurations or haplotypes. That process can be modeled in great detail, which I'll get to, but first there's a lot more to do with my newly-validated variant calls!

40 comments:

JohnvolJuly 9, 2020 at 7:23 AM
Now a large number of young and educated people create very creative and innovative services that help people. An example is the essay writing service https://essaywriter.org/ They are very responsible in writing each of their essays, so I trust them completely. If you also need help like me, write to them as soon as you can.
ReplyDelete
Replies
BextolAugust 12, 2020 at 5:32 AM
It takes two weeks for a new company registration or shelf company transfer, in most cases. However, the set-up or transfer process will take up to 15 business days, if the shareholder of Romanian company is an overseas corporate entity
VAT registration requires up to 1 month in most cases (Romanian tax authorities may request additional information)
For a ready-to-use bank account a minimum of 15-20 days is required. (Account activation is possible only upon receipt of the required information and hand-signed application forms) http://www.confiduss.com/en/jurisdictions/romania/business/company-formation/
ReplyDelete
Replies
loologDecember 14, 2020 at 1:05 AM
aol.de
ReplyDelete
Replies
ethanApril 12, 2021 at 2:44 AM
Genetic studies also enabled genome sequencing data to be collected by several individuals in the next year. dissertation help uk
ReplyDelete
Replies
Psydoc - Dr Jason HartMay 8, 2021 at 11:22 AM
filmstarlook is now providing best celebrity leather jackets. at our website you can get all the new leahter jackes of celebrities and can get all the custom designs. visit to our store to get the best services.
Aviator B3 sheepskin leather jacket
ReplyDelete
Replies
VikiiAugust 29, 2021 at 3:03 AM
If you are experiencing stress from your academic trials and you have little time to complete such a difficult task, you can choose an essay to help the service to find quick and easy solutions to your problems. You have access to a large catalog where you can choose from a list of hundreds of essay authors who specialize in the art of writing website.
ReplyDelete
Replies
Mary DaySeptember 1, 2021 at 12:29 AM
Wow. It's so incredible. I had read all your series of blog posts about your genome. And I even have no words to say how detailed you write about it. But I can only use professional term paper help because I have no ideas how to write so good like you.
ReplyDelete
Replies
AnonymousNovember 8, 2021 at 10:27 PM
By delivering our services, we do not undervalue students' strengths or talents; instead, we provide a roadmap for high-quality assignments that can help you achieve your main objectives. We can provide Free reference generator to ensure that your work, essays, research papers, and assignments are properly completed.
ReplyDelete
Replies
Lilla WilkinsonDecember 23, 2021 at 11:08 AM
Your work will be done up to par and within an acceptable period of time. Best Available writers GradeMiners are great when you need an essay or a similar type of paper to be done in a couple of days. But if your paper has a burning deadline…
ReplyDelete
Replies
noaholiverJanuary 16, 2022 at 12:03 AM
Nowadays, students enjoy a plethora of advantages. One of them is an assignment writing service. As a student, you may simply ask for assistance to make even the most difficult tasks easy. For example, if you believe that essays are difficult to complete, remain cool and contact an essay helper. There are many brilliant professionals willing to assist students like you in their academic endeavours. You can count on their dedication, hard effort, and knowledge. Find a legitimate and dependable essay writing service. It's a decision you'll never regret in your life.
ReplyDelete
Replies
ireland assignment helpJanuary 20, 2022 at 2:45 AM
Your writing is perfect and complete. However, I think it will be more wonderful if your post includes additional topics that I am thinking of. I have a lot of posts on my site similar to your topic. help in writing dissertation
ReplyDelete
Replies
Jakko BagciApril 3, 2022 at 11:20 AM
Yo! Our grademiners service believe in quality and commitment which makes us so different from our competitors and thus always keeps our customers satisfied.We deal in avariety of essay topics such asScience,History,Literature, Sociology and etc.
ReplyDelete
Replies
Martin LuizApril 19, 2022 at 1:24 PM
The astounding Ashley Massis, a previous design purchaser, beautician, and expert, told me, "A quality cowhide coat can go somewhere in the range of $300 to men's fashion jackets for sale upwards of $1000 or more. While you're taking a gander at your spending plan, a couple of things should be thought about for the ideal piece."
ReplyDelete
Replies
COM SPORTSTOTO-TVMay 26, 2022 at 12:41 PM
You nailed it. Thank you for taking the time. I'll check again to find out more and recommend my coworkers about your website. 토토
ReplyDelete
Replies
COM TOTOSAFEGUIDEMay 26, 2022 at 12:42 PM
Thanks for sharing your info. I truly appreciate your efforts and I am waiting for your further write ups thank you once again. 스포츠토토
ReplyDelete
Replies
Lucille F. ParhamJuly 7, 2022 at 8:51 AM
Allow your dragons to participate in battle in Dragon City. The Dragon League Tourney and the Dragon Stadium are the two fighting systems in this game(descargo). To open and begin the dragon combat in Dragon Stadium, you'll need three buddies. Choose three dragons from your dragon collection for each tournament once you've opened a stadium. The chosen dragons for battle, however, must be at least level 4 to fight. Furthermore, each fight you win increases the complexity of the subsequent ones. The game's PVP mode is the Dragon League Tourney. You can battle here for a maximum of three times per six hours.

Choose the best wireless gaming mouse for you!
ReplyDelete
Replies
diploma assignmenthelp ukSeptember 27, 2022 at 9:14 PM
Thank you so much for the post you do. I like your post and all you share with us is up to date and quite informative, i would like to bookmark the page so i can come here again to read you, as you have done a wonderful job.nursing assignment writing service uk
ReplyDelete
Replies
AnonymousOctober 19, 2022 at 4:25 AM
서울콜걸
울산콜걸
대구콜걸
대전콜걸
인천콜걸

세종콜걸
부산콜걸
인천콜걸
울산콜걸
대구콜걸
ReplyDelete
Replies
flavia110February 15, 2023 at 11:42 PM
This episode of Blogging My Genome takes viewers through the process of variant calling, filtering, and QC. With the help of business assignment help, viewers can ensure they understand the process clearly, and can use this knowledge to get their project done accurately. By properly understanding the concepts of variant calling, filtering, and QC, viewers can make sure that their project is of the highest quality.

ReplyDelete
Replies
casinositewiki 카지노사이트April 4, 2023 at 6:46 PM
Im impressed with this, Thanks
ReplyDelete
Replies
oncasinosite 카지노사이트April 4, 2023 at 6:50 PM
I admire this article for the well-researched content
ReplyDelete
Replies
casinositetop 카지노사이트April 4, 2023 at 6:54 PM
Appreciate you spending some time and effort to put this wonderful article. Goodjob!!
ReplyDelete
Replies
casinositeking 카지노사이트April 4, 2023 at 7:00 PM
It’s very informative blog in this area. Continue writing man! Keep it up
ReplyDelete
Replies
sportstotomen 스포츠토토맨April 13, 2023 at 7:28 AM
Its opportunity are so fantastic and working style so speedy.
ReplyDelete
Replies
19guide03 19가이드03April 13, 2023 at 7:30 AM
Having a hard time looking for good and trusted site?
ReplyDelete
Replies
casinosite 카지노사이트존April 13, 2023 at 7:31 AM
We have daebak event everyday!!
ReplyDelete
Replies
sportstotozone 스포츠토토존April 13, 2023 at 7:31 AM
I am really happy to say it’s an interesting post to read .
ReplyDelete
Replies
shaneJune 5, 2023 at 4:13 AM
This comment has been removed by the author.
ReplyDelete
Replies
JimOctober 17, 2023 at 5:33 AM
Fascinating article, read with interest.
I can also recommend one web resource with interesting articles - [url=https://www.esterdigital.de/webdesign-agentur-frankfurt] homepage erstellen frankfurt [/url]
ReplyDelete
Replies
Robert Gandell October 30, 2023 at 3:42 AM
At Native Assignment Help, we take pride in delivering exceptional Assignment Editing Services UK. Our dedicated team of editors pays meticulous attention to detail, refining your assignments to perfection. With a keen eye for detail and a commitment to quality, we ensure your work meets the highest standards, leaving you with polished, refined assignments that stand out academically.
ReplyDelete
Replies
Rebecca VirginiaJanuary 18, 2025 at 4:23 AM

We are the best online Nursing Assignment for UK college students.
ReplyDelete
Replies
William VincentMay 18, 2025 at 10:18 AM
Finally, someone who breaks down Freebees without overcomplicating things. I’ve been struggling with this module in my online course, and honestly, I’d pay someone to take my online class just so I could focus on understanding the core concepts like this. Blogs like yours fill in the gaps that some lectures just don’t cover. Keep it coming.
ReplyDelete
Replies
timcookJune 9, 2025 at 2:55 AM
Balancing lab work, lectures, and assignments can be quite the juggling act for engineering students. That’s why seeking Assignment Help can relieve some of that burden. Engineering Assignment Help supports students with tasks across all engineering domains—be it electrical, civil, chemical, or computer engineering. These services focus on delivering well-researched and technically accurate content. Besides offering help with problem-solving, experts can assist with research papers, case studies, and project reports. Their insight can significantly boost your grades and understanding. More importantly, they cater to tight deadlines and follow academic guidelines strictly. With expert support, you can transform your stress into success. Make sure you partner with a trusted provider that values your academic goals and privacy.
ReplyDelete
Replies
jackleachJune 14, 2025 at 1:38 AM
Need professional help with your physics assignments? Our Assignment writing service UK team offers affordable and top-notch academic support. We handle topics like astrophysics, nuclear physics, optics, and more, ensuring clarity and precision in every solution. Our UK-based experts provide step-by-step explanations, diagrams, and references to help you understand concepts better. Whether it’s a last-minute task or a detailed research project, we guarantee on-time delivery and confidentiality. Boost your grades with our dedicated Physics Assignment Help UK today!
ReplyDelete
Replies
Adam ScottJune 20, 2025 at 9:24 PM
This is an incredibly detailed and informative post that walks readers through the technical complexities of variant calling and filtration with clarity. The use of FreeBayes, thoughtful filtering strategies, and PCA analysis to validate results really showcase the depth of your genomic exploration. It’s also interesting how quality control can dramatically improve the Ts/Tv ratio and overall reliability of variant calls. Posts like this are especially helpful for students and researchers working on similar projects or seeking Assignment Help in genomics or bioinformatics coursework. Looking forward to your next update!
ReplyDelete
Replies
Stive JoyJuly 8, 2025 at 4:57 AM
The MBA journey in the UK is definitely challenging, especially when juggling multiple assignments and deadlines. I noticed that services offering MBA Assignment Help UK can provide useful academic support without compromising on quality. It’s worth considering when things get hectic.
ReplyDelete
Replies
tomburkeAugust 20, 2025 at 11:56 AM
The topic “Blogging My Genome, episode 6: variant calling, filtering, and QC” is fascinating, especially for those interested in genomics and bioinformatics. Variant calling and filtering are critical steps in ensuring accuracy in genomic data analysis, while QC helps validate results for reliability. Sharing this journey in a blog makes complex scientific processes more accessible to a wider audience. Similarly, in creative industries like embroidery, precision and quality control are also essential. Many businesses seek cheap digitizing embroidery services that provide affordable yet accurate design conversions, ensuring smooth stitching and professional results. Both fields emphasize detail and accuracy!
ReplyDelete
Replies
timcookSeptember 1, 2025 at 2:41 AM
Assignments in accounting often require detailed explanations, in-depth research, and perfect formatting alongside accurate calculations. For many students, this combination feels overwhelming. Professional guidance from a reliable Accounting Assignment Writing Service can make a huge difference. These services employ experts who understand academic requirements and deliver assignments that are well-researched, original, and aligned with specific guidelines. They cover topics like financial management, auditing, tax accounting, and cost control comprehensively. With timely delivery and 24/7 support, students can meet deadlines without stress. Beyond grades, learners also gain practical understanding by reviewing expert solutions. Choosing professional services ensures accuracy, academic excellence, and peace of mind for those pursuing accounting studies.
ReplyDelete
Replies
BhagatpropertystationSeptember 25, 2025 at 2:21 AM
Shapoorji Pallonji Dualis Gurgaon offers more than 50 amenities, like a world-class clubhouse, a swimming pool, a multipurpose sports area, a fully loaded modular kitchen, VRV air conditioning, a zen garden, an indoor games room, a yoga & meditation studio, landscaped gardens, and many more. Shapoorji Pallonji Sector 46 is near the Huda City Centre metro station and connected to major hubs of Gurgaon, like DLF Cyber City, Cyber Park, Sector 44, and Udyog Vihar. It is also connected to the IGI Airport, the Delhi-Jaipur Highway, the Taj Hotel, and many more.
ReplyDelete
Replies
Aisling BastFebruary 26, 2026 at 4:02 AM
The explanations are clear and practical. It’s helpful to see ideas presented in such an organized way.
eCommerce accountant
ReplyDelete
Replies

Add comment