Mike's Fourth Try: Genomics

Showing posts with label Genomics. Show all posts

Sunday, May 4, 2014

Blogging My Genome, episode 6: variant calling, filtering, and QC

This is the sixth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

I'd previously generated mappings of my sequence reads to the hs37d5 reference assembly. The next step is variant calling, to systematically identify the differences between my genome and the reference. There's a variety of popular tools for this, from which I selected FreeBayes to try first. FreeBayes is an example of a relatively new generation of algorithms capable of considering many hypothetical combinations of genetic variation and sequence alignments to explain the short read data mapped to a genomic region - all within a Bayesian statistical model, per the name.

FreeBayes took my BAM file and the hs37d5 reference as input, and produced a VCF (Variant Call Format) file with genome-wide calls. This took about 32 core-hours (parallelized on an 8-core cloud instance) and produced a VCF file of 377 MiB compressed - tiny compared to my 63 GiB BAM file, as it doesn't preserve the individual reads. Here's the one-line entry of my VCF file corresponding to my ALDH2*2 variant from last time:

$ dx cat "My Genome Analysis:/C2K.deduplicated.vcf.gz" | zcat | grep "12[[:space:]]112241766"
12 112241766 . G A 1264 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=42;CIGAR=1X;DP=42;DPB=42;DPRA=0;EPP=23.691;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=58.5772;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=1467;QR=0;RO=0;RPP=4.87156;RPPR=0;RUN=1;SAF=27;SAP=10.4553;SAR=15;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1 GT:DP:RO:QR:AO:QA:GL 1/1:42:0:0:42:1467:-10,-10,0

This line indicates that, at position #112,241,766 on chromosome 12, the reference assembly has a G but my reads indicate I've got an A. Further, the "1/1" prefixing the last string indicates I'm homozygous for the A, as all of my reads show it. Most of the nearly five million calls in my VCF are biallelic single-nucleotide variants (SNVs) like this one, but FreeBayes also calls more complex variation like indels and regions with two different non-reference alleles. Unfortunately, not all of them can be trusted immediately.

Blogging My Genome, episode 5: the homozygous designated driver

This is the fifth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

Last time, we manually examined some of my read mappings and called one A/A homozygous variant. I didn't choose this example at random, of course, but rather because it's an interesting variant with a fairly life-altering phenotype.

Blogging My Genome, episode 4: read mapping

This is the fourth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

Last week's data wrangling produced eight FASTQ files containing the sequencing reads for my genome ($8=4 \times 2$, four lanes' worth of paired-end reads). The next step in making sense of these 1.3 billion reads is to map their positions of origin in the human reference genome assembly. This post will continue somewhat down in the weeds technically, but we'll end up in position to look at some interesting genetics next time.

If you're not interested in the technical minutiae - and that would be fair enough - you could skip to the last section.

Reads quality control with FastQC

Trust, but verify: since my data came from Illumina's CLIA lab, I wasn't terribly worried about gross quality problems - but some QC is always prudent before proceeding further. We have an app on DNAnexus to run reads through FastQC, which runs some essential sanity checks. I ran these on my four pairs of FASTQ files to get four FastQC reports. (It's worth examining the four lanes separately, because they could conceivably have independent quality issues.) Here's one representative plot:

Blogging My Genome, episode 3: data wrangling

After learning about Illumina's Understand Your Genome (UYG) program at ASHG 2013, I decided to sign up to get my genome sequenced. This is the third in a series of blog posts I'm writing about my own adventure in very personal genomics.

Along with my clinical report, Illumina delivered a portable hard drive (pictured) containing the underlying next-generation sequencing data. Given my computational genomics background, this was the product I'd most been looking forward to receiving. In fact, after picking it up, I found myself hastily biking five miles through the rain to get it to a computer!

The drive is protected using TrueCrypt, with the encryption key e-mailed separately. I installed the TrueCrypt on my Linux workstation, mounted the drive, and started to look around:

Blogging My Genome, episode 2: scratching the surface

After learning about Illumina's Understand Your Genome (UYG) program at ASHG 2013, I decided to sign up to get my genome sequenced. This is the second in a series of blog posts I'll write about my own adventure in very personal genomics!

Three months after shipping my blood sample off to the lab for whole-genome sequencing (WGS), I got the long-awaited message to come in and go over my results. And so on a rainy Friday afternoon I biked over to Stanford for genetic counseling. I was very excited, and yet not without awareness of the ~1% chance I could see one of the known pathogenic findings on the American College of Medical Genetics list for genome sequencing reports, and perhaps up to ~5% chance of some other medically actionable finding.

Fortunately, nothing like that came up. In fact, my report is quite unremarkable, which is of course a good thing:

Blogging My Genome, episode 1: parting with my blood (and treasure)

After learning about Illumina's Understand Your Genome (UYG) program at ASHG 2013, I decided to go ahead and sign up to get my genome sequenced. This is the first in a series of blog posts I'll write about my own adventure in very personal genomics!

UYG gets you:

"Deep" whole-genome squencing (WGS) from a blood sample
Bioinformatics and clinical interpretation through Illumina's CLIA lab
Report sent to your clinician
Raw data on a portable hard drive
Day-long workshop with other participants
iPad with the MyGenome app

...for $5,000, which isn't too bad given what's included. The pricing probably reflects that the program is mainly an outreach effort aimed at subject matter experts.

The human population harbors 172 mutations per non-lethal genome position. What'll happen to them?

A recent Panda's Thumb post highlighted that, given the size of the human genome, the rate of de novo point mutations, and the total size of the population, every non-lethal position can be expected to vary - meaning that, for every genome position or site, there's very likely at least one person (and usually dozens or more) with a new mutation there, so long as it's non-lethal. It's a trivial calculation and, while we could refine it in various ways, the essential point is clear.

"We are all, regardless of race,
genetically 99.9% the same."
Right or wrong?

Still, let's try to understand this a bit further. First, an equally simple, entirely compatible fact which might attenuate our surprise: the existence of a couple hundred people with new mutations in a certain site leaves about seven billion without a new mutation there. Indeed, at the vast majority of sites, almost all people are homozygous for the same allele - identical by descent from the hominid lineage.

In that light, here's a deep question one can ask about all those hundreds of billions of de novo mutations: what will be their ultimate fate? Will they all shuffle through the future human population, making our genome's future evolution look like the reels on a slot machine? Or is it going to be rather more like the pitch drop experiment?

A taste of molecular phylogenetics in Julia

I've been meaning for some time to try out Julia, the up-and-coming scientific computing language/environment that might eventually give R, MATLAB, Mathematica, and SciPy all a run for their money. Julia feels familiar if you've used those systems, but it has a lot of more modern language features, an LLVM back-end that produces performant machine code, and integrated support for parallel and distributed computing. The project incubated from 2009-2012 and, with a strong push from applied math groups at MIT, has been gaining steam quickly in the last year.

As far as I could tell via Google, no phylogenetic sequence analysis code has been written in Julia, so this seemed like an interesting place to start. In this post, I'll build up some toy code for commonly-used models of molecular sequence evolution as continuous-time Markov processes on a phylogenetic tree. This will enable a little implementation of Felsenstein's algorithm.

Working with cross-species genome alignments on the DNAnexus platform

I've recently been resurrecting some comparative genomics methods I developed in my last year of grad school, but never got to publish. These build on previous work to locate what we called Synonymous Constraint Elements (SCEs) in the human genome: short stretches within protein-coding ORFs that also encode additional, overlapping functional elements - evidenced by a markedly reduced apparent rate of synonymous substitutions in cross-species alignments. The first step in this analysis, and the subject of this post, involves extracting the cross-species sequence alignments of protein-coding ORFs from raw, whole-genome alignments. I hope to write a series of blog posts as I get various other parts of the pipeline going. I'm not exactly sure where it'll go from there, but it's pretty neat stuff I would eventually like to get peer-reviewed!

A true story about Big Science

Once, I decided to consult the literature for details about how to perform a certain selection test using PAML. I turned to my officemate Matt, and asked if he knew of any papers using it. He suggested three relevant papers, which indeed described details of that test, at least in their supplements. I was an author on two of those papers!

Sunday, February 24, 2013

My thoughts on the immortality of television sets

There's a new GB&E manuscript sensationally blasting a certain widely-reported claim of the 2012 ENCODE Consortium paper, namely that the data generated in that project "enabled us to assign biochemical functions for 80% of the genome." I'm one of 400+ authors on that paper, but I was a bit player - not at all involved in the consortium machinations that resulted in that particular wording, which has proven quite controversial, and has already been discussed/clarified by other authors big and small.

The first author of the new criticism, Dan Graur, is an authority on molecular evolution and authored a popular textbook on that topic (one I own!). The manuscript stridently argues that ENCODE erred in using a definition of "functional element" in the human genome based on certain reproducible biochemical activities, rather than a definition based on natural selection and evolutionary conservation. Interestingly, while the consortium was mostly focused on high-throughput experimental assays to identify the biochemical activities, my modest contributions to ENCODE were entirely based on examining evolutionary evidence, through sequence-level comparative genomics. So, a few comments by a former rogue evolutionary ENCODE-insider:

Mike's Fourth Try

Sunday, May 4, 2014

Blogging My Genome, episode 6: variant calling, filtering, and QC

Wednesday, April 2, 2014

Blogging My Genome, episode 5: the homozygous designated driver

Tuesday, March 18, 2014

Blogging My Genome, episode 4: read mapping

Reads quality control with FastQC

Tuesday, March 11, 2014

Blogging My Genome, episode 3: data wrangling

Thursday, March 6, 2014

Blogging My Genome, episode 2: scratching the surface

Saturday, March 1, 2014

Blogging My Genome, episode 1: parting with my blood (and treasure)

Sunday, July 28, 2013

The human population harbors 172 mutations per non-lethal genome position. What'll happen to them?

Sunday, May 26, 2013

A taste of molecular phylogenetics in Julia

Thursday, May 23, 2013

Working with cross-species genome alignments on the DNAnexus platform

Wednesday, February 27, 2013

A true story about Big Science

Sunday, February 24, 2013

My thoughts on the immortality of television sets