Mike's Fourth Try

Sunday, April 26, 2015

Blogging My Genome, episode 7: sifting for bad news

This is the seventh in a series of blog posts about my genome, which I had sequenced through Illumina's Understand Your Genome program.

Hey, it's been awhile! We've been unbelievably busy at my company, but I've been plugging away on my genome analysis slowly. When I last blogged, I'd completed the process of identifying small variants in my genome (affecting just one or a few DNA nucleotides). This takes us into an interesting new analysis phase - interpreting the consequences of those variants in the context of existing knowledge of human genetics. I previously went into depth on a certain variant I'd known to look for, but now we'll sift through the others in my VCF file - nearly four million of them!

I began with Ensembl's Variant Effect Predictor (VEP), one of several available tools that annotates VCF variants with their likely consequences for known genes and other genomic features. VEP produces a new VCF file with this additional information crammed into each entry, like so:

1 871215 . C G 1357.81 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=45;CIGAR=1X;DP=
45;DPB=45;DPRA=0;EPP=4.21667;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=63.5608;P
AIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=1543;QR=0;RO=0;RPP=3.44459;RPPR=0;RUN=1;SAF=19;SAP=5.37
479;SAR=26;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1;CSQ=G|ENSG00000187634|ENST00000341065|Transcript|synonymous_variant|140|141|47|P|ccC/ccG|rs28419
423|0.0261008|0.00232558|3/12|||||||1|||SAMD11|HGNC|||G:0.0629|protein_coding|ENSP00000349216|||0.03
|0.08|0.16|0.0026|,G||ENSR00000528855|RegulatoryFeature|regulatory_region_variant||||||rs28419423|0.
0261008|0.00232558|||||||||||||||G:0.0629|||||0.03|0.08|0.16|0.0026| GT:DP:RO:QR:AO:QA:GL
1/1:45:0:0:45:1543:-10,-10,0

That's really not pretty, but VEP also produces a nice series of summary charts. For example, it breaks down putative consequences of my variants in protein-coding sequences.

Blogging My Genome, episode 6: variant calling, filtering, and QC

This is the sixth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

I'd previously generated mappings of my sequence reads to the hs37d5 reference assembly. The next step is variant calling, to systematically identify the differences between my genome and the reference. There's a variety of popular tools for this, from which I selected FreeBayes to try first. FreeBayes is an example of a relatively new generation of algorithms capable of considering many hypothetical combinations of genetic variation and sequence alignments to explain the short read data mapped to a genomic region - all within a Bayesian statistical model, per the name.

FreeBayes took my BAM file and the hs37d5 reference as input, and produced a VCF (Variant Call Format) file with genome-wide calls. This took about 32 core-hours (parallelized on an 8-core cloud instance) and produced a VCF file of 377 MiB compressed - tiny compared to my 63 GiB BAM file, as it doesn't preserve the individual reads. Here's the one-line entry of my VCF file corresponding to my ALDH2*2 variant from last time:

$ dx cat "My Genome Analysis:/C2K.deduplicated.vcf.gz" | zcat | grep "12[[:space:]]112241766"
12 112241766 . G A 1264 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=42;CIGAR=1X;DP=42;DPB=42;DPRA=0;EPP=23.691;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=58.5772;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=1467;QR=0;RO=0;RPP=4.87156;RPPR=0;RUN=1;SAF=27;SAP=10.4553;SAR=15;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1 GT:DP:RO:QR:AO:QA:GL 1/1:42:0:0:42:1467:-10,-10,0

This line indicates that, at position #112,241,766 on chromosome 12, the reference assembly has a G but my reads indicate I've got an A. Further, the "1/1" prefixing the last string indicates I'm homozygous for the A, as all of my reads show it. Most of the nearly five million calls in my VCF are biallelic single-nucleotide variants (SNVs) like this one, but FreeBayes also calls more complex variation like indels and regions with two different non-reference alleles. Unfortunately, not all of them can be trusted immediately.

Blogging My Genome, episode 5: the homozygous designated driver

This is the fifth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

Last time, we manually examined some of my read mappings and called one A/A homozygous variant. I didn't choose this example at random, of course, but rather because it's an interesting variant with a fairly life-altering phenotype.

Blogging My Genome, episode 4: read mapping

This is the fourth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

Last week's data wrangling produced eight FASTQ files containing the sequencing reads for my genome ($8=4 \times 2$, four lanes' worth of paired-end reads). The next step in making sense of these 1.3 billion reads is to map their positions of origin in the human reference genome assembly. This post will continue somewhat down in the weeds technically, but we'll end up in position to look at some interesting genetics next time.

If you're not interested in the technical minutiae - and that would be fair enough - you could skip to the last section.

Reads quality control with FastQC

Trust, but verify: since my data came from Illumina's CLIA lab, I wasn't terribly worried about gross quality problems - but some QC is always prudent before proceeding further. We have an app on DNAnexus to run reads through FastQC, which runs some essential sanity checks. I ran these on my four pairs of FASTQ files to get four FastQC reports. (It's worth examining the four lanes separately, because they could conceivably have independent quality issues.) Here's one representative plot:

MH370: the accidental π-day wisdom of NPR

Please find this exchange six minutes into the show:

ANCHOR: "How big is the search area, or areas, Tom?"
CORRESPONDENT: "Well the defined search area is about 31,000 square miles. But if you take into the possibility that this plane may have flown another four or five hours, then you're looking at a potential distance of 2500 miles, and then you've got to do...you've got to get some mathematician from MIT to figure out OK...in every single direction from the point off South Vietnam, all the way in every possible direction, you know...it seems to me the possibilities are endless there..."

But seriously. Where the plane could have gone depends not only on its heading, airspeed, and endurance, but also on the wind. And at an airliner's cruising altitude, the wind commonly blows at 50-100 knots - faster than you drive down the freeway, and a severe hurricane at ground level. Not only that:

The wind vector field is actually four-dimensional, varying with altitude and time. By the way, the ground speed and endurance also vary with altitude. Integrating over a multidimensional infinity of possible flight paths, we'd need to find the maximum range when they're projected down to the surface of the Earth - accounting for its curvature and coriolis effect - and calculate some kind of polar integral over the resulting surface. If we're searching for debris, what about those ocean currents?

In conclusion, contingent on a certain understanding of "endless", the correspondent is entirely correct. Godspeed to all those searching for MH370.

Mike's Fourth Try

Sunday, April 26, 2015

Blogging My Genome, episode 7: sifting for bad news

Sunday, May 4, 2014

Blogging My Genome, episode 6: variant calling, filtering, and QC

Wednesday, April 2, 2014

Blogging My Genome, episode 5: the homozygous designated driver

Tuesday, March 18, 2014

Blogging My Genome, episode 4: read mapping

Reads quality control with FastQC

Friday, March 14, 2014

MH370: the accidental π-day wisdom of NPR