Tuesday, March 18, 2014

Blogging My Genome, episode 4: read mapping

This is the fourth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

Last week's data wrangling produced eight FASTQ files containing the sequencing reads for my genome ($8=4 \times 2$, four lanes' worth of paired-end reads). The next step in making sense of these 1.3 billion reads is to map their positions of origin in the human reference genome assembly. This post will continue somewhat down in the weeds technically, but we'll end up in position to look at some interesting genetics next time.

If you're not interested in the technical minutiae - and that would be fair enough - you could skip to the last section.

Reads quality control with FastQC

Trust, but verify: since my data came from Illumina's CLIA lab, I wasn't terribly worried about gross quality problems - but some QC is always prudent before proceeding further. We have an app on DNAnexus to run reads through FastQC, which runs some essential sanity checks. I ran these on my four pairs of FASTQ files to get four FastQC reports. (It's worth examining the four lanes separately, because they could conceivably have independent quality issues.) Here's one representative plot:

Friday, March 14, 2014

MH370: the accidental π-day wisdom of NPR

Please find this exchange six minutes into the show:
ANCHOR: "How big is the search area, or areas, Tom?"
CORRESPONDENT: "Well the defined search area is about 31,000 square miles. But if you take into the possibility that this plane may have flown another four or five hours, then you're looking at a potential distance of 2500 miles, and then you've got to do...you've got to get some mathematician from MIT to figure out OK...in every single direction from the point off South Vietnam, all the way in every possible direction, you know...it seems to me the possibilities are endless there..."

But seriously. Where the plane could have gone depends not only on its headingairspeed, and endurance, but also on the wind. And at an airliner's cruising altitude, the wind commonly blows at 50-100 knots - faster than you drive down the freeway, and a severe hurricane at ground level. Not only that:

The wind vector field is actually four-dimensional, varying with altitude and time. By the way, the ground speed and endurance also vary with altitude. Integrating over a multidimensional infinity of possible flight paths, we'd need to find the maximum range when they're projected down to the surface of the Earth - accounting for its curvature and coriolis effect - and calculate some kind of polar integral over the resulting surface. If we're searching for debris, what about those ocean currents?

In conclusion, contingent on a certain understanding of "endless", the correspondent is entirely correct. Godspeed to all those searching for MH370.

Tuesday, March 11, 2014

Blogging My Genome, episode 3: data wrangling

After learning about Illumina's Understand Your Genome (UYG) program at ASHG 2013, I decided to sign up to get my genome sequenced. This is the third in a series of blog posts I'm writing about my own adventure in very personal genomics.

Along with my clinical report, Illumina delivered a portable hard drive (pictured) containing the underlying next-generation sequencing data. Given my computational genomics background, this was the product I'd most been looking forward to receiving. In fact, after picking it up, I found myself hastily biking five miles through the rain to get it to a computer!

The drive is protected using TrueCrypt, with the encryption key e-mailed separately. I installed the TrueCrypt on my Linux workstation, mounted the drive, and started to look around:

Thursday, March 6, 2014

Blogging My Genome, episode 2: scratching the surface

After learning about Illumina's Understand Your Genome (UYG) program at ASHG 2013, I decided to sign up to get my genome sequenced. This is the second in a series of blog posts I'll write about my own adventure in very personal genomics!

Three months after shipping my blood sample off to the lab for whole-genome sequencing (WGS), I got the long-awaited message to come in and go over my results. And so on a rainy Friday afternoon I biked over to Stanford for genetic counseling. I was very excited, and yet not without awareness of the ~1% chance I could see one of the known pathogenic findings on the American College of Medical Genetics list for genome sequencing reports, and perhaps up to ~5% chance of some other medically actionable finding.

Fortunately, nothing like that came up. In fact, my report is quite unremarkable, which is of course a good thing:

Saturday, March 1, 2014

Blogging My Genome, episode 1: parting with my blood (and treasure)

After learning about Illumina's Understand Your Genome (UYG) program at ASHG 2013, I decided to go ahead and sign up to get my genome sequenced. This is the first in a series of blog posts I'll write about my own adventure in very personal genomics!

UYG gets you:
  • "Deep" whole-genome squencing (WGS) from a blood sample
  • Bioinformatics and clinical interpretation through Illumina's CLIA lab
  • Report sent to your clinician
  • Raw data on a portable hard drive
  • Day-long workshop with other participants
  • iPad with the MyGenome app
...for $5,000, which isn't too bad given what's included. The pricing probably reflects that the program is mainly an outreach effort aimed at subject matter experts.