Tuesday, March 11, 2014

Blogging My Genome, episode 3: data wrangling

After learning about Illumina's Understand Your Genome (UYG) program at ASHG 2013, I decided to sign up to get my genome sequenced. This is the third in a series of blog posts I'm writing about my own adventure in very personal genomics.

Along with my clinical report, Illumina delivered a portable hard drive (pictured) containing the underlying next-generation sequencing data. Given my computational genomics background, this was the product I'd most been looking forward to receiving. In fact, after picking it up, I found myself hastily biking five miles through the rain to get it to a computer!

The drive is protected using TrueCrypt, with the encryption key e-mailed separately. I installed the TrueCrypt on my Linux workstation, mounted the drive, and started to look around:

PG0001312-BLD is the identifier Illumina's lab assigned to my case. In the root folder we have copies of my PDF clinical report and a file with the 5,390 variants in my genome that were interpreted by Illumina's CLIA lab, in TSV format. (It's an irritating truth that the field of bioinformatics is largely based on tab-delimited file formats - but that's another blog post.) As mentioned last time, this only includes single-nucleotide variants in the exons of 1,600 genes with currently-known medical significance.

In the Variations folder we have files containing all of the variants detected across my genome in VCF format (also tab-delimited). This is the main product of Illumina's bioinformatics pipeline, Isaac, which takes the jumble of reads from the sequencing instrument through to variant calls, preceding clinical interpretation.

The Assembly folder contains the mother lode: a 66 GiB BAM file containing all of the reads, associated quality information, and their estimated positions of origin, or mappings, with respect to the human reference genome assembly. While this file is really big, it's actually just half the size I'd been expecting, and I'll get to why in a future post.

The BAM file is an intermediate product of the Isaac bioinformatics pipeline, from which the input data is still recoverable. It'll be my goal to get that raw data out and redo much of the bioinformatics from scratch. Not because there's anything wrong with Isaac, but rather because that'll be the best way for me - and, I hope, readers of this blog - to learn about the process. Analyzing this raw data would typically be a daunting challenge, because of the technical expertise required, the sheer quantity of the data, and the computational horsepower needed to work with it in an exploratory fashion. Fortunately however, I happen to be a bridge officer on the world's most powerful genome informatics platform.

To the cloud!

I created a DNAnexus project called "My Genome", and then used our command-line client to begin uploading the entire contents of the portable hard drive.

I set this running just before going to dinner, and it'd completed by the time I woke up the next morning. As a result, all my genome data now lives safely in the cloud.

FASTQ regeneration

To begin the bioinformatics, I'd like to get all my sequencing reads into FASTQ format, which contains the sequence and quality information, but not the mappings also encoded in the BAM file. FASTQ isn't truly the raw data that comes straight from the sequencing instrument, but the steps leading to it are relatively uncontroversial. Picard has a utility to extract FASTQ data from the BAM file, which I wrapped to run on DNAnexus. I can upload this to my project and execute it like so:

~/src/blogging-my-genome$ dx build -f bam_to_fastq
~/src/blogging-my-genome$ dx run bam_to_fastq -i bam=:/Assembly/PG0001312-BLD.bam


These eight FASTQ files break down into four pairs. Each group is a pair owing to the use of paired-end sequencing, a neat trick to squeeze more information from a given quantity of reads. There are four such pairs corresponding to four lanes on two HiSeq flowcells  used to sequence my sample. Here are the first eight lines of C2KC2ACXX_1_6_none_1.fastq:

~$ dx cat C2KC2ACXX_1_6_none_1.fastq.gz | gunzip -c | head -n 8 | nl
   1  @C2KC2ACXX_1:6:1101:4:0/1
   3  +
   5  @C2KC2ACXX_1:6:1101:6:0/1
   7  +
   8  BBBF!0<BB0FBBFIIIFFIFFFBBF!0<!!00<B'<BB!00!!!!!!!0000'70<<'7<BB!!!!!!!!!!!!!!!!!!!!!!!!!!!!'!!!!!!!!

Line 1 is a read identifier, line 2 is the 100 base pair (bp) DNA sequence of the read, line 3 marks the end of the sequence, and line 4 presents the Phred quality scores for each position in the read. Lines 5-8 repeat this for a second read, and so on for the remainder of this 35 GiB (uncompressed) FASTQ file. You can see that the second read wasn't very good, with a large proportion of no-call positions (N) and the lowest quality scores (!).

Each read in C2KC2ACXX_1_6_none_1.fastq has a corresponding mate pair in C2KC2ACXX_1_6_none_2.fastq, and so on for the three remaining groups. The eight FASTQ files contain a grand total of 1,306,309,136 reads, or 131 Gbp - around 40 times the (haploid) human reference genome size of 3.2 Gbp. Such high coverage is typical for clinical-grade genome sequencing.

Next time: I'll generate my own mappings of these reads back onto the reference genome.