Tuesday, March 18, 2014

Blogging My Genome, episode 4: read mapping

This is the fourth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

Last week's data wrangling produced eight FASTQ files containing the sequencing reads for my genome ($8=4 \times 2$, four lanes' worth of paired-end reads). The next step in making sense of these 1.3 billion reads is to map their positions of origin in the human reference genome assembly. This post will continue somewhat down in the weeds technically, but we'll end up in position to look at some interesting genetics next time.

If you're not interested in the technical minutiae - and that would be fair enough - you could skip to the last section.

Reads quality control with FastQC

Trust, but verify: since my data came from Illumina's CLIA lab, I wasn't terribly worried about gross quality problems - but some QC is always prudent before proceeding further. We have an app on DNAnexus to run reads through FastQC, which runs some essential sanity checks. I ran these on my four pairs of FASTQ files to get four FastQC reports. (It's worth examining the four lanes separately, because they could conceivably have independent quality issues.) Here's one representative plot:

This shows that, across all the 100bp reads in one pair of my FASTQ files, the Phred quality scores assigned by the sequencer tend to be high across most of the read, but degrade to either end. That's typical with current technology. My other FastQC results also looked as expected.

Mapping with BWA-MEM

Mapping is the process of taking each 100bp sequencing read and determining the most likely corresponding position in the human genome reference assembly, enabling us to locate variants throughout my genome. Mapping is challenging for many reasons, such as: (1) the reads contain outright sequencing errors, in addition to bona fide differences between my genome and the reference; (2) the human genome contains extensive repetition, making 100bp reads sometimes difficult to uniquely place; (3) the paired reads convey implicit proximity information that can be tricky to use to maximum effect; (4) with 1.3 billion reads to map, it has to be done efficiently.

Befitting such an important and challenging algorithmic problem, there's a diversity of mapping software available for read data like mine, utilizing a variety of cool string processing algorithms including Bloom filters, suffix arrays, and the Burrows-Wheeler transform. I'm going to use BWA-MEM, a modern and popular mapper we already have available on DNAnexus, to map my reads onto hs37d5, the reference assembly used in the 1000 Genomes Project. (A colleague wrote a great explanation of the several different reference assemblies in common use today, although we expect a transition to GRCh38 over the next year.)

Aside: it would be great to produce a de novo assembly of my genome from the reads, rather than just mapping them onto a generic reference assembly. One day this will be standard, and I'll explore it a little bit in a future post - but with current technology, it's just far more difficult than mapping.

BAM finishing

BWA-MEM will take my FASTQ reads and produce a BAM file encoding the mappings, which we'll eventually feed to a variant calling algorithm. Best practices call for a few additional "finishing" steps on the mappings beforehand, however. One is deduplication, flagging or removing reads that are redundant in a certain sense - arising from PCR amplification prior to sequencing - which could otherwise bias downstream analyses. I'll perform deduplication using Picard MarkDuplicates, which is pretty standard.

There are additional finishing steps in common use today, including disposal of low-quality reads or mappings, recalibration of base quality scores, and local realignment around indels. I'll save myself the trouble because it seems the very latest generation of variant callers don't benefit much from these steps.

The mapping workflow

Here's a screenshot of the DNAnexus workflow I created for my mapping and BAM finishing:

Bear with me - there are a few additional things going on here:
  • I'm running BWA-MEM separately on each of my four pairs of FASTQ files. It's possible to run it on all of the FASTQ files at once, but it'll be useful later on to preserve the information of where each read arose from, and mapping the four "read groups" separately is the easiest way to do so.
  • Then I'm using bamtools to merge the four intermediate BAM files into one, preserving the read group information, prior to deduplication.
  • I'm creating a BAM index, which is useful for various downstream programs.
  • I wrote an applet to report various statistics about the BAM files, useful for sanity checking the results.
  • Lastly, I'm importing the mappings into a format suitable for our genome browser.
This workflow, as well as those from previous and future episodes, is available in this DNAnexus public project (free account required).

Did it work?

My workflow produced a final, deduplicated BAM file of about 63 GiB in size. Here are a few of the mappings, displayed using samtools view which converts the BAM data into an equivalent text format:

You can find the definition of each column in the SAM format specification. Briefly, each row includes a read identifier, its sequence and quality scores, a position in the reference genome assembly, and a "CIGAR string", a compact encoding of how the read differs from the reference at the specified position.

Looks good so far, but as with the original FASTQ reads, it's important to do some overall sanity checks before proceeding further. The workflow logs and reports show that deduplication removed less than 2% of my 1.3 billion reads; and of my 1,286,633,504 remaining reads, 98% mapped onto the hs37d5 reference, 94% with a positive mapping quality score. This is great, and I'll take a look at the unmapped leftovers in a future post. We can also look at how the mappings cover the reference genome; I collected some statistics on this using bedtools (code):
This plot shows a (complementary) cumulative distribution function for the basewise coverage of the mappings, or how many mapped reads cover each nucleotide position in the reference assembly. The black curve represents the complete hs37d5 reference assembly, which is about 3.2 Gbp, while the blue curve represents only positions within Consensus CDS (CCDS), a conservative annotation of protein-coding exons, totaling about 32 Mbp.

According to this plot, the typical (median) position is covered by about 40 mapped reads, a depth sufficient to enable confident variant detection despite occasional errors and noise in the sequencing and mapping. About 7% of the hs37d5 reference has no coverage at all - largely accounted for by centromeres and telomeres, but including other difficult regions as well - while almost all CCDS exon positions are covered by at least several reads. At the other extreme, very few positions have more than 70-fold coverage.

It's interesting that the curves cross, and the typical CCDS position has slightly lower coverage than the genome-wide median. This could have to do with the relative GC-richness of protein-coding regions, which might make sequencing or mapping to them a little less efficient. In any case, the difference is slight and not too concerning.

One variant call

So, after a long slog through some technical minutiae, where are we? Here's a visualization of some of my mappings in our genome browser:
(click to zoom)

We're looking at a 150bp stretch of the reference chromosome 12. Each of the light blue and green bars represents the mapping of one of my 100bp reads. (The color reflects the strand orientation of the read with respect to the reference, which is basically random and not very interesting in WGS, except to account for the lower reliability of the 3' ends of the reads.) We can see that my reads cover the genome many times over, which is typical, as shown previously. The reference DNA sequence is displayed just above the mappings.

Each black box indicates a mismatch between the read from my genome and the corresponding position in the reference. For the most part, these mismatches are infrequent and isolated to individual reads - consistent with errors and noise. That shows my genome has largely the same sequence as the reference, as expected. But we can see one position where all of my reads contain an A instead of the reference assembly's G. Based on this unanimous evidence, we can conclude that I'm homozygous for the variant A allele at this position. (If I were heterozygous, we would instead see some mixture of reads with A and G at that position.)

The next algorithmic analysis step will use the mappings to detect variants across my whole genome - similar, to a first approximation, to the logic we just applied to this one locus. First, however, I'll have much more to say about this particular homozygous variant; it wasn't mentioned in Illumina's clinical interpretation of my genome, but it's actually had a significant effect on my adult life.