Mike's Fourth Try: Code

Showing posts with label Code. Show all posts

Sunday, May 4, 2014

Blogging My Genome, episode 6: variant calling, filtering, and QC

This is the sixth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

I'd previously generated mappings of my sequence reads to the hs37d5 reference assembly. The next step is variant calling, to systematically identify the differences between my genome and the reference. There's a variety of popular tools for this, from which I selected FreeBayes to try first. FreeBayes is an example of a relatively new generation of algorithms capable of considering many hypothetical combinations of genetic variation and sequence alignments to explain the short read data mapped to a genomic region - all within a Bayesian statistical model, per the name.

FreeBayes took my BAM file and the hs37d5 reference as input, and produced a VCF (Variant Call Format) file with genome-wide calls. This took about 32 core-hours (parallelized on an 8-core cloud instance) and produced a VCF file of 377 MiB compressed - tiny compared to my 63 GiB BAM file, as it doesn't preserve the individual reads. Here's the one-line entry of my VCF file corresponding to my ALDH2*2 variant from last time:

$ dx cat "My Genome Analysis:/C2K.deduplicated.vcf.gz" | zcat | grep "12[[:space:]]112241766"
12 112241766 . G A 1264 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=42;CIGAR=1X;DP=42;DPB=42;DPRA=0;EPP=23.691;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=58.5772;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=1467;QR=0;RO=0;RPP=4.87156;RPPR=0;RUN=1;SAF=27;SAP=10.4553;SAR=15;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1 GT:DP:RO:QR:AO:QA:GL 1/1:42:0:0:42:1467:-10,-10,0

This line indicates that, at position #112,241,766 on chromosome 12, the reference assembly has a G but my reads indicate I've got an A. Further, the "1/1" prefixing the last string indicates I'm homozygous for the A, as all of my reads show it. Most of the nearly five million calls in my VCF are biallelic single-nucleotide variants (SNVs) like this one, but FreeBayes also calls more complex variation like indels and regions with two different non-reference alleles. Unfortunately, not all of them can be trusted immediately.

Blogging My Genome, episode 5: the homozygous designated driver

This is the fifth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

Last time, we manually examined some of my read mappings and called one A/A homozygous variant. I didn't choose this example at random, of course, but rather because it's an interesting variant with a fairly life-altering phenotype.

Blogging My Genome, episode 4: read mapping

This is the fourth in a series of blog posts about my genome, which I recently had sequenced through Illumina's Understand Your Genome program.

Last week's data wrangling produced eight FASTQ files containing the sequencing reads for my genome ($8=4 \times 2$, four lanes' worth of paired-end reads). The next step in making sense of these 1.3 billion reads is to map their positions of origin in the human reference genome assembly. This post will continue somewhat down in the weeds technically, but we'll end up in position to look at some interesting genetics next time.

If you're not interested in the technical minutiae - and that would be fair enough - you could skip to the last section.

Reads quality control with FastQC

Trust, but verify: since my data came from Illumina's CLIA lab, I wasn't terribly worried about gross quality problems - but some QC is always prudent before proceeding further. We have an app on DNAnexus to run reads through FastQC, which runs some essential sanity checks. I ran these on my four pairs of FASTQ files to get four FastQC reports. (It's worth examining the four lanes separately, because they could conceivably have independent quality issues.) Here's one representative plot:

Blogging My Genome, episode 3: data wrangling

After learning about Illumina's Understand Your Genome (UYG) program at ASHG 2013, I decided to sign up to get my genome sequenced. This is the third in a series of blog posts I'm writing about my own adventure in very personal genomics.

Along with my clinical report, Illumina delivered a portable hard drive (pictured) containing the underlying next-generation sequencing data. Given my computational genomics background, this was the product I'd most been looking forward to receiving. In fact, after picking it up, I found myself hastily biking five miles through the rain to get it to a computer!

The drive is protected using TrueCrypt, with the encryption key e-mailed separately. I installed the TrueCrypt on my Linux workstation, mounted the drive, and started to look around:

A taste of molecular phylogenetics in Julia

I've been meaning for some time to try out Julia, the up-and-coming scientific computing language/environment that might eventually give R, MATLAB, Mathematica, and SciPy all a run for their money. Julia feels familiar if you've used those systems, but it has a lot of more modern language features, an LLVM back-end that produces performant machine code, and integrated support for parallel and distributed computing. The project incubated from 2009-2012 and, with a strong push from applied math groups at MIT, has been gaining steam quickly in the last year.

As far as I could tell via Google, no phylogenetic sequence analysis code has been written in Julia, so this seemed like an interesting place to start. In this post, I'll build up some toy code for commonly-used models of molecular sequence evolution as continuous-time Markov processes on a phylogenetic tree. This will enable a little implementation of Felsenstein's algorithm.

assert-type: concise runtime type assertions for Node.js

I recently published my first npm package: assert-type, a library to help with writing concise runtime type assertions in Node.js programs.

Background: An OCaml hacker's year with Node.js

The new DNAnexus platform uses Node.js for several back-end components, so I've had to write a fair amount of JavaScript in the year since I joined. Considering I wrote the majority of my grad school code in OCaml, a language found at the opposite end of Steve Yegge's liberal/conservative axis, this has been quite a large adjustment. Indeed, I frequently find myself encountering certain kinds of silly runtime bugs, and writing especially tedious kinds of unit tests, that are both largely obviated in a language like OCaml.

So, I still count myself a hardcore conservative. But there's certainly a lot I've enjoyed about Node.js. When requirements evolve, as they always do, JavaScript and Node's "module system" (those are air quotes) will usually offer quick hacks instead of the careful refactoring that might be demanded by a type-safe language. This incurs technical debt, but a lot of times that's a fine tradeoff, especially at a startup. More generally, Node's rapid code/test/deploy cycle is a lot of fun, without all the build process and binary dependency headaches. The vibrancy of the developer community is amazing, as is the speed at which the runtime itself is improving. (There was a period a few years ago when I feared OCaml was dying out entirely, but there's some real momentum building now.)

Testing OCaml projects on Travis CI

Update (Oct 2013): Anil Madhavapeddy has fleshed this out further.

This evening I spent some time getting unit tests for my OCaml projects to run on Travis CI, a free service for continuous integration on public GitHub projects. Although Travis has no built-in OCaml environment, it's straightforward to hijack its C environment to install OCaml and OPAM, then build an OCaml project and run its tests.

1. Perform the initial setup to get Travis CI watching your GitHub repo (up to and including step two of that guide).

2. Add a .travis.yml file to the root of your repo, with these contents:

language: c
script: bash -ex travis-ci.sh

3. Fill in travis-ci.sh, also in the repo root, with something like this:

# OPAM version to install
export OPAM_VERSION=0.9.1
# OPAM packages needed to build tests
export OPAM_PACKAGES='ocamlfind ounit'

# install ocaml from apt
sudo apt-get update -qq
sudo apt-get install -qq ocaml

# install opam
curl -L https://github.com/OCamlPro/opam/archive/${OPAM_VERSION}.tar.gz | tar xz -C /tmp
pushd /tmp/opam-${OPAM_VERSION}
./configure
make
sudo make install
opam init
eval `opam config -env`
popd

# install packages from opam
opam install -q -y ${OPAM_PACKAGES}

# compile & run tests (here assuming OASIS DevFiles)
./configure --enable-tests
make test

4. Add and commit these two new files, and push to GitHub. Travis CI will then execute the tests.

Working examples: ForkWork, yajl-ocaml

Installing OCaml and OPAM add less than two minutes of overhead, leaving plenty of room for your tests within the stated 15-20 minute time limit for open-source builds. I'm sure the above steps could be used as the basis for an eventual OCaml+OPAM environment built-in to Travis CI.

Mike's Fourth Try

Sunday, May 4, 2014

Blogging My Genome, episode 6: variant calling, filtering, and QC

Wednesday, April 2, 2014

Blogging My Genome, episode 5: the homozygous designated driver

Tuesday, March 18, 2014

Blogging My Genome, episode 4: read mapping

Reads quality control with FastQC

Tuesday, March 11, 2014

Blogging My Genome, episode 3: data wrangling

Sunday, May 26, 2013

A taste of molecular phylogenetics in Julia

Tuesday, February 19, 2013

assert-type: concise runtime type assertions for Node.js

Background: An OCaml hacker's year with Node.js

Sunday, February 10, 2013

Testing OCaml projects on Travis CI