Thursday, May 23, 2013

Working with cross-species genome alignments on the DNAnexus platform

I've recently been resurrecting some comparative genomics methods I developed in my last year of grad school, but never got to publish. These build on previous work to locate what we called Synonymous Constraint Elements (SCEs) in the human genome: short stretches within protein-coding ORFs that also encode additional, overlapping functional elements - evidenced by a markedly reduced apparent rate of synonymous substitutions in cross-species alignments. The first step in this analysis, and the subject of this post, involves extracting the cross-species sequence alignments of protein-coding ORFs from raw, whole-genome alignments. I hope to write a series of blog posts as I get various other parts of the pipeline going. I'm not exactly sure where it'll go from there, but it's pretty neat stuff I would eventually like to get peer-reviewed!


I was previously hampered from making progress on this by lack of access to a compute cluster, but that's no longer a problem now that the new DNAnexus platform is up and running. While I also work under its hood every day, the personal side project I'll describe here relies only on the public feature set available to all platform users.

Downloading the alignments

The genome alignments we'll be working with are generated by the UCSC Genome Bioinformatics group using MULTIZ. These are the same alignments used in the fly and mammalian comparative genomics projects I worked on in the past. The first step was to get the alignments in multiple alignment format (MAF) from the UCSC FTP site onto the platform  (e.g. the 46-way alignment for human). One way would have been to download them to my computer, then upload them to the platform, but that's not very efficient, as they're several dozen GB compressed. A better way is to use the  "URL Fetcher" app, which causes the platform to download a file from a given URL. For example, we can use the platform's command-line client to launch a job that downloads chr21.maf.gz into the current project:

dx run url_fetcher -i url=ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/chr21.maf.gz

Building on this, I came up with the following one-liner to download all the MAF files in that FTP directory:

$ curl -iv ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/ | grep -o chr.*\\.maf\\.gz | xargs -n 1 printf "dx run url_fetcher -y --wait -i url=ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/maf/%s\n" | xargs -n 1 -d \\n -P 8 bash -c

The curl command lists the contents of the FTP directory, from which we grep out the chr*.maf.gz filenames. Then we use printf to formulate a dx run url_fetcher command for each file, and lastly execute those commands, up to eight at a time in parallel. (A simpler first try at this incantation launched all of the url_fetcher jobs at once instead of eight at a time, but that overloaded UCSC's FTP server!) We can then see the files on the platform:



Indexing the MAF blocks

The MAF format is good for transfer and archival of the whole genome-alignments, but it's not very convenient for day-to-day comparative genomics analysis: it divides the reference genome up into many small alignment blocks, not indexed for random access, and the individual blocks only include rows for those species that actually align to that region. So, we need some tools to help us access and reformat the alignments. (Similar tools are available in Galaxy, and a command-line utility is provided in PHAST.)

We first need the ability to extract any MAF blocks overlapping desired regions (e.g. the constituent exons of known ORFs). To make this efficient, I wrote a maf_importer applet to load all the MAF blocks into a GenomicTable (GTable for short), which is basically a database table provided by the platform that can be queried by genomic coordinates and other indices. For example, dx head can run such queries on a GTable produced by maf_importer:


$ dx head gtable-B4PB620X8Pv0GqQ1Ybp00F2b --gri chr7 117227697 117235331
┌────────┬────┬─────────┬─────────┬────────────────────────────────┐
│Row     │chr │lo       │hi       │maf_block                       │
├────────┼────┼─────────┼─────────┼────────────────────────────────┤
│28533480│chr7│117227595│117227701│a score=1206415.000000\ns hg19.…│
│28533481│chr7│117227701│117227726│a score=259963.000000\ns hg19.c…│
│28533482│chr7│117227726│117227764│a score=374335.000000\ns hg19.c…│
│28533483│chr7│117227764│117227779│a score=-15948.000000\ns hg19.c…│
│28533484│chr7│117227779│117227855│a score=3327291.000000\ns hg19.…│
│28533485│chr7│117227855│117227889│a score=1900087.000000\ns hg19.…│
│28533486│chr7│117227889│117227927│a score=109284.000000\ns hg19.c…│
│28533487│chr7│117227927│117227947│a score=-137786.000000\ns hg19.…│
│28533488│chr7│117227947│117228001│a score=659519.000000\ns hg19.c…│
│28533489│chr7│117228001│117228031│a score=330424.000000\ns hg19.c…│


The GTable is identified here by its permanent unique ID; it can also be referenced by its human-readable project name and path, in this case "MAF whole-genome alignments:/hg19 multiz46way" but unlike the permanent ID, that path can change in the future, if we rename something. So I usually prefer the ID, but I'm a nerd. The GTable resides in a "public" project I prepared, meaning that any platform user can view it. It consists simply of the reference genomic coordinates each block covers, and the text of the individual MAF block. Since this GTable conforms to the Spans type convention, the Genome Browser can also illustrate how the MAF blocks tile the human genome:


Stitching ORF alignments

Given a bunch of MAF blocks that overlap a region of interest, the next step is to "stitch" them together into a single alignment of the region, in a more convenient format. Enter the maf_stitcher app, which takes a list of gene annotations or other genomic intervals, and produces a GTable containing an alignment for each annotation - stitching MAF blocks and "splicing" annotated exons as necessary. The app has the 46-vertebrate alignments and the 15-insect alignments built-in, and lets you select from a menu of common subsets thereof (e.g. the 29 mammals analyzed in Lindblad-Toh et al. [2011]). It's a published app on the platform, and I also keep the source on GitHub.

For example, I used maf_stitcher to extract alignments of all human CCDS ORFs across 29 mammals:

$ dx run maf_stitcher -i species_set=hg19_29mammals -i regions="Reference Genomes:/hg19/hg19 CCDS Genes (20130406)" -i exon_type=CDS -i transcripts_per_job=5000

One can, of course, run the app by pressing buttons instead of entering shell commands:

The app parallelizes the stitching operation with the transcripts_per_job parameter, and GTable makes it straightforward to parallelize by splitting up the input and then collecting all the results. All-in-all, it took a few hours to extract and stitch the alignments of all 27,465 CCDS ORFs.  The resulting GTable has one row for each CCDS ORF, and a column with the alignment consensus sequence for each species:

$ dx head -w 12 gtable-B6GZbJQfZ7JgB60v5vjQ4363

┌───┬───────────┬────────────┬────────────┬────────────┬────────────┬───────────
│Row│_name      │_loci       │Human       │Chimp       │Rhesus      │Tarsier    
├───┼───────────┼────────────┼────────────┼────────────┼────────────┼───────────
│0  │CCDS10.1   │chr1:114176…│ATGGCACAGCA…│ATGGCACGGCA…│GTGGCACGGCA…│-----------
│1  │CCDS100.2  │chr1:918870…│ATGCAGCC---…│-----------…│ATGCCGCA---…│-----------
│2  │CCDS1000.1 │chr1:151512…│ATGAACGGGAC…│ATGAACGGGAC…│ATGAACGGGAC…│-----------
│3  │CCDS10000.1│chr14:10564…│ATGGAGCGCAT…│ATGGAGCGCAT…│-----------…│-----------
│4  │CCDS10001.1│chr14:10576…│ATGACGGGCCG…│ATGACGGGCCG…│ATGACGGGCCG…│-----------
│5  │CCDS10002.2│chr14:10571…│ATGGCGGCGGA…│ATGGCGGCGGA…│-----------…│-----------
│6  │CCDS10003.1│chr14:10594…│ATGGCCTCCAA…│---------AA…│-----------…│-----------
│7  │CCDS10004.1│chr14:10595…│ATGCCCAAGTG…│-----------…│ATGCCCAAGTG…│ATGTCCAAGTG
│8  │CCDS10006.1│chr14:10599…│ATGGTGCTGCC…│ATGGTGCTGCC…│ATGGTGCTGCC…│-----------
│9  │CCDS10008.1│chr15:22833…│ATGGCGCGG--…│ATGGCGCGG--…│ATGGCGCGG--…│ATGGCCCGG--
└───┴───────────┴────────────┴────────────┴────────────┴────────────┴───────────
27455 more rows


Lastly, I wrote a small script to post-process the alignments table by choosing the splicing isoform of each gene that has the longest ORF, which is useful to avoid double-counting in genome-wide analyses; the resulting table represents 18,354 CCDS ORFs. The final workflow, and the final GTables with CCDS ORF alignments, is saved in this "public" project.

A brief word about this strategy of sourcing alignments of protein-coding ORFs by extracting them from whole-genome alignments: it could be improved upon! Current whole-genome alignment methods aren't too clever when it comes to resolving complex orthology/paralogy relationships, and several groups have called out resultant alignment errors as a source of bias and error in phylogenetic analyses. For a variety of reasons, though, extraction from whole-genome alignments remains a popular approach. For example, it does not demand accurate protein-coding gene annotations in the informant genomes, and it permits relatively unbiased comparisons of evolutionary rates between coding and non-coding regions.

Sanity-checking the alignments with PhyloCSF

The unit tests for maf_stitcher verify that the human reference sequence in each alignment exactly matches the coding sequence annotated by CCDS. As an additional sanity-check on the alignments, I ran the (still in-progress) PhyloCSF app on the CCDS longest-isoform ORF alignments table.

$ dx run PhyloCSF -i alignments=gtable-B6Gfg20fZ7Jzj69f7GjQ13KG -i alignments_per_job=250

PhyloCSF looks at evolutionary signatures across the aligned species, and produces a score that's positive if the aligned region seems to represent a conserved protein-coding sequence, and negative otherwise. It does a lot of phylogenetic calculations and parameter optimization, which takes time - the above job drove 150 cores for 4-6 hours each to analyze all 18,354 ORFs. The result is the following simple GTable:


$ dx head gtable-B6K5qgXb4JPY9Pz3811Q01Y5
┌───┬───────────┬─────────────┐
│Row│name       │score        │
├───┼───────────┼─────────────┤
│0  │CCDS100.2  │3512.2578125 │
│1  │CCDS1000.1 │3658.95336914│
│2  │CCDS10000.1│2348.91552734│
│3  │CCDS10001.1│10929.1523438│
│4  │CCDS10002.2│7041.265625  │
│5  │CCDS10003.1│2454.08935547│
│6  │CCDS10004.1│821.016906738│
│7  │CCDS10006.1│2693.30834961│
│8  │CCDS10008.1│11091.0869141│
│9  │CCDS10009.1│22539.1035156│
└───┴───────────┴─────────────┘
18344 more rows


We can now use the platform's R bindings to plot the distribution of scores:


require(dxR)
scores <- getRows(DXGTable("gtable-B6K5qgXb4JPY9Pz3811Q01Y5"),columns='score')[,1]
plot(ecdf(scores),xlim=c(-5000,20000),xlab="decibans",ylab="CDF",main="PhyloCSF/29mammals on CCDS ORFs",col='red')
lines(c(0,0),c(-1,2),lty=3,col='blue')
lines(c(-1e9,1e9),c(0.088,0.088),lty=3,col='blue')



As expected, the vast majority of CCDS ORFs have positive PhyloCSF scores. The small fraction with negative scores (9%) is consistent with my previous experience - some human genes are not conserved across mammals, and others are difficult to align for various reasons. We could increase the sensitivity by having PhyloCSF search for high-scoring windows within each ORF, rather than looking only at its entirety - but this plot, combined with other smaller-scale tests of the maf_stitcher app, suffices to give us reasonable confidence that we're getting the desired alignments.

Where to now?

Now that I can slice and dice cross-species alignments on the platform, I next need to resurrect some code for estimating the phylogenetic codon models underlying the SCE methodology. This is the mathematically fanciest part of the pipeline, involving continuous-time Markov process models, expectation-maximization, L-BFGS, and some spectral approximations. Should be fun!

Separately, I'm continuing to flesh out the platform app for PhyloCSF. Although we've always provided the PhyloCSF code on GitHub, it's quite difficult to use properly, because it doesn't provide any help preparing the input alignments, because it needs to run at scale, and because it's hard to even compile. The platform makes these problems easy to solve.