Thursday, May 23, 2013

Working with cross-species genome alignments on the DNAnexus platform

I've recently been resurrecting some comparative genomics methods I developed in my last year of grad school, but never got to publish. These build on previous work to locate what we called Synonymous Constraint Elements (SCEs) in the human genome: short stretches within protein-coding ORFs that also encode additional, overlapping functional elements - evidenced by a markedly reduced apparent rate of synonymous substitutions in cross-species alignments. The first step in this analysis, and the subject of this post, involves extracting the cross-species sequence alignments of protein-coding ORFs from raw, whole-genome alignments. I hope to write a series of blog posts as I get various other parts of the pipeline going. I'm not exactly sure where it'll go from there, but it's pretty neat stuff I would eventually like to get peer-reviewed!

I was previously hampered from making progress on this by lack of access to a compute cluster, but that's no longer a problem now that the new DNAnexus platform is up and running. While I also work under its hood every day, the personal side project I'll describe here relies only on the public feature set available to all platform users.

Downloading the alignments

The genome alignments we'll be working with are generated by the UCSC Genome Bioinformatics group using MULTIZ. These are the same alignments used in the fly and mammalian comparative genomics projects I worked on in the past. The first step was to get the alignments in multiple alignment format (MAF) from the UCSC FTP site onto the platform  (e.g. the 46-way alignment for human). One way would have been to download them to my computer, then upload them to the platform, but that's not very efficient, as they're several dozen GB compressed. A better way is to use the  "URL Fetcher" app, which causes the platform to download a file from a given URL. For example, we can use the platform's command-line client to launch a job that downloads chr21.maf.gz into the current project:

dx run url_fetcher -i url=

Building on this, I came up with the following one-liner to download all the MAF files in that FTP directory:

$ curl -iv | grep -o chr.*\\.maf\\.gz | xargs -n 1 printf "dx run url_fetcher -y --wait -i url=\n" | xargs -n 1 -d \\n -P 8 bash -c

The curl command lists the contents of the FTP directory, from which we grep out the chr*.maf.gz filenames. Then we use printf to formulate a dx run url_fetcher command for each file, and lastly execute those commands, up to eight at a time in parallel. (A simpler first try at this incantation launched all of the url_fetcher jobs at once instead of eight at a time, but that overloaded UCSC's FTP server!) We can then see the files on the platform:

Indexing the MAF blocks

The MAF format is good for transfer and archival of the whole genome-alignments, but it's not very convenient for day-to-day comparative genomics analysis: it divides the reference genome up into many small alignment blocks, not indexed for random access, and the individual blocks only include rows for those species that actually align to that region. So, we need some tools to help us access and reformat the alignments. (Similar tools are available in Galaxy, and a command-line utility is provided in PHAST.)

We first need the ability to extract any MAF blocks overlapping desired regions (e.g. the constituent exons of known ORFs). To make this efficient, I wrote a maf_importer applet to load all the MAF blocks into a GenomicTable (GTable for short), which is basically a database table provided by the platform that can be queried by genomic coordinates and other indices. For example, dx head can run such queries on a GTable produced by maf_importer:

$ dx head gtable-B4PB620X8Pv0GqQ1Ybp00F2b --gri chr7 117227697 117235331
│Row     │chr │lo       │hi       │maf_block                       │
│28533480│chr7│117227595│117227701│a score=1206415.000000\ns hg19.…│
│28533481│chr7│117227701│117227726│a score=259963.000000\ns hg19.c…│
│28533482│chr7│117227726│117227764│a score=374335.000000\ns hg19.c…│
│28533483│chr7│117227764│117227779│a score=-15948.000000\ns hg19.c…│
│28533484│chr7│117227779│117227855│a score=3327291.000000\ns hg19.…│
│28533485│chr7│117227855│117227889│a score=1900087.000000\ns hg19.…│
│28533486│chr7│117227889│117227927│a score=109284.000000\ns hg19.c…│
│28533487│chr7│117227927│117227947│a score=-137786.000000\ns hg19.…│
│28533488│chr7│117227947│117228001│a score=659519.000000\ns hg19.c…│
│28533489│chr7│117228001│117228031│a score=330424.000000\ns hg19.c…│

The GTable is identified here by its permanent unique ID; it can also be referenced by its human-readable project name and path, in this case "MAF whole-genome alignments:/hg19 multiz46way" but unlike the permanent ID, that path can change in the future, if we rename something. So I usually prefer the ID, but I'm a nerd. The GTable resides in a "public" project I prepared, meaning that any platform user can view it. It consists simply of the reference genomic coordinates each block covers, and the text of the individual MAF block. Since this GTable conforms to the Spans type convention, the Genome Browser can also illustrate how the MAF blocks tile the human genome:

Stitching ORF alignments

Given a bunch of MAF blocks that overlap a region of interest, the next step is to "stitch" them together into a single alignment of the region, in a more convenient format. Enter the maf_stitcher app, which takes a list of gene annotations or other genomic intervals, and produces a GTable containing an alignment for each annotation - stitching MAF blocks and "splicing" annotated exons as necessary. The app has the 46-vertebrate alignments and the 15-insect alignments built-in, and lets you select from a menu of common subsets thereof (e.g. the 29 mammals analyzed in Lindblad-Toh et al. [2011]). It's a published app on the platform, and I also keep the source on GitHub.

For example, I used maf_stitcher to extract alignments of all human CCDS ORFs across 29 mammals:

$ dx run maf_stitcher -i species_set=hg19_29mammals -i regions="Reference Genomes:/hg19/hg19 CCDS Genes (20130406)" -i exon_type=CDS -i transcripts_per_job=5000

One can, of course, run the app by pressing buttons instead of entering shell commands:

The app parallelizes the stitching operation with the transcripts_per_job parameter, and GTable makes it straightforward to parallelize by splitting up the input and then collecting all the results. All-in-all, it took a few hours to extract and stitch the alignments of all 27,465 CCDS ORFs.  The resulting GTable has one row for each CCDS ORF, and a column with the alignment consensus sequence for each species:

$ dx head -w 12 gtable-B6GZbJQfZ7JgB60v5vjQ4363

│Row│_name      │_loci       │Human       │Chimp       │Rhesus      │Tarsier    
│0  │CCDS10.1   │chr1:114176…│ATGGCACAGCA…│ATGGCACGGCA…│GTGGCACGGCA…│-----------
│1  │CCDS100.2  │chr1:918870…│ATGCAGCC---…│-----------…│ATGCCGCA---…│-----------
│2  │CCDS1000.1 │chr1:151512…│ATGAACGGGAC…│ATGAACGGGAC…│ATGAACGGGAC…│-----------
│3  │CCDS10000.1│chr14:10564…│ATGGAGCGCAT…│ATGGAGCGCAT…│-----------…│-----------
│4  │CCDS10001.1│chr14:10576…│ATGACGGGCCG…│ATGACGGGCCG…│ATGACGGGCCG…│-----------
│5  │CCDS10002.2│chr14:10571…│ATGGCGGCGGA…│ATGGCGGCGGA…│-----------…│-----------
│6  │CCDS10003.1│chr14:10594…│ATGGCCTCCAA…│---------AA…│-----------…│-----------
│7  │CCDS10004.1│chr14:10595…│ATGCCCAAGTG…│-----------…│ATGCCCAAGTG…│ATGTCCAAGTG
│8  │CCDS10006.1│chr14:10599…│ATGGTGCTGCC…│ATGGTGCTGCC…│ATGGTGCTGCC…│-----------
│9  │CCDS10008.1│chr15:22833…│ATGGCGCGG--…│ATGGCGCGG--…│ATGGCGCGG--…│ATGGCCCGG--
27455 more rows

Lastly, I wrote a small script to post-process the alignments table by choosing the splicing isoform of each gene that has the longest ORF, which is useful to avoid double-counting in genome-wide analyses; the resulting table represents 18,354 CCDS ORFs. The final workflow, and the final GTables with CCDS ORF alignments, is saved in this "public" project.

A brief word about this strategy of sourcing alignments of protein-coding ORFs by extracting them from whole-genome alignments: it could be improved upon! Current whole-genome alignment methods aren't too clever when it comes to resolving complex orthology/paralogy relationships, and several groups have called out resultant alignment errors as a source of bias and error in phylogenetic analyses. For a variety of reasons, though, extraction from whole-genome alignments remains a popular approach. For example, it does not demand accurate protein-coding gene annotations in the informant genomes, and it permits relatively unbiased comparisons of evolutionary rates between coding and non-coding regions.

Sanity-checking the alignments with PhyloCSF

The unit tests for maf_stitcher verify that the human reference sequence in each alignment exactly matches the coding sequence annotated by CCDS. As an additional sanity-check on the alignments, I ran the (still in-progress) PhyloCSF app on the CCDS longest-isoform ORF alignments table.

$ dx run PhyloCSF -i alignments=gtable-B6Gfg20fZ7Jzj69f7GjQ13KG -i alignments_per_job=250

PhyloCSF looks at evolutionary signatures across the aligned species, and produces a score that's positive if the aligned region seems to represent a conserved protein-coding sequence, and negative otherwise. It does a lot of phylogenetic calculations and parameter optimization, which takes time - the above job drove 150 cores for 4-6 hours each to analyze all 18,354 ORFs. The result is the following simple GTable:

$ dx head gtable-B6K5qgXb4JPY9Pz3811Q01Y5
│Row│name       │score        │
│0  │CCDS100.2  │3512.2578125 │
│1  │CCDS1000.1 │3658.95336914│
│2  │CCDS10000.1│2348.91552734│
│3  │CCDS10001.1│10929.1523438│
│4  │CCDS10002.2│7041.265625  │
│5  │CCDS10003.1│2454.08935547│
│6  │CCDS10004.1│821.016906738│
│7  │CCDS10006.1│2693.30834961│
│8  │CCDS10008.1│11091.0869141│
│9  │CCDS10009.1│22539.1035156│
18344 more rows

We can now use the platform's R bindings to plot the distribution of scores:

scores <- getRows(DXGTable("gtable-B6K5qgXb4JPY9Pz3811Q01Y5"),columns='score')[,1]
plot(ecdf(scores),xlim=c(-5000,20000),xlab="decibans",ylab="CDF",main="PhyloCSF/29mammals on CCDS ORFs",col='red')

As expected, the vast majority of CCDS ORFs have positive PhyloCSF scores. The small fraction with negative scores (9%) is consistent with my previous experience - some human genes are not conserved across mammals, and others are difficult to align for various reasons. We could increase the sensitivity by having PhyloCSF search for high-scoring windows within each ORF, rather than looking only at its entirety - but this plot, combined with other smaller-scale tests of the maf_stitcher app, suffices to give us reasonable confidence that we're getting the desired alignments.

Where to now?

Now that I can slice and dice cross-species alignments on the platform, I next need to resurrect some code for estimating the phylogenetic codon models underlying the SCE methodology. This is the mathematically fanciest part of the pipeline, involving continuous-time Markov process models, expectation-maximization, L-BFGS, and some spectral approximations. Should be fun!

Separately, I'm continuing to flesh out the platform app for PhyloCSF. Although we've always provided the PhyloCSF code on GitHub, it's quite difficult to use properly, because it doesn't provide any help preparing the input alignments, because it needs to run at scale, and because it's hard to even compile. The platform makes these problems easy to solve.


  1. Good day everybody! Our online writing service that works around 6 years uses only top rated and qualified professional writers. They are all experts in their field and their reliable services ensure you will get the points you deserve and receive all of the credits you need to have to be the smartest student at the university.

  2. Hello people from all over the world! Our service helps pupils with their assignments like hometask, report writing, essay writing, business plan, dissertation writing help and with theses as well. Our great papers come without any plagiarism and with the best citations.

  3. Would help me write an essay and always pay attention and also I look at the reviews. I found this site I can advise you and I think you will definitely be satisfied with other users for the quality of work Expert Writers

  4. Hi guys! If you are looking for a reliable and inexpensive writing servise - go to service. These guys are real professionals in writing and proofreading. You can order any type of academic writing on their website and be sure that it will be done with the highest quality possible and on time.

  5. If it is true that education is the main foundation of any society, it follows that the state of race in today’s America mirrors its education system. Therefore, America’s education needs serious examination and even remaking. It is a system that uses Blacks (and other marginalized people) as mere tokens. You see a meager quota of Black people (as employees or students) here and to give the false impression of equity.

  6. Flash files are used to upgrade or reinstall the operating system in your device. If you want to install infocus m430 flashing, you can visit the link.

  7. This comment has been removed by the author.

  8. The more years a company operates in the market, the better. This means that it has gained a good reputation and trust from its customers. has been helping students for many years. We try to inspire our clients and give them additional motivation to perform their tasks faster and better

  9. At there are professionally qualified English-language essay writers who are free from grammatical error and who are also decorated with great vocabulary, linguistics and stylistic instruments. Their essay writing services are therefore well recognized to have no grammatical problems.

  10. I learned absolutely everything about my question when I read this post, thanks to the author for the detailed description. I wrote my review on the paperwriter reviews you can go in and read. Thank you very much for your attention and your time.

  11. Hello! In our company work a definitely best college papers writers in the business. If you need some support with any intellectual text including article, study research, dissertation or thesis, we have enough qualified masters available to help you to reach the best result as possible.

  12. Good evening! A professional expert with relevant expertise in your respective discipline. Your order will be assigned manually to a subject-matter writer with the required degree and time availability. Our company offers custom writing services across a broad range of academic subjects. Almost any type of work is doable. Proceed to the order form to see the full list of our services and disciplines.

  13. Not everyone knows how to start writing an essay. It happens that a person knows how to tell a story, expresses his thoughts and feelings clearly and confidently, but it can be difficult for him to put it on paper, to write an essay or other text. How to start an essay? The beginning is sometimes the hardest thing to give, and a prolonged reflection on "where to start," can ruin the whole process of creative work. So, with what words to start the essay:
    Before you start, think about the topic you will write about, find different sources for inspiration and work.

  14. This comment has been removed by the author.

  15. Canvas On Demand provides a wide range of personalized photo products and services. From transforming your favorite photos into stunning canvas prints to creating unique photo gifts, they offer a variety of options to help you showcase your cherished memories and add a personal touch to your home decor. With high-quality craftsmanship and a commitment to customer satisfaction, Canvas on demand promotion Codes is a trusted choice for turning your photos into art and preserving your special moments