Tuesday, March 11, 2014

Blogging My Genome, episode 3: data wrangling

After learning about Illumina's Understand Your Genome (UYG) program at ASHG 2013, I decided to sign up to get my genome sequenced. This is the third in a series of blog posts I'm writing about my own adventure in very personal genomics.

Along with my clinical report, Illumina delivered a portable hard drive (pictured) containing the underlying next-generation sequencing data. Given my computational genomics background, this was the product I'd most been looking forward to receiving. In fact, after picking it up, I found myself hastily biking five miles through the rain to get it to a computer!

The drive is protected using TrueCrypt, with the encryption key e-mailed separately. I installed the TrueCrypt on my Linux workstation, mounted the drive, and started to look around:



PG0001312-BLD is the identifier Illumina's lab assigned to my case. In the root folder we have copies of my PDF clinical report and a file with the 5,390 variants in my genome that were interpreted by Illumina's CLIA lab, in TSV format. (It's an irritating truth that the field of bioinformatics is largely based on tab-delimited file formats - but that's another blog post.) As mentioned last time, this only includes single-nucleotide variants in the exons of 1,600 genes with currently-known medical significance.

In the Variations folder we have files containing all of the variants detected across my genome in VCF format (also tab-delimited). This is the main product of Illumina's bioinformatics pipeline, Isaac, which takes the jumble of reads from the sequencing instrument through to variant calls, preceding clinical interpretation.

The Assembly folder contains the mother lode: a 66 GiB BAM file containing all of the reads, associated quality information, and their estimated positions of origin, or mappings, with respect to the human reference genome assembly. While this file is really big, it's actually just half the size I'd been expecting, and I'll get to why in a future post.

The BAM file is an intermediate product of the Isaac bioinformatics pipeline, from which the input data is still recoverable. It'll be my goal to get that raw data out and redo much of the bioinformatics from scratch. Not because there's anything wrong with Isaac, but rather because that'll be the best way for me - and, I hope, readers of this blog - to learn about the process. Analyzing this raw data would typically be a daunting challenge, because of the technical expertise required, the sheer quantity of the data, and the computational horsepower needed to work with it in an exploratory fashion. Fortunately however, I happen to be a bridge officer on the world's most powerful genome informatics platform.

To the cloud!

I created a DNAnexus project called "My Genome", and then used our command-line client to begin uploading the entire contents of the portable hard drive.


I set this running just before going to dinner, and it'd completed by the time I woke up the next morning. As a result, all my genome data now lives safely in the cloud.


FASTQ regeneration

To begin the bioinformatics, I'd like to get all my sequencing reads into FASTQ format, which contains the sequence and quality information, but not the mappings also encoded in the BAM file. FASTQ isn't truly the raw data that comes straight from the sequencing instrument, but the steps leading to it are relatively uncontroversial. Picard has a utility to extract FASTQ data from the BAM file, which I wrapped to run on DNAnexus. I can upload this to my project and execute it like so:

~/src/blogging-my-genome$ dx build -f bam_to_fastq
~/src/blogging-my-genome$ dx run bam_to_fastq -i bam=:/Assembly/PG0001312-BLD.bam

Producing:


These eight FASTQ files break down into four pairs. Each group is a pair owing to the use of paired-end sequencing, a neat trick to squeeze more information from a given quantity of reads. There are four such pairs corresponding to four lanes on two HiSeq flowcells  used to sequence my sample. Here are the first eight lines of C2KC2ACXX_1_6_none_1.fastq:

~$ dx cat C2KC2ACXX_1_6_none_1.fastq.gz | gunzip -c | head -n 8 | nl
   1  @C2KC2ACXX_1:6:1101:4:0/1
   2  AGTTNCCATATTAATTTTAGGATGAATTTTATTATGTCTGCAACAAAAACAACAACAAAATGTGGCCAGGCATGGTGGCTCATGCCTGTAATCCCAGCAC
   3  +
   4  BBBF!0BFFFFFFIIFIIIIIIIIIIIIIIFIIFIFFFIIIFIIIIFIFFFFFI<FFBF7FFBBFIIIIIFFFFFFBFFBF<BBBFBBB<<<BBB<<B'<
   5  @C2KC2ACXX_1:6:1101:6:0/1
   6  TTTCNAGCTTCCAGTATGTTGTCGCCNCCNNTCCTTTCTNGCNNNNNNNAGGATTTGTATTTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNN
   7  +
   8  BBBF!0<BB0FBBFIIIFFIFFFBBF!0<!!00<B'<BB!00!!!!!!!0000'70<<'7<BB!!!!!!!!!!!!!!!!!!!!!!!!!!!!'!!!!!!!!

Line 1 is a read identifier, line 2 is the 100 base pair (bp) DNA sequence of the read, line 3 marks the end of the sequence, and line 4 presents the Phred quality scores for each position in the read. Lines 5-8 repeat this for a second read, and so on for the remainder of this 35 GiB (uncompressed) FASTQ file. You can see that the second read wasn't very good, with a large proportion of no-call positions (N) and the lowest quality scores (!).

Each read in C2KC2ACXX_1_6_none_1.fastq has a corresponding mate pair in C2KC2ACXX_1_6_none_2.fastq, and so on for the three remaining groups. The eight FASTQ files contain a grand total of 1,306,309,136 reads, or 131 Gbp - around 40 times the (haploid) human reference genome size of 3.2 Gbp. Such high coverage is typical for clinical-grade genome sequencing.

Next time: I'll generate my own mappings of these reads back onto the reference genome.

11 comments:

  1. The index of ethnic fractionalization in United Arab Emirates is 0.6252. This means that there is a relatively high number of unique ethnic groups in United Arab Emirates. EF is usually measured as 1 minus the Herfindahl concentration index of ethnolinguistic group shares, which reproduces the probability that two randomly drawn individuals from the population belong to different groups. The theoretical maximum of EF of 1 means that each person belongs to a different group. Read below for statistics of United Arab Emirates on median age and gender distribution at various ages. http://www.confiduss.com/en/jurisdictions/united-arab-emirates/demographics/

    ReplyDelete
  2. Following this, the FORCE11 Software Citation Implementation Working Group developed checklists for software writers and developers. nursing essay help uk

    ReplyDelete
  3. Let us leave aside the essays of publicists, philosophers, doctors of sciences and writers. We still need to help me write my essay grow to these heights. On the student's agenda is an essay as a test at the university and an alternative to an interview.

    ReplyDelete
  4. I used to prefer to do various written works on my own, but the time came and we were asked to write an excellent essay, and here the writing services helped me a lot to write my essayz, they write well-written works

    ReplyDelete
  5. Hello readers of this blog! I fully agree with every word of this topic, so thanks for sharing this awesome information! At present days pupils should buy their essays to get high grades without unnecessary pressure and efforts. If you will choose this writing nursingpaper.com web agency to order task online, then your essays would be written by first-rate professional writer and with low price. At the end of the result you will get the highest A+ mark.

    ReplyDelete
  6. Now there are a lot of abstract companies - like mushrooms. But I ran into a problem - I needed a printout. And only in cheap essay writing service I was physically able to carry everything out of the office. All others are online only. It was safer for me too - I saw what I paid for. And I'm satisfied with the quality. It can be seen that they did not cheat and the uniqueness is high. Thanks for your cooperation and help! I will apply!

    ReplyDelete
  7. I know an excellent service for helping students, proven more than once through personal experience. I advise calmly as if I am sure of them. We have been in touch for several years now, and essays, abstracts, and essays for them https://writemypapers.company/ are not a problem. And what is important is prompt. Qualitatively, competently. I am satisfied. You guys are cool. Need help? Have no doubt.

    ReplyDelete
  8. Redot is a simple, logical cryptocurrency website that provides information and tools to help you learn about cryptocurrency and create smart investments. This website https://redot.com/staking/ is for anyone who is interested in cryptocurrencies and wants to learn more about them.

    ReplyDelete
  9. Hello, I wanted to say thank you for sharing this useful material. I'm not quite good at technology and the stuff related to it but this article will be useful to one of my friends who works at https://mid-terms.com/article-review-writing-help/

    ReplyDelete
  10. What is cryptocurrency trade script? Cryptocurrency exchange software script could be a computer program arrangement that gives the vital foundation for bitcoin exchanging stage without extra advancement. The code grasps irreplaceable components: trader’s stock, wallet mechanics, client dashboard, director board.

    ReplyDelete
  11. I just had to share my recent experience with bet999 promotions – it's been an absolute game-changer! ๐ŸŒŸ The bonuses and rewards are beyond amazing. ๐ŸŽ Whether you're into slots, sports betting, or live casino action, bet999 has got you covered. The promos not only boost your gameplay but also add an extra layer of thrill to every bet.

    I've never felt more valued as a player. The variety of promotions keeps things fresh, making every visit to bet999 an adventure. Trust me, you don't want to miss out on these fantastic offers! ๐ŸŽฎ๐Ÿ’ฐ

    ReplyDelete