Mike's Fourth Try: Blogging My Genome, episode 3: data wrangling

Tuesday, March 11, 2014

Blogging My Genome, episode 3: data wrangling

After learning about Illumina's Understand Your Genome (UYG) program at ASHG 2013, I decided to sign up to get my genome sequenced. This is the third in a series of blog posts I'm writing about my own adventure in very personal genomics.

Along with my clinical report, Illumina delivered a portable hard drive (pictured) containing the underlying next-generation sequencing data. Given my computational genomics background, this was the product I'd most been looking forward to receiving. In fact, after picking it up, I found myself hastily biking five miles through the rain to get it to a computer!

The drive is protected using TrueCrypt, with the encryption key e-mailed separately. I installed the TrueCrypt on my Linux workstation, mounted the drive, and started to look around:

PG0001312-BLD is the identifier Illumina's lab assigned to my case. In the root folder we have copies of my PDF clinical report and a file with the 5,390 variants in my genome that were interpreted by Illumina's CLIA lab, in TSV format. (It's an irritating truth that the field of bioinformatics is largely based on tab-delimited file formats - but that's another blog post.) As mentioned last time, this only includes single-nucleotide variants in the exons of 1,600 genes with currently-known medical significance.

In the Variations folder we have files containing all of the variants detected across my genome in VCF format (also tab-delimited). This is the main product of Illumina's bioinformatics pipeline, Isaac, which takes the jumble of reads from the sequencing instrument through to variant calls, preceding clinical interpretation.

The Assembly folder contains the mother lode: a 66 GiB BAM file containing all of the reads, associated quality information, and their estimated positions of origin, or mappings, with respect to the human reference genome assembly. While this file is really big, it's actually just half the size I'd been expecting, and I'll get to why in a future post.

The BAM file is an intermediate product of the Isaac bioinformatics pipeline, from which the input data is still recoverable. It'll be my goal to get that raw data out and redo much of the bioinformatics from scratch. Not because there's anything wrong with Isaac, but rather because that'll be the best way for me - and, I hope, readers of this blog - to learn about the process. Analyzing this raw data would typically be a daunting challenge, because of the technical expertise required, the sheer quantity of the data, and the computational horsepower needed to work with it in an exploratory fashion. Fortunately however, I happen to be a bridge officer on the world's most powerful genome informatics platform.

To the cloud!

I created a DNAnexus project called "My Genome", and then used our command-line client to begin uploading the entire contents of the portable hard drive.

I set this running just before going to dinner, and it'd completed by the time I woke up the next morning. As a result, all my genome data now lives safely in the cloud.

FASTQ regeneration

To begin the bioinformatics, I'd like to get all my sequencing reads into FASTQ format, which contains the sequence and quality information, but not the mappings also encoded in the BAM file. FASTQ isn't truly the raw data that comes straight from the sequencing instrument, but the steps leading to it are relatively uncontroversial. Picard has a utility to extract FASTQ data from the BAM file, which I wrapped to run on DNAnexus. I can upload this to my project and execute it like so:

~/src/blogging-my-genome$ dx build -f bam_to_fastq
~/src/blogging-my-genome$ dx run bam_to_fastq -i bam=:/Assembly/PG0001312-BLD.bam

Producing:

These eight FASTQ files break down into four pairs. Each group is a pair owing to the use of paired-end sequencing, a neat trick to squeeze more information from a given quantity of reads. There are four such pairs corresponding to four lanes on two HiSeq flowcells used to sequence my sample. Here are the first eight lines of C2KC2ACXX_1_6_none_1.fastq:

~$ dx cat C2KC2ACXX_1_6_none_1.fastq.gz | gunzip -c | head -n 8 | nl
1 @C2KC2ACXX_1:6:1101:4:0/1
2 AGTTNCCATATTAATTTTAGGATGAATTTTATTATGTCTGCAACAAAAACAACAACAAAATGTGGCCAGGCATGGTGGCTCATGCCTGTAATCCCAGCAC
3 +
4 BBBF!0BFFFFFFIIFIIIIIIIIIIIIIIFIIFIFFFIIIFIIIIFIFFFFFI<FFBF7FFBBFIIIIIFFFFFFBFFBF<BBBFBBB<<<BBB<<B'<
5 @C2KC2ACXX_1:6:1101:6:0/1
6 TTTCNAGCTTCCAGTATGTTGTCGCCNCCNNTCCTTTCTNGCNNNNNNNAGGATTTGTATTTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNN
7 +
8 BBBF!0<BB0FBBFIIIFFIFFFBBF!0<!!00<B'<BB!00!!!!!!!0000'70<<'7<BB!!!!!!!!!!!!!!!!!!!!!!!!!!!!'!!!!!!!!

Line 1 is a read identifier, line 2 is the 100 base pair (bp) DNA sequence of the read, line 3 marks the end of the sequence, and line 4 presents the Phred quality scores for each position in the read. Lines 5-8 repeat this for a second read, and so on for the remainder of this 35 GiB (uncompressed) FASTQ file. You can see that the second read wasn't very good, with a large proportion of no-call positions (N) and the lowest quality scores (!).

Each read in C2KC2ACXX_1_6_none_1.fastq has a corresponding mate pair in C2KC2ACXX_1_6_none_2.fastq, and so on for the three remaining groups. The eight FASTQ files contain a grand total of 1,306,309,136 reads, or 131 Gbp - around 40 times the (haploid) human reference genome size of 3.2 Gbp. Such high coverage is typical for clinical-grade genome sequencing.

Next time: I'll generate my own mappings of these reads back onto the reference genome.

14 comments:

BextolSeptember 29, 2020 at 2:50 AM
The index of ethnic fractionalization in United Arab Emirates is 0.6252. This means that there is a relatively high number of unique ethnic groups in United Arab Emirates. EF is usually measured as 1 minus the Herfindahl concentration index of ethnolinguistic group shares, which reproduces the probability that two randomly drawn individuals from the population belong to different groups. The theoretical maximum of EF of 1 means that each person belongs to a different group. Read below for statistics of United Arab Emirates on median age and gender distribution at various ages. http://www.confiduss.com/en/jurisdictions/united-arab-emirates/demographics/
ReplyDelete
Replies
UnknownJuly 8, 2021 at 11:33 AM
Following this, the FORCE11 Software Citation Implementation Working Group developed checklists for software writers and developers. nursing essay help uk
ReplyDelete
Replies
SalvadorSeptember 27, 2021 at 3:00 AM
Let us leave aside the essays of publicists, philosophers, doctors of sciences and writers. We still need to help me write my essay grow to these heights. On the student's agenda is an essay as a test at the university and an alternative to an interview.
ReplyDelete
Replies
Whome1996March 22, 2022 at 11:10 AM
I used to prefer to do various written works on my own, but the time came and we were asked to write an excellent essay, and here the writing services helped me a lot to write my essayz, they write well-written works
ReplyDelete
Replies
bojo2112jonMarch 24, 2022 at 2:45 AM
Hello readers of this blog! I fully agree with every word of this topic, so thanks for sharing this awesome information! At present days pupils should buy their essays to get high grades without unnecessary pressure and efforts. If you will choose this writing nursingpaper.com web agency to order task online, then your essays would be written by first-rate professional writer and with low price. At the end of the result you will get the highest A+ mark.
ReplyDelete
Replies
Mia WilsonMay 8, 2022 at 10:20 AM
Now there are a lot of abstract companies - like mushrooms. But I ran into a problem - I needed a printout. And only in cheap essay writing service I was physically able to carry everything out of the office. All others are online only. It was safer for me too - I saw what I paid for. And I'm satisfied with the quality. It can be seen that they did not cheat and the uniqueness is high. Thanks for your cooperation and help! I will apply!
ReplyDelete
Replies
ChadMay 11, 2022 at 4:09 AM
I know an excellent service for helping students, proven more than once through personal experience. I advise calmly as if I am sure of them. We have been in touch for several years now, and essays, abstracts, and essays for them https://writemypapers.company/ are not a problem. And what is important is prompt. Qualitatively, competently. I am satisfied. You guys are cool. Need help? Have no doubt.
ReplyDelete
Replies
KunJune 7, 2022 at 2:24 AM
Redot is a simple, logical cryptocurrency website that provides information and tools to help you learn about cryptocurrency and create smart investments. This website https://redot.com/staking/ is for anyone who is interested in cryptocurrencies and wants to learn more about them.
ReplyDelete
Replies
Rylee RussoJuly 19, 2022 at 1:22 PM
Hello, I wanted to say thank you for sharing this useful material. I'm not quite good at technology and the stuff related to it but this article will be useful to one of my friends who works at https://mid-terms.com/article-review-writing-help/
ReplyDelete
Replies
Maria MuglerJune 6, 2023 at 9:07 AM
What is cryptocurrency trade script? Cryptocurrency exchange software script could be a computer program arrangement that gives the vital foundation for bitcoin exchanging stage without extra advancement. The code grasps irreplaceable components: trader’s stock, wallet mechanics, client dashboard, director board.
ReplyDelete
Replies
777betDecember 18, 2023 at 7:36 AM
I just had to share my recent experience with bet999 promotions – it's been an absolute game-changer! 🌟 The bonuses and rewards are beyond amazing. 🎁 Whether you're into slots, sports betting, or live casino action, bet999 has got you covered. The promos not only boost your gameplay but also add an extra layer of thrill to every bet.

I've never felt more valued as a player. The variety of promotions keeps things fresh, making every visit to bet999 an adventure. Trust me, you don't want to miss out on these fantastic offers! 🎮💰
ReplyDelete
Replies
777betMay 26, 2025 at 7:00 AM
https://blog.777bet.io/777fun/ is an internet-based platform that delivers a casino atmosphere accepting various cryptocurrencies as its primary form of currency. Utilizing digital currency enables users to engage in a vast array of online slot games, video table games, and live dealer experiences. Through our state-of-the-art technology and easy-to-navigate interface, we aim to ensure that gambling becomes an enjoyable and straightforward experience for all. Our goal is to enable individuals to embrace a lively and thrilling lifestyle through the offerings and services we provide
ReplyDelete
Replies
Sharon BlanchardJuly 28, 2025 at 12:51 AM
This is fascinating! The excitement of getting raw genomic data is palpable. Data wrangling is a crucial step. Did you consider using cloud-based tools for analysis, given the size of the files? It'd be interesting to see how your analysis compares to commercially available reports. Reminds me of the adrenaline rush I get playing Moto X3M! Looking forward to the next update!

ReplyDelete
Replies
Runa MisasiaJuly 29, 2025 at 6:37 PM
Unpacking genomic data felt like Christmas morning! That BAM file sounds impressively massive, like trying to wrangle a digital kraken. Bioinformatics, indeed a tab-delimited universe... I recall a frantic scramble with corrupted CSV files while working on a project last year, very reminiscent of this, spent hours debugging only to realize a single rogue tab was to blame. Data wrangling can be frustrating. Think that feeling is amplified times ten! The cloud is definitely the place to be for this kind of data. friday night funkin with genomic data sounds both terrifying and thrilling.
ReplyDelete
Replies

Add comment