Sunday, April 26, 2015

Blogging My Genome, episode 7: sifting for bad news

This is the seventh in a series of blog posts about my genome, which I had sequenced through Illumina's Understand Your Genome program.

Hey, it's been awhile! We've been unbelievably busy at my company, but I've been plugging away on my genome analysis slowly. When I last blogged, I'd completed the process of identifying small variants in my genome (affecting just one or a few DNA nucleotides). This takes us into an interesting new analysis phase - interpreting the consequences of those variants in the context of existing knowledge of human genetics. I previously went into depth on a certain variant I'd known to look for, but now we'll sift through the others in my VCF file - nearly four million of them!

I began with Ensembl's Variant Effect Predictor (VEP), one of several available tools that annotates VCF variants with their likely consequences for known genes and other genomic features. VEP produces a new VCF file with this additional information crammed into each entry, like so:

1       871215  .       C       G       1357.81 .       AB=0;ABP=0;AC=2;AF=1;AN=2;AO=45;CIGAR=1X;DP=
45;DPB=45;DPRA=0;EPP=4.21667;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=63.5608;P
AIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=1543;QR=0;RO=0;RPP=3.44459;RPPR=0;RUN=1;SAF=19;SAP=5.37
479;SAR=26;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1;CSQ=G|ENSG00000187634|ENST00000341065|Transcript|synonymous_variant|140|141|47|P|ccC/ccG|rs28419
423|0.0261008|0.00232558|3/12|||||||1|||SAMD11|HGNC|||G:0.0629|protein_coding|ENSP00000349216|||0.03
|0.08|0.16|0.0026|,G||ENSR00000528855|RegulatoryFeature|regulatory_region_variant||||||rs28419423|0.
0261008|0.00232558|||||||||||||||G:0.0629|||||0.03|0.08|0.16|0.0026|        GT:DP:RO:QR:AO:QA:GL    
1/1:45:0:0:45:1543:-10,-10,0

That's really not pretty, but VEP also produces a nice series of summary charts. For example, it breaks down putative consequences of my variants in protein-coding sequences.


Practical efforts to interpret an individual's genome today focus on variants within such protein-coding sequences, because they're much better understood than RNA genes, introns, regulatory elements, and other non-coding regions. From my 3.7 million genome-wide variants, only a few tens of thousands lie within coding regions, which reflects the overall composition of the human genome. Moreover, more than half of those are synonymous, and therefore rather unlikely to cause important consequences (although I could tell you about some exceptions). That leaves about 30,000 non-synonymous coding variants - still quite a lot to plow through on my own!


To sieve them further, I ran my VEP VCF through GEMINI, which adds additional annotations from existing knowledge databases and other tools, and loads them all nicely into a local SQLite database. This made it easy to query my variants and pull out interesting subsets, like so:

gemini query --show-samples --header -q "select * from variants where is_lof=1 or (clinvar_sig is not null and clinvar_sig not in ('non-pathogenic','untested','unknown'))" mlin.db

I thus selected my small variants that are either likely to wreck the encoded protein (loss of function, LOF, usually a stop gain, frameshift, or splice site mutation), or matching an interesting record in ClinVar, a public database of variants of known significance in clinical (medical) settings. This produced a table of 885 variants, and so discarded many thousands of non-synonymous variants that don't clearly have drastic effects on proteins, and aren't known to be clinically significant. Many of those may well be interesting in various ways, but right now they're just very difficult to interpret in isolation.

The 885 remaining are finally a reasonable number to load into that near-universal attractor of bioinformatics data curation, the office spreadsheet.


Notice my ALDH2 variant annotated with "mixed" significance in "acute alcohol sensitivity" and "alcohol dependence" - makes sense. But since I don't seem to suffer from 884 other notable genetic conditions, what to make of the rest? Well, there's quite a lot more to consider at this stage:
  • I have two copies of nearly all genes, from my mother and father. If only one copy of a given gene is busted, there might be reduced or no consequence.
  • Even two broken copies of a given gene isn't necessarily that bad; many genes exist in gene families, or participate in elaborate regulatory networks, which probably have some degree of robustness. Some genes have rather trivial functions or do nothing really useful at all.
  • Many of the LOF variants are widespread among people of my ethnicity. That makes it unlikely they're really terrible, else they'd have been selected out - although, one could dispute this in adult-onset diseases. 
  • The evidence and reports underlying ClinVar records are sometimes mixed or conflicting, and frequently revised. 
  • The annotations of gene structures, which underlie the LOF annotations, are similarly never quite complete.
  • Some errors in the sequencing and variant calling pipeline will have slipped through.
Based on these and other considerations, I can make the practical decision to bypass much of the spreadsheet. For example, I found just fourteen variants marked "pathogenic" in ClinVar and shared by <25% of Asians, a dozen or so homozygous LOF single-nucleotide variants, a bunch of small indels that are tricky to annotate, and a handful of possible compound heterozygosity situations (though current technology usually leaves these ambiguous). I'll dive into some examples next time, but lest anyone gets too worried, let me add now that the overall amount of broken stuff in my genome is really not unusual; we've all got a bunch of it. And, while filtering and sorting my spreadsheet has served my casual purposes, there are numerous great products and services nowadays that can organize this triage more professionally.

Stepping back: why sequence someone's whole genome - billions of DNA nucleotides - and then look at only a few dozen carefully? Well, this approach makes sense if we aren't looking for any specific known genetic condition or risk factor. So we collect everything, and then look at what seems most interesting with what knowledge and resources we have. All those millions of variants I skipped over ruthlessly, because we're not good at making sense of them yet, will still be there in the future! The field is advancing so quickly that there will be much more to learn if I revisit them in a few years. Furthermore, DNA sequencing technology advancements will hopefully give me a better look at structural variants in addition to the small variants we're now pretty good at detecting.

Next time, I'll write a few interesting vignettes from the individual curation of my small variants.

19 comments:

  1. Major industries in the country are engineering, electronics, wood and wood products, textiles, information technology, telecommunications. The total labor force of Estonia is 670,200 people, wherein 5.5% of population in the country are unemployed. The total number of unemployed people in Estonia is 71,873. The Industrial Production growth rate of Estonia is 10%.

    ReplyDelete
  2. Do you want to buy papers online and take a break from your student routine? If so, then an excellent online service that has earned the trust of students will help you with this.

    ReplyDelete
  3. Do my college homework for me offers you the most affordable but excellent custom writing services & assistance by professional qualified writers of USA. do my college homework for me

    ReplyDelete
  4. Our experts take every minute detail about the Essays into concern and offer with superiority paper. Nursing Essay Writing Service UK

    ReplyDelete
  5. Looking for the best electrician companies in the UK turn out to GripElectric for all electric services at a cheap price.electricians

    ReplyDelete
  6. But that’s not asset protection
    enough when we start talking about the importance of hiring residential security services in London. These have been summarised here in this blog.

    ReplyDelete
  7. Students may find the assignment problem and supervisors' expectations difficult to understand, or they may lack the time or writing skills to work on the assignments due to a lack of time or strong writing qualities. We look after your abilities and help you reach good scores in all subjects. Get Coursework writing help that is focused on the pleasure and immediate satisfaction of students.

    ReplyDelete
  8. your material sounds promising! your research is very important. keep up the good work. and if you need writing help, then contact the company's specialists pay for essay who are the best in the niche for writing articles and dissertations to order

    ReplyDelete
  9. If you have a question regarding your order or the help itself, you can read her latest blog or contact our 24/7 customer support via phone, email or online chat. You can discuss the order and easily track its progress right through the website 24/7.

    ReplyDelete
  10. I see you have done a great job, it is worthy of praise. I think that you could do more if you had better information. it can be gleaned from the articles of the authors of the company paperswriting.services who have written tons of material on various topics and have clearly affected your

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. With cute and lovable balloons, sweet candy and gifts that will show and interest them if they get the https://kids-math-games.com/number-games/ right answers! Just drag and drop the correct answer to each equation and solve the puzzle! This is a specially built educational game for toddlers that will keep their focus on learning math.

    ReplyDelete
  13. We are assignment in need, a guaranteed organization to give you the best offices to your pay someone to write a research paper task at a reasonable cost. For our purposes, reasonable doesn't mean compromising with quality.

    ReplyDelete
  14. Good morning! Writers of team from https://us.grademiners.com/ can write your essay cheap and fast. Deadlines vary from 1 hour to 30 days, sometimes more, which mostly depends on the type of paper. An essay can be done within hours, while a dissertation could require up to 2 months to do well from scratch.

    ReplyDelete
  15. To date, the multiplayer online battle arena (MOBA) has become a widely popular gaming genre. This multiplayer game genre often needs five players to join a team and compete against another five-player team. The most popular MOBA games are known for being a competitive genre in which players must work together to succeed. In reality, this genre necessitates abilities and technique to challenge individuals all around the world. While MOBAs are popular on mobile devices these days, many people prefer the larger screen that PCs provide. When playing on smartphones, issues such as accidently hitting another button when you meant to click another occur.

    Learn How to Play Castle Clash on PC | Games.lol

    ReplyDelete
  16. Assignment Help Service is the most dependable Programming Assignment help service provider, helping students with dissertation writing, coursework writing and offering Thesis Help for all programming assignments. In addition to providing assistance in all areas of writing, the company also provides Reserch paper help for complex academic topics.

    ReplyDelete
  17. Go to Slope Wallet official website and select from Android or iOS for mobile application and select Chrome for desktop. You can also go directly to the Chrome Store, Google Play, or App Store. Then, search "Slope Wallet" and install.
    Atomic Wallet |

    ReplyDelete