Sunday, April 26, 2015

Blogging My Genome, episode 7: sifting for bad news

This is the seventh in a series of blog posts about my genome, which I had sequenced through Illumina's Understand Your Genome program.

Hey, it's been awhile! We've been unbelievably busy at my company, but I've been plugging away on my genome analysis slowly. When I last blogged, I'd completed the process of identifying small variants in my genome (affecting just one or a few DNA nucleotides). This takes us into an interesting new analysis phase - interpreting the consequences of those variants in the context of existing knowledge of human genetics. I previously went into depth on a certain variant I'd known to look for, but now we'll sift through the others in my VCF file - nearly four million of them!

I began with Ensembl's Variant Effect Predictor (VEP), one of several available tools that annotates VCF variants with their likely consequences for known genes and other genomic features. VEP produces a new VCF file with this additional information crammed into each entry, like so:

1       871215  .       C       G       1357.81 .       AB=0;ABP=0;AC=2;AF=1;AN=2;AO=45;CIGAR=1X;DP=
45;DPB=45;DPRA=0;EPP=4.21667;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=63.5608;P
AIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=1543;QR=0;RO=0;RPP=3.44459;RPPR=0;RUN=1;SAF=19;SAP=5.37
479;SAR=26;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1;CSQ=G|ENSG00000187634|ENST00000341065|Transcript|synonymous_variant|140|141|47|P|ccC/ccG|rs28419
423|0.0261008|0.00232558|3/12|||||||1|||SAMD11|HGNC|||G:0.0629|protein_coding|ENSP00000349216|||0.03
|0.08|0.16|0.0026|,G||ENSR00000528855|RegulatoryFeature|regulatory_region_variant||||||rs28419423|0.
0261008|0.00232558|||||||||||||||G:0.0629|||||0.03|0.08|0.16|0.0026|        GT:DP:RO:QR:AO:QA:GL    
1/1:45:0:0:45:1543:-10,-10,0

That's really not pretty, but VEP also produces a nice series of summary charts. For example, it breaks down putative consequences of my variants in protein-coding sequences.


Practical efforts to interpret an individual's genome today focus on variants within such protein-coding sequences, because they're much better understood than RNA genes, introns, regulatory elements, and other non-coding regions. From my 3.7 million genome-wide variants, only a few tens of thousands lie within coding regions, which reflects the overall composition of the human genome. Moreover, more than half of those are synonymous, and therefore rather unlikely to cause important consequences (although I could tell you about some exceptions). That leaves about 30,000 non-synonymous coding variants - still quite a lot to plow through on my own!


To sieve them further, I ran my VEP VCF through GEMINI, which adds additional annotations from existing knowledge databases and other tools, and loads them all nicely into a local SQLite database. This made it easy to query my variants and pull out interesting subsets, like so:

gemini query --show-samples --header -q "select * from variants where is_lof=1 or (clinvar_sig is not null and clinvar_sig not in ('non-pathogenic','untested','unknown'))" mlin.db

I thus selected my small variants that are either likely to wreck the encoded protein (loss of function, LOF, usually a stop gain, frameshift, or splice site mutation), or matching an interesting record in ClinVar, a public database of variants of known significance in clinical (medical) settings. This produced a table of 885 variants, and so discarded many thousands of non-synonymous variants that don't clearly have drastic effects on proteins, and aren't known to be clinically significant. Many of those may well be interesting in various ways, but right now they're just very difficult to interpret in isolation.

The 885 remaining are finally a reasonable number to load into that near-universal attractor of bioinformatics data curation, the office spreadsheet.


Notice my ALDH2 variant annotated with "mixed" significance in "acute alcohol sensitivity" and "alcohol dependence" - makes sense. But since I don't seem to suffer from 884 other notable genetic conditions, what to make of the rest? Well, there's quite a lot more to consider at this stage:
  • I have two copies of nearly all genes, from my mother and father. If only one copy of a given gene is busted, there might be reduced or no consequence.
  • Even two broken copies of a given gene isn't necessarily that bad; many genes exist in gene families, or participate in elaborate regulatory networks, which probably have some degree of robustness. Some genes have rather trivial functions or do nothing really useful at all.
  • Many of the LOF variants are widespread among people of my ethnicity. That makes it unlikely they're really terrible, else they'd have been selected out - although, one could dispute this in adult-onset diseases. 
  • The evidence and reports underlying ClinVar records are sometimes mixed or conflicting, and frequently revised. 
  • The annotations of gene structures, which underlie the LOF annotations, are similarly never quite complete.
  • Some errors in the sequencing and variant calling pipeline will have slipped through.
Based on these and other considerations, I can make the practical decision to bypass much of the spreadsheet. For example, I found just fourteen variants marked "pathogenic" in ClinVar and shared by <25% of Asians, a dozen or so homozygous LOF single-nucleotide variants, a bunch of small indels that are tricky to annotate, and a handful of possible compound heterozygosity situations (though current technology usually leaves these ambiguous). I'll dive into some examples next time, but lest anyone gets too worried, let me add now that the overall amount of broken stuff in my genome is really not unusual; we've all got a bunch of it. And, while filtering and sorting my spreadsheet has served my casual purposes, there are numerous great products and services nowadays that can organize this triage more professionally.

Stepping back: why sequence someone's whole genome - billions of DNA nucleotides - and then look at only a few dozen carefully? Well, this approach makes sense if we aren't looking for any specific known genetic condition or risk factor. So we collect everything, and then look at what seems most interesting with what knowledge and resources we have. All those millions of variants I skipped over ruthlessly, because we're not good at making sense of them yet, will still be there in the future! The field is advancing so quickly that there will be much more to learn if I revisit them in a few years. Furthermore, DNA sequencing technology advancements will hopefully give me a better look at structural variants in addition to the small variants we're now pretty good at detecting.

Next time, I'll write a few interesting vignettes from the individual curation of my small variants.