Sunday, April 26, 2015

Blogging My Genome, episode 7: sifting for bad news

This is the seventh in a series of blog posts about my genome, which I had sequenced through Illumina's Understand Your Genome program.

Hey, it's been awhile! We've been unbelievably busy at my company, but I've been plugging away on my genome analysis slowly. When I last blogged, I'd completed the process of identifying small variants in my genome (affecting just one or a few DNA nucleotides). This takes us into an interesting new analysis phase - interpreting the consequences of those variants in the context of existing knowledge of human genetics. I previously went into depth on a certain variant I'd known to look for, but now we'll sift through the others in my VCF file - nearly four million of them!

I began with Ensembl's Variant Effect Predictor (VEP), one of several available tools that annotates VCF variants with their likely consequences for known genes and other genomic features. VEP produces a new VCF file with this additional information crammed into each entry, like so:

1       871215  .       C       G       1357.81 .       AB=0;ABP=0;AC=2;AF=1;AN=2;AO=45;CIGAR=1X;DP=
45;DPB=45;DPRA=0;EPP=4.21667;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=63.5608;P
AIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=1543;QR=0;RO=0;RPP=3.44459;RPPR=0;RUN=1;SAF=19;SAP=5.37
479;SAR=26;SRF=0;SRP=0;SRR=0;TYPE=snp;technology.ILLUMINA=1;CSQ=G|ENSG00000187634|ENST00000341065|Transcript|synonymous_variant|140|141|47|P|ccC/ccG|rs28419
423|0.0261008|0.00232558|3/12|||||||1|||SAMD11|HGNC|||G:0.0629|protein_coding|ENSP00000349216|||0.03
|0.08|0.16|0.0026|,G||ENSR00000528855|RegulatoryFeature|regulatory_region_variant||||||rs28419423|0.
0261008|0.00232558|||||||||||||||G:0.0629|||||0.03|0.08|0.16|0.0026|        GT:DP:RO:QR:AO:QA:GL    
1/1:45:0:0:45:1543:-10,-10,0

That's really not pretty, but VEP also produces a nice series of summary charts. For example, it breaks down putative consequences of my variants in protein-coding sequences.