Sunday, February 24, 2013

My thoughts on the immortality of television sets

There's a new GB&E manuscript sensationally blasting a certain widely-reported claim of the 2012 ENCODE Consortium paper, namely that the data generated in that project "enabled us to assign biochemical functions for 80% of the genome." I'm one of 400+ authors on that paper, but I was a bit player - not at all involved in the consortium machinations that resulted in that particular wording, which has proven quite controversial, and has already been discussed/clarified by other authors big and small.

The first author of the new criticism, Dan Graur, is an authority on molecular evolution and authored a popular textbook on that topic (one I own!). The manuscript stridently argues that ENCODE erred in using a definition of "functional element" in the human genome based on certain reproducible biochemical activities, rather than a definition based on natural selection and evolutionary conservation. Interestingly, while the consortium was mostly focused on high-throughput experimental assays to identify the biochemical activities, my modest contributions to ENCODE were entirely based on examining evolutionary evidence, through sequence-level comparative genomics. So, a few comments by a former rogue evolutionary ENCODE-insider:



Definitions of function. One practical difficulty with a selection-based definition of biological function is that selection can be very difficult to detect - as Graur et al. discuss. They should also have noted that it's actually difficult for selection to even act on many traits. For they must be very well aware that significant phenotypic variants can nonetheless have essentially no effect on reproductive fitness; a disease that manifests only at advanced age, for example. Thus the evolutionary definition, taken too far, also leads to "bizarre outcomes": calling genetic loci with causal roles in such neutral traits non-functional, or else abandoning hope of identifying their functions through association with those traits, selection being the only criterion for inferring function. (Josh Whitten touches on this as well.)

Graur et al. present a similarly bad strawman about ENCODE's definition:
The ENCODE Incongruity implies that a biological function can be maintained without selection, which in turn implies that no deleterious mutations can occur in those genomic sequences described by ENCODE as functional.
This is wrong, because ENCODE didn't claim that biochemical signatures lacking evidence of selection have been, or necessarily will be, maintained over evolutionary timescales. But they may nonetheless prove highly consequential over human timescales, and to human values.

Of course, my expertise being what it is, I sat in the back of conference rooms at ENCODE meetings and thought to myself, all this non-conserved stuff, is it not crap? But the above is a humbling truth I've slowly come to accept.

Affirming the consequent, that is, applying the scientific method. One early section of the Graur et al. manuscript presents a step-by-step walkthrough of affirming the consequent, an error in deductive reasoning. They accuse ENCODE of committing this grave error by inferring function based on indirect biochemical readouts.

As an admirer of Bayesian methodologies, I'm baffled to see an appeal to deductive logic in a paper about data interpretation in the natural sciences. Graur and coauthors are surely acquainted with Bayes' theorem, but their writing here suggests they've yet to grasp it as the essence of reasoning under uncertainty - that is, of practically all reasoning performed in the natural sciences. For Bayes' theorem provides the precise justification for "affirming the consequent" in the presence of uncertainty - which is not erroneous at all, but instead permits inference to the best explanation. A hypothesis is confirmed by any body of data that its truth renders probable. This is the essence of the scientific method, and no progress in our field can be made without it!

(Dear reader, I expect that on this point you're either totally with me, or else it sounds like metaphysical gibberish. If the latter, I strongly urge you to read this book. It may change your life!)

Graur et al.'s overly conservative definition of biological function, and their flawed view that it's necessary to deductively refute any alternative explanations before even claiming evidence for a hypothesis, generally undermine the several ensuing sections that individually attack the specific assays ENCODE used and the putative associated functions. These sections will probably be addressed by other consortium members far better than I'd be able to - but the two problems I've discussed are the biggest by far, in my opinion. And in fairness, Graur et al. do present numerous specific points on which I'd agree.

On the presentation: I honestly don't mind too much the style in which the Graur et al. manuscript is written. It certainly grabs your attention, and I think the polemics generally serve a rhetorical purpose, albeit over-the-top here and there. (E.T. Jaynes wrote the aforementioned book, one of my all-time favorites, in a similar style.) Were the writing watered down with weasel-words and footnotes and apologies everywhere, maybe it would be more precise, but reading it would be like chewing glass. And we certainly wouldn't be talking about it.

The shoe fits the other foot too, by the way: a lot of ENCODE participants would be more than willing write a treatise about how the "biochemical function[s]" investigated in the project have only limited immediate implications, how the data is really quite noisy, that cutoff selection is profoundly arbitrary even if you dress it up, that - out of necessity - they invented new statistical methods as they went along, and so on. Who's gonna print that - or read it?

You can be sure there were plenty of heated discussions about those topics in the consortium, on many hours of conference calls and over beers at the Hilton in Rockville, MD. But in the end, publishing a paper in Nature is brutal even after it's accepted: you're lucky if you end up getting to say even 50% of what you'd like to. Reading the text of the consortium paper again, I think the appropriate definitions and qualifications are all strictly present and proper, but frankly there was no extra space to spend bawling over them further. (The subsequent presentation to the general public may be another matter - that was debated pretty thoroughly last fall, so I won't go into it here.)

Lastly, Graur et al. should have shown more appreciation for the fact that a paper cannot be written by unanimous consent of 400+ people (cf. the US Congress). As a participant, if you disagree with some aspect of the wording negotiated by the PIs and editors, one option is to berate the other consortium members, remove yourself from the author list, and decline further funding - I have seen this happen (well, certainly the beratement part, with pledges to do the others; I don't know if they actually followed through). Another is to contribute to the best of your abilities, publish your take on it separately, and trust that this will not be seen as an inconsistency to be mocked, but rather diversity to be appreciated.

10 comments:

  1. Not associated with Dan or with Encode.

    If I'm reading you right, you seem to assert that affirming the consequent is really just the Bayesian essence of the scientific method. Don't you have to observe both that A implies B and that ~A implies ~B in order to conclude that B implies A? Taking A to be function, and B to be a biochemical marker, seeing ~A implies ~B would mean seeing that non-functional sequence rarely has some biochemical marker. If it is not the case that ~A implies ~B, (i.e. we often see non-functional sequence that has biochemical makers used to infer functionality) isn't Dan's complaint justified? If we often see non-functional sequence with a biochemical maker, then isn't it incorrect to infer that a sequence with the biochemical marker is functional?

    Graur et al. have strong support for their claim that changes in 10% of the genome can, even in theory, have negative consequences on an organisms phenotype. The mutation rate is simply too high for a large fraction of the genome to serve an important function. The heart of the issue is precisely whether Graur et al.'s definition is overly conservative, or Encode's definition is overly broad. A definition of function which includes an enormous amount of sequences that can be removed from the genome without any deleterious consequences seems overly broad to me. A definition of function that includes only sequences that affect an organisms phenotype (i.e. 10% or so of the genome) seems more reasonable to me.

    ReplyDelete
    Replies
    1. The excellent question your comment raises is about how often one sees non-functional sequences with certain biochemical activities - which in turn depends on your definition of "functional." It's fine to debate and even disagree over those aspects, which, if you'd allow, I'd describe as prior beliefs about the relative plausibility of alternative hypotheses. By framing this in terms of degrees of plausibility - that is, in more Bayesian terms - your comments help to clarify the actual source of disagreement. (But you don't have to surround this with additional deductive propositions!)

      Graur et al. made a criticism on starkly different grounds: they accused ENCODE of a black-and-white logical error. And not by subtle insinuation that can be mistaken for a more nuanced view like yours; it's paraded out, step-by-step, as such. Taken by itself, that section of their manuscript is just a distraction, easily dispatched because the "affirming the consequent" criticism would substantially invalidate the natural sciences. Unfortunately, the attitude that the mere existence of alternative hypotheses refutes an abductive argument is quite evident in the ensuing sections of the paper.

      Conspicuously, I haven't claimed that strong criticisms of ENCODE's interpretations can not be written; rather, I've merely pointed out that Graur et al. framed theirs really poorly, with a backfiring appeal to deductive logic. As to what percentage of the genome is most reasonable to call "functional," I don't yet have much to add to the posts by Ewan Birney and Max Libbrecht I linked above.

      Delete
    2. Thanks for your answer and sorry for phrasing my response in terms of deductive propositions. I guess I just got swept up by nostalgia for my college courses. It isn't often I have to remember what 'affirm the consequent' means. I'm sure you are right that this boils down to a disagreement about priors, hopefully stemming from a disagreement about the what is appropriately called functional (or biochemical activity or whatever). And, like most of us, I don't find semantic arguments particularly interesting, nor do perceived semantical transgressions inflame my passions to such an extent that I write long missives to GBE (although I admit I enjoyed reading Graur et al. I still think it was in poor taste).

      I'm still not sure why you, John S. Wilkins or others claim that "the affirming the consequent criticism would substantially invalidate the natural sciences”. It seems to me that negative controls in experiments specifically exist to avoid affirming the consequent. But don't feel obligated to enlighten me, I have a feeling I will simply have to read the book you mentioned. (The most recent book I've read on the philosophy of statistics and science was Popper, so I'm really behind the times on this one).

      Delete
    3. Thanks. I'm going to prepare another blog post re: affirming the consequent. I probably won't do a better job than Jaynes, but hopefully I can at least pique people's interest :)

      Delete
  2. "This is wrong, because ENCODE didn't claim that biochemical signatures lacking evidence of selection have been, or necessarily will be, maintained over evolutionary timescales. But they may nonetheless prove highly consequential over human timescales, and to human values."

    What percentage of the genome would you estimate to lack evidence of selection AND that will prove to be highly consequential over human timescales, and to human values?

    "For Bayes' theorem provides the precise justification for "affirming the consequent" in the presence of uncertainty - which is not erroneous at all, but instead permits inference to the best explanation."

    So you consider the following argument to be valid?
    'We have examples of transcription factor binding having an influence on transcription, therefore all transcription factor binding sites are functional.'

    ReplyDelete
    Replies
    1. To approach your first question, it seems one first has to establish a position on the relative importance of neutrality and selection in the evolution of visible phenotypes, then somehow try to map that onto the molecular/genomic level. On this matter, I shall humbly decline to put my opinions out there amongst those of the foremost titans of evolutionary biology. But there's an outstanding (testy!) conversation between Larry Moran and Richard Dawkins in the comments here, and that post refers to many other great sources.

      To your second question, as you present an argument which is fallacious because it permits no degree of uncertainty, no - I do not consider it valid.

      Delete
  3. "To your second question, as you present an argument which is fallacious because it permits no degree of uncertainty, no - I do not consider it valid."

    I asked that question because it is exactly the kind of argument that Dr. Graur was attacking in his paper. And that's precisely how the ENCODE results were presented to the general public (http://www.genomicron.evolverzone.com/2012/09/the-encode-media-hype-machine/)! Even Pennisi wrote a piece called "ENCODE Project Writes Eulogy for Junk DNA" in Science.

    ReplyDelete
    Replies
    1. I don't see what this adds. What I've written above explains the main reasons why the Graur et al. attacks fail. I can't vouch for how ENCODE was presented to the general public; I had zero involvement in that. As you can see above, I only just barely, and with a lot of reservations, vouch for the actual Nature paper!

      Delete
  4. From the introduction of Graur's article:
    "We shall only deal with a single article (The ENCODE Project Consortium 2012) out of more than 30 that have been published since the 6 September 2012 release. We shall also refer to two commentaries, one written by a scientist and one written by a
    Science journalist (Pennisi 2012a,b), both trumpeting the death of “junk DNA.”"

    The whole point of Graur's critics of ENCODE turns around the 80% claim made in the main article of ENCODE (see also first sentence of abstract). I don't think you can argue that Graur's attack failed unless you put them in their context.

    ReplyDelete
  5. There is an error of omission in the GBE preprint. It should read: "We shall only deal with a single article (The ENCODE Project Consortium 2012) out of more than 30 that have been published since the 6 September 2012 release. We shall also refer to three commentaries, one written by a scientist (Ecker 2012) and two written by a Science journalist (Pennisi 2012a,b), all trumpeting the death of “junk DNA.”
    Dan Graur
    Will be fixed in the final proof.

    ReplyDelete