Sunday, March 3, 2013

Why the natural sciences rely on affirming the consequent (and that's OK!)

Last weekend, I discussed a paper by Graur et al., which (among many other criticisms and insults) accused the ENCODE consortium of basing certain interpretations on a fallacy in deductive logic called "affirming the consequent." I pointed out how bizarre that criticism was, because "affirming the consequent" is actually a necessary and justified part of reasoning in the natural sciences.

Many readers seemed to be surprised by and skeptical of this claim, and some probably thought it proof of my insanity. I must, first and foremost, once again beg such skeptics to read Jaynes' outstanding book. The first few chapters are actually available as a free pdf, but the whole book is really worthwhile. If you're an academic, you can probably find the book in your library system.

Understanding, however, that the urging of an apparent madman may not be adequate motivation, I thought I'd try to explain a bit more why this is, actually, the case.

tl;dr "affirming the consequent" is an error in deductive reasoning. It's spurious to raise this as an objection to findings in the natural sciences, because such findings are generally not based on deductive arguments. In fact, substantially all inferences in the natural sciences result from "affirming the consequent." This is because the natural sciences rely on confirmation and induction, epistemological processes under which "affirming the consequent" is permissible in a certain well-defined sense. This sense is nicely modeled by Bayesian probability theory, which can be viewed as a generalization of deductive logic to enable extrapolation and reasoning under uncertainty. When we attempt to restate Graur et al.'s objection using the proper vocabulary, we explicate the actual source of disagreement and avoid the need to level accusations of irrationality.

This lengthy essay will mainly discuss logic and inference in science. This is a really loose tangent from the whole ENCODE controversy, relevant only insofar as Graur et al. decided to make a big, unnecessary stink about it. So, you may well decide you have better things to do than read this, let alone Jaynes. But if it's not perfectly obvious to you why the "affirming the consequent" criticism is bogus, it's my sincere belief that you will benefit your scientific career by exploring this topic further.

Before diving in, let me also acknowledge some informed pushback on the other major problem I called out in my previous post about Graur et al., regarding definitions of "function" and the role of neutrally-evolving sequence elements in the genome. I'm going to write another blog post on that topic, but it may take me another few weekends! (Update: blogs by GENCODE and Nicolas Le Novère, and ironically Ford Doolittle's wonderful ENCODE critique, cover some similar ideas.)

Why the alarm?

You're on vacation, and you get an automated text message informing you that the burglar alarm at your home has been triggered. Has there been a burglary at your house? Should you call the cops?

Suppose that a burglary logically implies triggering of the alarm:
The problem with turning this around and inferring Burglary based on Alarm is that there are other possible causes of Alarm. Here in the San Francisco Bay Area, for example, small earthquakes happen once or twice a year, and the resulting vibration also causes Alarm.
Given Alarm, it's logically erroneous to infer Burglary, because Earthquake could also explain Alarm. Calling the cops when you see Alarm would make you guilty of the fallacy of "affirming the consequent" by the laws of deductive reasoning. Why are we paying for this stupid alarm again!?

This example is actually pretty tricky because I set it in the Bay Area, where small earthquakes happen more or less as often as burglaries of an individual home. Suppose instead that home is inner-city Detroit. Crime is high in that area, and earthquakes are quite rare. How, then, does our analysis change?

Actually, it doesn't change at all. Earthquakes are rare, but small ones do happen from time to time. Given Alarm, we cannot absolutely rule out Earthquake. So inferring Burglary from Alarm is still absolutely fallacious - affirming the consequent. That inference is simply illegal under the laws of deductive reasoning.

Of course, we have a gut feeling that deductive logic's answer is too rigid; there must be something more to the story. Indeed there is, and I'll get to that. But first, what does this have to do with science?

126 GeV? Pfffft.

In the natural sciences we attempt to learn general principles of the universe based on some observations we've collected. In particular, we seek to judge hypotheses and theories confirmed or falsified based on available data.

Occasionally we might find a datum that directly contradicts a hypothesis. That's rather useful, because then we can immediately declare that hypothesis falsified, and move on to others. There are difficulties with this in practice, though. For example, it's often trivial to surreptitiously fiddle with the original hypothesis just enough to resolve any contradiction. Also, when our data are obtained from elaborate instrumentation like particle colliders or microarrays, an apparent contradiction might well be due to an erroneous observation. These are actually crippling problems with "falsificationism" as an operational description of scientific inference, despite its enduring presence in the popular perception. (Not that falsifiability isn't an important, useful concept - it just ain't the whole story.)

A lot of other times, the data neither support nor flatly contradict a hypothesis. There's not much we can do then, although there are some journals for negative results, which can help save others' time and money.

What we most like to see, of course, is data that positively support our favored hypothesis - meaning that the hypothesis predicts or otherwise nicely explains the data. That's what gets papers published and grants funded! But, here's the rub: just as we can't logically infer Burglary from Alarm, no finite amount of data permits a deductive inference that a scientific hypothesis is generally true. No matter how many cases we've examined or what controls we've performed, there's always some possibility of an alternative explanation, or an instrumentation error, or a sampling error. So that $9B supercollider can never prove the existence of the Higgs boson. Similarly, that RNA-seq peak doesn't prove transcription, and transcription doesn't prove a "selected-effect" biological function; neither does conservation in a genome alignment, by the way.

Under the rules of deductive reasoning, we can never conclude a hypothesis is true unless we commit the fallacy of "affirming the consequent." Deductively speaking, progress in science implies "affirming the consequent." And clearly, science does in fact make progress. Ergo...

Don't just take my word for it

Now before you nitpick that argument - or call up MIT to demand the revocation of my PhD - take a look at the following quotes, all mined from credible sources on the interwebs:

Ha ha, I slipped a little Darwin in there on ya.

So, inference in the natural sciences seems to require that we commit the logical fallacy of "affirming the consequent." But as Darwin alluded to, and as I'll try to explain further below, this is nothing to worry about; in fact, it's just common sense!

One last thing: it certainly sounds pretty bad to be guilty of a logical fallacy, and in fact there are those who believe that "affirming the consequent" is a devastating criticism to level at scientists. Since I've just quoted Darwin, I'll bet you can guess who. That's right: creationists. Also, climate science deniers. Et cetera. This is a bread-and-butter argument for all of them.

Dear skeptical reader: is the horror beginning to set in?

About that burglar alarm...

Okay, if your house is in Detroit and your burglar alarm goes off, you'd better call the police. But this is logically fallacious - so how can we justify it?

Our common sense about this situation goes something like this: since Earthquakes happen so rarely, an Alarm is usually due to a Burglary.

One way to interpret how this works is to say that we're implicitly rounding the possibility of Earthquake down to nothing, so that we can pretend the alternative doesn't exist - thereby making the deductive inference of Burglary permissible. By rounding premises up to true, or down to false, we make the rules of deductive logic applicable.

But, doesn't it seem rather a shame to use the perfect laws of deductive logic on a mere approximation? Has it ever bothered you that such arguments never quite capture all the available information? What if it's not "often" or "rarely", but truly a toss-up, like in the Bay Area? Which way will we round then? Is it even possible to reason about that case in a rigorous way?

The answer is: yes, absolutely! Because there's a vastly richer interpretation of how common sense works in our example. In this interpretation, rather than rounding off the premises, we actually relax the laws of deductive logic to enable us to reason about degrees of belief in the two hypotheses. The system of reasoning we exercise by doing so is called Bayesian inference.

In our little example, applying Bayesian inference mainly entails dispensing with slippery words like "rarely" and "usually", and instead quantifying exactly how much more likely Burglary is than Earthquake as an explanation for Alarm in Detroit. In particular, Bayes' theorem provides a precise way to account for both the a priori plausibility of each hypothesis (Burglary and Earthquake) and also its power to explain the data (Alarm), in order to quantify the evidence favoring one hypothesis over the other. Having done so, we might conclude that we believe Burglary to be a better explanation than Earthquake for the Alarm by, perhaps, 100-fold. It's on this basis - that we believe there's only a very small, yet quantifiable risk of being wrong - that we can justify calling the cops.

There's tremendous depth I can't cover in this already-too-long essay, but here's the key insight: Bayesian inference is a generalization of deductive logic, able to account precisely for any degree of uncertainty. Both are sound mathematical theories of reasoning, which are related in an elegant way: when you plug zeroes and ones into the probability rules of Bayesian inference, some terms drop out and you get exactly the laws of deductive reasoning! In this strictly more powerful form of logic, some inferences that are absolutely illegal under the deductive rules become permissible in a certain well-defined sense. Namely, while Bayesian inference cannot prove any hypothesis that can't also be proven by deductive logic, unlike the latter it justifies increasing our degree of belief in a plausible hypothesis that explains the evidence at hand.

As our burglar alarm example illustrates, this approach to inference often captures "common sense" much more accurately than strict deductive logic - for rarely in everyday life do we stop to think about utter certainties. There are also other interpretations of "common sense" as it applies to our example, which are more precise than fudging the premises, but don't claim to be sound mathematical theories. Some terms you may have heard of include abductive reasoning, inference to the best explanation, Occam's razor, maximum a posteriori estimation, etc. These are arguably the most accurate explanations of "common sense" per se, since few of us actually go around calculating conditional probabilities in our heads. But Spock and Data are surely Bayesians!

Bayesian inference as a model of scientific reasoning

The Bayesian interpretation is a very powerful way to understand how scientific inference actually works - which is largely by common sense, supplemented with some best practices. Briefly, we evaluate our belief in a hypothesis based on our informed prior beliefs about its plausibility, on the one hand, and its ability to explain the available data, on the other. The data can never prove the hypothesis - rather, they may provide evidence in its favor. And while a hypothesis may thereby become preferred over any known alternatives, our degree of belief in it is always subject to future revision if we happen to come upon new information, or conceive of a new alternative. To give credit to falsificationism where due, it's exactly on such occasions that science makes the most progress.

Because Bayesian reasoning is closely aligned with common sense, most scientists write very well-reasoned papers without use of Bayesian statistical methods, without reference to Bayesian terminology, and even without Bayesian concepts in mind at all. Even if the author does happen to be an ardent Bayesian, style and space considerations usually do not permit formulating every sentence in a manuscript with the precise incantations of that system of reasoning. Occasionally, a word like "shows," "proves," or "demonstrates" will appear without a legalese qualifier like "beyond a reasonable doubt." Each such occurrence incurs a small bit of poetic license - and unambiguously commits the logical fallacy of "affirming the consequent." Usually, there's nothing terribly wrong with this.

To conclude this long exploration into logic and inference, I should acknowledge that Bayesian inference is certainly not all there is to scientific reasoning, but rather an excellent model of some of its main aspects. There's a lot of other creative stuff going on in the rational mind of a scientist, in the conceiving of hypotheses, the setup of experiments, the design of statistical models, and so on. Much of this has proven difficult to formalize so far, and some may never be. But insofar as it provides a sound, coherent solution to the key problems of confirmation and induction, it's not unfair to think of Bayesian inference as the essence of the scientific method.

Is that all Graur et al. really meant?

Let's (finally!) come back to the original motivation for this whole philosophical adventure. Graur et al. accused ENCODE of committing the logical fallacy of "affirming the consequent" by claiming certain genomic regions have biochemical functions based on data indicating specific biochemical signatures, such as transcription, transcription factor (TF) binding, chromatin modifications, etc. In the case of TF binding, they point out that it's logically erroneous to infer a function in regulating transcription, since TFs might bind without such effect. As we've discussed, this criticism is actually quite correct under the deductive rules called for by the terminology they chose, "affirming the consequent." The trouble is that if you accept it on those grounds, you also logically commit to invalidate substantially all inferences ever made in the natural sciences.

Have I just been way too anal this whole time? Maybe Graur et al. didn't intend for us to interpret their "affirming the consequent" criticism literally in terms of deductive logic, despite that term being completely specific to that theory. Perhaps all they really meant was that, given the evidence and their informed prior beliefs and definitions, there are alternatives at least as probable as the interpretations advanced by ENCODE. In this line of argument, ENCODE can't necessarily be accused of irrationality per se, but possibly of using a poor definition, or neglecting to account for alternative interpretations that others find highly plausible. Other commentators have raised such criticisms of the ENCODE paper, and frankly I think there is merit in them. In contrast, Graur et al. unmistakably chose to go further than this, and made the claim that even if you willingly grant ENCODE's definitions and premises, the conclusions are still wrong, because we committed the logical fallacy of "affirming the consequent."

Deductive reasoning is a rigorous mathematical theory. If you're going to attack others explicitly in the terms of that theory, your argument had better be defensible on those same grounds. In my opinion, anyone with a modern understanding of scientific inference just shouldn't level the criticism of "affirming the consequent" in those terms, because that criticism is itself definitely bogus. It seems valid, if you don't think about it too carefully - which is why it's popular among creationists and other science-deniers - but it logically commits you to absurd consequents. And by the way, Graur et al. don't just mention this in passing: it's fully explicated in a dedicated little section of the manuscript. Ouch - but at least it's not in the title.

Had they been expressed using the proper vocabulary as I sketched above, their objections to ENCODE's interpretations could have been stated formally, precisely, and coherently. Moreover, the actual source of disagreement - definitions, alternative hypotheses, and degrees of informed prior belief in the plausibility thereof - would have been obvious. Accusing ENCODE of a black-and-white logical error was a backfiring distraction from a more productive discussion about those stubbornly grey aspects.

Final words

I've gotta be perfectly honest - it took me about 10 years to clearly understand the principles underlying this essay. Probability was always a little bit mysterious to me as an undergraduate; I could do the calculations okay, but the meaning of it was elusive. I also took a few courses in philosophy of science, but there was limited rigor and quantification at that level. And once I got into research, the statistical methods prevalent in my field don't exactly help.

I'm sure many others "get it" much more quickly, but Jaynes' book was the key for me personally, and that's why I've been harping on it so much. The interpretation of probability as a generalized form of deductive logic is not an ancient or widely-appreciated insight, by the way; Jaynes was one of the first to recognize its importance, in the mid-20th century. Some debate continues to this day over a few of the strongest claims advanced in his writing, such as probability theory being the only coherent system of reasoning based on continuous measures of uncertainty. So while the book should not necessarily be taken as gospel in every detail, its explanations of Bayesian ways of thinking are unmatched in my experience.

18 comments:

  1. Maybe this formulation of Graur's argument, in Bayesian/likelihood terms, will satisfy you:

    Definitions:
    E=1 expression of DNA sequence observed
    E=0 expression of DNA sequence not observed

    F=1 DNA is functional
    E=0 DNA is not functional


    ENCODE says that:

    P(E=1 | F=1) = ~1

    This is a statement that the likelihood of the observation (E=1) is high on the hypothesis that F=1. Which I would agree with.

    They observe E=1, and thus infer F=1 (for many many sequences).

    The problem is that they fail to assess:

    P(E=1 | F=0)

    On the junk DNA hypothesis of promiscuous/sloppy transcription, i.e. "junk RNA",

    P(E=1 | F=0) = ~1

    This is a statement that the likelihood of the observation (E=1) is also high on the hypothesis that F=0.

    So, here, the likelihoods don't allow us to distinguish the hypotheses. We could distinguish the hypotheses based on the priors, but that won't resolve anything to anyone's satisfaction. But we can compare the likelihoods of lots of other data on the two. E.g.:

    Likelihood of a neutrally evolving sequence under F=1 = ~0 (at least as a first approximation, for most sequences)

    Likelihood of a neutrally evolving sequence under F=0 = ~1

    There are several other lines of evidence like this (e.g. genome size) that the "80% functional" ENCODE folks (which is not all ENCODE folks) completely ignored.

    ReplyDelete
    Replies
    1. Hi,

      Thanks for the comment. Where I disagree with your analysis is in choosing to disregard prior information, because the definition of "functional" is part of that. As Sean Eddy crystallized, ENCODE's definition of "functional" includes junk. So ENCODE's definition entails that P(E=1|F=0) is essentially zero, a priori.

      Which comes back to my point that the actual source of disagreement is the definitions and prior beliefs, which is a productive area of debate/discussion, while it was a spurious distraction to accuse ENCODE of a logical error. Please note the above post is highly focused toward that point (because epistemology happens to be a personal interest of mine) and does not attempt to defend of ENCODE's definition.

      Mike

      Delete
  2. This comment is also relevant to Mike's reply to Nick (this is also something I've shared with Nick before posting this -- we are on the same floor).

    In fact, the whole debate centers very strongly on a refusal to define P(E = 1|F = 0) = 0 a priori. This is, to my mind, tendentious, question begging, self serving, and in a very real sense bordering on disingenuous. In other words, it is at the very heart of why we're even having this conversation to begin with. So, to facilely wave your hand to dismiss this is to misunderstand (or worse, to try to subvert discussion of) this very real and relevant disagreement.

    I'm going to simplify here by focusing only on expression and not other assays. I'll simplify further and charitably allow that the probability of expression given function is large.

    F is for function, E is Expression. (In fact, E can stand in for any "specific biochemical activity" assayed by ENCODE, I only picked expression as a short-hand because transcription is by far the assay that tags the largest proportion of the genome compared to other assays.)

    We can write down a simple version of Bayes' Theorem for reference later:
    P(F|E) = P(E|F)*[P(F)/P(E)]

    Here is a simple version of Affirming the Consequent:
    1) If F then E†
    2) E
    3) Therefore F

    To put affirming the consequent into terms that are analogous to Bayesian terms, let's consider the following, which I take to be closely related to Graur's point on affirming the consequent:

    1) P(E|F) is large, ie P(E|F) >> 0, perhaps approaching 1†
    2-3) Therefore P(F|E) is large, ie P(F|E) >> 0, perhaps approaching 1

    Both the strict "Affirming the consequent" and its Bayesian analog are preposterous statements to make. First of all, it is a formal logical fallacy for ENCODE to apply a standard version of "Affirming the Consequent". Mike sidesteps that issue by saying in effect "Scientists aren't really using that kind of logic, but are instead using Bayesian thinking". Well, even then, then Mike's framing of the ENCODE position is either simply wrong or a case of self-serving question-begging.

    Firstly, in our simplified world, ENCODE is merely measuring P(E). If ENCODE is defining P(F) = P(E), this is clearly question begging. The main objection that many of us had with the ENCODE summaries is that we don't accept that the biochemical activities that ENCODE assayed are good proxies for function. ENCODE may disagree with this, but simply saying "Our assays ARE good proxies for function!" is simply question begging, full stop, and doesn't address our fundamental objections. It should (but probably doesn't) go without saying that simply redefining "functional" by fiat to be what ENCODE assayed is a non-starter.

    Next, to address the issue of Bayesian thinking, I'll be charitable and allow that ENCODE is not simply definining P(F) = P(E) (i.e. question begging). If this is the case, then P(E) (which is what ENCODE measures) is unequivocally incomplete regarding P(F|E) even if we accept that P(E|F) is close to 1. The prior probability P(F) MUST figure prominently in any discussion.

    Given that many of us question whether ENCODE's assays are direct proxies of function in the first place, then to my mind ENCODE must commit to one or more of the following:

    1) they were wrong in their broad characterization of the results of the ENCODE project. ENCODE is only one preliminary part of a larger picture that will also require more direct measurements of functionality that are entirely outside of the scope of the ENCODE project;
    2) they have chosen a tendentious definition of function that they could have predicted a priori would strike a large part of the community as self-serving sophistry in service of PR. They did it anyway, and owe us and the funders an explanation of why;
    3) our definition of functional needs overturning.

    † This is a simplification that I trust is either neutral or charitable towards the perspective that Mike is defending

    ReplyDelete
    Replies
    1. Hi, JJ. Thanks for posting. I don't think there's actually a lot of daylight between us. This post (and my last) are mainly pointing out problems with Graur et al.'s arguments as written, and not undertaking strident defenses of ENCODE. You can go back to my last post and you'll find that I repeatedly expressed mixed feelings about ENCODE's own claims. In this post, I explain why one of Graur et al.'s arguments is spurious, attempt to rephrase it in a non-spurious way, and then conspicuously do not dispute that rephrased version.

      To some specific points from your comment:

      When I pointed out to Nick that ENCODE's definition of function entails P(E=1|F=0) is zero, I did not try to defend that definition. On the contrary, I wrote that that definition is the true source of disagreement and described discussion/debate about that as "productive" - unlike the argument of Graur et al. I'm discussing in this post. So I fully agree with you that there's a "very real and relevant disagreement" over definitions. I also fully agree with you that the prior P(F) should figure importantly in interpretations of ENCODE's results. These are just not the topics I'm addressing in this blog post, which I wrote mainly because epistemology and philosophy of science are topics that personally interest me (and which I feel most actual scientists are not well-acquainted with).

      I did not 'sidestep' the issue of affirming the consequent. On the contrary, I pointed out that it's something we actually do all the time in the natural sciences; ENCODE commits that fallacy, just like everyone else. But it's spurious to raise this as an objection against findings in our field, for reasons I explained rather exhaustively, and as illustrated by the fact that this objection is frequently advanced by creationists and other science-deniers. So when you write 'it is a formal logical fallacy for ENCODE to apply a standard version of "Affirming the Consequent",' it leads me to believe that I have not succeeded in convincing you that this is a spurious accusation. What can I clarify further?

      To your three saving constructions for ENCODE's position, if you're expecting me to dispute them then you will hear only crickets! :)

      Delete
  3. One further reply after re-reading your comments, guys: Graur et al. accused ENCODE of a logical error. That is a severe accusation with a specific meaning - it means that even if you grant all of ENCODE's starting definitions, premises, and beliefs, the conclusion is still wrong because we were irrational, or worse. This blog post explains why that specific accusation made by Graur et al. is spurious - that's all. I'm not sure if you realized that, because your comments seem to get mixed up in "ENCODE's definition sucks" arguments. That is largely irrelevant to what I've written here.

    The topic of "affirming the consequent" only came up because Graur et al. stepped into it - badly. But I think this post is worth reading (if it's not all obvious to you) because a better understanding of scientific inference will make you a better scientist.

    Maybe some of your comments are really directed at my previous post. But even there, I'm not the imaginary stalwart ENCODE-defender you seem itching to pick a fight with. It mainly points out problems with the Graur et al. manuscript, including this one.

    ReplyDelete
  4. "I'm not sure if you realized that, because your comments seem to get mixed up in "ENCODE's definition sucks" arguments."

    They don't get mixed up. When talking with people about this, the common reply is "well, we've defined it this way, therefore we're immune to criticism" is so common, I feel it necessary to head it off at the pass in order to get down to discussing what assaying biochemical activity really tells us about how a certain social mammal works. I think ENCODE is essential and useful, but is clearly inadequate by itself.

    In short, I'm replying to your criticism of Graur's reasoning because this argument of Dan's gets to the heart of what many of us object to. Given what we know about biology, any version of the argument "Functional elements are often expressed. We observe expression, which adds to our confidence that they are functional" is either wrong or misleading. I can't see how you look at the terms of Bayes' theorem, acknowledge the importance of the prior (not to mention the likelihood, which we give you for free!) and say that ENCODE didn't make either a logical error or a sneaky switcheroo with the definition. And indeed, I don't think the blame for that problem goes to the 400+ authors, and certainly not to you. Had I been involved in ENCODE, I would have remained an author even though part of the message would have been something I disagreed with. That happens in collaborations, especially big ones. And that's a good thing. Collaboration and dissent are both healthy. It is nice that they can coexist, so I don't criticize any author of ENCODE in particular.

    But you are here picking apart what I think is one of Dan's reasonable criticisms (namely that observing "specific biochemical activity" tells us very little about functionality) based on what appear to me to be grounds that are missing the general point. I'm not itching to pick a fight. I am however, itching to see people who don't accept criticisms like Graur's defend their perspectives, in the hopes I might learn something. I'm unsatisfied by your foray into that arena.

    When you say that " 'affirming the consequent' is permissible in a certain well-defined sense" and give an example about burglar alarms, you are strongly implying that we know about the right hand side of Bayes' theorem (either the Likelihood is very informative in favor of one outcome or we have a strong prior). In other words, we aren't reasoning in a vacuum. The whole point about ENCODE is that we know very little. In effect, we are reasoning in something that is very close to a vacuum if the justification for the funding is to be believed. And, what little information we do have militates against the sort of conclusions that ENCODE drew (arguments from classical genetics, mouse deletions, genome size variation, and evolutionary constraint, for just a few). So to get back your burglar alarm analogy, yes, when we know the prior (burglaries are common) and the likelihood discriminates between the two possibilities fairly well (the door must open or the glass must break before the alarm sounds), sure we can say "Burglary therefore alarm. Alarm, therefore (probably) burglary." Now instead of a well-tuned house alarm, hook your smart phone up to your run of the mill stock cheap-o car alarm and tell me how willing you'd be to engage in that same line of reasoning. Or hook your smart phone up to Device X, which has a function you don't understand. The validity of "affirming the consequent" in natural sciences argument breaks down because, to paraphrase the complement of your sentiment "affirming the consequent is impermissible outside a certain well-defined sense". Fine, I understand your well-defined sense. I don't think ENCODE was working in that domain, and as a result, I think the thrust of Dan Graur's criticism, as stinging as it is, remains basically valid.

    ReplyDelete
    Replies
    1. Graur et al. went well out of their way to state their criticism explicitly in terms of the propositional calculus, which does not admit interpretation as to whether or not its "thrust" is "basically valid." They could easily have written it in some other, more flexible way, as we've both suggested, and as many others have; had they done so, this blog post would not exist. I have the impression that you may now see why their argument is spurious strictly as-written, but that fact simply isn't interesting to you - you're happy to reinterpret it in a more flexible sense that also aligns with your own problems with ENCODE. No objection from me.

      Other readers, however, may take away from that manuscript an impression that ENCODE not only used a lousy definition of 'functional', but also that we've been proven irrational (or worse). This blog post tries to correct that, which isn't easy for the same reason that the "affirming the consequent" argument is so popular amongst science-deniers - it looks pretty strong at first glance and you really have to think rather carefully about subtle concepts in epistemology to see the problem with it. While I'm sorry you didn't find my blog post interesting, I feel one reason for that is that you continue to mistake my identification of problems in Graur et al.'s argument for an attempt to mount a muscular defense of certain claims made by some ENCODE participants.

      Delete
    2. JJ,

      I think I may have figured out why we seem to be talking past each other.

      In my original post, I expressed my surprise to see Graur et al.'s use of the term "affirming the consequent" in criticizing ENCODE's findings, and mentioned Bayesianism as the basis of the scientific method. I mentioned the latter because considering scientific inference in Bayesian terms is a particularly helpful lens through which to see why Graur et al.'s criticism is erroneous (as written in the terms they chose). That, and only that, is the topic of the current post.

      You may have been under the impression that my mentioning Bayesianism was a prelude to some larger defense of ENCODE's inferences of 'biochemical function', in terms of conditional probabilities. This was never the case - no such argument is present in what I've written, and TBH I don't think it's between the lines either. Indeed, I think that getting into that would confuse the issue further (much like the "affirming the consequent" argument) when the real problem is a profound disagreement over the basic terms involved. 

      Does this help clarify why a lot of what you're saying seems off-topic to me and vice versa?

      Delete
    3. Sorry for the initial silence. This your blog and I thought it appropriate that I allow you the last word.

      But since you asked, I don't think your characterization is far from my reading of you. I can see how this might seem tangential or largely beside the point to you. I evaluate it differently and actually see this this as an important issue rather than a tangent. This difference in how we evaluate relevance to the larger issue almost entirely explains why it appears that we're talking past each other, to my mind.

      It would have been nice for Dan to have couched his "affirming the consequent" argument in a way that didn't invite the type of rebuttal that you've offered. And with you acknowledging that, had he proffered a more precise evaluation of ENCODE's use of biochemical activity assays, you wouldn't have objected certainly does make it seem that we're not that far apart, though resolving the remaining issues would seem like something best done over beer rather than on blog post comment threads.

      Delete
  5. Seeing as J.J.E. and others badly want to determine whether you are on the side of the angels or the devils, would you mind briefly giving an upper limit for the fraction of the genome you would be willing to assume contributes to an organism's fitness? You can simply indicate general agreement with Graur's number of ~10%, or you can give your own number with your own rational.

    In my opinion Graur's number is a bit generous since silent sites aren't quite as functional as non-degenerate sites (i.e. I would say something more like 6%). Even that 6% may be a bit generous. It's been know for sometime now that the human mutation rate is quite high (~75 SNPs per genome per generation, and a large number of INDELs and various other mutations, though the exact number doesn't readily come to mind), and Haldane showed that the average fitness of a population is approximately e^-U where U is the number of deleterious mutations per genome per generation. Importantly this formula does not depend on the fitness effect of individual mutaitons, which can have any distribution you like. Because of this only one or two mutations can have any non-zero fitness effects (assuming that historically roughly 2/3 of individuals die due to selection). This means that at most 1/75 nucleotides have a selective function based on their identity (We haven't ruled out that their mere existence is in some sense functional, since I don't recall how many new INDELs we get per generation). This leads me to believe that at most ~2% of the genome has a sequence dependent function, and that the vast majority of sequence which shows 'biochemical activity' in Encode's data set has no sequence dependent biological function. We can inch towards 6% with epistasis, but there is simply no way for a biological system to maintain itself if it suffers such 60 (80% of 75) deleterious mutations per generation. You just can't make that work.

    So, in other words, is it fair to say that the human genome is mostly junk DNA? Why or why not? It isn't important to your post (which I really appreciate BTW), but it would help me understand your views.

    ReplyDelete
    Replies
    1. For the record, while I'd be happy for Mike to "pick a side" (he should join us angels, natch), I'm most interested in seeing how this function dispute works itself out. And I'd hope that the fact that we in the genomics community are even having this conversation at all does trickle down to the funding agencies and the lay audience, though this is somewhat less likely.

      Delete
    2. On the % question I'm going to refer again to the blog posts by Ewan Birney and Max Libbrecht which reflect my views very well, but I know that won't make anyone happy so I will volunteer some more color:

      - I'm pretty sure that at least 5.5% is involved in core mammalian biology
      - I believe that transposons, repeats, and intronic sequences are mostly non-functional
      - I believe mutational load and the onion test are strong arguments
      - I can point you quite a few synonymous sites that do really interesting things!
      - I started out in the gene finding business but I frankly failed to see the scope and importance of lncRNAs coming. I take admonishment from that experience, not to be complacent about my knowledge of the genome
      - Lastly, I think we evolutionary genomicists are too easily inclined to leap from 'neutrally evolving' to 'non-functional'. That's served us pretty well as a first approximation since the HGP, but we're past that now. We need to remember how little is understood about the extent to which visible phenotypic traits (as well as developmental & cellular complexity) evolve by entirely neutral processes. We need to keep in mind controversial concepts like spandrels and constructive neutral evolution when we use neutrality as a "null" hypothesis. We also need to remember that reproductive fitness is a peculiar objective function - only partially aligned with our day-to-day concerns about our own physiology and health. A lot of this is probably under pleiotropic control of loci under selection for other reasons, but it's also plausible there's a significant role for neutrally-evolving genomic loci in all of that. I claim no ability to quantify this role - that's an x-factor which greatly increases the posterior variance of my beliefs on the % question. And, I will add, that needle was moved considerably by my experiences in ENCODE and the results it produced - showing a whole genome alive with activity, much of it probably neutral over evolutionary time which is NOT the same as inconsequential today!

      As I alluded to above, I'm going to expand the last point into another long blog post (probably not as long as this one...)

      Delete
    3. Neither Ewan Birney nor Max Libbrecht give estimates on the MAXIMUM fraction of the genome which could have meaningful function.

      I want to know what the minimum amount of the genome which can reasonably considered junk.

      How would you respond to the specific argument that I made? I think we can put a strong upper limit on the fraction of the genome that can be functional, since a) humans are not currently extinct, and b) any species with a deleterious mutation rate in excess of about four deleterious mutations per genome per generation must go extinct. If you want to say (as Ewan does) that 20% of the genome is functional in some traditional sense, you are already beyond the maximum functional genome size (I usually call it the effective genome size) allowable by our high mutation rate, and we should all be dead.

      If 20%, 30%, or 40% of the genome is functional, why are we not dead?

      Delete
    4. Matt,
      If we're talking % with fitness effect, I agree mutational load is a strong argument. The point I was making is that we ought to be open to looking beyond the assumption that 'functional' loci necessarily affect reproductive fitness. That's a fine first approximation that's served us well so far, but the current state of evolutionary theory doesn't justify complete adherence to it. We need to be cognizant that even macroscopic phenotypes can evolve neutrally (or at least that there is serious debate to this point), and we need to learn more about the extent to which those are controlled by neutral genomic loci. 
      This is why I don't have strong beliefs about an upper bound on % that is 'functional', except to say that I do think transposons, repeats, etc. are mostly non-functional, given what we know about how they proliferate.

      Delete
  6. very interesting information this is really good and helpful thanks for giving such a useful informationRegards,
    obiee training in hyderabad

    ReplyDelete
  7. I read a article under the same title some time ago, but this articles quality is much, much better. How you do this.. how to wash vegetables and fruit naturally

    ReplyDelete
  8. I read a article under the same title some time ago, but this articles quality is much, much better. How you do this.. how to wash vegetables and fruit naturally

    ReplyDelete
  9. I was surfing the Internet for information and came across your blog. I am impressed by the information you have on this blog. It shows how well you understand this subject. novodalin nutryweb

    ReplyDelete