Are replication efforts pointless?

A couple of people have tweeted/blogged (EDIT: additional posts from Neuroskeptic, Drugmonkey, Jan Moren, Chris Said, Micah Allen EDIT 2: more, by Sanjay Srivastava, Pete Etchells, Neuropolarbear EDIT 3: more, by Sean Mackinnon) about a recent essay by Jason Mitchell, Professor of Psychology at Harvard, titled On the emptiness of failed replications. Its primary thesis is that efforts to replicate experimental results are pointless, “because unsuccessful experiments have no meaningful scientific value”. This is, of course, counter to the recent replication drive in social psychology – and to how I understand experimental science should be done (caveat: I am not an experimental scientist).

I disagree with the above quotation, and thought I would counter a couple of his arguments that stuck out to me as wrong or misleading:

…A replication attempt starts with good reasons to run an experiment: some theory predicts positive findings, and such findings have been reported in the literature, often more than once. Nevertheless, the experiment fails. In the normal course of science, the presumption would be that the researcher flubbed something important (perhaps something quite subtle) in carrying out the experiment, because that is far-and-away the most likely cause of a scientific failure.

In the case of a very well established result, the most likely cause of scientific failure would certainly be experimental error. But for most hypotheses and theories this is surely not true. The likelihood of each possible cause of a negative result would depend on the likelihood of the hypothesis (consider homeopathy: any number of “positive” results are much better explained through experimental error/bad design/bad analysis than by the hypothesis that homeopathy is effective at curing X – indeed, later in the essay Mitchell acknowledges that spurious positive results can and do come about through bad statistical practices) and potentially unidentified variables. In much “frontier science” the likelihood of the theory is not well known (or even slightly known) and the unidentified variables can be many as the theory is incomplete. We’re getting into experimenter’s regress territory.

… if a replication effort were to be capable of identifying empirically questionable results, it would have to employ flawless experimenters. Otherwise, how do we identify replications that fail simply because of undetected experimenter error?

Adversarial collaboration; alas this happens infrequently and apparently not in many current replication efforts. This is a legitimate criticism of the replication movement: collaboration is essential to avoid experimenter’s regress.

…And here is the rub: if the most likely explanation for a failed experiment is simply a mundane slip-up, and the replicators are themselves not immune to making such mistakes, then the replication efforts have no meaningful evidentiary value outside of the very local (and uninteresting) fact that Professor So-and-So’s lab was incapable of producing an effect.

*Why* they weren’t able to produce the result should be very interesting, but can only really be investigated with collaboration, data sharing etc.

There are three standard rejoinders to these points. The first is to argue that because the replicator is closely copying the method set out in an earlier experiment, the original description must in some way be insufficient or otherwise defective…

…there is more to being a successful experimenter than merely following what’s printed in a method section…Collecting meaningful neuroimaging data, for example, requires that participants remain near-motionless during scanning, and thus in my lab, we go through great lengths to encourage participants to keep still. We whine about how we will have spent a lot of money for nothing if they move, we plead with them not to sneeze or cough or wiggle their foot while in the scanner, and we deliver frequent pep talks and reminders throughout the session. These experimental events, and countless more like them, go unreported in our method section for the simple fact that they are part of the shared, tacit know-how of competent researchers in my field; we also fail to report that the experimenters wore clothes and refrained from smoking throughout the session…

But I can conceive of scenarios where all this pleading and pressure on the participant may in fact cause them to move differently in the scanner from other labs’ approaches to dealing with participant movement, or to perform differently on tasks because they are so distracted by not moving. However, wearing clothes and not smoking indoors is common in many societies. If the participants were naked, that should definitely be reported, as nakedness in front of strangers is often considered socially uncomfortable (perhaps the participants at Harvard have transcended cultural norms around nudity).

A second common rejoinder is to argue that if other professional scientists cannot reproduce an effect, then it is unlikely to be “real.”…

This is a slightly more seductive argument, but it, too, falls short. Many of the most robust and central phenomena in psychology started life as flimsy and capricious effects, their importance only emerging after researcher developed more powerful methods with which to study them.

I agree with this, but I would again suggest that if competent scientists are producing contradicting results, they should collaborate and run experiments together using protocols they both agree on.

A third rejoinder argues that the replication effort ought to be considered a counterweight to our publication bias in favor of positive results… if an effect has been reported twice, but hundreds of other studies have failed to obtain it, isn’t it important to publicize that fact?

No, it isn’t.


Although the notion that negative findings deserve equal treatment may hold intuitive appeal, the very foundation of science rests on a profound asymmetry between positive and negative claims. Suppose I assert the existence of some phenomenon, and you deny it; for example, I claim that some non-white swans exist, and you claim that none do (i.e., that no swans exist that are any color other than white). Whatever our a priori beliefs about the phenomenon, from an inductive standpoint, your negative claim (of nonexistence) is infinitely more tenuous than mine. A single positive example is sufficient to falsify the assertion that something does not exist; one colorful swan is all it takes to rule out the impossibility that swans come in more than one color. In contrast, negative examples can never establish the nonexistence of a phenomenon, because the next instance might always turn up a counterexample…Thus, negative findings—such as failed replications—cannot bear against positive evidence for a phenomenon…Positive scientific assertion cannot be reversed solely on the basis of null observations.

But most experiments do not give us a “positive” result in this sense – they tell us the probability of a result given that the data were generated by a null distribution, not about the truth of our hypothesis. “Positive” experimental studies cannot be reasoned about in the same way as this illustration of the limits of induction.

Replications are not futile, but they are perhaps being conducted sub-optimally (and certainly ruffling some feathers). Adversarial collaboration and data sharing would maximise the benefit of replication experiments.

Says the non-experimentalist.