Excess Success – Research Practices in Psychology

A new paper adds to the continuing discussion of research practices in psychology. The paper (citation below), in press at Psychonomic Bulletin and Review by Gregory Francis, analyzes the last several years of published papers in Psychological Science, a premier outlet in psychology, and in essence asks if there is “too much success” in the reported studies.

The analysis uses the “test for excess significance” (TES) (Ioannidis & Trikalinos, 2007). The intuition is that if you run a number of experiments – N –  measuring an effect of a certain size, then it is possible to compute how likely getting N rejections of the null is for all N experiments. So, if the odds of getting an effect are, say, one in three, given the power to find the effect, then the chances of getting two such effects is the product of this probability, or one in nine. If one finds more successful rejections of the null than one would expect, given the power to reject the null, this suggests that something is amiss. From the analysis alone one can’t say where the excess success comes from, only that there is a bias in favor of positive results. According to Francis, the cutoff for the value of the TES is 0.1. As he puts it: “A power probability below this criterion suggests that the experiment set should be considered biased, and the results and conclusions are treated with skepticism.”

He ran the TES on published papers in Psychological Science between the years 2009 and 2012 (inclusive) that reported four studies or more – the minimum number for the TES analysis – and found that 82% of the 44 paper that met the inclusion criteria had values less than the cutoff value, suggesting a substantial degree of “excess success” in the journal.

I’m confident that the paper will stimulate a great deal of discussion. My interest for the remainder of this post is in a possible pattern in the TES data. When I first read the paper, my eye was caught (slightly ironically) by the short title of one of the papers investigated, The Wolfpack Effect, by my friend Brian Scholl and colleagues, which I wrote a little post about around the time it came out. This paper was one of the eight that surpassed the .1 threshold.

I looked a bit more closely into some of the others that similarly had TES values above .1. The largest TES value, .426, was also in the area of perception, looking at how people can quickly assign stimuli to categories (e.g., “animal”). The next largest TES value, .348, was another perceptual study, having to do with the way that objects are represented in the visual system. Two other papers had to do with, first, another effect in vision – how the color of light affects visual processing of fear-inducing stimuli – and, second, an effect in audition.

So five of the eight successes, as indexed by TES, are from the field of perception. The other three were not, having to do with predictors of subjective well-being, reducing prejudice, and appreciation of others’ help. One paper in the area of perception – about visual rivalry – didn’t fare as well. Neither did a paper looking at the possibility that people see objects they want as being closer to them.

So perception didn’t run the table, but, still, without looking very closely at all the papers in question, it seemed to me that the low-to-medium level perception work distinguished itself in the analyses. (I might add that another paper I posted about, didn’t do as well as the Scholl work.) The balance of the papers covered a fairly wide range of topics. To take just two to illustrate, one paper (TES = .041) presented six studies that purported to show that “[h]andling money (compared with handling paper) reduced distress over social exclusion and diminished the physical pain of immersion in hot water.” A second paper (TES = .036) purported to show that when “religious themes were made implicitly salient, people exercised greater self-control, which, in turn, augmented their ability to make decisions in a number of behavioral domains that are theoretically relevant to both major religions and humans’ evolutionary success.”

In any case, from the results that Francis reports, I don’t think any strong inferences can be drawn. To my eye, it looks like perceptual work does better than the other areas, but more systematic work will need to be done.

It seems to me that it’s worth knowing if some subfields score better in this respect because it speaks to the explanation for the problems. As Francis puts it: “Unless there is a flaw in the TES analysis…there are two broad explanations for how there could be such a high rate of apparent bias among the articles in Psychological Science: malfeasance or ignorance.” It doesn’t seem to me that there’s any reason to think that people in perception are any more ethical than people in other areas. If that’s true – though of course it might not be – then the place to look for the source of the problem is not in malfeasance.

Are there other candidate explanations? Could there be fewer opportunities for researcher degrees of freedom in perception? Could it have to do with the nature of theories in perception, compared to other areas?

I’m not really sure. But it could be that finding patterns in different areas of psychology might be useful for determining the sorts of best practices that will ameliorate these sorts of issues. My guess is that this paper will stimulate many profitable conversations.


Francis, G. (in press). The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin and Review.

Ioannidis, J. P. A., & Trikalinos T. A. (2007). An exploratory test for an excess of significant findings. Clinical Trials, 4, 245-253.

06. March 2014 by kurzbanepblog
Categories: Blog | 9 comments

Comments (9)

  1. Interesting post. Unfortunately, I wasn’t able find the paper, even in their section of articles published ahead of print. But, from your description, it sounds like this is another (important!) demonstration that, as a general rule, psychologists’ studies are severely underpowered. So a study with a small sample that happens to reach the magic threshold of significance is published; if it wasn’t significant, it’s not published. The result is severe publication bias: “positive” results, esp. with large effects are published but “negative” results or those with small effects are not. This story should be old news for most of now, but the problem doesn’t seem to be going away.

    The solution doesn’t seem complicated to me. Reviewers and editors must insist that only papers can be published that are adequately powered. If someone submits a study with a cool result based on a small sample but a large effect, the reviewers and editors should say “Great, we’ll publish this result as soon as you conduct an exact replication and demonstrate its reliability.” Yes, I realize that are some kinds of datasets that are so difficult to obtain that low power cannot be helped. But this is rarely the case in psychology. Another way to think of it is that researchers shouldn’t undertake a study unless they will have sufficient statistical power (and a sufficiently interesting hypothesis) that both “negative” and “positive” results would be interpretable and likely publishable. Again, there are times when this won’t be possible, but I think it could be a good rule of thumb.

    Anyhow, let’s get back to your question, i.e., why might the TES values vary by sub-field? I think the answer is that in fields that are theoretically weak, a negative result is not interesting because no strong hypothesis can be considered refuted or weakened by a negative finding. So researchers develop a culture (probably without explicitly realizing it) of running many underpowered studies; they publish their occasional lucky hits (i.e., positive results) and immediately stick the negative results in the file drawer.

  2. Hi Rob. Thanks for highlighting the Wolfpack project and for asking these interesting questions! Here’s a rambling, inconclusive response…

    I have to admit that I find both the nature and (especially) the applications of the TES to be confusing, and I’m not sure how to think about it. But I also didn’t want to pass up a chance to comment on anything suggesting that perception research is outstanding.

    In fact I do think that perception work probably is “special” vis-a-vis this kind of discussion, for a few reasons (and beyond its special awesomeness in general, its perfect balance of psychophysical rigor and big-picture theorizing, its relative success at joining brain and behavior, its sheer cortical real estate that dwarfs any other similar process, its … sorry, let’s get back on topic).

    When assessing the likelihood of a “real” effect or phenomenon, both scientists (when doing the work) and journal readers (when learning about it) often have more evidence to weigh than just the reported analyses. They also have The Demo. I capitalize here just to underscore how important demonstrations are to work on perception (where here I’m using the term “perception” to refer only to conscious seeing, as opposed to possibly unconscious “visual processing). The reason why many topics in perception have generated lots of attention and work — e.g. motion-induced blindness, illusory contours, the perception of animacy, the perception of faces where there are none, structure from motion, binocular rivalry, subjective time dilation, after-effects of all kinds, and on and on — isn’t because the papers reporting those phenomena have especially compelling statistics attached to them. Rather, its because “Wow: just look at that!”

    I think this very clearly has an influence on the confuct and evaluation of research in perception. Some of these influences are probably positive. For example, I wouldn’t be surprised to learn that there is just less outright fraud and also fewer published false positives in perception research. (Suppose you heard that Kanizsa had just blatantly made up the phenomenon of illusory contours in his papers, and in fact they’re not real after all. Nobody would believe that for a moment, because in effect the “experiment” is replicated every time anybody looks at one of the relevant figures!)

    With the wolfpack effect, for example, we knew we had a winner before we ran the first subject, because we were wowed by the demos. (Indeed, for our perception research, the actual scientific discoveries often feel tied to the demos, whereas the actual experiments can seem like mopping up — verifying, replicating, generalizing, quantifying….)

    The demo-infused nature of perception research might also be dangerous in this context, though. I bet that perception research is also home to many more never-reported “stillborn” experiments — i.e. pilot studies that are never fully run, and could never really be reported. Indeed, our lab often has stillborn “experiments” involving a total of zero subjects: we have a great idea for a study, code it, excitedly fire it up for the first time … and then sigh, because we don’t see any effect at all. And then we move on. Of course there’s really no feasible way to report such “studies”, nor do I hope that anyone would advocate that.

    I’m uncertain about how to apply these thoughts to the TES data, though. On one hand, perception researchers tend to focus on awesome effects, and part of being awesome is being huge and robust. As a result, many perception experiments report hugely significant effects, and are probably more likely to be overpowered compared to some other research areas. That should reduce the likelihood of “excess significance” just due to chance, at least. On the other hand, I can also imagine that many null effects might be less likely to be reported in perception research (and so increase excessive significance in published papers), simply because they’re ruled out at the “demo” stage, before an actual experiment is conducted.

    Two other comments on the TES paper/data:

    1. If I were going to run this sort of test, “Psychological Science” is pretty much the last journal I’d pick. Sure, it’s great (because it’s home to an especially large number of especially interesting and important results), but fewer than 5% of the papers report enough experiments to even meet Francis’ criteria for evaluability in the first place. Why? Because of the severe length restrictions, of course.

    2. Indeed, one thing that I find … tone-deaf … about many of the current crop of discussions of research practices is the seeming rush to damnation. This seems especially true of Francis’ (rather large number of) papers on this topic; he seems to have a hair-trigger for suspecting malfeasance and for ruling papers utterly worthless on the basis of such tests — whereas I suspect that many cases of excess significance are due to factors that, while still not unproblematic, are at least less damning. Or at least I hope they’re less damning, since I’m not always innocent. Take the “Wolfpack Effect” paper that occasioned this conversation. Guess what: there was at least one other experiment that we ran as part of that project that we didn’t report in that paper. Why? Because of those god-awful word limits at Psych Science! They force us to kill so many darlings — not only turns of phrase, but sometimes entire experiments too. (At least Rob’s blog-commenting system doesn’t impose such restrictions.) What to do in this situation? Should the fact that we can’t possibly squeeze so many experiments into so few words thus require us to submit only to less constraining journals? Surely not. Should we have mentioned that we also ran this other unreported experiment? Probably. But we didn’t. (Actually, it was in the initial submission, but then of course some reviewer made us devote those zero-sum-game words instead to a discussion of their very wise comments.) Upon learning this horrible news, should you be concerned for the reality of the phenomenon? I hope not. (The unreported experiment worked too. Quite nicely, in fact — and even better than some of the ones in the paper!) Anyway, I bet that this is the sort of explanation that is responsible for many cases of “excessive significance” in practice. But you wouldn’t ever suspect this from reading Francis’ paper, which talks only about fraud, malfeasance, criminal ignorance, et al.

    • I tend to agree with you and Rob about the properties of visual perception that give it an advantage compared to other areas of psychology. However, I don’t think we have enough data to draw a firm conclusion that our intuitions are correct.

      Regarding your thoughts about the zero-subject pilot studies. There is no need to report all experiments (pilot or not) to avoid failing the TES. What is required is that the experiments that are reported be sufficiently highly powered (or otherwise convincing if null results are what define success).

      Regarding using Psych Science for the TES analysis. The decision to analyze articles in Psych Science and some of the particulars of the analyses (such as investigating only articles with 4 or more studies) is because the editor-in-chief, Eric Eich, invited me to do the analyses in that way. He said that he would publish an article describing the analyses, invite authors whose work was critiqued to write comments, and allow me to reply. After Eich saw the article, he decided to reject it, first citing that it was not of sufficient interest and then (after I appealed) citing that the validity of the TES analysis was uncertain.

      Regarding your second point, can you please identify where any of my articles accuse authors of malfeasance? I tried to be very careful to not make such claims. Indeed, one of my main arguments is that the problems are the result of fundamental misunderstandings about how to make scientific arguments across multiple experiments. Thus, ignorance (not “criminal” ignorance) is a reasonable interpretation for the source of the problems rather than malfeasance/fraud. I am disturbed that readers are somehow drawing the opposite conclusion.

      The conclusion regarding an article that fails the TES is that something is wrong, but the TES cannot specify exactly what is wrong. This wrongness means that at least some of the theoretical claims are not properly supported by the presented empirical data. We don’t necessarily know if it is just one theoretical claim or all of them. Sometimes we can take a closer look at the details of the claims and the empirical data and make some pretty good interpretations, but that’s for subject matter experts. We all agree that fraud is worse than adding subjects to get from p=.06 to p=.04. But if we do the latter many times, than the TES is going to tell us (correctly) that the reported findings are improbable and that something (again it doesn’t tell us what) is wrong. I think we have to aim for scientific evidence that is better than being not as bad as fraud.

      Regarding some of the details you provided about the Wolfpack Effect paper. You probably should be grateful that the additional successful experiment was not included in the article. The P_TES value for the published article is 0.115. An additional experiment, even with power of around .8 would give a P_TES value below the 0.1 threshold.

      In your following comment you wrote, “What criteria determine the input to such a test, in principle and in practice?” which is a good question. In principle, the input to such a test has to be determined by subject matter experts. This could be done within an article or across articles. In practice, I usually use the definitions of obvious subject matter experts: the authors of a single article or a meta-analysis. Let’s take the Wolfpack effect paper as an example.

      You correctly note that the article measures several different effects. Consider Experiment 1. The results section is pretty clear about what effects matter for the theoretical claim(s):

      “The wolfpack effect impaired detection of chasing: Performance was significantly worse on wolfpack trials (62%, SD = 7.8%) than on perpendicular (72%, SD = 9.0%), match (75%,SD = 9.1%), or disc (72%, SD = 9.4%) trials, and nearly all observers showed this pattern (for statistical tests, see Table 1). Thus, the wolfpack effect is strong enough to trump actual chasing—impairing observers’ ability to detect actual pursuit, even when orientation is not task relevant.”

      Three t-tests (first row in Table 1) provide the empirical support for these claims, and thats what I used for estimating the success probability for Experiment 1. The calculation is rather tricky because of the within-subjects design, but I was able to estimate the correlations between conditions with the given statistical information. If you were to repeat this experiment with the same sample sizes, the estimated probability of getting these three tests to all produce significant outcomes is 0.418. There were three other (non-significant) tests reported in Table 1, but they do not seem to be related to the theoretical claims. If I misunderstood something, and these non-significant outcomes are also important to the theoretical claims, then the probability of success must be lower than 0.418.

      And that is how the TES analysis goes for each experiment. We look for the theoretical claim(s) and the empirical results that were used to support those claims.

      Exp 2.: “The wolfpack effect impaired participants’ ability to detect and evade the real wolf:”

      Exp. 3a: “observers spent less time (7.99 s, SD = 0.39) in wolfpack quadrants (and more time in perpendicular quadrants; 9.01 s) than would be predicted by chance (8.5 s), t(9) = 4.18, p = .002”

      Exp. 3b: “Observers again spent less time (7.97 s, SD = 0.26) in wolfpack quadrants (and more time in perpendicular quadrants; 9.03 s) than would be predicted by chance (8.5 s), t(6) = 5.291, p = .002, and this pattern held true for all participants individually. This replication is consistent with the idea that the wolfpack effect is a social cue and not just a physical cue,”

      Sometimes a successful finding is a predicted non-significant outcome:

      Exp. 3c: “Observers spent no more time in the nonflashing quadrants (8.58 s, SD = 0.47) than would be predicted by chance (8.5 s), t(13) = 0.67, p = 0.51. This finding suggests that the avoidance of wolfpack quadrants in the previous experiments was not due to attentional capture.”

      Exp. 3d: “Observers spent no less time in wolfpack quadrants (8.63 s, SD = 0.55) than would predicted by chance (8.5 s), t(8) = 0.708, p = 0.499, and spent no more time in the target quadrant (3.98 s, SD = 0.76) than in the other nonwolfpack quadrant (4.39 s, SD = 0.59), t(8) = 1.001, p = 0.356—and in fact the latter numerical difference trended in the opposite direction. This suggests that the wolfpack effect in the Leave Me Alone task is truly a social effect and is not simply mediated by some new form of attentional direction or grouping by the wolfpack items.”

      Note that across experiments, there is a coherent set of theoretical claims being made about when the wolf pack effect happens and when it does not, along with claims about mechanisms. Experiment 4 adds to the claims:

      Exp. 4: “performance in the wolfpack-to-wolf condition (69.4%, SD = 5.6%) was both significantly better than performance in the wolfpack-to-sheep condition, F(1, 9) = 20.857, p = .001, and significantly worse than performance in the perpendicular-to-sheep condition, F(1, 9) = 7.078, p = .026. This finding that the wolfpack’s target makes a difference is another indication that the wolfpack effect is a type of social cue. In addition, it again demonstrates that the influence of the wolfpack effect cannot simply stem from any general sort of grouping or attention capture, as these factors were always equated in this experiment; rather, it seems to matter in such situations whether the wolfpack is facing you or a third party.”

      Looking just at the statistics, the evidence used to support the claims in Experiments 3a-d seem pretty strong (estimated probabilities of success are above 0.9). Experiments 1, 2 and 4 are relatively weak (estimated probabilities of success of 0.418, 0.639, and 0.583). Overall, the probability of the entire set is around 0.115, which is not below the standard TES threshold, but is low enough that I think you should be concerned. Surely you would want your findings to have a better than 12% chance of being replicated if someone repeated all of your experiments.

      Just to wrap up, I have seen the demos of the wolf pack effect, and I find them convincing; but I don’t think the reported experiments provide especially good evidence for the claims (the P_TES is above the standard threshold, but I would personally want more). I am a modeler, and I would be reluctant to start modeling the phenomena reported in this paper because I suspect some of the findings are not measured with enough precision to allow me to properly constrain the model. I don’t doubt the wolf pack effect is real, but the measurements reported in this paper are not good enough; at least for how I might want to use them. Many of the other articles in Psych Science are much worse.

  3. By the way, here’s another thing that I don’t really understand about the TES:

    What criteria determine the input to such a test, in principle and in practice?

    In principle, Francis talks about “experiments that are presented as empirical support for a theory”, and Rob’s post talks about tests of “an effect”. But what counts as “a theory” or “an effect”?

    In practice (though I admit that I haven’t studied his paper super-carefully), Francis’ criterion seems to be that the experiments were all reported in the same paper. But surely that can’t be right: it’s perfectly possible for a single paper to test two very different “effects” or “theories”.

    When I saw that the Wolfpack paper had been included in this analysis, my first thought was that this was inappropriate, since the several experiments in that paper report (what I think of as) several very different effects that might stand or fall independently — and that might have independent and wide-ranging effect sizes, etc. So it doesn’t seem fair to lump them all together for such an analysis, unless the “theory” being tested is just that “this perceptual phenomenon exists, and has awesome and wide-ranging effects”. (Of course, mea culpa: we gave the paper a singular title, rather than calling it “Several Wolfpack Effects That Don’t Necessarily Entail Each Other”. I wonder why…)

    So what should the criteria for inclusion in a TES be? And how many of the other papers in Francis’ list were similarly testing different effects in different experiments?

  4. Pingback: Weekend reads: “Too much success” in psychology, why hoaxes aren’t the real problem in science | Retraction Watch

  5. Am I the only one to think that this PBR paper sounds a bit like revenge for getting rejected by Psych Sci? 🙂

    • I suspect there is an interesting story about why PS did not ultimately decide to publish this work but I think that is beside the point. Instead, I think there are interesting discussions about whether to use a fixed or random-effect procedure with the TES but I doubt thatit would matter too much. As a model of openness, this is pretty impressive work. You can walk through of all of his calculations as he has made them publically available.

      I have followed the Francis work with some interest and read many of the rejoinders. From what I have seen, authors sometimes acknowledge having stuff in the file drawer but then argue that the effect is still reliably different from zero when all information is included. I often think this kind of reply misses the point. The concern is that publication bias leads to inflated effect size estimates. This is bad because it distorts the literature and might lead people (e.g., grad students) to invest resources in areas that might end up being weaker than they would have anticipated from reading the papers.

  6. Many years ago, I submitted a paper to a special issue on emotions in decision making to a leading psych. journal (Cognition & Emotion). In the paper, I was exploring how dysphoria might affect decision making processes within a sequential choice paradigm. If memory serves me well, I obtained null effects for 16 out of 17 metrics when comparing dysphorics to non-dysphorics. in other words, the null effect was robust across a wide range of measures. Given the confused nature of the relevant literature, I thought that the paper would contribute to the debate regarding the effects of dysphoria on decision making. The editor of the special issue felt that it was a very strong paper but that he could not publish it in light of the null effects. So, the paper has sat in my proverbial pile of “to do” works that are yet to be published.

  7. Pingback: 115: L'amour de la répétition • Neuromonaco

Skip to toolbar