Selective reporting: The abuse of statistics

Posted by Andre M. Bellve (MSc Candidate)

As a quick preface – what I am talking about in this post isn’t new or revolutionary. My intent with this blog is to package some of the facts in a more digestible form and hopefully steer my fellow biologists to better practice.

When we carry out statistical significance tests (I.e. ANOVAs, t-tests, etc) there are two types of errors associated with them: False positives: falsely rejecting the null hypothesis, H₀, and; False negatives: falsely accepting H₀. Every test has these two errors in them and they are inversely proportionate to each other. A significance level of 0.05 means that 5% of the time we will incorrectly reject the null hypothesis. This is the basis of selective reporting, A.K.A p-hacking. Collectively, we have (for the most part) deemed a 5% cut-off as acceptable, at least when we aren’t betting on human lives, and I tend to agree.

What if the chance was higher though? If there was a 10% or 25% chance of false positive, would you still trust your results? You might be thinking: “Well that’s silly Andre – who cuts off their p-values at anything higher than 0.05? ” Well here’s the catch: if you carry out two significance tests, and both have a cut-off for significance at 5%, then the chance that there is a false positive is no longer 5% – it is much higher. This inflating of the overall error rate is the basis of p-hacking. The chance of a false positive is also increased when making pairwise/multiple comparisons!

type1error

Credit: Statistical Statistic Memes via Facebook

Now this isn’t the end of the world – you can correct for it. The read significance level that you set as your cut-off defines the overall error rate of your experiment – both the type I and II error rates. If you are doing several tests, then you have to correct for the increased chance of a false positive. Fortunately, it’s relatively easy to do! There is the extremely conservative Bonferroni’s correction where you divide your significance level by the number of tests you are doing, which is appropriate in some cases (i.e. when it’s a case of life or death). This isn’t popular among biologists as we often deal with very noisy data and it can be hard enough to pick up a signal as it is. For this reason, the less conservative False Discovery Rate (FDR) correction works well when lives aren’t on the line, as it typically leaves at least one significant result without sacrificing the integrity of your analysis. This is not an exhaustive list and you can read more about different methods for correction here.

Selective reporting happens a lot. One recent example came out of the sensory science food labs from Cornell University: Brian Wansink, director of the Food and Brand lab at Cornell, boasted online that a volunteer research assistant of his was able to take a “failed study which had null results” and produce ‘significant’ results from it (Science of Us, 2016). However, careful investigation of the published papers yielded multiple errors and inconsistencies suggesting that the research assistant had ‘p-hacked’ the data to produce these results, although it seemed she had done so unwittingly. It is this kind of behaviour that brought about an analysis by Ioannidis (2005) which found that most published research findings are actually false. The incorrect use of these methods can put at risk the objectivity of the traditional scientific method and create issues of credibility for the scientific community. With more “fake news” and anti-science rhetoric being thrown around than ever before the last thing we need are blows to our credibility or any more fuel being added to the tangerine tire fire that is my president.

tangerinetyrefire