🚨 preprint alert 🚨

Initial evidence that using #statcheck in peer review may reduce errors.

Results from a preregistered observational study of 7000+ psychology articles.

/w @JelteWicherts

🧵

https://psyarxiv.com/bxau9

Background: ~50% of published psych articles with stats have at least one p-value that is inconsistent with their degrees of freedom and test statistic, and in ~12.5% of articles this may affect conclusions about statistical significance.

See: https://link.springer.com/article/10.3758/s13428-015-0664-2

The prevalence of statistical reporting errors in psychology (1985–2013) - Behavior Research Methods

This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion. In contrast to earlier findings, we found that the average prevalence of inconsistent p-values has been stable over the years or has declined. The prevalence of gross inconsistencies was higher in p-values reported as significant than in p-values reported as nonsignificant. This could indicate a systematic bias in favor of significant results. Possible solutions for the high prevalence of reporting inconsistencies could be to encourage sharing data, to let co-authors check results in a so-called “co-pilot model,” and to use statcheck to flag possible inconsistencies in one’s own manuscript or during the review process.

SpringerLink
Several journals started implementing #statcheck in their peer review process. Their hope is that requiring a “clean” statcheck report will significantly reduce statistical reporting inconsistencies in the articles they publish. BUT! This remains an empirical question.

We compared statistical inconsistencies in 2 journals that implemented #statcheck in peer review and 2 matched controls, before and after statcheck implementation:

- Psych Science (🤖) vs. Journal of Exp Psych: General

- Journal of Exp Soc Psych (🤖) vs. Journal of Pers & Soc Psy

We used preregistered multilevel logistic regression analyses to predict (decision) inconsistencies and found a significant interaction effect of journal_type*time in the expected direction: a steeper decline in (decision) inconsistencies in statcheck journals than matched controls.

These results provide initial evidence that using #statcheck in peer review may be a successful intervention to decrease statistical reporting inconsistencies. 🤖✅

but... >>

An important limitation of our study is that it is observational: by including matched control journals we aimed to reduce confounding factors, but we can’t fully rule out alternative explanations (e.g., maybe authors who are more careful in stat reporting prefer a “statcheck” journal).
Plans for exploratory analyses: we noted that even after statcheck implementation, some articles still contained statistical inconsistencies 🤔. We will dive deeper in full text of a subsample of these cases to see what is going on here. Suggestions for additional analyses are welcome! 🧮

🔢 You can find the preregistration, data, and R code here: https://osf.io/q84jn/

🤖 Interested in #statcheck? Check out the latest version on http://statcheck.io.

Estimating the effectivity of statcheck in peer review to reduce statistical reporting inconsistencies

We compare the prevalence of statistical reporting inconsistencies between journals that implemented statcheck in their peer review process and matched control journals, before and after statcheck implementation. Hosted on the Open Science Framework

OSF
@MicheleNuijten I know that I once "failed" a statcheck report because of some oddity of rounding (e.g., we reported that the test stat was 1.24 which was not above the criterion, but the real value was 1.244 which was...) I'm not sure if there's a way to account for that as a source of inconsistency (e.g. if journals have open data requirements and you could rerun the 'inconsistent' results?)

@regretlab I'm sorry to hear that! #statcheck is supposed to take correct rounding of the test stat into account.

I did recently fix a bug where the correct rounding wasn't taken into account with negative test stats, maybe that was the case for you?

And this is also why #statcheck results should not be followed blindly, and why I hope editors, reviewers, and authors collaborate on reducing errors together.

P.S. If you do run into a bug like that again, plz let me know! 🙏

@MicheleNuijten This was many years ago, so I assume the issue is long since resolved! (And the details are thus fuzzy and may have been more of a "we truncated when we should have rounded" issue?) I actually run statcheck on my submissions regardless of journal policy bc it's such a helpful tool. ❤