Mastodawn

BLOG POST: Invalid Conclusions Built on Statistical errors

If you see a small p-value or a large separation in confidence intervals, you may assume that an effect is reliable. But overly simple reporting may be hiding serious conclusion-altering errors.

Here are some example scenarios:

http://steveharoz.com/blog/2024/wrong-conclusions-built-on-statistical-errors/

1/🧵 #stats #ieeevis #chi #hci

Invalid Conclusions Built on Statistical Errors

When you see p = 0.003 or a 95% confidence interval of , you might assume a certain clarity and definitiveness. A null effect is very unlikely to yield those results, right? But be careful! Such overly simple reporting of p-values, confidence intervals, Bayes factors, or any statistical estimate could hide critical conclusion-flipping errors in…

Steve Haroz's blog

Show thread

Steve Haroz (@sharoz on 🐤)Mar 14, 2024

False positive: Zero effect but a strong result.

A tiny p-value and separated confidence intervals may seem like a clearcut effect. But if the analysis involves a type of error called "Pseudoreplication", the real effect may be a fraction of what's reported, if it exists at all.

Show thread

Steve Haroz (@sharoz on 🐤)Mar 14, 2024

Rerunning that simulation, the true effect is 0, but more than half of simulations yielded a false positive when analyzed incorrectly.

Show thread

Steve Haroz (@sharoz on 🐤)Mar 14, 2024

False negative: Failing to detect a very reliable effect

Here, a within-subject experiment is analyzed as between-subject data. Despite the effect being consistent for 95% of subjects, the inappropriate analysis is overwhelmed by individual baseline differences and doesn't detect any effect.

Show thread

Steve Haroz (@sharoz on 🐤)Mar 14, 2024

False negative: Experiment design lacks sensitivity

Sometimes, the problem is in the design of the experiment. When there are large differences in baseline, running a between-subject experiment is unlikely to find even a very reliable effect. In this example, an experiment with 200 subjects can be less effective than a more sensitive experiment with 5 subjects.

Show thread

Steve Haroz (@sharoz on 🐤)Mar 14, 2024

Some red flags for spotting sketchy results:
🚩 Incomplete reporting. P-values aren't enough. Look for the full APA or AMA report F(_,_)=_, p=_, 95% CI ES=_.
🚩 Unreported model.
🚩 Hidden data or code.
🚩 Claims about the analysis approach being immune from common statistical concerns.

Show thread

Steve Haroz (@sharoz on 🐤)Mar 14, 2024

Long-term improvements:
* More common stats education in applied fields.
* Reporting standards that make errors easier to spot.
* Mandatory open data & code (or an explicit reason why it can't be shared). Publication venues that don't mandate open practices don't warrant credibility.

7/7