Mastodawn

Excellent article on the dangers of dichotomisation of continuous variables

“Cake causes herpes?” - promiscuous dichotomisation induces false positives
https://link.springer.com/article/10.1186/s12874-025-02712-0

#dataanalys #statistics #stats

“Cake causes herpes?” - promiscuous dichotomisation induces false positives - BMC Medical Research Methodology

Background Continuous biomedical data is often dichotomized into two or more groups for analysis, despite long-standing warnings from statisticians that this constitutes bad practice. This dichotomisation is typically discouraged because it reduces statistical power and may obscure important trends. This paper considers another reason to discourage this practice: that dichotomisation is a powerful tool to manipulate data, as dichotomising at an arbitrary yet flexible threshold (which we term ’promiscuous dichotomisation’) represents a powerful researcher degree of freedom. Methods The motivating question is how probable is it that given a set of uniformly distributed data a threshold can be engineered to produce the illusion of a true effect when none exists? To estimate this, we employed both analytical approaches and Monte-Carlo simulation approaches to quantify the expected number of spurious findings that could arise from manipulating a dichotomous threshold for an arbitrary data set. We also illustrate an example of this with NHANES data, showing how a spurious relationship between blood glucose and herpes status could be engineered. Results For even a relatively small sample of $$n=100$$ , a false positive rate of $$\approx 38\%$$ can be observed, rising to over $$66\%$$ if low counts scenarios are not excluded. With larger samples even with low-count exclusion, false positive rates in excess of $$66\%$$ for $$n=1000$$ and $$83\%$$ for $$n=10,000$$ are possible, climbing to in excess of $$81\%$$ and $$89\%$$ respectively if low-count scenarios were not excluded. For most configurations, manipulation of thresholds was a highly viable methods of crafting a false positive result. Conclusions It is likely that manipulating cut-off points in measured variables represents a significant source of data manipulation in published science, and the ease of access of larger health databases means this is an issue that is likely to grow in severity. We discuss implications of this, and means of identifying potential promiscuous dichotomisation.

SpringerLink