#statstab #363 p-checker The one-for-all p-value analyzer

Thoughts: Easy way to check for publication bias using some current tools.

#shiny #pvalue #phacking #QRPs #zcurve #bias #pcurve #rindex

https://shinyapps.org/apps/p-checker/

Experience Statistics

#statstab #358 What are some of the problems with stepwise regression?

Thoughts: Model selection is not an easy task, but maybe don't naively try step wise reg.

#stepwise #regression #QRPs #issues #phacking #modelselection #bias

https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/

Stata | FAQ: Problems with stepwise regression

What are some of the problems with stepwise regression?

Il P-hacking è la pratica di modificare l'analisi o i dati per ottenere un risultato statisticamente significativo.
Si cerca di ottenere un risultato desiderabile e si riportano solo i risultati ottenuti, ignorando tutte le volte in cui non si è ottenuto nulla.
Potrebbe portare a una pubblicazione a breve termine, ma il P-hacking contribuisce alla crisi di riproducibilità e replicabilità nella scienza, riempiendo la letteratura scientifica di conclusioni dubbie o infondate.
#science #statistic #phacking
https://www.nature.com/articles/d41586-025-01246-1
P hacking — Five ways it could happen to you

Some data practices can lead to statistically dubious findings. Here’s how to avoid them.

#phacking #LLM performance:

"The researchers found that big players like Meta, Google, OpenAI, and Amazon are given special privileges to privately test multiple versions of their models and only publish the best results. This hidden practice allows them to inflate their rankings by cherry picking data, making their models appear stronger than they actually are."

https://arxiv.org/abs/2504.20879?utm_source=beehiiv&utm_medium=newsletter&utm_campaign=mediamobilize&_bhlid=045dede5ad6eedcb96ec953f2fbac11a166a2243

via Sabine Hossenfelder's newsletter

The Leaderboard Illusion

Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field

arXiv.org

#statstab #330 Encourage Playing with Data and Discourage Questionable Reporting Practices

Thoughts: What are and aren't "Questionable Research Practices"? Where is the "grey area"? Interesting opinion piece.

#QRPs #exploratory #EDA #posthoc #phacking

https://link.springer.com/article/10.1007/s11336-015-9445-1

Encourage Playing with Data and Discourage Questionable Reporting Practices - Psychometrika

SpringerLink
> 2011: Joseph Simmons, Leif Nelson, and Uri Simonsohn publish a paper, “False-positive psychology,” in Psychological Science introducing the useful term “researcher degrees of freedom.” Later they come up with the term p-hacking, and Eric Loken and I speak of the garden of forking paths to describe the processes by which researcher degrees of freedom are employed to attain statistical significance.
https://statmodeling.stat.columbia.edu/2016/09/21/what-has-happened-down-here-is-the-winds-have-changed/
#PHacking #DegreesOfFreedom
@bsmall2
What has happened down here is the winds have changed « Statistical Modeling, Causal Inference, and Social Science

Statistical Modeling, Causal Inference, and Social Science
🧵
> ... unethical behaviour during the report of results is.. P hacking... frequent in research.. [of a] clinical nature... two main reasons.. First, scientists are often evaluated by the number and quality of publications, and sometimes this pressure to get sig­nificant results makes some scientists cherry-pick their results. Second (and more frequent), some inexperienced analysts are unaware of the importance of #MultipleTesting and think this is
OK. But it is not! #PHacking
@bsmall2

#statstab #168 Large P-Values Cannot be Explained by Power Analysis

Thoughts: "Researchers cannot “aim” for p = .05, not even with a careful, perfectly accurate, power analysis."

#research #nhst #pvalues #power #QRPs #phacking

https://quentinandre.net/post/large-pvalues-and-power-analysis/

Large P-Values Cannot be Explained by Power Analysis | Quentin André

Can p-values be close to .05 because researchers ran careful power analysis, and collected 'just enough' participants to detect their effect? In this blog post, I evaluate the merits of this argument.

Quentin André

#statstab #135 {p-checker} The one-for-all p-value analyzer

Thoughts: Why not try your hand at some tools for detecting publication bias (mileage may vary). Useful teaching demo tools.

#shinyapp #r #rstats #phacking #pvalues #NHST #education #edu

https://shinyapps.org/apps/p-checker/

Experience Statistics

Fun class class survey of other undergrads, current N=55. I'm doing #irresponsible #DataAnalysis bc not actual #research.

Still #WTF?

ghost_recv_log = How many times have you been ghosted? (log-transformed)

mosi = Misperception of others' sexual interest

bjw = Belief in a just world

swls = Satisfaction with life

csei = College self-efficacy

High self-esteem assholery? IDK.

OH WAIT. Gender!

Shit. We didn't ask.

#NotResearch #Datafishing #phacking but it's #OK I'm a #professional #oops