In a recent @TrendsEcolEvo paper, Muff et al. suggested to use the wording "weak/moderate/strong/very strong evidence" instead of binary yes/no p-values. https://www.sciencedirect.com/science/article/pii/S0169534721002846
This article was met by some controversy, such as from @lakens https://www.sciencedirect.com/science/article/abs/pii/S0169534721003414?via%3Dihub , by Hartig&Barraquand https://www.sciencedirect.com/science/article/abs/pii/S0169534722000489?via%3Dihub as well by Amrhein&Greenland https://www.sciencedirect.com/science/article/abs/pii/S0169534722000246?via%3Dihub . In my opinion, Dushoff et al. present a nice alternative - referring statistical clarity https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13159 #Rstats #statistics
@DenisMeuthen @TrendsEcolEvo I regret to say I don't think this last contribution is useful. The correct statement after a non-significant effect is simply 'we can not reject the null'. That is formally correct, and clearer than 'statistically unclear'. The reason I don't like that term is because null effects *can* be statistically clear - just not if you only test against zero. But add an equivalence test, and they can be very clear evidence of the absence of meaningful effects.
@DenisMeuthen @TrendsEcolEvo I do agree (and have made the point myself) that using 'no significant' is not a useful label. But 'we can not reject' is the way to think about tests - after all, that is what they do. They reject values - that don't add 'clarity'. In general, coming up with new language is often not the best approach. Most of this was dealt with adequately more than half a century ago - let's just follow those suggestions, not reinvent the wheel (especially not badly).
@lakens Thank you for your input. And what kind of wording would you recommend for different p-values so as to not have a standardized binary cut-off at "we can reject the null" at all values of p<0.05?
@DenisMeuthen I would not make such a recommendation. I am a methodological falsificationist (Popper, Lakatos) and falsifiability means dichotomous conclusions, and long run interpretations (always do replication studies!). And yes, that feels silly if you get a p = 0.049 vs 0.051 but that is how it is. It even has some uses (see my recent blog posts http://daniellakens.blogspot.com/2022/07/irwin-bross-justifying-005-alpha-level.html). If you want gradual interpretations, I would recommend likelihoods, not p-values.
Irwin Bross justifying the 0.05 alpha level as a means of communication between scientists

This blog post is based on the chapter “Critical Levels, Statistical Language, and Scientific Inference” by Irwin D. J. Bross (1971) in the ...

@TrendsEcolEvo @DenisMeuthen @lakens This seems reasonable to me. Perhaps especially if used with open data to help with the issue that, in many fields, users of published research involving statistical testing are mainly concerned with „will it work for my patients“ so how big the *difference* is (and in what direction), and whether the group effect appears consistently in all individuals matters as much as whether it’s a clear group difference.
Quantifying Treatment Effects in Trials with Multiple Event-Time Outcomes | NEJM Evidence

@rspfau good references, thanks!
@DenisMeuthen @TrendsEcolEvo @lakens Muff et al. still want to categorize data, which I think is cutting results short. I agree with Amrhein and Greenland (2022) that it might help getting more "non-significant" results out there, but there are better ways (because every study/result can be valuable). I like their suggestion about compatibility, but hope we can move away from testing point null hypotheses and move towards testing sensible scientific hypotheses.
@raoulvanoosten @DenisMeuthen @TrendsEcolEvo I discuss testing interval ypotheses as the correct way forward here: https://journals.sagepub.com/doi/10.1177/1745691620958012 Note that *every* test is a categorization into a prediction that is corroborated or falsified. If you want to stop that practice, you will need to develop a new epistemology not built on methdological falsificationism. I have not seen anyone try it 😉
@raoulvanoosten @DenisMeuthen @TrendsEcolEvo MAny people are trained in the Fisherian 'use' of p-values. Which is actually an incoherent approach. Instead, the Neyman-Pearson school offers the only (as far as I know) coherent (that is, linked to an epistemology) use of p-values. Fine not to use it - use likelihoods if you want continuous quantifications - as long as you are coherent!
@lakens @DenisMeuthen @TrendsEcolEvo that's why including p-values is so problematic. Many scientists are conditioned to hyperfocus on them.
@raoulvanoosten @lakens In one of my publications, I've once tried to leave p-values out, only citing average (+-sd) values for each treatment, estimated effect sizes (from the model), confidence intervals and other summary statistics. Editor's and reviewer's comments: "Results section is intelligible as it is crowded by too many numbers and p-values are missing".
@DenisMeuthen @raoulvanoosten Yeah, it can get busy. I think crowded often means 'I don't understand this as well as I should' ;)
@lakens I agree with your view; we should teach people to ask the right questions. And indeed, data are either compatible with the hypothesis or not.