«The "most accurate" #LLM silently discarded 63% of the relevant papers when screening for a #SystematicReview»

#systematicreview #llm #evidencesynthesis #softwareengineering #openaccess #metaresearch | Lech Madeyski
The "most accurate" LLM silently discarded 63% of the relevant papers when screening for a Systematic Review. That is what we found when we re-analysed a 9,695-article systematic review screening study: the LLM ranked best by Accuracy lost 63.3% of the relevant evidence. The one ranked best by MCC still lost 43.9%. The one ranked best by WMCC — the cost-sensitive Weighted Matthews Correlation Coefficient we propose in the paper — lost just 5.8%. LLMs are increasingly used to screen papers for systematic reviews — but the standard metrics used to evaluate them can badly mislead under the extreme class imbalance and asymmetric error costs of screening. Across 29 papers we reviewed: 🔸 only 24% reported the full confusion matrix 🔸 only 10% reported MCC 🔸 none of the 5 papers claiming "workload savings" priced the cost of a wrongly excluded study Our new open-access paper in Information and Software Technology — LLM4SCREENLIT — turns this into actionable recommendations: ✅ Report Lost Evidence (1 − Recall) as a headline metric ✅ Use Weighted MCC (WMCC): chance-corrected AND cost-sensitive, validated on 9 LLMs × 24 SE secondary studies (34,528 articles) ✅ Always report the full confusion matrix; treat unclassifiable outputs as positives requiring human review ✅ Distinct guidance for benchmarking vs deployment studies — plus a ready-to-use compliance checklist for editors and reviewers Joint work with Barbara Kitchenham (Keele University) and Martin Shepperd (Brunel University of London); I am affiliated with Wydział Informatyki i Telekomunikacji Politechniki Wrocławskiej (Faculty of Information and Communication Technology) of Wrocław University of Science and Technology. I recently had the pleasure of giving an invited talk on this work at the AI Engineering lab at Chalmers University of Technology and the University of Gothenburg (https://lnkd.in/de3iAbas) — thank you again, Miroslaw Staron, for hosting that discussion. 📄 Paper (open access, CC-BY): https://lnkd.in/dpVkt6QK 🧰 Replication package — R/Python scripts + fillable reviewer/editor checklist: https://lnkd.in/d8c-ZReU #SystematicReview #LLM #EvidenceSynthesis #SoftwareEngineering #OpenAccess #MetaResearch


