AVeriTeC (NeurIPS 2023): 4,568 real-world fact-checked claims, web-retrieved evidence, four-way labels, temporal-leak-free split.
Two structural gaps: gold answers are frozen but the retrieval surface isn't (two systems a year apart hit different Google), and the not-enough-evidence class rewards weak retrievers — predicting NEI when retrieval fails matches gold by coincidence.
https://benjaminhan.net/posts/20260507-averitec/?utm_source=mastodon&utm_medium=social






