Real or Slop? — PL Papers Edition

@koronkebitch 10/10 by downloading the papers and asking Opus 4.6 to decide (GPT-5.4 got 9/10) 🤪
@koronkebitch Sadly, this means that one could make the slop papers even more convincing by letting Claude recursively improve and judge its own output. But I don’t believe this process would converge to a convincing paper:
@koronkebitch Both models missed obvious signs. The slop papers have very carelessly put-together layout with plenty of whitespace, equations running into the margin and no beautiful diagrams or figures, but neither model noticed this. This makes it possible to identify AI-generated papers purely visually. Additionally, neither model commented on mistakes in the AI-generated calculi, lemmas or proofs, even though such mistakes likely exist and could be found by a human reviewer.
@koronkebitch More generally, recent experiments using such feedback loops (like Anthropic’s C compiler) have spent tens of thousands in API credits and still yielded output that is not at the level of novel research. While the first few iterations are always impressive, LLMs do not seem to be good at long context tasks that require deep thought about a system as a whole.