Real or Slop? — PL Papers Edition

7/10, avg time 31s, total 5m21s, but also I did make the mistake of seeing a citation not generated correctly and immediately assuming that that meant it was AI, another I felt should have cited bowman but did not so thought it might be AI but alas
@koronkebitch yeesh, I did badly on this. 6/10 avg time 1m22s.
@rntz @koronkebitch 5/10 😭
@regehr @rntz @koronkebitch 5/10 😓 time pressure does not help...

@burakemir @regehr @rntz @koronkebitch 7/10 avg 51s tot 8m38s (and I would have gotten 8/10 if I had not mistakenly rejected the first real paper because its authors were anonymized)

Still. Bad human. No cookie.

@koronkebitch 7/10 26s I got baited

@koronkebitch somehow I got 10/10 in 1m38s

I think this says more about my propensity for slop than it does about PL knowledge

welp, back to my fraught OOPSLA allnighter

@stschaef good luck! excited to read whatever you come up with, no matter the state <3
@koronkebitch 10/10 2s average and nobody knows how I did it
@koronkebitch FYI, on my android device, the browser displays the pdf as a file name and a download button, and the file name gives away the answer.
@koronkebitch this is so scary
@jonmsterling yup we were having an existential crisis in the lab yesterday (will surely continue today)
@koronkebitch I did poorly with avg time 43s, but that was partially some faulty assumptions.
Definitely human papers that surprised me with how shallow they were, or with weird page counts. Only obvious things for AI were silly explanations and ideas which obviously don't work.

@koronkebitch 9 / 10 correct, 15s avg time, 2m 38s total time. not too bad.

I very quickly skimmed the papers and looked for any sign of life or humor. or if they mention a language in the abstract, I checked if the paper was consistent with that. my mistake was a real paper that I thought was AI. tough test though.

@joomy I wonder if we are all failing on the same one 😬
@koronkebitch a heuristic that has worked well for me is whether the paper has a reference section at the end. Though I guess it’s only a matter of time till that heuristic stops working.
@d10c yup...
@d10c recently reviewed a paper with 100% hallucinated refs
@koronkebitch No thanks, 4/10, 1m14s avg. Help welcome.
@koronkebitch 9/10 in 7 minutes (I had one paper I falsely accused of being ai); a key indicator seemed to be that slop papers claim eight or nine main contributions. I’m no PL person but I am a CS academic and I feel like it’s hard to write one paper that honestly does three or more things.
@koronkebitch @hallasurvivor This was super fun! I got 7/10 and made errors both ways. PL is not at all my field, I wonder if I'd do better or worse for algebraic topology papers (I'd hope better, but who knows).

@koronkebitch I've read almost no PL papers before but I got 8/10 with an avg time of 1m 37s

I was looking for obvious tells like broken LaTeX and over-explanation and overly commented code snippets but that made me miss a few because I misidentified notation I didn't understand as broken math

@koronkebitch 10/10 by downloading the papers and asking Opus 4.6 to decide (GPT-5.4 got 9/10) 🤪
@koronkebitch Sadly, this means that one could make the slop papers even more convincing by letting Claude recursively improve and judge its own output. But I don’t believe this process would converge to a convincing paper:
@koronkebitch Both models missed obvious signs. The slop papers have very carelessly put-together layout with plenty of whitespace, equations running into the margin and no beautiful diagrams or figures, but neither model noticed this. This makes it possible to identify AI-generated papers purely visually. Additionally, neither model commented on mistakes in the AI-generated calculi, lemmas or proofs, even though such mistakes likely exist and could be found by a human reviewer.
@koronkebitch More generally, recent experiments using such feedback loops (like Anthropic’s C compiler) have spent tens of thousands in API credits and still yielded output that is not at the level of novel research. While the first few iterations are always impressive, LLMs do not seem to be good at long context tasks that require deep thought about a system as a whole.