I can't believe that we live in a timeline where the thing people go most apeshit for in the world is a repository that literally consists of 77 lines of markdown that literally just say "don't write code that is pointless to write" in 6 bullet points
these skills always crack me up, i will never think it is not funny that the correct way to declare a command in a programming interface is to beg something to consider some string as calling you and also beg it to like "wait when you call us that means you should keep listening to the text in this markdown file please" and nothing actually is ever anything except a mash of fucking vibes that exist in a universe designed for the appearance of things working

i love it when my program's execution conditions are "still active if unsure"

what in the fuck kind of world have we arrived at where in the optimal conditions where the "program" works fully as intended the program is "RUNNING" as long as the execution environment is "NOT SURE" (????????) if the program is "RUNNING"

this is what counts as benchmarking, with code links because this shit makes literally no sense and boils your brain if you try and read it:

there is an as-yet unmerged PR to "fix the correctness benchmarks" and a "robustness audit" that is wonderful:

https://github.com/DietrichGebert/ponytail/pull/83

  • someone raised an issue like "hey this makes the models worse"
  • the LLM self-diagnosed the problem as being that all the tests are based off extracting code from fenced code blocks (true)
  • the prior prompt text failed because all the LLMs just made up function names for one of the prompts, so now the prompt text just says the name of the function and what it should accept and return. (the models by default just copy/paste the most common response on stack overflow)
  • two whole new set of tests, defined different than all the other tests, are added. one is just more of the tests testing the tests and the other is sweet mother of mercy what the hell is that
  • the robustness audit passes if every test fails, the only thing that matters is if ponytail fails more than the baseline. therefore ponytail is good. untested is whether the test output is meaningful or possible to fail.
benchmarks: fix correctness gate + robustness audit (#65) by DietrichGebert · Pull Request #83 · DietrichGebert/ponytail

Two-part response to #65 ("Impact on model performance?"). Part 1 — fix the correctness gate Two bugs in the correct gate were under-reporting correctness for terse models — the likely so...

GitHub
The office of internal LLM affairs has done a full self evaluation and concluded that the LLM did nothing wrong. The LLM resists any changes to its prompt text because the prompt text says resist any changes. The prompt text manifestly causes the models to produce baffling code in the very PR that audits the skill, but that just shows that the skill is good.
@jonny the readme for the project has very bad performative gender vibes.
@jonny you are a gifted writer jonny. You really capture the sensation of the cooked brain exiting my cranium, already the texture of a light frappé
@jonny you are truly brave Jonny. Are you still working on that piece that documents your dive into this wreck?
@ink i have been thinking i want to save it and collect more samples over time and make the argument "it's not getting better" rather than "claude code is a trainwreck" because that's easily dismissed as "it used to be bad but now it's a totally rewritten codebase catch up with the velocity of the times"
@jonny that is a smart move, especially because of how the code is being written.
@jonny I spent/wasted some time recently putting together some skills & scripts to summarize my rss feeds and found most of the time claude just wanted to burn tokens writing the same scripts it could have simply called. It was a frustrating experience, and one that you probably wouldn't notice if you just granted permission to run arbitrary Python code, and let it loose. https://inkdroid.org/2026/06/10/inside-out/
Inside Out

@jonny

if switches to give a name to the prompt, rather than idk labeling the prompts themselves

You don't understand. This original prompt has been brought into being by tibetian throatsingers during a 48-hour "vibing" session with the late Sir Ferdinand von Codeschreiber.
You can't simply change it, because that would cause things to fail in a way that can't be properly tested because it's all snakeoil anyway... 

@jonny I struggle to imagine a more effective way to violate every tenant of good design than this project

truly, I wonder if someone can do worse.