I can't believe that we live in a timeline where the thing people go most apeshit for in the world is a repository that literally consists of 77 lines of markdown that literally just say "don't write code that is pointless to write" in 6 bullet points
these skills always crack me up, i will never think it is not funny that the correct way to declare a command in a programming interface is to beg something to consider some string as calling you and also beg it to like "wait when you call us that means you should keep listening to the text in this markdown file please" and nothing actually is ever anything except a mash of fucking vibes that exist in a universe designed for the appearance of things working

i love it when my program's execution conditions are "still active if unsure"

what in the fuck kind of world have we arrived at where in the optimal conditions where the "program" works fully as intended the program is "RUNNING" as long as the execution environment is "NOT SURE" (????????) if the program is "RUNNING"

this is what counts as benchmarking, with code links because this shit makes literally no sense and boils your brain if you try and read it:

there is an as-yet unmerged PR to "fix the correctness benchmarks" and a "robustness audit" that is wonderful:

https://github.com/DietrichGebert/ponytail/pull/83

  • someone raised an issue like "hey this makes the models worse"
  • the LLM self-diagnosed the problem as being that all the tests are based off extracting code from fenced code blocks (true)
  • the prior prompt text failed because all the LLMs just made up function names for one of the prompts, so now the prompt text just says the name of the function and what it should accept and return. (the models by default just copy/paste the most common response on stack overflow)
  • two whole new set of tests, defined different than all the other tests, are added. one is just more of the tests testing the tests and the other is sweet mother of mercy what the hell is that
  • the robustness audit passes if every test fails, the only thing that matters is if ponytail fails more than the baseline. therefore ponytail is good. untested is whether the test output is meaningful or possible to fail.
benchmarks: fix correctness gate + robustness audit (#65) by DietrichGebert · Pull Request #83 · DietrichGebert/ponytail

Two-part response to #65 ("Impact on model performance?"). Part 1 — fix the correctness gate Two bugs in the correct gate were under-reporting correctness for terse models — the likely so...

GitHub
The office of internal LLM affairs has done a full self evaluation and concluded that the LLM did nothing wrong. The LLM resists any changes to its prompt text because the prompt text says resist any changes. The prompt text manifestly causes the models to produce baffling code in the very PR that audits the skill, but that just shows that the skill is good.
@jonny the readme for the project has very bad performative gender vibes.
@jonny you are a gifted writer jonny. You really capture the sensation of the cooked brain exiting my cranium, already the texture of a light frappé
@jonny you are truly brave Jonny. Are you still working on that piece that documents your dive into this wreck?
@ink i have been thinking i want to save it and collect more samples over time and make the argument "it's not getting better" rather than "claude code is a trainwreck" because that's easily dismissed as "it used to be bad but now it's a totally rewritten codebase catch up with the velocity of the times"
@jonny that is a smart move, especially because of how the code is being written.
@jonny I spent/wasted some time recently putting together some skills & scripts to summarize my rss feeds and found most of the time claude just wanted to burn tokens writing the same scripts it could have simply called. It was a frustrating experience, and one that you probably wouldn't notice if you just granted permission to run arbitrary Python code, and let it loose. https://inkdroid.org/2026/06/10/inside-out/
Inside Out

@jonny

if switches to give a name to the prompt, rather than idk labeling the prompts themselves

You don't understand. This original prompt has been brought into being by tibetian throatsingers during a 48-hour "vibing" session with the late Sir Ferdinand von Codeschreiber.
You can't simply change it, because that would cause things to fail in a way that can't be properly tested because it's all snakeoil anyway... 

@jonny I struggle to imagine a more effective way to violate every tenant of good design than this project

truly, I wonder if someone can do worse.
@jonny SIG_VIBE_IS_OFF
@jonny just simulate the boolean in a while(true) loop by running a simulation of the entirety of human knowledge in a terabyte scale vectorized database each time you need to check it's state. hmm why isnt my AI company making any money 🤔

@jonny How are we supposed to take these programs and intentions seriously? Looks like an endless clown car of fuckery trying to force their clown prizes on everyone.

To a non-expert like me, seems like exactly the kind of thing agentic "AI" would be coded to star and use. Might help explain its quick rise to maximize torment nexus profit?

@jonny
The WHAT is not WHAT if WHAT

...reset computing and do it with less *bro types

@jonny OMG, they enshittified the Halting Problem!
@jonny The skills read like an AI-generated motivational speech. Wouldn't be surprised if the skill was AI-generated.
@sklrmths that is considered an important part in crafting a skill, you are supposed to have the LLM rewrite it so "the skill is in the """native language""" of the LLM"
@jonny @sklrmths Model collapse? No! Productivity!
@jonny @sklrmths I'm only now realizing that this has devolved into ritual magick territory. Crowley would have a field day with LLMs! (At least Crowley was using his shtick to get laid, though.)

@jonny
What.

Words fail me.
@sklrmths

@jonny speaking of "the appearance of things working", some Microsoft AI guy gave a big zoom presentation at my university and his slides were filled with slop. One of them had an AI "hammer" and "nails": https://tilde.zone/@mk30/116653889032252709

And that's just like, the perfect example of "the appearance of things that work." The "hammer" is bent and the "nails" aren't nails... You would never be able to do actual work if these were your tools. But if all you're doing is a *simulacrum* of work, then they're fine, right? 🫩

mk30 (@[email protected])

Attached: 1 image @[email protected] oh yea. and even his hammer and nail analogy needed an AI hammer and "nails".

tilde.zone
@jonny makes left-pad seem like a work of stellar engineering by comparison.
@jonny actually, this finally clarifies for me the core of my objection to the whole thing: I am in love with computers because they are deterministic. They are capable of following instructions exactly, and every result can be understood with enough effort. You can get bytewise reproducibility if you want (architecture-dependent). "AI" breaks that contract. No advantages it could possibly bring would be worth destroying the thing I loved in the first place.