Mastodawn

I can't believe that we live in a timeline where the thing people go most apeshit for in the world is a repository that literally consists of 77 lines of markdown that literally just say "don't write code that is pointless to write" in 6 bullet points

Show thread

jonny (nonvenomous)

these skills always crack me up, i will never think it is not funny that the correct way to declare a command in a programming interface is to beg something to consider some string as calling you and also beg it to like "wait when you call us that means you should keep listening to the text in this markdown file please" and nothing actually is ever anything except a mash of fucking vibes that exist in a universe designed for the appearance of things working

Show thread

jonny (nonvenomous)Jun 16

i love it when my program's execution conditions are "still active if unsure"

what in the fuck kind of world have we arrived at where in the optimal conditions where the "program" works fully as intended the program is "RUNNING" as long as the execution environment is "NOT SURE" (????????) if the program is "RUNNING"

Show thread

jonny (nonvenomous)Jun 16

this is what counts as benchmarking, with code links because this shit makes literally no sense and boils your brain if you try and read it:

A list of 5 hardcoded 1-sentence prompts
that get checked against a hardcoded set of if switches to give a name to the prompt, rather than idk labeling the prompts themselves. If a prompt doesn't have a matching word in that switch, the test is marked as pass.
the task name extracted from the if/switch block selects from a map of functions that print the LLM output into a hardcoded string template > a python file
where the python code has a hardcoded list of possible function names that the LLM could have generated like validate_email, is_valid_email, etc. if any of those names is defined, get the function by fucking evaling the name.
if none is found, just look for ANY FUNCTION THAT TAKES ONE PARAMETER IN THE globals() DICT AND SEE IF THAT IS AN EMAIL VALIDATION FUNCTION
call that in your tests by oh wait no yeah just completely redefining the test prompts in the test code again, that's fine. and the output too so the only thing the tests test are the tests when tested on test data.
actually half the tests just test for the existence of keywords that are inevitably in the output since they are also in the prompt and they are the most sampled training data in the world
when you actually run the benchmarks, the plugin actually causes one of the test cases to fail because it invents several function names
the email validator it writes is just "anything with an @ and a period."
the only output that's actually reported is lines of code, lower is better.

Show thread

jonny (nonvenomous)Jun 16

there is an as-yet unmerged PR to "fix the correctness benchmarks" and a "robustness audit" that is wonderful:

https://github.com/DietrichGebert/ponytail/pull/83

someone raised an issue like "hey this makes the models worse"
the LLM self-diagnosed the problem as being that all the tests are based off extracting code from fenced code blocks (true)
the prior prompt text failed because all the LLMs just made up function names for one of the prompts, so now the prompt text just says the name of the function and what it should accept and return. (the models by default just copy/paste the most common response on stack overflow)
two whole new set of tests, defined different than all the other tests, are added. one is just more of the tests testing the tests and the other is sweet mother of mercy what the hell is that
the robustness audit passes if every test fails, the only thing that matters is if ponytail fails more than the baseline. therefore ponytail is good. untested is whether the test output is meaningful or possible to fail.

benchmarks: fix correctness gate + robustness audit (#65) by DietrichGebert · Pull Request #83 · DietrichGebert/ponytail

Two-part response to #65 ("Impact on model performance?"). Part 1 — fix the correctness gate Two bugs in the correct gate were under-reporting correctness for terse models — the likely so...

GitHub

Show thread

jonny (nonvenomous)Jun 16

The office of internal LLM affairs has done a full self evaluation and concluded that the LLM did nothing wrong. The LLM resists any changes to its prompt text because the prompt text says resist any changes. The prompt text manifestly causes the models to produce baffling code in the very PR that audits the skill, but that just shows that the skill is good.

Show thread

Going T. Maine Jun 17

@jonny the readme for the project has very bad performative gender vibes.

Show thread

Glyph Jun 16

@jonny you are a gifted writer jonny. You really capture the sensation of the cooked brain exiting my cranium, already the texture of a light frappé

Show thread

ink Jun 16

@jonny you are truly brave Jonny. Are you still working on that piece that documents your dive into this wreck?

Show thread

jonny (nonvenomous)Jun 16

@ink i have been thinking i want to save it and collect more samples over time and make the argument "it's not getting better" rather than "claude code is a trainwreck" because that's easily dismissed as "it used to be bad but now it's a totally rewritten codebase catch up with the velocity of the times"

Show thread

ink Jun 16

@jonny that is a smart move, especially because of how the code is being written.

Show thread

ink Jun 16

@jonny I spent/wasted some time recently putting together some skills & scripts to summarize my rss feeds and found most of the time claude just wanted to burn tokens writing the same scripts it could have simply called. It was a frustrating experience, and one that you probably wouldn't notice if you just granted permission to run arbitrary Python code, and let it loose. https://inkdroid.org/2026/06/10/inside-out/

Inside Out

Show thread

Not a Spring Onion Jun 16

@jonny

if switches to give a name to the prompt, rather than idk labeling the prompts themselves

You don't understand. This original prompt has been brought into being by tibetian throatsingers during a 48-hour "vibing" session with the late Sir Ferdinand von Codeschreiber.
You can't simply change it, because that would cause things to fail in a way that can't be properly tested because it's all snakeoil anyway...

Show thread

CyberFrog Jun 16

@jonny I struggle to imagine a more effective way to violate every tenant of good design than this project

truly, I wonder if someone can do worse.

Show thread

Chris M Jun 16

@jonny SIG_VIBE_IS_OFF

Show thread

Jack

Jun 16

@jonny just simulate the boolean in a while(true) loop by running a simulation of the entirety of human knowledge in a terabyte scale vectorized database each time you need to check it's state. hmm why isnt my AI company making any money 🤔

Show thread

Jeremy Jun 16

@jonny How are we supposed to take these programs and intentions seriously? Looks like an endless clown car of fuckery trying to force their clown prizes on everyone.

To a non-expert like me, seems like exactly the kind of thing agentic "AI" would be coded to star and use. Might help explain its quick rise to maximize torment nexus profit?

Show thread

Ozzelot

Jun 16

@jonny
The WHAT is not WHAT if WHAT

...reset computing and do it with less *bro types

Show thread

Luka R.Jun 16

@jonny OMG, they enshittified the Halting Problem!

Show thread

Rivimea, god of accidents Jun 16

@jonny The skills read like an AI-generated motivational speech. Wouldn't be surprised if the skill was AI-generated.

Show thread

jonny (nonvenomous)Jun 16

@sklrmths that is considered an important part in crafting a skill, you are supposed to have the LLM rewrite it so "the skill is in the """native language""" of the LLM"

Show thread

Dave Rahardja value 71 Jun 16

@jonny @sklrmths Model collapse? No! Productivity!

Show thread

Ted Mielczarek Jun 16

@jonny @sklrmths I'm only now realizing that this has devolved into ritual magick territory. Crowley would have a field day with LLMs! (At least Crowley was using his shtick to get laid, though.)

Show thread

Landa

Jun 16

@jonny
What.

Words fail me.
@sklrmths

Show thread

mk30 Jun 16

@jonny speaking of "the appearance of things working", some Microsoft AI guy gave a big zoom presentation at my university and his slides were filled with slop. One of them had an AI "hammer" and "nails": https://tilde.zone/@mk30/116653889032252709

And that's just like, the perfect example of "the appearance of things that work." The "hammer" is bent and the "nails" aren't nails... You would never be able to do actual work if these were your tools. But if all you're doing is a *simulacrum* of work, then they're fine, right? 🫩

mk30 (@[email protected])

Attached: 1 image @[email protected] oh yea. and even his hammer and nail analogy needed an AI hammer and "nails".

tilde.zone

Show thread

lispwitch 2.0 Jun 16

@jonny https://web.archive.org/web/20071103113140/http://progsoc.uts.edu.au/~sbg/intercal/ick6.html

ICK -- 6. UNDOCUMENTED FEATURES FROM INTERCAL-72

Show thread

alys Jun 16

@jonny makes left-pad seem like a work of stellar engineering by comparison.

Show thread

Iris Young (he/they/she) (PhD)Jun 16

@jonny actually, this finally clarifies for me the core of my objection to the whole thing: I am in love with computers because they are deterministic. They are capable of following instructions exactly, and every result can be understood with enough effort. You can get bytewise reproducibility if you want (architecture-dependent). "AI" breaks that contract. No advantages it could possibly bring would be worth destroying the thing I loved in the first place.