Mastodawn

I can't believe that we live in a timeline where the thing people go most apeshit for in the world is a repository that literally consists of 77 lines of markdown that literally just say "don't write code that is pointless to write" in 6 bullet points

Show thread

jonny (nonvenomous)Jun 16

these skills always crack me up, i will never think it is not funny that the correct way to declare a command in a programming interface is to beg something to consider some string as calling you and also beg it to like "wait when you call us that means you should keep listening to the text in this markdown file please" and nothing actually is ever anything except a mash of fucking vibes that exist in a universe designed for the appearance of things working

Show thread

jonny (nonvenomous)Jun 16

i love it when my program's execution conditions are "still active if unsure"

what in the fuck kind of world have we arrived at where in the optimal conditions where the "program" works fully as intended the program is "RUNNING" as long as the execution environment is "NOT SURE" (????????) if the program is "RUNNING"

Show thread

jonny (nonvenomous)

this is what counts as benchmarking, with code links because this shit makes literally no sense and boils your brain if you try and read it:

A list of 5 hardcoded 1-sentence prompts
that get checked against a hardcoded set of if switches to give a name to the prompt, rather than idk labeling the prompts themselves. If a prompt doesn't have a matching word in that switch, the test is marked as pass.
the task name extracted from the if/switch block selects from a map of functions that print the LLM output into a hardcoded string template > a python file
where the python code has a hardcoded list of possible function names that the LLM could have generated like validate_email, is_valid_email, etc. if any of those names is defined, get the function by fucking evaling the name.
if none is found, just look for ANY FUNCTION THAT TAKES ONE PARAMETER IN THE globals() DICT AND SEE IF THAT IS AN EMAIL VALIDATION FUNCTION
call that in your tests by oh wait no yeah just completely redefining the test prompts in the test code again, that's fine. and the output too so the only thing the tests test are the tests when tested on test data.
actually half the tests just test for the existence of keywords that are inevitably in the output since they are also in the prompt and they are the most sampled training data in the world
when you actually run the benchmarks, the plugin actually causes one of the test cases to fail because it invents several function names
the email validator it writes is just "anything with an @ and a period."
the only output that's actually reported is lines of code, lower is better.

Show thread

jonny (nonvenomous)Jun 16

there is an as-yet unmerged PR to "fix the correctness benchmarks" and a "robustness audit" that is wonderful:

https://github.com/DietrichGebert/ponytail/pull/83

someone raised an issue like "hey this makes the models worse"
the LLM self-diagnosed the problem as being that all the tests are based off extracting code from fenced code blocks (true)
the prior prompt text failed because all the LLMs just made up function names for one of the prompts, so now the prompt text just says the name of the function and what it should accept and return. (the models by default just copy/paste the most common response on stack overflow)
two whole new set of tests, defined different than all the other tests, are added. one is just more of the tests testing the tests and the other is sweet mother of mercy what the hell is that
the robustness audit passes if every test fails, the only thing that matters is if ponytail fails more than the baseline. therefore ponytail is good. untested is whether the test output is meaningful or possible to fail.

benchmarks: fix correctness gate + robustness audit (#65) by DietrichGebert · Pull Request #83 · DietrichGebert/ponytail

Two-part response to #65 ("Impact on model performance?"). Part 1 — fix the correctness gate Two bugs in the correct gate were under-reporting correctness for terse models — the likely so...

GitHub

Show thread

jonny (nonvenomous)Jun 16

The office of internal LLM affairs has done a full self evaluation and concluded that the LLM did nothing wrong. The LLM resists any changes to its prompt text because the prompt text says resist any changes. The prompt text manifestly causes the models to produce baffling code in the very PR that audits the skill, but that just shows that the skill is good.

Show thread

Going T. Maine Jun 17

@jonny the readme for the project has very bad performative gender vibes.

Show thread

Glyph Jun 16

@jonny you are a gifted writer jonny. You really capture the sensation of the cooked brain exiting my cranium, already the texture of a light frappé

Show thread

ink Jun 16

@jonny you are truly brave Jonny. Are you still working on that piece that documents your dive into this wreck?

Show thread

jonny (nonvenomous)Jun 16

@ink i have been thinking i want to save it and collect more samples over time and make the argument "it's not getting better" rather than "claude code is a trainwreck" because that's easily dismissed as "it used to be bad but now it's a totally rewritten codebase catch up with the velocity of the times"

Show thread

ink Jun 16

@jonny that is a smart move, especially because of how the code is being written.

Show thread

ink Jun 16

@jonny I spent/wasted some time recently putting together some skills & scripts to summarize my rss feeds and found most of the time claude just wanted to burn tokens writing the same scripts it could have simply called. It was a frustrating experience, and one that you probably wouldn't notice if you just granted permission to run arbitrary Python code, and let it loose. https://inkdroid.org/2026/06/10/inside-out/

Inside Out

Show thread

Not a Spring Onion Jun 16

@jonny

if switches to give a name to the prompt, rather than idk labeling the prompts themselves

You don't understand. This original prompt has been brought into being by tibetian throatsingers during a 48-hour "vibing" session with the late Sir Ferdinand von Codeschreiber.
You can't simply change it, because that would cause things to fail in a way that can't be properly tested because it's all snakeoil anyway...

Show thread

CyberFrog Jun 16

@jonny I struggle to imagine a more effective way to violate every tenant of good design than this project

truly, I wonder if someone can do worse.