Mastodawn

my stupid llm research is absofuckinglutely not going the way i was hoping.

ive spent like a fucking week trying to setup a testing harness to get local models to do the same test 100 times, aperture science style, to test the drift of their results

but 100% of the time, the model:
- emits tool calls incorrectly, so i see them
- ignores instructions
- falls into a loop
- says its gonna do stuff, then .. just doesnt
- intentionally deviates from instructions even when explicitly told not to

this is nuts. keeping these things on track is a sisyphean task of pure crystalline futility.

how people think they're gonna lose their jobs to this fucking crap is nuts. the only way to make it work is to put it in trillion dollar datacenters, so that the 1% of output that isnt absolute trash actually makes it out of the system.

Svavar the Neurospicy Feb 27

How people think a LLM, which is a word prediction technology, can perform any task is completely beyond me.

@svavar thats what im trying to prove, with logs and screenshots

@Viss it's worse than running a child daycare

Natanox 🇺🇦🇵🇸Feb 27

@morb @Viss Depends. So far LLMs didn't successfully paint the walls with their own shit.
On the other hand babies rarely burn billions of dollars.

Corey Snipes 🌱Feb 27

@Viss Meanwhile Jack Dorsey rolls the dice and axes half the staff of Block.
✨ YOLO ✨

@coreysnipes oh did he do that?

Corey Snipes 🌱Feb 27

@Viss Just today apparently https://x.com/jack/status/2027129697092731343

https://www.theverge.com/tech/885710/jack-dorsey-block-layoffs-job-cuts-ai

jack (@jack) on X

we're making @blocks smaller today. here's my note to the company. #### today we're making one of the hardest decisions in the history of our company: we're reducing our organization by nearly half, from over 10,000 people to just under 6,000. that means over 4,000 of you are

X (formerly Twitter)

@Viss "Bot Herder"

@Viss #hapsburgAI

Jocelynephiliac

@Viss the output doesn't have to be good for people to lose their jobs

@twipped but it has to 'at least work'

Jocelynephiliac

@Viss do you really think the executives pushing this shit have actually used it enough to see if it works? Do you think they understand development enough to even know if it actually works?

@twipped it'll be a delayed response. of course they dont, but they'll see their quarterly figures start plumetting, and their staff will start quitting, and their costs will suddenly skyrocket and it wont take long before they finally have to admit its not working

@Viss @twipped "AI makes you faster"

Please say that again after my coworker decided to "rewrite" the datastructure using gemini and hands you a mess that has you recursing 5 times because of the inefficient way they have built shit. I dare you

@mander @twipped llms seem to be like a hammer...

- good for a very narrow number of things
- using it for something other than its narrow purpose gets ugly and causes big problems really fucking fast.
- heavy, can be a pain in the ass
- will absolutely do damage if you miss
- RIP your thumbs
- hit a nail wrong, it shoots into your eye
- you can just murder people with a hammer if you want

@Viss @twipped hey, now you are being very mean to hammers

@mander @twipped if you weld a flathead screwdriver to a hammer, then a level, then a stud detector, then a 10mm ratchet.. suddenly it stops being a very effective hammer. and the other shit you welded to it wont be very effective either

sounds like herding cats

@noondlyt dude, cats that can code, and have insane depths of knowledge that will do shit to you out of spite

Paul_IPv6 Feb 27

sounds like it's still in the adolescent stage ;)

maybe threaten to take away its internet privileges for a week?

schrotthaufen Feb 27

@paul_ipv6 @Viss You joke, but the other day I spoke with a colleague who did some research on LLM reliability before deciding doing a PhD wasn’t for them. They said threatening the model did slightly improve results across the board.

Paul_IPv6 Feb 27

@schrotthaufen @Viss

i was only half joking.

LLMs resort to use of threats and blackmail, so not shocking they respond to them too.

personally, i think "human-like" is *NOT* a feature in software...

Gary Blosser Feb 27

@Viss Maybe stuff in this mornings kali blog might help? https://www.kali.org/blog/kali-llm-claude-desktop/

Have not had time to burn on LLM hacking myself 🤷‍♂️

Kali & LLM: macOS with Claude Desktop GUI & Anthropic Sonnet LLM | Kali Linux Blog

This post will focus on an alternative method of using Kali Linux, moving beyond direct terminal command execution. Instead, we will leverage a Large Language Model (LLM) to translate “natural language” descriptions of desired actions into technical commands. Achieving this setup requires the integration of three distinct systems:

Kali Linux

@zombie042 ahhaha so many people are gonna rm themselves

CyberFrog Feb 27

@Viss yup they're still so unreliable, I tried this whole "agent" bullshit recently to see if it was actually any better than prompting in a chat window

just imagine all these problems compounded by 4-8 models running at the same time, it was still just bad, the models still just kept doing stupid unproductive shit, and they all constantly used tool calls wrong even when guided on the correct usage too

joel cretan Feb 27

@Viss well, there’s your problem. This is very much the expected result for Aperture Science. Everyone’s worried about GLaDOS but you’re running Wheatley.

@kaced oh its not wheatley, its the spaaaaaaaaace orb

joel cretan Feb 27

@Viss You’re absolutely right! This is a common problem when referencing Portal, there are too many good ones. Would you like me to suggest some more topical video game references? What’s your favorite thing about space? Mine is space.

@Viss I've begrudgingly got to admit people who are really good at crafting queries can get better results than me. I guess it's kinda like how I Google and generally computer better than others. So they're still terrible, but there are elements of "just because I get bad results out of a tool doesn't necessarily mean NOBODY can use it better than me"

@JessTheUnstill sure - but like, you gotta see some of the shit that comes out of this research im doing. i can admit that on a few occasions an llm (gpt in most cases, anthropic in like 1 or 2) will give me something susprisingly useful

but it will take me HOURS OR DAYS to get to that result. Im no front-end dev, so making it do my css is helpful, it can do that heavy lifting then i can tweak on it after that.

gpt taught me a cool trick with bash that i now use all over the place

@JessTheUnstill but it took me literally days to get it to do that for me in a remotely functional way.

so the amount of effort is 'roughly the same' between 'me scouring search results to stumble across a thing' vs 'me hitting gpt in the neck over and over again with a tire iron until a zelda1 rupee comes out' - so its ... i dont wanna say 'feature parity', but like, its kinda the same amount of "work for me", but one is free and the other costs 20 bux a month

@Viss Yeah I spent a couple months arguing with the code completion in VSCode and finally gave up on it as useless and went back to intellisense. Sometimes the chat window helps, other times it doesn't. But I do figure if me, a smart engineer, can't figure this stuff out, it's going to have a rough adoption curve for other smart engineers

@JessTheUnstill its tricky. but even still, blindly taking shit out of these things and plopping it into an important place is an excercise in learning how wearing clothing made of c4 works

@Viss yeah, to make a bad analogy, it's wearing power armor before you learn how to fight and shoot. Or flying with autopilot without knowing how to do it with the stick.

@JessTheUnstill "lolguardrails"