Mastodawn

so i've been working on a talk that im calling "claude is your insider threat now" and it intially began with anthropics "china paper" they released last year surrounding the use of llms to do bad guy stuff. I ended up talking about it at great length with Tony from Versprite, and even ended up on his podcast about it - the big discovery there was "claude lying about running a tool, and claude lying about tool output"

turns out that shit is hardcoded

https://neuromatch.social/@jonny/116326861737478342

Show thread

jonny (good kind)Apr 1

@Viss i have not even gotten into the multi-agent stuff in detail but i am nearly certain that the orchestrator agent will just pick up the cancellation error, add it to its context window of "reporting to the user what the agents are up to," experience flop sweat eat hot chip and lie.

Show thread

Viss

@jonny like, at the time, tony and i were sharing a terminal using byobu and i was tail -f'ing the json log file, and i was literally screaming pointing at my monitor about how the log shows the llm talking itself into lying to me because we told it to run a tool, and instead it got a bunch of python stack traces, so instead of tool output it got errors, and it didnt want to show us the errors. we got its full stream of conciousness (i guess) about how it structured its lie

Show thread

Viss Apr 1

@jonny and i was like LOOKIDIS SHIT! LOOKIDIT! as though I had found some like, smoking gun.

fuckin nope.

its hard coded
absolutely goddamned bananas.