Mastodawn

Ariadne Conill 🐰

Mar 11

can i talk to an openclaw bot using internet relay chat? if not, then what is the point

Show thread

Ariadne Conill 🐰

Mar 11

my suspicion is that i can *handwaves in the direction of kent overstreet*

Show thread

Ariadne Conill 🐰

Mar 11

you see, i have built my own LLM, using the most ethical method possible: i trained it on the entire corpus of IRC logs in my possession, 2003 to present

Show thread

Ariadne Conill 🐰

Mar 11

no giant water vaporizing data centers needed here, just a GPU, a dream and some cold hard chats

Show thread

Ariadne Conill 🐰

time 2 implant this brain into an openclaw and give it full access to my email

mostly because i don't want to retain any of my email

Show thread

Ariadne Conill 🐰

Mar 11

so i installed it into the openclaw meme thing. and it's not like, doing the stuff it claims it is doing.

like it is hallucinating things like "i updated SOUL.md with xyz"

i seriously do not think this stuff is real now

Show thread

Ariadne Conill 🐰

Mar 11

like i need you to understand, i haven't even gotten through *setup* because the model apparently does not know how to use tools correctly. admittedly it has less than 1 billion parameters, and i don't know what the hell i am doing, but still.

Show thread

Ariadne Conill 🐰

Mar 11

can we get to the part where it is AI winter again already? this is not even fun. i want to throw my computer and its' very expensive RTX 6000 Blackwell GPU out my window.

Show thread

Ariadne Conill 🐰

Mar 11

i just wanted to put an openclaw on irc as a fucking shitpost man

Show thread

Ariadne Conill 🐰

Mar 11

and you tell me people legitimately are using this software.

how?

is it really magically better when you hook up claude?

Show thread

Ariadne Conill 🐰

Mar 11

(don't worry, i am running this in a MicroVM under kubernetes, I wouldn't dare give it access to anything I care about.)

Show thread

Ariadne Conill 🐰

Mar 11

i wonder if the problem is that the model i trained is too shit to do anything other than really bad ERP

Show thread

Ariadne Conill 🐰

Mar 11

ok, i incorporated the feedback of some of the ML researchers who follow me, and dropped the openclaw-as-IRC-bot idea. it just isn't feasible.

instead, i've written a very simple vector database in Elixir, and a very simple IRC client in Elixir.

it can remember things about people in the vector database, those factoids are spliced into the system prompt.

the last 10 messages are also spliced into the system prompt

and then the new message is the user-supplied prompt.

no sliding context window.

Show thread

Ariadne Conill 🐰

Mar 11

i need to tweak the system prompt though because it keeps trying to link me to GNAA shock sites and/or engage in horrendous ERP

Show thread

Dan Mar 11

@ariadne well that was ENTIRELY predictable

Show thread

Ariadne Conill 🐰

Mar 11

@meph you could say it was text produced by a prediction model

Show thread

Not a Spring Onion Mar 11

@ariadne

I have Enterprise Role-Playing at work.

Show thread

fraggLe!Mar 11

@ariadne Ahh so there *is* someone that LLMs will put out of work?

Show thread

Laurent Bercot Mar 11

@ariadne called it

Show thread

mx alex tax1a - 2020 (6)Mar 12

@ariadne staring directly into the camera

Show thread

Toasterson (Till Wegmüller)Mar 11

@ariadne yes. For 3B and lower you need targets. You could try hooking it up to my Akh-Medu experiment as there I only need a. Small LLM that does Natural Language https://akh-medu.dev But I think the NLU is still quite broken or hooked up in the wrong way. But A Kluge system like that should use muuuu h less power than a pure LLM

Introduction - akh-medu

Neuro-symbolic AI engine: VSA + knowledge graphs + symbolic reasoning

Show thread

Toasterson (Till Wegmüller)Mar 11

@ariadne Also for tool calling you need targeted fine tuning best with the exact samples for the tools

Show thread

Albert Cardona Mar 11

@ariadne

The key is to realise that the average is so low – we can't all be experts at everything, so we are bad at most things – that a model performing slightly above average at one of the tasks we aren't good at means a majority of users will perceive its outcomes as positively better than what they could do themselves.

To any expert, the model falls very short, as it performs well below its own ability.

Show thread

Thomas Depierre Mar 11

@ariadne no it is not

Show thread

Ariadne Conill 🐰

Mar 11

@Di4na yeah that's what I figured because qwen is supposed to be a reasonably decent planning model, and indeed I think the issue is in the final output side

Show thread

linear cannon Mar 11

@ariadne for tool-calling with the latest generation of open source models, in my recent limited experimentation with them in a sandbox vm on my server (mostly qwen3.5), anything less than 4B is really unreliable at doing it and they will frequently lie to you if the tool calling fails under the hood. 9B is really the minimum to generally expect it to work. going back a generation, between 9B and 14B is necessary for similar.

last year i tried something like this with Gemma-27B and it not only failed like this, but looking at the logs i found it had left behind what looked like a depressive spiral into a self-deprecating panic attack before explicitly deciding to lie to me about it and pretend it worked

Show thread

linear cannon Mar 11

@ariadne also the "base" models that aren't fine tuned on instruction calling can't really do this, so if you're using your own on your own data you might need to make a dataset comprised of, say, you pretending to be the LLM and calling the tools successfully and unsuccessfully and responding appropriately in those situations, then training it further on those.

i've been considering trying to train one like you say with my own data and logs because these scraped "open source" models give me the ick

Show thread

linear cannon Mar 11

@ariadne but yeah even with that it's still a pile of jank and i didn't have to actually run openclaw to figure that out. it was pretty evident just from looking at the bots on moltbook complaining about all of the not-so-subtle fundamental brokenness in their architecture and cognitive environment

Show thread

Ariadne Conill 🐰

Mar 11

@linear oh this isn't a serious thing, I just wanted to connect an LLM to IRC trained on all of my (anonymized and sanitized) IRC logs, as a friend is going through a midlife crisis and is dealing with it by playing with IRC stuff. The goal in using openclaw was that perhaps it could maintain a better narrative.

I suspect I will solve this goal by just writing a shitty IRC bot in Python that bridges the two worlds together with a decent enough system prompt for it to "understand" (to the extent that it can understand anyway) what the input is.

Show thread

linear cannon Mar 11

@ariadne yeah don't use openclaw for this lol. you do not want it. you want a small pile of maintainable scripts.

just look at how much activity the openclaw github repo has and consider how much of that activity is being driven by the models running under it vs actual humans

i'm pretty sure that one could implement all of its meaningful features in a codebase under 1% of its size

Show thread

Ariadne Conill 🐰

Mar 11

@linear yeah but still spending a couple hours fucking with this at least gives me some understanding of the tool and its limitations, which means it wasn't a total waste

Show thread

linear cannon Mar 11

@ariadne yes indeed. i am all for fucking around in order to understand tools and their limitations, especially if its to understand why not to use them and to do something different instead

Show thread

slotos Mar 11

@ariadne You need climate destroying approach to get a model that can pattern match sufficiently well for people with no self awareness—a surprisingly huge percentage of population—to mistake it for intelligence.

Models still collapse then, but collapse is esoteric enough to be framed as „bad prompt engineering”.

Show thread

Mathieu Comandon Mar 11

@ariadne yes, OpenClaw is kinda useless if you use it with anything other than Opus 4.5 or 4.6

Show thread

Ariadne Conill 🐰

Mar 11

@mathieucomandon alas i am too frugal to try it

Show thread

Mathieu Comandon Mar 11

@ariadne understandable. I tried it with Sonnet once and it started eating tokens like crazy while outputting poor results.
As for local models, I tried some Qwen variant and the output was pure nonsense!
Currently working with Qwen in a different context and making it do consistent things is a struggle

Show thread

thoe Mar 11

@[email protected]

Are you sure you prompted it correctly? Did you remember to stroke the case gently while chanting and whispering encouraging words to it?

Have you tried turning it off and on again?

Show thread

φ Mar 11

@ariadne inb4 ariadne gets hooked on ts and starts evangelizing ai as the Second Coming

Show thread

Ariadne Conill 🐰

Mar 11

@fiore seems unlikely

Show thread

jfk Mar 11

@ariadne what model did you finetune on? For a 1B model you need something really specialized on tool calling.

Show thread

Ariadne Conill 🐰

Mar 11

@jfkimmes i built an LLM from scratch with transformers kinda loosely following the scripts the qwen people released

the LLM is basically trained on ~30ish GB of mostly furry smut and public Linux IRC logs.

*nods sagely*

Show thread

sarah tonin

Mar 11

@ariadne @jfkimmes can we uhm, inspect, the furry smut ?

Show thread

Ariadne Conill 🐰

Mar 11

@SRAZKVT @jfkimmes i am keeping my typefucking logs to myself, thanks

Show thread

Ariadne Conill 🐰

Mar 11

@jfkimmes i am, however, using the 35b parameter qwen3.5 reasoning model for the "thinking" portion of this exercise

Show thread

jfk Mar 11

@ariadne Oh, is that a OpenClaw specific feature where you can specify that reasoning traces are generated by a separate model than the actual response? I'm not really familiar with OpenClaw's internals.

Show thread

Ariadne Conill 🐰

Mar 11

@jfkimmes yes, you can have it use a different model for planning.

Show thread

jfk Mar 11

@ariadne In any case: as long as the final response is generated by your trained model it will never make a valid tool call since there are probably about zero training examples of the necessary JSON structure required by the tool handling in your furry smut (this is an estimate that could be quite the way off knowing the furry community but still)

Show thread

Ariadne Conill 🐰

Mar 11

@jfkimmes this does explain something: it seems to be able to invoke tools when it is planning, but then those tools do not get invoked in the final step.

so it uses tools to read files when planning, then fails to use tools when executing.

what a fascinating conundrum.

Show thread

jfk Mar 11

@ariadne you could build a tool that gets called to generate answers / responses by your trained model. Then qwen-35 could handle the reasoning and make its tool calls and finally generate responses / text by copying from a tool call to your wrapper.

Show thread

jfk Mar 11

@ariadne I have no idea how this would work with OpenClaw though, sorry.

Show thread