RE: https://social.treehouse.systems/@ariadne/116213132813239860

Read what Ariadne is writing about LLMs. This all tracks with my intuition, that OpenAI et al are a big grift.

You categorically do NOT need millions or billions to train a useful LLM that can communicate in human language. LLMs are good at language, it's in the name!

The reason these companies are burning massive amounts of money and using increasingly massive models is they've taken "look, this tech makes for a cute chatbot that can do useful stuff" and turned it into "if we make it bigger it'll be SMARTER!"

And the thing is, that's true... to a point. When you stop treating the LLM as a language model and start trying to turn them into an all-knowing entity that has memorized the entirety of human knowledge and can do anything you prompt it for all with the same model (or a few collaborating models), you quickly hit diminishing returns. And you end up with a thing that's kind of smart (not really) and kind of knows everything (not really) and convinces everyone to throw insane amounts of money at you because you're fundamentally using the technology for something it wasn't intended for.

The way we fight back is with small home-grown "LLMs" (SLMs?) that run on a MacBook and train on a few GPUs and training/fine tuning them for specific purposes.

The whole AIBro approach of just using prompting and in-context learning with a single all-powerful model is just patently absurd.

@lina I've always maintained this stance: LLMs are actually really cool when you think about it. We got overgrown Markov chains literally solving programming problems!

I've been thinking of this lately: but I really hate the people running this. I hate "AI", the marketing around it and that they've normalized us using it both as a service and as a cultural thing.

This is not good for any of us!

To do this right, I think we gotta:
1. Ethically source this data. I really like the logs approach, I bet we could find more "good" ways to do this
2. Figure out licensing. The AI bros don't don't understand copyright at all. Neither do we in all honesty but at least we care. Having licensing locks with a proper attribution system would at least try to invoke trust rather than make it clear to everyone that we're just blatantly ripping off the internet.
3. Finetune and build custom little models for things in an appropriate manner, as you were discussing earlier.

As I said earlier, we care. Because we care, we have a chance to do it right. Will it stop them from blatantly disregarding rights, privacy, ethics, ownership, etc? Obviously not... But that shouldn't stop us from caring about it.

@sounddrill The one LLM I've been looking at as a base was ethically trained on legal codes/patents and stuff like that (which are PD). There's also Wikipedia, which comes with a license you have to follow but it's an easy one. And of course there's the whole "stuff old enough to be PD" corpus. There's plenty of ethical data going around.
@lina @sounddrill Is permissively-licensed software also an acceptable source, or is it too hard to handle all the copyright notices?