Leaked Google document: “We Have No Moat, And Neither Does OpenAI”

The most interesting thing I've read recently about LLMs - a purportedly leaked document from a researcher at Google talking about the huge strategic impact open source models are having
https://simonwillison.net/2023/May/4/no-moat/

Leaked Google document: “We Have No Moat, And Neither Does OpenAI”

SemiAnalysis published something of a bombshell leaked document this morning: Google “We Have No Moat, And Neither Does OpenAI”. The source of the document is vague: The text below is …

@simon wow, open source wins again. Thanks for excerpting!
@simon Does all of the work on top of LLaMA actually count? After all, that model was leaked out of Facebook.
@matt it proved that it was all possible to run on end-user hardware - and the openly licensed trained-from-scratch LLaMA alternatives are already starting to emerge https://simonwillison.net/2023/May/3/openllama/
OpenLLaMA

The first openly licensed model I've seen trained on the RedPajama dataset. This initial release is a 7B model trained on 200 billion tokens, but the team behind it are …

@simon Oh damn, I hadn't seen that post yet. Things are definitely heating up.
@simon After thinking about this a little more, I wonder if OpenAI still has a moat in GPT-4's ability to work with image inputs. The applications of that for accessibility sound really promising, though most of us don't actually have access to that feature yet, so I suppose it could turn out to be smoke and mirrors.
@matt they still haven't shipped that! Meanwhile there are already open models that can do that surprisingly well: https://simonwillison.net/2023/Apr/19/llava-large-language-and-vision-assistant/
LLaVA: Large Language and Vision Assistant

Yet another multi-modal model combining a vision model (pre-trained CLIP ViT-L/14) and a LLaMA derivative model (Vicuna). The results I get from their demo are even more impressive than MiniGPT-4. …

@simon Wow, yeah, that _is_ impressive. Can't wait to see what could be done with a model like that but fine-tuned for accessibility (e.g. render the UI in this image as something like an accessibility tree).
@matt @simon That one is okay but still makes up a lot of its responses and so does Mini GPT. I can't tell which one is worse at the moment. But, I'm completely impressed at the speed of innovation and progress here. I have so many tabs pinned for comparison purposes that I'm beginning to lose track.
@matt @simon it felt like that itself was also addressed pretty directly in "Individuals are not constrained by licenses to the same degree as corporations". Not just Llama, but things like using GPT output to train other models. Those individuals are benefiting greatly from big companies'work in a way the companies themselves can't benefit from each other
@simon thank you for highlighting this and summarizing some interesting points. I really appreciate the view you're giving in to current AI developments.
@simon I’m not understanding why this is a surprise if the larger companies are milking the models they have since it’s clearly providing a ROI and the open source communities are getting excited to innovate the underlying components
@frijolito until recently I thought that the cost involved in training a model would mean the open source community would always be several steps behind OpenAI and Google - apparently at least one person inside Google doesn't think that's true

@simon This actual solves one of my fundamental problems with the current LLM tools like chatGPT and CoPilot: that you have to basically stream all of your content / code to Microsoft to use their tool. This seems to indicate that running a local open source server would be entirely feasible.

If the models are also trained using only correctly licenced material as well (rather than Microsoft buying github and ignoring the licences for the model) then we have a full house

@simon The reading list alone is gold.
@simon
LoRA is clearly a great tool but, to use an open source analogy, it feels like applying a kernel patch downstream: it gets the job done but at some point, if it is generic enough, it needs to be upstreamed. And that part is not possible with LoRA. To integrate the modification in the model, a full retraining is inevitable.
@simon and yet Google is instead tightening their grip harder!
@simon bazaar strikes back! 💥
@simon tbh it's nice to see groups of researchers taking the lead on AI. it's not fun to imagine what the world would have been like had the Internet been the product of a race between two corporations
@simon oh wow, this is incredible, thanks!
@simon
For an anonymous doc, isn't "Having read through it, it looks real to me" a point in favor of it being LLM-written? (Not quite a "tell" but a cause to go Hmmmm.)
@simon That document was the best reading of this week by far.
@simon Hey Simon, I’ve been holding off the use of ChatGPT, Bard, etc., even though I think they could be useful. This is because I can see (especially with ChatGPT) the horrible unethical behaviour that the companies are using in their arms race to deploy deploy deploy. With all the talk in this leaked doc about open source alternatives, do you know of any LLMs that are “ethically sourced” and available for the average punter to use? I don’t want to be left behind :/

@jimgar the ethics of this stuff is incredibly complicated

I'm very optimistic about the models being trained on the RedPajama data - there's one out already and evidently more to follow very shortly https://simonwillison.net/tags/redpajama/

Simon Willison on redpajama

Claude is an interesting option that's one of the most promising closed alternatives to ChatGPT - they have an interesting approach to AI safety which they call "constitutional AI" https://www.anthropic.com/index/introducing-claude
Introducing Claude

After working with key partners for the past few months, we’re opening up access to Claude, our AI assistant.

Anthropic
@simon thank you so much, l’ll give these a look. Everywhere I look in tech it’s one ethical nightmare after another 😵‍💫
@simon what's your take on the copyrighted material included in RedPajama through CommonCrawl? It seems to me that one could train a model on only text that has been shared freely and that might be more ethical. cc @jimgar

@resing @jimgar I'm not convinced it's possible to train a usable LLM without including copyrighted material in they raw pretraining data

As such, personally think it's a necessary evil to avoid a monopoly on LLM technology belonging to organizations that are willing to train against crawler data

@simon @jimgar not sure I follow. Are you saying that crawler data, which includes copyrighted material shouldn’t be used by commercial companies and LLMs are inherently flawed because of that? If so, I’m not saying you’re wrong, just trying to understand.

@resing @jimgar I'm saying I'm not sure it's possible to build a useful LLM without including copyrighted data in the training set

The ethics of this entire field are incredibly murky - I wrote about that last year https://simonwillison.net/2022/Aug/29/stable-diffusion/#ai-vegan

Stable Diffusion is a really big deal

If you haven’t been paying attention to what’s going on with Stable Diffusion, you really should be. Stable Diffusion is a new “text-to-image diffusion model” that was released to the …

@simon @resing it *all* feels fundamentally wrong, so long as the results rely on indiscriminate harvesting of people’s work without permission. Literally the only compelling argument I have heard is the “necessary evil” Simon mentions - doing it anyway but making it open source. I just find it sad that this is the position we’re in at all, and worse, how little the majority of people seem to care about providence and permissions full stop.

@jimgar @resing search engines work by indiscriminately harvesting people's work without their permission, and have done for decades

What's different here isn't how the things are built, it's what they can be used for

People mostly tolerated search engines because they saw them as useful - they helped people's work be found, they didn't (appear to) threaten their livelihoods

@jimgar @resing note that I'm not saying that search engines were morally/ethically pure here either!

The ethics around this are deeply complicated - there are no easy or obvious answers

@simon @jimgar the legal issue might be resolved soon. if @[email protected] is right, Stable Diffusion could lose the lawsuit against them. I buy his argument in favor of that. If that's the case, LLMs trained on sets that only allow that use might really take off https://arstechnica.com/tech-policy/2023/04/stable-diffusion-copyright-lawsuits-could-be-a-legal-earthquake-for-ai/
Stable Diffusion copyright lawsuits could be a legal earthquake for AI

Experts say generative AI is in uncharted legal waters.

Ars Technica
@simon enjoyed this and your blog generally. Keep it up.

@simon We don't talk enough about how one of the big bugbears at the start of the ML explosion was the assumption that these models would be stuck under corporate control forever because the tech would be proprietary and expensive to run.

There is no correlation with the likelihood of the other risks, but I admit I was on board with that one but it didn't quite materialize.

@simon Excellent doc there. I keep thinking Google should respond to Meta’s stroke of luck with Llama by shipping an LLM browser API and local model with Chrome.
@simon very interesting post Simon, thanks for sharing
@simon Not so strange that OpenAI doesn't have a proprietary strategy, as they weren't originally supposed to be proprietary.
@simon OpenAI could get a moat if they were willing to do more investments into the ChatGPT plugin ecosystem, especially if they added some kind of (embeddings-based) long-term memory.
Machtkampf der KI: Es rollt ein Tsunami auf uns zu

Bilder, Videos, Audios – all dies kann immer besser mit künstlicher Intelligenz erzeugt werden. Viel spricht dafür, dass Open-Source-Systeme die Branchengiganten bald in Bedrängnis bringen. Wie ist das einzuschätzen?

DER SPIEGEL