One thing about "AI" is with the technology OpenAI has (large neural network plus manual tagging) you could've made the best search engine ever. There could be a Copilot where you describe what you wanted and it finds an example of it in the corpus of open source software. You could go from a fuzzy image description to a stock image. These would be better than buggy code and fucked up images. But they wouldn't do that because the *service* OpenAI provides is obscuring that the content is stolen.
If you describe a situation and OpenAI finds for you an existing image, you know what image database the image came from, and you know whether you're allowed to use it, and you know you're committing a crime, and OpenAI wants to relieve you of this final burden.
@mcc makes sense, the former sounds like an actual business model, not a way to accumulate so much wealth you can use it to grasp the levers of power
@mcc not sure. LLM are very bad at keeping information on their sources it seems
@galibert they'd have to have trained it to do something different than they in fact trained it to do
@galibert @mcc LLM are not inherently bad at that, it's just the way they're trained.

@mcc oh that's a really interesting perspective

OpenAI
We commit crimes so you don't have to

@mcc As the kids say: talk your shit, mcc. 🤘🏽

@mcc actually that just sounds so much better, like... some sort of stack overflow for finding code in github specifically, you just ask it what you want and it'd give you a trillion billion examples from a lot of different repositories

and it wouldn't even go against the spirit of open source unless you steal entire systems without proper acknowledgement

@mcc tbh, a search tool like that would even be 100% allowed as courts have decided that smth like that counts as being transformative enough to fall under fair use

@mcc 100% agree.

I have always thought that:

- LLMs are lossy compressed hyperbooks.
- Companies misleadingly slapped an « oracle » UX over it.
- It has more potential as a navigational artifact than a knowledge artifact.

@0gust1 The way I usually frame it is that machine learning can work-with-heavy-caveats for identification and categorization, but generation is a completely different problem and it does not work* for that.

* Except for certain problems of pure aesthetics, and the corporate LLMs/image models fail at those aesthetics.

@mcc likewise, they could also build into the LLMs the ability to cite sources reliably and trace the origins of facts and text. Some of them try to do that, but in my experience they are terribly unreliable (as in the majority of cited sources either don't exist or are irrelevant to the topic at hand). "Citation needed" is a wikipedia cliché, but crucial for building human knowledge and broken (by intent as you say) for LLMs.
@kajord i think generation plus citation *together* is probably much harder to do reliably than either separately, but also, i believe their business incentives are way, way against it

@mcc

Not sure when you've used #AI 👉properly👈.

In my experience the more vocal opponent of AI is the further back in time their (lack of use) goes.
With the most ardent opponents having never used the models, yet having most empathic (and increasingly inaccurate) opinions.

Attached media, a public query from today, with sources dropdown at the bottom.

Approx 30% of web searches comes from the engines nowadays.

(Edit: Hahaha, insta blocked by poster, I guess folks don't like to be called out on saying patent provable falsehoods 🤡

The poster, made a comment exposing their ignorance of features of existing AI. This one has 33,000 followers, question is "How many others like them have zero idea about the systems they critique"?)
#llm #ai #luddites

@n_dimension @mcc have you perhaps considered that:
1. you are very much mansplaining, mcc knows far more about this than you do
2. maybe fedi is not the right social media for you. go set up an account on farcaster or whatever the grifty techbros are using nowadays
3. even IF AI provided reputable content and sources as described by mcc, that still doesn’t solve all the ethical and environmental issues

that is all.

@GroupNebula563 @mcc

1. She literally said an untruth.
The exact definition of "not knowing more than I do"

2. Thanks for gatekeeping. Keep it up.

3. The post wasn't about ethics, it was about exposing ignorance of how the system evolved.

Thanks for engaging

@n_dimension @GroupNebula563 Your post was so incoherent that it was not possible to know what it was about, except that it legitimised wholesale infringement in a locked product.

@ahltorp @GroupNebula563

What are you talking about?

The folk hero mmc made a statement demonstrating she has not seen an LLM for at least 6 months.

Then when I demonstrated she made an error.
WTF is incoherent about it.
Maybe it's the folks who DONT use AI are losing thinking skills.

Where are you confused?
I'll walk you through

@n_dimension @ahltorp magnus, and anyone else reading this post: I checked their profile and they’re a UFO conspiracy theorist. I don’t think there’s any winning this argument. block, report, and move on

@GroupNebula563 @ahltorp

WTF are you talking about.

What are you going to report me for?
Pointing out another user outright misrepresented technology feature?

As to the UFO conspiracy theory.
Its you who is the conspiracy theorist.

@n_dimension @ahltorp all right I think we’re well and truly done here

@n_dimension That's just a LLM googling. It doesn't have the sources, it uses tool calls to use search engines and scrape web pages.

A LLM using a search engine under the hood is not proof that a LLM can replace a search engine.

And it doesn't solve fundamental problems (that can only be solved with a very different kind of training and different tools) such as making shit up and not giving credit to the source material of the training data (except for very well known things and only when you ask explicitly).

@starsider

The sources are at the bottom of the dialogue.
You click it shows sources.
You need a computer to see it.
A computer is like a slate tablet only it uses electricity.
Your library has one.

@n_dimension Those are not sources from the training data. Those are sources extracted from a literal google search made by the LLM, with keywords chosen by the LLM. That's not what mcc is talking about. That's just tool calling. Do you know what tool calling is?
@n_dimension @mcc

but those citations are generated no ?

@n_dimension

This is neither an image search nor an example of open source software.

@n_dimension you're a huge asshole and you will continue to be hated and blocked by many for this type of behavior

@mcc I doubt that

I think we are all forgetting the years and years of general enshittification of the web - all the crap, out of date examples; pages written more for clicks than helpfulness; transition to video for everything; etc, etc

I feel this was the correct technology path to follow, but everything they did about how they went about it is an immoral mess

- with a focus on enshittifying it straight out of the box

@mcc haven’t you described perplexity.ai ?
@david01928 @mcc nope, perplexity is just another LLM masquerading as a “search engine” that barely does anything useful
@david01928 That's just a LLM using a regular search engine and some other tricks, and pretty much all LLMs nowadays can do that through tool calling... but that's a very poor (and extremely limited) imitation of what mcc is actually talking about.
@mcc I still don't understand how the content is stolen.
@vader @mcc basically, the LLM is using content without consent. not obeying licenses, not attributing (or misattributing) the garbage it spits out, and actively avoiding attempts to curtail this behavior. I suggest you read this amazing article: https://aworkinglibrary.com/writing/toolmen
Toolmen

Even the best weapon is an unhappy tool.

A Working Library
Yeah it's completely ignoring use licenses that humans would have to comply with.
@vader @ct @mcc go ask ChatGPT or another one of those bullshit generators you’re so fond of, they’re going to be more willing to waste energy on you than we are
@GroupNebula563 @ct @mcc Deflections are a sign of the unintelligent who can't have a knowledgeable discourse. I'm sorry you're unequipped to have this conversation, but if you can't, maybe let ceets here explain their point of view. Potentially they actually have data and information that would be good for discussion.
@GroupNebula563 @mcc That's not how LLM's work. They learn from ingesting materials, creating tokens and learning patterns. Then it creates its own "garbage" based off of all that it has learned. I work in the industry. I don't need to read that article. By your logic, every author ever would need to attribute every single book they've read that could have ever influenced them.
@vader @mcc oop, we got a mansplainer. the problem here is that humans are… well… humans. they are capable of transformative thought and can come up with original ideas. LLMs, as you said previously, cannot. all they do is stitch things together based on what’s in their database. mcc ALSO works in the industry (the industry of *actual computing*, not bubbles that will burst in a matter of years), and (no offense) probably knows far more about it than you do. (1/2)
@vader @mcc (2/2) if you write a song with a violin in it, you do not have to *credit* the creator of the instrument (this is what humans do). if you stitch a bunch of parts of existing songs together without the consent or knowledge of the original writers or record labels and call it your own song, then you absolutely have to give credit and in some cases even that isn’t enough. anyways, maybe edi isn’t the best fit for you. maybe go back to X (formerly Twitter) or whatever

@mcc

The links at the end of each line in the AI summary at the top of Google search results are often better than the search results themselves (and always better than the AI summary since they are an authoritative source)

@gbargoud maybe they should have just incorporated that engine into the search results. as it is i'll never see it because i switched away from google completely solely in order to avoid the AI summary box

@mcc

Yeah the UX for that is horrible and easy to miss but it shows just how great they could be if they were used as an index instead of a weird regurgitator like you suggested.

@mcc
Resonates: “… the *service* OpenAI provides is obscuring that the content is stolen.“ 😐

@mcc the *one thing* keep coming back to with these tools, watching students use them, is that the inference ability of these tools to do things like translate or refine interpretation of the student's intention is great

and then instead of being a search engine it's kinda useless

@mcc You absolutely *can do that*.

One of my AI use cases is "tip of the tongue" searches, things like "find me a movie where the dog dies, shortly followed by...." or "find me that comment from Hackernews on a story about Google buying some fitness startup that linked to books about x"

Modern LLMs will search and link to sources.

@mcc Kagi does it. In a non open source way sadly but as a proof of concept it is here, working. Now we need open source alternative.
@mcc the sad thing is, this is exactly what language models were invented for in the first place, in the field of information retrieval. Ironically, one of the best models to map image descriptions to existing images is OpenAI's CLIP model. The technology is there, and it's crazy good, but instead of making human knowledge more accessible than ever, we poison the Internet with nonsense, making actual information even harder to find.

@mcc "the *service* OpenAI provides is obscuring that the content is stolen."

👆👆👆

This. So much this. The folks bamboozled by the output of the automated plagiarism machines just don't grasp how utterly vast the corpus of stuff out there is, and how getting what they hoped out of the machine was just the result of something recognizably similar already existing.

THE service is sufficiently obscuring that similarity to create plausible deniability of plagiarism.

@mcc This reminds me of my cousin who loves ai, but only uses it for finding the poorly named programs in his obscure work server OS, and he's just right that is what llms are good for.

@mcc Yup. It's an accountability firewall. They provide two advantages to customers (as distinct from users) -

"We couldn't do this without a lot of stolen data and obviously we weren't going to take on the liability but we can just pay OpenAI to do it for us!"

AND -

"We can't be blamed for the decisions made at our request on our behalf by the LLM!"