"AI is built on the collective knowledge of humankind."

No. Nononononono. It is not built on _knowledge_, it it built on _data_. And not everyone's experiences are available as data, many communities are excluded. Also: "Collective" implies some sort of collaboration and shared activity. But "AI" is just accumulation by a few powerful.

So No. It's not collective but extractive, not knowledge but data, not humankind but the hegemonic western view. Everything in that statement is wrong.

@tante this is a crucial distinction.

@tante Knowledge is something curated.

Data is not curated. For every correct thing in there, there will be five well-meaning contradictions of it and a hundred deliberate lies.

@tante TIL: a lot of the NLP Models were trained on the Enron Corpus initially. So we trained AI basically on criminal evidence of corporate fraud and no we are wondering why the world is shit.
@Nfoonf
That sounds interesting! Why the Enron corpus? Because it was so huge and also in the public domain?

@musevg basically, yes: https://en.wikipedia.org/wiki/Enron_Corpus

i wonder if we will see the epstein files soon :D

Enron Corpus - Wikipedia

@musevg
I know someone who wrote a linguistics thesis based on the Enron corpus. They used it yes because it was large and public, but also because, unlike most corpora, the people in it didn't know they were making a corpus—minimal observer effect.
@Nfoonf
@thrilway @musevg this does not refute my statement. :)

@Nfoonf @tante Google trained theirs on a corpus that included 4chan and Stormfront.

https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

See the websites that make AI bots like ChatGPT sound so smart

An analysis of a chatbot data set by The Washington Post reveals the proprietary, personal, and often offensive websites that go into an AI’s training data.

The Washington Post
@tante you probably mean many communities are _EX_cluded? I wonder if that is going to play out as an advantage in the future. Thinking e.g. of indigenous knowledge not available as data...
@bicipoiesis yeah "exclusion" is better than "non inclusion"
@tante And non even all data, but specifically MEDIA. Specific media. Pigeonholed by the socioeconomic constraints of its time. LLMs are trained with documents which means that it lacks the ability to be trained in anything that cannot be conveyed in documents, like non-verbal insights or spatial reasoning. Diffusion image genAI is trained on raster pixelated images which means it lacks for example the spatial cues of vector graphics and 3d engines (which is why diffusion AI videogames suck)
@tante Only a person that believes every human knowledge in the universe can be transcribed into documents could understand how LLMs work and still believe that a machine trained only on papers can attain general intelligence.
@tante not to mention the "knowledge" (data) was basically scraped without permission
@nnnilabs @tante
Data behind paywalls (valuable data) is not scraped, is it?

@Freedman @nnnilabs @tante want to bet? Meta publicly announced it downloaded umpteen terabytes of copyrighted material such as novels from an illegal warez site in Russia and used it to train it's LLMs. No legal fallout from that whatsoever.

Altmann has stated that "AI" development will die if companies are prevented from stealing copyrighted prior art.

It's extractive theft. Plain and simple

@tante narcissists are always grand masters of marketing some grand vision, which later on turns out to be a bunch of psychological abuse to those who get too close. The psychology was always sitting there in plain view.

Great essay exploring the patterns of psychology at play in the #AI industry, BTW: "The Possessed Machines: Dostoevsky's Demons and the Coming AGI Catastrophe":
https://possessedmachines.com

#psychology

The Possessed Machines: Dostoevsky's Demons and the Coming AGI Catastrophe

A close reading of prophetic fiction in the age of artificial superintelligence

@d1 @tante

You might like Paul Kingsnorths new book, Against The Machine. He is a very religious person and directly relates the coming of "AI" to the coming of the antichrist from the biblical book of revelation. The rapture of the nerds is the biblical rapture but inverted so the devil wins.

I am not a religious person whatsoever. In my view it's a metaphor but it's a good one. A lot of religious language is used on both sides of the "AI" debate after all. Cultists etc.

@tante and then the output is manipulated even further by those few.

@tante

sewage is built on the collective kitchen of humankind. every possible taste is mixed

@tante

It's a perfect example of dishonest AI propaganda.

@tante AI is built on the hubris of wealthy morons.

AI is an oxymoron. Everything in the universe is real. If Intelligence exists in this universe it will be real just like ours, and not artificial. We have complicated LLMs nothing more

@tante

Sadly true. And some points would be so easily changed if we wanted. Not the technical limitations, of cause.

@tante not built, but more like "grown", not intelligence but capability...
@tante
Also data scraped without consent.

@tante cherry on top: the AI isn't AI either.
So:

«The AI (which it isn't, it's just a Large Language Model) is built (which it isn't, it's trained) on the collective (which it isn't, it's extractive) knowledge (which it isn't, it's data) of humankind (not all of it, just the colonialists)»

@oblomov @tante

Thanks for saving me the bother of saying that!

@tante and 'build' is only the Base neural network in the Sense of SW Implementation. For the Rest Data is "trained" into the models. The models make use of the "trained" or "learned" Data in very intransparent ways by means of pseudo stochastic Prozesses. No one can determine which Data influences the behaviour of the model in which way.
@tante The correction is easy: "AI is built on the THEFT OF the collective knowledge of humankind."

@einfachnurRoland @tante Inserting that would still not make it correct. It would imply some sort of completeness, and that is very far from the truth.

And even if all text (in a broad sense) ever produced was input into these machines, it would still not be knowledge, much in the same way as fully sequencing the human DNA didn’t make us magically know everything there is to know about humans.

@einfachnurRoland @tante Nope. Knowledge is what's in people's heads. You mean information and data.
@samueljohnson @tante
Nope. The thing in the head of most people is the opposite of AI: Natural Stupidity.

@einfachnurRoland @tante Hah. Indeed, it's not a given that what people know is true. 😉

AI doesn't ”know" anything, whether it gets anything right or not, because knowledge is in people's heads.

@tante and if the data is wrong......then....it's not collective knowledge.
@tante eh, hegemonic western view... since deepseek, wan and z image turbo, that is no longer really true. now it's just various forms of hegemony ;)
@lritter The hegemonic western view comes from the fact that that constitutes most of the data that is available on the planet. The west is just overly datafied.

@tante assuming you're objecting to the line from Abundance, it's instructive to look at the rest of the sentence

> AI is built on the collective knowledge of humanity, and so its profits are shared

The second part is obviously false; the profits have been going to Jeff Bezos and Jensen Huang. So what does that imply about the first part? Interesting.

@aburka @tante To be fair, that bit of the book seems to be describing an Abundance-pilled utopian future, not the present state of things. It is grounded in as much reality as any utopic vision has ever been, which is to say none at all.
@nick @tante yeah and they also think factories in space will make medicine free lmao

@tante

With everything that’s been zapped, and all that continues to be, what is the definition of knowledge today? Is it the whole of everything learned by a single human being or a gigantic data store to explore the depths of, guided by silicon shamans?

@tante if the statement "ai is the collective knowledge of mankind" is true then the finest food in the world is the stinking, rotten, rancid juice at the bottom of a dumpster behind a food delivery ghost kitchen as it is all the cuisines of the world melded together and distilled.
@tante not built, just stealing everything

@tante "It is not built on _knowledge_, it it built on _data_." Oh, it's so much worse than that. It's not built on data, it's built on *text*. Including massive amounts of intentional fiction (lots of paranormal romance), unintentional fiction (all those flat earthers and young earth creationists), racism and trolling (4chan!), and psychosis creations (Time Cube ftw!).

Can't tell you how thrilled I am that my robot surgeon might've been trained on Dr. Bronner's soap bottle text.

Which actually reminds me, have any of the SlopMachines started talking about the best way to build space cars for vampires yet? Asking for a friend, who is mostly watching the decline of western civilization with some bemusement.
@tante YES!!! I’ve been trying to express this point for months and you expressed just right!! Thanks.
I’m pretty sure that AI will “average” our culture into a mediocre mess

@tante

Hot Damn

so well said and then the additions below just make this statement better and better.

@tante It's a categorically wrong statement, because "knowledge" implies semantic comprehension, which "AI" doesn't do.

It doesn't matter if you encoded it as data or not, the "AI" does not work with semantic knowledge and logic.

(Unlike some prior attempts a few decades ago that also failed for reasons such as doing that being pretty difficult.)

@tante

data that be subtly molded over time by very big actors or small crafty actors that would be hard to detect and very bad for you.

How many github projects or stackoverflow posts would have to be planted to smaller but vital areas of code that could inject vulnerabilities exploitable remotely? We already see folks squating on hallucinated package names.

How would we even know it is not already happening?

@tante I completely agree. This isn’t “collective knowledge,” but massive datasets collected and filtered by those who have the resources. Many voices simply don’t make it into that sample. So calling it something neutral and “from all of humanity” is, to put it mildly, an exaggeration.
@tante "built on the extractive data of hegemonic western view." Got it. ;)
@tante All true. And as Theodore Sturgeon said: 90% of everything is rubbish. (So much rubbish 😢)
@tante It's not a jupiter brain made of all knowledge, it's a black hole all knowledge was shoved into.
@tante I have been reliably informed, since the 80s, that knowledge is data with parentheses around it.
@tante AI is built on stealing as much data from everywhere they can. Once they're a big company, they claim that it's their right to steal because it their Business Model. They should all be closed down, and the executives arrested.
@tante Ist wie damals als sie in Star Trek Voyager einen neuen Doktor programmieren wollten und die Holomatrix mit sämtlichem medizinischen Wissen gefüttert haben. Hat auch nicht funktioniert weil die Matrix dann zwar alles wusste, aber keine Ahnung hatte wie das Wissen zusammen hängt und angewandt werden muss. KI ist genau so, weiß alles und kann nix.

@tante assimilation is hegemony

#aislop #fascism #theft #lies #data

(I need to rewatch the Borg episodes of TNG)

@tante “collected writings of the internet”, not “collective knowledge of humanity”