@joytrek @quotesofnote @mdekstrand
Yeah, agreed, that's why I personally don't have any smart devices...
I don't disagree, but what he leaves out is the data. LLM are trained on all the data they can find, not just factually correct statements. Any model that relies on vast amounts of data has this annotation problem, doesn't matter whether it's autoregressive or not.
Good article, but I don't understand his hedging in your quote (Do they understand in this constructive sense? Probably not).
There is absolutely no math behind transformers that maps onto generating understanding, they are just word generators. I think it's dangerous to interview non-ML people about what they think LLMs do, they simply don't have the background and it will not help the discussion.
@joytrek @quotesofnote @mdekstrand
Yeah, agreed, that's why I personally don't have any smart devices...
@joytrek @quotesofnote @mdekstrand
But wouldn't that mostly be actors? Or do you mean based on Alexa/Google Assistant/Siri... I think for the latter they would have to significantly expand their data recording, right now they only save ~30 second snippets. Scary thought.
LeCun's arguments about why LLMs are limited seem fine to me. However his last slide is just plain wrong. He claims that "almost nothing is learned through supervision or imitation". Babies/toddlers learn almost everything by imitation, and later of course comes a lot of supervision (schools). Animals also learn through imitation and some light supervision.
https://drive.google.com/file/d/1BU5bV3X5w65DwSMapKcsr0ZvrMRU_Nbi/view?usp=drivesdk
I highly doubt it. Eventually the web-scraping-based datasets will be so degenerate that it leads to new innovations, probably as one step further towards AGI (i.e. training without huge datasets, as is the case for all living creatures).
Ah no, it's worse than that! Structured text with decent grammar will quickly become suspect. It used to be that it was a sign somebody was putting some effort into their writing, now it will increasingly make me think this someone was lazy.
I also wonder what it will mean for the development of better language models/bots. I actually think that this could lead to the next breakthrough in AI - if you cannot train on webcrawl-based data anymore (because they will mostly have been created by AI), people will need to invent other language-generation methods than simply relying on huge datasets.
Sounds like another instance of Doctorow's enshittificatin theory (https://pluralistic.net/2023/01/21/potemkin-ai/#hey-guys).
I am already mostly reading a curated list of webpages that I trust, I expect this will be even more necessary in the future.