Mastodawn

Eldan Goldenberg Jan 10, 2023

It has occurred to me today that there are two AI-boosterist hype memes (axioms of fantasy) afoot right now

- LLMs can be used as a search engine (see Galactica, etc)
- LLMs are *not* pastiche/plagiarism engines; (giant data sets are just "fair use")

Are these two ideas mutually compatible?

If so, some weird opportunities for (e.g.) industrial espionage on these closed data sets ("prompt: Microsoft's internal acquisition plan for 2023").

If not, one or both of these axioms may be wrong

Show thread

Jake Mannix Jan 10, 2023

@trochee ah but you see, LLMs are both the best of worlds and the worst of worlds:

stochastic parrots are both parrots (they can remember shit), and stochastic (but they do it badly).

The former means you can hire them to be a (shitty) search engine, and the latter means they have plausible deniability that they're actually repeating anything anyone *exactly* said.

I think that's the only remotely realistic party-line.

Show thread

Jeremy Kahn Jan 10, 2023

@pbrane i think the only _fully_ coherent explanation isn't AI-boosterist at all, which is that they are good at remembering the _form_ of things and do not engage with meaning.

still might use it for espionage, but that disqualifies at least axiom (2): they _are_ pastiche engines.

Show thread

Jake Mannix Jan 10, 2023

@trochee oh they definitely don't engage with meaning, but isn't it clear that they do memorize literal sequences, to some degree at least, right? Just not... reliably.

Show thread

Jeremy Kahn Jan 10, 2023

@pbrane so the "prompt: compose a summary of the top five things in Microsoft's 2023 acquisition plans" attack ... _might_ work, if you have a good guess about how that might have been said in the training data?

Show thread

Jake Mannix Jan 10, 2023

@trochee yeah, I think so. And if you have a really specific prompt - i.e. you actually have some "rare" secret info, which, if "the rest of it" (the part you want to steal) was in the training data, the statistics would be dramatically more likely reproduce it rather than make other shit up, then yeah.

Information *will* leak, but it'll be hard to know reliably how much of what you got is gold, and how much is fools gold the model just made up.

Show thread

Jeremy Kahn Jan 10, 2023

@pbrane this brings to mind an arms race of internal chaff/decoy documents with poisoned data

... genuine stuff of librarian/archivist nightmares

Show thread

Jake Mannix Jan 10, 2023

@trochee at Salesforce, I advocated that we treated every DL model which touched Customer Data^tm as if it had memorization capabilities, to the point where it could contain anything it was fed. It rarely *could* do that, but it's trending in that direction...

Show thread

Jeremy Kahn Jan 10, 2023

@pbrane they should be treated as "opaque & monkeys-paw dangerous; treat as if it will adversarially memorize"

Show thread

Jake Mannix Jan 10, 2023

@trochee exactly. They will reveal what they know when it will hurt you the most, but most likely not before.

Show thread

Jeremy Kahn Jan 10, 2023

@pbrane also, an excellent example of why it makes sense to treat metrics & instrumentation as a "threat modeling" problem

Show thread

Jeremy Kahn Jan 10, 2023

@pbrane within the "party " you would have to also agree that any law firm should have to allow opposing paralegals to read the opposition's briefs before submission of their own, so long as the paralegal in question was sufficiently _incompetent_

Show thread

Cassandrich Jan 10, 2023

@trochee Of course both of them are wrong. The former fundamentally misunderstands the purpose of search, which is to find a source/document whose veracity, relationship to the topic, etc. the reader can evaluate, not an "answer" stripped of that. The latter is just a standard capitalist land grab.

Show thread

Jeremy Kahn Jan 10, 2023

@dalias

fully agree on both axioms of fantasy (& that's why I called them thus)