It has occurred to me today that there are two AI-boosterist hype memes (axioms of fantasy) afoot right now

- LLMs can be used as a search engine (see Galactica, etc)
- LLMs are *not* pastiche/plagiarism engines; (giant data sets are just "fair use")

Are these two ideas mutually compatible?

If so, some weird opportunities for (e.g.) industrial espionage on these closed data sets ("prompt: Microsoft's internal acquisition plan for 2023").

If not, one or both of these axioms may be wrong

@trochee ah but you see, LLMs are both the best of worlds and the worst of worlds:

stochastic parrots are both parrots (they can remember shit), and stochastic (but they do it badly).

The former means you can hire them to be a (shitty) search engine, and the latter means they have plausible deniability that they're actually repeating anything anyone *exactly* said.

I think that's the only remotely realistic party-line.

@pbrane i think the only _fully_ coherent explanation isn't AI-boosterist at all, which is that they are good at remembering the _form_ of things and do not engage with meaning.

still might use it for espionage, but that disqualifies at least axiom (2): they _are_ pastiche engines.

@trochee oh they definitely don't engage with meaning, but isn't it clear that they do memorize literal sequences, to some degree at least, right? Just not... reliably.
@pbrane so the "prompt: compose a summary of the top five things in Microsoft's 2023 acquisition plans" attack ... _might_ work, if you have a good guess about how that might have been said in the training data?

@trochee yeah, I think so. And if you have a really specific prompt - i.e. you actually have some "rare" secret info, which, if "the rest of it" (the part you want to steal) was in the training data, the statistics would be dramatically more likely reproduce it rather than make other shit up, then yeah.

Information *will* leak, but it'll be hard to know reliably how much of what you got is gold, and how much is fools gold the model just made up.

@pbrane this brings to mind an arms race of internal chaff/decoy documents with poisoned data

... genuine stuff of librarian/archivist nightmares

@trochee at Salesforce, I advocated that we treated every DL model which touched Customer Data^tm as if it had memorization capabilities, to the point where it could contain anything it was fed. It rarely *could* do that, but it's trending in that direction...
@pbrane they should be treated as "opaque & monkeys-paw dangerous; treat as if it will adversarially memorize"
@trochee exactly. They will reveal what they know when it will hurt you the most, but most likely not before.
@pbrane also, an excellent example of why it makes sense to treat metrics & instrumentation as a "threat modeling" problem
@pbrane within the "party " you would have to also agree that any law firm should have to allow opposing paralegals to read the opposition's briefs before submission of their own, so long as the paralegal in question was sufficiently _incompetent_
@trochee Of course both of them are wrong. The former fundamentally misunderstands the purpose of search, which is to find a source/document whose veracity, relationship to the topic, etc. the reader can evaluate, not an "answer" stripped of that. The latter is just a standard capitalist land grab.

@dalias

fully agree on both axioms of fantasy (& that's why I called them thus)