Mastodawn

Akari Asai Dec 20, 2022

New paper 🚨

Can we solely rely on LLMs’ memories (eg replace search w ChatGPT)? Probably not.
Is retrieval a silver bullet? Probably not either.

Our analysis reveals that LLMs' memorizations are still limited and scaling won't help much in long-tail distributions.
We show that adaptively incorporating non-parametric memories (eg retrieved chunks) can improve performance as well as efficiency.

📜 http://tinyurl.com/2sdeuupn 💻 http://github.com/AlexTMallen/adaptive-retrieval

#PaperThread #newpaper
[1/N]

Show thread

Akari Asai Dec 20, 2022

LLMs store a lot of factual knowledge in their parameters (parametric factual knowledge), but recent work shows that they struggle to learn less frequent facts and can often hallucinate when they don't know. How much do they memorize and what affects their memorization? [2/N]

Show thread

Akari Asai Dec 20, 2022

To answer these questions, we construct a new large open-domain QA dataset, PopQA, whose questions are grounded in Wikidata and are sampled from long-tail popularity distributions of Wikipedia to enable fine-grained analysis. We then test 10 LLMs in a zero/few-shot manner [3/N]

Show thread

Akari Asai Dec 20, 2022

We found that there are strong correlations between subject entity popularity and accuracy, indicating that LLMs memorize well popular factual knowledge while it does not memorize less popular ones. [4/N]

Show thread

Akari Asai Dec 20, 2022

Surprisingly, in long-tail distributions, scaling LLMs may not be as helpful as we believed: GPT-3 003 performs nearly as poorly as GPT-Neo 2B on less popular entities 🟦
Prior knowledge learning analysis often uses NQ/TriviaQA 🟥 may inflate the effectiveness of scaling [5/N]

Show thread

Akari Asai Dec 20, 2022

We show that augmenting LMs with non-parametric memories (retrieved text chunks) largely helps: GPT-Neo 1.3B assisted by retrieved context outperforms vanilla GPT-3 003! Even for GPT-3, retrieval gives up to 10% accuracy gains. Why are they so effective? [6/N]

Show thread

Akari Asai Dec 20, 2022

We found that retrieval-augmented LMs (red & green lines) are particularly helpful for questions about less popular entities, where LMs suffer. On the contrary, larger models (eg GPT-3) even outperform retrieval-augmented models in well-known facts, due to retrieval errors. [7/N]

Show thread

Akari Asai Dec 20, 2022

In summary, LLMs indeed memorize a lot now, but they are still not good enough to completely replace non-parametric memories, esp domains with long-tail distributions. Can we get the best of both worlds? We introduce a simple-yet-effective Adaptive Retrieval [8/N]

Show thread

Akari Asai Dec 20, 2022

Adaptive Retrieval decides when *not* to retrieve based on the subject popularity & relationship type, This approach not only gives performance improvements (up to 5%) but also largely reduces the inference time latency & API costs (e.g., halves GPT-3 API costs!). [9/N]

Show thread

Akari Asai Dec 20, 2022

More interesting results & discussions in our paper!
📝 https://tinyurl.com/2sdeuupn
‍💻 https://github.com/AlexTMallen/adaptive-retrieval
Work done by
Alex (a junior undergrad at UW!)
@AkariAsai
@v
Rajarshi Das
Hanna Hajishirzi
Daniel Khashabi

Show thread

Sören Arlt

@AkariAsai
great paper, really interesting to read! i'm wondering why there is no clear correlation for the 'country' category? do you have an explanation or a suspicion?

Show thread

Akari Asai Dec 21, 2022

@soerenarlt
Thanks a lot! We found that in some relationship types (e.g., country, nationality .. etc) LMs often exploit surface-level cues (e.g., entity names; for example, a person named Akari is likely from Japan).

As a result, even a model doesn't really know the answers, it can still answer correctly, which weakens the correlations.
More discussions can be seen in the Sec 3.2, "Subject entity popularity predicts memorization" paragraph!

Show thread

Sören Arlt Dec 21, 2022

@AkariAsai very cool, thanks!