New paper 🚨

Can we solely rely on LLMs’ memories (eg replace search w ChatGPT)? Probably not.
Is retrieval a silver bullet? Probably not either.

Our analysis reveals that LLMs' memorizations are still limited and scaling won't help much in long-tail distributions.
We show that adaptively incorporating non-parametric memories (eg retrieved chunks) can improve performance as well as efficiency.

📜 http://tinyurl.com/2sdeuupn 💻 http://github.com/AlexTMallen/adaptive-retrieval

#PaperThread #newpaper
[1/N]

LLMs store a lot of factual knowledge in their parameters (parametric factual knowledge), but recent work shows that they struggle to learn less frequent facts and can often hallucinate when they don't know. How much do they memorize and what affects their memorization? [2/N]
To answer these questions, we construct a new large open-domain QA dataset, PopQA, whose questions are grounded in Wikidata and are sampled from long-tail popularity distributions of Wikipedia to enable fine-grained analysis. We then test 10 LLMs in a zero/few-shot manner [3/N]
We found that there are strong correlations between subject entity popularity and accuracy, indicating that LLMs memorize well popular factual knowledge while it does not memorize less popular ones. [4/N]
Surprisingly, in long-tail distributions, scaling LLMs may not be as helpful as we believed: GPT-3 003 performs nearly as poorly as GPT-Neo 2B on less popular entities 🟦
Prior knowledge learning analysis often uses NQ/TriviaQA 🟥 may inflate the effectiveness of scaling [5/N]
We show that augmenting LMs with non-parametric memories (retrieved text chunks) largely helps: GPT-Neo 1.3B assisted by retrieved context outperforms vanilla GPT-3 003! Even for GPT-3, retrieval gives up to 10% accuracy gains. Why are they so effective? [6/N]
We found that retrieval-augmented LMs (red & green lines) are particularly helpful for questions about less popular entities, where LMs suffer. On the contrary, larger models (eg GPT-3) even outperform retrieval-augmented models in well-known facts, due to retrieval errors. [7/N]
In summary, LLMs indeed memorize a lot now, but they are still not good enough to completely replace non-parametric memories, esp domains with long-tail distributions. Can we get the best of both worlds? We introduce a simple-yet-effective Adaptive Retrieval [8/N]
Adaptive Retrieval decides when *not* to retrieve based on the subject popularity & relationship type, This approach not only gives performance improvements (up to 5%) but also largely reduces the inference time latency & API costs (e.g., halves GPT-3 API costs!). [9/N]
More interesting results & discussions in our paper!
📝 https://tinyurl.com/2sdeuupn
‍💻 https://github.com/AlexTMallen/adaptive-retrieval
Work done by
Alex (a junior undergrad at UW!)
@AkariAsai
@v
Rajarshi Das
Hanna Hajishirzi
Daniel Khashabi
@AkariAsai
great paper, really interesting to read! i'm wondering why there is no clear correlation for the 'country' category? do you have an explanation or a suspicion?

@soerenarlt
Thanks a lot! We found that in some relationship types (e.g., country, nationality .. etc) LMs often exploit surface-level cues (e.g., entity names; for example, a person named Akari is likely from Japan).

As a result, even a model doesn't really know the answers, it can still answer correctly, which weakens the correlations.
More discussions can be seen in the Sec 3.2, "Subject entity popularity predicts memorization" paragraph!