@soerenarlt
Thanks a lot! We found that in some relationship types (e.g., country, nationality .. etc) LMs often exploit surface-level cues (e.g., entity names; for example, a person named Akari is likely from Japan).
As a result, even a model doesn't really know the answers, it can still answer correctly, which weakens the correlations.
More discussions can be seen in the Sec 3.2, "Subject entity popularity predicts memorization" paragraph!
@jacobeisenstein
Thank you so much for the feedback!!
Yes, we totally agree that retrieval-augmentation is quite effective and address many issues of relying on LMs trained on static text. We tried to put many findings from the paper to a single post, which may make the post misleading...
Regarding the calibration, we didn't try other methods and focus on the simple popularity-based aproach as the first step. We're interested in trying more sophistecated (e.g., learned) approaches though!
More interesting results & discussions in our paper!
📝
https://tinyurl.com/2sdeuupn💻
https://github.com/AlexTMallen/adaptive-retrievalWork done by
Alex (a junior undergrad at UW!)
@AkariAsai@v Rajarshi Das
Hanna Hajishirzi
Daniel Khashabi
Adaptive Retrieval decides when *not* to retrieve based on the subject popularity & relationship type, This approach not only gives performance improvements (up to 5%) but also largely reduces the inference time latency & API costs (e.g., halves GPT-3 API costs!). [9/N]
In summary, LLMs indeed memorize a lot now, but they are still not good enough to completely replace non-parametric memories, esp domains with long-tail distributions. Can we get the best of both worlds? We introduce a simple-yet-effective Adaptive Retrieval [8/N]
We found that retrieval-augmented LMs (red & green lines) are particularly helpful for questions about less popular entities, where LMs suffer. On the contrary, larger models (eg GPT-3) even outperform retrieval-augmented models in well-known facts, due to retrieval errors. [7/N]
We show that augmenting LMs with non-parametric memories (retrieved text chunks) largely helps: GPT-Neo 1.3B assisted by retrieved context outperforms vanilla GPT-3 003! Even for GPT-3, retrieval gives up to 10% accuracy gains. Why are they so effective? [6/N]
We found that there are strong correlations between subject entity popularity and accuracy, indicating that LLMs memorize well popular factual knowledge while it does not memorize less popular ones. [4/N]
Surprisingly, in long-tail distributions, scaling LLMs may not be as helpful as we believed: GPT-3 003 performs nearly as poorly as GPT-Neo 2B on less popular entities 🟦
Prior knowledge learning analysis often uses NQ/TriviaQA 🟥 may inflate the effectiveness of scaling [5/N]
To answer these questions, we construct a new large open-domain QA dataset, PopQA, whose questions are grounded in Wikidata and are sampled from long-tail popularity distributions of Wikipedia to enable fine-grained analysis. We then test 10 LLMs in a zero/few-shot manner [3/N]