๐Ÿ“ข ๐—จ๐—ž๐—ฃ ๐—Ÿ๐—ฎ๐—ฏ ๐—ฎ๐˜ ๐—œ๐—–๐—Ÿ๐—ฅ๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฒ ๐Ÿ“ข
Happy to share that our paper has been accepted to #ICLR2026 ๐ŸŽ‰

๐Ÿ“œ ๐š๐šŽ๐šŸ๐šŽ๐š•๐šŠ: ๐——๐—ฒ๐—ป๐˜€๐—ฒ ๐—ฅ๐—ฒ๐˜๐—ฟ๐—ถ๐—ฒ๐˜ƒ๐—ฒ๐—ฟ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐˜ƒ๐—ถ๐—ฎ ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น๐—ถ๐—ป๐—ด

๐Ÿ‘ฅ Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang, Sherry Tongshuang Wu, Iryna Gurevych, Heinz Koeppl

Dense retrievers are typically trained with costly queryโ€“document supervision. ๐—ฅ๐—ฒ๐˜ƒ๐—ฒ๐—น๐—ฎ opens a different path: it learns retrieval by language modeling, using an in-batch attention mechanism where cross-document attention is guided by the retrieverโ€™s similarity scores.

Why this is exciting:
โ€ข No need for annotated or synthetic queryโ€“document pairs
โ€ข Strong results on CoIR (code), BRIGHT (reasoning-intensive), and BEIR (general retrieval) benchmarks
โ€ข Competitive performance with substantially less training data and compute, and it scales with batch/model size

๐Ÿ”— Paper: https://arxiv.org/abs/2506.16552
๐Ÿ”— Code: https://github.com/TRUMANCFY/Revela
๐Ÿ”— Model: https://huggingface.co/trumancai/Revela-3b

Revela: Dense Retriever Learning via Language Modeling

Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self-supervised learning objectives in the spirit of language modeling to train retrievers? To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on CoIR and matches them on BRIGHT. It achieves BEIR's unsupervised SoTA with ~ 1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela's scalability and its promise for self-supervised retriever learning.

arXiv.org