Beyond Semantic Similarity
본 논문은 기존의 고정된 의미 유사도 기반 검색 방식을 넘어, 에이전트가 직접 원시 코퍼스에 일반 터미널 도구를 활용해 상호작용하는 직접 코퍼스 인터랙션(DCI) 방식을 제안한다. DCI는 임베딩 모델이나 벡터 인덱스 없이도 다단계 추론과 복합 조건 검색에 유연하게 대응하며, 기존 희소 및 밀집 검색 기법 대비 여러 벤치마크에서 우수한 성능을 보였다. 이는 에이전트 검색에서 단순한 추론 능력뿐 아니라 코퍼스와의 인터페이스 해상도가 검색 품질에 중요한 영향을 미친다는 점을 시사한다. AI 에이전트 구축과 검색 시스템 설계에 새로운 인터페이스 설계 방향을 제시한다.
https://arxiv.org/abs/2605.05242
#informationretrieval #agenticsearch #directcorpusinteraction #semanticsearch #llm

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction
Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.







