Mastodawn

A weighty thematic issue of Information Research has just dropped: 'Artificial Intelligence (AI) in Information Science'. The issue includes 44 papers exploring information seeking in the age of #AI, #information evaluation and use, information #retrieval, trust and security, future research needs, and a lot more. It'll take a while to read them all, but read I must!

https://publicera.kb.se/ir/issue/view/5559 #InformationResearch #InformationScience #InformationRetrieval #LLMs #ArtificialIntelligence

Vol. 31 No. 2 (2026): Information Research: Artificial Intelligence (AI) in Information Science | Information Research an international electronic journal

sayzard 5d ago

Spire: Structure-Preserving Interpretable Retrieval of Evidence

SPIRE는 HTML과 같은 반구조화 문서에서 증거를 해석 가능하게 검색하는 구조 보존 기반 파이프라인을 제안한다. 기존 임베딩 및 생성 모델의 평면적 시퀀스 인터페이스와 문서 구조 간 불일치를 해결하기 위해, 문서 내 하위 문서 단위로 후보를 표현하고 전역 및 지역 문맥화 기법을 도입해 해석 가능한 인용문을 생성한다. 실험 결과, 구조 보존과 문맥화가 결합된 접근법이 고정 예산 내에서 더 높은 품질과 다양성을 가진 인용문을 제공하며 확장성도 유지함을 보였다.

https://arxiv.org/abs/2604.20849

#informationretrieval #structureddocuments #embedding #contextualization #html

SPIRE: Structure-Preserving Interpretable Retrieval of Evidence

Retrieval-augmented generation over semi-structured sources such as HTML is constrained by a mismatch between document structure and the flat, sequence-based interfaces of today's embedding and generative models. Retrieval pipelines often linearize documents into fixed-size chunks before indexing, which obscures section structure, lists, and tables, and makes it difficult to return small, citation-ready evidence without losing the surrounding context that makes it interpretable. We present a structure-aware retrieval pipeline that operates over tree-structured documents. The core idea is to represent candidates as subdocuments: precise, addressable selections that preserve structural identity while deferring the choice of surrounding context. We define a small set of document primitives--paths and path sets, subdocument extraction by pruning, and two contextualization mechanisms. Global contextualization adds the non-local scaffolding needed to make a selection intelligible (e.g., titles, headers, list and table structure). Local contextualization expands a seed selection within its structural neighborhood to obtain a compact, context-rich view under a target budget. Building on these primitives, we describe an embedding-based candidate generator that indexes sentence-seeded subdocuments and a query-time, document-aware aggregation step that amortizes shared structural context. We then introduce a contextual filtering stage that re-scores retrieved candidates using locally contextualized views. Across experiments on HTML question-answering benchmarks, we find that preserving structure while contextualizing selections yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines, while maintaining scalability.

arXiv.org

sayzard 6d ago

Beyond Semantic Similarity

본 논문은 기존의 고정된 의미 유사도 기반 검색 방식을 넘어, 에이전트가 직접 원시 코퍼스에 일반 터미널 도구를 활용해 상호작용하는 직접 코퍼스 인터랙션(DCI) 방식을 제안한다. DCI는 임베딩 모델이나 벡터 인덱스 없이도 다단계 추론과 복합 조건 검색에 유연하게 대응하며, 기존 희소 및 밀집 검색 기법 대비 여러 벤치마크에서 우수한 성능을 보였다. 이는 에이전트 검색에서 단순한 추론 능력뿐 아니라 코퍼스와의 인터페이스 해상도가 검색 품질에 중요한 영향을 미친다는 점을 시사한다. AI 에이전트 구축과 검색 시스템 설계에 새로운 인터페이스 설계 방향을 제시한다.

https://arxiv.org/abs/2605.05242

#informationretrieval #agenticsearch #directcorpusinteraction #semanticsearch #llm

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.

arXiv.org

Djoerd Hiemstra 🍉May 3

RE: https://researchbuzz.masto.host/@mottg/116474491390856287

Nice, the state of Information Retrieval in 2026 by Mohan Krishna. Lots of interesting references and thoughts, if you're into leaderboards and state-of-the-art performance on benchmark test collections. #InformationRetrieval

MottG Apr 27

"The State of Information Retrieval in 2026"

This is the best survey article I have seen in a long time in this niche.

The dominant retriever in 2026 is an 8-billion-parameter decoder-only language model fine-tuned on synthetic data, conditioned on natural-language instructions, often executing chain-of-thought reasoning before deciding what to retrieve.

https://medium.com/@mohankrishnagr08/the-state-of-information-retrieval-in-2026-192f125a5269

#research #informationRetrieval #RAG #LLM #SPLADE #AIbenchmark #AI

IRRJ Apr 16

Published at #IRRJ: "Simple Techniques for Efficient Top-k Batch Query Processing" by Zhixuan Li and Joel Mackenzie. #BatchProcessing, #Caching, #DynamicPruning, #EfficientRetrieval, #InformationRetrieval

https://doi.org/10.54195/irrj.23893

Simple Techniques for Efficient Top- k Batch Query Processing | Information Retrieval Research

Show thread

IRRJ Apr 13

Looking back at #ECIR2026: Jaap Kamps presented the #IRRJ paper "Effectiveness of In-Context Learning for Due Diligence": https://doi.org/10.54195/irrj.22626 #InformationRetrieval

sayzard Apr 13

PageIndex는 벡터 DB와 인위적 청킹을 배제하고 문서를 계층적 '목차(tree)'로 인덱싱해 LLM으로 추론 기반 탐색을 수행하는 벡터리스 RAG 프레임워크입니다. 인간 전문가처럼 다단계 추론으로 정확한 관련 구간을 찾아 FinanceBench에서 Mafin 2.5로 98.7% 달성. 오픈소스·API·셀프호스트 제공, LiteLLM·OpenAI Agents SDK 연동.

https://github.com/VectifyAI/PageIndex

#pageindex #rag #vectorless #reasoning #informationretrieval

GitHub - VectifyAI/PageIndex: 📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG

📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG - VectifyAI/PageIndex

GitHub

sayzard Apr 4

Akshay (@akshay_pachaar)

AI 엔지니어를 위한 8가지 RAG 아키텍처를 소개하며, Naive RAG부터 다양한 검색·생성 조합 패턴까지 용도별로 설명한다. RAG 시스템 설계와 구현을 고민하는 개발자에게 실용적인 참고 자료다.

https://x.com/akshay_pachaar/status/2040050430890405931

#rag #llm #aiengineering #informationretrieval

Akshay 🚀 (@akshay_pachaar) on X

8 RAG architectures for AI Engineers: (explained with usage) 1) Naive RAG - Retrieves documents purely based on vector similarity between the query embedding and stored embeddings. - Works best for simple, fact-based queries where direct semantic matching suffices. 2)

X (formerly Twitter)

HackerNoon Apr 3

A practical look at how search indexing is evolving to hybrid retrieval systems that support semantic search, vector search, and AI-driven query understanding. https://hackernoon.com/from-inverted-indexes-to-hybrid-retrieval-rethinking-search-architecture #informationretrieval

From Inverted Indexes to Hybrid Retrieval: Rethinking Search Architecture | HackerNoon

A practical look at how search indexing is evolving to hybrid retrieval systems that support semantic search, vector search, and AI-driven query understanding.