Spire: Structure-Preserving Interpretable Retrieval of Evidence

SPIRE는 HTML과 같은 반구조화 문서에서 증거를 해석 가능하게 검색하는 구조 보존 기반 파이프라인을 제안한다. 기존 임베딩 및 생성 모델의 평면적 시퀀스 인터페이스와 문서 구조 간 불일치를 해결하기 위해, 문서 내 하위 문서 단위로 후보를 표현하고 전역 및 지역 문맥화 기법을 도입해 해석 가능한 인용문을 생성한다. 실험 결과, 구조 보존과 문맥화가 결합된 접근법이 고정 예산 내에서 더 높은 품질과 다양성을 가진 인용문을 제공하며 확장성도 유지함을 보였다.

https://arxiv.org/abs/2604.20849

#informationretrieval #structureddocuments #embedding #contextualization #html

SPIRE: Structure-Preserving Interpretable Retrieval of Evidence

Retrieval-augmented generation over semi-structured sources such as HTML is constrained by a mismatch between document structure and the flat, sequence-based interfaces of today's embedding and generative models. Retrieval pipelines often linearize documents into fixed-size chunks before indexing, which obscures section structure, lists, and tables, and makes it difficult to return small, citation-ready evidence without losing the surrounding context that makes it interpretable. We present a structure-aware retrieval pipeline that operates over tree-structured documents. The core idea is to represent candidates as subdocuments: precise, addressable selections that preserve structural identity while deferring the choice of surrounding context. We define a small set of document primitives--paths and path sets, subdocument extraction by pruning, and two contextualization mechanisms. Global contextualization adds the non-local scaffolding needed to make a selection intelligible (e.g., titles, headers, list and table structure). Local contextualization expands a seed selection within its structural neighborhood to obtain a compact, context-rich view under a target budget. Building on these primitives, we describe an embedding-based candidate generator that indexes sentence-seeded subdocuments and a query-time, document-aware aggregation step that amortizes shared structural context. We then introduce a contextual filtering stage that re-scores retrieved candidates using locally contextualized views. Across experiments on HTML question-answering benchmarks, we find that preserving structure while contextualizing selections yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines, while maintaining scalability.

arXiv.org