arXiv Information Retrieval

164 Followers
7 Following
9.2K Posts
Multi-event Video-Text Retrieval https://arxiv.org/abs/2308.11551
Multi-event Video-Text Retrieval

Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task. Comprehensive experiments show that this straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, effectively establishing a robust baseline for the MeVTR task. We believe this work serves as a strong foundation for future studies. Code is available at https://github.com/gengyuanmax/MeVTR.

arXiv.org
iTIMO: An LLM-empowered Synthesis Dataset for Travel Itinerary Modification https://arxiv.org/abs/2601.10609
iTIMO: An LLM-empowered Synthesis Dataset for Travel Itinerary Modification

Addressing itinerary modification is crucial for enhancing the travel experience as it is a frequent requirement during traveling. However, existing research mainly focuses on fixed itinerary planning, leaving modification underexplored. To bridge this gap, we formally define the itinerary modification task and introduce iTIMO, a dataset specifically tailored for this purpose. We identify the lack of {\itshape need-to-modify} itinerary data as the critical bottleneck hindering research on this task and propose a general pipeline to overcome it. This pipeline frames the generation of such data as an intent-driven perturbation task. It instructs large language models to perturb real world itineraries using three atomic editing operations: REPLACE, ADD, and DELETE. Each perturbation is grounded in three intents, including disruptions of popularity, spatial distance, and category diversity. Furthermore, a hybrid evaluation metric is designed to ensure perturbation effectiveness. We conduct comprehensive experiments on iTIMO, revealing the limitations of current LLMs and lead to several valuable directions for future research. Dataset and corresponding code are available at https://github.com/zelo2/iTIMO.

arXiv.org
Does LLM Focus on the Right Words? Mitigating Context Bias in LLM-based Recommenders https://arxiv.org/abs/2510.10978
Does LLM Focus on the Right Words? Mitigating Context Bias in LLM-based Recommenders

Large language models (LLMs), owing to their extensive open-domain knowledge and semantic reasoning capabilities, have been increasingly integrated into recommender systems (RS). However, a substantial gap remains between the pre-training objectives of LLMs and the specific requirements of recommendation tasks. To address this gap, supervised fine-tuning (SFT) is commonly performed on specially curated recommendation datasets to further enhance their predictive ability. Despite its success, SFT exhibits a critical limitation: it induces Context Bias, whereby the model over-relies on auxiliary tokens, such as task descriptions and prefix-generated tokens, while underutilizing core user interaction tokens that encode user-specific preferences. This bias not only undermines recommendation accuracy but also raises unfairness concerns. To address this issue, we propose Group Distributionally Robust Optimization-based Tuning (GDRT), a novel fine-tuning paradigm that enforces consistent model performance across token groups with varying degrees of relevance to auxiliary tokens. By adaptively upweighting underperforming groups, typically those weakly correlated with auxiliary tokens, GDRT shifts the model's attention from superficial auxiliary cues to informative user interaction tokens, thereby mitigating context bias. Extensive experiments conducted on three public datasets demonstrate that GDRT effectively mitigates context bias, yielding substantial improvements in recommendation accuracy (with an average NDCG@10 gain of 24.29%) and significantly enhancing recommendation fairness. The code is available at https://github.com/WANGBohaO-jpg/GDRT.

arXiv.org
Membership Inference Attacks on LLM-based Recommender Systems https://arxiv.org/abs/2508.18665
Membership Inference Attacks on LLM-based Recommender Systems

Large language models (LLMs) based recommender systems (RecSys) can adapt to different domains flexibly. It utilizes in-context learning (ICL), i.e., prompts, to customize the recommendation functions, which include sensitive historical user-specific item interactions, encompassing implicit feedback such as clicked items and explicit product reviews. Such private information may be exposed by novel privacy attacks. However, no study has been conducted on this important issue. We design several membership inference attacks (MIAs) aimed to revealing whether system prompts include victims' historical interactions. The attacks are \emph{Similarity, Memorization, Inquiry, and Poisoning attacks}, each utilizing unique features of LLMs or RecSys. We have carefully evaluated them on five of the latest open-source LLMs and three well-known RecSys benchmark datasets. The results confirm that the MIA threat to LLM RecSys is realistic: inquiry and poisoning attacks show significantly high attack advantages. We also discussed possible methods to mitigate such MIA threats. We have also analyzed the factors affecting these attacks, such as the number of shots in system prompts, the position of the victim in the shots, the number of poisoning items in the prompt,etc.

arXiv.org
Eliminating Out-of-Domain Recommendations in LLM-based Recommender Systems: A Unified View https://arxiv.org/abs/2505.03336
Eliminating Out-of-Domain Recommendations in LLM-based Recommender Systems: A Unified View

Recommender systems based on Large Language Models (LLMs) are often plagued by hallucinations of out-of-domain (OOD) items. To address this, we propose RecLM, a unified framework that bridges the gap between retrieval and generation by instantiating three grounding paradigms under a single architecture: embedding-based retrieval, constrained generation over rewritten item titles, and discrete item-tokenizer generation. Using the same backbone LLM and prompts, we systematically compare these three views on public benchmarks. RecLM strictly eradicates OOD recommendations (OOD@10 = 0) across all variants, and the constrained generation variants RecLM-cgen and RecLM-token achieve overall state-of-the-art accuracy compared to both strong ID-based and LLM-based baselines. Our unified view provides a systematic basis for comparing three distinct paradigms to reduce item hallucinations, offering a practical framework to facilitate the application of LLMs to recommendation tasks. Source code is at https://github.com/microsoft/RecAI.

arXiv.org
Variational Bayesian Personalized Ranking https://arxiv.org/abs/2503.11067
Variational Bayesian Personalized Ranking

Pairwise learning underpins implicit collaborative filtering, yet its effectiveness is often hindered by sparse supervision, noisy interactions, and popularity-driven exposure bias. In this paper, we propose Variational Bayesian Personalized Ranking (VarBPR), a tractable variational framework for implicit-feedback pairwise learning that offers principled exposure controllability and theoretical interpretability. VarBPR reformulates pairwise learning as variational inference over discrete latent indexing variables, explicitly modeling noise and indexing uncertainty, and divides training into two stages: variational inference and variational learning. In the variational inference stage, we develop a variational formulation that integrates preference alignment, denoising, and popularity debiasing under a unified ELBO/regularization objective, deriving closed-form posteriors with clear control semantics: the prior encodes a target exposure pattern, while temperature/regularization strength controls posterior-prior adherence. As a result, exposure controllability becomes an endogenous and interpretable outcome of variational inference. In the variational learning stage, we propose a posterior-compression objective that reduces the ideal ELBO's computational complexity from polynomial to linear, with the approximation justified by an explicit Jensen-gap upper bound. Theoretically, we provide interpretable generalization guarantees by identifying a structural error component and revealing the opportunity cost of prioritizing certain exposure patterns (e.g., long-tail), offering a concrete analytical lens for designing controllable recommender systems. Empirically, we validate VarBPR across popular backbones; it demonstrates consistent gains in ranking accuracy, enables controlled long-tail exposure, and preserves the linear-time complexity of BPR.

arXiv.org
GRITHopper: Decomposition-Free Multi-Hop Dense Retrieval https://arxiv.org/abs/2503.07519
GRITHopper: Decomposition-Free Multi-Hop Dense Retrieval

Decomposition-based multi-hop retrieval methods rely on many autoregressive steps to break down complex queries, which breaks end-to-end differentiability and is computationally expensive. Decomposition-free methods tackle this, but current decomposition-free approaches struggle with longer multi-hop problems and generalization to out-of-distribution data. To address these challenges, we introduce GRITHopper-7B, a novel multi-hop dense retrieval model that achieves state-of-the-art performance on both in-distribution and out-of-distribution benchmarks. GRITHopper combines generative and representational instruction tuning by integrating causal language modeling with dense retrieval training. Through controlled studies, we find that incorporating additional context after the retrieval process, referred to as post-retrieval language modeling, enhances dense retrieval performance. By including elements such as final answers during training, the model learns to better contextualize and retrieve relevant information. GRITHopper-7B offers a robust, scalable, and generalizable solution for multi-hop dense retrieval, and we release it to the community for future research and applications requiring multi-hop reasoning and retrieval capabilities.

arXiv.org
HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval https://arxiv.org/abs/2601.16155
HVD: Human Vision-Driven Video Representation Learning for Text-Video Retrieval

The success of CLIP has driven substantial progress in text-video retrieval. However, current methods often suffer from "blind" feature interaction, where the model struggles to discern key visual information from background noise due to the sparsity of textual queries. To bridge this gap, we draw inspiration from human cognitive behavior and propose the Human Vision-Driven (HVD) model. Our framework establishes a coarse-to-fine alignment mechanism comprising two key components: the Frame Features Selection Module (FFSM) and the Patch Features Compression Module (PFCM). FFSM mimics the human macro-perception ability by selecting key frames to eliminate temporal redundancy. Subsequently, PFCM simulates micro-perception by aggregating patch features into salient visual entities through an advanced attention mechanism, enabling precise entity-level matching. Extensive experiments on five benchmarks demonstrate that HVD not only captures human-like visual focus but also achieves state-of-the-art performance.

arXiv.org
Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing https://arxiv.org/abs/2601.16125
Rethinking Composed Image Retrieval Evaluation: A Fine-Grained Benchmark from Image Editing

Composed Image Retrieval (CIR) is a pivotal and complex task in multimodal understanding. Current CIR benchmarks typically feature limited query categories and fail to capture the diverse requirements of real-world scenarios. To bridge this evaluation gap, we leverage image editing to achieve precise control over modification types and content, enabling a pipeline for synthesizing queries across a broad spectrum of categories. Using this pipeline, we construct EDIR, a novel fine-grained CIR benchmark. EDIR encompasses 5,000 high-quality queries structured across five main categories and fifteen subcategories. Our comprehensive evaluation of 13 multimodal embedding models reveals a significant capability gap; even state-of-the-art models (e.g., RzenEmbed and GME) struggle to perform consistently across all subcategories, highlighting the rigorous nature of our benchmark. Through comparative analysis, we further uncover inherent limitations in existing benchmarks, such as modality biases and insufficient categorical coverage. Furthermore, an in-domain training experiment demonstrates the feasibility of our benchmark. This experiment clarifies the task challenges by distinguishing between categories that are solvable with targeted data and those that expose intrinsic limitations of current model architectures.

arXiv.org
Connect the Dots: Knowledge Graph-Guided Crawler Attack on Retrieval-Augmented Generation Systems https://arxiv.org/abs/2601.15678
Connect the Dots: Knowledge Graph-Guided Crawler Attack on Retrieval-Augmented Generation Systems

Stealing attacks pose a persistent threat to the intellectual property of deployed machine-learning systems. Retrieval-augmented generation (RAG) intensifies this risk by extending the attack surface beyond model weights to knowledge base that often contains IP-bearing assets such as proprietary runbooks, curated domain collections, or licensed documents. Recent work shows that multi-turn questioning can gradually steal corpus content from RAG systems, yet existing attacks are largely heuristic and often plateau early. We address this gap by formulating RAG knowledge-base stealing as an adaptive stochastic coverage problem (ASCP), where each query is a stochastic action and the goal is to maximize the conditional expected marginal gain (CMG) in corpus coverage under a query budget. Bridging ASCP to real-world black-box RAG knowledge-base stealing raises three challenges: CMG is unobservable, the natural-language action space is intractably large, and feasibility constraints require stealthy queries that remain effective under diverse architectures. We introduce RAGCrawler, a knowledge graph-guided attacker that maintains a global attacker-side state to estimate coverage gains, schedule high-value semantic anchors, and generate non-redundant natural queries. Across four corpora and four generators with BGE retriever, RAGCrawler achieves 66.8% average coverage (up to 84.4%) within 1,000 queries, improving coverage by 44.90% relative to the strongest baseline. It also reduces the queries needed to reach 70% coverage by at least 4.03x on average and enables surrogate reconstruction with answer similarity up to 0.699. Our attack is also scalable to retriever switching and newer RAG techniques like query rewriting and multi-query retrieval. These results highlight urgent needs to protect RAG knowledge assets.

arXiv.org