Romain Deffayet

33 Followers
21 Following
32 Posts
PhD candidate Naver Labs Europe x UvA
RL and Counterfactual L2R for Recommender Systems
LocationFrench Alps
Websitehttps://www.deffayet.cc
Our code is open-source (https://github.com/philipphager/sigir-cmip) and we release a standalone implementation of CMIP to help learning-to-rank practitioners use it in their scenarios (https://github.com/philipphager/cmip)
GitHub - philipphager/sigir-cmip: An Offline Metric for the Debiasedness of Click Models - SIGIR 2023

An Offline Metric for the Debiasedness of Click Models - SIGIR 2023 - GitHub - philipphager/sigir-cmip: An Offline Metric for the Debiasedness of Click Models - SIGIR 2023

GitHub
In the paper, we explain CMIP in more details and systematically assess its reliability for click model evaluation.
CMIP, like nDCG, requires relevance annotations which can be obtained, e.g., via human labeling or using randomized traffic (we leave the latter for future work).
By computing the conditional mutual information of these quantities, our metric CMIP measures how debiased a model is.
While debiasedness alone is not sufficient (a random click model is debiased), CMIP helps to discard biased models and predict out-of-distribution metrics.
We defined the "debiasedness" of a click model as the independence of its predicted scores and those of the LP, conditionally to relevance labels. In other words, a model is debiased if one cannot guess its prediction by revealing where the LP placed the considered document.
That's the idea of our metric: analyze the correlations between relevance scores predicted by the logging policy (the reference classmate) and by the candidate click model. If they are correlated beyond the true relevance signal, the model failed to debias the logged data.
Comparing their grade doesn't work, because they could match the classmate just by copying. They can even reach a better grade by combining cheating and their own knowledge !
You must compare the mistakes to spot the cheaters: those making the same ones as the classmate cheated.
Picture students taking an exam, where they have the possibility to cheat by copying on a good classmate. How do you differentiate those whose truly understood the course from those who cheated ?

Remember when we analyzed lots of click models and found many of them are not able to predict accurate click probabilities on unseen rankings ?
Well, we found a way to detect this failure without actually deploying these models to the downstream task.

https://sigmoid.social/@romain/109307116028508208

Romain Deffayet (@[email protected])

#PaperThread #ULTR Together with Jean-Michel Renders and Maarten de Rijke, we investigated how click models *actually* perform (spoiler: not so great), and whether our offline metrics capture this (spoiler 2: they don't). Here's what it means for unbiased learning-to-rank researchers and practitioners ⬇️

Sigmoid Social

One can't judge a click model only by how well it ranks documents, we also need to make sure it actively identified and removed biases hidden in the logged data.

That's what we showed in our recent #SIGIR23 paper with Philipp Hager, Jean-Michel Renders and Maarten de Rijke.

https://arxiv.org/abs/2304.09560

#PaperThread #ULTR #ClickModels #IR

An Offline Metric for the Debiasedness of Click Models

A well-known problem when learning from user clicks are inherent biases prevalent in the data, such as position or trust bias. Click models are a common method for extracting information from user clicks, such as document relevance in web search, or to estimate click biases for downstream applications such as counterfactual learning-to-rank, ad placement, or fair ranking. Recent work shows that the current evaluation practices in the community fail to guarantee that a well-performing click model generalizes well to downstream tasks in which the ranking distribution differs from the training distribution, i.e., under covariate shift. In this work, we propose an evaluation metric based on conditional independence testing to detect a lack of robustness to covariate shift in click models. We introduce the concept of debiasedness in click modeling and derive a metric for measuring it. In extensive semi-synthetic experiments, we show that our proposed metric helps to predict the downstream performance of click models under covariate shift and is useful in an off-policy model selection setting.

arXiv.org
We therefore encourage researchers to use other methods to evaluate RL recommenders. While all of these methods have drawbacks that we detail in the paper, we notably recommend counterfactual OPE such as IPS, semi-synthetic simulators and uncertainty-aware evaluation.