By computing the conditional mutual information of these quantities, our metric CMIP measures how debiased a model is.
While debiasedness alone is not sufficient (a random click model is debiased), CMIP helps to discard biased models and predict out-of-distribution metrics.
That's the idea of our metric: analyze the correlations between relevance scores predicted by the logging policy (the reference classmate) and by the candidate click model. If they are correlated beyond the true relevance signal, the model failed to debias the logged data.
Among the 24 papers performing offline eval of RL-based RecSys, 22 used this protocol. But we argue that (1) it is myopic and does not account for long-term outcomes, (2) what it considers as ground truth is suboptimal and (3) it hides deficiencies of RL agents trained offline.