TIL: using next‑token log‑probs for classification is brittle — tokenization & token priors skew choices (your "Yes" vs "No" might not be fair). Do: pick single‑token labels (0/1), verify tokenization, or score full label strings and calibrate with label‑swaps.