NEW BIML Bibliography entry
https://arxiv.org/abs/2408.04667v2
LLM Stability: A detailed analysis with some surprises
Berk Atil et al
This is terrible science (which means it is ironically a good example of how not to do it). Walks directly into the baseline bunker. "Benchmarking does not work so we introduce...a benchmark."
