Interesting #INTERSPEECH2023 paper (I think ;-)) on using MOS scores vs side-by-side preference tests when comparing TTS systems.
How to choose? Is one more robust/sensitive than the other?
Ever wondered about that?
Here is the paper:
http://tomkenter.nl/pdf/camp2023_comparing_tts_systems_reliably.pdf