It is neat to see MetaAI using LaTeXML productively for arXiv preprocessing in their Nougat OCR work.
Good discussion in "5.2 Text modalities": there is indeed a lot of hidden complexity when recovering TeX input strings.
Rather tempting to wish for a way to normalize to "canonical" expressions...
project homepage: https://facebookresearch.github.io/nougat/
arXiv preprint:
https://arxiv.org/abs/2308.13418