Mastodawn

With GPT Tokenizers (like BPE used by OpenAI) does the number of tokens used to represent a word correlate with the frequency of that word in the training data?
Is it a way to reverse-engineer frequency in the hidden training data?

Show thread

Joseph Wilk d[-_-]b Jan 23

Best I've found is from: "Low Frequency Names Exhibit Bias and Overfitting in Contextualizing Language Models": https://aclanthology.org/2021.emnlp-main.41.pdf

"Minority and female group names are singly tokenized less than white and male names. Single tokenization correlates with frequency"