Are there papers that tell you how to best train LMs on *small data*? Say less than 100k tokens.

RT @[email protected] (https://nitter.net/mariusmosbach/status/1648695154654556161)

Marius Mosbach (@mariusmosbach)

Are there papers that tell you how to best train LMs on *small data*? Say less than 100k tokens.

Nitter

@_dmh The BabyLM challenge has participants training LMs on small data, which seems like a good match (goal is <100M words). It is running this spring/summer, so you could look at the results/papers or participate!

https://babylm.github.io/

@par
Lol, 100m tokens as small data--100 times the size of the brown corpus. I love it.

In NLG I'm looking at like... 1 million tokens in a large dataset

@_dmh I believe there is a 10M token track as well, but in general the lower scale makes it easier to do scaling studies since its cheaper. I could easily see some cool papers coming out of the workshop.
@par
I appreciate the tips for sure. Thank you! :)