Lucas Beyer

1.9K Followers
105 Following
118 Posts
still setting things up and figuring stuff out
At dayResearcher at Google Brain in Zürich
At nightGamer, Hacker, Belgian
Old websitehttp://lucasb.eyer.be
Twitter@giffmana

Cool numerics rule of thumb shared by @nan on Twitter today:

Computing inverse p-th root of a matrix "A^{-1/p}" will lose about log2(K/p) bits of precision,

where K is the condition number (largest over smallest eigenvalue).

So, a condition number of 10^6 (not uncommon at all in DL) loses 19 bits, leaving you with only 4 bits in fp32.

That's why very careful implementation of higher-order optimizers like Shampoo is necessary (or every strong "regularization" which may help or hurt overall)

8 results

left: overall, it's getting pretty close to original BERT which used 45-136x more total FLOPS (4d on 16 TPUs)
right: and when training for 16x longer (2d on 8 GPUs), the same recipe actually improves on original BERT quite a bit, reaching RoBERTa levels of performance.

7/N data

- Try pile subsets, c4, book+wiki
- dedup (exact substring) not helpful
- remove uncompressible data "t=0.3": keep only if ntokens < 0.3 * nchars
- sort: data with fequent tokens first (think "easy/common text first")
- grow batch-size at end

addendum to 6b: the figure from my screenshot is in the appendix. Other papers have shown even pre-norm needs warmup.

6c training:
- no dropout, tokendrop, or length curric.
- micro-batch 96 accum into 1.5-4k, linearly increased during training. Auto-tuning looks mostly linear.

6b training: lr schedule!

They tried many, but this is where I disagree with the paper.
Most schedules either don't warmup (-> lower peak lr!) or don't cooldown (-> 0 at the end).

The only two that work clearly better than the rest are the only two with warmup and cooldown!

6a/N training

- Stick to the simplest MLM objective
- Optimizer: Adam. No win from fancier.
- I want to point out that AdaFactor is meant to save memory but behave like Adam, so no win is a win!
- They mention no win from Shampoo (cc @nan) but aren't confident it's a good impl.

5c changes include:

- SA: remove biases, many variants tried, none kept.
- MLP: remove biases, make gated, nothing else.
- Scaled sin embedding + LN
- pre-norm helps, but only when increasing LR
- In the head, MLP can be dropped (same with ViT).
- Again: gray text interesting!

5b same thing again: focus model changes to those that keep the same capacity (~params for Transformer MLMs with fixed seqlen) but speed things up.

It's a shame that vast majority of papers (including sometimes mine) completely ignore reporting wall-clock speed or slowdowns.

5a Architecture (sub-thread).

Super interesting and echoes our experience in vision: with enough data, all variants reach ~ the same loss in the same wall-clock time. Faster models need to see more tokens. In other words, with good implementations, it's hard to cheat wall-clock.

4/N Data: en
- Short sequence length 128 and packing with <sep>. I like the simplicity!
- <cls> token seems unnecessary; we found the same with ViT
- Use large batches by accumulating gradients across micro-batches
- 1epoch (cc @aran)
- Grey text isn't error, but contains all negative results. The most interesting part!