Lucas Beyer

1.9K Followers
105 Following
118 Posts
still setting things up and figuring stuff out
At dayResearcher at Google Brain in Zürich
At nightGamer, Hacker, Belgian
Old websitehttp://lucasb.eyer.be
Twitter@giffmana

Posts on a Mastodon timeline are sorted in chronological order, new to old. I think algorithmic curation (e.g., highlighting popular posts) is acceptable as long as it's open source and testable.

#Fediview, which uses an open source algorithm, can selectively display popular posts from your Mastodon timeline based on posts' boosts and favorites. https://fediview.com/

Kudos to the developer, @adamghill

#OpenSource #Algorithm #Curation

fediview

Cool numerics rule of thumb shared by @nan on Twitter today:

Computing inverse p-th root of a matrix "A^{-1/p}" will lose about log2(K/p) bits of precision,

where K is the condition number (largest over smallest eigenvalue).

So, a condition number of 10^6 (not uncommon at all in DL) loses 19 bits, leaving you with only 4 bits in fp32.

That's why very careful implementation of higher-order optimizers like Shampoo is necessary (or every strong "regularization" which may help or hurt overall)

PS: This thread took me almost as long as a paper review. Looks like I procrastinate my CVPR reviews by making twitter paper reviews instead ¯\_(ツ)_/¯

Meta: I wrote the thread on twitter and copied it over here, hence the short toots. I tried, but the UI for writing threads here (web client) is absolutely abysmal...

9/9 final thoughts.

- I really like the "trend reversal" of seeing how much can be done with limited compute.
- I am a big fan of the gray text passages for things that were tried but didn't work.
- The lr sched part is fishy, but not super important.
- Impressive bibliography!

8 results

left: overall, it's getting pretty close to original BERT which used 45-136x more total FLOPS (4d on 16 TPUs)
right: and when training for 16x longer (2d on 8 GPUs), the same recipe actually improves on original BERT quite a bit, reaching RoBERTa levels of performance.

7/N data

- Try pile subsets, c4, book+wiki
- dedup (exact substring) not helpful
- remove uncompressible data "t=0.3": keep only if ntokens < 0.3 * nchars
- sort: data with fequent tokens first (think "easy/common text first")
- grow batch-size at end

addendum to 6b: the figure from my screenshot is in the appendix. Other papers have shown even pre-norm needs warmup.

6c training:
- no dropout, tokendrop, or length curric.
- micro-batch 96 accum into 1.5-4k, linearly increased during training. Auto-tuning looks mostly linear.

6b training: lr schedule!

They tried many, but this is where I disagree with the paper.
Most schedules either don't warmup (-> lower peak lr!) or don't cooldown (-> 0 at the end).

The only two that work clearly better than the rest are the only two with warmup and cooldown!

6a/N training

- Stick to the simplest MLM objective
- Optimizer: Adam. No win from fancier.
- I want to point out that AdaFactor is meant to save memory but behave like Adam, so no win is a win!
- They mention no win from Shampoo (cc @nan) but aren't confident it's a good impl.