But we still use hand-designed heuristics to train our models. Let's replace our optimizers with trained neural nets!
If you are training models with < 5e8 parameters, for < 2e5 training steps, then with high probability this LEARNED OPTIMIZER will beat or match the tuned optimizer you are currently using, out of the box, with no hyperparameter tuning (!).
And the resulting learned optimizer works really well! We reached out to other researchers inside Brain, and had them try it on their tasks, and subject to the scale constraints I mention above it did as well or better than what they were currently using, with no tuning.
Tasks include multiple vision models, multiple language models, decision transformers, distillation tasks, scientific modeling, and more.
Huge thanks to Luke Metz for leading this research direction for the last half dozen years (!!) -- it's wonderful to see it bear fruit.
Also, huge thanks to James Harrison who has completely taken over the project over the last several months, and is responsible for the careful analysis and coherent story in the paper, as well as the exciting ongoing work.