It's hard to beat a plain SGD optimizer that has a finely tuned learning rate schedule.
@fchollet Do you think this is also true for robustness metrics?
@fchollet agree with the sprit. I never saw examples for this on Transformers, though. Adam (and it’s derivatives) just seem to be better there.
@fchollet i did some weirds experiments with a batch size of 170K , and one of the 10 top tuner runs was SGD!