i believe i've implemented the optimizer described in: https://arxiv.org/abs/1712.03298
it seems to have comparable performance to Nesterov momentum with gradient clipping, which is my usual go-to when Adam doesn't work.
[1712.03298] Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks

@notwa How do you pick the constants for Nesterov?

@hjkl
momentum is usually high: 0.9, or 0.7 if that doesn't work. with gradient clipping (clipped at 1.0), learning rate can be higher than usual. i usually start at 1.0 and do quick tests down exponentially: 1.0, 0.32, 0.1, 0.032, etc.

something worth noting is that momentum acts as a boost for learning rate at DC and low frequencies, so you wind up with 1/(1-mu) times more learning rate than you asked for. i believe this is why Adam's default learning rate is usually a tiny 0.001 or 0.002.

@notwa Thanks! I didn't know about gradient clipping. Of course if you'd know the Lipschitz constant of the loss derivative (I think) you could pick the values so that convergence is guaranteed. Obviously impossible with deep learning though.

@hjkl yeah, i'm aware of Lipschitz and the like, but most of my experience is honestly just tweaking numbers, trying ideas, and implementing any paper that interests me. i personally find it easier to try things than theorize them.

in my mind, deep learning is more like a highly unstable system trying to settle than anything like convex optimization. a lot of modern techniques seem to be based on simple intuition instead of pages of proof — compare resnets to SELU. just some random thoughts.

@hjkl side note, the momentum boosting learning rate thing is my own idea; i'm not sure how well it holds in practice. but when you consider the momentum equation as an LTI system, you see its magnitude plot has a gain at DC proportional as i stated.

for fun, i've tried implementing a second-order filter as an optimizer, but i couldn't personally manage anything better than a traditional well-tuned momentum optimizer.