it seems to have comparable performance to Nesterov momentum with gradient clipping, which is my usual go-to when Adam doesn't work.
@hjkl
momentum is usually high: 0.9, or 0.7 if that doesn't work. with gradient clipping (clipped at 1.0), learning rate can be higher than usual. i usually start at 1.0 and do quick tests down exponentially: 1.0, 0.32, 0.1, 0.032, etc.
something worth noting is that momentum acts as a boost for learning rate at DC and low frequencies, so you wind up with 1/(1-mu) times more learning rate than you asked for. i believe this is why Adam's default learning rate is usually a tiny 0.001 or 0.002.
@hjkl yeah, i'm aware of Lipschitz and the like, but most of my experience is honestly just tweaking numbers, trying ideas, and implementing any paper that interests me. i personally find it easier to try things than theorize them.
in my mind, deep learning is more like a highly unstable system trying to settle than anything like convex optimization. a lot of modern techniques seem to be based on simple intuition instead of pages of proof — compare resnets to SELU. just some random thoughts.
@hjkl side note, the momentum boosting learning rate thing is my own idea; i'm not sure how well it holds in practice. but when you consider the momentum equation as an LTI system, you see its magnitude plot has a gain at DC proportional as i stated.
for fun, i've tried implementing a second-order filter as an optimizer, but i couldn't personally manage anything better than a traditional well-tuned momentum optimizer.