What happens when you use #autodiff and let your nonsmooth iterative algorithm goes to convergence?

With J. Bolte & E. Pauwels, we show that under a contraction assumption, the derivatives of the algorithm converge linearly!
Preprint: https://arxiv.org/abs/2206.00457

I will present this work this week at #NEURIPS2022

Automatic differentiation of nonsmooth iterative algorithms

Differentiation along algorithms, i.e., piggyback propagation of derivatives, is now routinely used to differentiate iterative solvers in differentiable programming. Asymptotics is well understood for many smooth problems but the nondifferentiable case is hardly considered. Is there a limiting object for nonsmooth piggyback automatic differentiation (AD)? Does it have any variational meaning and can it be used effectively in machine learning? Is there a connection with classical derivative? All these questions are addressed under appropriate nonexpansivity conditions in the framework of conservative derivatives which has proved useful in understanding nonsmooth AD. For nonsmooth piggyback iterations, we characterize the attractor set of nonsmooth piggyback iterations as a set-valued fixed point which remains in the conservative framework. This has various consequences and in particular almost everywhere convergence of classical derivatives. Our results are illustrated on parametric convex optimization problems with forward-backward, Douglas-Rachford and Alternating Direction of Multiplier algorithms as well as the Heavy-Ball method.

arXiv.org

This is not trivial because:

1. We consider *nonsmooth* derivation that requires concepts such as conservative Jacobians

2. Convergence of functions do not imply convergence of derivatives in general

3. It could be not observed in practice

But it also works in practice! For instance, differentiating forward-backward for the Lasso or Douglas-Rachford for Sparse inv covar select leads to a linear rate. In fact, one can prove that we have linear convergence as soon as we have enough regularity

Unfortunately, we also show that for some pathological cases, automatic differentiation of momentum methods might diverge, even in 1D.

The proofs are based on a Banach-Picard fixed-point theorem for set-valued functions, along with some previous results of Jerome and Edouard on conservative Jacobians.

PS: This is *not* a result on implicit differentiation, already covered in https://arxiv.org/abs/2106.04350

Nonsmooth Implicit Differentiation for Machine Learning and Optimization

In view of training increasingly complex learning architectures, we establish a nonsmooth implicit function theorem with an operational calculus. Our result applies to most practical problems (i.e., definable problems) provided that a nonsmooth form of the classical invertibility condition is fulfilled. This approach allows for formal subdifferentiation: for instance, replacing derivatives by Clarke Jacobians in the usual differentiation formulas is fully justified for a wide class of nonsmooth problems. Moreover this calculus is entirely compatible with algorithmic differentiation (e.g., backpropagation). We provide several applications such as training deep equilibrium networks, training neural nets with conic optimization layers, or hyperparameter-tuning for nonsmooth Lasso-type models. To show the sharpness of our assumptions, we present numerical experiments showcasing the extremely pathological gradient dynamics one can encounter when applying implicit algorithmic differentiation without any hypothesis.

arXiv.org

@vaiter
This is neat!
In practice I don't think I see piggyback being used any more: modern autodiff frameworks often default to IFT. Does it still have major use-cases? I did skim the paper but didn't see a comparison highlighted.

[... not that it needs a use case to be interesting!]

I imagine forward-mode piggyback can be parallelised, unlike IFT?

@PatrickKidger Fully agree.

I think the most appealing part (from a practical POV) is that if you use forward mode of piggyback you don't have to store all the iterates, but practitioners seems to have moved from unrolling to IFT indeed.

Regarding the comparaison with it, we discuss in Remark 2 (p.5) the link between the two: in the differentiable world, this is the same limiting object whereas for nonsmooth procedure, the inclusion may be strict.

1/2

@PatrickKidger

A theoretical aspect that is interesting of autodiff in contrast of IFT, is that you do need that the Jacobian to be invertible to make some sense of the limit. Even if (using the notation of the paper) I - A is not invertible, you can still study the behavior. But again, from a practical POV, this is not really important.

@vaiter
Hmm, do you mean that you need invertibility to make sense of IFT or of piggyback? At least for IFT I think you can generalise the linsolve to a linear least squares / minimal norm solution, i.e. the pseudoinverse solution to an ill-posed linear solve. And AFAIK this is a reasonable thing to do. (Even if it doesn't seem to get discussed in the literature much?)