Just finished uploading “Numerics of Machine Learning”. This final batch is on optimization and deep learning, and more of a list of grievances with the status quo than a collection of solutions.

```
https://www.probabilistic-numerics.org/teaching/2022_Numerics_of_Machine_Learning/
```

Quick thread below.

Probabilistic Numerics | Numerics of Machine Learning

Quantifying Uncertainty in Computation.

Frank Schneider started the round with a lecture on deep training. He argues that deep learning is only superficially an optimization problem in the strict sense. And that it is a poorly understood problem, despite its enormous emerging economic value.

```
https://youtu.be/PBcVZ5jEE5k
```

Thankfully, there are many leads to a new kind of training; and they are fundamentally probabilistic in nature. The strong stochasticity of mini-batch training should be embraced rather than ignored.

Numerics of ML 11 --Optimization for Deep Learning -- Frank Schneider

YouTube
With tools like `backpack.pt` by Felix Dangel and Fred Kuenstner, second-order quantities in the data dimension (batch variances, individual gradients) and the weight dimension (various curvature estimates) are readily available. Why not make use of them in training?
More generally, Frank says, for something done by human engineers, for months on end, the deep training toolchain is surprisingly simplistic. We need a richer software engineering stack for deep models. And he proposes cockpit https://github.com/f-dangel/cockpit by Felix and himself, as inspiration for what such “deepbuggers” for differentiable, array-centric programs might look like.
GitHub - f-dangel/cockpit: Cockpit: A Practical Debugging Tool for Training Deep Neural Networks

Cockpit: A Practical Debugging Tool for Training Deep Neural Networks - GitHub - f-dangel/cockpit: Cockpit: A Practical Debugging Tool for Training Deep Neural Networks

GitHub
Why do we even need these? Lukas Tatzel takes over to make the connection to classic convex optimization. Of course, second-order and superlinear methods (like BFGS) are great. But their advantages are diminished severely in the strongly stochastic setting of most deep learning.