@scolobb The speaker discussed why deep neural networks are overparameterized (the number of parameters often exceeds the amount of training data) but still give excellent results.
@scolobb My guess would be that deep neural networks use *alternative* parameters on every layer, that do not compete for finding the best fit. Does BERT really need 12 layers? Theory suggests it does not...
@scolobb ... but we don't know how to train more concise models, yet. Dacheng Tao suggested to use statistical gradient descent instead of gradient descent, but I did not fully understand what that solves.
@scolobb On Monday, Geoff Hinton said he does not believe that the brain does gradient descent, but it's the best we can do at the moment to train complex models