@scolobb The speaker discussed why deep neural networks are overparameterized (the number of parameters often exceeds the amount of training data) but still give excellent results.
@scolobb My guess would be that deep neural networks use *alternative* parameters on every layer, that do not compete for finding the best fit. Does BERT really need 12 layers? Theory suggests it does not...
@djoerd I don't really understand the idea or alternative parameters that don't compete. It would mean redundancy, right? This plays well with the idea that some nets don't seem to need all of their layers.