@jadeaffenjaeger Very good talk, thanks. IMO "transfer learning" is not as narrow as you say, it can refer to many ways to transfer the knowledge. And there is some theory to distillation: essentially you're training a regression, where the function to approximate is the teacher; a smaller student can (a) learn a smoother function (>>generalise better? sometimes!) and (b) learn the function at any arbitrary point, not just the training set points (>>follow the teacher very well)
@danstowell
Thank you for pointing these out! Is the capability of learning points outside of the original dataset used in practice? I could imagine using an additional corpus of training data that was labeled by the teacher instead of by humans, however I'm only aware of techniques that use the original dataset as input and then fit to the output of the teacher model. Definitely an intriguing idea, though!
@jadeaffenjaeger In theory, you could feed pure noise data in during distillation, since that spans the full space and gives a "good" approximation of the teacher function. In practice that's a bad idea! -- because most training effort is wasted learning the teacher's response to very unlikely things (low-density regions of input space). Using the original dataset is common, yes, but good advice is to ALSO add some extra unlabelled but relevant data, & maybe to increase the data aug too...