@jadeaffenjaeger In theory, you could feed pure noise data in during distillation, since that spans the full space and gives a "good" approximation of the teacher function. In practice that's a bad idea! -- because most training effort is wasted learning the teacher's response to very unlikely things (low-density regions of input space). Using the original dataset is common, yes, but good advice is to ALSO add some extra unlabelled but relevant data, & maybe to increase the data aug too...