Why and when is synthetic data better than real data for ML training?
It's not only a question of availability volume, although in the past that was an important consideration.
In training data we want to have:
1. Knowledge which is transferrable to the target task, or generally, in high fidelity.
2. Skills which are generalizable to the target task, or generally, in high fidelity.
3. Both represented in a way that allows instruction or control of the trained model, typically instruction-following form.
Can we get better synthetic data than a real world data is? That depends on our models actually. If our models do not yet understand the skills needed, they won't be able to practice those skills to become better in them. If they lack knowledge, they cannot by themselves acquire that knowledge without input from the real world, whether by literature or by active experimentation.
For some relatively generalist skills we already have frontier models which have acquired a bootstrappable level of competence in those skills, and indeed understand what those skills are about, to be able to improve above human level by autonomous practice.
The knowledge pool trained to our generalist large language models or large multi-modal models is already vast, impressively above human-level in most topics.
Of course in new modalities like medical imagery, and robotic control, both the competence in skills and required knowledge are still lacking in vanilla frontier models, but these can be easily trained to those models by imitation and self-supervised learning.
Once the models achieve the bootstrappable level of competence in a new domain, they will become able to self-improve by exercising the related skills and evaluating their own performances. In practice this becomes a process of recursive self-improvement by training data refinement and synthesis.
We already have a clear engineering roadmap to surpass human level in all domains, one by one, and the progress won't take steps backwards. Knowledge and skills from other domains transfer to new domains and make this process easier and faster for every novel domain.
Now, consider a world where this process has reached its conclusion.
#RecursiveSelfImprovement #UniversalEmbodiment #LLMs #AI #AGI