Recycling Finetuned models, it works!
Finetuned models lie everywhere,
there must be a way to use the data and compute invested in them.
Apparently averaging their weights is such a method.
3 papers & A🧵
Recycling Finetuned models, it works!
Finetuned models lie everywhere,
there must be a way to use the data and compute invested in them.
Apparently averaging their weights is such a method.
3 papers & A🧵
Pretraining consumes tremendous amounts of data and compute
In comparison, a single fine-tuning is quite modest
However, all finetuning done in the community utilizes tenfold more data and GPUs
And those models are used only once!
With this in mind, we had the following idea:
Could we reverse the paradigm?
Combine fine-tuned models to improve the pretrained model?
(Spoiler:yes)
https://arxiv.org/pdf/2204.03044.pdf
@LChoshen Twitter handles...: @EladVenezian @noamslonim @YoavKatz73
We propose fusing:
Take finetuned models, average their weights and use it as a base model.
Later fine-tune over it as needed.
Note that this is not an ensemble!
We average the weights, not the outputs, this could make nonsense networks.
We find that consistently, doing so is better than fine-tuning over the pretrained model, even if fine-tuning came from an unrelated task or domain (in Fig. NLI or Twitter).
Also, certain model groups work better than others.
Fusing just one model, is equivalent to intertraining, a common practice of taking a fine-tuned model as a base model. Intertraining often fails unless choosing the right model for your task.
Still, if you choose which models to fuse (Fig.) it often works better
This is strong evidence that fusing could work, but still more work is to be done.
For example, how to choose which models to fuse?
It is unclear what models to fuse, but in our latest work, we do find which is the best single model for various architectures in our updating site:
https://ibm.github.io/model-recycling/
We also show in the paper much more (future thread...) but in short, the best model is just the best, regardless of your target task.
https://arxiv.org/abs/2211.00107
Another method proposes to weigh not only the models but their weights.
Michael Matena&
colin raffel
https://arxiv.org/abs/2111.09832
In this setting, they assume you have available models and a specific task in mind, but the models are of different tasks. They aim to average the models and get a better result on the task.
@RonRichman on this note, is there any evidence that models fine-tuned from the same parent model have the same basin?