Recycling Finetuned models, it works!

Finetuned models lie everywhere,
there must be a way to use the data and compute invested in them.

Apparently averaging their weights is such a method.

3 papers & A🧵

https://arxiv.org/pdf/2204.03044.pdf

Pretraining consumes tremendous amounts of data and compute

In comparison, a single fine-tuning is quite modest
However, all finetuning done in the community utilizes tenfold more data and GPUs

And those models are used only once!

With this in mind, we had the following idea:
Could we reverse the paradigm?

Combine fine-tuned models to improve the pretrained model?
(Spoiler:yes)

https://arxiv.org/pdf/2204.03044.pdf
@LChoshen Twitter handles...: @EladVenezian @noamslonim @YoavKatz73

We propose fusing:
Take finetuned models, average their weights and use it as a base model.
Later fine-tune over it as needed.

Note that this is not an ensemble!
We average the weights, not the outputs, this could make nonsense networks.

We find that consistently, doing so is better than fine-tuning over the pretrained model, even if fine-tuning came from an unrelated task or domain (in Fig. NLI or Twitter).

Also, certain model groups work better than others.

Fusing just one model, is equivalent to intertraining, a common practice of taking a fine-tuned model as a base model. Intertraining often fails unless choosing the right model for your task.

Still, if you choose which models to fuse (Fig.) it often works better
This is strong evidence that fusing could work, but still more work is to be done.
For example, how to choose which models to fuse?

It is unclear what models to fuse, but in our latest work, we do find which is the best single model for various architectures in our updating site:
https://ibm.github.io/model-recycling/

We also show in the paper much more (future thread...) but in short, the best model is just the best, regardless of your target task.
https://arxiv.org/abs/2211.00107

This leads us to the second paper.
Almost at the same time, came "model soups".
They don't improve pretraining, but a target task
They propose to use the finetuned models created during parameter sweep to improve results on the task.
As in this case, the target task is well known, they propose to learn the weight of each model in the average, or to greedily add models ensuring performance keeps increasing

https://arxiv.org/pdf/2203.05482.pdf
And more models means better results.
There are also interesting findings in terms of the loss space, check the paper.
And the followups:
arxiv.org/abs/2210.11948

Another method proposes to weigh not only the models but their weights.
Michael Matena&
colin raffel
https://arxiv.org/abs/2111.09832

In this setting, they assume you have available models and a specific task in mind, but the models are of different tasks. They aim to average the models and get a better result on the task.

@LChoshen it’s wild that fusing works! I wonder if it’s dependent on the extent to which fine tuning updated the model weights…
@RonRichman it benefits from larger datasets so in that sense I would say not too much. It is clear that similar starting point is crucial (although some works claim aligning the right neurons is enough to allow reasonable fusing even without it).
@LChoshen super interesting! Re aligning the neurons I guess this is in the spirit of the git re-basin work from some weeks ago…

@LChoshen

@RonRichman on this note, is there any evidence that models fine-tuned from the same parent model have the same basin?

@LChoshen is there a bookmark functionality on this platform? So that I can save this article to my infinitely growing to-do-read list.