🚨Beep beep 🚨 Have you ever been wondering about how to use human similarity judgments for improving neural network representations? 🧠 We have something for you! We found a linear transform that improves both representational alignment and downstream task performance! 🦾

https://arxiv.org/abs/2306.04507

Improving neural network representations using human similarity judgments

Deep neural networks have reached human-level performance on many computer vision tasks. However, the objectives used to train these networks enforce only that similar images are embedded at similar locations in the representation space, and do not directly constrain the global structure of the resulting space. Here, we explore the impact of supervising this global structure by linearly aligning it with human similarity judgments. We find that a naive approach leads to large changes in local representational structure that harm downstream performance. Thus, we propose a novel method that aligns the global structure of representations while preserving their local structure. This global-local transform considerably improves accuracy across a variety of few-shot learning and anomaly detection tasks. Our results indicate that human visual representations are globally organized in a way that facilitates learning from few examples, and incorporating this global structure into neural network representations improves performance on downstream tasks.

arXiv.org
Naively aligning neural network representations with human similarity judgments improves representational alignment but hurts downstream task performance quite a bit. Maximizing representational alignment while preserving a model's local similarity structure yields a best-of-both-worlds representation! 🧠🤖
While the original representations are locally accurate (that’s what the pretraining objectives shoot for), they are poorly organized globally. Our transform restructures the representation space in a globally more meaningful and human-aligned way while preserving local structure.
Across a wide variety of few-shot learning and anomaly detection tasks, our transform considerably improves performance over the original representations. At the same time, the transform improves representational alignment for different human similarity judgment datasets, similar to a naive approach!
Together, we hope to provide a way forward in understanding the differences between human and neural network representation spaces and the interaction between local and global similarity structures of neural net representations more broadly.
Side note: all of this only applies to CLIP models! ImageNet models fail to yield a best-of-both-worlds representation. Probably because the information encoded in their representations is not rich enough.
This has been a stellar team effort w/ Lorenz Linhardt Jonas Dippel Robert A. Vandermeulen @khermann @lampinen @simonster 🧠🦾🎉