The unreasonable effectiveness of UNet for image segmentation?
I am focusing/fixating on segmentation models and the underlining principles I'd expect for computer (and human) vision.
1) I feel like UNet models shouldn't need all those skip-connections: the highest resolution one would do
2) I feel like the effectiveness comes from the decoder doing too much work, not only by receiving skip-connections but effectively adding many convolutions
3) I feel like fancier models that use transformer encoders UNet-like convolutional decoders incur in (2) and they're not particularly better/different
4) I think the encoder should do the heavy, semantic job; then a "pixel" would actually integrate information from its surroundings and a simple MLP decoder should suffice to classify/segment each pixel
The approaches in the biomedical field and in the industry of big players might be diverging, as Meta can use Segment Anything Model with heavy transformer encoder pretrained and trained on millions of images, while the avg biomed researcher may have some hundreds of samples and thus work at a scale where convolutional models are just better, and maybe where SAM can't be used.
#imagesegmentation #unet #visiontransformer #deeplearning