Do T2I and I2T models understand each other? The answer is, they do, to a certain extent. The authors analyze the fidelity of image and text tasks when BLIP and Stable #Diffusion talk to each other.
A 🧶

Paper: https://arxiv.org/abs/2212.12249

Day 15 #30daysofDiffusion #MachineLearning

Do DALL-E and Flamingo Understand Each Other?

The field of multimodal research focusing on the comprehension and creation of both images and text has witnessed significant strides. This progress is exemplified by the emergence of sophisticated models dedicated to image captioning at scale, such as the notable Flamingo model and text-to-image generative models, with DALL-E serving as a prominent example. An interesting question worth exploring in this domain is whether Flamingo and DALL-E understand each other. To study this question, we propose a reconstruction task where Flamingo generates a description for a given image and DALL-E uses this description as input to synthesize a new image. We argue that these models understand each other if the generated image is similar to the given image. Specifically, we study the relationship between the quality of the image reconstruction and that of the text generation. We find that an optimal description of an image is one that gives rise to a generated image similar to the original one. The finding motivates us to propose a unified framework to finetune the text-to-image and image-to-text models. Concretely, the reconstruction part forms a regularization loss to guide the tuning of the models. Extensive experiments on multiple datasets with different image captioning and image generation models validate our findings and demonstrate the effectiveness of our proposed unified framework. As DALL-E and Flamingo are not publicly available, we use Stable Diffusion and BLIP in the remaining work. Project website: https://dalleflamingo.github.io.

arXiv.org
First, the title is a bit misleading, since the authors did not use DALL-E or Flamingo in the analysis. The paper should've been "Do BLIP and Stable Diffusion understand each other?"
Let's say you have an image, x, and you caption it using BLIP, and then feed it to Stable Diffusion which gives you another image, x̃. If they are very similar to each other, then BLIP did a good job of captioning it. (It could also mean models share the biases)
Similarly, we can check for text reconstruction by doing the following. text ->SD-> BLIP -> new_text. This helps us understand how well SD translated the text into the image.
The authors use CIDEr metric to measure text fidelity and CLIP_image representation dot product to check the image fidelity. Also, they evaluate this task on captioning datasets, MSCOCO, and NoCaps.
But there is stochasticity in terms of generations, the authors account for this in the following way.
Let's say, we are looking at the I2T2I task. For a given image x, they generate "k" BLIP captions and generate SD images for all "k" captions. Then they choose the caption(y_i) and generation(x̃_i) which has a maximum clip score between x and x̃_i.
They follow a similar process for T2I2T as well. As expected, the better the reconstructed image, the better the caption, and vice versa.
Some examples are below, the first column is the ground truth. and the rest are generations. Sometimes they translate well (like first row cols 2,3), sometimes they end up looking very different (like second row last column)
The authors hypothesize the failures can be attributed to biases in the model. There could be heavy bias in the data T2I model is trained on, like any mention of "doctor" in the prompt lead to a medical doctor. Hence it could lead to drift in text.
Another source of bias is, the annotation of the training data might not be so detailed (for e.g. every instance of a flamingo is captioned as "bird"), hence the generation capabilities of I2T models might be limited. This could lead to a drift in the image.
I wish the authors analyzed a few more T2I and I2T models... It would have been a great analysis to see which models go well together.