Yesterday I completed 50% of the #30daysofDiffusion goal. I learned a lot, however, it's an information overload as well. I am going to take a couple of weeks' break and then resume.
Access all the paper &tweet links here - https://go.umd.edu/30daysofdiffusion
#Diffusion #MachineLearning

30 days of diffusion
Part 1
Name,Date,Paper Link,Status,Tweet link
dreambooth,2-Jan-23,<a href="https://arxiv.org/abs/2208.12242">https://arxiv.org/abs/2208.12242</a>,Done ✨,<a href="https://twitter.com/gowthami_s/status/1609942652971266050?s=20">https://twitter.com/gowthami_s/status/1609942652971266050?s=20</a>
tex...
Google DocsDo T2I and I2T models understand each other? The answer is, they do, to a certain extent. The authors analyze the fidelity of image and text tasks when BLIP and Stable #Diffusion talk to each other.
A 🧶
Paper: https://arxiv.org/abs/2212.12249
Day 15 #30daysofDiffusion #MachineLearning


Do DALL-E and Flamingo Understand Each Other?
The field of multimodal research focusing on the comprehension and creation
of both images and text has witnessed significant strides. This progress is
exemplified by the emergence of sophisticated models dedicated to image
captioning at scale, such as the notable Flamingo model and text-to-image
generative models, with DALL-E serving as a prominent example. An interesting
question worth exploring in this domain is whether Flamingo and DALL-E
understand each other. To study this question, we propose a reconstruction task
where Flamingo generates a description for a given image and DALL-E uses this
description as input to synthesize a new image. We argue that these models
understand each other if the generated image is similar to the given image.
Specifically, we study the relationship between the quality of the image
reconstruction and that of the text generation. We find that an optimal
description of an image is one that gives rise to a generated image similar to
the original one. The finding motivates us to propose a unified framework to
finetune the text-to-image and image-to-text models. Concretely, the
reconstruction part forms a regularization loss to guide the tuning of the
models. Extensive experiments on multiple datasets with different image
captioning and image generation models validate our findings and demonstrate
the effectiveness of our proposed unified framework. As DALL-E and Flamingo are
not publicly available, we use Stable Diffusion and BLIP in the remaining work.
Project website: https://dalleflamingo.github.io.
arXiv.orgFor faster inference, distill (in 2 stages) multi-step classifier-free guided #diffusion models into a student model of same architecture which can generate the same quality images in fewer steps.
A 🧶
Paper: https://arxiv.org/abs/2210.03142
Day 14 #30daysofDiffusion #MachineLearning


On Distillation of Guided Diffusion Models
Classifier-free guided diffusion models have recently been shown to be highly
effective at high-resolution image generation, and they have been widely used
in large-scale diffusion frameworks including DALLE-2, Stable Diffusion and
Imagen. However, a downside of classifier-free guided diffusion models is that
they are computationally expensive at inference time since they require
evaluating two diffusion models, a class-conditional model and an unconditional
model, tens to hundreds of times. To deal with this limitation, we propose an
approach to distilling classifier-free guided diffusion models into models that
are fast to sample from: Given a pre-trained classifier-free guided model, we
first learn a single model to match the output of the combined conditional and
unconditional models, and then we progressively distill that model to a
diffusion model that requires much fewer sampling steps. For standard diffusion
models trained on the pixel-space, our approach is able to generate images
visually comparable to that of the original model using as few as 4 sampling
steps on ImageNet 64x64 and CIFAR-10, achieving FID/IS scores comparable to
that of the original model while being up to 256 times faster to sample from.
For diffusion models trained on the latent-space (e.g., Stable Diffusion), our
approach is able to generate high-fidelity images using as few as 1 to 4
denoising steps, accelerating inference by at least 10-fold compared to
existing methods on ImageNet 256x256 and LAION datasets. We further demonstrate
the effectiveness of our approach on text-guided image editing and inpainting,
where our distilled model is able to generate high-quality results using as few
as 2-4 denoising steps.
arXiv.orgVersatile Diffusion: A diffusion model which is trained with reconstruction objectives of image and text together. It can go text-to-image, image-to-image, image->text->image, and so on.
A 🧶
Paper: https://arxiv.org/abs/2211.08332
Day 13 #30daysofDiffusion #MachineLearning


Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
Recent advances in diffusion models have set an impressive milestone in many generation tasks, and trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted great interest. Despite the rapid landscape changes, recent new approaches focus on extensions and performance rather than capacity, thus requiring separate models for separate tasks. In this work, we expand the existing single-flow diffusion pipeline into a multi-task multimodal network, dubbed Versatile Diffusion (VD), that handles multiple flows of text-to-image, image-to-text, and variations in one unified model. The pipeline design of VD instantiates a unified multi-flow diffusion framework, consisting of sharable and swappable layer modules that enable the crossmodal generality beyond images and text. Through extensive experiments, we demonstrate that VD successfully achieves the following: a) VD outperforms the baseline approaches and handles all its base tasks with competitive quality; b) VD enables novel extensions such as disentanglement of style and semantics, dual- and multi-context blending, etc.; c) The success of our multi-flow multimodal framework over images and text may inspire further diffusion-based universal AI research. Our code and models are open-sourced at https://github.com/SHI-Labs/Versatile-Diffusion.
arXiv.orgA contemplative piece on how the current datasets used for training large-scale T2I models might not be right for generating art since they are quite objective.
A tiny paper 🧶
Paper: https://arxiv.org/abs/2210.10578
Day 12 #30daysofDiffusion #Diffusion #MachineLearning

Language Does More Than Describe: On The Lack Of Figurative Speech in Text-To-Image Models
The impressive capacity shown by recent text-to-image diffusion models to
generate high-quality pictures from textual input prompts has leveraged the
debate about the very definition of art. Nonetheless, these models have been
trained using text data collected from content-based labelling protocols that
focus on describing the items and actions in an image but neglect any
subjective appraisal. Consequently, these automatic systems need rigorous
descriptions of the elements and the pictorial style of the image to be
generated, otherwise failing to deliver. As potential indicators of the actual
artistic capabilities of current generative models, we characterise the
sentimentality, objectiveness and degree of abstraction of publicly available
text data used to train current text-to-image diffusion models. Considering the
sharp difference observed between their language style and that typically
employed in artistic contexts, we suggest generative models should incorporate
additional sources of subjective information in their training in order to
overcome (or at least to alleviate) some of their current limitations, thus
effectively unleashing a truly artistic and creative generation.
arXiv.orgeDiff-I: A new text-to-image #diffusion model. Uses T5 and both CLIP encoders for conditioning. Instead of using the same denoising model for all steps, they propose using multiple specialized ones. A 🧵
Paper: https://arxiv.org/abs/2211.01324
Day 11 #30daysofDiffusion #MachineLearning

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
Large-scale diffusion-based generative models have led to breakthroughs in
text-conditioned high-resolution image synthesis. Starting from random noise,
such text-to-image diffusion models gradually synthesize images in an iterative
fashion while conditioning on text prompts. We find that their synthesis
behavior qualitatively changes throughout this process: Early in sampling,
generation strongly relies on the text prompt to generate text-aligned content,
while later, the text conditioning is almost entirely ignored. This suggests
that sharing model parameters throughout the entire generation process may not
be ideal. Therefore, in contrast to existing works, we propose to train an
ensemble of text-to-image diffusion models specialized for different synthesis
stages. To maintain training efficiency, we initially train a single model,
which is then split into specialized models that are trained for the specific
stages of the iterative generation process. Our ensemble of diffusion models,
called eDiff-I, results in improved text alignment while maintaining the same
inference computation cost and preserving high visual quality, outperforming
previous large-scale text-to-image diffusion models on the standard benchmark.
In addition, we train our model to exploit a variety of embeddings for
conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We
show that these different embeddings lead to different behaviors. Notably, the
CLIP image embedding allows an intuitive way of transferring the style of a
reference image to the target text-to-image output. Lastly, we show a technique
that enables eDiff-I's "paint-with-words" capability. A user can select the
word in the input text and paint it in a canvas to control the output, which is
very handy for crafting the desired image in mind. The project page is
available at https://deepimagination.cc/eDiff-I/
arXiv.orgRetrieval Augmented #Diffusion (RDM) models: Smaller diffusion models can generate high-quality generations by accessing an external memory to guide the generation. Inspired by Deepmind's RETRO.
A 🧶
Paper: https://arxiv.org/abs/2204.11824
Day 10 #30daysofDiffusion #MachineLearning

Semi-Parametric Neural Image Synthesis
Novel architectures have recently improved generative image synthesis leading
to excellent visual quality in various tasks. Much of this success is due to
the scalability of these architectures and hence caused by a dramatic increase
in model complexity and in the computational resources invested in training
these models. Our work questions the underlying paradigm of compressing large
training data into ever growing parametric representations. We rather present
an orthogonal, semi-parametric approach. We complement comparably small
diffusion or autoregressive models with a separate image database and a
retrieval strategy. During training we retrieve a set of nearest neighbors from
this external database for each training instance and condition the generative
model on these informative samples. While the retrieval approach is providing
the (local) content, the model is focusing on learning the composition of
scenes based on this content. As demonstrated by our experiments, simply
swapping the database for one with different contents transfers a trained model
post-hoc to a novel domain. The evaluation shows competitive performance on
tasks which the generative model has not been trained on, such as
class-conditional synthesis, zero-shot stylization or text-to-image synthesis
without requiring paired text-image data. With negligible memory and
computational overhead for the external database and retrieval we can
significantly reduce the parameter count of the generative model and still
outperform the state-of-the-art.
arXiv.orgStructureDiffusion: Improve the compositional generation capabilities of text-to-image #diffusion models by modifying the text guidance by using a constituency tree or a scene graph.
A 🧵
Paper: https://arxiv.org/abs/2212.05032
Day 9 #30daysofDiffusion #MachineLearning


Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
Large-scale diffusion models have achieved state-of-the-art results on
text-to-image synthesis (T2I) tasks. Despite their ability to generate
high-quality yet creative images, we observe that attribution-binding and
compositional capabilities are still considered major challenging issues,
especially when involving multiple objects. In this work, we improve the
compositional skills of T2I models, specifically more accurate attribute
binding and better image compositions. To do this, we incorporate linguistic
structures with the diffusion guidance process based on the controllable
properties of manipulating cross-attention layers in diffusion-based T2I
models. We observe that keys and values in cross-attention layers have strong
semantic meanings associated with object layouts and content. Therefore, we can
better preserve the compositional semantics in the generated image by
manipulating the cross-attention representations based on linguistic insights.
Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention
design is efficient that requires no additional training samples. We achieve
better compositional skills in qualitative and quantitative results, leading to
a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an
in-depth analysis to reveal potential causes of incorrect image compositions
and justify the properties of cross-attention layers in the generation process.
arXiv.orgInstructPix2Pix: Edit an image using text guidance using a single forward pass. Why use any inversion or other stuff,just create a dataset using inversion techniques and train a new model.
A 🧶
Paper: https://arxiv.org/abs/2211.09800
Day 8 #30daysofDiffusion #Diffusion #MachineLearning


InstructPix2Pix: Learning to Follow Image Editing Instructions
We propose a method for editing images from human instructions: given an
input image and a written instruction that tells the model what to do, our
model follows these instructions to edit the image. To obtain training data for
this problem, we combine the knowledge of two large pretrained models -- a
language model (GPT-3) and a text-to-image model (Stable Diffusion) -- to
generate a large dataset of image editing examples. Our conditional diffusion
model, InstructPix2Pix, is trained on our generated data, and generalizes to
real images and user-written instructions at inference time. Since it performs
edits in the forward pass and does not require per example fine-tuning or
inversion, our model edits images quickly, in a matter of seconds. We show
compelling editing results for a diverse collection of input images and written
instructions.
arXiv.orgGet the average CLIP image model embeddings of an "Aesthetic" dataset, optimize the clip text encoder to align with this embedding, and plug it into SD to get better-looking images!
A tiny 🧶
Paper: https://arxiv.org/abs/2209.12330
Day 7 #30daysofDiffusion #Diffusion #MachineLearning
Personalizing Text-to-Image Generation via Aesthetic Gradients
This work proposes aesthetic gradients, a method to personalize a
CLIP-conditioned diffusion model by guiding the generative process towards
custom aesthetics defined by the user from a set of images. The approach is
validated with qualitative and quantitative experiments, using the recent
stable diffusion model and several aesthetically-filtered datasets. Code is
released at https://github.com/vicgalle/stable-diffusion-aesthetic-gradients
arXiv.org