Christian Wolf

209 Followers
143 Following
50 Posts

Principal Scientist at @NaverLabsEurope, Lead of "Spatial AI" team. AI for Robotics, Computer Vision, Machine Learning. Austrian living in France.

https://chriswolfvision.github.io/www

This is the Confluence Museum in Lyon, France, built in 2014, which has an interesting geometric shape. The Austrian architects from Coop Himmelblau missed an occasion to integrate a Klein Bottle ...

https://en.wikipedia.org/wiki/Mus%C3%A9e_des_Confluences

Musée des Confluences - Wikipedia

I admit that I did not know that burning 1 liter of petrol (=0.79kg) creates 2.3 kg of CO2. That's almost 3x it's own weight. That's pretty shocking.
Reinforcement learning vs. imitation learning.
Saw "Avatar 2" yesterday: I am unsure whether this is an animated or live action movie.
PhD defense of Samuel Felton at Inria Rennes, France, supervised by Elisa Fromont and Eric Marchand, on Visual Servoeing in deep latent space. Great work, congrats to the new Dr.!
🔥

A saga in how many acts?

The CEO of comma.ai publicly proposes on Twitter to ... intern at Twitter to work on some stuff. Which is crazy in its own right. So, Musk answers, he get's hired a couple of days later, and works on Twitter's search functionality. Tweets regularly about it.

Today he started criticising the new "Free Speech" (sic) implementation at Twitter. Let's see where this goes.

Exhibit A + Exhibit B
There is a risk I am the one not understanding the process (average IQ 100) in the curve below, but in any case I would like to ...

An attempt in drawing a generative model of the process in ("Image-and-Language Understanding from Pixels Only", https://arxiv.org/abs/2212.08045).

Orange=desired relationship. We add additional unobserved variables to the process, need new data augmentation for them, and do not get any new information.

I do not critisize this work.
I want to understand why it works.

Image-and-Language Understanding from Pixels Only

Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without

arXiv.org