#57: DALL·E and CLIP, OpenAI's new multimodal neural networks
Hey everyone, welcome to Dynamically Typed #57! In today’s issue I’ve written about DALL·E and CLIP, two new multimodal neural networks by OpenAI that learn to generate and classify images based on textual prompts in a zero-shot way. They’re both very cool so they’ve taken up the bulk of this newsletter, but I’ve also got one quick link each for productized AI, ML research and climate AI, covering the other interesting stuff that happend.
Productized Artificial Intelligence 🔌
- 📸 Creative Commons photo sharing site Unsplash (where I also have a profile!) has launched a new feature: Visual Search, similar to Google’s search by image. If you’ve found a photo you’d like to include in a blog post or presentation, for example, but the image is copyrighted, this new Unsplash feature will help you find similar-looking ones that are free to use. The launch post doesn’t go into detail about how Visual Search works, but I’m guessing some (convolutional) classification model extracts features from all images on Unsplash to create a high-dimensional embedding; the same happens to the image you upload, and the site can then serve you photos that are close together in this embedding space. (Here’s an example of how you’d build that in Keras.)
Machine Learning Research 🎛
Two example prompts and resulting generated images from DALL·E
OpenAI’s new “multimodal” DALL·E and CLIP models combine text and images, and also mark the first time that the lab has presented two separate big pieces of work in conjunction. In a short blog post, which I’ll quote almost in full throughout this story because it also neatly introduces both networks, OpenAI’s chief scientist Ilya Sutskever explains why:
A long-term objective of artificial intelligence is to build “multimodal” neural networks—AI systems that learn about concepts in several modalities, primarily the textual and visual domains, in order to better understand the world. In our latest research announcements, we present two neural networks that bring us closer to this goal.
These two neural networks are DALL·E and CLIP. We’ll take a look at them one by one, starting with DALL·E.
The name DALL·E is a nod to Salvador Dalí, the surrealist artist known for that painting of melting clocks, and to WALL·E, the Pixar science-fiction romance about a waste-cleaning robot. It’s a bit silly to name an energy-hungry image generation AI after a movie in which lazy humans have fled a polluted Earth to float around in space and do nothing but consume content and food, but given how well the portmanteau works and how cute the WALL·E robots are, I probably would’ve done the same. Anyway, beyond what’s in a name, here’s Sutskever’s introduction of what DALL·E actually does:
The first neural network, DALL·E, can successfully turn text into an appropriate image for a wide range of concepts expressible in natural language. DALL·E uses the same approach used for GPT-3, in this case applied to text–image pairs represented as sequences of “tokens” from a certain alphabet.
DALL·E builds on two previous OpenAI models, combining GPT-3’s capability to perform different language tasks without finetuning with Image GPT’s capability to generate coherent image completions and samples. As input it takes a single stream — first text tokens for the prompt sentence, then image tokens for the image — of up to 1280 tokens, and learns to predict the next token given the previous ones. Text tokens take the form of byte-pair encodings of letters, and image tokens are patches from a 32 x 32 grid in the form of latent codes found using a variational autoencoder similar to VGVAE. This relatively simple architecture, combined with a large, carefully designed dataset, gives DALL·E the following laundry list of capabilities, each of which have interactive examples in OpenAI’s blog post:
- Controlling attributes
- Drawing multiple objects
- Visualizing perspective and three-dimensionality
- Visualizing internal and external structure (like asking for a macro or x-ray view!)
- Inferring contextual details
- Combining unrelated concepts
- Zero-shot visual reasoning
- Geographic and temporal knowledge
A lot of people from the community have written about DALL·E or played around with its interactive examples. Some of my favorites include:
- DeepMind researcher Felix’s Hill’s NonCompositional, a blog post on why DALL·E is good at composition without being very systematic (it can draw a hedgehog-shaped lettuce but not a green cube on a red cube on a blue cube )
- Károly Zsolnai-Fehér’s Two Minute Papers video on DALL·E, OpenAI’s previous work that led to it, and lots of generation examples
- Fun generations from Twitter and beyond: Janelle Shane’s dawnings of DALL-E; Oriol Vinyals’ soap dispenser in the shape of a glacier; and Karol Hausman’s snail made of a corkscrew.
I think DALL·E is the more interesting of the two models, but let’s also take a quick look at CLIP.
CLIP’s performance on different image classification benchmarks.
CLIP has the ability to reliably perform a staggering set of visual recognition tasks. Given a set of categories expressed in language, CLIP can instantly classify an image as belonging to one of these categories in a “zero-shot” way, without the need to fine-tune on data specific to these categories, as is required with standard neural networks. Measured against the industry benchmark ImageNet, CLIP outscores the well-known ResNet-50 system and far surpasses ResNet in recognizing unusual images.
Instead of training on a specific benchmark like ImageNet or ObjectNet, CLIP pretrains on a large dataset of text and images scraped from the internet (so without specific human labels for each images). It performs a proxy training task: “given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in our dataset.” To then do actual classification on a benchmark dataset, the labels are transformed to be more descriptive (e.g. a “cat” label becomes “a photo of a cat”), and CLIP calculates for each label how likely it is to be paired with the image. It predicts the most likely one to be the label. As you can see from the image above, this approach is highly effective across datasets. It’s also very efficient because, being a zero-shot model, CLIP doesn’t need to be (re)trained or finetuned for different datasets.
My favorite application so far of CLIP is by Travis Hoppe, who used it to visualize poems using Unsplash photos — worth a click! Another interesting one is actually how it’s used in combination with DALL·E: after DALL·E generates 512 plausible images for a prompt, CLIP ranks their quality, and only the 32 best ones are returned in the interactive viewer. Instead of researchers cherry-picking the best results to show in a paper, a different neural net can actually perform this task!
Quick ML research + resource links 🎛
The FCN-8 semantic segmentation network, drawn using PlotNeuralNet.
- ⚡️ PlotNeuralNet is an open-source LaTeX package for drawing deep neural networks, also featuring a Python wrapper. I’ve spent hours doing this by hand in Figma in the past, so this is a very welcome change. (Thanks for the tip, Tim!)
I’ve also collected all 75+ ML research tools previously featured in Dynamically Typed on a Notion page for quick reference. ⚡️
Artificial Intelligence for the Climate Crisis 🌍
- 🗃 Stephen Rasp of agriculture climate change adaptation platform ClimateAi has launched Pangeo ML Datasets, a website that collects weather and climate datasets for AI research. It includes both raw datasets and datasets already preprocessed specifically for training machine learning models.
Thanks for reading! As usual, you can let me know what you thought of today’s issue using the buttons below or by replying to this email. If you’re new here, check out the Dynamically Typed archives or subscribe below to get a new issues in your inbox every second Sunday.
If you enjoyed this issue of Dynamically Typed, why not forward it to a friend? It’s by far the best thing you can do to help me grow this newsletter. ❄️