DALL·E and CLIP: OpenAI's Multimodal Neural Networks
Two example prompts and resulting generated images from DALL·E
OpenAI’s new “multimodal” DALL·E and CLIP models combine text and images, and also mark the first time that the lab has presented two separate big pieces of work in conjunction. In a short blog post, which I’ll quote almost in full throughout this story because it also neatly introduces both networks, OpenAI’s chief scientist Ilya Sutskever explains why:
A long-term objective of artificial intelligence is to build “multimodal” neural networks—AI systems that learn about concepts in several modalities, primarily the textual and visual domains, in order to better understand the world. In our latest research announcements, we present two neural networks that bring us closer to this goal.
These two neural networks are DALL·E and CLIP. We’ll take a look at them one by one, starting with DALL·E.
The name DALL·E is a nod to Salvador Dalí, the surrealist artist known for that painting of melting clocks, and to WALL·E, the Pixar science-fiction romance about a waste-cleaning robot. It’s a bit silly to name an energy-hungry image generation AI after a movie in which lazy humans have fled a polluted Earth to float around in space and do nothing but consume content and food, but given how well the portmanteau works and how cute the WALL·E robots are, I probably would’ve done the same. Anyway, beyond what’s in a name, here’s Sutskever’s introduction of what DALL·E actually does:
The first neural network, DALL·E, can successfully turn text into an appropriate image for a wide range of concepts expressible in natural language. DALL·E uses the same approach used for GPT-3, in this case applied to text–image pairs represented as sequences of “tokens” from a certain alphabet.
DALL·E builds on two previous OpenAI models, combining GPT-3’s capability to perform different language tasks without finetuning with Image GPT’s capability to generate coherent image completions and samples. As input it takes a single stream — first text tokens for the prompt sentence, then image tokens for the image — of up to 1280 tokens, and learns to predict the next token given the previous ones. Text tokens take the form of byte-pair encodings of letters, and image tokens are patches from a 32 x 32 grid in the form of latent codes found using a variational autoencoder similar to VGVAE. This relatively simple architecture, combined with a large, carefully designed dataset, gives DALL·E the following laundry list of capabilities, each of which have interactive examples in OpenAI’s blog post:
- Controlling attributes
- Drawing multiple objects
- Visualizing perspective and three-dimensionality
- Visualizing internal and external structure (like asking for a macro or x-ray view!)
- Inferring contextual details
- Combining unrelated concepts
- Zero-shot visual reasoning
- Geographic and temporal knowledge
A lot of people from the community have written about DALL·E or played around with its interactive examples. Some of my favorites include:
- DeepMind researcher Felix’s Hill’s NonCompositional, a blog post on why DALL·E is good at composition without being very systematic (it can draw a hedgehog-shaped lettuce but not a green cube on a red cube on a blue cube )
- Károly Zsolnai-Fehér’s Two Minute Papers video on DALL·E, OpenAI’s previous work that led to it, and lots of generation examples
- Fun generations from Twitter and beyond: Janelle Shane’s dawnings of DALL-E; Oriol Vinyals’ soap dispenser in the shape of a glacier; and Karol Hausman’s snail made of a corkscrew.
I think DALL·E is the more interesting of the two models, but let’s also take a quick look at CLIP.
CLIP’s performance on different image classification benchmarks.
CLIP has the ability to reliably perform a staggering set of visual recognition tasks. Given a set of categories expressed in language, CLIP can instantly classify an image as belonging to one of these categories in a “zero-shot” way, without the need to fine-tune on data specific to these categories, as is required with standard neural networks. Measured against the industry benchmark ImageNet, CLIP outscores the well-known ResNet-50 system and far surpasses ResNet in recognizing unusual images.
Instead of training on a specific benchmark like ImageNet or ObjectNet, CLIP pretrains on a large dataset of text and images scraped from the internet (so without specific human labels for each images). It performs a proxy training task: “given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in our dataset.” To then do actual classification on a benchmark dataset, the labels are transformed to be more descriptive (e.g. a “cat” label becomes “a photo of a cat”), and CLIP calculates for each label how likely it is to be paired with the image. It predicts the most likely one to be the label. As you can see from the image above, this approach is highly effective across datasets. It’s also very efficient because, being a zero-shot model, CLIP doesn’t need to be (re)trained or finetuned for different datasets.
My favorite application so far of CLIP is by Travis Hoppe, who used it to visualize poems using Unsplash photos — worth a click! Another interesting one is actually how it’s used in combination with DALL·E: after DALL·E generates 512 plausible images for a prompt, CLIP ranks their quality, and only the 32 best ones are returned in the interactive viewer. Instead of researchers cherry-picking the best results to show in a paper, a different neural net can actually perform this task!
AlphaFold 2: DeepMind's structural biology breakthrough
AlphaFold’s predictions vs. the experimentally-determined shapes of two CASP14 proteins.
DeepMind’s AlphaFold 2 is a major protein folding breakthrough. Protein folding is a problem in structural biology where, given the one-dimensional RNA sequence of a protein, a computational model has to predict what three-dimensional structure the protein “folds” itself into. This structure is much more difficult to determine experimentally than the RNA sequence, but it’s essential for understanding how the protein interacts with other machinery inside cells. In turn, this can give insights into the inner workings of diseases — “including cancer, dementia and even infectious diseases such as COVID-19” — and how to fight them.
Biennially since 1994, the Critical Assessment of Techniques for Protein Structure Prediction (CASP) has determined the state of the art in computational models for protein folding using a blind test. Research groups are presented (only) with the RNA sequences of about 100 proteins whose shapes have recently been experimentally determined. They blindly predict these shapes using their computational models and submit them to CASP to be evaluated with a Global Distance Test (GDT) score, which roughly corresponds to how far each bit of the protein is from where it’s supposed to be. GDT scores range from 0 to 100, and a model that scores at least 90 across different proteins would be considered good enough to be useful to science (“competitive with results obtained from experimental methods”).
Before CASP13 in 2018, no model had ever scored significantly above 40 GDT. That year, the first version of AlphaFold came in at nearly 60 GDT — already “stunning” at the time (see DT #21). At CASP14 this year, AlphaFold 2 blew its previous results out of the water and achieved a median score of 92.4 GDT across all targets. This was high enough for CASP to declare the problem as “solved” in their press release and to start talking about new challenges for determining the shape of multi-protein complexes.
I’ve waited a bit to write about AlphaFold 2 until the hype died down because, oh boy, was there a lot of hype. DeepMind released a slick video about the team’s process, their results were covered with glowing features in Nature and The New York Times, and high praise came even from the leaders of DeepMind’s biggest competitors, including OpenAI’s Ilya Sutskever and Stanford HAI’s Fei-Fei Li. It was a pretty exciting few days on ML twitter.
Columbia University’s Mohammed AlQuraishi, who has been working on protein folding for over a decade, was one of the first people to break the CASP14 news. His blog post about CASP13 and AlphaFold 1 was also widely circulated back in 2018, so a lot of people in the field were interested in what he’d have to say this year. Last week, after the hype died down a bit, AlQuraishi published his perspective on AlphaFold 2. He summarized it by saying “it feels like one’s child has left home:” AF2 got results he did not expect to see until the end of this decade, even when takin into account AF1 — bittersweet for someone whose lab has also been working on this same problem for a long time.
AlQuraishi is overall extremely positive about DeepMind’s results here, but he does express disappointment at their “falling short of the standards of academic communication” — the lab has so far been much more secretive about AF2 than it was about AF1 (which is open-source). AlQuraishi’s post is very long and technical, but if you want to know exactly how impressive AlphaFold 2 is, learn the basics of how it works, read about its potential applications in broader biology, or see some of the hot takes against it debunked, the post is definitely worth the ~75 minutes of your time. (I always find it energizing to see someone excitedly explain a big advancement in their field that they did not directly work on; here’s the link again.)
I personally also can’t wait to see the first practical applications of AlphaFold, which I expect we’ll start to see DeepMind talk about in the coming years. (Hopefully!) For one, they’ve already released AlphaFold’s predictions for some proteins associated with COVID-19.
Google AI's ethics crisis
Google AI is in the middle of an ethics crisis. Timnit Gebru, the AI ethics researcher behind Gender Shades (see DT #42), Datasheets for Datasets (#41), and much more, got pushed out of the company after a series of conflicts. Karen Hao for MIT Technology Review:
A series of tweets, leaked emails, and media articles showed that Gebru’s exit was the culmination of a conflict over [a critical] paper she co-authored. Jeff Dean, the head of Google AI, told colleagues in an internal email (which he has since put online) that the paper “didn’t meet our bar for publication” and that Gebru had said she would resign unless Google met a number of conditions, which it was unwilling to meet. Gebru tweeted that she had asked to negotiate “a last date” for her employment after she got back from vacation. She was cut off from her corporate email account before her return.
See Casey Newton’s coverage on his Platformer newsletter for both Gebru’s and Jeff Dean’s emails (and here for his extended statement). This story unfolded over the past week and is probably far from over, but from everything I’ve read so far — which is a __lot, hence this email hitting your inbox a bit later than usual — I think think Google management made the wrong call here. Their statement on the matter focuses on missing references in Gebru’s paper, but as Google Brain Montreal researcher Nicolas Le Roux points out:
… [The] easiest way to discriminate is to make stringent rules, then to decide when and for whom to enforce them. My submissions were always checked for disclosure of sensitive material, never for the quality of the literature review.
This is echoed by a top comment on HackerNews. From Gebru’s email, it sounds like frustrations had been building up for some time, and that the lack of transparency surrounding the internal rejection of this paper was simply the final straw. I think it would’ve been more productive for management to start a dialog with Gebru here — forcing a retraction, “accepting her resignation” immediately and then cutting off her email only served to escalate the situation.
Gebru’s research on the biases of large (compute-intensive) vision and language models is much harder to do without the resources of a large company like Google. This is a problem that academic ethics researchers often run into; OpenAI’s Jack Clark, who gave feedback on Gebru’s paper, has also pointed this out. I always found it admirable that Google AI, as a research organization, intellectually had the space for voices like Gebru’s to critically investigate these things. It’s a shame that it was not able to sustain an environment in which this could be fostered.
In the end, beside the ethical issues, I think Google’s handling of this situation was also a big strategic misstep. 1500 Googlers and 2100 others have signed an open letter supporting Gebru. Researchers from UC Berkeley and the University of Washington said this will have “a chilling effect” on the field. Apple and Twitter are publicly poaching Google’s AI ethics researchers. Even mainstream outlets like The Washington Post and The New York Times have picked up the story. In the week leading up to NeurIPS and the Black in AI workshop there, is this a better outcome for Google AI than letting an internal researcher submit a conference paper critical of large language models?
Photoshop's Neural Filters
Light direction is one of many new AI-powered features in Photoshop; in the middle picture, the light source is on the left; in the right picture, it’s moved to the right.
Adobe’s latest Photoshop release is jam-packed with AI-powered features. The pitch, by product manager Pam Clark:
You already rely on artificial intelligence features in Photoshop to speed your work every day like Select Subject, Object Selection Tool, Content-Aware Fill, Curvature Pen Tool, many of the font features, and more. Our goal is to systematically replace time-intensive steps with smart, automated technology wherever possible. With the addition of these five major new breakthroughs, you can free yourself from the mundane, non-creative tasks and focus on what matters most – your creativity.
Adobe is branding the most exciting of these new features as Neural Filters : neural-network-powered image manipulations that are parameterized by sliders in the Photoshop UI. Some of them automate tasks that were previously very labor-intensive, while others enable changes that were previously impossible. Here’s a few of both:
- Style transfer: apply one photo’s style to another, like the classic “make this look like a Picasso / Van Gogh / Monet.”
- Smart portraits: subtly change a photo subject’s age, expression, gaze direction, pose, hair thickness, etc.
- Colorize: infer colors for black-and-white photos based on their contents.
- JPEG Artifacts Removal: smooth out the blocky artifacts that occur on patches of JPEG-compressed photos.
These all run on-device and came out of a collaboration between Adobe Research and NVIDIA, implying they’re best suited to machines with beefy GPUs — not surprising. However, the blog post is a little vague in about the specifics here (“performance is particularly fast on desktops and notebooks with graphics acceleration”), so I wonder whether this Neural Filters is also optimized for any other AI accelerator chips that Adobe can’t mention yet. In particular, Apple recently showed off their new A14 chips that feature a much faster Neural Engine. These chips launched in the latest iPhones and iPads but will also be in a new line of non-Intel “Apple Silicon” Macs, rumored to be announced next month — what are the chances that Apple will boast about the performance of Neural Filters on the Neural Engine during the presentation? I’d say pretty big. (Maybe worthy of a Ricky, even?)
Anyway, this Photoshop release is exactly the kind of productized AI that I started DT to cover: advanced machine learning models — that only a few years ago were just cool demos at conferences — wrapped up in intuitive UIs that fit into users’ existing workflows. It’s now just as easy to tweak the intensity of a smile or the direction of a gaze in a portrait photo as it is to manipulate its hue or brightness. That’s pretty amazing.
OpenAI and Microsoft: GPT-3 and beyond
OpenAI is exclusively licensing GPT-3 to Microsoft. What does this mean for their future relationship?
GPT-3 is OpenAI’s latest gargantuan language model (see DT #42) that’s uniquely capable of performing many different “text-in, text-out” tasks — demos range from imitating famous writers to generating code (#44) — without needing to be fine-tuned: its crazy scale makes it a few-shot learner.
In July 2019, OpenAI announced it got a $1 billion investment from Microsoft. Back then, this raised some eyebrows in the (academic) machine learning community, which can sometimes be a bit allergic to the commercialization of AI (#19). The exact terms of the investment were never disclosed, but some key elements of the deal were. Tom Simonite for WIRED:
Most interesting bit of the OpenAI announcement: “we intend to license some of our pre-AGI technologies, with Microsoft becoming our preferred partner.”
Now, a year and a bit later, that’s exactly what happened. From the OpenAI blog:
In addition to offering GPT-3 and future models via the OpenAI API, and as part of a multiyear partnership announced last year, OpenAI has agreed to license GPT-3 to Microsoft for their own products and services.
What does that mean? Nick Statt for The Verge:
A Microsoft spokesperson tells The Verge that its exclusive license gives it unique access to the underlying code of GPT-3, which contains technical advancements it hopes to integrate into its products and services.
In their blog post, Microsoft pitches this as a way to “expand [their] Azure-powered AI platform in a way that democratizes AI technology,” to which the community again reacted negatively: if you want to democratize AI, why not just open-source GPT-3’s code and training data?* I agree that “democratizing” is a bit of a stretch, but I think there’s a much more interesting discussion to be had here than the one on a self-congratulatory word choice in a corporate press release. Perhaps ironically, that discussion also starts from overanalyzing another few words in that very same press release.
According to Microsoft’s blog post about the licensing deal, GPT-3 “is trained on Azure’s AI supercomputer.” I wonder if that means OpenAI is now using Microsoft’s open-source DeepSpeed library (#34) to train its GPT models. DeepSpeed is a library for distributed training of enormous ML models that has specific features to support training large Transformers; Microsoft Research claimed in May that it’s capable of training models with up to 170 billion parameters (#40). GPT-3 is a 175-billion-parameter Transformer that was released in June, just one month later. That seems unlikely to be a coincidence, and Microsoft’s latest DeepSpeed update (#49) even includes some experimental work using the GPT-3 architecture.
So this suggests that the partnership goes beyond just the exchange of Microsoft’s money and compute for OpenAI’s trained models and ML brand strength (an exchange of cloud for clout, if you will) that we previously expected. Are the companies actually also deeply collaborating on ML and systems engineering research? I’d love to find out.
If so, this could be an early indication that Microsoft — who I’m sure is at least a little bit envious of Google’s ownership of DeepMind — will eventually want to acquire OpenAI. And it could be a great fit. Looking at Microsoft’s recent acquisition history, it has so far let GitHub (which it acquired two years ago) continue to operate largely autonomously. This makes it an attractive potential parent company for OpenAI: the lab probably wouldn’t have to give up too much of its independence under Microsoft’s stewardship. So unless OpenAI actually invents and monetizes some form of artificial general intelligence (AGI) in the next five to ten years — which I don’t think they will — I wouldn’t be surprised if they end up becoming Microsoft’s DeepMind.