#75: OpenAI's book summaries for the alignment problem, Translatotron 2, and AI-generated movie posters
Hey everyone, welcome to Dynamically Typed #75! This was a bit of a quiet two weeks for productized and climate AI, so today I’ve just got three ML research and one cool stuff links for you: OpenAI’s GPT-3 book summarization; Google AI’s Wikipedia-based dataset for multimodal models; an update to the Translatotron speech-to-speech translator; and VQGAN + CLIP-generated movie posters.
Machine Learning Research 🎛
- 📚 Wu et al. (2021) at OpenAI used a fine-tuned GPT-3 to recursively summarize books. The model first separately summarizes sections of a book, then concatenates those summaries together and summarizes the result, and continues the process until it converges on a concise summary of the entire book. This works surprisingly well! For Romeo and Juliet , as visualized on the Summarizing Books demo page, this process takes it from 25,433 words (the whole play), to 5,809 words (72 summaries of sections), to 692 words (7 summaries of section summaries), to 116 words (the final summary). The result is usually a bit worse than an “average” human summarizer, but importantly this recursive process allows researchers to trace back how the model constructed the summary: What part of the book was the source of this plot point in the summary? What parts of lower-level summaries did the model not deem important enough to include in a higher level? Constructing models in such a way that these kinds of questions can be answered are part of OpenAI’s larger research effort into the alignment problem: to “ensure that machine learning models act in accordance with human intentions.” (A core part of their mission.)
- 🗃 Big AI labs’ focus on multimodal neural networks — architectures that combine different types of input and output data, like images and text — continues. After OpenAI’s DALL·E and CLIP, Stanford HAI’s Foundation Models, and DeepMind’s Perceiver IO, Google AI has now announced WIT: a Wikipedia-based image-text dataset. Bridging the gap between human-annotated image captions (too labor-intensive) and broad web-scraped ones (too messy and English-centric), Srinivasan et al. (2021) created WIT by “extracting multiple different text selections associated with an image from Wikipedia articles and Wikimedia image links.” This results in a Creative Commons-licensed dataset of 37.5 million image-text examples, across 11.5 million unique images and 108 languages. Until now these big multimodal models have mostly been trained on proprietary datasets by large private labs; this open dataset should help lower the barrier to entry for university labs to research similar models.
- 💱 In 2019, Google AI introduced Translatotron, “the first ever model that was able to directly [end-to-end] translate speech between two languages,” instead of chaining together separate speech recognition, machine translation, and speech synthesis models (see DT #14). Jia et al. (2021) updated the model to create Translatotron 2, which is newly able to do voice transfer — making the translated speech sound like it was spoken by the same voice as the input speech — “even when the input speech contains multiple speakers speaking in turns.” (Check out the blog post for some samples of generated audio.) One significant change from the original Translatotron is that both the voice and content of the input speech are now captured using a single encoder, which the authors claim makes the model less likely to be abused for spoofing arbitrary audio content (making someone’s voice say something they never said). But I’m a bit surprised that this is such a central part of the blog post, since there are plenty of dedicated voice-mimicking speech generation models out there already that would be easier to use for this purpose anyway.
Cool Things ✨
VQGAN + CLIP poster for The Prestige, one of my favorite movies, which features Tesla coils and disappearing magicians. (Noah Veltman)
- 🎞 Noah Veltman used VQGAN + CLIP to generate movie posters based on short text descriptions of their plot. His AI movie posters website features a couple dozen examples, with the movie titles hidden behind a spoilers banner so that you can guess them based on the poster. Most of them are pretty difficult to guess, but once you reveal the answer you can definitely see what concepts from the plot the model tried to capture in the poster. And they actually look quite good too! Sadly Veltman didn’t publish the prompts he used to generate each poster, but he does link to an explainer of how the model works.
Thanks for reading! As usual, you can let me know what you thought of today’s issue using the buttons below or by replying to this email. If you’re new here, check out the Dynamically Typed archives or subscribe below to get a new issues in your inbox every second Sunday.
If you enjoyed this issue of Dynamically Typed, why not forward it to a friend? It’s by far the best thing you can do to help me grow this newsletter. 🇧🇪