#74: Apple's privacy-focused facial recognition, DeepMind's multimodal Perceiver IO, and sea ice forecasting with IceNet
Hey everyone, welcome to Dynamically Typed #74! Today’s productized AI section includes some updates on the ClipDrop app and a detailed Apple blog post about privacy-preserving facial recognition in the Photos app. I also covered DeepMind’s new general Perceiver IO architecture for ML research, and IceNet for climate AI. And finally, for cool stuff I found Omnimatte, which we’ll probably see integrated into most video editing software a few years from now. Happy reading!
(This issue is a bit later than usual because preseason just started at rowing and the first few practices have been exhausting (but very fun). Anyway, I’ve finally figured out how to upload GIFs in DT so I hope those make up for the tardiness.)
Productized Artificial Intelligence 🔌
ClipDrop demo. (ClipDrop)
- 📱 I first covered Cyril Diagne’s AR cut and paste tool in May 2020 when it was a cool tech demo on Twitter, and then again when he productized it as ClipDrop in October. As a reminder, ClipDrop lets you take a picture of an object which it then segments (“clips”) out of the background so that you can paste (“drop”) it onto a canvas on your laptop in AR. Diagne has kept busy since the initial launch: he did Y Combinator, raised a seed round, and grew the team. ClipDrop now has 11,000 paying customers; it’s also launching a new web app and an API. (Register for access to the private beta here.) It’s another great example of AI enabling creativity software — see also Photoshop’s Neural Filters, Rosebud AI’s GAN photo models, all the Spleeter-powered apps, and of course RunwayML and Descript.
- 🕵️♀️ Apple’s machine learning blog has a detailed new post about their privacy-focused, on-device implementation of facial recognition in the Photos app. Some interesting details, in no particular order: (1) people are identified not only by embeddings of their face, but also by their upper body and metadata from the photo — two photos taken a few minutes apart are relatively likely to contain the same person; (2) an iterative clustering algorithm first groups very certain matches, then groups those groups, etc, and once it’s no longer certain it asks the user whether two clusters are still the same person; (3) constant re-evaluations of bias in the training dataset serve as a guide to what gaps to fill in new rounds of data collection; (4) running on a recent Apple Neural Engine, face embedding generation takes only 4 milliseconds. I’ve recently switched from Google Photos to Apple Photos, and one thing about their person recognition is definitely impressive: Google thinks two of my friends who are twins are the same person, and Apple can keep them apart.
Machine Learning Research 🎛
- 🔎 Perceiver IO is DeepMind’s new general-purpose architecture for processing a wide variety of input modalities — like images, videos, 3D point clouds, and sounds — into output vectors. First, Perceiver (without the IO) scaled Transformers’ concept of attention to much larger input sizes, “without introducing domain-specific assumptions,” by first encoding the inputs to a small fixed-size latent array and attending over that. Now, Perceiver IO (arXiv, GitHub) extends this by also applying attention to the decoding side, so that one input can produce multiple outputs and both the inputs and outputs can be a mix of modalities. “This opens the door for all sorts of applications, like understanding the meaning of a text from each of its characters, tracking the movement of all points in an image, processing the sound, images, and labels that make up a video, and even playing games, all while using a single architecture that’s simpler than the alternatives.” With OpenAI releasing DALL·E and CLIP and Stanford HAI launching the Foundation Models research center, both also this year, these large multimodal networks have become a central focus of leading AI labs.
Artificial Intelligence for the Climate Crisis 🌍
- 🧊 IceNet is a new probabilistic, deep learning sea ice forecasting system “trained on climate simulations and observational data to forecast the next 6 months of monthly-averaged sea ice concentration maps.” It’s a U-Net model that uses 50 climate variables as input, and outputs discrete probability distributions for three different sea ice concentration classes at each grid cell. Coolest (haha) part: “IceNet runs over 2000 times faster on a laptop than SEAS5 running on a supercomputer, taking less than ten seconds on a single graphics processing unit.” Practical use cases are in planning shipping routes and in avoiding conflicts between ships and migrating walruses and whales. Pretty cool.
Cool Things ✨
Omnimatte masks both objects (the car) and their effects (the dust), enabling adding the ML DRIFT logo “behind” the dust. (Lu et al. 2021)
- 💨 Omnimatte is a new matte/mask generation model by Erika Lu, who developed it in collaboration with Google AI researchers during two internships there. Unlike other state-of-the-art segmentation networks, Omnimatte creates masks for both objects and their “effects” like shadows or dust clouds in videos, enabling editors to easily add layers of content between the background and a foreground subject in a realistic way. Forrester Cole and Tali Dekel explain how the model works in detail (with lots of gifs!) in a post on the Google AI Blog.
Thanks for reading! As usual, you can let me know what you thought of today’s issue using the buttons below or by replying to this email. If you’re new here, check out the Dynamically Typed archives or subscribe below to get a new issues in your inbox every second Sunday.
If you enjoyed this issue of Dynamically Typed, why not forward it to a friend? It’s by far the best thing you can do to help me grow this newsletter. 🚣♂️