#37: OpenAI's neural network taxonomy, decoding text from brain implants, and models that don't exist
Hey everyone, welcome to Dynamically Typed #37! I’ve pushed the ML research section to the top of today’s newsletter because OpenAI’s new Distill article is one of the most exciting things I’ve read in a long time: they investigated the early layers of Google’s InceptionV1 vision network to an incredible level of detail, resulting in a first-of-its-kind taxonomy of “neuron groups.” It’s really cool stuff, so I’m covering it in depth.
Beyond that, I’ve got links to neurological work on decoding text from brain implant signals, and to Wayve’s new LIDAR data augmentation tech. For productized AI, I’m covering a startup that’s using GANs to synthesize fake models for ads, as well as links about AR acquisitions and more. Finally, for cool stuff, I found a paper that generates 2.5D-perspective images based on a single photo with depth information.
Machine Learning Research 🎛
The largest neuron groups in the mixed3a
layer of InceptionV1. (Olah et al., 2020)
Chris Olah and his OpenAI collaborators published a new Distill article: An Overview of Early Vision in InceptionV1 . This work is part of Distill’s Circuits thread, which aims to understand how convolutional neural networks work by investigating individual features and how they interact through the formation of logical circuits (see DT #35). In this new article, Olah et al. explore the first five layers of Google’s InceptionV1 network:
Over the course of these layers, we see the network go from raw pixels up to sophisticated boundary detection, basic shape detection (eg. curves, circles, spirals, triangles), eye detectors, and even crude detectors for very small heads. Along the way, we see a variety of interesting intermediate features, including Complex Gabor detectors (similar to some classic “complex cells” of neuroscience), black and white vs color detectors, and small circle formation from curves.
Each of these five layers contains dozens to hundreds of features (a.k.a.
channels or filters) that the authors categorize into human-understandable groups, which consist of features that detect similar things for inputs with slightly different orientations, frequencies, or colors.
This goes from conv2d0
, the first layer where 85% of filters fall into two simple categories (detectors for lines and for contrasting colors, in various orientations), all the way up to mixed3b
, the fifth layer where there are over a dozen complex categories (detectors for small heads, for circles/loops, and much more).
We’ve known that there are line detectors in early network layers for a long time, but this detailed taxonomy of later-layer features is novel—and it must’ve been an enormous amount of work to create.
A cicuits-based visualization of the black & white detector neuron group in layer mixed3a
of InceptionV1. (Olah et al., 2020)
For a few of the categories, like black & white and small circle detectors in mixed3a
, and boundary and fur detectors in mixed3b
, the article also investigates the “circuits” that formed them.
Such circuits show how strongly the presence of a feature in the input positively or negatively influences (“excites” or “inhibits”) different regions of the current feature.
One of the most interesting aspects of this research is that some of these circuits—which were learned by the network, not explicitly programmed!—are super intuitive once you think about them for a bit.
The black & white detector above, for example, consists mostly of negative weights that inhibit colorful input features: the more color features in the input, the less likely it is to be black & white.
The simplicity of many of these circuits suggests, to me at least, that Olah et al. are currently exploring one of the most promising paths in AI explainability research. (Although there is an alternate possibility, as pointed out by the authors: that they’ve found a “taxonomy that might be helpful to humans but [that] is ultimately somewhat arbitrary.”)
Anyway, An Overview of Early Vision in InceptionV1 is one of the most fascinating machine learning papers I’ve read in a long time, and I spent a solid hour zooming in on different parts of the taxonomy.
The groups for layer mixed3a
are probably my favorite.
I’m also curious about how much these early-layer neuron groups generalize to other vision architectures and types of networks—to what extent, for example, do these same neuron categories show up in the first layers of binarized neural networks?
If you read the article and have more thoughts about it that I didn’t cover here, I’d love to hear them. :)
Quick ML research + resource links 🎛 (see all 59 resources)
- 🧠 Neurologists Joseph Makin et al. at UC San Francisco used a-250 electrode brain implant to decode human brain signals into text with techniques from machine translation—at a word error rate of only 3%. The implant technology won’t be widely usable anytime soon, if ever, but you can download the code it runs on anyway: jgmakin/machine_learning.
- 🚘 Shuyang Cheng et al. at self-driving car company Waymo have extended Google’s reinforcement-learned image data augmentation technique, AutoAugment, to work with LIDAR data. Elon Musk still believes LIDAR sensors are useless “appendices” for self-driving cars but the rest of the industry, Waymo included, is evidently not getting anywhere closer to agreeing with that thesis.
Productized Artificial Intelligence 🔌
None of these models exist. (Rosebud AI)
Rosebud AI uses generative adversarial networks (GANs) to synthesize photos of fake people for ads. We’ve of course seen a lot of GAN face generation in the past (see DT #6, #8, #23), but this is one of the first startups I’ve come across that’s building a product around it. Their pitch to advertisers is simple: take photos from your previous photoshoots, and we’ll automatically swap out the model’s face with one better suited to the demographic you’re targeting. The new face can either be GAN-generated or licensed from real models on the generative.photos platform. But either way, Rosebud AI’s software takes care of inserting the face in a natural-looking way.
This raises some obvious questions: is it OK to advertise using nonexistent people? Do you need models’ explicit consent to reuse their body with a new face? How does copyright work when your model is half real, half generated? I’m sure Rosebud AI’s founders spend a lot of time thinking about these questions; and as they do, you can follow their along with their thoughts on Twitter and Instagram.
Quick productized AI links 🔌
- 📓 I recently came across Martin Zinkevich’s 24-page Rules of Machine Learning (PDF), “intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning from around Google.” Lots of good stuff in here.
- 💸 Acquisitions in the augmented reality space are heating up: in the past two weeks alone, Ikea bought Geomagical Labs (which works on placing virtual furniture in rooms; an obvious fit) and Pokémon Go developer Niantic acquired 6D.ai (which works on indoor mapping; another obvious fit).
- 📸 Cool new paper from Liu et al. (2020): Learning to See Through Obstructions proposes a learning-based approach that can remove things like chain fences and window reflections from photos (see the paper PDF for examples). This isn’t yet productized, but of the collaborators works at Google—so: how long until this shows up up in the camera app for Pixel phones?
Cool Things ✨
Layered depth inpainting. (Shih et al., 2020)
Here’s another cool AI art piece that can’t be done justice using just the static screenshot above: Shih et al. (2020) published 3D Photography using Context-aware Layered Depth Inpainting at this year’s CVPR conference. Here’s what that means:
We propose a method for converting a single RGB-D input image into a 3D photo, i.e., a multi-layer representation for novel view synthesis that contains hallucinated color and depth structures in regions occluded in the original view.
Based on a single image (plus depth information), they can generate a 2.5-dimensional representation, realistically re-rendering the scene from slightly different perspectives from which it was originally taken. Contrast that with recent work on neural radiance fields, which requires on the order of 20 - 50 images to work (see DT #36).
Shih et al. set up a website with some fancy demos, which is definitely worth a look; see these gifs on Twitter too. One of the authors also works at Facebook, so I wonder if we’ll one day see Instagram filters with this effect—or if it’ll be a part of Facebook’s virtual reality ambitions. Since the next generation of iPhones will likely have a depth sensor on the back too, I expect we’ll see a lot of this 2.5D photography stuff in the coming years.
Thanks for reading! As usual, you can let me know what you thought of today’s issue using the buttons below or by replying to this email. If you’re new here, check out the Dynamically Typed archives or subscribe below to get a new issues in your inbox every second Sunday.
If you enjoyed this issue of Dynamically Typed, why not forward it to a friend? It’s by far the best thing you can do to help me grow this newsletter. 🌞