#55: DeepMind's structural biology breakthrough with AlphaFold 2

December 20, 2020

Hey everyone, welcome to Dynamically Typed #55. In today’s issue I’m mostly focusing on DeepMind’s new AlphaFold 2 (AF2) model, which I waited to dive into until last month’s hype settled down a bit; that story is right below this, in the ML research section. It took pretty long to understand and boil down everything around AF2, so I’ve only got a few other quick links in the other sections.

This is also the last DT of 2020 — happy holidays everyone! 🎄

Machine Learning Research 🎛

AlphaFold’s predictions vs. the experimentally-determined shapes of two CASP14 proteins.

AlphaFold’s predictions vs. the experimentally-determined shapes of two CASP14 proteins.

DeepMind’s AlphaFold 2 is a major protein folding breakthrough. Protein folding is a problem in structural biology where, given the one-dimensional RNA sequence of a protein, a computational model has to predict what three-dimensional structure the protein “folds” itself into. This structure is much more difficult to determine experimentally than the RNA sequence, but it’s essential for understanding how the protein interacts with other machinery inside cells. In turn, this can give insights into the inner workings of diseases — “including cancer, dementia and even infectious diseases such as COVID-19” — and how to fight them.

Biennially since 1994, the Critical Assessment of Techniques for Protein Structure Prediction (CASP) has determined the state of the art in computational models for protein folding using a blind test. Research groups are presented (only) with the RNA sequences of about 100 proteins whose shapes have recently been experimentally determined. They blindly predict these shapes using their computational models and submit them to CASP to be evaluated with a Global Distance Test (GDT) score, which roughly corresponds to how far each bit of the protein is from where it’s supposed to be. GDT scores range from 0 to 100, and a model that scores at least 90 across different proteins would be considered good enough to be useful to science (“competitive with results obtained from experimental methods”).

Before CASP13 in 2018, no model had ever scored significantly above 40 GDT. That year, the first version of AlphaFold came in at nearly 60 GDT — already “stunning” at the time (see DT #21). At CASP14 this year, AlphaFold 2 blew its previous results out of the water and achieved a median score of 92.4 GDT across all targets. This was high enough for CASP to declare the problem as “solved” in their press release and to start talking about new challenges for determining the shape of multi-protein complexes.

I’ve waited a bit to write about AlphaFold 2 until the hype died down because, oh boy, was there a lot of hype. DeepMind released a slick video about the team’s process, their results were covered with glowing features in Nature and The New York Times, and high praise came even from the leaders of DeepMind’s biggest competitors, including OpenAI’s Ilya Sutskever and Stanford HAI’s Fei-Fei Li. It was a pretty exciting few days on ML twitter.

Columbia University’s Mohammed AlQuraishi, who has been working on protein folding for over a decade, was one of the first people to break the CASP14 news. His blog post about CASP13 and AlphaFold 1 was also widely circulated back in 2018, so a lot of people in the field were interested in what he’d have to say this year. Last week, after the hype died down a bit, AlQuraishi published his perspective on AlphaFold 2. He summarized it by saying “it feels like one’s child has left home:” AF2 got results he did not expect to see until the end of this decade, even when takin into account AF1 — bittersweet for someone whose lab has also been working on this same problem for a long time.

AlQuraishi is overall extremely positive about DeepMind’s results here, but he does express disappointment at their “falling short of the standards of academic communication” — the lab has so far been much more secretive about AF2 than it was about AF1 (which is open-source). AlQuraishi’s post is very long and technical, but if you want to know exactly how impressive AlphaFold 2 is, learn the basics of how it works, read about its potential applications in broader biology, or see some of the hot takes against it debunked, the post is definitely worth the ~75 minutes of your time. (I always find it energizing to see someone excitedly explain a big advancement in their field that they did not directly work on; here’s the link again.)

I personally also can’t wait to see the first practical applications of AlphaFold, which I expect we’ll start to see DeepMind talk about in the coming years. (Hopefully!) For one, they’ve already released AlphaFold’s predictions for some proteins associated with COVID-19.

Quick ML research + resource links 🎛

📰 There are some new developments in Google’s ongoing AI ethics crisis (see DT #54). CEO Sundar Pichai issued a company-wide memo apologizing for the fact that “Dr. [Timnit] Gebru’s departure … seeded doubts and led some in our community to question their place at Google.” This doesn’t address the central issue, and it did not land well with the community; see the tweets from Gebru and Jack Clark, as well as Khari Johnson’s interview with Gebru and NPR’s coverage of the story. In response to the memo, a group of Google AI researchers sent the executives a list of demands asking for leadership and policy changes. And meanwhile, someone made a fake Twitter account (complete with a GAN-generated profile picture) that opposed Gebru’s side of the story by pretending to be an ex-researcher from the Google ethics team. I don’t think this’ll be resolved anytime soon.
✳️ Chris Olah et al. have a cool new Distill article in the Circuits thread: Naturally Occurring Equivariance in Neural Networks. “We sometimes think of understanding neural nets as being like reverse engineering a regular computer program. In this analogy, equivariance is like finding the same inlined function repeated throughout the code.”
⚡️The SE4ML lab’s updated Engineering best practices for Machine Learning includes tips for managing data, code, training, deployment, teams, and high-level governance.

I’ve also collected all 75+ ML research tools previously featured in Dynamically Typed on a Notion page for quick reference. ⚡️

Artificial Intelligence for the Climate Crisis 🌍

🌍 Climate Change AI’s NeurIPS 2020 Workshop on Tackling Climate Change with Machine Learning happened last week, featuring nearly 100 papers and proposals. I haven’t had the time to properly read through everything yet, but it looks like there’s a lot of great work to dive into over the holidays!

Cool Things ✨

🎨 Also from NeurIPS 2020: the gallery for this year’s workshop on Machine Learning for Creativity and Design is live! I really enjoyed looking through all the new work in the online gallery, which is available at aiartonline.com.
🎾 Technology-driven design studio Bakken & Bæck wrote a blog post about a recent computer vision project they did that lets tennis players compare their own swing with the pros, to “swing like Serena.” The article explains how their model works and includes some nice visuals.

Thanks for reading! As usual, you can let me know what you thought of today’s issue using the buttons below or by replying to this email. If you’re new here, check out the Dynamically Typed archives or subscribe below to get a new issues in your inbox every second Sunday.

If you enjoyed this issue of Dynamically Typed, why not forward it to a friend? It’s by far the best thing you can do to help me grow this newsletter. 🎅

Dynamically Typed

#55: DeepMind's structural biology breakthrough with AlphaFold 2

Machine Learning Research 🎛

Artificial Intelligence for the Climate Crisis 🌍

Cool Things ✨

Productized AI

ML Research

Climate AI

Cool Things