The climate opportunity of gargantuan AI models
Climate change and the energy transition
Climate change is our generation’s biggest challenge, and the transitions needed to reduce emissions and prevent it from becoming catastrophic will affect almost every part of society in the coming decades. On their excellent Our World in Data page on CO2 and Greenhouse Gas Emissions, Hannah Ritchie and Max Roser write:
To make progress in reducing greenhouse gas emissions, there are two fundamental areas we need to focus on: energy (this encapsulates electricity, heat, transport, and industrial activities) and food and agriculture (which includes agriculture and land use change, since agriculture dominates global land use).
The biggest of these is energy: it’s responsible for almost three quarters of global greenhouse gas emissions. Decarbonizing energy involves two parallel transitions: (1) electrifying sectors powered by fossil fuels, and (2) shifting our electricity generation to low-emissions sources like solar, wind, hydro and nuclear.
Take road transport for example: cars need to be powered by electricity, and that electricity needs to be green. Replacing all internal combustion engine cars with electric ones will take at least two decades, as will replacing all gas- and coal-fired electricity plants with low-carbon power generation — so if we want to be climate-neutral by 2050, neither transition can wait for the other. Road transport is responsible for about 12% of overall emissions, but the same dual transition applies for other energy-intensive sectors like iron and steel production (7%) or lighting and heating in buildings (17.5%).
That’s the energy transition in a nutshell: we need to move energy demand towards electricity, and the electricity supply toward low-carbon sources.
But there’s one less-discussed thing linking these two: how do we move this low-carbon electricity from the supply to the demand? That’s where electrical grids — the focus of this post — come in.
A quick primer on electrical grids
Let’s start with a short, hopefully not too technical, primer on electrical grids — they play a huge role in all our lives, but I personally didn’t really know how they worked until I started working at a renewables optimization software company in January. On the most basic level, grids are very large systems — all of Europe is a single grid, and North America is divided into an eastern and a western grid (plus the smaller Texas and Quebec grids) — consisting of power lines at different voltages (high for long-distance transmission, low for local distribution), electrical substations which step voltage up or down, and electricity producers and consumers.
As opposed to the direct current (DC) in, for example, a battery-LED system — where electrons flow from one pole of the battery through the LED to the other pole — electricity in grids is in the form of alternating current (AC) — where electrons oscillate back and forth on the power line tens of times a second: at 50 Hz (times a second) in Europe and 60 Hz in North America. One of the main jobs of a grid operator is to ensure that this frequency remains constant, because lots of stuff breaks if it is too far from nominal, which can cause grid-wide blackouts in the worst case. An oversupply of electricity (more generation than consumption) causes the frequency to increase, while an undersupply causes it to decrease; so the operator has to make sure that generation and consumption are equal at all times.
Electricity markets and renewable generation
Grid operators keep electricity generation and consumption equal by creating time-based markets for electricity. For this discussion, the day-ahead and intraday markets are the most important. On the day-ahead market, producers and consumers place bids to sell and buy the amount of electricity they want to produce or consume during each hour of the following day. At the end of the day, the operator settles these bids in an optimal way that ensures that, for each hour, the amount sold matches the amount bought. Problem solved, right?
Sadly, as anyone who has ever been outside knows, the weather (and other factors affecting generation and consumption) can’t be perfectly predicted down to the hour a whole day ahead. It could happen, for example, that the afternoon is less sunny than expected, which means that a solar farm will produce less electricity than it sold the previous day. This is where the intraday market — operating on 15-minute intervals instead of hour-long ones — comes in. In this scenario of an expected underproduction, the solar farm can go to the intraday market and place bids to buy the difference between the amount of electricity it sold on the day-ahead market and the amount it’ll actually produce, from someone who is willing to either consume less power than they bought or produce more power than they already sold.
In practice, it’s usually the latter: someone will jump in and produce the extra electricity. This is big business for coal and gas plants, because they can ramp their production up (or down if the scenario is reversed) on-demand, and very quickly. As a larger percentage of electricity on the grid is generated using weather-dependent renewables, this intraday market becomes more valuable — and coal and gas-burning plants can be operated profitably for longer, even as learning effects make wind and solar power cheaper and CO2 emission prices rise.
Beyond fossil fuel-burning power plants ramping their generation up and down to meet consumption, another obvious supplier of flexibility is large batteries. These can be paid to charge when there is an oversupply, and paid again to discharge when there is an undersupply. Another plausible demand-side response comes from climate-controlled (food) distribution centers that need to run their cooling units a number of hours a day, but can be a bit flexible about exactly when those hours are. These are both useful, but they’re not happening at scale (yet).
So it’d be great for the planet if these coal and gas plants had some more competition on the intraday electricity balancing market.
(Any imbalance that is not solved on the day-ahead and intraday markets is handled by the grid operator’s balancing reserves; I won’t go into the details of these FCRs and FRRs here.)
Datacenters and flexible AI training for demand-side response
This is — finally — where datacenters and AI models come in. Here in The Netherlands, there has been some controversy in recent months about how many datacenters are being built (I bike by this imposing-looking one in Amsterdam several times a week) and how much energy they use. But given the above, I actually think datacenters have the potential to play a positive role in the intraday electricity market. Although many tasks of a datacenter, like serving websites, facilitating video calls, or powering Netflix streams, can’t really be shifted around in time at will, AI-related tasks often can be — both in research and production.
In a research setting, gargantuan AI models like DeepMind’s AlphaFold 2 can often take several days or weeks to train on dozens, hundreds or thousands of powerful machines. And labs like OpenAI already use highly-customized versions of tools like Kubernetes to orchestrate these machines. It’s not a stretch to imagine that these tools can be extended to ramp training up or down (in terms of the number of active machines, for example), along with the intraday electricity market. (In fact, I tried building a little tool similar to this myself last year!)
In production settings, machine learning models are often retrained periodically, once for a whole service or even many times for individual (groups of) users. This doesn’t happen exactly when the user queries or interacts with the model, but rather in an “offline” way: training happens on some schedule, and the model is saved to be retrieved for inference whenever the user wants to query it — so there’s potential for flexibility there. Even inference can happen offline: things like tagging photo libraries with the objects present in the photos are not too time-sensitive, and can probably happen flexibly within some period after the photos are uploaded without impacting user experience too much. It’s also not too crazy to imagine syncing this up to the electricity market.
Luckily, I’m not the first person to come up with this idea — see the Boden Tech datacenter in Sweden and Google’s partnership with Electricity Map, for example — but I do think that it’s under-appreciated, and often missed in discussions about Green AI and the climate risks of large AI models. Since these big models can often be scheduled to be trained at any time, perhaps counterintuitively, the more power they use, the more flexibility they can offer to the grid — and the more they can out-compete fossil fuel plants on the intraday electricity market!
I think we have a better shot at getting big tech companies and AI labs to implement ideas like this at scale, than we do at getting them to stop training big AI models. So instead of looking at gargantuan AI (language) models only as a climate problem, let’s give some more attention to their potential as a climate solution.
Same.Energy, a visual search engine
Same.Energy visual search
same.energy is a deep learning-powered visual search engine built by Jason Jackson (the creator of TabNine, see DT #18), who says it works similarly to OpenAI’s CLIP. It’s a really nice website to just browse around for a while and get lost in, which is also exactly what it’s meant for. Jackson, on the site’s about page, says:
We believe that image search should be visual, using only a minimum of words. And we believe it should integrate a rich visual understanding, capturing the artistic style and overall mood of an image, not just the objects in it. We hope Same Energy will help you discover new styles, and perhaps use them as inspiration.“
This may remind you of Google’s similar image search, but I found that same.energy does a much better job of finding diverse results: where Google will usually show me different crops and resolutions of the same photo or different photos of the same object, same.energy manages to capture the “feel” of an image and show more (different!) images with that same “feel.” Some of my favorite searches so far are __blue geometric patterns, glass art, and rusty surfaces.
The AI Incident Database
The Discover app of the AI Incident Database
The Partnership on AI to Benefit People and Society (PAI) is an international coalition of organizations with the mision “to shape best practices, research, and public dialogue about AI’s benefits for people and society.” Its 100+ member organizations cover a broad range of interests and include leading AI research labs (DeepMind, OpenAI); several universities (MIT, Cornell); most big tech companies (Google, Apple, Facebook, Amazon, Microsoft); news media (NYT, BBC); and humanitarian organizations (Unicef, ACLU).
PAI recently launched a new project: the AI Incident Database . The AIID mimics the FAA’s airplane accidents database and is similarly meant to help “future researchers and developers avoid repeated bad outcomes.” It’s launching with a set of 93 incidents, including an autonomous car that killed a pedestrian, a trading algorithm that caused a flash crash, and a facial recognition system that caused an innocent person to be arrested (see DT #43). For each incident, the database includes a set of news articles that reported about it: there are over 1,000 reports in the AIID so far. It’s also open source on GitHub, at PartnershipOnAI/aiid.
Systems like this (and Amsterdam’s AI registry, for example) are a clear sign that productized AI is a quickly starting to mature as a field, and that lots of good work is being done to manage its impact. Most importantly, I hope these projects will help us have more sensible discussions about regulating AI. Benedict Evans’ essays Notes on AI Bias and Face recognition and AI ethics are excellent reads on this; he compares calls to “regulate AI” to wanting to regulate databases — it’s not the right level of abstraction, and we should be thinking about specific policies to address specific problems instead. A dataset of categorized AI incidents, managed by a broad coalition of organizations, sounds like a great step in this direction.
DALL·E and CLIP: OpenAI's Multimodal Neural Networks
Two example prompts and resulting generated images from DALL·E
OpenAI’s new “multimodal” DALL·E and CLIP models combine text and images, and also mark the first time that the lab has presented two separate big pieces of work in conjunction. In a short blog post, which I’ll quote almost in full throughout this story because it also neatly introduces both networks, OpenAI’s chief scientist Ilya Sutskever explains why:
A long-term objective of artificial intelligence is to build “multimodal” neural networks—AI systems that learn about concepts in several modalities, primarily the textual and visual domains, in order to better understand the world. In our latest research announcements, we present two neural networks that bring us closer to this goal.
These two neural networks are DALL·E and CLIP. We’ll take a look at them one by one, starting with DALL·E.
The name DALL·E is a nod to Salvador Dalí, the surrealist artist known for that painting of melting clocks, and to WALL·E, the Pixar science-fiction romance about a waste-cleaning robot. It’s a bit silly to name an energy-hungry image generation AI after a movie in which lazy humans have fled a polluted Earth to float around in space and do nothing but consume content and food, but given how well the portmanteau works and how cute the WALL·E robots are, I probably would’ve done the same. Anyway, beyond what’s in a name, here’s Sutskever’s introduction of what DALL·E actually does:
The first neural network, DALL·E, can successfully turn text into an appropriate image for a wide range of concepts expressible in natural language. DALL·E uses the same approach used for GPT-3, in this case applied to text–image pairs represented as sequences of “tokens” from a certain alphabet.
DALL·E builds on two previous OpenAI models, combining GPT-3’s capability to perform different language tasks without finetuning with Image GPT’s capability to generate coherent image completions and samples. As input it takes a single stream — first text tokens for the prompt sentence, then image tokens for the image — of up to 1280 tokens, and learns to predict the next token given the previous ones. Text tokens take the form of byte-pair encodings of letters, and image tokens are patches from a 32 x 32 grid in the form of latent codes found using a variational autoencoder similar to VGVAE. This relatively simple architecture, combined with a large, carefully designed dataset, gives DALL·E the following laundry list of capabilities, each of which have interactive examples in OpenAI’s blog post:
- Controlling attributes
- Drawing multiple objects
- Visualizing perspective and three-dimensionality
- Visualizing internal and external structure (like asking for a macro or x-ray view!)
- Inferring contextual details
- Combining unrelated concepts
- Zero-shot visual reasoning
- Geographic and temporal knowledge
A lot of people from the community have written about DALL·E or played around with its interactive examples. Some of my favorites include:
- DeepMind researcher Felix’s Hill’s NonCompositional, a blog post on why DALL·E is good at composition without being very systematic (it can draw a hedgehog-shaped lettuce but not a green cube on a red cube on a blue cube )
- Károly Zsolnai-Fehér’s Two Minute Papers video on DALL·E, OpenAI’s previous work that led to it, and lots of generation examples
- Fun generations from Twitter and beyond: Janelle Shane’s dawnings of DALL-E; Oriol Vinyals’ soap dispenser in the shape of a glacier; and Karol Hausman’s snail made of a corkscrew.
I think DALL·E is the more interesting of the two models, but let’s also take a quick look at CLIP.
CLIP’s performance on different image classification benchmarks.
CLIP has the ability to reliably perform a staggering set of visual recognition tasks. Given a set of categories expressed in language, CLIP can instantly classify an image as belonging to one of these categories in a “zero-shot” way, without the need to fine-tune on data specific to these categories, as is required with standard neural networks. Measured against the industry benchmark ImageNet, CLIP outscores the well-known ResNet-50 system and far surpasses ResNet in recognizing unusual images.
Instead of training on a specific benchmark like ImageNet or ObjectNet, CLIP pretrains on a large dataset of text and images scraped from the internet (so without specific human labels for each images). It performs a proxy training task: “given an image, predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in our dataset.” To then do actual classification on a benchmark dataset, the labels are transformed to be more descriptive (e.g. a “cat” label becomes “a photo of a cat”), and CLIP calculates for each label how likely it is to be paired with the image. It predicts the most likely one to be the label. As you can see from the image above, this approach is highly effective across datasets. It’s also very efficient because, being a zero-shot model, CLIP doesn’t need to be (re)trained or finetuned for different datasets.
My favorite application so far of CLIP is by Travis Hoppe, who used it to visualize poems using Unsplash photos — worth a click! Another interesting one is actually how it’s used in combination with DALL·E: after DALL·E generates 512 plausible images for a prompt, CLIP ranks their quality, and only the 32 best ones are returned in the interactive viewer. Instead of researchers cherry-picking the best results to show in a paper, a different neural net can actually perform this task!
AlphaFold 2: DeepMind's structural biology breakthrough
AlphaFold’s predictions vs. the experimentally-determined shapes of two CASP14 proteins.
DeepMind’s AlphaFold 2 is a major protein folding breakthrough. Protein folding is a problem in structural biology where, given the one-dimensional RNA sequence of a protein, a computational model has to predict what three-dimensional structure the protein “folds” itself into. This structure is much more difficult to determine experimentally than the RNA sequence, but it’s essential for understanding how the protein interacts with other machinery inside cells. In turn, this can give insights into the inner workings of diseases — “including cancer, dementia and even infectious diseases such as COVID-19” — and how to fight them.
Biennially since 1994, the Critical Assessment of Techniques for Protein Structure Prediction (CASP) has determined the state of the art in computational models for protein folding using a blind test. Research groups are presented (only) with the RNA sequences of about 100 proteins whose shapes have recently been experimentally determined. They blindly predict these shapes using their computational models and submit them to CASP to be evaluated with a Global Distance Test (GDT) score, which roughly corresponds to how far each bit of the protein is from where it’s supposed to be. GDT scores range from 0 to 100, and a model that scores at least 90 across different proteins would be considered good enough to be useful to science (“competitive with results obtained from experimental methods”).
Before CASP13 in 2018, no model had ever scored significantly above 40 GDT. That year, the first version of AlphaFold came in at nearly 60 GDT — already “stunning” at the time (see DT #21). At CASP14 this year, AlphaFold 2 blew its previous results out of the water and achieved a median score of 92.4 GDT across all targets. This was high enough for CASP to declare the problem as “solved” in their press release and to start talking about new challenges for determining the shape of multi-protein complexes.
I’ve waited a bit to write about AlphaFold 2 until the hype died down because, oh boy, was there a lot of hype. DeepMind released a slick video about the team’s process, their results were covered with glowing features in Nature and The New York Times, and high praise came even from the leaders of DeepMind’s biggest competitors, including OpenAI’s Ilya Sutskever and Stanford HAI’s Fei-Fei Li. It was a pretty exciting few days on ML twitter.
Columbia University’s Mohammed AlQuraishi, who has been working on protein folding for over a decade, was one of the first people to break the CASP14 news. His blog post about CASP13 and AlphaFold 1 was also widely circulated back in 2018, so a lot of people in the field were interested in what he’d have to say this year. Last week, after the hype died down a bit, AlQuraishi published his perspective on AlphaFold 2. He summarized it by saying “it feels like one’s child has left home:” AF2 got results he did not expect to see until the end of this decade, even when takin into account AF1 — bittersweet for someone whose lab has also been working on this same problem for a long time.
AlQuraishi is overall extremely positive about DeepMind’s results here, but he does express disappointment at their “falling short of the standards of academic communication” — the lab has so far been much more secretive about AF2 than it was about AF1 (which is open-source). AlQuraishi’s post is very long and technical, but if you want to know exactly how impressive AlphaFold 2 is, learn the basics of how it works, read about its potential applications in broader biology, or see some of the hot takes against it debunked, the post is definitely worth the ~75 minutes of your time. (I always find it energizing to see someone excitedly explain a big advancement in their field that they did not directly work on; here’s the link again.)
I personally also can’t wait to see the first practical applications of AlphaFold, which I expect we’ll start to see DeepMind talk about in the coming years. (Hopefully!) For one, they’ve already released AlphaFold’s predictions for some proteins associated with COVID-19.