#38: Gender bias reductions in Google Translate, Facebook's bot simulation, and ML-based detection of battery degradation
Hey everyone, welcome to Dynamically Typed #38! Since a lot of you are new here, I figured I’d spend a few sentences explaining how DT is set up. I send out this newsletter every second Sunday, covering stories across four categories:
- 🔌 Productized Artificial Intelligence: new services or features in existing products made possible by AI (more the deep learning type, less the linear regression type).
- 🎛 Machine Learning Research: papers and blog posts from the ML research community that caught my eye.
- 🌍 Artificial Intelligence for the Climate Crisis: how people are harnessing AI to tackle climate change.
- ✨ Cool Things: art projects, visuals, and interactives that were enabled using (something loosely related to) ML.
For each of these categories, I cover one story in depth and include quick links to a few more. Depending on how much happened that fortnight, some categories may be larger than the rest, while others may not make the cut at all. My favorite newsletters help me discover one or two pieces of interesting content I otherwise wouldn’t have found, and my aim is to provide that same service to you with Dynamically Typed.
With that out of the way, let’s dive into today’s newsletter.
Productized Artificial Intelligence 🔌
Gender-specific translations from Persian, Finnish, and Hungarian in the new Google Translate.
Google is continuing to reduce gender bias in its Translate service. Previously, it might translate “o bir doktor” in Turkish, a language that does not use gendered pronouns, to “he is a doctor”—assuming doctors are always men—and “o bir hemşire” to “she is a nurse”—assuming that nurses are always women. This is a very common example of ML bias, to the point that it’s covered in introductory machine translation courses like the one I took in Edinburgh last year. That doesn’t mean it’s easy to solve, though.
Back in December 2018, Google took a first step toward reducing these biases by providing gender-specific translations in Translate for Turkish-to-English phrase translations, like the example above, and for single word translations from English to French, Italian, Portuguese, and Spanish (DT #3). But as they worked to expand this into more languages, they ran into scalability issues: only 40% of eligible queries were actually showing gender-specific translations.
Google’s original and new approaches to gender-specific translations.
They’ve now overhauled the system: instead of attempting to detect whether a query is gender-neutral and then generating two gender-specific translations, it now generates a default translation and, if this translation is indeed gendered, also rewrites it to an opposite-gendered alternative. This rewriter uses a custom dataset to “reliably produce the requested masculine or feminine rewrites 99% of the time.” As before, the UI shows both alternatives to the user.
Another interesting aspect of this update is how they evaluate the overall system:
We also devised a new method of evaluation, named bias reduction, which measures the relative reduction of bias between the new translation system and the existing system. Here “bias” is defined as making a gender choice in the translation that is unspecified in the source. For example, if the current system is biased 90% of the time and the new system is biased 45% of the time, this results in a 50% relative bias reduction. Using this metric, the new approach results in a bias reduction of ≥90% for translations from Hungarian, Finnish and Persian-to-English. The bias reduction of the existing Turkish-to-English system improved from 60% to 95% with the new approach. Our system triggers gender-specific translations with an average precision of 97% (i.e., when we decide to show gender-specific translations we’re right 97% of the time).
The standard academic metrics (recall and average precision) did not answer the most important question about the two different approaches, so the developers came up with a new metric specifically to evaluate relative bias reduction. Beyond machine translation, this is a nice takeaway for productized AI in general: building the infrastructure and metrics to measure how your ML system behaves in its production environment is at least as important as designing the model itself.
In the December 2018 post announcing gender-specific translations, the authors mention that one next step is also addressing non-binary gender in translations; this update does not mention that, but I hope it’s still on the roadmap. Either way, it’s commendable that Google has continued pushing on this even after the story has been out of the media for a while now.
Quick productized AI links 🔌
- 🤖 What’s the best way to mitigate the damage malicious bots can do on your social media platform? Facebook’s answer: building your own set of reinforcement-learning-based bots and setting them loose on a simulated version of your network. The company is deploying these Web-Enabled Simulations (WESs) to catch bad actors, search for bad content, and figure out how real-world bots could scrape data off the platform and break privacy rules.
- 📏 Michael Schoenberg and Adarsh Kowdle wrote a deep dive on uDepth, the set of neural networks on Google’s Pixel 4 phones that enable some cool computational photography features and a real-time depth sensing API at 30 Hz. Fun bonus: their architecture diagram also highlights whether each step runs on the phone’s CPU, GPU, or Neural Core.
Machine Learning Research 🎛
That’s a lot of authors and institutions.
A large coalition of big-name ML researchers and institutions published Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. The 80-page report recognizes “that existing regulations and norms in industry and academia are insufficient to ensure responsible AI development” and presents a set of recommendations for providing evidence of the “safety, security, fairness, and privacy protection of AI systems.” Specifically, they outline two types of mechanisms:
- Mechanisms for AI developers to substantiate claims about their AI systems—going beyond just saying a system is “privacy-preserving” in the abstract, for example.
- Mechanisms that users, policy makers, and regulators can use to increase the specificity and diversity of demands they make to AI developers—again, going beyond abstract, unenforceable requirements.
The two-page executive summary and single-page list of recommendations (categorized across institutions, software, and hardware) are certainly worth a read for anyone who is to some extent involved in AI development, from researchers to regulators:
Recommendations from the report by Brundage et al. (2020).
On the software side, I found the audit trail recommendation especially interesting. The authors state the problem as such:
AI systems lack traceable logs of steps taken in problem-definition, design, development, and operation, leading to a lack of accountability for subsequent claims about those systems’ properties and impacts.
Solving this will have go far beyond just saving a git commit history that traces the development of a model. For the data collection, testing, deployment, and operational aspects, there are no reporting or verification standards in widespread use yet.
I don’t think these standards can be sensibly defined for “AI” at large, so they’ll have to be implemented on an industry-by-industry basis. There are a whole different set of things to think about for self-driving cars than for social media auto-moderation, for example.
(Also see OpenAI’s short write-up of the report.)
Quick ML research + resource links 🎛 (see all 61 resources)
- 💻 Cool progress on automated chip design from Anna Goldie and Azalia Mirhoseini at Google Brain: “[whereas] existing baselines require human experts in the loop and take several weeks to generate, our method can generate placements in under six hours that outperform or match their manually designed counterparts.” Check out their Google AI blog post on the research: Chip Design with Deep Reinforcement Learning.
- 📓 Andrej Karpathy’s weekend project: a 150-line Python implementation of an autograd engine and PyTorch-like neural net library on top of it, all in Jupyter notebooks on GitHub: karpathy/micrograd.
- ⚡️TensorFlow Profiler provides a set of tools that you can use to measure the training performance and resource consumption of your TensorFlow models.
- ⚡️ OpenAI Microscope is a collection of visualizations of every significant layer and neuron of eight vision “model organisms” which are often studied in interpretability. (See for example OpenAI’s Distill paper on early vision in InceptionV1; DT #37.)
Artificial Intelligence for the Climate Crisis 🌍
Quick climate AI links 🌍
- 🔋 In a new paper for Nature, Zhang et al. (2020) used Gaussian processes to forecast the health and remaining useful life (RUL) of Lithium-ion batteries, the type of battery used in smartphones and electric cars, based on real-time, non-invasive electrochemical impedance spectroscopy (EIS) measurements. Reliable EIS-based RUL estimation can be used to improve the usability and recyclability of these batteries, which will be critical as we electrify the economy.
- Climate Change AI’s ICLR 2020 workshop on tackling climate change with machine learning kicked off today. You can watch it live or follow along on Twitter. I’ll try to cover highlights in the next edition of DT.
Thanks for reading! As usual, you can let me know what you thought of today’s issue using the buttons below or by replying to this email. If you’re new here, check out the Dynamically Typed archives or subscribe below to get a new issues in your inbox every second Sunday.
If you enjoyed this issue of Dynamically Typed, why not forward it to a friend? It’s by far the best thing you can do to help me grow this newsletter. 🏠