Dynamically Typed

#38: Gender bias reductions in Google Translate, Facebook's bot simulation, and ML-based detection of battery degradation

Hey everyone, welcome to Dynamically Typed #38! Since a lot of you are new here, I figured I’d spend a few sentences explaining how DT is set up. I send out this newsletter every second Sunday, covering stories across four categories:

For each of these categories, I cover one story in depth and include quick links to a few more. Depending on how much happened that fortnight, some categories may be larger than the rest, while others may not make the cut at all. My favorite newsletters help me discover one or two pieces of interesting content I otherwise wouldn’t have found, and my aim is to provide that same service to you with Dynamically Typed.

With that out of the way, let’s dive into today’s newsletter.

Productized Artificial Intelligence 🔌

Gender-specific translations from Persian, Finnish, and Hungarian in the new Google Translate.

Gender-specific translations from Persian, Finnish, and Hungarian in the new Google Translate.

Google is continuing to reduce gender bias in its Translate service. Previously, it might translate “o bir doktor” in Turkish, a language that does not use gendered pronouns, to “he is a doctor”—assuming doctors are always men—and “o bir hemşire” to “she is a nurse”—assuming that nurses are always women. This is a very common example of ML bias, to the point that it’s covered in introductory machine translation courses like the one I took in Edinburgh last year. That doesn’t mean it’s easy to solve, though.

Back in December 2018, Google took a first step toward reducing these biases by providing gender-specific translations in Translate for Turkish-to-English phrase translations, like the example above, and for single word translations from English to French, Italian, Portuguese, and Spanish (DT #3). But as they worked to expand this into more languages, they ran into scalability issues: only 40% of eligible queries were actually showing gender-specific translations.

Google’s original and new approaches to gender-specific translations.

Google’s original and new approaches to gender-specific translations.

They’ve now overhauled the system: instead of attempting to detect whether a query is gender-neutral and then generating two gender-specific translations, it now generates a default translation and, if this translation is indeed gendered, also rewrites it to an opposite-gendered alternative. This rewriter uses a custom dataset to “reliably produce the requested masculine or feminine rewrites 99% of the time.” As before, the UI shows both alternatives to the user.

Another interesting aspect of this update is how they evaluate the overall system:

We also devised a new method of evaluation, named bias reduction, which measures the relative reduction of bias between the new translation system and the existing system. Here “bias” is defined as making a gender choice in the translation that is unspecified in the source. For example, if the current system is biased 90% of the time and the new system is biased 45% of the time, this results in a 50% relative bias reduction. Using this metric, the new approach results in a bias reduction of ≥90% for translations from Hungarian, Finnish and Persian-to-English. The bias reduction of the existing Turkish-to-English system improved from 60% to 95% with the new approach. Our system triggers gender-specific translations with an average precision of 97% (i.e., when we decide to show gender-specific translations we’re right 97% of the time).

The standard academic metrics (recall and average precision) did not answer the most important question about the two different approaches, so the developers came up with a new metric specifically to evaluate relative bias reduction. Beyond machine translation, this is a nice takeaway for productized AI in general: building the infrastructure and metrics to measure how your ML system behaves in its production environment is at least as important as designing the model itself.

In the December 2018 post announcing gender-specific translations, the authors mention that one next step is also addressing non-binary gender in translations; this update does not mention that, but I hope it’s still on the roadmap. Either way, it’s commendable that Google has continued pushing on this even after the story has been out of the media for a while now.

Quick productized AI links 🔌

Machine Learning Research 🎛

That’s a lot of authors and institutions.

That’s a lot of authors and institutions.

A large coalition of big-name ML researchers and institutions published Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. The 80-page report recognizes “that existing regulations and norms in industry and academia are insufficient to ensure responsible AI development” and presents a set of recommendations for providing evidence of the “safety, security, fairness, and privacy protection of AI systems.” Specifically, they outline two types of mechanisms:

  1. Mechanisms for AI developers to substantiate claims about their AI systems—going beyond just saying a system is “privacy-preserving” in the abstract, for example.
  2. Mechanisms that users, policy makers, and regulators can use to increase the specificity and diversity of demands they make to AI developers—again, going beyond abstract, unenforceable requirements.

The two-page executive summary and single-page list of recommendations (categorized across institutions, software, and hardware) are certainly worth a read for anyone who is to some extent involved in AI development, from researchers to regulators:

Recommendations from the report by Brundage et al. (2020).

Recommendations from the report by Brundage et al. (2020).

On the software side, I found the audit trail recommendation especially interesting. The authors state the problem as such:

AI systems lack traceable logs of steps taken in problem-definition, design, development, and operation, leading to a lack of accountability for subsequent claims about those systems’ properties and impacts.

Solving this will have go far beyond just saving a git commit history that traces the development of a model. For the data collection, testing, deployment, and operational aspects, there are no reporting or verification standards in widespread use yet.

I don’t think these standards can be sensibly defined for “AI” at large, so they’ll have to be implemented on an industry-by-industry basis. There are a whole different set of things to think about for self-driving cars than for social media auto-moderation, for example.

(Also see OpenAI’s short write-up of the report.)

Quick ML research + resource links 🎛 (see all 61 resources)

Artificial Intelligence for the Climate Crisis 🌍

Quick climate AI links 🌍

Thanks for reading! As usual, you can let me know what you thought of today’s issue using the buttons below or by replying to this email. If you’re new here, check out the Dynamically Typed archives or subscribe below to get a new issues in your inbox every second Sunday.

If you enjoyed this issue of Dynamically Typed, why not forward it to a friend? It’s by far the best thing you can do to help me grow this newsletter. 🏠