Productized AI links
I first covered Cyril Diagne’s AR cut and paste tool in May 2020 when it was a cool tech demo on Twitter, and then again when he productized it as ClipDrop in October. As a reminder, ClipDrop lets you take a picture of an object which it then segments (“clips”) out of the background so that you can paste (“drop”) it onto a canvas on your laptop in AR. Diagne has kept busy since the initial launch: he did Y Combinator, raised a seed round, and grew the team. ClipDrop now has 11,000 paying customers; it’s also launching a new web app and an API. (Register for access to the private beta here.) It’s another great example of AI enabling creativity software — see also Photoshop’s Neural Filters, Rosebud AI’s GAN photo models, all the Spleeter-powered apps, and of course RunwayML and Descript.
Apple’s machine learning blog has a detailed new post about their privacy-focused, on-device implementation of facial recognition in the Photos app. Some interesting details, in no particular order: (1) people are identified not only by embeddings of their face, but also by their upper body and metadata from the photo — two photos taken a few minutes apart are relatively likely to contain the same person; (2) an iterative clustering algorithm first groups very certain matches, then groups those groups, etc, and once it’s no longer certain it asks the user whether two clusters are still the same person; (3) constant re-evaluations of bias in the training dataset serve as a guide to what gaps to fill in new rounds of data collection; (4) running on a recent Apple Neural Engine, face embedding generation takes only 4 milliseconds. I’ve recently switched from Google Photos to Apple Photos, and one thing about their person recognition is definitely impressive: Google thinks two of my friends who are twins are the same person, and Apple can keep them apart.
Merlin, an app by the Cornell Lab of Ornithology, identifies birds based on their songs and calls. The app’s Sound ID feature currently supports 450+ birds in the US and Canada. It works by visualizing an audio recording of a bird’s song or call as a spectrogram — where the x-axis is time, the y-axis is frequency, and each point’s brightness represents decibels, so it’s essentially a monochromatic image — and then classifies it using computer vision. Because the vision model runs on-device, Merlin also works without a cellular connection. Beyond Sound ID, the app also has a Photo ID feature that directly classifies photos of birds, and one that guesses which bird you saw based on three simple questions (how big it was, what its main colors were, and what it was doing) — that last one is probably just some clever filtering though, not an AI model. Links: App Store, Google Play.
I joined Oxygen Digital for their AI Series panel on AI-assisted coding (YouTube link). To our own surprise, we filled the whole 90-minute slot — it was a lot of fun! We discussed GitHub Copilot and OpenAI Codex, and lots more about the future of professional software engineering as tools like this become a part of every IDE. As I also wrote in Towards talking to computers with Codex, I’m most excited about how these code generation AI models will unlock the power of working with APIs to people who don’t know how to write code.
AlphaFold, DeepMind’s protein folding neural network, represented a breakthrough in structural biology when it was released in December. Given a protein’s sequenced “code” of amino acid chains, the model predicts what shape the molecule “folds” itself into in nature — a key property for understanding how the protein works. After open-sourcing it last month, DeepMind has now partnered with the European Bioinformatics Institute (EMBL-EBI) to build the AlphaFold Protein Structure Database, “which offers the most complete and accurate picture of the human proteome, doubling humanity’s accumulated knowledge of high-accuracy human protein structures.” The database currently has a total of 350,000 predicted 3D protein structures and will soon be extended to cover all 100+ million sequenced proteins. It’s very impressive how in less than a year, AlphaFold went from cutting-edge research to something that end users (scientists and drug developers) can use without having to run or understand the AI model themselves. From EMBL’s press release: “This step change will catalyse a huge amount of research in new areas, and the development of applications that were previously impossible, impractical or limited in their scope by the hitherto relatively restricted amounts of 3D structural information available.” Some people even think this’ll get the AlphaFold team a Nobel prize. More coverage: MIT Tech review, NYT, DeepMind blog.
Brickit is an iOS app that uses computer vision to identify LEGO bricks in a big pile and then shows you a list of projects you can build with those bricks — with instructions! The most impressive part is that it can detect so many small objects with so many different classes in one photo. I’d guess it does this by tiling the image or sliding a window over the photo, and then running the smaller images through some custom model powered by Core ML and the iPhone’s neural engine; but I can’t find information much about how the app works exactly. Brickit is a great example of productized AI: its core functionality is enabled by a highly-complex machine learning, but it abstracts this away into a simple user interface.
Google AI researchers Azalia Mirhoseini and Anna Goldie published a Nature paper on their AI-powered computer chip design methodology, which uses “an edge-based graph convolutional neural network architecture capable of learning rich and transferable representations of the chip.” Trained on a dataset of 10,000 chip floorplans, the method replaces “months of intense effort” for humans, and comes up with a more optimal end result. I covered this research when it first came out in April 2020, but the big news now is that it has been productionized: Mirhoseini and Goldie have used it to design the next generation of Google’s Tensor Processing Units (TPUs)!
GPT-3, OpenAI’s language model that doesn’t need fine tuning, went viral on Twitter about a year ago when people showed off demo projects in which they got it to generate code; a few months later, OpenAI exclusively licensed the model’s underlying technology to Microsoft. We’re now starting to see the results of both those stories: Microsoft has launched a new feature for its “low code, no code” Power Platform that uses GPT-3 to turn natural language prompts into database query code. It doesn’t get more “productized AI” than this!
Google previewed its AI-powered dermatology assist tool at I/O, its yearly developer conference. Integrated with Search, the app guides you through taking photos of your skin at different angles, and then uses a deep learning model published in Nature Medicine to potentially detect one of 288 skin conditions. (See how it works in this GIF.) The tool is explicitly not intended to provide a diagnosis or as a substitute to medical advice. Although this theoretically sounds incredible — internet-scale access to early-stage detection of e.g. skin cancer could be an amazing global DALY booster — experts have raised some serious concerns. Google Ethical AI researcher Dr. Alex Hanna, Stanford Medicine dermatologist Roxanna Daneshjou MD/PhD and Vice journalist Todd Feathers have pointed out that, although Google claims to have tested the app across all demographics, it has not sufficiently tested it across all (Fitzpatrick) skin types: the darkest V and VI types — where skin conditions are already misdiagnosed relatively often — were severely underrepresented in the dataset. The app isn’t live yet, and Google Health spokesperson Johnny Luu told Vice that the dataset has been expanded since the Nature paper was published, but this issue must be properly addressed before the app can responsibly be launched. I’d be disappointed to see it go live without at the very least a Datasheet and a Model Card explaining its limitations.
Thomas Verg wrote about How image search works at Dropbox for the company’s blog. Their algorithm uses a combination of image classification to extract relevant ImageNet-style labels from photos (like “beach” or “hotdog”), and word vectors to match non-exact search terms to those labels (e.g. “shore” or “sandwich”). The rest of the post goes into quite some depth on the production architecture and scalability optimizations in the algorithm’s deployment. Always nice to see these technical deep dives on AI-powered features from product companies!
A bit different from usual on DT: the following is a good example of removing an AI-powered feature from a product. Late last year, Twitter users began to notice that the app’s photo cropping algorithm (which decides what portion of an image to show as preview in the timeline) seemed to favor white faces over Black faces. The simple saliency algorithm doesn’t look for faces specifically but rather tries to predict what part of an image a user would look at first, and no one thought to check it for this bias. Twitter has now solved the problem by no longer cropping images at all, instead displaying standard aspect ratio images in full (which I think is better anyway.) Director of Software Engineering Rumman Chowdhury wrote an excellent blog post about how the company handled this issue, including details of its own (open-source) study that confirmed the algorithm’s biases. “One of our conclusions is that not everything on Twitter is a good candidate for an algorithm, and in this case, how to crop an image is a decision best made by people.”
Microsoft Teams now has a live meeting transcription feature, launching first for US-English-speaking users. Microsoft’s implementation here is quite impressive: beyond the basics like speaker attribution and saving the transcript for access after the meeting, the feature “uses a meeting’s invitation, participant names, attachments, etc. to improve the accuracy and recognize meeting-specific jargon for each transcript automatically.” Really cool! This also all happens live during the meeting, and the data isn’t saved on Microsoft’s servers after the meeting ends.
Google Maps is getting some new features powered by “new information and AI.” Maps VP of Product Dane Glasgow wrote about them in a post for The Keyword; I’ll highlight the two features where AI seems most central. (1) Live View, a mobile feature that shows augmented reality navigation overlays by mapping the camera feed to the Street View images, is getting an Indoor mode that “can help you find the nearest elevator and escalators, your gate, platform, baggage claim, check-in counters, ticket office, restrooms, ATMs and more.” I think that’s also the most AI-powered bit here: a few classical computer vision algorithms can already do image-comparison-based localization, but the object recognition was probably done using machine learning. (2) Instead of always showing directions for the last mode you used, Maps “will default to the route with the lowest carbon footprint when it has approximately the same ETA as the fastest route” — the AI bit here is that it learns to rank the options by what you’re likely to take yourself and by what’s popular in the city you’re in: cycling in Amsterdam or the metro in New York.
The European Commission has released its Artificial Intelligence Act, “the first ever legal framework on AI, which addresses the risks of AI and positions Europe to play a leading role globally.” The proposal covers software powered by anything from machine learning to more classical statistical and expert system approaches, and applies rules depending on how risky it deems them. Unacceptable-risk applications like broad, real-time facial recognition or automated social credit systems are completely forbidden; but high-risk applications like emotion detection or biometric categorization systems only require the person being analyzed to be notified it’s happening. As noted on Twitter by Dr. Kate Crawford and in Andrew Ng’s DeepLearning.AI newsletter, there are certainly flaws in the proposal — on the one hand it could hinder innovation, on the other there are loopholes — but it could have a similar effect to GDPR in “drawing a line in the sand” and inspiring other big economies’ regulators to create similar legislation. Creating such hand-holds for what AI applications we accept and don’t accept as a society, is a very good thing in my book.
Last September, I wrote that autonomous trucks will be the first big self-driving market. A detailed new report by Deloitte’s Rasheq Zarif now came to the same conclusion: Autonomous trucks lead the way. “Driverless trucks are already heading out to the highway, as shipping companies increasingly look to autonomous technology to meet rising demand for goods. The focus now: determining the best way to hand off trailers from machine to human.” In related news, self-driving car company Waymo, which has been developing autonomous heavy-duty trucks since 2017, invited a few journalists along for a (virtual) test ride. Exciting few years ahead here.
After we saw GPT-3 — OpenAI’s gargantuan language model that doesn’t need finetuning — used for lots of cool demos, the model’s API now powers 300+ apps and outputs an average of 4.5 billion (!) words per day. OpenAI published a blog post describing some of these apps, including Viable, which summarizes and answers questions about survey responses, and Agolia, a website plugin for semantically searching through content. Cool stuff! As the OpenAI API scales up to power more products, though, one thing to keep a close eye on will be how often it outputs problematic responses in production systems. Abid et al. (2021) have shown that GPT-3 has a persistent anti-Muslim bias, and TNW’s Tristan Greene got a GPT-3-powered chatbot to spit out racist and anti-LGBT slurs. The OpenAI API runs a content filter on top of the raw GPT-3 model to prevent such responses from reaching end-users (which is pretty strict in my experience: when I was playing around with the beta, I couldn’t get it to say bad things without labeling them as potentially problematic) but no filter is ever perfect. We’ll see what happens in the coming few years, but I do expect that the good and useful products will outweigh the occasional bad response.
This was making the rounds on Twitter: online genealogy platform MyHeritage launched a tool called Deep Nostalgia that animates faces in old family photos. According to the company’s blog, it was used to animate over 10 million faces in its first week of being live. As with many visual deep-learning-powered features, there is a free version with watermarks, as well as a premium version as part of a paid MyHeritage subscription. The model behind Deep Nostalgia is licensed from D-ID, a startup that makes live portrait, talking heads, and video anonymization products.
OpenAI CEO Sam Altman wrote Moore’s Law for Everything, an essay in which he discusses economic implications of the exponential rate at which AI is improving. As AI replaces more labor and makes goods and services cheaper, he argues that we must shift the focus of taxation away from income and toward capital, to prevent extreme inequality from destabilizing democracies. See his full essay for details of the (US-specific) implementation, in the form of an American Equity Fund and broad land taxes. This reminds me of a discussion we had in my undergrad CS ethics class on “taxing robots” because they replace labor (and taxable income with it). At the time, I argued against this idea because it seems impossible to implement in any sane way — should we tax email (which is free!) because there are no more telegram operator jobs left? Altman’s proposal is a different solution to this same problem, and a pretty interesting one at that — right up there with a Universal Basic Income (UBI).
Cade Metz and Kashmir Hill at The New York Times wrote about how old Flickr photos became a part of facial recognition datasets. The story centers around Exposing.AI, a tool that can show you whether your face is a featured in any popular facial recognition datasets like VGG Face, MegaFace and FaceScrub, based on your Flickr username or a photo URL. Beyond that, it’s a good read that goes into how, five to ten years ago when AI was not yet very influential, commercial and university labs were building lots of different facial recognition datasets and, in the spirit of open science, sharing them publicly on the internet. Only now that it’s becoming clear that facial recognition systems are biased — as I covered last summer in Is it enough for only big tech to pull out of facial recognition? and Facial recognition false arrest — some of these datasets are being taken offline. But these systems exist now, and taking down the datasets won’t stop them from being used; only regulation will.
Apparently, Google’s Pixel phones can detect car crashes. This was making the rounds on Twitter after a Reddit user wrote on r/GooglePixel that car crash detection saved them from hours of suffering because they had an accident on their own property, where no one would otherwise have found them for a long time. When the phone detects a crash, it calls local emergency services and says, “You are being contacted by an automated emergency voice service on behalf of a caller. The caller’s phone detected a possible car crash, and they were unresponsive. Please send help,” followed by the phone’s latest location. Pretty amazing stuff, that’s being built into more and more products — Apple Watches have a similar fall detection feature. Dave Burke, Google’s VP of engineering for Android, noticed the story and tweeted a photo of the setup they used to train the ML model powering this feature. Worth a click.
Google is adding camera-based vitals measurement to its Fit app on Android. Initially rolling out to Pixel phones, the new feature can measure your respiratory (breathing) rate by looking at your face and upper torso through the selfie camera — something that, judging from a cursory Scholar search, was only becoming a mainstream research topic just two years ago! The rate at which computer vision research makes it from an idea to deployment on millions of phones remains pretty astonishing. The Fit app can also read your heart rate when you place your finger on the back-facing camera, though I don’t think this is as new: I’ve used iPhone apps that did this years ago — but one big difference is that Google has actually done clinical studies to validate these features.
Jingwen Lu, Jidong Long and Rangan Majumder wrote a blog post about Speller100, Microsoft’s zero-shot spelling correction models that now collectively work across 100+ languages. Speller100 is currently live in production as part of Microsoft’s Bing search engine, where it corrects typos in search queries — it’s what powers the “did you mean…” prompt. Although this feature has been around for English-language search queries for a very long time, Speller100 newly enables it for a whole host of smaller languages. It’s also an interesting case study of how an AI-powered refinement step of user input can significantly improve a product’s overall experience. By A/B testing Speller100 against not having spelling correction, the researchers found that it reduced the number of pages with no results by 30%, and manual query reformatting by 5%; and that it increased the number of clicks on spelling suggestions by 67%, and clicks on any item on the page by 70%.
Win Suen wrote about a machine learning system running in production at Dropbox that decides for which files previews should be rendered: Cannes: How ML saves us $1.7M a year on document previews. She goes into two design considerations for building a highly performant AI system: the cost-benefit tradeoff of ML-powered infrastructure savings (rendering fewer previews to save compute vs. hurting user experience by not having previews) and the model complexity tradeoff (prediction accuracy vs. interpretability and cost of deployment). The final model is a gradient-boosted classifier that can “predict previews up to 60 days after time of pre-warm with >70% accuracy.”
Facebook has launched a significantly improved version of its automatic alternative text (AAT) feature, which helps blind or visually impaired people understand the contents of images in their Facebook feed. As explained in Facebook’s tech blog post, this new version of AAT can recognize over 1,200 distinct concepts. Interestingly, the model was trained on weakly supervised data, using the hashtags on billions of public Instagram images as labels. So if you’ve ever posted a picture of your latte and tagged it #latte on Instagram, you may have had a tiny impact on this feature. The blog post also details the user research that went into improving AAT — something I think we usually don’t hear enough about (or do enough of!) in productized AI — so make sure to give it a read. (I wish I could credit the person who wrote this post, but sadly Facebook keeps these posts anonymous, which seems a bit out of character for the company.)
Naveen Arivazhagan and Colin Cherry wrote a post for the Google AI Blog about how they solved a problem with the live speech translation feature in Google Translate: translations would frequently get updated as more of the transcribed text became available, which users found distracting. It’s a cool glimpse into all the stuff besides just model accuracy and speed that are important to get right for a successful AI-powered product, and into how engineers think about turning these nonfunctional requirements into measurable performance metics they can optimize for.
Creative Commons photo sharing site Unsplash (where I also have a profile!) has launched a new feature: Visual Search, similar to Google’s search by image. If you’ve found a photo you’d like to include in a blog post or presentation, for example, but the image is copyrighted, this new Unsplash feature will help you find similar-looking ones that are free to use. The launch post doesn’t go into detail about how Visual Search works, but I’m guessing some (convolutional) classification model extracts features from all images on Unsplash to create a high-dimensional embedding; the same happens to the image you upload, and the site can then serve you photos that are close together in this embedding space. (Here’s an example of how you’d build that in Keras.)
Business Insider’s Mathias Döpfner did a long new interview with Elon Musk. It covers a lot, and most of it isn’t too relevant to DT, but this Musk quote is: “I’m extremely confident that Tesla will have level five [self driving] next year, extremely confident, 100%.” Yes, this definitely isn’t the first time Musk has claimed full self driving is just around the corner, but my slightly contrarian take (from a few months ago) is that I actually do think Tesla will get to a useful level of self-driving — deployed at scale in consumer cars — first. Their big bet years ago that vision (without LIDAR) is enough for autonomy has enabled them to be years ahead of the competition with their dataset. They’ve harnessed their fleet of Telsas on real roads for very clever sampling, feedback loops (ghost mode), and regression testing; Andrej Karpath, (Tesla’s head of AI, had a really great talk on all this in April last year.
Another episode in the saga of deepfakes, videos that make real people look like they’re saying or doing things they never said or did. In the fall of 2019, Facebook, Microsoft, and Google created datasets and challenges to automatically detect deepfakes (see DT #23); in October 2020, Microsoft then launched their Video Authenticator deepfake detection app (#48). Now, just a few months later, Neekhara et al. (2020) present an adversarial deepfake model that handily beats those detectors: “We perform our evaluations on the winning entries of the DeepFake Detection Challenge (DFDC) and demonstrate that they can be easily bypassed in a practical attack scenario.” And the carousel goes ‘round.
Recorder app for Android, which uses on-device AI to transcribe recordings (see DT #25, #31), now has a new ML-powered feature: Smart Scrolling. The feature “automatically marks important sections in the transcript, chooses the most representative keywords from each section, and then surfaces those keywords on the vertical scrollbar, like chapter headings.” This all happens on-device. How long until it also writes concise summaries of your hour-long recordings?
Runway ML, the “app store” of easy-to-use machine learning models for creators (see DT #18), added a new Green Screen feature, which it says is “[the] first real-time web tool for cutting objects out of videos. Using machine learning, it makes rotoscoping (a.k.a. masking) a lot faster and a lot less painful.” It looks very cool, but take their claim of being first with a grain of salt: Kaleido, the folks behind DT-favorite remove.bg, also launched an ML-powered automatic video background removal tool called unscreen earlier this year (#35). However, for Runway ML, Green Screen represents yet another well-integrated feature for their already extensive AI creativity product, which is not something unscreen can match as a single-use tool. Along with Photoshop’s new AI features (#51), this is yet another example of how quickly deep learning vision models are becoming easy to use for everyone.
Apple has forked TensorFlow 2 to optimize it for their new crazy-fast M1 Macs! This came as a pretty big surprise, and it makes the new M1 Macs even more attractive to ML developers: for the first time, this’ll enable using the internal GPU to train TensorFlow models on Mac laptops, leadings to ~5x speedups (!) compared to the previous generation. I’ll probably hold until for the next generation — by which time Apple’s optimizations should also be upstreamed to the main TensorFlow branch instead of only being available on their own fork — but it’s clear that even now these laptops are already huge game changers.
Nathan Benaich and Ian Hogarth’s 2020 State of AI report came out. It covers research, talent, industry, and politics, and is once again full of great in-depth data and analysis. It touches on many of the trends I’ve covered in DT this year, including gargantuan (Transformer) language models, productized NLP, reproducibility, accelerator chips, and self-driving progress. A few topics that are a bit outside DT’s scope but that are very interesting nonetheless include their assertion that biology is experiencing its “AI moment”, their analysis of talent education and flow, and their summary of geopolitical trends surrounding AI hardware and software companies. (Google Slides link; an executive summary is on slide 7.)
Duplex, Google’s “AI technology that uses natural conversation to get things done,” was first launched at the company’s 2018 I/O conference as a way to automatically make phone calls for reservations at restaurants or rental car services (see DT #13). It’s now being used in a lot more places, from calling businesses listed on Google Maps to automatically confirm their opening times, to screening potential spam phone calls. Personally I’d feel a little rude having Duplex make a reservation for me, but I think the use case of double-checking opening times is very useful — especially now, during the pandemic — since that single automated call can prevent a lot of people from showing up to closed doors if opening times are wrong on Google Maps.
Lobe, a web app to visually edit high-level compute graphs of machine learning models and train them, has (re)launched as a Microsoft product. “Just show it examples of what you want it to learn, and it automatically trains a custom machine learning model that can be shipped in your app.” The site’s UI looks super slick, and it can export models to TensorFlow (1.15, not 2.x), TFLite, ONNX, CoreML and more. I’d be very interested to find out what kind of optimizations it applies for the mobile and edge deployment targets — anything on top of the standard TFLite conversion, for example?
Cyril Diagne’s AR cut & paste demo (#39) is now an app: ClipDrop lets you take photos of objects on your phone, uses a background removal model to cut them out, and then lets you paste them onto your laptop screen in augmented reality. I’ve tried it on a few objects I had laying around my apartment, and capturing objects (the “clip” bit) works super reliably; sending the photo to my laptop (the “drop” bit) was a bit less robust.
Descript has launched their new video editor. This is another DT-favorite: Descript originally built an app that lets you edit the transcribed text of an audio file and reflects those changes back into the audio (see DT #18), followed by a version of the product optimized for podcast editing (#24). The newest release turns the app into a fully-fledged video editor, including support for Descript’s core transcript-based editing feature: it can delete sections, auto-remove “uhm"s, and even generate new audio (in the speaker’s voice!) for small corrections. And it comes with a great launch video (by Sandwich, of course).
Long (technical) deep-dive from Google on their lessons learned in a decade of software engineering for machine learning systems: Towards ML Engineering: A Brief History Of TensorFlow Extended (TFX). A recurring theme for those of you that have been reading DT for a while: “We also recommend that before focusing on cutting-edge ML modeling techniques, product leaders should invest more time in adopting interoperable ML platforms for their organizations.”
Amsterdam (where I live!) and Helsinki (where I don’t live) have launched their “AI algorithm registries.” These are actually a pretty cool idea: whenever a municipalities “utilizes algorithmic systems as part of [their] city services,” these systems must be cataloged in the city’s algorithm registry. Amsterdam’s registry currently has three entries: (1) license plate-recognizing automated parking control cars, (2) a pilot for algorithm-assisted fraud surveillance for holiday home rentals, and (3) a natural language processing system for categorizing reports of trash in public space. These registries may become a good source of productized AI links for me, but more importantly, this is a great step for building transparency, trust and accountability into these systems.
Your weekly reminder that anyone who tries to sell you a facial recognition system without any age, gender, racial or accessory biases, probably does not actually have such a system to sell to you. From the 1800 submissions to the FairFace Challenge at ECCV 2020, Sixta et al. (2020) found that: “[the] top-10 teams [show] higher false positive rates (and lower false negative rates) for females with dark skin tone as well as the potential of eyeglasses and young age to increase the false positive rates too.” I really hope that everyone deploying these systems widely is aware of this and the potential consequences.
Facebook is increasingly talking publicly about the work it does to keep its platform safe, probably at least partially in response to the constant stream of news about its failures in this area (from Myanmar to Plandemic). This does mean we get to learn a lot about the systems that Facebook AI Research (FAIR) is building to stop viral hoaxes before they spread too widely. Examples include the recent inside look into their AI Red Team (DT #47); their Web-Enabled Simulations (WES, #38) and Tempotal Interaction Embeddings (TIES, #34) for detecting bots on Facebook; and their DeepFake detection dataset (#23). Now, Halevy et al. (2020) have published an extensive survey on their work preserving integrity in online social networks, in which they “highlight the techniques that have been proven useful in practice and that deserve additional attention from the academic community.” It covers many of the aforementioned topics, plus a lot more.
New must-read essay if you’re at an AI startup, by Martin Casado and Matt Bornstein at Andreesen Horowitz: Taming the Tail: Adventures in Improving AI Economics. “We share some of the lessons, best practices, and earned secrets we learned through formal and informal conversations with dozens of leading machine learning teams. For the most part, these are their words – not ours.” I couldn’t write a more convincing pitch for the post than that, so I didn’t try.
Not quite productized yet, but an example of work I’m seeing more and more of on new arXiv uploads lately: Deep Atrous Guided Filter for Image Restoration in Under Display Cameras — using AI to make photos taken through phone screens look decent. The recent activity was probably due to the RLQ ECCV challenge in August, but it’s making me wonder if in-display selfie cameras will go mainstream in the next year or two. (Xiaomi is already hyping it up.)
Related: Caldwell et al. wrote a paper on AI-enabled future crime for Crime Science, a journal associated with University College London. They think the highest-risk possibilities are: audio/video impersonation (e.g. deepfakes, again see DT #23), driverless vehicles as weapons, tailored phishing, disrupting AI-controlled systems (like the Facebook stuff above), large-scale blackmail, and AI-authored fake news. Burglar bots rank as low-risk and killer robots rank as medium-risk—personally I’d rank killer drones (bad title, good 7-minute sci-fi) above those two.
️There has always been a cat-and-mouse game between ever-updating automated content filters and users who think of clever new ways to circumvent them: from email spam filters decades ago to blockers for explicit, violent or fake viral content on social media today. A new filter evasion trick falls through the cracks every once in a while, becomes popular and widely used, and is then eventually added to the automated filters. Depending on the severity of the bypass, this process sometimes has to be completed in mere hours or days. In light of, well, the state of the world, the stakes here are obviously very high—I don’t envy the pressure these ML teams must be under. Tom Simonite at Wired wrote a feature on Facebook’s internal AI Red Team, which is the company’s response to this problem. The team tries to hack the company’s own AI-powered filtering systems before users do, to always stay one step ahead of them. It’s a good read that covers the company’s “risk-a-thons”, their deepfakes detection challenge (DT #23), automated testing, and much more.
Voyage has put up a detailed blog post announcing the G3, its next-generation robotaxi aimed at senior citizens. Although the company is not quite as far along as Waymo, which has had customers riding their driverless taxis for over a year now, Voyage’s service should be live in San Jose, California, next year. I’ve been following this company for a while now and I thought I had featured them on DT at least once before, but my archive appears to disagree with me there. To rectify that, here are some more high-quality technical blog posts from Voyage that I’ve read but never get around to covering: one on their automatic emergency braking system, one on their active learning data curation, and one on their Telessist remote operations solution.
Samuel Axon wrote an in-depth feature on machine learning at Apple for Ars Technica, with input from two executives at the company: John Giannandrea (SVP for ML and AI Strategy) and Bob Borchers (VP of Product Marketing). From handwriting recognition to battery charging optimization, AI has—Software 2.0-style—steadily been eating its way into more and more of the iOS software stack, far beyond just powering the obvious things like Siri speech recognition and camera roll semantic search. Of course, Giannandrea and Borchers also talk a lot about Apple’s focus on on-device ML and their “neural engine” accelerator chips. It’s a long article, but a must-read if you’re into productized AI.
From the folks behind DT-favorite remove.bg (DT #3, #5, #12, #16), automatic video background removal tool Unscreen (#35) now has a Pro version that supports full HD, MP4 export, and unlimited-length videos. It’s web-only for now but an API is in the works. I’ve really enjoyed following this team’s progress over the past almost two years, and it’s great to see they’re continuing to execute to successfully.
Interesting case study from Amnesty International on using automated satellite image classification for human rights research in war zones: tens of thousands of volunteers labeled 2.6 million image tiles of the Darfur region in western Sudan as empty , human presence, or destroyed, which Amnesty and Element AI researchers used to train a model that could predict these labels at 85% precision and 81% recall on the test set. This model then allowed them to visualize and analyze different waves of destruction in the zone over time. The full case study is well worth a read: it includes detailed notes on the ethical tradeoffs they considered before starting the project —a contrast with the ethics sections in many recent ML papers that read like checkbox afterthoughts.
Related: Fawkes is a new algorithm by Shan et al. that makes imperceptible changes to portrait photos to fool facial recognition: “ cloaked images will teach the model a highly distorted version of what makes you look like you.” The University of Chicago researchers wrapped Fawkes into a Windows and macOS app, and they claim that it’s 100% effective against the state-of-the-art models powering commercially available facial recognition APIs. As my friends who study computer security tell me, though, this is always a cat-and-mouse game: at some point, someone will figure out how to make a facial recognition model that’s robust against Fawkes; and then someone else will make a Fawkes 2.0 that’s robust against that; and then… But, at least for a while, running your photos through Fawkes should make them unrecognizable to most facial recognition models out there. Probably.
Google open-sourced Seq2act, a new model that translates natural-language instructions (“How do I turn on Dutch subtitles on YouTube?”) into mobile UI action sequences (tap the video; tap the settings button; tap closed captions; select Dutch from the list). This isn’t quite productized yet, but who wants to bet that the next major version of Android will allow you to say “OK Google, turn on Dutch subtitles” in the YouTube app—as well as millions of other commands in other apps—and that the phone will just tap the right buttons in the background and do it for you? This is the stuff that makes me jealous as an iPhone user.
Update on facial recognition in the United States, which big tech recently pulled out of (see DT #42), and which startups then doubled down on (#43): a group of senators has now proposed legislation to block use of the facial recognition technology by law enforcement. Good!
As we feared following the news that IBM, Microsoft, and Amazon are no longer selling facial recognition technology to police departments in the United States (see DT #42), companies that aren’t tied to large consumer-facing brands—and that aren’t under the level of scrutiny that comes with being a household name—are now doubling down on the space. The only real solution to this problem is regulation.
In related news, a Michigan man was arrested because a facial recognition algorithm misidentified him. This is the first time a facial-recognition-induced wrongful arrest has been reported, which actually slightly surprises me because the technology has been rolled out much more widely in China (although cases like this may not make the news there). What’s less surprising is that this first case happened to a Black man, given that commercial facial recognition algorithms have been shown to make more mistakes on people with darker skin (see DT #41).
Android 11 includes much-improved voice access, where instead of having to say the number next to the part of the screen you want to click, you can just say what you’re trying to do, and the phone is pretty good at understanding your intention. Check out Dieter Bohn’s demo video on twitter.
Background sound removal in Google Meet got improved significantly: G Suite director of product management Serge Lachapelle made a demo video showing it successfully muting all sorts of annoying meeting noises—while preserving his talking at the same time. Reminds me of Krisp.ai (DT #16).
Isaac Caswell and Bowen Liang summarized recent advances in Google Translate, linking out to papers and other posts describing each change in depth. These changes over the past year have together resulted in an average improvement of +5 BLEU across the 100+ languages now supported by Translate (see this fun gif), with low-resource languages improving by an additional +2 BLEU on average.
remove.bg, the service that automatically removes backgrounds from images, has released an update that significantly improves the quality of their cutouts. It includes better hair handling, edge color correction, and multi-object scenes. This is Software 2.0 in action: the same APIs are now powered by better models, providing better results for users who don’t have to change their workflow.
BenchSci helps life science companies reduce failed experiments by curating reagent catalogs and experiments from the literature, decoding them using ML models, and wrapping the resulting data in an easy-to-use interface for researchers. This is the classic productized AI model of (1) automating graduate-student-level work, (2) applying it across the corpus of literature in some niche, and then (3) selling access to the extracted info as a service. I’m personally a big fan of this model and think it has the potential to make many industries more efficient; VCs seem to agree, since BenchSci recently raised a $22 million round of funding.
️DeepQuest’s DeepStack AI Servers offer a different twist on machine learning APIs: instead of just being available as endpoints in the cloud (like Google’s, Microsoft’s and Amazon’s ML APIs), DeepStack’s servers and pretrained models can be installed as Docker containers. This way it combines the ease-of-use of cloud APIs with the data privacy of self-hosting—a cool idea I hadn’t heard of before.
Andrea Lewis Åkerman interviewed Tiffany Deng, Tulsee Doshi and Timnit Gebru on their work at Google to make the company’s AI products more inclusive. “Why is it that some products and services work better for some than others, and why isn’t everyone represented around the table when a decision is being made?” They emphasize the importance of tooling and resources, the difficulty of even defining fairness, and the necessity of diversity in both data and teams. I found their journeys toward their positions at Google—each noticing inequalities in tech and wanting to help fix them—especially eye-opening.
Related: Facebook’s 3D photos feature now simulates depth for any image, using techniques very similar to what the iPhone SE is doing.
Ben Sandofsky of iOS camera app Halide wrote a deep dive on how the new iPhone SE, which has only one rear-facing camera, uses single-image monocular depth estimation to do fake background blur in portrait mode photos. I have some experience with this exact computer vision task, and the results achieved here by Apple—on-device!—look very impressive to me.
Otter.ai auto-generates “rich notes for meetings, interviews, lectures, and other important conversations.” This looks like a fun product, and apparently it’s being integrated into Zoom. (Unrelated: the otter emoji may be the purest emoji in existence.)
Google Lens now lets you copy text from handwritten notes by pointing your phone at them.
Emma Beede conducted a user study on how nurses in Thailand are using Google’s AI screening tool to help diagnose diabetic retinopathy. “[The] study found that the AI system could empower nurses to confidently and immediately identify a positive screening, resulting in quicker referrals to an ophthalmologist.” Beede emphasizes, though, that it’s important to engage with clinicians and patients before widely deploying such systems, to ensure it doesn’t inadvertently hinder diagnosis.
Writing for Ars Technica, Timothy B. Lee shared his experience of getting a burger delivered by a robot. Part-self-driving and part-piloted, these box-on-wheels sidewalk robots by startups like Starship and Kiwibot are getting pretty clever. “If, like, a group of people surrounded the robot and blocked it,” said Starship executive Ryan Touhy, “the robot would identify the situation and say ‘Hello I’m a Starship delivery robot. Can you please let me pass.’” The whole story is a fun read, as is this comment. Also check out Joan Lääne’s post about their mapping and navigation tech for Starship’s blog.
What’s the best way to mitigate the damage malicious bots can do on your social media platform? Facebook’s answer: building your own set of reinforcement-learning-based bots and setting them loose on a simulated version of your network. The company is deploying these Web-Enabled Simulations (WESs) to catch bad actors, search for bad content, and figure out how real-world bots could scrape data off the platform and break privacy rules.
Michael Schoenberg and Adarsh Kowdle wrote a deep dive on uDepth, the set of neural networks on Google’s Pixel 4 phones that enable some cool computational photography features and a real-time depth sensing API at 30 Hz. Fun bonus: their architecture diagram also highlights whether each step runs on the phone’s CPU, GPU, or Neural Core.
I recently came across Martin Zinkevich’s 24-page Rules of Machine Learning (PDF), “intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning from around Google.” Lots of good stuff in here.
Acquisitions in the augmented reality space are heating up: in the past two weeks alone, Ikea bought Geomagical Labs (which works on placing virtual furniture in rooms; an obvious fit) and Pokémon Go developer Niantic acquired 6D.ai (which works on indoor mapping; another obvious fit).
Cool new paper from Liu et al. (2020): Learning to See Through Obstructions proposes a learning-based approach that can remove things like chain fences and window reflections from photos (see the paper PDF for examples). This isn’t yet productized, but of the collaborators works at Google—so: how long until this shows up up in the camera app for Pixel phones?
Related to Software 2.0: Martin Casado and Matt Bornstein at venture capital firm Andreessen Horowitz wrote about the new business of AI-powered software and how it’s different from traditional software-as-a-service companies: margins are lower, there’s a bigger services component, and it’s harder to create a defensible moat. Luckily they end with a set of tips.
Jaime Lien and Nicholas Gillion wrote an interesting story for the Google AI blog about how their Soli radar-based perception features went from a chunky prototype in the company’s Advanced Technology and Projects (ATAP) lab to a tiny chip shipping in Pixel 4 phones. It involved a combination of creating and shipping a novel sensor, as well as designing machine learning models to power Motion Sense, the feature that recognizes hand gestures from radar data.
️Natt Garun at The Verge: Tempo is a smart home gym that uses computer vision to track your form in real time.
Self-driving car company Waymo has raised a big new round of funding.