Self-driving car company Waymo has also open-sourced its training data,
following other players in the field like Chinese tech giant Baidu and ride hailing company Lyft (see DT #19
). Brad Templeton, who previously worked on Google’s car team, wrote about the significance of this release:
Waymo is widely hailed as the best in the business. Their data is good, and it also contains flat 2-D camera images which have been synchronized with LIDAR 3-D scans of the same scene, something else Waymo is good at. … This archive will be a boon for academic researchers, who can’t pay [to collect and label this data themselves].
But there is a catch: the data is strictly licensed for only non-commercial use, so Waymo’s many competitors can’t use it directly. In the long term, though, supporting academia like this should help advance the basic research (object classification, sensor fusion, optical flow, etc.) on top of which all self-driving systems are being built. Read Templeton’s full analysis on Forbes: Waymo Gives Away Free Self-Driving Training Data – But With Restrictions
OpenAI has posted a 6-month follow-up of their GPT-2 release.
GPT-2 is OpenAI’s language model that can be used to generate text that convincingly looks like it was written by a human. It has been controversial from the start, since OpenAI originally—in fear that GPT-2 would be used to automatically generate fake news and propaganda—only released a small, less capable version of the model, which made it hard for other scientists to evaluate or replicate their work. At the time, I called for them to release the full model to raise public awareness of its capabilities (see DT #8
Instead, over the past half year, OpenAI has been doing a staged release of progressively larger versions of the model (see DT #13
), opening it up first to trusted research institutions and then to the public at large. With their latest release a little over two weeks ago, they also published some of their findings about this approach:
Coordination is difficult, but possible: OpenAI has worked closely with over five other groups that have replicated GPT-2 and ensured that no one has released a full-size model yet.
Humans can be convinced by synthetic text: As they feared, fake news written by GPT-2 can be extremely effective; propaganda written by similar language model called GROVER actually seems more “plausible” to humans than human-written propaganda.
Detection isn’t that simple: The automated systems to detect text written by GPT-2 or similar models are only around 90% effective, not enough to stop potential bad actors.
So what’s next?
As part of our staged release strategy, our current plan is to release the 1558M parameter model in a few months, but it’s plausible that findings from a partner, or malicious usage of our 774M model, could change this.
We think that a combination of staged release and partnership-based model sharing is likely to be a key foundation of responsible publication in AI, particularly in the context of powerful generative models. The issues inherent to large models are going to grow, rather than diminish, over time.
I think I’ve changed my mind from six months ago and that I agree with this approach now—we need detection systems built by trusted partners to at least catch up to the current publicly available versions of GPT-2 before new, more capable ones become available. For more analysis and a full timeline, see OpenAI’s blog post: GPT-2: 6-Month Follow-Up
- Turbo is a color map for visualizing depth images (common in computer vision research) that’s more perceptually uniform, interpretable, and accessible to color blind people. Link: Turbo
Fairness and machine learning is a free online textbook that approaches ML from an anti-discriminatory perspective. Link: fairmlbook.org.