Kavita Ganesan and Romano Foti wrote about OctoLingua, a machine learning model GitHub uses to classify programming languages. Figuring out in what language a piece of code is written is may seem like a trivial problem, but GitHub deals with lots of files with missing, incorrect, and overlapping file extensions. In this first scenario, their previous heuristics-based approach,
Linguist, has an
F1-score of 0.05. This is not surprising since Linguist mostly looks at file extensions to make its predictions. The new approach, OctoLingua, uses some interesting hand-engineered features (such as the top five special characters and top 20 code tokens per file), which it feeds into a two-layer fully-connected neural network with dropout. Using this approach—and without having to jump straight to more complex sequence-to-sequence model like LSTMs—OctoLingua achieves an F1-score up to 0.95 in the missing file extension setting. More details on the GitHub Blog:
C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages.
Google released YouTube-8M Segments, a new video classification dataset. Unlike the original YouTube-8M dataset, which had machine-generated labels for eight million YouTube videos, Segements has human-labeled annotations of five-second clips within these videos. Google is also hosting a
Kaggle challenge for the new dataset and organizing a related
workshop at ICCV'19. More on the Google AI blog:
Announcing the YouTube-8M Segments Dataset.
Vi Hart wrote an essay about AI, Universal Basic Income, and the value of data. She frames machine learning models as “collages” that “glue together” the knowledge of the thousands (or millions) of people whose data was used to train them:
That cancer-screening AI might get it right more often than any individual doctor, but that’s not because it’s “smarter” but because it uses the knowledge of thousands of doctors all collaged together. – M Eifler
Hart argues that we should value data creators much more than we do today. Many traditional jobs are being replaced by machine learning systems that can figure out 99% of cases, and then fall back on 10-cents-a-task
Mechanical Turk gig workers for the 1% they can’t handle. These people (like the annotators for YouTube-8M Segments) are doing critically important work—without them, ML systems can’t “learn” anything—but they are woefully underpaid and underprotected by current labor laws that were designed for non-gig work. Hart:
33% of US adults have a bachelor’s degree. A 2016 Pew Research Center study that focused on US workers on [Amazon’s Mechanical Turk platform (MTurk)] found that 51% of US workers on MTurk have a bachelor’s degree. The same Pew Research Center study also found that the majority of US workers on MTurk were making under $5 an hour. Federal minimum wage in the US is $7.25.
That’s people who went to college and got a degree, working online for less than minimum wage, with zero benefits.
This “ghost work” problem is just one small part of Hart’s essay; she also dives deeper into the influences of the history of science and cultural evolution, the “AI purity culture” prevalent in Silicon Valley, and data markets. It’s a long post, but it’s incredibly insightful. I highly recommend taking an hour out of your Sunday to sit down and give it a read. If you do, please let me know your thoughts! Read Vi Hart’s essay at The Art of Research here:
Changing my Mind about AI, Universal Basic Income, and the Value of Data
- IBM’s differential privacy library, which includes a Jupyter notebook to explore the effect of differential privacy on machine learning accuracy using basic classification and clustering models. Link: IBM/differential-privacy-library
- PyTorch Transformers is a library of state-of-the-art pre-trained models for natural language processing, including BERT, GPT, GPT-2, Transformer-XL, XLNet, and XLM. Link: huggingface/pytorch-transformers
- I made a repository of BibTeX citations for common Python packages, to make it easier to give credit to the software we use for machine learning research in papers. Link: leonoverweel/bibtex-python-package-citations
- fast.ai has several high-quality free online machine learning courses, from practical deep learning for coders, to cutting edge deep learning from the foundations, to NLP. Link: fast.ai