Understanding RL Vision
Cool new Distill paper from Hilton et al. (2020): Understanding RL Vision. The authors train a reinforcement learning agent to play a procedurally-generated video game based on single frames as input, and then develop an interactive interface (embedded in the article!) to study what different parts of the network learn. Using Circuits editing (see DT #37), they then make the agent blind to e.g. left-moving enemies in the game, and experimentally show that this indeed makes it fail more often by missing such enemies. “Our results depend on levels in CoinRun being procedurally-generated, leading us to formulate a diversity hypothesis for interpretability. [Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).] If it is correct, then we can expect RL models to become more interpretable as the environments they are trained on become more diverse.” As always, the full article is a great Sunday long read.