In 2019, Google AI introduced Translatotron, “the first ever model that was able to directly [end-to-end] translate speech between two languages,” instead of chaining together separate speech recognition, machine translation, and speech synthesis models (see DT #14). Jia et al. (2021) updated the model to create Translatotron 2, which is newly able to do voice transfer — making the translated speech sound like it was spoken by the same voice as the input speech — “even when the input speech contains multiple speakers speaking in turns.” (Check out the blog post for some samples of generated audio.) One significant change from the original Translatotron is that both the voice and content of the input speech are now captured using a single encoder, which the authors claim makes the model less likely to be abused for spoofing arbitrary audio content (making someone’s voice say something they never said). But I’m a bit surprised that this is such a central part of the blog post, since there are plenty of dedicated voice-mimicking speech generation models out there already that would be easier to use for this purpose anyway.