Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised fashion by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of ten youtubers with notable expressiveness in both the speech and visual signals.
Metadata
Item Type:
Conference or Workshop Item (Poster)
Event Type:
Conference
Refereed:
Yes
Uncontrolled Keywords:
deep learning; adversarial learning; face synthesis; computer vision
This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:
“la Caixa” Foundation funded by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 713673, Spanish Ministry of Economy and Competitivity and the European Regional Development fund under contracts TEC 2015-69266-P and TEC 2016-75976-R (MINECO/FEDER, UE), Science Foundation Ireland (SFI) under grant number SFI/15/SIRG/3283
ID Code:
23188
Deposited On:
16 May 2019 14:46 by
Kevin Mcguinness
. Last Modified 01 Mar 2022 15:46