Talking Head from Speech Audio using a Pre-trained Image Generator

Advanced search
Browse by:

Department | Year

UCL Theses | Latest

Deposit your research

Talking Head from Speech Audio using a Pre-trained Image Generator

Alghamdi, Mohammed M; Wang, He; Bulpitt, Andrew J; Hogg, David C; (2022) Talking Head from Speech Audio using a Pre-trained Image Generator. In: MM '22: Proceedings of the 30th ACM International Conference on Multimedia. (pp. pp. 5228-5236). Association for Computing Machinery (ACM): New York, NY, USA. Green open access

Preview

Text
2409.18401v1.pdf - Accepted Version
Download (4MB) | Preview

Abstract

We propose a novel method for generating high-resolution videos of talking-heads from speech audio and a single 'identity' image. Our method is based on a convolutional neural network model that incorporates a pre-trained StyleGAN generator. We model each frame as a point in the latent space of StyleGAN so that a video corresponds to a trajectory through the latent space. Training the network is in two stages. The first stage is to model trajectories in the latent space conditioned on speech utterances. To do this, we use an existing encoder to invert the generator, mapping from each video frame into the latent space. We train a recurrent neural network to map from speech utterances to displacements in the latent space of the image generator. These displacements are relative to the back-projection into the latent space of an identity image chosen from the individuals depicted in the training dataset. In the second stage, we improve the visual quality of the generated videos by tuning the image generator on a single image or a short video of any chosen identity. We evaluate our model on standard measures (PSNR, SSIM, FID and LMD) and show that it significantly outperforms recent state-of-the-art methods on one of two commonly used datasets and gives comparable performance on the other. Finally, we report on ablation experiments that validate the components of the model. The code and videos from experiments can be found at https://mohammedalghamdi.github.io/talking-heads-acm-mm/.

Type:	Proceedings paper
Title:	Talking Head from Speech Audio using a Pre-trained Image Generator
Event:	MM '22: The 30th ACM International Conference on Multimedia
Location:	PORTUGAL, Lisboa
Dates:	10 Oct 2022 - 14 Oct 2022
ISBN-13:	9781450392037
Open access status:	An open access version is available from UCL Discovery
DOI:	10.1145/3503161.3548101
Publisher version:	https://doi.org/10.1145/3503161.3548101
Language:	English
Additional information:	This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords:	Audio-driven synthesis; talking head generation; video generation
UCL classification:	UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI:	https://discovery.ucl.ac.uk/id/eprint/10215226

Downloads since deposit

7Downloads

Download activity - last month

Download activity - last 12 months

Downloads by country - last 12 months

Archive Staff Only

View Item