FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

Advanced search
Browse by:

Department | Year

UCL Theses | Latest

Deposit your research

Bookmark & Share

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis

Huang, R; Lam, MWY; Wang, J; Su, D; Yu, D; Ren, Y; Zhao, Z; (2022) FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis. In: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22). (pp. pp. 4157-4163). IJCAI:International Joint Conferences on Artificial Intelligence Organization: Vienna. Green open access

Preview

PDF
2204.09934.pdf - Accepted Version
Download (690kB) | Preview

Abstract

Denoising diffusion probabilistic models (DDPMs) have recently achieved leading performances in many generative tasks. However, the inherited iterative sampling process costs hindered their applications to speech synthesis. This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies with adaptive conditions. A noise schedule predictor is also adopted to reduce the sampling steps without sacrificing the generation quality. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms without any intermediate feature (e.g., Mel-spectrogram). Our evaluation of FastDiff demonstrates the state-of-the-art results with higher-quality (MOS 4.28) speech samples. Also, FastDiff enables a sampling speed of 58x faster than real-time on a V100 GPU, making diffusion models practically applicable to speech synthesis deployment for the first time. We further show that FastDiff generalized well to the mel-spectrogram inversion of unseen speakers, and FastDiff-TTS outperformed other competing methods in end-to-end text-to-speech synthesis. Audio samples are available at https://FastDiff.github.io/.

Type:	Proceedings paper
Title:	FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis
Event:	Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)
ISBN-13:	9781956792003
Open access status:	An open access version is available from UCL Discovery
DOI:	10.24963/ijcai.2022/577
Publisher version:	https://doi.org/10.24963/ijcai.2022/577
Language:	English
Additional information:	This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
UCL classification:	UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science UCL > Provost and Vice Provost Offices > UCL BEAMS UCL
URI:	https://discovery.ucl.ac.uk/id/eprint/10156662

Downloads since deposit

35Downloads

Download activity - last month

Download activity - last 12 months

Downloads by country - last 12 months

Archive Staff Only

View Item