UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Cross-Utterance Conditioned VAE for Speech Generation

Li, Yang; Yu, Cheng; Sun, Guangzhi; Zu, Weiqin; Tian, Zheng; Wen, Ying; Pan, Wei; ... Sun, Fanglei; + view all (2024) Cross-Utterance Conditioned VAE for Speech Generation. IEEE Transactions on Audio, Speech and Language Processing , 32 pp. 4263-4276. 10.1109/TASLP.2024.3453598. Green open access

[thumbnail of 2309.04156v2.pdf]
Preview
PDF
2309.04156v2.pdf - Accepted Version

Download (2MB) | Preview

Abstract

Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.

Type: Article
Title: Cross-Utterance Conditioned VAE for Speech Generation
Open access status: An open access version is available from UCL Discovery
DOI: 10.1109/TASLP.2024.3453598
Publisher version: https://doi.org/10.1109/taslp.2024.3453598
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher's terms and conditions.
Keywords: speech synthesis, TTS, speech editing, pre-trained language model, variational autoencoder
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10206796
Downloads since deposit
15Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item