UCL logo

UCL Discovery

UCL home » Library Services » Electronic resources » UCL Discovery

Fundamental frequency modelling: an articulatory perspective with target approximation and deep learning

Liu, H; (2017) Fundamental frequency modelling: an articulatory perspective with target approximation and deep learning. Doctoral thesis , UCL (University College London). Green open access

[img]
Preview
Text
Hao_Liu_110083707_PhD_thesis_final_deposit.pdf

Download (7MB) | Preview

Abstract

Current statistical parametric speech synthesis (SPSS) approaches typically aim at state/frame-level acoustic modelling, which leads to a problem of frame-by-frame independence. Besides that, whichever learning technique is used, hidden Markov model (HMM), deep neural network (DNN) or recurrent neural network (RNN), the fundamental idea is to set up a direct mapping from linguistic to acoustic features. Although progress is frequently reported, this idea is questionable in terms of biological plausibility. This thesis aims at addressing the above issues by integrating dynamic mechanisms of human speech production as a core component of F0 generation and thus developing a more human-like F0 modelling paradigm. By introducing an articulatory F0 generation model – target approximation (TA) – between text and speech that controls syllable-synchronised F0 generation, contextual F0 variations are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. With the goal of demonstrating that human speech movement can be considered as a dynamic process of target approximation and that the TA model is a valid F0 generation model to be used at the motor-to-acoustic stage, a TA-based pitch control experiment is conducted first to simulate the subtle human behaviour of online compensation for pitch-shifted auditory feedback. Then, the TA parameters are collectively controlled by linguistic features via a deep or recurrent neural network (DNN/RNN) at the linguistic-to-motor stage. We trained the systems on a Mandarin Chinese dataset consisting of both statements and questions. The TA-based systems generally outperformed the baseline systems in both objective and subjective evaluations. Furthermore, the amount of required linguistic features were reduced first to syllable level only (with DNN) and then with all positional information removed (with RNN). Fewer linguistic features as input with limited number of TA parameters as output led to less training data and lower model complexity, which in turn led to more efficient training and faster synthesis.

Type: Thesis (Doctoral)
Title: Fundamental frequency modelling: an articulatory perspective with target approximation and deep learning
Event: UCL (University College London)
Open access status: An open access version is available from UCL Discovery
Language: English
Keywords: Fundamental frequency, F0, Pitch, Prosody, Text-to-Speech, Speech synthesis, Articulation, Speech production, Target approximation, Auditory feedback, Deep neural network, Recurrent neural network
UCL classification: UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences
URI: http://discovery.ucl.ac.uk/id/eprint/1535338
Downloads since deposit
0Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item