Liu, H;
(2017)
Fundamental frequency modelling: an articulatory perspective with target approximation and deep learning.
Doctoral thesis , UCL (University College London).
Preview |
Text
Hao_Liu_110083707_PhD_thesis_final_deposit.pdf Download (7MB) | Preview |
Abstract
Current statistical parametric speech synthesis (SPSS) approaches typically aim at state/frame-level acoustic modelling, which leads to a problem of frame-by-frame independence. Besides that, whichever learning technique is used, hidden Markov model (HMM), deep neural network (DNN) or recurrent neural network (RNN), the fundamental idea is to set up a direct mapping from linguistic to acoustic features. Although progress is frequently reported, this idea is questionable in terms of biological plausibility. This thesis aims at addressing the above issues by integrating dynamic mechanisms of human speech production as a core component of F0 generation and thus developing a more human-like F0 modelling paradigm. By introducing an articulatory F0 generation model – target approximation (TA) – between text and speech that controls syllable-synchronised F0 generation, contextual F0 variations are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. With the goal of demonstrating that human speech movement can be considered as a dynamic process of target approximation and that the TA model is a valid F0 generation model to be used at the motor-to-acoustic stage, a TA-based pitch control experiment is conducted first to simulate the subtle human behaviour of online compensation for pitch-shifted auditory feedback. Then, the TA parameters are collectively controlled by linguistic features via a deep or recurrent neural network (DNN/RNN) at the linguistic-to-motor stage. We trained the systems on a Mandarin Chinese dataset consisting of both statements and questions. The TA-based systems generally outperformed the baseline systems in both objective and subjective evaluations. Furthermore, the amount of required linguistic features were reduced first to syllable level only (with DNN) and then with all positional information removed (with RNN). Fewer linguistic features as input with limited number of TA parameters as output led to less training data and lower model complexity, which in turn led to more efficient training and faster synthesis.
Type: | Thesis (Doctoral) |
---|---|
Title: | Fundamental frequency modelling: an articulatory perspective with target approximation and deep learning |
Event: | UCL (University College London) |
Open access status: | An open access version is available from UCL Discovery |
Language: | English |
Keywords: | Fundamental frequency, F0, Pitch, Prosody, Text-to-Speech, Speech synthesis, Articulation, Speech production, Target approximation, Auditory feedback, Deep neural network, Recurrent neural network |
UCL classification: | UCL > Provost and Vice Provost Offices UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > Div of Psychology and Lang Sciences |
URI: | https://discovery.ucl.ac.uk/id/eprint/1535338 |
Archive Staff Only
View Item |