UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Statistical language modelling of dialogue material in the British national corpus.

Hunter, G.J.A.; (2004) Statistical language modelling of dialogue material in the British national corpus. Doctoral thesis , University of London. Green open access

[thumbnail of U602659.pdf] Text

Download (8MB)


Statistical language modelling may not only be used to uncover the patterns which underlie the composition of utterances and texts, but also to build practical language processing technology. Contemporary language applications in automatic speech recognition, sentence interpretation and even machine translation exploit statistical models of language. Spoken dialogue systems, where a human user interacts with a machine via a speech interface in order to get information, make bookings, complaints, etc., are example of such systems which are now technologically feasible. The majority of statistical language modelling studies to date have concentrated on written text material (or read versions thereof). However, it is well-known that dialogue is significantly different from written text in its lexical content and sentence structure. Furthermore, there are expected to be significant logical, thematic and lexical connections between successive turns within a dialogue, but "turns" are not generally meaningful in written text. There is therefore a need for statistical language modeling studies to be performed on dialogue, particularly with a longer-term aim to using such models in human-machine dialogue interfaces. In this thesis, I describe the studies I have carried out on statistically modelling the dialogue material within the British National Corpus (BNC) - a very large corpus of modern British English compiled during the 1990s. This thesis presents a general introductory survey of the field of automatic speech recognition. This is followed by a general introduction to some standard techniques of statistical language modelling which will be employed later in the thesis. The structure of dialogue is discussed using some perspectives from linguistic theory, and reviews some previous approaches (not necessarily statistical) to modelling dialogue. Then a qualitative description is given of the BNC and the dialogue data within it, together with some descriptive statistics relating to it and results from constructing simple trigram language models for both dialogue and text data. The main part of the thesis describes experiments on the application of statistical language models based on word caches, word "trigger" pairs, and turn clustering to the dialogue data. Several different approaches are used for each type of model. An analysis of the strengths and weaknesses of these techniques is then presented. The results of the experiments lead to a better understanding of how statistical language modelling might be applied to dialogue for the benefit of future language technologies.

Type: Thesis (Doctoral)
Title: Statistical language modelling of dialogue material in the British national corpus.
Identifier: PQ ETD:602659
Open access status: An open access version is available from UCL Discovery
Language: English
Additional information: Thesis digitised by Proquest
UCL classification: UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Brain Sciences > Div of Psychology and Lang Sciences > Speech, Hearing and Phonetic Sciences
URI: https://discovery.ucl.ac.uk/id/eprint/1446734
Downloads since deposit
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item