Determining Unintelligible Words from their Textual Contexts

Advanced search
Browse by:

Department | Year

UCL Theses | Latest

Deposit your research

Determining Unintelligible Words from their Textual Contexts

Pintér, B; Vörös, G; Palotai, Z; Szabo, Z; Lőrincz, A; (2013) Determining Unintelligible Words from their Textual Contexts. Procedia - Social and Behavioral Sciences , 73 101 - 108. 10.1016/j.sbspro.2013.02.028. Green open access

Preview	PDF pinter13determining.pdf Available under License : See the attached licence file. Download (1MB)
Preview	PDF pinter13determining_presentation.pdf Available under License : See the attached licence file. Download (686kB)

Abstract

We propose a method to determine unintelligible words based on the textual context of the word determined. As there can be many different candidate words to choose from for a word, a robust, large-scale method is needed. The large scale makes the problem sensitive to spurious similarities of contexts: when the contexts of two, different words are similar. To reduce this effect, we induce structured sparsity on the words by formulating the task as a group Lasso problem. We compare this formulation to a k-nearest neighbor and a support vector machine based approach, and find that group Lasso outperforms both by a large margin. We achieve up to 75% of accuracy when determining the word from among 1000 candidate words both on the Brown corpus and on the British National Corpus. The relevance of this work is in Optical Character Recognition (OCR), where unintelligible words are often produced. Our proposed method utilizes information independent from information used in OCR and in turn, one expects that a combined approach could be very successful.

Type:	Article
Title:	Determining Unintelligible Words from their Textual Contexts
Open access status:	An open access version is available from UCL Discovery
DOI:	10.1016/j.sbspro.2013.02.028
Publisher version:	http://dx.doi.org/10.1016/j.sbspro.2013.02.028
Language:	English
Additional information:	This work is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivativeWorks 3.0 license (CC BY-NC-ND). You are free to share (copy, distribute and transmit the work), but you must attribute the author, you may not use this work for commercial purposes and you may not alter, transform, or build upon this work and distribute any derivative works you create under a similar license.
Keywords:	distributional hypothesis, natural language processing, structured sparse coding, word recognition
UCL classification:	UCL UCL > Provost and Vice Provost Offices UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences
URI:	https://discovery.ucl.ac.uk/id/eprint/1433157