UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Determining Unintelligible Words from their Textual Contexts

Pintér, B; Vörös, G; Palotai, Z; Szabo, Z; Lőrincz, A; (2013) Determining Unintelligible Words from their Textual Contexts. Procedia - Social and Behavioral Sciences , 73 101 - 108. 10.1016/j.sbspro.2013.02.028. Green open access

[thumbnail of pinter13determining.pdf]
Preview
PDF
pinter13determining.pdf
Available under License : See the attached licence file.

Download (1MB)
[thumbnail of pinter13determining_presentation.pdf]
Preview
PDF
pinter13determining_presentation.pdf
Available under License : See the attached licence file.

Download (686kB)

Abstract

We propose a method to determine unintelligible words based on the textual context of the word determined. As there can be many different candidate words to choose from for a word, a robust, large-scale method is needed. The large scale makes the problem sensitive to spurious similarities of contexts: when the contexts of two, different words are similar. To reduce this effect, we induce structured sparsity on the words by formulating the task as a group Lasso problem. We compare this formulation to a k-nearest neighbor and a support vector machine based approach, and find that group Lasso outperforms both by a large margin. We achieve up to 75% of accuracy when determining the word from among 1000 candidate words both on the Brown corpus and on the British National Corpus. The relevance of this work is in Optical Character Recognition (OCR), where unintelligible words are often produced. Our proposed method utilizes information independent from information used in OCR and in turn, one expects that a combined approach could be very successful.

Type: Article
Title: Determining Unintelligible Words from their Textual Contexts
Open access status: An open access version is available from UCL Discovery
DOI: 10.1016/j.sbspro.2013.02.028
Publisher version: http://dx.doi.org/10.1016/j.sbspro.2013.02.028
Language: English
Additional information: This work is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivativeWorks 3.0 license (CC BY-NC-ND). You are free to share (copy, distribute and transmit the work), but you must attribute the author, you may not use this work for commercial purposes and you may not alter, transform, or build upon this work and distribute any derivative works you create under a similar license.
Keywords: distributional hypothesis, natural language processing, structured sparse coding, word recognition
UCL classification: UCL
UCL > Provost and Vice Provost Offices
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences
URI: https://discovery.ucl.ac.uk/id/eprint/1433157
Downloads since deposit
179Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item