Pintér, B;
Vörös, G;
Palotai, Z;
Szabo, Z;
Lőrincz, A;
(2013)
Determining Unintelligible Words from their Textual Contexts.
Procedia - Social and Behavioral Sciences
, 73
101 - 108.
10.1016/j.sbspro.2013.02.028.
Preview |
PDF
pinter13determining.pdf Available under License : See the attached licence file. Download (1MB) |
Preview |
PDF
pinter13determining_presentation.pdf Available under License : See the attached licence file. Download (686kB) |
Abstract
We propose a method to determine unintelligible words based on the textual context of the word determined. As there can be many different candidate words to choose from for a word, a robust, large-scale method is needed. The large scale makes the problem sensitive to spurious similarities of contexts: when the contexts of two, different words are similar. To reduce this effect, we induce structured sparsity on the words by formulating the task as a group Lasso problem. We compare this formulation to a k-nearest neighbor and a support vector machine based approach, and find that group Lasso outperforms both by a large margin. We achieve up to 75% of accuracy when determining the word from among 1000 candidate words both on the Brown corpus and on the British National Corpus. The relevance of this work is in Optical Character Recognition (OCR), where unintelligible words are often produced. Our proposed method utilizes information independent from information used in OCR and in turn, one expects that a combined approach could be very successful.
Type: | Article |
---|---|
Title: | Determining Unintelligible Words from their Textual Contexts |
Open access status: | An open access version is available from UCL Discovery |
DOI: | 10.1016/j.sbspro.2013.02.028 |
Publisher version: | http://dx.doi.org/10.1016/j.sbspro.2013.02.028 |
Language: | English |
Additional information: | This work is licensed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivativeWorks 3.0 license (CC BY-NC-ND). You are free to share (copy, distribute and transmit the work), but you must attribute the author, you may not use this work for commercial purposes and you may not alter, transform, or build upon this work and distribute any derivative works you create under a similar license. |
Keywords: | distributional hypothesis, natural language processing, structured sparse coding, word recognition |
UCL classification: | UCL UCL > Provost and Vice Provost Offices UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Life Sciences |
URI: | https://discovery.ucl.ac.uk/id/eprint/1433157 |
Archive Staff Only
View Item |