UCL logo

UCL Discovery

UCL home » Library Services » Electronic resources » UCL Discovery

Integrating Approximate String Matching with Phonetic String Similarity

Ferri, J; Tissot, H; Del Fabro, MD; (2018) Integrating Approximate String Matching with Phonetic String Similarity. In: Proceedings of the European Conference on Advances in Databases and Information Systems: ADBIS 2018. (pp. pp. 173-181). Springer, Cham: Budapest, Hungary.

[img] Text
Tissot_Integrating_approximate_string_matching with_phonetic_string_similarity.pdf - ["content_typename_Accepted version" not defined]
Access restricted to UCL open access staff until 30 July 2019.

Download (244kB)

Abstract

Well-defined dictionaries of tagged entities are used in many tasks to identify entities where the scope is limited and there is no need to use machine learning. One common solution is to encode the input dictionary into Trie trees to find matches on an input text. However, the size of the dictionary and the presence of spelling errors on the input tokens have a negative influence on such solutions. We present an approach that transforms the dictionary and each input token into a compact well-known phonetic representation. The resulting dictionary is encoded in a Trie that is about 72% smaller than a non-phonetic Trie. We perform inexact matching over this representation to filter a set of initial results. Lastly, we apply a second similarity measure to filter the best result to annotate a given entity. The experiments showed that it achieved good F1 results. The solution was developed as an entity recognition plug-in for GATE, a well-known information extraction framework.

Type: Proceedings paper
Title: Integrating Approximate String Matching with Phonetic String Similarity
Event: European Conference on Advances in Databases and Information Systems: ADBIS 2018
ISBN-13: 9783319983974
DOI: 10.1007/978-3-319-98398-1_12
Publisher version: https://doi.org/10.1007/978-3-319-98398-1_12
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords: Entity recognition, Metaphone, Text tagging, Trie, Active nodes, Fast similarity search
UCL classification: UCL > Provost and Vice Provost Offices
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Pop Health Sciences > Institute of Health Informatics
UCL > Provost and Vice Provost Offices > School of Life and Medical Sciences > Faculty of Pop Health Sciences > Institute of Health Informatics > Clinical Epidemiology
URI: http://discovery.ucl.ac.uk/id/eprint/10058443
Downloads since deposit
0Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item