UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Natural Language Processing for Under-resourced Languages: Developing a Welsh Natural Language Toolkit

Cunliffe, D; Vlachidis, A; Williams, D; Tudhope, D; (2022) Natural Language Processing for Under-resourced Languages: Developing a Welsh Natural Language Toolkit. Computer Speech & Language , 72 , Article 101311. 10.1016/j.csl.2021.101311. Green open access

[thumbnail of WNLT author - accepted version.pdf]
Preview
Text
WNLT author - accepted version.pdf - Accepted Version

Download (2MB) | Preview

Abstract

Language technology is becoming increasingly important across a variety of application domains which have become common place in large, well-resourced languages. However, there is a danger that small, under-resourced languages are being increasingly pushed to the technological margins. Under-resourced languages face significant challenges in delivering the underlying language resources necessary to support such applications. This paper describes the development of a natural language processing toolkit for an under-resourced language, Cymraeg (Welsh). Rather than creating the Welsh Natural Language Toolkit (WNLT) from scratch, the approach involved adapting and enhancing the language processing functionality provided for other languages within an existing framework and making use of external language resources where available. This paper begins by introducing the GATE NLP framework, which was used as the development platform for the WNLT. It then describes each of the core modules of the WNLT in turn, detailing the extensions and adaptations required for Welsh language processing. An evaluation of the WNLT is then reported. Following this, two demonstration applications are presented. The first is a simple text mining application that analyses wedding announcements. The second describes the development of a Twitter NLP application, which extends the core WNLT pipeline. As a relatively small-scale project, the WNLT makes use of existing external language resources where possible, rather than creating new resources. This approach of adaptation and reuse can provide a practical and achievable route to developing language resources for under-resourced languages.

Type: Article
Title: Natural Language Processing for Under-resourced Languages: Developing a Welsh Natural Language Toolkit
Open access status: An open access version is available from UCL Discovery
DOI: 10.1016/j.csl.2021.101311
Publisher version: https://doi.org/10.1016/j.csl.2021.101311
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
Keywords: Natural Language Processing, Under-resourced Languages, Welsh, Cymraeg, Language Technology
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL SLASH
UCL > Provost and Vice Provost Offices > UCL SLASH > Faculty of Arts and Humanities
UCL > Provost and Vice Provost Offices > UCL SLASH > Faculty of Arts and Humanities > Dept of Information Studies
URI: https://discovery.ucl.ac.uk/id/eprint/10136398
Downloads since deposit
240Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item