UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Arabic dialect identification in the context of bivalency and code-switching

El-Haj, M; Rayson, P; Aboelezz, M; (2018) Arabic dialect identification in the context of bivalency and code-switching. In: Proceedings of the LREC 2018, Eleventh International Conference on Language Resources and Evaluation. (pp. pp. 3622-3627). European Language Resources Association Green open access

[thumbnail of Arabic dialect identification.pdf]
Preview
Text
Arabic dialect identification.pdf - Published Version

Download (151kB) | Preview

Abstract

In this paper we use a novel approach towards Arabic dialect identification using language bivalency and written code-switching. Bivalency between languages or dialects is where a word or element is treated by language users as having a fundamentally similar semantic content in more than one language or dialect. Arabic dialect identification in writing is a difficult task even for humans due to the fact that words are used interchangeably between dialects. The task of automatically identifying dialect is harder and classifiers trained using only n-grams will perform poorly when tested on unseen data. Such approaches require significant amounts of annotated training data which is costly and time consuming to produce. Currently available Arabic dialect datasets do not exceed a few hundred thousand sentences, thus we need to extract features other than word and character n-grams. In our work we present experimental results from automatically identifying dialects from the four main Arabic dialect regions (Egypt, North Africa, Gulf and Levant) in addition to Standard Arabic. We extend previous work by incorporating additional grammatical and stylistic features and define a subtractive bivalency profiling approach to address issues of bivalent words across the examined Arabic dialects. The results show that our new methods classification accuracy can reach more than 76% and score well (66%) when tested on completely unseen data.

Type: Proceedings paper
Title: Arabic dialect identification in the context of bivalency and code-switching
Event: LREC 2018, Eleventh International Conference on Language Resources and Evaluation
ISBN-13: 979-10-95546-00-9
Open access status: An open access version is available from UCL Discovery
Publisher version: http://www.lrec-conf.org/proceedings/lrec2018/inde...
Language: English
Additional information: © 2018 The LREC 2018 Proceedings are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/).
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL SLASH
UCL > Provost and Vice Provost Offices > UCL SLASH > Faculty of Arts and Humanities
UCL > Provost and Vice Provost Offices > UCL SLASH > Faculty of Arts and Humanities > SELCS
URI: https://discovery.ucl.ac.uk/id/eprint/10113748
Downloads since deposit
20Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item