UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

A comparison of machine learning and rule-based approaches for text mining in the archaeology domain, across three languages

Brandsen, Alex; Vlachidis, Andreas; Lien-Talks, Alphaeus; (2023) A comparison of machine learning and rule-based approaches for text mining in the archaeology domain, across three languages. Presented at: CAA 2023, Amsterdam, Netherlands. Green open access

[thumbnail of CAA2023 - NER comparison.pdf]
Preview
Text
CAA2023 - NER comparison.pdf - Accepted Version

Download (3MB) | Preview

Abstract

Archaeology is a destructive process in which the evidence primarily becomes written documentation. As such, the archaeological domain creates huge amounts of text, from books and scholarly articles to unpublished ‘grey literature’ fieldwork reports. We are experiencing a significant increase in archaeological investigations and easy access to the information hidden in these texts is a substantial problem for the archaeological field, which has been identified as early as 2005 (Falkingham 2005). In the Netherlands alone, it is estimated that 4,000 new grey literature reports are being created each year, as well as numerous books, papers and monographs. Furthermore, as research – such as desk based assessments – are increasingly being carried out online remotely, these documents need to be made more easily Findable, Accessible, Interoperable and Reusable. Making these documents searchable and analysing them is a time consuming task when done by hand, and will often lack consistency. Text mining provides methods for disclosing information in large text collections, allowing researchers to locate (parts of) texts relevant to their research questions, as well as being able to identify patterns of past behaviour in these reports. Furthermore, it enables resources to be searched in meaningful ways using semantic interoperable vocabularies and domain ontologies to answer questions on what, where and when. The EXALT project at Leiden University is working on creating a semantic search engine for archaeology in and around the Netherlands, indexing all available, open-access texts, which includes Dutch, English and German language documents. In this context, we are systematically researching and comparing different methods for extracting information from archaeological texts, in these 3 languages. The specific task we are looking at is Named Entity Recognition (NER), which is to find and recognise certain concepts in text, e.g. artefacts, time periods, places, etc. In the archaeology domain, the task of entity recognition is particularly specialised and determined by domain semantics that pose challenges to conventional NER. We develop text mining applications tailored to the archaeological domain and in this process we will compare a rule-based knowledge driven approach (using GATE), a ‘traditional’ machine learning method (Conditional Random Fields), and a deep learning method (BERT). Previous studies have investigated different applications of text mining in archaeological literature (Richards et al. 2015), but this often occurred at a relatively small scale, in isolated case studies, or as proof-of-concept type work. With this study, we are comparing multiple methods in multiple languages, and we aim to contribute to guidelines and good practice for text mining in archaeology. Specifically, we will compare not only the overall accuracy of each approach, but also the time, digital literacy, hardware, and labelled data needed to run each method. We also pay attention to the energy usage and CO2 output of these machine learning models and the impact on climate change, something that’s particularly poignant during the ongoing energy crisis. Besides these more practical aspects, we also aim to describe some general properties of the way we write about archaeology, and how writing in a particular language can make knowledge transfer (and by extension, NER) easier or more difficult.

Type: Conference item (Presentation)
Title: A comparison of machine learning and rule-based approaches for text mining in the archaeology domain, across three languages
Event: CAA 2023
Location: Amsterdam, Netherlands
Dates: 03 - 06 April 2023
Open access status: An open access version is available from UCL Discovery
Publisher version: https://2023.caaconference.org/
Language: English
Keywords: Named Entity Recognition, Archaeology, Multilingual, Information Extraction, Machine Learning, Deep Learning, BERT
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL SLASH
UCL > Provost and Vice Provost Offices > UCL SLASH > Faculty of Arts and Humanities
UCL > Provost and Vice Provost Offices > UCL SLASH > Faculty of Arts and Humanities > Dept of Information Studies
URI: https://discovery.ucl.ac.uk/id/eprint/10168813
Downloads since deposit
32Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item