Filgueir, R;
Jackson, M;
Terras, M;
Beavan, D;
Roubickov, A;
Hobson, T;
Ardanuy, MC;
... Ahnert, R; + view all
(2019)
defoe: A Spark-based Toolbox for Analysing Digital Historical Textual Data.
In: Altintas, Ilkay, (ed.)
Proceedings of 15th International Conference on escience 2019.
escience: San Diego, CA, USA.
Preview |
Text
Nyhan_AAM_eScience_2019_paper_28_easychair.pdf - Accepted Version Download (2MB) | Preview |
Abstract
This work presents defoe, a new scalable and portable digital eScience toolbox that enables historical research. It allows for running text mining queries across large datasets, such as historical newspapers and books in parallel via Apache Spark. It handles queries against collections that comprise several XML schemas and physical representations. The proposed tool has been successfully evaluated using five different large-scale historical text datasets and two HPC environments, as well as on desktops. Results shows that defoe allows researchers to query multiple datasets in parallel from a single command-line interface and in a consistent way, without any HPC environment-specific requirements.
Type: | Proceedings paper |
---|---|
Title: | defoe: A Spark-based Toolbox for Analysing Digital Historical Textual Data |
Event: | 15th International Conference on escience 2019, 24-27 September 2019, San Diego, CA, USA |
Open access status: | An open access version is available from UCL Discovery |
Publisher version: | https://escience2019.sdsc.edu/ |
Language: | English |
Additional information: | This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions. |
Keywords: | text mining, distributed queries, Apache Spark, High-Performance Computing, XML schemas, digital tools, digitised primary historical sources, humanities research |
UCL classification: | UCL UCL > Provost and Vice Provost Offices > UCL SLASH UCL > Provost and Vice Provost Offices > UCL SLASH > Faculty of Arts and Humanities UCL > Provost and Vice Provost Offices > UCL SLASH > Faculty of Arts and Humanities > Centre for Editing Lives and Letters UCL > Provost and Vice Provost Offices > UCL SLASH > Faculty of Arts and Humanities > Dept of Information Studies UCL > Provost and Vice Provost Offices > UCL SLASH > Faculty of Arts and Humanities > SELCS |
URI: | https://discovery.ucl.ac.uk/id/eprint/10082577 |
Archive Staff Only
View Item |