UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires

De, Suparna; Moss, Harry; Johnson, Jon; Li, Jenny; Pereira, Haeron; Jabbari, Sanaz; (2022) Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires. IASSIST Quarterly , 46 (1) 10.29173/iq1023. Green open access

[thumbnail of IQ46_1_De_1023_F.pdf]
Preview
Text
IQ46_1_De_1023_F.pdf - Published Version

Download (504kB) | Preview

Abstract

Data Documentation Initiative-Lifecycle (DDI-L) introduced a robust metadata model to support the capture of questionnaire content and flow, and encouraged through support for versioning and provenancing, objects such as BasedOn for the reuse of existing question items. However, the dearth of questionnaire banks including both question text and response domains has meant that an ecosystem to support the development of DDI ready Computer Assisted Interviewing (CAI) tools has been limited. Archives hold the information in PDFs associated with surveys but extracting that in an efficient manner into DDI-Lifecycle is a significant challenge.
 While CLOSER Discovery has been championing the provision of high-quality questionnaire metadata in DDI-Lifecycle, this has primarily been done manually. More automated methods need to be explored to ensure scalable metadata annotation and uplift.
 This paper presents initial results in engineering a machine learning (ML) pipeline to automate the extraction of questions from survey questionnaires as PDFs. Using CLOSER Discovery as a ‘training and test dataset’, a number of machine learning approaches have been explored to classify parsed text from questionnaires to be output as valid DDI items for inclusion in a DDI-L compliant repository.
 The developed ML pipeline adopts a continuous build and integrate approach, with processes in place to keep track of various combinations of the structured DDI-L input metadata, ML models and model parameters against the defined evaluation metrics, thus enabling reproducibility and comparative analysis of the experiments.  Tangible outputs include a map of the various metadata and model parameters with the corresponding evaluation metrics’ values, which enable model tuning as well as transparent management of data and experiments.

Type: Article
Title: Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires
Open access status: An open access version is available from UCL Discovery
DOI: 10.29173/iq1023
Publisher version: https://doi.org/10.29173/iq1023
Language: English
Additional information: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. This license lets others remix, tweak, and build upon your work non-commercially, and although their new works must also acknowledge you and be non-commercial, they don’t have to license their derivative works on the same terms. The Creative Commons-Attribution-Noncommercial License 4.0 International applies to all works published by IASSIST Quarterly. Authors will retain copyright of the work. Your contribution will be available at the IASSIST Quarterly website when announced on the IASSIST list server.
Keywords: automated metadata extraction, longitudinal surveys, machine learning, model provenance, hyperparameter tuning, DDI Lifecycle
UCL classification: UCL > Provost and Vice Provost Offices > School of Education > UCL Institute of Education
UCL > Provost and Vice Provost Offices > School of Education > UCL Institute of Education > IOE - Social Research Institute
UCL > Provost and Vice Provost Offices > School of Education
UCL
URI: https://discovery.ucl.ac.uk/id/eprint/10146590
Downloads since deposit
83Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item