eprintid: 10146592
rev_number: 8
eprint_status: archive
userid: 699
dir: disk0/10/14/65/92
datestamp: 2023-01-09 15:32:21
lastmod: 2023-01-09 15:32:21
status_changed: 2023-01-09 15:32:21
type: conference_item
metadata_visibility: show
sword_depositor: 699
creators_name: De, Suparna
creators_name: Moss, Harry
creators_name: Jabbari, Sanaz
creators_name: Johnson, Jon
creators_name: Periera, Haeron
creators_name: Li, Jennie
title: Engineering a Machine Learning Pipeline for Automating Metadata Extraction from Longitudinal Survey Questionnaires
divisions: B14
divisions: J81
divisions: B16
divisions: UCL
note: This is an Open Access presentation published under a Creative Commons Attribution 4.0 International (CC BY 4.0) Licence (https://creativecommons.org/licenses/by/4.0/).
abstract: Data Documentation Initiative-Lifecycle (DDI-L) introduced a robust metadata model to support the capture of questionnaire content and flow, and encouraged through support for versioning and provenancing, objects such as BasedOn for the reuse of existing question items. However, the dearth of questionnaire banks including both question text and response domains has meant that an ecosystem to support the development of DDI ready CAI tools has been limited. Archives hold the information in PDFs associated with surveys, but extracting that in an efficient manner into DDI-Lifecycle is a significant challenge. 

While CLOSER Discovery has been championing the provision of high-quality questionnaire metadata in DDI-Lifecycle, this has primarily been done manually. More automated methods need to be explored to ensure scalable metadata annotation and uplift. 

This paper presents initial results in engineering a machine learning (ML) pipeline to automate the extraction of questions from survey questionnaires as PDFs. Using CLOSER Discovery as a ‘training dataset', a number of machine learning approaches have been explored to classify parsed text from questionnaires to be output as valid DDI items for inclusion in a DDI-L compliant repository. 

The developed ML pipeline adopts a continuous build and integrate approach, with processes in place to keep track of various combinations of the structured DDI-L input metadata, ML models and model parameters against the defined evaluation metrics, thus enabling reproducibility and comparative analysis of the experiments. Tangible outputs include a map of the various metadata and model parameters with the corresponding evaluation metrics' values, which enable model tuning as well as transparent management of data and experiments.
date: 2021-11-30
date_type: published
official_url: https://doi.org/10.5281/zenodo.5742916
oa_status: green
full_text_type: pub
language: eng
primo: open
primo_central: open_green
verified: verified_manual
elements_id: 1948286
doi: 10.5281/zenodo.5742916
lyricists_name: Johnson, Jon
lyricists_name: Johnson, Jon
lyricists_id: JDJOH43
lyricists_id: JDJOH43
actors_name: Johnson, Jon
actors_id: JDJOH43
actors_role: owner
full_text_status: public
pres_type: presentation
event_title: European DDI Conference
event_location: Paris, France
event_dates: 30 November - 01 December 2021
citation:        De, Suparna;    Moss, Harry;    Jabbari, Sanaz;    Johnson, Jon;    Periera, Haeron;    Li, Jennie;      (2021)    Engineering a Machine Learning Pipeline for Automating Metadata Extraction from Longitudinal Survey Questionnaires.                   Presented at: European DDI Conference, Paris, France.       Green open access   
 
document_url: https://discovery.ucl.ac.uk/id/eprint/10146592/1/SDe%20EDDI%20Enhancing%20Metadata.pdf