UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Transcription enhancement of a digitised multi-lingual pamphlet collection: a case study and guide for similar projects

Watson, AM; Szubarczyk, ES; Salinger, PS; Howe, AJ; Wright, S; Freedman, VR; (2021) Transcription enhancement of a digitised multi-lingual pamphlet collection: a case study and guide for similar projects. UCL Library Services: London, UK. Green open access

[thumbnail of Transcription Report-FINAL.pdf]
Preview
Text
Transcription Report-FINAL.pdf

Download (3MB) | Preview

Abstract

UCL Library Services holds an extensive collection of over 9,000 Jewish pamphlets, many of these extremely rare. Over the past five years, UCL has embarked on a project to widen access to this collection through an extensive programme of cataloguing, conservation and digitisation. With the cataloguing complete and the most fragile items conserved, the focus is now on making these texts available to global audiences via UCL Digital Collections website. The pamphlets were ranked for rarity, significance and fragility and the highest-scoring selected for digitisation. Unique identifiers allocated at the point of cataloguing were used to track individual pamphlets through the stages of the project. This guide details the text-enhancement methods used, highlighting particular issues relating to Hebrew scripts and early-printed texts. Initial attempts to enable images of these pamphlets to be searched digitally relied on the Optical Character Recognition (OCR) embedded within the software used to create the PDF files. Whilst satisfactory for texts chiefly in Roman script, it provided no reliable means to search the extensive corpus of texts in Hebrew. Generous advice offered by the National Library of Israel led to our adoption of ABBYY FineReader software as a means of enhancing the transcriptions embedded within the PDF files. Following image capture, JPEG files were used to create multi-page PDF files of each pamphlet. Pre-processing in ABBYY FineReader consisted of: setting the language and colour mode; detecting page orientation; selecting and refining areas of the text to be read; reading the text to produce a transcription. The resultant files were stored in folders according to language of text. The software highlighted spelling errors and doubtful readings. A verification tool allowed transcribers to correct these as required. However, some erroneous or doubtful readings were nevertheless genuine words and not highlighted; it was therefore essential to proofread the text, particularly for early-printed scripts. Transcribers maintained logs of common errors; additionally, problems with Hebrew vocalisations, cursive and Gothic scripts were noted. During initial quality checks of the transcriptions, many text searches were unsuccessful due to previously unidentified spacings occurring within words. This was generally linked to the font size being too small. Maintaining logs of font sizes used led to the adoption of a minimum of Arial 8 or Times New Roman 10 in transcribed text. The methodology was revised to include the preliminary quality-checking of one page. We concluded that it was difficult to develop a standardised procedure applicable to all texts given the variance in language, script and typography. However, we concluded that the font Arial gave the most successful accuracy ratings for Hebrew script, minimum text size 17, minimum title size 25. ABBYY file preparation took a minimum of 1.5 hours per pamphlet; transcription correction took an average of 10.4 minutes per page; the final quality check took 30 minutes per pamphlet. On average, the work on each pamphlet took a minimum of 6 hours to complete. As a result of the project, average accuracy ratings improved from 60% to 89%, the greatest improvement being for pre-1800 and Hebrew script publications. We are therefore inclined to focus future transcription-enhancement activity on these types of publication for the remainder of our Jewish Pamphlet Collections.

Type: Report
Title: Transcription enhancement of a digitised multi-lingual pamphlet collection: a case study and guide for similar projects
Open access status: An open access version is available from UCL Discovery
Publisher version: https://www.ucl.ac.uk/library/digital-collections/...
Language: English
UCL classification: UCL
UCL > Provost and Vice Provost Offices
UCL > Provost and Vice Provost Offices > VP: Research
UCL > Provost and Vice Provost Offices > VP: Research > Library Services
URI: https://discovery.ucl.ac.uk/id/eprint/10121599
Downloads since deposit
99Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item