UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

The value of books in the age of generative AI training data

Rowberry, Simon; (2025) The value of books in the age of generative AI training data. Convergence: The International Journal of Research into New Media Technologies 10.1177/13548565251358020. (In press). Green open access

[thumbnail of rowberry-2025-the-value-of-books-in-the-age-of-generative-ai-training-data.pdf]
Preview
PDF
rowberry-2025-the-value-of-books-in-the-age-of-generative-ai-training-data.pdf - Published Version

Download (736kB) | Preview

Abstract

Controversies around AI companies’ use of pirated book collections, including Books3 and Library Genesis, to train Large Language Models (LLMs) has led to increased scrutiny of books as AI training data. In this article, I contextualize these controversies in relation to the perceived value of books as a training data source compared to other textual data sources. Books are a liminal source for LLMs as they provide edited and curated long-form content while simultaneously presenting substantial legal risks and not aligning with the most popular genres of writing outputted by Generative AI services. I propose using the technical concept of ‘epochs’ in machine learning as a proxy for the perceived value of a data source, and using this metric to understand how AI companies value books in the training mix.

Type: Article
Title: The value of books in the age of generative AI training data
Open access status: An open access version is available from UCL Discovery
DOI: 10.1177/13548565251358020
Publisher version: https://doi.org/10.1177/13548565251358020
Language: English
Additional information: This work is licensed under a Creative Commons License. The images or other third-party material in this article are included in the Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
Keywords: Generative AI, digital publishing, Large Language Models, value of books, training data
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL SLASH
UCL > Provost and Vice Provost Offices > UCL SLASH > Faculty of Arts and Humanities
UCL > Provost and Vice Provost Offices > UCL SLASH > Faculty of Arts and Humanities > Dept of Information Studies
URI: https://discovery.ucl.ac.uk/id/eprint/10210660
Downloads since deposit
32Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item