Rowberry, Simon;
(2025)
The value of books in the age of generative AI training data.
Convergence: The International Journal of Research into New Media Technologies
10.1177/13548565251358020.
(In press).
Preview |
PDF
rowberry-2025-the-value-of-books-in-the-age-of-generative-ai-training-data.pdf - Published Version Download (736kB) | Preview |
Abstract
Controversies around AI companies’ use of pirated book collections, including Books3 and Library Genesis, to train Large Language Models (LLMs) has led to increased scrutiny of books as AI training data. In this article, I contextualize these controversies in relation to the perceived value of books as a training data source compared to other textual data sources. Books are a liminal source for LLMs as they provide edited and curated long-form content while simultaneously presenting substantial legal risks and not aligning with the most popular genres of writing outputted by Generative AI services. I propose using the technical concept of ‘epochs’ in machine learning as a proxy for the perceived value of a data source, and using this metric to understand how AI companies value books in the training mix.
Type: | Article |
---|---|
Title: | The value of books in the age of generative AI training data |
Open access status: | An open access version is available from UCL Discovery |
DOI: | 10.1177/13548565251358020 |
Publisher version: | https://doi.org/10.1177/13548565251358020 |
Language: | English |
Additional information: | This work is licensed under a Creative Commons License. The images or other third-party material in this article are included in the Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ |
Keywords: | Generative AI, digital publishing, Large Language Models, value of books, training data |
UCL classification: | UCL UCL > Provost and Vice Provost Offices > UCL SLASH UCL > Provost and Vice Provost Offices > UCL SLASH > Faculty of Arts and Humanities UCL > Provost and Vice Provost Offices > UCL SLASH > Faculty of Arts and Humanities > Dept of Information Studies |
URI: | https://discovery.ucl.ac.uk/id/eprint/10210660 |
Archive Staff Only
![]() |
View Item |