UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM)

Hauser, J; Kondor, D; Reddish, J; Benam, M; Cioni, E; Villa, F; Bennett, JS; ... Maria del Rio-Chanona, R; + view all (2024) Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM). In: Advances in Neural Information Processing Systems 37. (pp. pp. 32336-32369). Neural Information Processing Systems Foundation, Inc. (NeurIPS): Vancouver, Canada. Green open access

[thumbnail of NeurIPS-2024-large-language-models-expert-level-global-history-knowledge-benchmark-hist-llm-Paper-Datasets_and_Benchmarks_Track.pdf]
Preview
Text
NeurIPS-2024-large-language-models-expert-level-global-history-knowledge-benchmark-hist-llm-Paper-Datasets_and_Benchmarks_Track.pdf - Published Version

Download (730kB) | Preview

Abstract

Large Language Models (LLMs) have the potential to transform humanities and social science research, yet their history knowledge and comprehension at a graduate level remains untested. Benchmarking LLMs in history is particularly challenging, given that human knowledge of history is inherently unbalanced, with more information available on Western history and recent periods. We introduce the History Seshat Test for LLMs (HiST-LLM), based on a subset of the Seshat Global History Databank, which provides a structured representation of human historical knowledge, containing 36,000 data points across 600 historical societies and over 2,700 scholarly references. This dataset covers every major world region from the Neolithic period to the Industrial Revolution and includes information reviewed and assembled by history experts and graduate research assistants. Using this dataset, we benchmark a total of seven models from the Gemini, OpenAI, and Llama families. We find that, in a four-choice format, LLMs have a balanced accuracy ranging from 33.6% (Llama-3.1-8B) to 46% (GPT-4-Turbo), outperforming random guessing (25%) but falling short of expert comprehension. LLMs perform better on earlier historical periods. Regionally, performance is more even but still better for the Americas and lowest in Oceania and Sub-Saharan Africa for the more advanced models. Our benchmark shows that while LLMs possess some expert-level historical knowledge, there is considerable room for improvement.

Type: Proceedings paper
Title: Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM)
Event: Advances in Neural Information Processing Systems 37
Dates: 10 Dec 2024 - 15 Dec 2024
Open access status: An open access version is available from UCL Discovery
DOI: 10.52202/079017-1016
Publisher version: https://doi.org/10.52202/079017-1016
Language: English
Additional information: This version is the version of record. For information on re-use, please refer to the publisher’s terms and conditions.
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10216741
Downloads since deposit
1Download
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item