UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Paq: 65 million probably-asked questions and what you can do with them

Lewis, P; Wu, Y; Liu, L; Minervini, P; Küttler, H; Piktus, A; Stenetorp, P; (2021) Paq: 65 million probably-asked questions and what you can do with them. Transactions of the Association for Computational Linguistics , 9 pp. 1098-1115. 10.1162/tacl_a_00415. Green open access

[thumbnail of tacl_a_00415.pdf]
Preview
Text
tacl_a_00415.pdf - Published Version

Download (1MB) | Preview

Abstract

Open-domain Question Answering models that directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared with conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models fall short of the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce Probably Asked Questions (PAQ), a very large resource of 65M automatically generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster. Using PAQ, we train CBQA models which outperform comparable baselines by 5%, but trail RePAQ by over 15%, indicating the effectiveness of explicit retrieval. RePAQ can be configured for size (under 500MB) or speed (over 1K questions per second) while retaining high accuracy. Lastly, we demonstrate RePAQ’s strength at selective QA, abstaining from answering when it is likely to be incorrect. This enables RePAQ to ‘‘back-off’’ to a more expensive state-of-the-art model, leading to a combined system which is both more accurate and 2x faster than the state-of-the-art model alone.

Type: Article
Title: Paq: 65 million probably-asked questions and what you can do with them
Open access status: An open access version is available from UCL Discovery
DOI: 10.1162/tacl_a_00415
Publisher version: https://doi.org/10.1162/tacl_a_00415
Language: English
Additional information: © 2021 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10140063
Downloads since deposit
58Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item