UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Chaining Value Functions for Off-Policy Learning

Schmitt, Simon; Shawe-Taylor, John; Van Hasselt, Hado; (2022) Chaining Value Functions for Off-Policy Learning. In: Sycara, Katia, (ed.) Proceedings of the AAAI 2022 Conference: The 36th AAAI Conference on Artificial Intelligence. (pp. pp. 8187-8195). Association for the Advancement of Artificial Intelligence (AAAI): Virtual conference. Green open access

[thumbnail of Chaining value functions.pdf]
Preview
Text
Chaining value functions.pdf - Published Version

Download (2MB) | Preview

Abstract

To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn ‘off-policy’ about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcementlearning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to frst learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this onpolicy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective – that we call a ‘k-step expedition’ – of following the target policy for fnitely many steps before continuing indefnitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird’s counter example and observe favourable results.

Type: Proceedings paper
Title: Chaining Value Functions for Off-Policy Learning
Event: AAAI 2022 Conference: The 36th AAAI Conference on Artificial Intelligence
Location: ELECTR NETWORK
Dates: 22 Feb 2022 - 1 Mar 2022
ISBN-13: 9781577358763
Open access status: An open access version is available from UCL Discovery
Publisher version: https://aaai-2022.virtualchair.net/poster_aaai3834
Language: English
Additional information: This version is the author accepted manuscript. For information on re-use, please refer to the publisher’s terms and conditions.
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10166269
Downloads since deposit
10Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item