UCL Discovery
UCL home » Library Services » Electronic resources » UCL Discovery

Reinforcement Learning in Persistent Environments: Representation Learning and Transfer

Borsa, Diana; (2020) Reinforcement Learning in Persistent Environments: Representation Learning and Transfer. Doctoral thesis (Ph.D), UCL (University College London). Green open access

[img]
Preview
Text
Borsa_10094113_thesis.pdf

Download (9MB) | Preview

Abstract

Reinforcement learning (RL) provides a general framework for modelling and reasoning about agents capable of sequential decision making, with the goal of maximising a reward signal. In this work, we focus on the study of situated agents designed to learn autonomously through direct interaction with their environment, under limited or sparse feedback. We consider an agent in a persistent environment. The dynamics of this ’world’ do not change over time, much like the laws of physics, and the agent would need to learn to master a potentially vast set of tasks in this environment. To efficiently tackle learning in multiple tasks, with the ultimate goal of scaling to a life-long learning agent, we turn our attention to transfer learning. The main insight behind this paradigm is that generalisation may occur not only within tasks, but also across them. The objective of transfer in RL is to accelerate learning by building and reusing knowledge obtained in previously encountered tasks. This knowledge can be in the form of samples, value functions, policies, shared features or other abstractions of the environment or behaviour. In this thesis, we examine different ways of learning transferable representations for value functions. We start by considering jointly learning value functions across multiple reward signals. We explore doing this by leveraging known multitask techniques to learn a shared set of features that cater to the intermediate solutions of popular iterative dynamic learning processes – like value and policy iteration. This learnt representation evolves as the individual value functions improve. At the end of this process, we obtain a shared basis for (near) optimal value functions. We show that this process benefits the learning of good policies forthetasks considered inthis joint learning. This class of algorithms is potentially very general, but somewhat agnostic to the persistent environment assumption. Thus we turn to ways of building this shared basis by leveraging more explicitly the rich structure induced by this assumption. This leads to various extensions of least-squares Policy Iteration methods to the multitask scenario, under shared dynamics. Here we leverage transfer of samples and multitask regression to further improve sample efficiency in building these shared representations, capturing commonalities across optimal value functions. The second part of the thesis introduces a different way of representing knowledge via successor features. In contrast to the representations learnt in the first part, these are policy dependent and serve as a basis for policy evaluations, rather than directly building optimal value functions. As such, the way to transfer knowledge to a new task changes as well. We do this by first relating the new task to previous learnt ones. In particular, we try to approximate the new reward signal as a linear combination of previous ones. Under this approximation, we can obtain approximate evaluations of the quality of previously learnt policies on the new task. This enables us to carry over knowledge about good or bad behaviour across tasks and strictly improve on previous behaviours. Here the transfer leverages the structure in policy space, with the potential of re-using partial solutions learnt in previous tasks. We show empirically that this leads to a scalable, online algorithm that can successfully re-use the common structure, if present, between a set of training tasks and a new one. Finally, we show that if one has further knowledge about the reward structure an agent would encounter, one can leverage this to learn very effectively, in an off-policy and off-task manner, a parameterised collection of successor features. These correspondto multiple (near) optimal policies for tasks hypothesized by the agent. This not only makes very efficient use of the data but proposes a parametric solution to the behaviour basis problem; namely which policies should one learn to enable transfer

Type: Thesis (Doctoral)
Qualification: Ph.D
Title: Reinforcement Learning in Persistent Environments: Representation Learning and Transfer
Event: UCL (University College London)
Open access status: An open access version is available from UCL Discovery
Language: English
Additional information: Copyright © The Author 2020. Original content in this thesis is licensed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) Licence (https://creativecommons.org/licenses/by/4.0/). Any third-party copyright material present remains the property of its respective owner(s) and is licensed under its existing terms. Access may initially be restricted at the author’s request.
UCL classification: UCL
UCL > Provost and Vice Provost Offices
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery.ucl.ac.uk/id/eprint/10094113
Downloads since deposit
68Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item