Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

Advanced search
Browse by:

Department | Year

UCL Theses | Latest

Deposit your research

Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

Kuba, JG; Chen, R; Wen, M; Wen, Y; Sun, F; Wang, J; Yang, Y; (2022) Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning. In: ICLR 2022 - 10th International Conference on Learning Representations. (pp. p. 1046). The International Conference on Learning Representations (ICLR): Virtual. Green open access

[thumbnail of 1046_trust_region_policy_optimisati.pdf]

Preview

Text
1046_trust_region_policy_optimisati.pdf - Published Version
Download (1MB) | Preview

Abstract

Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks. Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply; this is because agents, even in cooperative games, could have conflicting directions of policy updates. As a result, achieving a guaranteed improvement on the joint policy where each agent acts individually remains an open challenge. In this paper, we extend the theory of trust region learning to cooperative MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme. Based on these, we develop Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms. Unlike many existing MARL algorithms, HATRPO/HAPPO do not need agents to share parameters, nor do they need any restrictive assumptions on decomposibility of the joint value function. Most importantly, we justify in theory the monotonic improvement property of HATRPO/HAPPO. We evaluate the proposed methods on a series of Multi-Agent MuJoCo and StarCraftII tasks. Results show that HATRPO and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and MADDPG on all tested tasks, thereby establishing a new state of the art.

Type:	Proceedings paper
Title:	Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning
Event:	ICLR 2022 - 10th International Conference on Learning Representations
Open access status:	An open access version is available from UCL Discovery
Publisher version:	https://iclr.cc/Conferences/2022
Language:	English
Additional information:	This version is the version of record. For information on re-use, please refer to the publisher’s terms and conditions.
UCL classification:	UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI:	https://discovery.ucl.ac.uk/id/eprint/10167464

Downloads since deposit

47Downloads

Download activity - last month

Download activity - last 12 months

Downloads by country - last 12 months

Archive Staff Only

View Item