Why I’m excited about MARL

9 minute read

I’m excited to be participating in the 2020 cohort of the OpenAI Scholars program. I’ll be spending the next few months studying multi-agent reinforcement learning (MARL), and periodically writing a series of posts to document my progress. In this first post, I’ll discuss the reasons I’m excited about MARL and my plan for the Scholars program.

What is MARL?

Reinforcement learning (RL) is a class of algorithms for optimizing the behavior of an agent interacting with an environment in response to external rewards. RL is partially inspired by theories of animal learning from psychology, taking the term “reinforcement” from Pavlov’s 1927 work on classical conditioning (Sutton 2018).

Training agents to perform well at video games (and computerized board games) is a prototypical use-case for reinforcement learning, since the action space (controls, possible moves) and observation space (pixels on screen, state of the board) are well defined, and the score (points in the game, or victory/loss) provides a clear reward signal. Progress in RL has been marked by success in harder and harder games: backgammon (Tesauro 1995); various Atari games (Mnih et al. 2013); chess, shogi, and go (Silver et al. 2017); Dota 2 (Berner 2019); and Starcraft 2 (Vinyals et al. 2019).

Multi-agent reinforcement learning (MARL) is the extension of RL to scenarios with multiple interacting agents. MARL is naturally important for applications like self driving cars, where agents can only succeed by accounting for the behavior of other agents (Reddy 2018). Research in MARL includes efforts to foster collaborative problem solving, to improve and understand inter-agent communication, and to characterize emergent social phenomena.

In general, RL agents learn by incrementally improving a strategy for attaining a high reward. For this learning to be effective, the reward signal must distinguish between beneficial and detrimental changes to the agent’s behavior. In competitive multi-agent environments like games, agents struggle to learn effective strategies against much stronger opponents (improvements to their behavior still lead to losses). More broadly, training an agent to complete a difficult task generally requires constructing either a sequence (curriculum) of tasks that allows the agent to incrementally build expertise, or a very informative reward signal.

Many successful applications of RL to competitive two-player games have accomplished this with a technique called self-play. With self-play, agents are trained to compete against copies of themselves – so they always have an opponent of a comparable skill level. Learning is effective because the difference between good and bad variations to their strategies are reflected in the reward signal. This remains true as agents’ strategies grow in sophistication. Agents trained by self-play can out-compete human experts in complex strategic games including Go (Silver 2017) and Dota 2 (Berner 2019).

Recent results show that training large and diverse populations of agents in varying scenarios can help individual agents achieve higher performance. Wang et al. (2019) show that this technique – along with procedures for generating curricula of increasingly difficult tasks and transferring successful individual agents between tasks – can encourage individual agents to develop sophisticated and robust strategies and overcome challenges that may be otherwise intractable. Vinyals et al. (2019) extend self-play to diverse populations of Starcraft agents by training them (1v1) in leagues of opponents with varying skill levels and strategy types. This diversity forces agents to learn robust strategies that are not over-fit to weaknesses of particular opponents.

With more than two agents in a shared environment, increasingly complex cooperative and competitive interactions form an automatic curriculum of challenges (Leibo et al. 2019) that can induce a rich variety of emergent phenomena including communication (Jaques et al. 2019) and tool use (Baker et al. 2020).

However, MARL presents challenges that are not present in typical RL problems. Because other agents are constantly learning and adapting, the learning environment in MARL changes continually. RL algorithms with convergence guarantees in the single-agent setting do not necessarily converge or even stabilize when there are multiple agents interacting and learning simultaneously (Balduzzi et al. 2018, Mazumdar et al. 2019). I’ll discuss some of the ways standard RL agent architectures are modified for multi-agent scenarios in a future post.

Why I am excited about MARL

Human intelligence is very social

The fact that I have some direct introspective access to the stuff in my head makes it easy to attribute my thoughts and actions to internal cognitive processes. This probably causes me to overestimate the importance of this internal stuff (compared to external influences) in shaping my thoughts and behavior. If I were presented with a novel task that I needed to work through “from scratch”, I would lean heavily on concepts refined over thousands of years of human cultural/social development.

Humans are really smart, even compared to closely related animals like chimps. In the first few million years after their last common ancestor, proto-humans developed a slew of distinct characteristics including larger brains and a notable aptitude for social learning (Henrich 2015).

But anatomically modern humans invented language just 50k years ago – recently enough that biological evolution hasn’t had the time to change our basic cognitive capabilities. This suggests that the remarkable progress humanity has made since then has been due to social/cultural development and accumulation of knowledge rather than improvements to the human brain.

The environment inhabited by modern humans – in the RL sense – is mainly made up of entities that arose through these processes of socio-cultural evolution (companies, countries). I know how to function in society, but I would die pretty quickly if I were transported to an environment in which I was directly subjected to environments that shaped human biological evolution.

I’m excited about MARL as a way to empirically study the emergence of the sort of social phenomena that gave rise to the complexity of modern human culture, and as a way to build agents that can make use of cultural knowledge in a human-like way.

MARL might be a useful tool for understanding how AI will impact society

Reinforcement learning-based automation is poised to be deployed for many practical uses including stock trading, corporate decision making, robotics and self-driving, etc. MARL is likely to be important for enabling self-driving cars and household robots to coordinate with humans, and might be a useful tool for training individual agents that are capable enough to succeed at complex real-world tasks. But regardless of how central MARL is in developing such systems, any future that is filled with RL agents could benefit from the ability to understand how those agents may interact.

By analogy to the emergence of entities like companies and countries in large groups of humans, we might expect the interactions of these agents to give rise to emergent behavior. Just as collections of interacting humans pose risks that individual humans do not, the group behavior of these RL agents may be undesirable.

Some potential concerns are pretty straightforward (“are the stock trading bots colluding?”), but the stark difference in the objectives and capabilities between individuals and countries suggests that characterizing emergent social behavior might be really important. As AI agents fill more roles in society, MARL might provide us with invaluable tools for understanding how inadvertent collectives of agents emerge/behave, and the impact they could have on society at large.

Scholars plan

The OpenAI Scholars program lasts for 4 months. The program is divided in half, into a learning portion and a project portion. During the learning portion, I’ll work roughly in parallel on digesting the MARL literature and developing some simple multi-agent experiments.

First, I’ll study and write about the broad state of MARL research (~3 weeks) while I work on the general infrastructure for running experiments, Then I’ll narrow the scope of my reading to focus on communication in MARL while adding inter-agent communication to the experiment scaffolding (~3 weeks). Building on all of this, I plan to use information theoretic tools from the MARL literature to empirically investigate inter-agent communication in this toy system.


Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction; 2nd Edition, 2017.

Gerald Tesauro. Temporal Difference Learning and TD-Gammon. Communications of the ACM. 38 (3), 1995.

Volodymyr Mnih et al. Playing Atari with Deep Reinforcement Learning. arXiv preprint arXiv:1312.5602, 2013.

David Silver et al. Mastering the Game of Go without Human Knowledge. Nature 550, 354-359, 2017.

David Silver et al. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv preprint arXiv:1712.01815, 2017.

Oriol Vinyals et al. Grandmaster level in Starcraft II using multi-agent reinforcement learning. Nature 575, 350-354, 2019.

Ilge Akkaya et al. Solving Rubik’s Cube with a Robot Hand. arXiv preprint arXiv:1910.07113, 2019.

Reddy et al. Shared Autonomy via Deep Reinforcement Learning. arXiv preprint arXiv:1802.01744, 2018.

Christopher Berner et al. Dota 2 with Large Scale Deep Reinforcement Learning. arXiv preprint arXiv:1912.06680, 2019.

Rui Wang et al. Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions. arXiv preprint arXiv:1901.01753, 2019.

Natasha Jaques et al. Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning. arXiv preprint arXiv:1810.08647, 2019.

Bowen Baker et al. Emergent Tool Use From Multi-Agent Autocurricula. arXiv preprint arXiv:1909.07528, 2019.

David Balduzzi et al. The Mechanics of n-Player Differentiable Games. arXiv preprint arXiv:1802.05642, 2018.

Eric Mazumdar et al. Policy-Gradient Algorithms Have No Guarantees of Convergence in Linear Quadratic Games. arXiv preprint arXiv:1907.03712, 2019.

Joel Z. Leibo et al. Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for Multi-Agent Intelligence Research. arXiv preprint arXiv:1903.00742, 2019.

Joseph Henrich. The Secret of Our Success: How Culture Is Driving Human Evolution, Domesticating Our Species, and Making Us Smarter, 2015.