I spent some time recently exploring reinforcement learning in the excellent MineRL minecraft environments. I haven’t played much Minecraft, and I haven’t actually accomplished the personally accomplished the holy grail objective of mining a diamond. The prospect of building a bot that can learn to accomplish a task that I haven’t completed – one that is as human-accessible as this – is incredibly exciting!
There are a bunch of factors that make the MineRL environments interesting and challenging: - need to learn from pixels - need to coordinate actions on short and long time scales - mixed discrete/continuous action and observation spaces - very sparse rewards
MineRL also provides many hours of expert data – recorded trajectories of human players accomplishing a variety of in-game tasks. Rewards in the MineRL environments are very sparse; in most variants, agents reap the first reward after successfully chopping a tree. This is very unlikely to happen if the agent’s randomly mashing buttons, which makes the expert data particularly valuable.
I’ve mostly been focusing on RL algorithms that fit the full MineRL environments – in parcicular, those that work pretty naturally with both discrete and continuous action spaces, can learn from expert demonstrations, and can cope with sparse rewards. Actor critic algorithms fit the bill pretty well, and soft actor critic in particular is promising thanks to excellent demonstrated sample efficiency (even while learning directly from pixels).
These algorithms are tricky to implement properly, and their performance can be quite sensitive to hyperparameter values. Furthermore, the extreme reward sparsity makes it very difficult to distinguish between a bug-laden algorithm and one that is correct but poorly tuned: either way, the reward will be zero for a long time.
So I instead began by implementing algorithms in the much simpler and more forgiving Roboschool environments. I implemented the agents with a modular architecture: the networks for encoding observations and emitting actions are inferred from the structure of the environment, but the core of the learning algorithms are not environment-specific. This let me validate agent architectures in the more forgiving roboschool environment before moving them to MineRL.
Nevertheless, this often left me watching “validated” agents hop randomly across the map, wondering whether they’d be capable of achieving the basic tree-chop task. A few hours into the N-th fruitless training run, I decided to put myself in the agent’s shoes and actually play the game for a bit. Hoping to get build some empathy for the difficulty of the task, I approached a tree and tried to punch out the wood. Surprisingly, this took at least a second of continuous “attack” actions. If I let up on the mouse for an instant, the tree would remain intact.
Agents in off-policy RL algorithms like soft actor-critic choose actions by sampling from distributions. Up to this point, I had been assuming that the agent could succeed in simple tasks by choosing actions independently at each step. If I was presented a bunch of early-game minecraft frames out of order and told to choose actions that would lead to tree chops, I’d choose an appropriate action > 90% of the time. But this isn’t nearly good enough: if the agent chose the ‘attack’ action 90% of the time, on average it would take about 7 seconds before the agent would successfully chop some wood (assuming this requires 1 second of constant attacking at 30fps).
There are a few potential approaches to stabilizing actions:
- Shape the observation space by including past observations at every step, so that policies have direct access to some of the environment dynamics
- Shape the action space by de-bouncing jitter in discrete actions
- Add explicit temporal regularization to the policy loss
- “Observe and Look Further…”, one of the DeepMind Montezuma’s Revenge papers, suggests a “temporal consistency” (TC) loss that penalizes producing different actions at consecutive steps.
- Use auto-regressive policies to give agents information about past actions
- Use fully recurrent networks
As soon as I added a TC loss to the policy, the agent started to (occasionally) successfully chop trees! On top of that, using LSTM-based policies instead of feedforward networks begat further improvements, but in the interest of managing complexity and stability I’ve mostly been experimenting with LSTMs in significantly simpler architectures (behavior cloning, advantage-weighted regression, etc).
To facilitate experiments with recurrent policies, I implemented a fancy trajectory replay buffer that
- can to store and upate hidden states,
- can easily sample minibatches of arbitrary-length (obs, act, rew, done, hidden) sequences
- stores all the data on-disk, in surprisingly efficient memory-mapped numpy arrays.
My implementation of the “RecurrentReplayBuffer”, along with some other utilities that were helpful in managing the complex hierarchical MineRL action/observation spaces, are available on github.
I’ll update this post as I continue cleaning and refactoring an unseemly mess of private code.