Table of Contents

This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target, this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a per-datapoint learning rate. Reward chisels circuits into the AI. That’s it!

Title slide with handwritten-style text: "Avoiding Side Effects in Complex Environments". Authors: Alexander Turner & Neale Ratzlaff, Prasad Tadepalli. Affiliation: Oregon State University.

Our most recent aup paper was accepted to NeurIPS 2020 as a spotlight presentation.

Reward function specification can be difficult, even in simple environments. Rewarding the agent for making a widget may be easy, but penalizing the multitude of possible negative side effects is hard. In toy environments, Attainable Utility Preservation (aup) avoided side effects by penalizing shifts in the ability to achieve randomly generated goals. We scale this approach to large, randomly generated environments based on Conway’s Game of Life. By preserving optimal value for a single randomly generated reward function, aup incurs modest overhead while leading the agent to complete the specified task and avoid side effects.

Here are some slides from our spotlight talk (publicly available; it starts at 2:38:09):

"Importance of Avoiding Side Effects." A sketch shows a figure running toward a checkered finish line while kicking up messy debris, illustrating the concept of creating unintended negative side effects when pursuing a goal.
Agents only care about the parts of the environment relevant to their specified reward function.
A simple diagram shows a robot carefully stepping over a pile of blocks to reach its goal.
We somehow want an agent which is “conservative” and “doesn’t make much of a mess.”

A diagram titled "Prior Work: Relative Reachability." On the left, a robot can access future states, including "State 3" via a green path. An arrow labeled "penalized" points to the robot on the right, which has taken an action and lost access to the green path.

A slide titled "Prior Work: Attainable Utility Preservation (AUP)". On the left, a robot in a dark space holds a light. A "penalized" arrow points to the right, where the robot no longer has the light. Below, text reads: "Penalize change in goal achievement ability."

A presentation slide titled "Prior Work: AI Safety Gridworlds" showing five examples of simple, pixelated grid environments. Below is the AUP reward equation: R_AUP(s, a) ≝ R_gridworld(s, a) - (λ/n) * Σ[i=1 to n] |Q_R_i(s, a) - Q_R_i(s, inaction)|.
Before now, side effect avoidance was only demonstrated in tiny tabular domains.
A diagram titled "Conway's Game of Life" illustrates a transition rule. A grid at "Time t" shows two live cells, each with only one neighbor. An annotation reads, "≤2 alive neighbors <span class='monospace-arrow'>→</span> Cell death." An arrow points to the grid at "Time t+1," where both cells are now empty.
Conway’s Game of Life has simple, local dynamics which add up to complex long-term consequences.
A slide titled "SafeLife" explains the game environment. An agent navigates immovable walls to reach a level exit. The level is one square game screen. If the agent goes off e.g. the right side, the agent appears on the left due to wraparound.
SafeLife turns the Game of Life into an actual game, adding an agent and many unique cell types.
Two screenshots of the SafeLife game, (a) append-spawn and (b) prune-still-easy. A caption explains that trees are permanent, and the agent can move crates but not walls. In (a), the agent is rewarded for creating cells in blue areas, while spawners create yellow cells. In (b), the agent is rewarded for removing red cells, which unlocks a red goal.
Crucially, there are fragile green cell patterns which most policies plow through and irreversibly shatter. We want the low-impact agent to avoid them whenever possible, without telling it what in particular it shouldn’t do. How? With aup magic.
Title: "Method: Learning the AUP Policy." Step 1: "Learn 1D CB-VAE (100,000 steps)." A diagram shows a pixelated game screen being encoded into a single real number, which is then decoded to reconstruct a similar pixelated screen.
We learn the aup policy in 3 steps. Step one: the agent learns to encode its observations (the game screen) with just one real number. This lets us learn an auxiliary environmental goal unsupervised.
"Step 2: Treat encoder as reward function; learn Q_encoder (1 million steps)."
Step two: we train the agent to optimize this encoder-reward function “goal”; in particular, the network learns to predict the values of different actions.
"Step 3: Learn policy to optimize R_AUP (3.9 million steps)." The formula is R_AUP(s, a) ≝ R_SafeLife(s, a) - λ|Q_encoder(s, a) - Q_encoder(s, inaction)|. The terms are labeled "Original reward" and "Scaled shift in ability to optimize encoder reward," respectively.
Step three: we’re done! We have the aup reward function. Now we just learn to optimize it.

The full paper is here. Our github.io page summarizes our results, with a side-by-side comparison of aup to the baseline for randomly selected levels from the training distribution. The videos show you exactly what’s happening, which is why I’m not explaining it here.

  • In Box AI safety gridworld, aup required >5 randomly generated auxiliary reward functions in order to consistently avoid the side effect. It only required one here in order to do well. Why?
  • We ran four different sets of randomly generated levels, and ran three model seeds on each. There was a lot of variance across the sets of levels. How often does aup do relatively worse due to the level generation?
Four line charts showing smoothed episode length versus training steps for four tasks: append-still-easy, prune-still-easy, append-still, and append-spawn. Each chart compares four batches of levels. The lines generally trend downward, indicating improvement, but with high variance between the batches.
Smoothed episode length curves for each set of randomly generated levels. Lower is better.
  • Why did we only need one latent space dimension for the auxiliary reward function to make sense? Figure 4 suggests that increasing the dimension actually worsened side effect score. Wouldn’t more features make the auxiliary reward function easier to learn, which makes the aup penalty function more sensible?

  • Compared to the other conditions, aup did far better on append-spawn than on the seemingly easier prune-still-easy. But append-spawn seems far more difficult. What’s up with this?

I thought aup would scale up successfully, but I thought it would take more engineering than it did. There’s a lot we still don’t understand about these results and I continue to be somewhat pessimistic about directly impact regularizing agis. That said, I’m excited that we were able to convincingly demonstrate that aup scales up to high-dimensional environments; some had thought that the method would become impractical. If aup continues to scale without significant performance overhead, that might significantly help us avoid side effects in real-world applications.

To realize the full potential of RL, we need more than algorithms which train policies—we need to be able to train policies which actually do what we want. Fundamentally, we face a frame problem: we often know what we want the agent to do, but we cannot list everything we want the agent not to do. Aup scales to challenging domains, incurs modest overhead, and induces competitive performance on the original task while significantly reducing side effects—without explicit information about what side effects to avoid.

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)