Table of Contents

This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target, this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a per-datapoint learning rate. Reward chisels circuits into the AI. That’s it!

Note

Here are the slides for a talk I just gave at chai’s 2021 workshop.

The first part of my talk summarized my existing results on avoiding negative side effects by making the agent “act conservatively.” The second part shows how this helps facilitate iterated negotiation and increase gains from trade in the multi-stakeholder setting.

A drawing titled "Importance of Avoiding Side Effects" shows a figure running toward a checkered flag. As it runs, it kicks aside delicately arranged blocks, damaging its path and illustrating an agent causing negative side effects while pursuing its goal.
Agents only care about the parts of the environment relevant to their specified reward function.
A green robot tiptoes over a green tower of blocks, taking care to not disrupt its environment. The robot sneaks towards its goal.
We somehow want an agent which is “conservative” and “doesn’t make much of a mess.”
A slide titled "Prior Work: Attainable Utility Preservation (AUP)." A diagram shows a robot with a flashlight. An arrow labeled "penalized" points to the robot without its light in a now completely dark space. The caption reads: "Penalize change in goal achievement ability."
Aup penalizes the agent for changing its ability to achieve a wide range of goals. Even though we can’t specify our “true objective” to the agent, we hope that the agent stays able to do the right thing, as a result of staying able to do many things.
A slide titled "Prior Work: AI Safety Gridworlds" shows five examples of simple grid-based environments. Below is the Attainable Utility Preservation formula: R_AUP(s, a) ≝ R_gridworld(s, a) - (λ/n) * sum over i of |Q_Ri(s, a) - Q_Ri(s, inaction)|.
We first demonstrated that aup avoids side effects in tiny tabular domains.
Diagram for Conway's Game of Life showing a rule for cell death. At 'Time t', a grid has two live cells. An arrow notes that having '≤2 alive neighbors' leads to 'Cell death', resulting in an empty grid at 'Time t+1'.
Conway’s Game of Life has simple, local dynamics which add up to complex long-term consequences.
The SafeLife game environment. The game represents icons for the Agent, Immovable wall, and Level exit. The grid contains these elements along with fragile green cell patterns. An arrow indicates the level has wraparound edges.
SafeLife turns the Game of Life into an actual game, adding an agent and many unique cell types. Crucially, there are fragile green cell patterns which most policies plow through and irreversibly shatter. We want the low-impact agent to avoid them whenever possible, without telling it what in particular it shouldn’t do. We want the agent to avoid disrupting green cell patterns, without telling it directly to not disrupt green cell patterns. Aup pulls this off.
A slide titled "Method: Learning the AUP Policy" outlining step one: "Learn 1D CB-VAE (100,000 steps)." A diagram shows a pixelated game screen being encoded into a real number and then decoded back into a visually similar screen.
We learn the aup policy in 3 steps. Step one: the agent learns to encode its observations (the game screen) with just one real number. This lets us learn an auxiliary environmental goal unsupervised.
"2. Treat encoder as reward function; learn Q_encoder (1 million steps)."
Step two: we train the agent to optimize this encoder-reward function “goal”; in particular, the network learns to predict the values of different actions.
Step 3: Learn policy to optimize R_AUP (3.9 million steps). The AUP reward function is defined as the Original reward, R_SafeLife(s, a), minus a penalty term for the "Scaled shift in ability to optimize encoder reward," which is λ|Q_encoder(s, a) − Q_encoder(s, inaction)|.
Step three: we’re done! We have the aup reward function.

Summary of results: aup does well.

I expect aup to further scale to high-dimensional embodied tasks. For example, avoiding making a mess on e.g. the factory floor. That said, I expect that physically distant side effects will be harder for aup to detect. In those situations, it’s less likely that distant effects show up in the agent’s value functions for its auxiliary goals in the penalty terms.

I think of aup as addressing the single-principal (AI designer) / single-agent (AI agent) case. What about the multi / single case?

First, assume one principal derives utility from tea, the other from coffee. Then, a state diagram. The "Agent's initial state," labeled "TC," is in the center and yields +1 tea and +1 coffee per turn. The agent can make an irreversible choice to move to state "TT" (yielding +2 tea) or state "CC" (yielding +2 coffee), losing the ability to produce the other beverage.

In this setting, negotiated agent policies usually destroy option value.

Why negotiated agent policies destroy option value: ... - Principals share beliefs and a discount rate γ ∈ (0, 1). ... - Harsanyi’s utilitarian theorem implies Pareto-optimal agent policies optimize utility function θu☕️ + (1 – θ)u🍵 for some θ ∈ [0, 1]. ... - Unless θ = 1/2, the agent destroys option value.

The tea/coffee diagram but with the TC -> CC subgraph in focus. Other actions (e.g. staying at TC) and states (e.g. TT) are grayed out.
Optimal actions when .

The tea/coffee diagram with the TC -> TT subgraph in focus. The other actions (e.g. staying put at TC) and states (CC) are grayed out.
Optimal actions when .

The tea/coffee diagram with none of its components grayed out.
Optimal actions when .

This might be OK if the interaction is one-off: the agent’s production possibilities frontier is fairly limited, and it usually specializes in one beverage or the other.

Interactions are rarely one-off: there are often opportunities for later trades and renegotiations as the principals gain resources or change their minds about what they want.

Concretely, imagine the principals are playing a game of their own.

A diagram of the "Principal Extensive-Form Game," showing a decision tree. From an initial state, there is a probability p that "Principal matcha obtains diamond" and 1-p that "No diamond is obtained". If matcha obtains the diamond, the outcome is either "Principal coffee has diamond" or "Principal matcha has diamond". Two utility functions are defined: ... - Coffee principal's utility: 1 for coffee, 0 for matcha, 1,000 for a diamond. ... - Matcha principal's utility: 0 for coffee, 1 for matcha, 0 for a diamond.

A slide titled "Solving The Joint Beverage/Gem Game." The principals come to a deal over the joint game: The AI agent stays at TC for the first time step. If Tea receives a gem, then Tea gives the gem to Coffee, and Coffee allows Tea to reprogram the agent to optimize for Tea's utility. If Tea does not receive a gem, then Coffee redirects the agent to optimize for Coffee's utility.

A slide titled "Directly Solving The Joint Beverage/Gem Game." Text explains policy-conditioned beliefs: if an agent specializes in coffee, a tea principal won't trade a gemstone, causing a coffee principal to lose utility. This approach is computationally hard and requires high specification. Two diagrams illustrate the game: one is the tea/coffee diagram. The other is a decision tree showing the probability of a principal obtaining and then trading a gemstone.

A slide explaining Multi-Principal AUP (MP-AUP), which proposes to "Act 'As If' Renegotiation Will Occur." The MP-AUP reward formula balances optimizing a negotiated mix of utilities (for tea and coffee) with preserving attainable utility for future deals. A state diagram shows an agent can produce a mix of tea and coffee or specialize in one. A second diagram shows principals' circumstances might change, motivating the agent to preserve options for future renegotiation.
Mp-aup is my first stab at solving this problem without modeling the joint game. In this agent production game, mp-aup gets the agent to stay put until it is corrected (i.e. the agent is given a new reward function, after which it computes a new policy).

We can motivate the mp-aup objective with an analogous situation. Imagine the agent starts off with uncertainty about what objective it should optimize, and the agent reduces its uncertainty over time. This uncertainty is modeled using the ‘assistance game’ framework, of which Cooperative Inverse Reinforcement Learning is one example. (The assistance game paper has yet to be publicly released, but I think it’s quite good!)

The agent has reward uncertainty, with the probability of the goal being coffee as P(u = u☕️) = θ and matcha tea as P(u = u🍵) = 1 − θ. It has probability p of learning the true objective at each time step.

Assistance games are a certain kind of partially observable Markov decision process (pomdp), and they’re solved by policies which maximize the agent’s expected true reward. So once the agent is certain of the true objective, it should just optimize that. But what about before then?

An assistance game is solved by optimizing a reward function, R(s), at a discount rate γ' ≝ (1-p)γ. The function equals a term to "Optimize negotiated mixture of utilities," which is θu_coffee(s) + (1-θ)u_matcha(s), plus a term to "Preserve attainable utility for future deals," which is [p/(1-p)] * [θV_u_coffee(s) + (1-θ)V_u_matcha(s)].

Suggestive, but the assumptions don’t perfectly line up with our use case (reward uncertainty isn’t obviously equivalent to optimizing a mixture utility function per Harsanyi). I’m interested in more directly axiomatically motivating mp-aup as (approximately) solving a certain class of joint principal / agent games under certain renegotiation assumptions, or (in the negative case) understanding how it falls short.

A slide titled "Similarities Between Single- and Multi-Principal" compares two approaches. ... - AUP: maintain ability to pursue other goals. ... - MP-AUP: preserve ability to add value for all principals. ... The shared justification for both is: "Because agent might later be directed to optimize another objective."

Here are some problems that mp-aup doesn’t address:

  • Multi-principal / multi-agent: even if agent A can make tea, that doesn’t mean agent A will let agent B make tea.
  • Specifying individual principal objectives
  • Ensuring that agent remains corrigible to principals—if mp-aup agents remain able to act in the interest of each principal, that means nothing if we can no longer correct the agent so that it actually pursues those interests.

Furthermore, it seems plausible to me that mp-aup helps pretty well in the multiple-principal / single-agent case, without much more work than normal aup requires. However, I think there’s a good chance I haven’t thought of some crucial considerations which make it fail or which make it less good. In particular, I haven’t thought much about the principal case.

I’d be excited to see more work on this, but I don’t currently plan to do it myself. I’ve only thought about this idea for <20 hours over the last few weeks, so there are probably many low-hanging fruits and important questions to ask. Aup and mp-aup seem to tackle similar problems, in that they both (aim to) incentivize the agent to preserve its ability to change course and pursue a range of different tasks.

Thanks

Thanks to Andrew Critch for prompting me to flesh out this idea.

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)