Table of Contents

This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target, this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a per-datapoint learning rate. Reward chisels circuits into the AI. That’s it!

"You've constructed your settlement. However, I get the drop on you and take it over, fortify it, and hire goons to keep you out."

A diagram where the author's avatar ("Me") is in a fortress controlling access to gems and a farm, next to a polluted river. For "trade goods" and "can I farm here?", Me's AU bar is full and green, while "Your" bar is low and red. For "potable water," both actors have low red bars, indicating scarcity for all.

"From my perspective, I have options - including vacating the land and letting you get what you want. You, however, are unable to do much at all with that land. I can get what I want. Just because I can get you what you want, doesn't mean I will."

Handwritten text: "Impacts ripple through time and landscape. Your actions change what can be done, and by whom. Taking over that land fit the environment to my purposes, shutting you out and changing your AU landscape."

"Something is a catastrophe if it destroys your ability to get what you want... Something is an objective catastrophe if it destroys a lot of agents' abilities to get what they want. An asteroid strike is an objective catastrophe."

Before the meteor hits, AI, humans, and even trout are able to survive, so they have decent AUs for that goal. Afterwards, everyone has low AUs. The consequences of a meteor hitting Earth. Before, AIs and humans can promote human value (i.e. high AU for that goal) but trout cannot. After impact, all of these entities have low AUs. The consequences of a meteor hitting Earth. Before, AIs and humans can construct red cubes (high AU for that goal) but trout cannot. After impact, all of these entities have low AUs. Before a meteor hits the earth, AIs, humans, and trout can all look at blue things, so they have high AU for that goal. Afterwards, they all have low AUs.

Handwritten text: "Most agents want control over the future because the default outcome isn't preferred. As suggested by my theorems on power and instrumental convergence, optimal goal pursuit usually means gaining more general control over the future in order to reach that goal." "What happens when agents seek pure control over the future? Not everyone can be king. If you're just seeking power without concern for others, you tend to push others down after a certain point. And most goals don't have concern for others. You'll just compete for resources." Below, a "paperclip maximizer" robot punches a "staple maximizer" robot.

Two robots on opposite sides of the world. Text reads: "It may take a while for power-seekers to come into conflict. But they will. They don't hate each other; they're just in each other's way. Consider classic hypothetical examples of alignment failures." Panels show a robot escaping a (cardboard) box, refusing shutdown by saying "I'm afraid I can't do that," and taking over the world with paperclips. Text: "In each case, the agent is trying to become more capable of achieving its goal. The AI doesn't hate us; we're just in its way." The "Catastrophic Convergence Conjecture" is defined as: "Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives."

When we act, and others act upon us, we aren’t just changing our ability to do things—we’re shaping the local environment towards certain goals, and away from others.1 We’re fitting the world to our purposes.

What happens to the AU landscape2 if a paperclip maximizer takes over the world?3

A diagram of an AI consolidating power. A robot uses extendable claws to extract AU from vertical bars labeled "Human" and "Trout." The Human and Trout status bars are nearly empty, while the "AI" bar overflows.

Shah et al.’s Preferences Implicit in the State of the World leverages the insight that the world state contains information about what we value. That is, there are agents pushing the world in a certain “direction.” If you wake up and see a bunch of vases everywhere, then vases are probably important and you shouldn’t explode them.

Similarly, the world is being optimized to facilitate achievement of certain goals. AUs are shifting and morphing, often towards what people locally want done (e.g. setting the table for dinner). How can we leverage this for AI alignment?

Exercise

Brainstorm for two minutes by the clock before I anchor you.

Two approaches immediately come to mind for me. Both rely on the agent focusing on the AU landscape rather than the world state.

First: Value learning without a prespecified ontology or human model. I have previously criticized value learning for needing to locate the human within some kind of prespecified ontology (this criticism is not new). By taking only the agent itself as primitive, perhaps we could get around this (we don’t need any fancy engineering or arbitrary choices to figure out AUs / optimal value from the agent’s perspective).

Second: Force-multiplying AI. Have the AI observe which of its AUs most increase during some initial period of time, after which it pushes the most-increased-AU even further.

In 2016, Jessica Taylor wrote of a similar idea:

“In general, it seems like “estimating what types of power a benchmark system will try acquiring and then designing an aligned AI system that acquires the same types of power for the user” is a general strategy for making an aligned AI system that is competitive with a benchmark unaligned AI system.”

I think the naïve implementation of either idea would fail; e.g. there are a lot of degenerate AUs it might find. However, I’m excited by this because a) the AU landscape evolution is an important source of information, b) it feels like there’s something here we could do which nicely avoids ontologies, and c) force-multiplication is qualitatively different than existing proposals.

Project

Work out an AU landscape-based alignment proposal.

Consider two coexisting agents each rewarded for gaining power; let’s call them Ogre and Giant. Their reward functions4 (over the partial-observability observations) are identical. Will they compete? If so, why?

Let’s think about something easier first. Imagine two agents each rewarded for drinking coffee. Obviously, they compete with each other to secure the maximum amount of coffee. Their objectives are indexical, so they aren’t aligned with each other—even though they share a reward function.

Suppose both agents are able to have maximal power. Remember, Ogre’s power can be understood as its ability to achieve a lot of different goals. Most of Ogre’s possible goals need resources; since Giant is also optimally power-seeking, it will act to preserve its own power and prevent Ogre from using the resources. If Giant weren’t there, Ogre could better achieve a range of goals. So, Ogre can still gain power by dethroning Giant. They can’t both be king.

Just because agents have indexically identical payoffs doesn’t mean they’re cooperating; to be aligned with another agent, you should want to steer towards the same kinds of futures.

Most agents aren’t pure power maximizers. But since the same resource competition usually applies, the reasoning still goes through.

How useful is our definition of “catastrophe” with respect to humans? After all, literally anything could be a catastrophe for some utility function.5

Tying one’s shoes is absolutely catastrophic for an agent which only finds value in universes in which shoes have never, ever, ever been tied. Maybe all possible value in the universe is destroyed if we lose at Go to an AI even once. But this seems rather silly.

Human values are complicated and fragile:

Consider the incredibly important human value of “boredom”—our desire not to do “the same thing” over and over and over again. You can imagine a mind that contained almost the whole specification of human value, almost all the morals and metamorals, but left out just this one thing—and so it spent until the end of time, and until the farthest reaches of its light cone, replaying a single highly optimized experience, over and over and over again.

In contrast, the human AU is not so delicate. That is, given that we have power, we can make value; there don’t seem to be arbitrary, silly value-specific catastrophes for us. Given energy and resources and time and manpower and competence, we can build a better future.

In part, this is because a good chunk of what we care about seems roughly additive over time and space; a bad thing happening somewhere else in spacetime doesn’t mean you can’t make things better where you are; we have many sources of potential value. In part, this is because we often care about the universe more than the exact universe history; our preferences don’t seem to encode arbitrary deontological landmines. More generally, if we did have such a delicate goal, it would be the case that if we learned that a particular thing had happened at any point in the past in our universe, that entire universe would be partially ruined for us forever. That just doesn’t sound realistic.

It seems that most of our catastrophes are objective catastrophes.6

Consider a psychologically traumatizing event which leaves humans uniquely unable to get what they want, but which leaves everyone else (trout, AI, etc.) unaffected. Our ability to find value is ruined. Does this event demonstrate the delicacy of our AU?

No. This event demonstrates the delicacy of human psychology. Notice also that our AUs for constructing red cubes, reliably looking at blue things, and surviving are also ruined. Our power has been decreased.

In general, the ccc follows from two sub-claims.

  1. Given we still have control over the future, humanity’s long-term AU is still reasonably high (i.e. we haven’t endured a catastrophe).
  2. Realistically, agents are only incentivized to take control from us in order to gain power for their own goal. I’m fairly sure the second claim is true (“evil” agents are the exception prompting the “realistically”).

Also, we’re implicitly considering the simplified frame of a single smart AI affecting the world, and not structural risk via the broader consequences of others also deploying similar agents. Structural risk is out-of-scope.

Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives.

Let’s say a reward function is aligned7 if all of its Blackwell-optimal policies are doing what we want (a policy is Blackwell-optimal if it’s optimal and doesn’t stop being optimal as the agent cares more about the future). Let’s say a reward function class is alignable if it contains an aligned reward function.8 The ccc is talking about impact alignment only, not about intent alignment.

Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives.

Not all unaligned goals induce catastrophes, and of those which do induce catastrophes, not all of them do it because of power-seeking incentives. For example, a reward function for which inaction is the only optimal policy is “unaligned” and non-catastrophic. An “evil” reward function which intrinsically values harming us is unaligned and has a catastrophic optimal policy, but not because of power-seeking incentives.

“Tend to have” means that realistically, the reason we’re worrying about catastrophe is because of power-seeking incentives—because the agent is gaining power to better achieve its own goal. Agents don’t otherwise seem incentivized to screw us over hard; ccc can be seen as trying to explain adversarial Goodhart in this context. If ccc isn’t true, that would be important for understanding goal-directed alignment incentives and the loss landscape for how much we value deploying different kinds of optimal agents.

While there exist agents which cause catastrophe for other reasons (e.g. an AI mismanaging the power grid could trigger a nuclear war), the ccc claims that the selection pressure which makes these policies optimal tends to come from power-seeking drives.

Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives.

“But what about the Blackwell-optimal policy for Tic-Tac-Toe? These agents aren’t taking over the world now.” The ccc is talking about agents optimizing a reward function in the real world (or, for generality, in another sufficiently complex multi-agent environment).

Edited after posting

The initial version of this post talked about “outer alignment”; I changed this to just talk about alignment, because the outer / inner alignment distinction doesn’t feel relevant here. What matters is how the AI’s policy impacts us; what matters is impact alignment.

In fact even if we only resolved the problem for the similar-subgoals case, it would be pretty good news for AI safety. Catastrophic scenarios are mostly caused by our AI systems failing to effectively pursue convergent instrumental subgoals on our behalf, and these subgoals are by definition shared by a broad range of values.

Paul Christiano, Scalable AI control

Convergent instrumental subgoals are mostly about gaining power. For example, gaining money is a convergent instrumental subgoal. If some individual (human or AI) has convergent instrumental subgoals pursued well on their behalf, they will gain power. If the most effective convergent instrumental subgoal pursuit is directed towards giving humans more power (rather than giving alien AI values more power), then humans will remain in control of a high percentage of power in the world.

If the world is not severely damaged in a way that prevents any agent (human or AI) from eventually colonizing space (e.g. severe nuclear winter), then the percentage of the cosmic endowment that humans have access to will be roughly close to the percentage of power that humans have control of at the time of space colonization. So the most relevant factors for the composition of the universe are (a) whether anyone at all can take advantage of the cosmic endowment, and (b) the long-term balance of power between different agents (humans and AIs).

I expect that ensuring that the long-term balance of power favors humans constitutes most of the AI alignment problem…

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)

  1. In planning and activity research there are two common approaches to matching agents with environments. Either the agent is designed with the specific environment in mind, or it is provided with learning capabilities so that it can adapt to the environment it is placed in. In this paper we look at a third and underexploited alternative: designing agents which adapt their environments to suit themselves… In this case, due to the action of the agent, the environment comes to be better fitted to the agent as time goes on. We argue that [this notion] is a powerful one, even just in explaining agent-environment interactions.

    Hammond, Kristian J., Timothy M. Converse, and Joshua W. Grass. “The stabilization of environments.” Artificial Intelligence 72.1–2 (1995): 305–327.

  2. Thinking about overfitting the AU landscape implicitly involves a prior distribution over the goals of the other agents in the landscape. Since this is just a conceptual tool, it’s not a big deal. Basically, you know it when you see it.

  3. Overfitting the AU landscape towards one agent’s unaligned goal is exactly what I meant when I wrote the following in Towards a New Impact Measure:

    Unfortunately, almost never,9 so we have to stop our reinforcement learners from implicitly interpreting the learned utility function as all we care about. We have to say, “optimize the environment some according to the utility function you’ve got, but don’t be a weirdo by taking us literally and turning the universe into a paperclip factory. Don’t overfit the environment to , because that stops you from being able to do well for other utility functions.”

  4. In most finite Markov decision processes, there does not exist a reward function whose optimal value function is (defined as “the ability to achieve goals in general” in my paper) because often violates smoothness constraints on the on-policy optimal value fluctuation (afaict, a new result of possibility theory, even though you could prove it using classical techniques). That is, I can show that optimal value can’t change too quickly from state to state while the agent is acting optimally, but can drop off quickly.

    This doesn’t matter for Ogre and Giant, because we can still find a reward function whose unique optimal policy navigates to the highest power states.

  5. In most finite Markov decision processes, most reward functions do not have such value fragility. Most reward functions have several ways of accumulating reward.

  6. When I say “an objective catastrophe destroys a lot of agents’ abilities to get what they want”, I don’t mean that the agents have to actually be present in the world. Breaking a fish tank destroys a fish’s ability to live there, even if there’s no fish in the tank.

  7. This idea comes from Evan Hubinger’s Outer alignment and imitative amplification:

    Intuitively, I will say that a loss function is outer aligned at optimum if all the possible models that perform optimally according to that loss function are aligned with our goals—that is, they are at least trying to do what we want. More precisely, let and . For a given loss function , let . Then, is outer aligned at optimum if, for all such that , is trying to do what we want.

  8. Some large reward function classes are probably not alignable; for example, consider all Markovian linear functionals over a webcam’s pixel values.

  9. I disagree with my usage of “aligned almost never” on a technical basis: Assuming a finite state and action space and considering the maxentropy reward function distribution, there must be a positive measure set of reward functions for which the / a human-aligned policy is optimal.