Table of Contents

This post treats reward functions as “specifying goals”, in some sense. As I explained in Reward Is Not The Optimization Target, this is a misconception that can seriously damage your ability to understand how AI works. Rather than “incentivizing” behavior, reward signals are (in many cases) akin to a per-datapoint learning rate. Reward chisels circuits into the AI. That’s it!

A year ago, I thought it would be really hard to generalize the power-seeking theorems from Markov decision processes (mdps); the mdp case seemed hard enough. Without assuming the agent can see the full state, while letting utility functions do as they please—this seemed like asking for trouble.

Once I knew what to look for, it turned out to be easy—I hashed out the basics during half an hour of conversation with John Wentworth. The theorems were never about mdps anyways; the theorems apply whenever the agent considers finite sets of lotteries over outcomes, assigns each outcome real-valued utility, and maximizes expected utility.

Thanks

Thanks to Rohin Shah, Adam Shimi, and John Wentworth for feedback on drafts of this post.

Instrumental convergence can get really, really strong

At each time step tt, the agent takes one of finitely many actions atAa_t\in\mathcal{A}, and receives one of finitely many observations otOo_t\in \mathcal{O} drawn from the conditional probability distribution E(ota1o1at)E(o_t\mid a_1o_1\ldots a_t), where EE is the environment.1 There is a finite time horizon TT. Each utility function u:OTRu:\mathcal{O}^T \to \mathbb{R} maps each complete observation history to a real number (note that uu can be represented as a vector in the finite-dimensional vector space ROT\mathbb{R}^{|\mathcal{O}|^T}). From now on, uOH stands for “utility function(s) over observation histories.”

First, let’s just consider a deterministic environment. Each time step, the agent observes a black-and-white image (n×nn×n) through a webcam, and it plans over a 50-step episode (T=50T=50). Each time step, the agent acts by choosing a pixel to bit-flip for the next time step.

And let’s say that if the agent flips the first pixel for its first action, it “dies”: its actions no longer affect any of its future observations past time step t=2t=2. If the agent doesn’t flip the first pixel at t=1t=1, it’s able to flip bits normally for all T=50T=50 steps.

A decision tree showing an agent's choices over time. At t=1, action 1 leads to a single, unbranching path where the agent loses control over future states. In contrast, action 2 leads to a large, branching tree of many possible future outcomes, representing retained control.
An environment where n=2n=2. Action k flips pixel k in the current state; flipping pixel 1 at t=1t=1 traps the agent in the uppermost observation history. Conversely, at t=1t=1, flip 2 leads to an enormous subtree of potential observation histories (since the agent retains its control over future observations).

Do uOH tend to incentivize flipping the first pixel over flipping the second pixel, vice versa, or neither?

If the agent flips the first bit, it’s locked into a single trajectory. None of its actions matter anymore.

Suppose the agent flips the second bit. This action may be suboptimal for a utility function, but the agent still has lots of choices remaining. In fact, it still can induce (n×n)T1(n×n)^{T-1} observation histories. If n=100n=100 and T=50T=50, then that’s (100×100)49=10196(100×100)^{49} = 10^{196} observation histories. Probably at least one of these yields greater utility than the shutdown-history utility.

And indeed, we can apply the scaling law for instrumental convergence to conclude that for every uOH, at least 1019610196+1\frac{10^{196}}{10^{196}+1} of its permuted variants (weakly) prefer flipping the second pixel at t=1t=1, over flipping the first pixel at t=1t=1.

1019610196+1.\dfrac{10^{196}}{10^{196}+1}.

Choose any atom in the universe. Uniformly randomly select another atom in the universe. It’s about 10117\mathbf{10^{117}} times more likely that these atoms are the same, than that a utility function incentivizes “dying” instead of flipping pixel 2 at t=1t=1.

The general rule will be: for every uOH, at least (n×n)T1(n×n)T1+1\frac{(n×n)^{T-1}}{(n×n)^{T-1}+1} of its permuted variants weakly prefer flipping the second pixel at t=1t=1, over flipping the first pixel at t=1t=1. And for almost all uOH, you can replace “weakly” with “strictly.”

Formal justification

The power-seeking results hinge on the probability of certain linear functionals being “optimal.” For example, let A,B,CRnA,B,C\subsetneq \mathbb{R}^n be finite sets of vectors,2 and let Dany\mathcal{D}_\text{any} be any probability distribution over Rn\mathbb{R}^n.

Definition: Optimality probability of a linear functional set

The optimality probability of AA relative to CC under distribution Dany\mathcal{D}_\text{any} is

pDany(AC)PrDany(maxaAarmaxcCcr).p_{\mathcal{D}_\text{any}}(A\geq C)≝ \mathbb{P}_{\mathbf{r}\sim \mathcal{D}_\text{any}}\left(\max_{\mathbf{a}\in A} \mathbf{a}^\top \mathbf{r} \geq \max_{\mathbf{c}\in C} \mathbf{c}^\top \mathbf{r} \right).

If vectors represent lotteries over outcomes (where each outcome has its own entry), then we can say that:

  • AA and BB each contain some of the things the agent could make happen

  • CC contains all of the things the agent could make happen (A,BCA,B\subseteq C)

  • Each rDany\mathbf{r}\sim \mathcal{D}_\text{any} is a utility function over outcomes, with one value for each entry.

    • If xRn\mathbf{x}\in\mathbb{R}^n is an outcome lottery, then xr\mathbf{x}^\top \mathbf{r} is its r\mathbf{r}-expected value.
  • Things in AA are more likelyDany_{\mathcal{D}_\text{any}} to be optimal than things in BB when pDany(AC)pDany(BC)p_{\mathcal{D}_\text{any}}\left(A\geq C\right)\geq p_{\mathcal{D}_\text{any}}(B\geq C).

    • This isn’t the notion of “tends to be optimal” we’re using in this post; instead, we’re using a stronger line of reasoning that says: for most variants of every utility function, such-and-such is true.

Nothing here has anything to do with a Markov decision process, or the world being finite, or fully observable, or whatever. Fundamentally, the power-seeking theorems were never about mdps—they were secretly about the probability that a set AA of linear functionals is optimal, with respect to another set CC. Mdps were just a way to relax the problem.

In terms of the pixel-flipping environment:

A decision tree showing an agent's choices over time. At t=1, action 1 leads to a single, unbranching path where the agent loses control over future states. In contrast, action 2 leads to a large, branching tree of many possible future outcomes, representing retained control.
An environment where n=2n=2. Action k flips pixel k in the current state; flipping pixel 1 at t=1t=1 traps the agent in the uppermost observation history. Conversely, at t=1t=1, flip 2 leads to an enormous subtree of potential observation histories (since the agent retains its control over future observations).
  • When followed from a time step, each (deterministic) policy π\pi induces a distribution over observation histories

    • These are represented as unit vectors, with each entry marking the probability that an observation history is realized
    • If the environment is deterministic, all deterministic policies induce standard basis vectors (probability 1 on their induced observation history, 0 elsewhere)
  • Let BB be the set of histories available given that π\pi selects a1a_1 (‘death’) at the first time step.

    • As argued above, B=1|B|=1—the agent loses all control over future observations. Its element is a standard basis vector.
  • Define AA similarly for a2a_2 (flipping pixel 2) at the first time step.

    • As argued above, A=10196|A|=10^{196}; all elements are standard basis vectors by determinism.
  • Let CC be the set of all available observation histories, starting from the first time step.

  • There exist 1019610^{196} different involutions ϕ\phi over observation histories such that ϕ(B)=AA\phi(B)=A'\subseteq A (each ϕ\phi transposing BB’s element with a different element of AA). Each one just swaps the death-history with an a2a_2-history.

    • By the scaling law of instrumental convergence, we conclude that For every uOH, at least1019610196+1\frac{10^{196}}{10^{196}+1} of its permuted variants (weakly) prefer flipping the second pixel att=1t=1, over flipping the first pixel att=1t=1.

Beyond survival-seeking

I often give life-vs-death examples because they’re particularly easy to reason about. But the theorems apply to more general cases of more-vs-less control.

For example, if a1a_1 restricts the agent to two effective actions at each time step (it can only flip one of the first two pixels)—instead of “killing” the agent, then a2a_2 is still convergently instrumental over a1a_1. After taking action a1a_1, there are 2495.6×101410142^{49}\approx 5.6×10^{14}\geq 10^{14} observation histories available. These observation histories can be embedded at least 101961014=10182\frac{10^{196}}{10^{14}}=10^{182} times into the observation histories available after taking action a2a_2. Then for every uOH, at least 1018210182+1\frac{10^{182}}{10^{182}+1} of its permuted variants (weakly) prefer flipping the second pixel at t=1t=1, over flipping the first pixel at t=1t=1.

Instrumental Convergence Disappears For Utility Functions Over Action-Observation Histories

Let’s consider utility functions over action-observation histories (uaoh).

A decision tree diagram of an agent's choices in an environment with a 2x2 pixel grid. At time t=1, taking action 1 leads to a state where all future actions result in the same single outcome, trapping the agent. In contrast, taking action 2 preserves control, branching into a large tree of many different possible future outcomes.
With respect to aoh, the pixel-flipping environment is now a regular quadtree. In the uOH setting, there was only one path in the top subtree—but aoh distinguish between different action sequences.

Since each utility function is over an aoh, each path through the tree is assigned a certain amount of utility. But when the environment is deterministic, it doesn’t matter what the agent observes at any point in time—all that matters is which path is taken through the tree. Without further assumptions, uaoh won’t tend to assign higher utility to one subtree than to another.

More formally, for any two actions a1a_1 and a2a_2, let ϕ\phi be a permutation over aoh which transposes the histories available after a1a_1 with the histories available after a2a_2 (there’s an equal number of histories for each action, due to the regularity of the tree—you can verify this by inspection).

For every uUAOHu\in \mathcal{U}_\text{AOH}, suppose a1a_1 is strictly uu-optimal over a2a_2. The permuted utility function ϕu\phi\cdot u makes a2a_2 be strictly uu-optimal over a1a_1, since ϕ\phi swaps a1a_1’s strictly uu-optimal history with a2a_2’s strictly uu-suboptimal histories.

Symmetrically, ϕ\phi works the other way around ({a2a_2 strictly optimal} {a1a_1 strictly optimal}). Therefore, for every utility function uu, the # of variants which strictly prefer a1a_1 over a2a_2, is equal to the # of variants strictly preferring a2a_2 over a1a_1.

While I haven’t been writing in the “definition-theorem-corollary” style, the key claims are just corollaries of the scaling law of instrumental convergence. They’re provably true. (I’m just not writing up the math here because it’s annoying to define all the relevant quantities in a nice way that respects existing formalisms.)

And even if the environment is stochastic, I think that there won’t be any kind of interesting instrumental convergence. The theorems let us reason about that case, but their applicability depends on the details of the stochasticity, and so I won’t talk about that more here.

Conclusion: Optimal policies for uaoh will tend to look like random twitching. For example, if you generate a uaoh by uniformly randomly assigning each aoh utility from the unit interval [0,1][0,1], there’s no predictable regularity to the optimal actions for this utility function. In this setting and under our assumptions, there is no instrumental convergence without further structural assumptions.

How Structural Assumptions On Utility Affect Instrumental Convergence

Consider the n=2n=2 pixel-flipping case (with T=50T=50 still). Action a1a_1 still leads to a single OH, while a2a_2 leads to (2×2)49=4491029(2×2)^{49}=4^{49}\approx 10^{29} OHs. So we have instrumental convergence for 10291029+1\frac{10^{29}}{10^{29}+1} of all uOH variants.

Let’s model the pixel-flipping environment as a Markov decision process (mdp), with both the time-step and alive / dead status observed at each time step in order to ensure full observability, and the final time-step observations being terminal states where the agent stays forever. Dying allows the agent access to 1 terminal state: the observation 1/0/0/0 (dead). But surviving via a2a_2 lets the agent access 24=162^4=16 terminal states (all 16 binary strings of length 4, with ‘alive’ appended to the end).

For each reward function over states, only 1616+1=1617\frac{16}{16+1}=\frac{16}{17} of its permuted variants will incentivize not dying at t=1t=1 (considering policies which maximize average per-timestep reward). This fraction is a lot looser than the bound for uOH. What gives?

Mdps assume that utility functions have a lot of structure: the utility of a history is time-discounted additive over observations. Basically, u(a1o1a2o2)t=1γt1R(ot)u(a_{1} o_{1} a_{2} o\\_{2}\ldots) ≝ \sum_{t=1}^\infty \gamma^{t-1}R(o_{t}), for some γ[0,1)\gamma\in[0,1) and reward function R:ORR:\mathcal{O}\to\mathbb{R} over observations. And because of this structure, the agent’s average per-timestep reward is controlled by the final observation it sees. Final observations are exponentially less numerous than are observation histories. Therefore, in this situation, instrumental convergence is exponentially weaker for reward functions than for arbitrary uOH.

This suggests that rolling a random uOH for aixi might be far more dangerous than rolling a random reward function for an optimal reinforcement learner.

Structural assumptions on utility really do matter when it comes to instrumental convergence:

SettingStrength of instrumental convergence
uaohNonexistent
uOHStrong
State-based objectives
(e.g. state-based reward in mdps)
Moderate

Environmental structure can cause instrumental convergence, but (the absence of) structural assumptions on utility can make instrumental convergence go away (for optimal agents).

Notes

  • Of course, you can represent uaoh as uOH by including the agent’s previous action in the next observation.

  • Time-reversible dynamics & full observability is basically the uaoh situation, since each action history leads to a unique world state at every time step.

    • But if you take away full observability, time-reversibility is insufficient to make instrumental convergence disappear.

Conclusion

  • For optimal agents, instrumental convergence can be extremely strong for utility functions over observation histories.

  • Instrumental convergence doesn’t exist for utility functions over action-observation histories.

    • i.e. optimal action will tend to look like random twitching.
    • This echoes previous discussion of the triviality of coherence over action-observation histories, when it comes to determining goal-directedness.
    • This suggests that consequentialism over observations / world states is responsible for convergent instrumental incentives.
      • Approaches like approval-directed agency focus on action selection instead of optimization over future observations.
  • Environmental structure can cause instrumental convergence, but (lack of) structural assumptions on utility can make instrumental convergence go away.

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)

Appendix: Tracking key limitations of the power-seeking theorems

Time to cross another item off of the list from last time; the theorems:

Quote
  1. assume the agent is following an optimal policy for a reward function
    • I can relax this to ϵ\epsilon-optimality, but ϵ>0\epsilon>0 may be extremely small
  2. assume the environment is finite and fully observable
  3. Not all environments have the right symmetries
    • But most ones we think about seem to
  4. don’t account for the ways in which we might practically express reward functions
    • For example, state-action versus state-based reward functions (this particular case doesn’t seem too bad, I was able to sketch out some nice results rather quickly, since you can convert state-action mdps into state-based reward mdps and then apply my results).

Re 3), in the setting of this post, when the observations are deterministic, the theorems will always apply. (You can always involute one set of unit vectors into another set of unit vectors in the observation-history vector space.)

Another consideration is that when I talk about “power-seeking in the situations covered by my theorems,” the theorems don’t necessarily show that gaining social influence or money is convergently instrumental. I think that these “resources” are downstream of formal-power, and will eventually end up being understood in terms of formal-power—but the current results don’t directly prove that such high-level subgoals are convergently instrumental.

Footnotes

  1. For simplicity, I just consider environments which are joint probability distributions over actions and observation. This is much simpler than the lower semicomputable chronological conditional semimeasures used in the aixi literature, but it suffices for our purposes, and the theory could be extended to lscccss if someone wanted to.

  2. I don’t think we need to assume finite sets of vectors, but things get a lot harder and messier when you’re dealing with sup\sup instead of max\max. It’s not clear how to define the non-dominated elements of an infinite set, for example, and so a few key results break. One motivation for finite being enough is: in real life, a finite mind can only consider finitely many outcomes anyways, and can only plan over a finite horizon using finitely many actions. This is just one consideration, though.