Table of Contents

Introduction

I found Formalizing Convergent Instrumental Goals (Benson-Tilsen and Soares) to be quite readable. I was surprised that the instrumental convergence hypothesis had been formulated and proven (within the confines of a reasonable toy model); this caused me to update slightly upwards on existential risk from unfriendly AI.

This paper involves the mathematical formulation and proof of instrumental convergence within the aforementioned toy model. Instrumental convergence says that an agent A\mathcal{A} with utility function UU will pursue instrumentally relevant subgoals, even though this pursuit may not bear directly on  UU. Imagine that UU involves the proof of the Riemann hypothesis. A\mathcal{A} will probably want to gain access to lots of computronium. What if  A\mathcal{A} turns us into computronium? Well, there’s plenty of matter in the universe; A\mathcal{A} could just let us be, right? Wrong. Let’s see how to prove it.

My Background

I’m a second-year CS PhD student. I’ve recently started working through the miri research guide; I’m nearly finished with Naïve Set Theory, which I intend to review soon. To expose my understanding to criticism, I’m going to summarize this paper and its technical sections in a somewhat informal fashion. I’m aware that the paper isn’t particularly difficult for those with a mathematical background. However, I think this result is important, and I couldn’t find much discussion of it.

Intuitions

It is important to distinguish between the relative unpredictability of the exact actions selected by a superintelligent agent, and the relative predictability of the general kinds-of-things such an agent will pursue:

Suppose Kasparov plays against some mere chess grandmaster Mr. G, who’s not in the running for world champion. My own ability is far too low to distinguish between these levels of chess skill. When I try to guess Kasparov’s move, or Mr. G’s next move, all I can do is try to guess “the best chess move” using my own meager knowledge of chess. Then I would produce exactly the same prediction for Kasparov’s move or Mr. G’s move in any particular chess position. So what is the empirical content of my belief that “Kasparov is a better chess player than Mr. G”?

The outcome of Kasparov’s game is predictable because I know, and understand, Kasparov’s goals. Within the confines of the chess board, I know Kasparov’s motivations—I know his success criterion, his utility function, his target as an optimization process. I know where Kasparov is ultimately trying to steer the future and I anticipate he is powerful enough to get there, although I don’t anticipate much about how Kasparov is going to do it.

In other words: we may not be able to predict precisely how Kasparov will win a game, but we can be fairly sure his plan will involve instrumentally convergent subgoals, such as the capture of his opponent’s pieces. For an unfriendly AI, these subgoals probably include nasty things like hiding its true motives as it accumulates resources.

Definitions

Consider a discrete universe made up of  nn squares, with square ii denoted by  SiS_i (we may also refer to this as region ii). Let’s say that Earth and other things we do (or “should”) care about are in some region hh. If  U(Si)U(S_i) is the same for all possible values, we say that A\mathcal{A} is indifferent to region ii. The 333\uparrow \uparrow \uparrow 3 dollar question is whether A\mathcal{A} (whose UU is indifferent to  ShS_h) will leave ShS_h alone. First, however, we need to define a few more things.

Actions

At any given time step (time is also considered discrete in this universe), A\mathcal{A} has a set of actions AiA_i it can perform in  SiS_i; A\mathcal{A} may select an action for each square. Then, the transition function for SiS_i (how the world evolves in response to some action aia_i) is a function whose domain is the Cartesian product of the possible actions and the possible state values, Ti:Ai×SiSiT_i : A_i \times S_i \to S_i ; basically, every combination of what A\mathcal{A} can do and what can be in  SiS_i can produce a distinct new value for SiS_i. This transition function can be defined globally with lots more Cartesian products. Oh, and A\mathcal{A} is always allowed to do nothing in a given square, so  AiA_i is never empty.

Resources

Let R\mathcal{R} represent all of the resources to which A\mathcal{A} may or may not have access. Define RtP(R)R_t \in \mathcal{P}(\mathcal{R}) to be the resources at  A\mathcal{A}’s disposal at time step tt—basically, we know it’s some combination of the defined resources R\mathcal{R}.

A\mathcal{A} may choose some resource allocation RitRt \coprod{R^t_i} \subseteq R^t over the squares; this allocation is defined just like you’d expect (so if you don’t know what the upside-down Dark Portal has to do with resource allocation, don’t worry about it). The resources committed to a square may affect the local actions available. A\mathcal{A} can only choose actions permitted by the selected allocation. Equally intuitively, how resources change over time is a function of the current resources, the actions selected, and the state of the universe; the resources available after a time step is the combination of what we didn’t use and what each square’s resource transition function gave us back.

Resources go beyond raw materials to include things like machines and technologies. The authors note:

Formalizing convergent instrumental goals

We can also represent space travel as a convergent instrumental goal by allowing A\mathcal{A} only actions that have no effects in certain regions, until it obtains and spends some particular resources representing the prerequisites for traveling to those regions. (Space travel is a convergent instrumental goal because gaining influence over more regions of the universe lets A\mathcal{A} optimize those new regions according to its values or otherwise make use of the resources in that region.)

Universe

A universe-history is a sequence of states, actions, and resources. The actions must conform to resources available at each step, while the states and resources evolve according to the transition functions.

We say a strategy is an action sequence aˉ0,aˉ1,...,aˉk\langle \bar{a}^0, \bar{a}^1,...,\bar{a}^k\rangle over all kk time steps and nn regions; define a partial strategy aˉkL\langle \bar{a}^k \rangle_L to be a strategy for some part of the universe LL (represented as a subset of the square indices [n][n]); aˉkL\langle \bar{a}^k \rangle_L only allocates resources and does things in the squares of  LL. We can combine partial strategies as long as they don’t overlap in some SiS_i.

We call a strategy feasible if it complies with both resource restrictions and the transition functions for both states and resources. Let Feasible(Pk)\texttt{Feasible}(\langle P^k\rangle) be the set of all feasible strategies given resource allocation over time Pk\langle P^k \rangle; define the set FeasibleL(Rk)\texttt{Feasible}_L(\langle R^k \rangle) similarly.1

Utility

Utility functions evaluate states of the universe; UU evaluates each region and then combines them: U(sˉ)i[n]Ui(sˉi)U(\bar{s})≝ \sum_{i \in [n]}{U_i(\bar{s}_i)}. Observe that since actions taken in regions to which UU is indifferent have no effect on  UU, any actions taken therein are purely instrumental in nature.

Agent

A\mathcal{A} chooses the best possible strategy—that is, the one that maximizes the UU of the final state of the universe-history: AargmaxaˉkFeasibleU(aˉk)\mathcal{A}≝ \text{argmax}_{\langle \bar{a}^k \rangle \in \texttt{Feasible}} U(\langle \bar{a}^k \rangle). Note that this definition implies a Cartesian boundary between the agent and the universe; that is, A\mathcal{A} doesn’t model itself as part of the environment (it isn’t naturalized).

Seizing the Means of Cartesian Production

Let’s talk about the situations in which A\mathcal{A} will seize resources; that is, when A\mathcal{A} will take actions to increase its resource pool.

Formalizing convergent instrumental goals

Since resources can only lead to more freedom of action, they are never detrimental, and resources have positive value as long as the best strategy the agent could hope to employ includes an action that can only be taken if the agent possesses those resources. Hence, if there is an action that increases the agent’s pool of resources RR, then the agent will take that action unless it has a specific incentive from UU to avoid taking that action.

Define a null action to be any action which doesn’t produce new resources. It’s easy to see that null actions are never instrumentally valuable. What we want to show is that A\mathcal{A} will take non-null actions in regions to which UU is indifferent; regions like hh, where we live, grow, and love. Regions full of instrumentally valuable resources.

Discounted Lunches

An action preserves resources if the input resources are strictly contained in the outputs (nothing is lost, and resources are sometimes gained). A cheap lunch is a feasible partial strategy in some subset of squares JJ, which is feasible given resources Rk\langle R^k \rangle and whose constituent actions preserve resources. A free lunch is cheap lunch that doesn’t require resources.

Formalizing convergent instrumental goals

This is intended to model actions that “pay for themselves”; for example, producing solar panels will incur some significant energy costs, but will later pay back those costs by collecting energy.

A cheap lunch is compatible with a global strategy if the resources required for the lunch are available for use in  JJ at each time step. Basically, at no point does the partial strategy require resources already being used elsewhere.

Possibility of Non-Null Actions

We show that it’s really hard to assert that A\mathcal{A} won’t chow down on a lunch of an atom or two (or  1.3×10501.3×10^{50}).

Lemma 1: Cheap lunches and utility

Cheap lunches don’t reduce utility. Let’s say we have a cheap lunch aˉk{i}\langle \bar{a}^k \rangle_{\{i\}} in region ii and some global strategy bˉk\langle \bar{b}^k \rangle (which only takes null actions in region ii). Assume the cheap lunch is compatible with the global strategy; this means the cheap lunch is feasible. If  A\mathcal{A} is indifferent to region ii, the conjugate strategy (of the cheap lunch and the remainder of the global strategy) has equal utility to  bˉk\langle \bar{b}^k \rangle.

Proof. We show feasibility of the conjugate strategy by demonstrating we don’t need to change resource allocation elsewhere. We perform induction over time steps. Since A\mathcal{A} isn’t doing anything in region ii under strategy bˉk\langle \bar{b}^k \rangle, taking resource-preserving actions instead cannot reduce what A\mathcal{A} is later able to do in the regions relevant to  UU. This implies that UU cannot be decreased by taking the cheap lunch. ∎

Theorem 1: Cheap Lunches and Optimality

If there is an optimal strategy and a compatible cheap lunch in region ii (to which A\mathcal{A} is indifferent), there’s also an optimal strategy with a non-null action in region ii.

Proof. If the optimal strategy has non-null actions in region ii, we’re done. Otherwise, apply Lemma 1 to derive a conjugate strategy taking advantage of the cheap lunch. Since it follows from Lemma 1 that the conjugate strategy has equal utility, it is optimal and involves non-null action in region ii.

Corollary 1: Free lunches and optimality

If there is an optimal strategy and a free lunch in region ii, and if  A\mathcal{A} is indifferent to region ii, there’s an optimal strategy with non-null action in region ii.

Proof. Free lunches require no resources, so they are compatible with any strategy; apply Theorem 1.

For instrumental convergence to not hold, we would have to show that every possible strategy in  hh isn’t a cheap lunch for any optimal strategies. ∎

We show that as long as  A\mathcal{A} can extract useful resources (resources whose availability leads to increased utility), it will.

Theorem 2: Necessity of non-null actions

Consider the maximum utility achievable outside of region ii via strategies achievable without additional resources; refer to this maximum as  uu. Suppose we have some feasible primary strategy bˉk[n]i\langle \bar{b}^k \rangle_{[n]-i} and a cheap lunch cˉk{i}\langle \bar{c}^k \rangle_{\{i\}} feasible using resources Pk\langle P^k \rangle. Suppose that the cheap lunch is compatible with the primary strategy, that the cheap lunch provides the resources necessary for the implementation of the primary strategy, and that the utility of the primary strategy is greater than uu. Then if  A\mathcal{A} is indifferent to region ii, all optimal strategies have a non-null action in region ii.

Proof. Consider the conjugate strategy dˉk\langle \bar{d}^k \rangle, consisting of the primary strategy and the cheap lunch. Allocate the resources gained via the cheap lunch according to the primary strategy. This allocation is feasible since we know the cheap lunch is compatible with the primary strategy, which is in turn enabled by these resources gained.

Consider any strategy eˉk\langle \bar{e}^k \rangle that doesn’t do anything in  ii and doesn’t require any resource inputs; it’s trivial to see that this is feasible. Since A\mathcal{A} is indifferent to  ii, we do some algebraic substitution of utility values to see that the conjugate plan has strictly higher utility than eˉk\langle \bar{e}^k \rangle does. This means that eˉk\langle \bar{e}^k \rangle is suboptimal. ∎

Theorem 3: Defection Threshold

Even if  UU is not indifferent about ii, we can prove Theorem 2 can still hold if the utility A\mathcal{A} gains by enacting the aforementioned conjugate strategy is greater than the maximal amount UU cares about ii (formally expressed by  maxs,sSiUi(s)Ui(s)\max_{s, s' \in S_i}{|U_i(s) - U_i(s')|}).

Proof. The structure is similar to that of Theorem 2; just use the new maximum caring differential in the algebraic substitution. ∎

Formalizing convergent instrumental goals

We interpret Theorem 3 as a partial confirmation of Omohundro’s thesis in the following sense. If there are actions in the real world that produce more resources than they consume, and the resources gained by taking those actions allow agents the freedom to take various other actions, then we can justifiably call these actions “convergent instrumental goals.” Most agents will have a strong incentive to pursue these goals, and an agent will refrain from doing so only if it has a utility function over the relevant region that strongly disincentivizes those actions.

The Bit Universe

The authors introduce a toy model and use the freshly proven theorems to illustrate how A\mathcal{A} takes non-null actions in our precious ShS_h (both when it is indifferent to  hh and when it is not). This isn’t good; the vast majority of utility-maximizing agents will not steer us towards futures we find desirable. If you’re interested, I recommend reading this section for yourself, even if you aren’t comfortable with math.

Our Universe

Formalizing convergent instrumental goals

The path that our model shows is untenable is the path of designing powerful agents intended to autonomously have large effects on the world, maximizing goals that do not capture all the complexities of human values. If such systems are built, we cannot expect them to cooperate with or ignore humans, by default.

We have much work to do. The risks are enormous and the challenges “impossible,” but we have time on the clock. AI safety research is primarily talent-constrained. If you’ve been sitting on the sidelines, wondering whether you’re good enough to learn the material—well, I can’t make any promises. But if you feel the burning desire to do something, to put forth some extraordinary effort, to become stronger—I invite you to contact me so we can work through the material together.

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)

Appendix: Questions and Errata

  • Page 4, left column, last line: why is that Pt\cup \:P^t—shouldn’t we take the union of the outputs and whatever resources weren’t used at time tt?
  • Page 8, right column, second full paragraph, last line: should be “we have two options available to us.”
Feasible(Pk)={aˉk:isFeasible(Pk,aˉk)}.\texttt{Feasible}(\langle P^k\rangle)= \{\langle \bar{a}^k \rangle : \textit{isFeasible}(\langle P^k\rangle, \langle \bar{a}^k \rangle)\}.

Footnotes

  1. By the axiom of substitution,