The danger from power-seeking is not intrinsic to the alignment problem. This danger also depends on the structure of the agent’s environment.

In The Catastrophic Convergence Conjecture, I wrote:

"What happens when agents seek pure control over the future? Not everyone can be king. If you're just seeking power without concern for others, you tend to push others down after a certain point. And most goals don't have concern for others. You'll just compete for resources." Below, a "paperclip maximizer" robot punches a "staple maximizer" robot.A hand-drawn illustration shows two robots on opposite sides of a green planet. Text reads: "It may take a while for power-seekers to come into conflict. But they will. They don't hate each other; they're just in each other's way."

Are there worlds where this isn’t true? Consider a world where you supply a utility-maximizing agi with a utility function.

A universe divided by a jagged line into two halves. The "Left half" contains a person on Earth, and the "Right half" contains a robot on a planet. Text below the line reads "No interaction," indicating the two halves are causally separate.
The agi is in a “separate part of the universe”; after the initial specification of the utility function, the left half of the universe evolves independently of the right half. Nothing you can do after specification can affect the agi’s half, and vice versa. No communication can take place between the two halves.

The only information you have about the other half is your utility. For simplicity, let’s suppose you and the agi have utility functions over universe-histories which are additive across the halves of the universe. You don’t observe any utility information about the other part of the universe until the end of time, and vice versa for the agi. That is, for history ,

If the agi uses something like causal decision theory, then it won’t try to kill you, or “seek power” over you. The effects of its actions have no causal influence over what happens in your half of the universe. Your universe’s evolution adds a constant term to its expected utility.

Note

Other decision theories might have the agi precommit to minimizing human utility unless it attains maximal agi-utility from the left half of the universe-history, or some other shenanigans. This precommitment isn’t relevant to the point I want to make in this post, but it’s important to consider.

However, the setup is still interesting because

  1. Goodhart’s law still applies: if you give the agi an incomplete proxy objective, you’ll get suboptimal true performance.
  2. Value is still complex: it’s still hard to get the agi to optimize the right half of the universe for human flourishing.
  3. If the agi is autonomously trained via stochastic gradient descent in the right half of the universe, then we may still hit inner alignment problems.

Alignment is still hard, and we still want to get the agi to do good things on its half of the universe. But it isn’t instrumentally convergent for the agi to seek power over you, and so you shouldn’t expect an unaligned agi to try to kill you in this universe. You shouldn’t expect the agi to kill other humans, either, since none exist in the right half of the universe—and it won’t create any, either.

To restate: Bostrom’s original instrumental convergence thesis needs to be applied carefully. The danger from power-seeking is not intrinsic to the alignment problem. This danger also depends on the structure of the agent’s environment. I think I sometimes bump into reasoning that feels like “instrumental convergence, smart AI, & humans exist in the universe bad things happen to us / the AI finds a way to hurt us”; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail.

Thanks

Thanks to John Wentworth for feedback on this post. Edited to clarify the broader point I’m making.

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)