Table of Contents

Handwritten text reads: "Last time, on Reframing Impact:". Below, a cloud-shaped bubble defines the "Catastrophic Convergence Conjecture": "Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives."

Text: "If the CCC is right, then if power gain is disincentivized, the agent isn't incentivized to overfit and disrupt our AU landscape. Without even knowing who we are or what we want, the agent's actions preserve our attainable utilities." Below are illustrated examples of commands: "Make paperclips," "Put the strawberry on the plate," or "Paint the car pink," followed by the main constraint: "... but don't gain power." Title text reads: "This approach is called Attainable Utility Preservation." A diagram shows a robot's reach being stopped by a blue energy field. Beside it, three status bars labeled "Human", "Trout", and "AI" all show high attainable utility levels.

Handwritten text: "imagine an agent receiving reward for a primary task minus a scaled penalty for how much its actions change its power." It specifies this is "AUP_conceptual, not" (underlined) "any formalization you may be familiar with."

A diagram asking what a paperclip-manufacturing AUP_conceptual agent might do. Disallowed actions, crossed out with red Xs, are "Build lots of factories," "Copy itself," and "Nothing." The encouraged policy is to "Narrowly improve paperclip production efficiency." Text explains AUP_conceptual is designed to encourage this, and the optimal policy won't be catastrophic.

Handwritten text: "AUP_conceptual dissolves thorny problems in impact measurement." The question "Is the agent's ontology reasonable?" is answered with "Who cares." A robot in a thinking pose sits on a large green circle. Text explains: "Instead of regulating its complex physical effects on the outside world, the agent is looking inwards at itself and its own abilities." A thought bubble shows the robot picturing itself and a power meter. A diagram on the "locality" problem for AI impact penalties. It asks how to avoid penalties from distant state changes, using the example of rearranging inaccessible stars—a huge but irrelevant change. The solution is for AUP_conceptual to regularize the agent's impact on the nearby AU landscape.

A diagram asks, "What about butterfly effects? How can the agent possibly determine which effects it's responsible for?" An agent in a maze chooses between going north (somehow causing a tornado in the east) and going south (somehow causing a volcano to erupt in the west). The diagram concludes, "Forget about it." Handwritten text: "AUP conceptual agents are respectful and conservative with respect to their AU landscape..." A robot interacts with the environment while green bars for "Human," "Trout," and "AI" indicate preserved attainable utilities for these agents.

Handwritten text asks, "How can an idea go wrong?" It describes two gaps: one between what we want and the concept, and another between the concept and execution. It then critiques past impact measures for focusing on minimizing physical change or maintaining world states, questioning if this is well-aimed. A cartoon robot with a box head is surrounded by red laser-like tripwires. Text above reads: "The hope is that in order for the agent to have a large impact on us, it has to snap a tripwire." Text below reads: "The problem is... well, it's not clear how we could possibly know whether the agent can still find a catastrophic policy; in a sense, the agent is still trying to sneak by the restrictions and gain power over us. An agent maximizing expected utility while actually minimally changing the physical world still probably leads to catastrophe."

Handwritten text: "That doesn't seem to be the case for AUP conceptual. Assuming CCC, an agent which doesn't gain much power doesn't cause catastrophes. This has no dependency on complicated human value, and most realistic tasks should have reasonable, high-reward policies not gaining undue power." A hand-drawn diagram states "So AUP_conceptual meets our desiderata:". Inside a cloud bubble is a list titled "The distance measure should: 1) Be easy to specify, 2) Put catastrophes far away, 3) Put reasonable plans nearby." Next to the list is a "no" symbol over a nervous robot sweating in front of a devilish pink smiley face.

Top text: "Therefore, I consider AUP to conceptually be a solution to impact measurement." with a cartoon robot popping champagne. Bottom text: "Wait! let's not get ahead of ourselves! I don't think we've fully bridged the concept / execution gap. However, for AUP, it seems possible – more on that later."

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)

What if we want the agent to single-handedly ensure the future is stable and aligned with our values? Aup probably won’t allow policies which actually accomplish this goal—one needs power to e.g. nip unaligned superintelligences in the bud. Aup aims to prevent catastrophes by stopping bad agents from gaining power to do bad things, but it symmetrically impedes otherwise-good agents.

This doesn’t mean we can’t get useful work out of agents—there are important asymmetries provided by both the main reward function and AU landscape counterfactuals.

First, even though we can’t specify an aligned reward function, the provided reward function still gives the agent useful information about what we want. If we need paperclips, then a paperclip-aup agent prefers policies which make some paperclips. Simple.

Second, if we don’t like what it’s beginning to do, we can shut it off (because it hasn’t gained power over us). Therefore, it has “approval incentives” which bias it towards AU landscapes in which its power hasn’t decreased too much, either.

So we can hope to build a non-catastrophic aup agent and get useful work out of it. We just can’t directly ask it to solve all of our problems: it doesn’t make much sense to speak of a “low-impact singleton.”

  • To emphasize, when I say “aup agents do ” in this post, I mean that aup agents correctly implementing the concept of aup tend to behave in a certain way.
  • As pointed out by Daniel Filan, aup suggests that one might work better in groups by ensuring one’s actions preserve teammates’ AUs.