Table of Contents

Text: "We've come a long way; let's recap." To the right, a small cartoon robot stands on a larger, wheeled robot and adds a block to the top of a tall, precarious tower of blocks.Top: A Pebblehoarder mourns their pebbles (now turned into obsidian cubes). Text: "Some things feel like big deals to agents with specific kinds of goals." Bottom: A planet being destroyed in space. Text: "Some things feel like big deals to basically everyone."A confused stick figure holds a brain next to the word "Why?"."When thinking about whether something impacts us, we ask: How does this change my ability to get what I want?". The central question is in large, multicolored letters. Below, it states: "This is impact."On top, a stick figure thinks about how they "could" be productive, behind text: "The way people feel impacted depends on their beliefs about the world and their future actions." Below, an intact vase shatters, with blue arrows tracking where each piece travels. Text: "Impact's not necessarily about big physical change to the world."An illustrated summary of the "Reframing Impact" sequence's five main points. 1. A landscape drawing with text: "Acting in the world changes who can do what." 2. A cartoon figure climbs crates to reach the powerful Infinity Gauntlet: "Theorems suggest that most optimal agents who care about the future try to gain control over their environment." 3. "Catastrophic Convergence Conjecture: Unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives." 4. Frank the robot pops champagne behind the text "To avoid catastrophe, have an agent achieve its goal without gaining power. This sidesteps previously intractable problems in impact measurement." 5. "By preserving randomly selected AUs, AUP agents avoid side effects even in highly nontrivial environments." A handwritten question asks: "What if we have smart agents accrue reward while being penalized for becoming more able to accrue that reward?" A drawing shows a robot reaching toward a blue barrier, next to three green attainable utility bars labeled Human, Trout, and AI."We can steadily decrease the penalty term until the agent selects a reasonable, non-catastrophic policy. This avoids catastrophe if catastrophes require gaining e.g. 10x as much power as do reasonable policies."Mt. Doom erupts in the distance, as viewed from the White Tower of Gondor. The White Tree begins to blossom. Text: "We still have work to do. The alignment problem remains comically underfocused in academia. We're still confused about many things. However, after this sequence, I'd like to think we're a little less confused about a little bit of the problem. Writing Reframing Impact has been a pleasure. Thanks for reading."

Acknowledgments

After ~700 hours of work over the course of ~9 months, the sequence is finally complete.

This work was made possible by the Center for Human-Compatible AI, the Berkeley Existential Risk Initiative, and the Long-Term Future Fund. Deep thanks to Rohin Shah, Abram Demski, Logan Smith, Evan Hubinger, TheMajor, Chase Denecke, Victoria Krakovna, Alper Dumanli, Cody Wild, Matthew Barnett, Daniel Blank, Sara Haxhia, Connor Flexman, Zack M. Davis, Jasmine Wang, Matthew Olson, Rob Bensinger, William Ellsworth, Davide Zagami, Ben Pace, and a million other people for giving feedback on this sequence.

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)

I’ve made many claims in these posts. All views are my own.

StatementCredence
There exists a simple closed-form solution to catastrophe avoidance (in the outer alignment sense).25%
For the superhuman case, penalizing the agent for increasing its own Attainable Utility (AU) is better than penalizing the agent for increasing other AUs.65%
Some version of Attainable Utility Preservation solves side effect problems for an extremely wide class of real-world tasks and for subhuman agents.65%
The catastrophic convergence conjecture is true. That is, unaligned goals tend to have catastrophe-inducing optimal policies because of power-seeking incentives.70%1
Agents trained by powerful RL algorithms on arbitrary reward signals generally try to take over the world.75%2
Aupconceptual prevents catastrophe, assuming the catastrophic convergence conjecture.85%
Attainable Utility theory describes how people feel impacted.95%
Note

The LessWrong version of this post contained probability estimates from other users.

The big art pieces (and especially the last illustration in this post) were designed to convey a specific meaning, the interpretation of which I leave to the reader.

The sequence hides a few pop culture references which I think are obvious enough to not need pointing out, and a lot of hidden smaller playfulness which doesn’t quite rise to the level of “easter egg.”

Reframing Impact
The bird’s nest contains a literal easter egg.

Handwritten text reads: "The world is wide, and full of objects." Below, a white space contains simple drawings of a bird's nest, a blue bird, a pink circle, a grey circle labeled "worst," and a pink smiley face.

The paperclip-Balrog drawing contains a Tengwar inscription which reads “one measure to bind them”, with “measure” in impact-blue and “them” in utility-pink.

Text overlay: "An impact measure would be the first proposed safeguard which maybe actually stops a powerful agent with an imperfect objective from ruining things – without assuming anything about the objective. This is a rare property among approaches." The text lurks above an illustration paying homage to the iconic Gandalf-vs-Balrog scene in Moria. The demon's whip ends in a giant paperclip, a metaphor for a misaligned artificial intelligence.

“Towards a New Impact Measure” was the title of the post in which aup was introduced:

The interior of a cozy, hobbit-hole-like room with a round door open to a sunny landscape. Sunlight streams in, illuminating the tiled floor. Text over the view reads "Towards a new impact measure" and is rendered in a Tolkienesque font.


Attainable Utility Theory: Why Things Matter

This style of maze is from the video game Undertale.

A colorful grid maze in the style of the video game "Undertale." On the left, a white square with a plus sign is labeled "you." In the top right corner, a dark grey square is labeled "Your goal."


Seeking Power is Often Convergently Instrumental in mdps

To seek power, Frank is trying to get at the Infinity Gauntlet.

A crying cartoon robot jumps from stacked crates, straining to reach a high ledge where a treasure chest contains the glowing Infinity Gauntlet.


The tale of Frank and the orange Pebblehoarder
Speaking of under-tales, a friendship has been blossoming right under our noses:
A cartoon of Frank the robot giving his pink marble to a surprised Pebblehoarder. They stand in a grassy field under a sunny sky.
After the Pebblehoarders suffer the devastating transformation of all of their pebbles into obsidian blocks, Frank generously gives away his favorite pink marble as a makeshift pebble.
"Impact" is written in large blue letters inside a sparkling frame. Below, text reads: "Written and illustrated by Alex Turner." To the right, a small robot stands on a larger robot to build a tower of black blocks. The small robot tips over a small block, possibly leading to a block-avalanche.
The title cuts to the middle of their adventures together, the Pebblehoarder showing its gratitude by helping Frank reach things high up.
Frank and the Pebblehoarder sit together on a cliff's edge, overlooking a vast mountain range at sunset. The scene pays homage to the ending shot of the 2012 film, The Hobbit: An Unexpected Journey.
This still at the midpoint of the sequence is from the final scene of The Hobbit: An Unexpected Journey, where the party is overlooking Erebor, the Lonely Mountain. They’ve made it through the Misty Mountains, only to find Smaug’s abode looming in the distance.
Frank the robot stands atop the orange Pebblehoarder, popping a bottle of champagne. In the background, celebratory fireworks explode, with one spelling out "LW" in purple.
Frank and the orange Pebblehoarder pop some of the champagne from Smaug’s hoard. Since Erebor isn’t close to Gondor, we don’t see Frank and the Pebblehoarder gazing at Ephel Dúath from Minas Tirith.

  1. There seems to be a dichotomy between “catastrophe directly incentivized by goal” and “catastrophe indirectly incentivized by goal through power-seeking”, although Vika provides intuitions in the other direction.

  2. The theorems on power-seeking only apply to optimal policies in fully observable environments, which isn’t realistic for real-world agents. However, I think they’re still informative. There are also strong intuitive arguments for power-seeking.