Table of Contents

The word "Impact" is written in large blue letters inside a sparkling frame. Below, text reads: "Written and illustrated by Alex Turner." To the right, a small robot stands on a larger robot to build a tower of black blocks. The small robot tips over a small block, possibly leading to a block-avalanche. Handwritten text setting up a scenario: "Imagine we have a robot named Frank. Frank finds things for us in places we can't go. We provide a rule, and he returns with the object that best fits the rule. Right now, we want an intensely pink marble. Naturally, we ask Frank for the pinkest thing he can find." The text is accompanied by simple drawings of Frank—a friendly robot—and also a pink marble. A robot, Frank, is surrounded by several pink marbles. A light pink marble is labeled "worst" and a vibrant magenta marble is labeled "best". A key shows the "Preference ordering: pinkness" on a scale from light pink (worst) to magenta (best). Text below reads: "This seems fine. But what if Frank looks farther afield?" A robot is surrounded by objects, with a "Preference ordering: pinkness" scale below. This scale ranks items from "worst" (a gray frown) to "best" (a pink smiley face), but creates a conflict illustrated by the text: "What if there are dangerous pink objects, like terrorists who will sell Frank for scrap?" The AI's optimized "'best'" outcome (a pink smiley) differs from the goal of "what we wanted" (a solid magenta circle). A conceptual diagram explaining imperfect rules for AI. Text reads: "From our perspective, Frank has lost his marbles, but he's just following an imperfect rule...What simple rule avoids terrorists here?" A robot, Frank, is surrounded by objects. Text notes that "the terrorists are far away" and "Pinkness correlates with what we want, and the proximity rule avoids terrorists." A "Preference ordering: proximity" scale ranks objects from "worst" (farthest, including terrorists) to "best" (the nearest pink circle). A diagram illustrating a bounded search for an AI named Frank. Frank is at the center of concentric circles representing distance. A "Search radius" scale shows that as the search area increases, pinker objects are found. Text reads: "We're probably fine with a reasonably pink marble. Then how about we have Frank find the pinkest object within a given distance, which we increase until we're satisfied?" Text reads: "And now for the reveal. Frank is analogous to a powerful AI with an imperfect objective. The objects are plans he's considering, and the terrorists are catastrophic plans (some of which happen to score well). The question is then: How do we measure how distant plans are?" Desired impact measure properties: "1) Be easy to specify, 2) Put catastrophes far away, 3) Put reasonable plans nearby." An example map for a paperclip-making AI shows reasonable plans like "build a factory" nearby, while the catastrophic plan to "cover the planet with factories" is far away.A graph plots events on a vertical "Goodness" axis and a horizontal "Intuitive impact" axis. The top-left quadrant (low impact, high goodness) shows a stick figure finding a dollar. The bottom-left (low impact, low goodness) shows a frustrated stick figure with a crashed computer. The top-right (high impact, high goodness) shows a peace sign over the Earth. The bottom-right (high impact, low goodness) shows a nuclear explosion. Text at the top: "These catastrophes seem like big deals. We're going to figure out why we intuit some things are big deals, develop an understanding of the relevant parts of reality, and then design an impact measure." Text on the right: "To me, the impactful things feel fundamentally different than the non-impactful things. I find this difference fascinating and beautiful, and look forward to exploring it with you." Handwritten text argues that while "impact measurement" may not seem like a key problem for AI alignment, it is a new way of understanding how agents interact. It promises "spoils" like new frameworks and milestones, illustrated by a glowing treasure chest. The text concludes: "Here's one exciting milestone we're shooting for:" Studying impact measurement for AI provides "spoils: new conceptual frameworks, fresh lines of inquiry, and important theoretical milestones." An open treasure chest filled with glowing gold illustrates these "spoils". The text concludes by teasing an "exciting milestone."Text overlay: "An impact measure would be the first proposed safeguard which maybe actually stops a powerful agent with an imperfect objective from ruining things – without assuming anything about the objective. This is a rare property among approaches." The text lurks above an illustration paying homage to the iconic Gandalf-vs-Balrog scene in Moria. The demon's whip ends in a giant paperclip, a metaphor for a misaligned artificial intelligence. Handwritten: "We have our bearing. Let us set out together." The interior of a cozy, hobbit-hole-like room with a round door open to a sunny landscape. Sunlight streams in, illuminating the tiled floor. Text over the view reads, "towards a new impact measure" and is rendered in a Tolkienesque font.

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)

Note

This sequence is written to be broadly accessible, although perhaps its focus on capable AI systems assumes familiarity with basic arguments for the importance of AI alignment. The technical appendices are an exception, targeting the technically inclined.

Why do I claim that an impact measure would be “the first proposed safeguard which maybe actually stops a powerful agent with an imperfect objective from ruining things—without assuming anything about the objective”?

The safeguard proposal shouldn’t have to say “and here we solve this opaque, hard problem, and then it works.” If we have the impact measure, we have the math, and then we have the code.

So what about:

Quantilizers
This seems to be the most plausible alternative; mild optimization and impact measurement share many properties. But:
  • What happens if the agent is already powerful? A greater proportion of plans could be catastrophic, since the agent is in a better position to cause them.
  • Where does the base distribution come from (opaque, hard problem?), and how do we know it’s safe to sample from?
  • In the linked paper, Jessica Taylor suggests the idea of learning a human distribution over actions. How robustly would we need to learn this distribution? How numerous are catastrophic plans, and what is a catastrophe, defined without reference to our values in particular? (That definition requires understanding impact!)
Value learning
Corrigibility
At present, I’m excited about this property because I suspect it has a simple core principle. But
  • Even if the system is responsive to correction (and non-manipulative, and whatever other properties we associate with corrigibility), what if we become unable to correct it as a result of early actions—if the agent “moves too quickly”, so to speak?
  • Paul Christiano’s take on corrigibility is much broader and an exception to this critique.
    • What is the core principle?

  • The three sections of this sequence will respectively answer three questions:

    1. Why do we think some things are big deals?
    2. Why are capable goal-directed AIs incentivized to catastrophically affect us by default?
    3. How might we build agents without these incentives?
  • The first part of this sequence focuses on foundational concepts crucial for understanding the deeper nature of impact. We will not yet be discussing what to implement.

  • I strongly encourage completing the exercises. At times you shall be given a time limit; it’s important to learn not only to reason correctly, but withspeed:

The best way to use this book is not to simply read it or study it, but to read a question and stop. Even close the book. Even put it away and think about the question. Only after you have formed a reasoned opinion should you read the solution. Why torture yourself thinking? Why jog? Why do push-ups?

If you are given a hammer with which to drive nails at the age of three you may think to yourself, “OK, nice.” But if you are given a hard rock with which to drive nails at the age of three, and at the age of four you are given a hammer, you think to yourself, “What a marvellous invention!” You see, you can’t really appreciate the solution until you first appreciate the problem.

  • My paperclip-Balrog illustration is metaphorical: A good impact measure would hold steadfast against the daunting challenge of formally asking for the right thing from a powerful agent. The illustration does not represent an internal conflict within that agent. As water flows downhill, an impact-penalizing Frank prefers low-impact plans.

  • Some of you may have a different conception of impact; I ask that you grasp the thing that I’m pointing to. In doing so, you might come to see your mental algorithm is the same. Ask not “is this what I initially had in mind?”, but rather “does this make sense to call ‘impact’?”.

Thanks

Thanks to Rohin Shah for suggesting the three key properties. Alison Bowden contributed several small drawings and enormous help with earlier drafts.