Table of Contents

"In 'The Gears of Impact', we discussed how your attainable utility calculation roughly takes the best of different possibilities." A flashback cloud shows a map with paths to goals: relaxing at home, buying groceries, or hiking. A stick figure asks, "How do different AUs interact with the environment, and how does the environment interact with us?"

The sentence "There's a lot to think about when staking out a settlement."

A simple landscape with questions about resources and corresponding "AU meters". Near mountains with gems: "How plentiful are trade goods?" (meter is full). By a blue river: "How accessible is potable water?" (meter is high). Next to grass: "How fertile is the soil?" (meter is medium).

Handwritten text: "These considerations are proxies for future prosperity. Each is an AU for a different goal (e.g. trade good acquisition), conditioned on possibilities going through this part of the world."

A stick figure with a small tool looks puzzled at a large rock with golden ore veins. Text reads: "For example, if the hills run rich with ore inaccessible to your equipment, then this isn't beneficial until later."

Handwritten text with a yellow-highlighted title: "The attainable utility landscape." It's defined as "attainable utilities of all kinds of different goals." An example explains a trade-off: cultivating soil raises AU, but soiling the water decreases it.

An illustration of environmental resources with corresponding meters. "How plentiful are trade goods?" next to a mountain with gems shows a full green meter. "How accessible is potable water?" shows a low red meter. "How fertile is the soil?" shows a full green meter.

Text: "Exercise: What are various attainable utilities like on the moon?"

(One interpretation of the prompt is that you haven’t chosen to go to the moon. If you imagined yourself as more prepared, that’s also fine.)

If you were plopped onto the moon, you’d die pretty fast. Maybe the “die as quickly as possible” AU is high, but not much else—not even the “live on the moon” AU! We haven’t yet reshaped the AU landscape on the moon to be hospitable to a wide range of goals. Earth is special like that.

Handwritten text: "When we think about the world, we usually think about the world state first, and only then imagine what can be done with it. The AU landscape inverts this by instead taking 'ability to do things' as primary, thus considering the world state details to be secondary. This is nice."

Handwritten text reads: "Imagine the ocean submerges a forest." Below is a cartoon drawing of a dense green forest completely underwater, with blue waves on the surface above it.

Handwritten text asks: "What's happened to the survival AU in the forest?" The answer, "Depends on who's asking:", is followed by two examples: "Deer" (decrease) while "Fish" (increase).

Handwritten text reads: "Events have asymmetric impact on agents, depending on their: Capabilities • Goals • Vantage point • Knowledge." It continues: "Instead of seeing a flood and thinking “ugh, that’s probably bad?”, we can use the AU landscape to cleanly disentangle and understand these effects."

Attainable utilities are calculated by winding your way through possibility-space, considering and discarding possibility after possibility to find the best plan you can. This frame is unifying.

Sometimes you advantage one AU at the cost of another, moving through the state space towards the best possibilities for one goal and away from the best possibilities for another goal. This tradeoff is opportunity cost.

Hand-drawn diagram illustrating opportunity cost with three trade-offs. "TV vs books" shows a person watching TV with a book dangling unread from their hand. "Hiking vs minimizing distance to nearest airport" shows a hiker 5 miles from an airport. "Reading this post vs sleeping" shows this article on a computer vs a person asleep at 3:10.

Sometimes you gain more control over the future: most of the best possibilities make use of a windfall of cash. Sometimes you act to preserve control over the future: most Tic-Tac-Toe goals involve not ending the game right away. Otherwise put: preserving power.

A game tree for Tic-Tac-Toe showing how choices affect future possibilities. A central board state branches into many possible future game states. Some branches lead to terminal states while others continue to branch, illustrating the preservation of options.

Other people usually objectively impact you by decreasing or increasing a bunch of your AUs (generally, by changing your power). This happens for an extremely wide range of goals because of the structure of the environment.

Sometimes, the best possibilities are made unavailable or worsened only for goals much like yours. This outcome involves a value impact for your goals.

Text: "Value impact" is "important to agents like you." "Objective impact" is "important to agents in general," illustrated by an asteroid hitting Earth and two agents using money for different goals: "I can buy John a gift!" and "I can buy pebbles!"

Sometimes a bunch of the best possibilities go through the same part of the future: fast travel to random places on Earth usually involves the airport. This commonality is instrumental convergence.

An illustration of instrumental convergence on a globe. Multiple flight paths to different destinations all originate from a single airport. A nearby arrow notes "you start here [a small distance from the airport]."

Exercise

Track what’s happening to your various AUs during the following story: you win the lottery. Being an effective spender, you use most of your cash to buy a majority stake in a major logging company. Two months later, the company goes under.

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)

In the context of finite deterministic Markov decision processes, there’s a wonderful handful of theorems which basically say that the AU landscape and the environmental dynamics encode each other. That is, they contain the same information, just with different emphasis. This supports thinking of the AU landscape as a “dual” of the world state.

Quote

Let be a rewardless deterministic mdp with finite state and action spaces , deterministic transition function , and discount factor . As our interest concerns optimal value functions, we consider only stationary, deterministic policies: .

The first key insight is to consider not policies, but the trajectories induced by policies from a given state; to not look at the state itself, but the paths through time available from the state. We concern ourselves with the possibilities available at each juncture of the mdp.

To this end, for , consider the mapping of (where ); in other words, each policy maps to a function mapping each state to a discounted state visitation frequency vector , which we call a possibility. The meaning of each frequency vector is: starting in state and following policy , what sequence of states do we visit in the future? States visited later in the sequence are discounted according to : the sequence would induce 1 visitation frequency on , visitation frequency on , and visitation frequency on .

A maze represents the space of possibilities, with a pink dashed line showing one specific path being taken. Above, text reads: "Each [state visit distribution] f is a possible path through time."

The possibility function outputs the possibilities available at a given state :

Figure 1: "A simple example." A state diagram shows blue node 1 with arrows to red node 2 and black node 3. Nodes 2 and 3 have self-loops. Below, equations for the possibility function F at each state are: F(1) = { [1 (blue), γ/(1-γ) (red), 0 (black)], [1, 0, γ/(1-γ)] }, F(2) = { [0, 1/(1-γ), 0] }, F(3) = { [0, 0, 1/(1-γ)] }.

Put differently, the possibilities available are all of the potential film-strips of how-the-future-goes you can induce from the current state.

A film strip illustrating a "possibility" or a "film-strip of how-the-future-goes." Each frame shows a successive move in a single game of tic-tac-toe, from an early move to the game's completion.

We say two rewardless mdps and are isomorphic up to possibilities if they induce the same possibilities. Possibility isomorphism captures the essential aspects of an mdp’s structure, while being invariant to state representation, state labelling, action labelling, and the addition of superfluous actions (actions whose results are duplicated by other actions available at that state). Formally, when there exists a bijection (letting be the corresponding -by- permutation matrix) satisfying for all .

This isomorphism is a natural contender1 for the canonical (finite) mdp isomorphism:

A theorem I proved

and are isomorphic up to possibilities iff their directed graphs are isomorphic (and they have the same discount rate).

Suppose I give you the following possibility sets, each containing the possibilities for a different state:

Exercise

What can you figure out about the mdp structure? Hint: each entry in the column corresponds to the visitation frequency of a different state; the first entry is always , second , and third .

You can figure out everything: , up to possibility isomorphism. Solution.

How? Well, the norm of the possibility vector is always , so you can deduce easily. The single possibility state must be isolated, so we can mark that down in our graph. Also, it’s in the third entry.

The other two states correspond to the “1” entries in their possibilities, so we can mark that down. The rest follows straightforwardly.

Theorem

Suppose the rewardless mdp has possibility function . Given only ,2 can be reconstructed up to possibility isomorphism.

In mdps, the “AU landscape” is the set of optimal value functions for all reward functions over states in that mdp. If you know the optimal value functions for just reward functions, you can also reconstruct the rewardless mdp structure.3

From the environment (rewardless mdp), you can deduce the AU landscape (all optimal value functions) and all possibilities. From possibilities, you can deduce the environment and the AU landscape. From the AU landscape, you can deduce the environment (and thereby all possibilities).

A diagram showing that "Rewardless MDP," "Optimal value functions," and "Possibilities" are equivalent concepts. The three terms are arranged in a triangle, connected by double-headed arrows in a cycle to show they can all be derived from one another and encode the same information.
All of these encode the same mathematical object.

Opportunity cost is when an action you take makes you more able to achieve one goal but less able to achieve another. Even this simple world has opportunity cost:

A state diagram illustrating opportunity cost. A central black state connects to a purple state on the left and a green state on the right. To move between the purple and green states, one must pass through the black state. The purple and green states also have self-loops.

Going to the green state means you can’t get to the purple state as quickly.

On a deep level, why is the world structured such that this happens? Could you imagine a world without opportunity cost of any kind? The answer, again in the rewardless mdp setting, is simple: “yes, but the world would be trivial: you wouldn’t have any choices.” Using a straightforward formalization of opportunity cost, we have:

Existence of opportunity cost

Opportunity cost exists in an environment iff there is a state with more than one possibility.

Philosophically, opportunity cost exists when you have meaningful choices. When you make a choice, you’re necessarily moving away from some potential future but towards another; since you can’t be in more than one place at the same time, opportunity cost follows. Equivalently, we assumed the agent isn’t infinitely farsighted (); if it were, it would be possible to be in “more than one place at the same time”, in a sense (thanks to Rohin Shah for this interpretation).

While understanding opportunity cost may seem like a side-quest, each insight is another brick in the edifice of our understanding of the incentives of goal-directed agency.

  • Just as game theory is a great abstraction for modeling competitive and cooperative dynamics, AU landscape is great for thinking about consequences: it automatically excludes irrelevant details about the world state. We can think about the effects of events without needing a specific utility function or ontology to evaluate them. In multi-agent systems, we can straightforwardly predict the impact the agents have on each other and the world.
  • “Objective impact to a location” means that agents whose plans route through the location tend to be objectively impacted.
  • The landscape is not the territory: AU is calculated with respect to an agent’s beliefs, not necessarily with respect to what really “could” or will happen.

  1. The possibility isomorphism is new to my work, as are all other results shared in this post. This apparent lack of basic theory regarding mdps is strange; even stranger, this absence was actually pointed out in two published papers!

    I find the existing mdp isomorphisms / equivalences to be pretty lacking. The details don’t fit in this margin, but perhaps in a paper at some point. If you want to coauthor this (mainly compiling results, finding a venue, and responding to reviews), let me know. Added later: The results are available in the appendices of my dissertation.

  2. In fact, you can reconstruct the environment using only a limited subset of possibilities: the non-dominated possibilities.

  3. As a tensor, the transition function has size , while the AU landscape representation only has size . However, if you’re just representing as a transition function, it has size .