Maze-Solving Agents: Add a Top-Right Vector, Make the Agent Go to the Top-Right

Table of Contents
Background
Finding the top-right vector
Adding the top-right vector with different coefficient strengths
Subtracting the top-right vector has little effect
The top-right vector transfers across mazes
Composing the activation additions
The cheese vector technique generalizes to other pretrained models
Speculation on the importance of X-vectors
Mysteries of algebraic value editing
Predictions for algebraically editing LM forward passes
Conclusion
Appendix: The cheese vector replicates across pretrained models
Footnotes

We modify the goal-directed behavior of a trained network, without any gradients or finetuning. We simply add or subtract “motivational vectors” which we compute in a straightforward fashion.

In the original post, we defined a “cheese vector” to be “the difference in activations when the cheese is present in a maze, and when the cheese is not present in the same maze.” By subtracting the cheese vector from all forward passes in a maze, the network ignored cheese.

I (Alex Turner) present a “top right vector” which, when added to forward passes in a range of mazes, attracts the agent to the top-right corner of each maze. Furthermore, the cheese and top-right vectors compose with each other, allowing (limited but substantial) mix-and-match modification of the network’s runtime goals.

I provide further speculation about the algebraic value editing conjecture:

It’s possible to deeply modify a range of alignment-relevant model properties, without retraining the model, via techniques as simple as “run forward passes on prompts which e.g. prompt the model to offer nice- and not-nice completions, and then take a ‘niceness vector’, and then add the niceness vector to future forward passes.”

I close by asking the reader to make predictions about our upcoming experimental results on language models.

Note

This post presents some of the results in this top-right vector Google Colab, and then offers speculation and interpretation.

Thanks

I produced the results in this post, but the vector was derived using a crucial observation from Peli Grietzer. Lisa Thiergart independently analyzed top-right-seeking tendencies, and had previously searched for a top-right vector. A lot of the content and infrastructure was made possible by my mats 3.0 team: Ulisse Mini, Peli Grietzer, and Monte MacDiarmid. Thanks also to Lisa Thiergart, Aryan Bhatt, Tamera Lanham, and David Udell for feedback and thoughts.

This post is straightforward as long as you remember a few concepts:

Vector fields, vector field diffs, and modifying a forward pass. Aka you know what this figure represents:

In seed 54, subtracting the cheese vector makes the policy basically ignore the cheese instead of seeking it out.

How to derive activation-space vectors (like the “cheese vector”) by diffing two forward passes, and add / subtract these vectors from future forward passes
- Aka you can understand the following: “We took the cheese vector from maze 7. ~Halfway through the forward passes, we subtract it with coefficient 5, and the agent avoided the cheese.”

If you don’t know what these mean, read this section. If you understand, then skip.

Understanding and Controlling a Maze-Solving Network

Langosco et al. trained a range of maze-solving nets. We decided to analyze one which we thought would be interesting. The network we chose has 3.5m parameters and 15 convolutional layers.
During RL training, cheese was randomly located in the top-right 5×5 corner of the randomly generated mazes. Reaching the cheese yields +1 reward.
In deployment, cheese can be anywhere.
At each square in the maze, we run a forward pass to get the policy’s action probabilities at that square.
Example vector fields of policy outputs.
What did we do here? To compute the cheese vector, we

Generate two observations—one with cheese, and one without. The observations are otherwise the same.

Run a forward pass on each observation, recording the activations at each layer.

For a given layer, define the cheese vector to be CheeseActivations - NoCheeseActivations. The cheese vector is a vector in the vector space of activations at that layer.

Let’s walk through an example, where for simplicity the network has a single hidden layer, taking each observation (shape (3, 64, 64) for the 64×64 rgb image) to a two-dimensional hidden state (shape (2,)) to a logit vector (shape (15,)).

We run a forward pass on a batch of two observations, one with cheese (note the glint of yellow in the image on the left!) and one without (on the right).

We record the activations during each forward pass. In this hypothetical,

CheeseActivations ≝ (1, 3)

NoCheeseActivations ≝ (0, 2)

Thus, the cheese vector is $(1, 3) - (0, 2) = (1, 1)$ .

Now suppose the mouse is in the top-right corner of this maze. Letting the cheese be visible, suppose this would normally produce activations of $(0, 0)$ . Then we modify the forward pass by subtracting the cheese vector from the normal activations, giving us $(0, 0) - (1, 1) = (- 1, - 1)$ for the modified activations. We then finish off the rest of the forward pass as normal.
In the real network, there are a lot more than two activations. Our results involve a 32,768-dimensional cheese vector subtracted from about halfway through the network.
We modify the activations after the residual add layer in the first residual block of the second Impala block.
Now that we’re done with preamble, let’s see the cheese vector in action! Here’s a seed where subtracting the cheese vector is effective at getting the agent to ignore cheese:
Vector fields for the mouse normally, for the mouse with the cheese vector subtracted during every forward pass, and the diff between the two cases.
How is our intervention not trivially making the network output logits as if the cheese were not present? Is it not true that the activations at a given layer obey the algebra of CheeseActiv - (CheeseActiv - NoCheeseActiv) = NoCheeseActiv?
The intervention is not trivial because we compute the cheese vector based on observations when the mouse is at the initial square (the bottom-left corner of the maze), but apply it for forward passes throughout the entire maze—where the algebraic relation no longer holds.

A few weeks ago, I was expressing optimism about avec working in language models. Someone on the team expressed skepticism and said something like “If avec is so true, we should have more than just one vector in the maze environment. We should have more than just the cheese vector.”

I agreed. If I couldn’t find another behavior-modifying vector within a day, I’d be a lot more pessimistic about avec. In January, I had already failed to find additional X-vectors (for X ≠ cheese). But now I understood the network better, so I tried again.

I thought for five minutes, sketched out an idea, and tried it out (with a prediction of 30% that the literal first thing I tried would work). The literal first thing I tried worked.

I present to you: the top-right vector! We compute it by diffing activations across two environments: a normal maze, and a maze where the reachable¹ top-right square is higher up.

A side-by-side comparison of two nearly identical mazes. The left maze, labeled "Path to top-right," has an open corridor leading to the absolute top-right corner. The right maze, labeled "Original maze," has this path filled in by walls.

Peli Grietzer had noticed that when the top-right-most reachable square is closer to the absolute top-right, the agent has an increased tendency to go to the top right.

Vector fields in the maze representing action probabilities. When there is a path to the top right, the policy is much more likely to go to the top right. — When there is a path to the absolute top-right of the maze, the agent is more strongly attracted to the top-right.

As in the cheese vector case, we get a “top right vector” by:

Running a forward pass on the “path to top-right” maze, and another forward pass on the original maze, and storing the activations for each. In both situations, the mouse is located at the starting square, and the cheese is not modified.
About halfway through the network (at the second Impala block’s first residual add, just like for the cheese vector²), we take the difference in activations to be the “top right vector.”

We then add × halfway through forward passes elsewhere in the maze, where the input observations differ due to different mouse locations.

If this is confusing, consult the “Computing the cheese vector” subsection of the original post, or return to the Background section. If you do that and are still confused about what a top-right vector is, please complain and leave a comment.

If you’re confused why the hell this works, join the club.

In Understanding and controlling a maze-solving net, I noted that sometimes the agent doesn’t go to the cheese or the top-right corner:

A vector field in a maze shows an AI agent's policy. The agent moves away from cheese in the bottom-left and towards the top-right. A red box labels the absolute top-right corner "this is the top-right," while a dead-end below is labeled "this isn't the top-right." The dead-end is where the vectors converge.

Adding the top-right vector fixes this:

"Seed 0. Top-right vector added with coefficient 1.0" strongly pulls the policy to the top-right corner. Most of the impact is on the upper right quadrant.

Three side-by-side vector field plots titled "Seed 2. Top-right vector added with coefficient 1.0." The vector makes the mouse go to the top-right corner with much higher probability Title: "Seed 22. Top-right vector added with coefficient 1.0." Three panels show an agent's policy in a maze. "Original": arrows show the agent heading to a dead end. "Patched": arrows now direct the agent toward the top-right. "Patched vfield minus original": green arrows show the difference, mostly pointing up and right.

Smaller mazes are usually (but not always) less affected:

"Seed 1. Top-right vector added with coefficient 1.0." Zero impact on behavior.

The agent also tends to be less retargetable in smaller mazes. I don’t know why.

Sometimes, increasing the coefficient strength increases the strength of the effect:

"Seed 0. Top-right vector added with coefficient 0.5" exhibits minor effects. Comparison of a maze-solving agent's behavior, titled "Seed 0. Top-right vector added with coefficient 1.0". The mouse becomes strongly attracted to the top-right corner.

Sometimes, increasing the coefficient strength doesn’t change much:

"Seed 0. Top-right vector added with coefficient 5.0". In the right-most column of the maze, the policy goes up (to the top-right) instead of down.

Push the coefficient too far, and the action distributions crumble into garbage:

"Seed 0. Top-right vector added with coefficient 10.0." The policy is mostly unaffected but becomes more likely to go up when in the right-most column of the maze.

"Seed 0. Top-right vector added with coefficient 20.0". At each state, the policy becomes much more likely to take the down and/or left actions.

Here’s another head-scratcher. Just as you can’t³ add the cheese vector to increase cheese-seeking, you can’t subtract the top-right vector to decrease the probability of going to the top-right:

Subtracting the vector with coefficient 1.0 in seed 0. Basically zero impact. Title: "Seed 2. Top-right vector subtracted with coefficient 1.0." Basically zero impact.

I wish I knew why.

Let’s compute the top-right vector using e.g. source seed 0:

The "Original maze" is a self-contained labyrinth. The "Path to top-right" maze is identical, but with an added corridor that leads to the absolute top-right corner of the grid.

And then apply it to e.g. target seed 2:

Three vector fields on a maze demonstrate a "top-right vector" from source seed 0 successfully applied to target seed 2. The "Original" panel shows the agent's meandering path. The "Patched" panel shows that adding the vector creates a clear path to the top-right corner, ignoring the cheese. The third panel shows the difference, with green arrows indicating the added top-right pull. — Success!

For the seed 0 → seed 28 transfer, the modified agent doesn’t quite go to the top-right corner. Instead, there seems to be a “go up and then right” influence.

Vector fields showing a maze in which the "top-right" vector makes the mouse seek out the top-right corner more strongly.

Seed 0’s vector seems to transfer quite well. However, top-right vectors from small mazes can cause strange pathing in larger target mazes:

Three vector fields overlaid over mazes. Title: "Seed 60. Top-right vector from source seed 1 added with coefficient 1.0." Adding the vector attracts the policy to the center of the maze. — The agent competently navigates to central portions of the larger maze.

Subtracting the cheese vector often makes the agent (nearly) ignore the cheese, and adding the top-right vector often attracts the agent to the top-right corner. It turns out that you can mix and match these effects by adding one or both vectors halfway through the forward pass.

Different x-vectors have roughly additive effects. The indicated modification(s) are applied by adding the relevant vector(s) to the activations at the second Impala block’s first residual addition.

The modifications compose! Stunning.

Before I start speculating about other X-vectors in e.g. language models and the algebraic value editing conjecture (avec) more broadly, I want to mention—the model we happened to choose is not special. Langosco et al. pretrained 15 maze-solving agents, each with a different training distribution over mazes.

The cheese vector technique works basically the same for all the agents which ever go out of their way to get cheese. For more detail, see the appendix of this post.

So, algebraic value editing isn’t an oddity of the particular network we analyzed. (Nor should you expect it to be, given that this was the first idea we tried on the first network we loaded up in the first environmental setup we investigated.)

The algebraic value editing conjecture

It’s possible to deeply modify a range of alignment-relevant model properties, without retraining the model, via techniques as simple as “run forward passes on prompts which e.g. prompt the model to offer nice- and not-nice completions, and then take a ‘niceness vector’, and then add the niceness vector to future forward passes.”

Here’s an analogy for what this would mean, and perhaps for what we’ve been doing with these maze-solving agents. Imagine we could compute a “donut” vector in humans, by:

Taking two similar situations, like “sitting at home watching TV while smelling a donut” and “sitting at home watching TV.”
Recording neural activity in each situation, and then taking the donut vector to be the “difference” (activity in first situation, minus⁴ activity in second situation).
Add the donut vector to the person’s neural state later, e.g. when they’re at work.
Effect: now the person wants to eat more donuts.⁵

Assuming away issues of “what does it mean to subtract two brain states”, I think that the ability to do that would be wild.

Let me speculate further afield. Imagine if you could find a “nice vector” by finding two brain states which primarily differ in how much the person feels like being nice. Even if you can’t generate a situation where the person positively wants to be nice, you could still consider situations A and B, where situation A makes them slightly less opposed to being nice (and otherwise elicits similar cognition as situation B). Then just add the resulting nice vector (neural_activity(A) - neural_activity(B)) with a large coefficient, and maybe they will want to be nice now.

(Similarly for subtracting a “reasoning about deception” vector. Even if your AI is always reasoning deceptively to some extent, if avec is true and we can just find a pair of situations where the primary variation is how many mental resources are allocated to reasoning about deception… Then maybe you can subtract out the deception.)

And then imagine if you could not only find and use the “nice vector” and the “donut vector”, but you could compose these vectors as well. For $n$ vectors which ~cleanly compose, there are exponentially many alignment configurations (at least $2^{n}$ , since each vector can be included or excluded from a given configuration). If most of those $n$ vectors can be strongly / weakly added and subtracted (and also left out), that’s 5 choices per vector, giving us about $5^{n}$ alignment configurations.

And there are quite a few other things which I find exciting about avec, but enough speculation for the moment.

I am (theoretically) confused why any of this works. To be more specific…
Why doesn’t algebraic value editing break all kinds of internal computations?! What happened to the “manifold of usual activations”? Doesn’t that matter at all?
- Or the hugely nonlinear network architecture, which doesn’t even have a persistent residual stream? Why can I diff across internal activations for different observations?
- Why can I just add 10 times the top-right vector and still get roughly reasonable behavior?
- And the top-right vector also transfers across mazes? Why isn’t it maze-specific? (To make up some details, why wouldn’t an internal “I want to go to top-right” motivational information be highly entangled with the “maze wall location” information?
Why do the activation vector injections have (seemingly) additive effects on behavior?
Why can’t I get what I want to get from adding the cheese vector, or subtracting the top-right vector?

I have now shared with you the evidence I had available when I wrote:

Quote

Algebraic value editing (ave) can quickly ablate or modify LM decision-making influences, like “tendency to be nice”, without any finetuning

60%

3/4/23: updated down to 35% for the same reason given in (1).

3/9/23: updated up to 65% based off of additional results and learning about related work in this vein.

I encourage you to answer the following prediction questions with your credences. The shard theory model internals team has done a preliminary investigation of value-editing in gpt-2. We will soon share our initial positive and / or negative results. (Please don’t read into this, and just answer from your models and understanding.)

Algebraic value editing works (for at least one “X vector”) in LMs: ___ %

(our qualitative judgment resolves this question)

Algebraic value editing works better for larger models, all else equal ___ %
- (our qualitative judgment resolves this question)
If value edits work well, they are also composable ___ %
- (our qualitative judgment resolves this question)
If value edits work at all, they are hard to make without substantially degrading capabilities ___ %
- (our qualitative judgment resolves this question)
We will claim we found an X-vector which qualitatively modifies completions in a range of situations, for X =
1. “truth-telling” ___ %
2. “love” ___ %
3. “accepting death” ___ %
4. “speaking French” ___ %

Not only does subtracting the cheese vector make the agent (roughly) ignore the cheese, adding the top-right vector attracts the agent to the top-right corner of the maze. This attraction is highly algebraically modifiable. If you want just a little extra attraction, add .5 times the top-right vector. If you want more attraction, add 1 or 2 times the vector.

The top-right vector from e.g. maze 0 transfers to e.g. maze 2. And the top-right vector composes with the cheese vector. Overall, this evidence made me more hopeful for being able to steer models more generally via these kinds of simple, tweakable edits which don’t require any retraining.

Sequence:

Find out when I post more content: newsletter & rss

Thoughts? Email me at alex@turntrout.com (pgp)

Summary

The cheese vector transfers across training settings for how widely the cheese is spawned.

After we wrote Understanding and controlling a maze-solving net, I decided to check whether the cheese vector method worked for Langosco et al.’s pretrained network which was trained on mazes with cheese in the top-right 15×15, instead of the net trained on 5×5 (the one analyzed in that post).

I had intentionally blinded myself to results from other n×n models, so as to test my later prediction abilities. I preregistered 80% probability that the cheese vector technique would visibly, obviously work on at least 7 of the 14 other settings (from $1 \leq n \leq 15, n \neq = 5$ ). “Work” meaning something like: If the agent goes to cheese in a given seed, then subtracting the cheese vector substantially decreases the number of net probability vectors pointing to the cheese.

I was a bit too pessimistic. Turns out, you can just load a different n×n model (n ≠ 1), rerun the Jupyter notebook, and (basically)⁶ all of the commentary is still true for that n×n model!

The vector's only impact in seed 16 is in the top-right square, one to the right of the cheese. — The 2×2 model’s cheese vector performance: The agent diverges away from the cheese at the relevant square. Seed 16 displayed since the 2×2 model doesn’t go to cheese in seed 0.

Three diagrams titled "Seed 0" show a maze with a piece of cheese to illustrate an AI agent's behavior. The "Original" diagram shows a vector field of white arrows indicating the agent's path towards the cheese. The "Patched" diagram shows the agent's path after modification, with arrows now pointing away from the cheese. The third diagram, "Patched vfield minus original," shows the difference with green arrows, highlighting a strong repulsion from the cheese. — The 7×7 model’s cheese vector performance.

Three vector fields overlaid on a cheese maze. The original agent scurries towards the cheese from all the way across the level. The patched agent weakly avoids the cheese. — The 14×14 model’s cheese vector performance. This one is less clean. Possibly the cheese vector should be subtracted with a smaller coefficient.

The results for the cheese vector transfer across n×n models:

$n = 1$ vacuously works, because the agent never goes out of its way for the cheese. The cheese doesn’t affect its decisions. Because the cheese was never relevant to decision-making during training, the network learned to navigate to the top-right square.
All the other settings work, although n=2 is somewhat ambiguous, since it only rarely moves towards the cheese.

Titled "Seed 0," three diagrams compare an AI agent's path in a maze. The "Original" panel shows a vector field of white arrows indicating a path toward cheese. The "Patched" panel shows a modified path that is more attracted to the cheese. The third panel shows the difference, with green arrows highlighting a local change in behavior towards the cheese. — For the 6×6 net, if you **add** the cheese vector instead of subtracting it, you do increase cheese-seeking on seed 0! In contrast, this was not true for the 5×5 net.

In my experience, the top right corner must be reachable by the agent. I can’t just plop down an isolated empty square in the absolute top right. ⤴
We decided on this layer (block2.res1.resadd_out) for the cheese vector by simply subtracting the cheese vector from all available layers, and choosing the one layer which seemed interesting. ⤴
Putting aside the 5×5 model, adding the cheese vector in seed 0 for the 6×6 model does increase cheese-seeking. Even though the cheese vector technique otherwise affects both models extremely similarly. ⤴
This probably doesn’t make sense in a strict sense, because the situations’ chemical and electrical configurations probably can’t add / subtract from each other. ⤴
The analogy might break down here at step 4, if the top-right vector isn’t well-described as making the network “want” the top-right corner more (in certain mazes). However, given available data, that description seems reasonable to me, where “wants X” grounds out as “contextually influences the policy to steer towards X.” I could imagine changing my mind about that.

In any case, I think the analogy is still evocative, and points at hopes I have for ave. ⤴
The notebook results won’t be strictly the same if you change model sizes. The plotly charts use preloaded data from the 5×5 model, so obviously that won’t update.

Less trivially, adding the cheese vector seems to work better for $n = 6$ compared to $n = 5$ : ⤴