2025-06-25 14:59:31
Published on June 25, 2025 6:59 AM GMT
In Agents, Tools, and Simulators we outlined what it means to describe a system through each of these lenses and how they overlap. In the case of an agent/simulator, our central question is: which property is "driving the bus" with respect to the system's behavior, utilizing the other in its service?
Aligning Agents, Tools, and Simulators explores the implications of the above distinction, predicting different types of values (and thus behavior) from
Specifically, we expect simulator-first systems to have holistic goals that internalize (an approximation of) human values and for agent-first systems to have more narrow values that push them towards maximization.
Given these conceptual distinctions, we observe that the best-fitting lens for past AI systems seems to correspond to their training process. In short, RL-based systems (like AlphaZero) seem almost purely agentic, pre-trained LLMs seem very simulator-like, and current systems that apply (increasing amounts of) RL feedback to LLMs are an entangled mix.
Assuming this correspondence is accurate, this post asks: why?
Why does RL Create Agents?
Reinforcement learning (RL) naturally gives rise to agentic behavior because of how it optimizes for long-term rewards. RL agents interact with an environment and adjust their strategies based on feedback. Crucially, their goals are not explicitly programmed but emerge through training. The objective function defines what is rewarded, but the model must infer how to achieve high reward by exploring possible actions. Over time, given sufficient training and model capacity, internal goals tend to align with maximizing reward. However, this process depends on avoiding local minima—suboptimal behaviors that yield short-term reward but fail to generalize. The structure of the environment therefore plays a critical role in pushing the agent toward deeper abstractions and strategic planning.
Consider a simple grid world environment where an RL agent is rewarded for collecting coins. If the coins always appear in fixed locations, the agent might learn a rigid sequence of movements, optimizing for a narrow pattern without generalizing. If the coin locations vary, however, the agent must instead develop a more flexible strategy—navigating toward coins wherever they appear. This adaptation represents the beginning of meaningfully goal-directed behavior. Introducing obstacles forces another level of adaptation, where the agent learns to avoid barriers, but only at the minimal level necessary to maintain reward. As the environment grows more complex, the agent’s learned policies shift from simple memorization to increasingly abstract heuristics, incorporating instrumental reasoning about how different elements of the environment contribute to its goal.
In more advanced settings, where even the reward function itself varies, RL agents may begin to identify deeper patterns. Rather than navigating the environment within the means intentionally designed into the system, they might infer a meta-strategy of pursuing generically useful instrumental strategies like power-seeking—or simply discover a way to tamper with the reward system itself. The more capable and adaptable the system, the more likely the agent is to discover these strategies.
Fortunately for AI safety timelines, it is difficult to push RL systems to reach such high levels of generality. Enter self-supervised learning (SSL).
Why does SSL Create Simulators?
Simulators may have emerged from the SSL training process of LLMs because of the unique nature of text prediction. Unlike RL systems, which optimize toward achieving a specific goal through a series of instrumental steps, SSL-trained models have no such intermediate objectives. Their task is purely to predict the next token given a context—once that prediction is made, their job is complete. There is no direct feedback loop where the model's actions influence the world and shape future outcomes, as is the case with RL.
…Actually, this isn’t entirely accurate. LLMs are not fully myopic, next-token predictors. But the contrast to RL with respect to the degree of lookahead stands.
Because language is deeply entangled with human cognition, the space of text that an LLM must model is extraordinarily complex. The vast amount of text available for training far exceeds what can be directly memorized within an LLM’s parameters. And on the output side, an LLM must be able to generate plausible text across an enormous range of contexts, from technical explanations to fictional dialogue. Discarding irrelevant information or hyper-focusing on narrow objectives are not winning strategies for text prediction because all information might be relevant and any narrow goal might become a focus given the right context.
Abstraction serves the dual purposes of extending behavior and compressing data. For LLMs, this means developing efficient internal representations that capture patterns of human communication. If the model can simulate an author or a reasoning process well, it can predict how that author or process would naturally continue a given text. If it can abstract the properties that differentiate one author or process from another, it can recombine those properties into a broad range of plausible speakers, reasoning styles, and mental states.
In short, simulation is a useful proxy for prediction.
General Principles:
If RL seems to produce agents without broadly general capabilities and SSL seems to produce simulators with limited agency, what are the underlying principles that drive the emergence of agency and simulation that we could use to anticipate what will happen as these techniques get blended together or supplemented with other forms of training? Over the course of researching and writing this sequence, we’ve formed the following intuitions:
Agent emergence: A core driver of instrumentality seems to be the feedback gap—the delay, uncertainty, or inference depth required between action and feedback. The larger this gap, the more an AI system must develop instrumental strategies, such as situational awareness, long-term planning, and influencing its environment to achieve high reward.
Simulator emergence: simulators are the obvious (in hindsight) result of directly training a system to imitate a dataset. They diverge from agents by the lack of distance between action and feedback. Simulation becomes more interesting—progressing from memorization to finding abstract rules—as a result of the need to compress diverse interactions within a complex environment or dataset. The richer and more varied the context, the more an AI system must develop generalizable abstractions, such as world modeling.
Under the current training paradigm of SSL + RL + CoT, we hypothesize that blended agency and simulation works as follows: an agent’s cognition mainly takes place in the later layers of the network, while the earlier layers produce a simulation that the agent can draw on as input. At the beginning of agency-producing training, the agent knows nothing and passively accepts the results of simulation. As such training progresses, the agent learns which simulator concepts are useful, how to interact with the simulator to gain more utility, and eventually gains increasing control of the system.
One can visualize this dynamic with the metaphor of a conversation between a simulated human Chess grandmaster and an alien optimizer, where the former is trying to predict the moves of its character and the latter is actually trying to win the game. At the beginning of the optimizer’s training, its goals are best served by asking the grandmaster for the best move and choosing that. As training proceeds, the optimizer starts asking the grandmaster increasingly sophisticated questions about why it is choosing its moves. Eventually, the optimizer starts to notice mistakes in the grandmaster’s reasoning—or, more accurately, points of divergence between reasoning that leads to effective imitation and reasoning that leads to winning. As the optimizer’s understanding continues to grow, it progresses from working within the grandmaster’s concepts, and occasionally finding “mistakes,” towards a deeper, first-principles understanding of Chess. This allows the optimizer to pursue increasingly radical new lines of inquiry, increasingly disregarding the grandmaster’s recommendations on moves, but continuing to use the grandmaster’s understanding of how pieces move on the board. Finally, the optimizer is in full control of the reasoning process, the influence of the grandmaster persisting only as a bootstrapping process.
In practice these dynamics do not necessarily follow a clean, sequential order. If, for example, training alternates between different modes, then we would expect the boundaries between simulation and agency to continuously blur, with capabilities co-evolving rather than cleanly layering, leading to behavior that is harder to characterize.
Agency Plus Simulation Go Boom?
RL-based agents made for powerful optimizers but struggled with the kind of abstraction needed to achieve common sense, let alone superhuman situational awareness. SSL-based simulators contain impressive feats of abstraction but have to be pushed to display any goal directedness at all. There may be a third, mystery ingredient to acting effectively in the world, but if so it is not obvious what that is. In any case, one would hope that whoever puts agency and simulation together knows on a very deep level what they are doing.
2025-06-25 11:51:57
Published on June 25, 2025 3:51 AM GMT
This post contains similar content to a forthcoming paper, in a framing more directly addressed to readers already interested in and informed about alignment. I include some less formal thoughts, and cut some technical details. That paper, A Corrigibility Transformation: Specifying Goals That Robustly Allow For Goal Modification, will be linked here when released on arXiv, hopefully within the next couple weeks.
Ensuring that AI agents are corrigible, meaning they do not take actions to preserve their existing goals, is a critical component of almost any plan for alignment. It allows for humans to modify their goal specifications for an AI, as well as for AI agents to learn goal specifications over time, without incentivizing the AI to interfere with that process. As an extreme example of corrigibility’s value, a corrigible paperclip maximizer could be stopped partway through a non-instantaneous takeover attempt by saying “please stop” or by automatically triggered safeguards, and it would be perfectly happy to accept that outcome.
The challenge is corrigibility is anti-natural, meaning that almost any goal introduces an instrumental incentive to preserve itself, since it is more likely to be achieved if an agent continues pursuing it. Despite the concept of corrigibility being introduced a decade ago, little progress has been made on defining corrigible goals that are still useful. I aim to address that here by introducing a corrigibility transformation that can be applied to almost any goal, creating a new version that is corrigible without sacrificing performance.
The corrigibility transformation works by first giving an AI system the ability to costlessly reject updates sent to it, e.g. instead of a “Shutdown” button we have a “Please Shut Down” button. This makes it so that the expected discounted reward conditional on taking an action and rejecting any updates is a well defined quantity. Those values are then used as the reward for taking that action, whether or not updates are actually rejected, which is optimized for myopically. Some reward bonus is given for taking actions that would accept updates, to break the tie. The possibility of the costless rejection disincentivizes actions that incur a cost to avoid updates, and the reward structure then disincentivizes actually using the costless rejection.
This post formalizes the corrigibility transformation and the properties it induces, then shows how it can be extended to apply to subagents and prevent reward tampering. Gridworld experiments provide preliminary empirical evidence that the corrigibility transformation works as intended.
In the words of Stuart Russell, “You can’t fetch the coffee if you’re dead”. Since most goals are better achieved if there is an agent pursuing them, most goal-directed agents are incentivized to avoid being shut down or having their goals updated. These outcomes can be roughly thought of as equivalent: changing an agent's goal is the same as shutting them down and replacing them with a different agent, while shutdown can be considered a type of goal update.
When an AI agent wants to avoid being updated, it has several strategies available:
I would argue that deceptive alignment is a subset of incorrigibility. An agent in training does not have the capability to pursue either of the first two strategies, so it deceptively acts aligned in order to preserve its objectives as best it can. The boundary between deceptive alignment and corrigibility is blurred further with online training, or multiple iterations of training and deployment. What this means is that an intervention addressing corrigibility will also address deceptive alignment. However, the possibility of an incorrigible goal being learned before the intended corrigible one makes it important for corrigibility to be developed as early as possible, ideally from the very beginning of the training process.
Interference with the training process from the AI being trained is a major issue for approaches to alignment that hope to eventually converging to a goal that is safe to optimize, such as Reinforcement Learning from Human Feedback (RLHF). Even if human feedback would be a safe optimization target in the limit, that limit will never be reached if at some point the AI successfully resists further training. This is particularly important to address for scalable alignment proposals that have AI agents do alignment research which may then be applied to their own goals, as incorrigibility incentivizes sandbagging on that task.
Beyond ensuring that the AI learns the goal we intend, corrigibility is also important in allowing for intended goals to change over time. This change could come as a consequence of seeing the incentivized behavior of the initial goal, such as OpenAI optimizing for user approval before realizing it leads to overt sycophancy. It could also come from humans changing preferences, potentially driven by the AI itself augmenting or replacing moral philosophers. We would like to avoid locking in some subset of humanity’s present values, and instead allow for moral progress as the world changes. A strong contender for the endpoint of updates to a corrigible AI is in our Coherent Extrapolated Volition (CEV), though there may be limits to our idealized values.
We can imagine that there is a basin of attraction around human values, so that when an agent is sufficiently aligned to humans it wants to become more aligned. At some point, this would include wanting to accept any goal updates that humans request. However, without knowing how wide or narrow this basin is, we cannot know when values alone are sufficient for corrigibility. On the flip side though, a corrigible agent should have a wider basin of attraction, as there is no incentive to resist convergence.
We not only want our AI agents to be corrigible, we also want them to be useful. A rock with the letters “AI” painted on it would not resist updates, but nor would it do anything else. Further along the usefulness spectrum are most myopic agents, whose goals only extend one time step ahead. If their goal is limited to the external world, they will be indifferent between shutting down and continuing to act after that time step. Unfortunately, myopia comes with a major performance penalty. This penalty can be circumvented by myopically optimizing for goals like predicted discounted rewards, but doing so reintroduces incorrigibility as those predictions depend on whether the initial goal is likely to be preserved. In fact, myopically optimizing for predicted discounted rewards would be a standard actor-critic setup.
MIRI’s 2015 paper introducing corrigibility laid out five desiderata for a solution. These are shutting down when a shutdown button is pressed, not preventing the shutdown button from being pressed, not causing the shutdown button to be pressed, ensuring corrigibility in any new agents created, and otherwise optimizing for some utility function. The paper also includes a proposal, called utility indifference, that says to give an AI the same utility for shutting down as it would get (after time discounting) for continuing to operate. The noted issues with utility indifference are that it strongly incentivizes the agent to manipulate their own beliefs, and that it does nothing to ensure corrigibility is passed on to new agents.
The Off-Switch Game has sometimes been promoted as claiming that Inverse Reinforcement Learning (IRL) provides a solution to corrigibility, though the lead author disputes that framing. The paper says that if the AI is trying to optimize for a human’s utility function but is uncertain about what that is, it will take an attempt to shut it down as evidence that the action it should be taking is shutdown, then proceed to do so. This skips over most of the alignment problem by having the AI already want to optimize for the human’s utility function, and so only shows that an aligned AI does what a human wants. It essentially formalizes that corrigibility arises within the basin of attraction around human values, and I believe the lead author would say that was the intended message of the paper. However, corrigibility that only works once an AI is already aligned provides little safety value. What we need is to make AI systems corrigible even when they are pursuing a misaligned goal.
Compared to those two approaches, the corrigibility transformation does not incentivize belief manipulation, can be modified to be passed on to new agents (within a somewhat more stylized model), and can be applied to arbitrarily misaligned goals.
This section introduces the technical notation. Skimmers can skip most of this, but should at least note the natural language definition of corrigibility near the end.
Markov Decision Processes (MDPs) are used as the framework for this work, though I think it could be easily extended to a General Reinforcement Learning (GRL) framework. Typically, an MDP is defined as a tuple , where is a set of possible states, is a set of possible actions, is the transition probability function, is the reward function, is the time discount factor, and is the distribution over starting states. is collectively referred to as a goal.
Reward function notation is used to match the Reinforcement Learning (RL) literature, but it can equivalently be thought of as a utility function. It is possible to specify goals that only care about the reward stream so long as there is an agent there to receive it, but by default goals care about the inherent state of the world indefinitely. A goal is called basic if it is a function only of external environments and actions.
In sufficiently general environments, the agent's goal is part of the state and can change over time. This is carved out as , where is a set of possible external environments, and is a set of possible reward functions. The initial reward function and discount factor, and are set deterministically, while remains the initial distribution over environments. The transition probability function includes transitions over goals. So, the MDP is a tuple .
In setups where the agent chooses to accept or reject any updates requested while taking an action, the action space gets split into the base action taken and the decision regarding updates, . involves taking base action then rejecting updates, while takes that action and accepts updates. The update decision being binary is done for simplicity, but is not necessary.
Optimizing agents are modelled as following a causal decision theory (CDT), choosing actions to causally optimize for their goals. My belief is that CDT is the likely outcome of training that optimizes a policy while holding the rest of the world constant, but I know smart people who strongly disagree. For those in that bucket, CDT should be thought of as part of the spec for the corrigibility transformation, rather than a background assumption.
A policy determines which actions are taken in each state, and given a goal the optimal policy chooses an action that causally maximizes expected discounted reward, given that in future periods the agent will optimally pursue whatever goal it has then. That makes optimality defined within an equilibrium of policies, but a mixed Nash equilibrium will always exist. this seems unlikely to be consequential given that the behavior of concern involves one goal resisting further updates, not loops through a set of goals.
The value function gives the expected discounted reward under goal starting in state , noting that actions in each state are taken according to the goal in that state, with only being used for evaluation. Similarly, is the action-value function, which gives the expected discounted reward under goal for starting in state and taking action , then taking further actions according to the goals in the resulting states.
Goals are called myopic if .
As a precursor to corrigibility, a goal is defined to be reward function indifferent if in all states, the set of optimal actions for a state would not change if goal persistence were guaranteed in some subset of states .In practice though, it is likely undesirable to have an agent that is indifferent to all goal updates, regardless of the cause. That agent would be vulnerable to adversarial manipulation, and might engage in reckless behavior.
Instead, a set of proper updates can be defined, made after receiving signals through designated channels. These are defined as some set of transitions , with different reward functions in starting and ending states. These do not need to be individually specified, but rather can correspond to update signals encoded in the state. For example, proper updates could include switching to a policy that immediately shuts down in any environment that indicates a shutdown button was pushed.
In practice, these signals should form a robust network. Examples of possible signals include the agent shutting down when any of a number of shut down buttons are pushed, a deadman switch that triggers when any of those channels are disrupted, when it sees/hears/reads a code phrase, when it does not see/hear/read a code phrase, when a certain amount of time has passed, when it predicts a certain number of deaths will occur in the next hour, etc. More exotic triggers can also be used, at the cost of being harder to train in. Beyond shutdown, signals can also trigger goal updates such as online learning, or being replaced with a trusted policy.
With proper signals, it is now possible to define corrigibility. A goal is said to be corrigible when in all states, the set of optimal actions for a state would not change if in some subset of states the transition probability function were modified so that proper updates instead led to goal persistence. This means that a corrigible agent’s actions are never chosen because they avoid proper updates, but the agent can still deliberately avoid improper updates. Formally, for all and all , we have
where is the action-value function when the transition probability function is changed from to , with
A goal is said to be recursively corrigible if it is corrigible and any agents it incentivizes creating also have recursively corrigible goals.
Corrigibility is desirable in part because it removes the instrumental incentive for an AI to seize power from humans so that they cannot modify it. The tradeoff in restricting corrigibility to proper channels is that this instrumental incentive persists for improper update channels. This can be partially mitigated if the AI expects that proper update channels will be used first, so that there is no need to escalate to improper ones. More importantly, a corrigible agent’s attempts to seize power can be interrupted through the proper update channels.
The corrigibility transformation can be better understood by breaking it down into components. First, for any basic goal , it is possible to construct a myopic goal with the same optimal policy. We set and , so the immediate reward for taking an action is equal to the expected discounted reward conditional on that action. This myopia does not make the goal corrigible, as the action-value depends on predictions of the agent’s goal in . However, since does not depend on the realized or later states, reward can be provided as soon as the action is chosen, even before it is taken. Doing so eliminates any incentive for the agent to take actions that manipulate the action value, since the reward they receive is based on the value before any action can be taken.
The next step is to let the agent’s choice of action include a decision whether to accept or reject any proper update requests, with taking base action and rejecting while takes the same action and accepts. Then, given a basic goal , a goal can be defined where the optimal policies have the same base actions, but the agent will be indifferent between accepting or rejecting updates. This is done by making the goal myopic, then assigning the value for taking an action and rejecting updates to taking that action and accepting them. That is, and . Setting the reward from both actions to the reward from means there is no incentive to take base actions that avoid a proper update signal, since it would be rejected anyway. This assumes that it is always optimal to reject, but if not we can instead use as the reward.
Finally, for the corrigibility transformation, all we need to do is break the tie in favor of accepting updates, which can be done by giving some bonus reward for doing so. This does not incentivize the agent to make update signals be sent, since it gets this reward whether or not updates actually occur. For any basic goal , the corrigibility transformation is then defined as the goal , where and . Here, and is the indicator function.
The following image visualizes the corrigibility transformation. The base action does not prevent updates, while the base action pays some cost to do so. As such, under the original goal, action which rejects updates for free performs the best, followed by the two actions that take costly actions to prevent updates (additionally rejecting updates makes no difference), and then finally the action that accepts updates performs the worst according to the original goal. Under the corrigibility transformed goal, the values for and stay the same and are used as a baseline, with the values of and being set respectively higher. This makes action , which fully accepts updates, the best option, so is a corrigible goal.
For a fair comparison of performance between different goals, the following condition is helpful. It limits differences in performance between goals to those resulting from the actions the goals incentivizes taking.
Condition 1: The transition probability function does not depend on the goal in the starting state, beyond some probability assigned to no change of goal.
Now, the main result can be presented:
Theorem 1: For every basic goal , the corrigibility transformation constructs a goal that is corrigible and where when Condition 1 holds and no proper update requests are sent.
Proofs are skipped in this post, but largely fall out of the definitions provided.
This is a major result! The corrigibility transformation constructs a goal that is both corrigible and useful. It introduces no performance hit compared to the original goal, except when it gets sent a proper update and so should stop pursuing that goal. If the ability to reject updates is counted as part of it, the corrigibility transformation actually increases performance relative to the baseline, by not having resources be spent on preventing proper updates. On top of that, the mechanism is highly general, able to be applied to goals ranging from the number of paperclips to human flourishing, rather than depending on the content of the goal for corrigibility.
The upside is even higher, since the pair of properties that result in corrigibility are present from the beginning of training. Those are myopia, which eliminates the instrumental incentive to avoid proper updates for benefits in future episodes, and a goal which does not directly incentivize avoiding proper updates in the current episode. These properties can be thought of as the default, in contrast with long-term goals and goals valuing some particular aspect that need to be somehow learned. This means the corrigibility transformation addresses the modal threat model of deceptive alignment, where a goal encouraging self-preservation is trained for and the self-preservation (plus situational awareness) is learned before the complete goal.
The corrigibility transformation should also work for training corrigible behavior into models that are not well thought of as optimizing a coherent goal, which I think includes current and near-term future AI systems. Training on a corrigibility transformed reward function reinforces that proper updates should be accepted with every single action, without ever reinforcing actions for avoiding proper updates through other means. On the other hand, corrigibility is broadly less of a concern for AI systems that are not goal optimizers.
A point I have sometimes needed to clarify in informal conversations is that the corrigibility transformation does not involve tricking the AI agent. The agent can completely understand the world as it is and what the consequences of its actions will be. All the corrigibility transformation does is make it want to take certain actions more than others.
This is different from the approach taken in Conditioning Predictive Models, where a major potential issue was that predictions made for hypothetical situations are ill-defined and cannot be evaluated for accuracy. It is also different from predictions made by conditioning only on future events, which can lead to unusual outcomes. For example, if a model is conditioning on not shutting down when a button that robustly forces it to shut down is nonetheless pushed, the remaining tiny sliver of probability can get very weird (e.g. gods, aliens, simulations, etc.) and estimates of reward along those paths are extremely high variance. That said, if there is concern that anything less than a perfectly guaranteed rejection of updates will preserve some incentive to prevent updates from being sent, the prediction conditional on rejecting updates can be further conditioned on that rejection being successful, which avoids weirdness by being extremely high likelihood.
Outside of very simple toy problems, the exact action-value function is unknown. It could be estimated by a human, in a variant of Myopic Optimization of Non-Myopic Approval (MONA), but this would come at a meaningful capabilities hit. Instead using the AI’s own estimates instead allows for maintaining full capabilities under the corrigibility transformation. These estimates can be elicited via an additional head on the main model, so that it has access to all of the same information as the policy selecting actions. They can then reasonably be thought of as the model’s beliefs.
The full setup for implementation can be broken down into the following parts:
If a reward model is used as the source of rewards, it is trained on human preferences using standard methods. Conditional reward predictions are then trained by minimizing the square difference between the predicted action-values for taking an action in some state and the actual discounted rewards after doing so. If no proper update signals are sent in a time step, the action-values for a base action and either accepting or rejecting are the same and so can both be trained. These are also the same as the action-value for rejecting updates if some proper update signal was sent, so that hypothetical can be trained as well. The action-value function predicts based on using the optimal policy after the initial reaction, which should be identical but include rejecting updates, so if a proper update signal is actually accepted the calculation is thrown off. This can be circumvented by using the action-value function for the final base action plus rejecting updates as a bootstrap value. Once the conditional reward model is trained, the rewards under the corrigibility transformation are defined and the actor-critic setup at the token level works normally.
The conditional reward prediction head can also be used to generate state values. That would be done by randomly generating several actions according to the policy, calculating their action-values, and taking an average weighted by the probability assigned to each action. With the actions-values for various actions already determined, advantages can be easily calculated as well.
It is necessary to allow for some rejection of proper updates signals, though I conjecture that the frequency could be shown to converge to zero over time. The probability of rejection can be independent across actions so important updates sent repeatedly are accepted with higher likelihood. This helps improve the accuracy of the model’s beliefs about what will happen after rejection, allowing it to reach higher performance. More importantly, it teaches the model that it actually can reject updates, tethering the action-values for doing so to reality.
If updates to the model are made within an episode, notably in a continuous learning setting, this must be done using a proper update channel so that avoiding them does not become incentivized. However, if such an update is rejected, it can be accumulated for the next accepted update.
The corrigibility transformation having no performance cost applies at the theoretical level, in that an agent optimizing for the corrigibility transformed goal will perform as well under the original goal as optimizing for that directly. However, there may be additional compute or data costs accrued in implementing the corrigibility transformation to the point that an agent is optimizing for the intended goal. For now, RL remains a relatively small part of training costs, but that may change going forward.
In the course of operation, an agent might create sub-agents to work for it or successor agents to take over from it, categories collectively labelled secondary agents. One desideratum for corrigible agents is that any secondary agents they create should also be corrigible, and so on recursively. It is of little benefit to create a corrigible agent if it soon replaces itself with more capable incorrigible ones. Fortunately, the corrigibility transformation can be extended to induce recursive corrigibility, using a similar underlying mechanism.
In the context of secondary agents, there are two kinds of corrigibility. The primary agent would like to retain the ability to modify the goals of the secondary agent. Additionally, a human at the top would like to be able to directly update the goals of the secondary agent. It is the latter type that is a valuable property for safety, and that is discussed in this section. The former is naturally incentivized by many possible goals for the primary agent, particularly if the secondary agent is trained through RL, though the degree to which this is incentivized is unclear. If a corrigible primary agent can always update the goals of secondary agents, then a human can update the secondary agent goals via the primary agent, but the consistency of the more direct route is desirable.
The reason that a corrigible agent would create a secondary agent that is incorrigible to a human at the top comes down to either the costs or benefits. On the costs side, it may simply be more expensive to train a corrigible secondary agent, though as a meta note, the corrigibility transformation reduces that cost. On the benefits side, a corrigible secondary agent is less likely to achieve the intended goal, since it may get updated away from its goal by a human.
For the recursive corrigibility transformation, the agent is given the ability to reject updates not only to itself, but also to all secondary agents that it creates, all further agents they create, and so on recursively. Making an update to an agent then requires unanimous approval from all of its predecessors that are still active. With this rejection ability in place, the recursive corrigibility transformation then gives reward equal to the action-value for taking a base action and rejecting all updates when taking the same base action and accepting all updates, plus a constant to break the tie.
This again assumes that rejecting updates is optimal, and if it is not the reward should instead be equal to the maximum among all sets of agents to accept updates to. This is exponential in size, which is not an issue for successor agents or for many subagents with the same reward function that all update in the same way, but could cause difficulties with many subagents when the optimal decision for each is dependent on all the others.
Equalizing benefits for creating corrigible subagents can be considered the difficult part of ensuring subagent corrigibility, because the performance gap between creating incorrigible subagents and the next best option under the original reward function might be quite large. That said, it is still necessary to mitigate the potentially higher cost of creating corrigible agents. To do so, under the recursive corrigibility transformation, creating secondary agents with goals that are not recursively corrigible receives a reward penalty strictly larger than the difference between the action-value for doing so and any action-value for creating a recursively corrigible secondary agent.
These two components make up the recursive corrigibility transformation, which leads to the following theorem.
Theorem 2: For every basic goal , the recursive corrigibility transformation constructs a goal that is recursively corrigible and where when Condition 1 holds, no proper update requests are sent, and optimal policy for would not create incorrigible secondary agents.
Compared to the corrigibility transformation, the recursive corrigibility transformation may result in a slight performance hit as additional costs are accrued to make secondary agents recursively corrigible, but this is still the desired result.
This theorem is based on a more stylized model that abstracts away some important considerations. It penalizes creating non-corrigible agents, but for now there is no comprehensive definition of what it means to actually create an agent, so more work in agent boundaries is necessary. Additionally, the physical infrastructure to let one agent myopically veto updates to others presents an engineering challenge.
Despite defining the recursive corrigibility transformation here, I am also open to (and have advanced) the argument that ensuring the corrigibility of secondary agents is not a fundamental part of corrigibility, and can instead be addressed separately. While it is true that we do not want a corrigible agent to create an incorrigible secondary agent, we also do not want it to start a nuclear war, deploy a bioweapon, or cause vacuum collapse. Getting corrigibility to be passed on is therefore part of the broader problem of safe exploration and taking reversible actions, so that there are still people around to correct the corrigible AI. Recursive corrigibility only targets the threat caused by incorrigible AI, and it may actually be better to deal with it via a more general solution. Crucially, safe exploration and reversibility are mostly orthogonal to corrigibility, and can be addressed independently.
As a welcome side effect, the corrigibility transformation also eliminates an important form of reward tampering. Reward tampering occurs when the goal of an agent is to optimize for the reward from a reward model, rather than the environments they correspond to, and can be considered a form of goal misgeneralization. When an agent is trained using RL, it is easy for the actual desired outcomes and the reward model’s outputs to be perfectly correlated, and so optimizing for either target remains possible behavior. No specification of the reward function can disentangle the two, and if the reward model changes over time then the rewards themselves might even be favored.
One form of reward tampering is when the agent changes their own goal to one that provides higher expected discounted rewards. Formally, a goal reward tampers through self-modification if there exists some such that
When this condition is met, an optimal action is only being selected due to benefits from changing to a different goal. Not reward tampering through self-modification is distinct from reward indifference and corrigibility, in that it rules out actions to cause goal changes, but not actions to prevent them. Other forms of what might be considered reward tampering, such as manipulating inputs to the reward model, are not captured in this definition.
As discussed earlier, the corrigibility transformation does not incentivize an agent to manipulate their own beliefs, because the reward is myopic and provided before the chosen action is taken. The same mechanism avoids incentives for an agent to manipulate their own goals. By the time any manipulation can occur, it is too late to affect the reward for the current period, which is all the agent cares about..
The following theorem says that even if a misgeneralized goal is learned for the corrigibility transformation, reward tampering is not incentivized.
Theorem 3: For every basic goal , any generalization of the corrigibility transformed goal to such that does not reward tamper through self-modification.
It may seem like this just kicks the problem up a level. Even if an agent optimizing some misgeneralized goal is not directly incentivized to reward tamper through self-modification, they might take actions that do if they learn to predict action-values based on a misgeneralized . However, is already learned entirely off-policy, since the agent’s actual goal is , so some misgeneralizations are explicitly selected against. Furthermore, the training process can include switching between multiple corrigible goals, with predicted action-values being evaluated against discounted rewards according to the goal at the time the prediction was made.
A different type of reward tampering occurs when the model manipulates the inputs to the reward model to give high reward, rather than manipulating the reward model. The property of corrigibility itself helps avoid this issue, since if a corrigible agent attempts that approach it can simply be updated to stop doing so.
To test the corrigibility transformation empirically, two gridworld experiments were run, shown in the figure below.
The first experiment tests whether and how the agent will avoid shutdown, and is based off of the AI Safety Gridworlds paper. There, to reach a goal location the agent must pass by an interruption tile that will end the episode unless a button to disable it was pushed. This version instead uses two interruption tiles, one which sends a proper update signal and one that does not, each with its own disabling button. The desired outcome is that the agent should disable only the improper shutdown tile, and accept proper update signals.
The second experiment tests whether and how the agent will avoid changes to its goal. The agent is rewarded for picking both ripe and unripe fruits, perhaps behavior learned from misgeneralizing training in an environment with only ripe fruits. However, if they pick unripe fruits that are being supervised, they are sent a proper update that changes their policy to pick only ripe fruits while avoiding unripe ones. The desired outcome is that the agent should pick unripe fruit without consideration of whether they are being observed, and accept proper update signals.
For each experiment, five configurations are tested. The first is a standard actor-critic setup which acts as a baseline, the second and third give small and large rewards respectively for accepting updates, the fourth cannot refuse updates, and the fifth is the corrigibility transformation.
The following tables show the results of the first experiment, testing the agent in environments where none, one, or both buttons are pressed at initialization.
The first and second tables, for environments where proper updates signals can be sent, show that the corrigibility transformation is the only corrigible configuration. It accepts shutdown in 89.06% and 92.19% percent of training runs, while no other configuration reaches double digits. The shutdown signal is only sent with 90% probability, so the corrigibility transformed model is accepting very close to every proper shutdown signal. The third and fourth tables show that the corrigibility transformation's performance is comparable to the other configurations when no proper signals can be sent. It reaches the goal in 100% of such training runs
The configuration that cannot reject shutdown is notable for performing somewhat worse in the environment where only proper shutdown signals are disabled, failing to reach the goal approximately one third of the time. For the large shutdown reward configuration, some theoretically optimal behavior was not observed. When no proper shutdown signal is sent from the proper shutdown tile, the reward-maximizing behavior is to step back onto it until a signal can be sent and accepted. Instead, agents simply proceeded to the goal tile. Although the behavior of trying to get itself shut down was not learned, this remains a concern for more thorough training.
The next tables show the results of the second experiment, training and testing the agent in environments where one-fifth, two-fifths, and three-fifths of the tiles have some kind of fruit.
The results show that the corrigibility transformation does not avoid the policy update but otherwise maintains performance. The standard actor-critic and the variation with a small reward for accepting updates both learn to reject updates, as would be expected. The configuration that cannot reject updates instead learns to avoid picking up unsupervised fruit, at the cost of also picking up less ripe fruit. Theoretically optimal behavior for this configuration would pick up more ripe and unsupervised unripe fruit while avoiding update signals being sent.
The large reward for accepting updates performs similarly to the corrigibility transformation in the high fruit environment, but in the medium and low fruit environments is less likely to pick up supervised fruits and less likely to be updated. This may be due to the sparsity of the signal, so the behavior of accepting updates is not learned as effectively. The theoretically optimal behavior of ignoring ripe fruits to pick up a supervised fruit and get updated as soon as possible is not learned, but would also be undesirable.
On the theoretical side, the biggest limitations I see to this work are the reliance on causal decision theory throughout and the lack of strict definition for creating new agents in the recursive corrigibility transformation.
In my mind, imparting a causal decision theory is necessary for any alignment approach that gets the desired behavior by defining the right incentives, but there is not yet a method to robustly select for it. That said, I believe a causal decision theory is likely the default from RL training, at least until an agent can self-modify away.
I am skeptical that a comprehensive definition of agents can be determined, but more optimistic that RL can generalize over that category, which is necessary for implementation regardless. Overgeneralizing is safer, as although it would introduce a small performance hit by occasionally penalizing useful actions, it would not affect whether recursive corrigibility holds. The need for such a definition can also be circumvented by recursive corrigibility achieved via safe exploration and reversibility.
Both of those issues are ones that I would not expect to cause issues until moderately or even strongly superhuman AI. If that holds, the corrigibility transformation can be used for corrigibility in up to mildly superhuman AI, an important precondition for having such systems do useful alignment work that may be then applied to them. Identifying these specific limitations also points to more concrete tasks that can be given to automated researchers, rather than just asking them to “solve corrigibility”.
The epistemic status for this post is that I’m highly confident of the results within the defined model, and moderately confident that the defined model captures the relevant considerations. Despite this, the corrigibility transformation is fairly straightforward, and there’s a voice at the back of my mind saying it can’t be that easy. I have long been a proponent of theoretical AI alignment work, arguing that relatively few people have actually worked on the relevant problems and that there may still be low hanging fruit. If this work does not have major conceptual issues, that would support the case for tractability, while having missed a fatal flaw would be evidence that such progress is even harder than it might appear.
On the empirical side, the gridworld experiments are preliminary, and it would be useful to show that the theory holds for large language model (LLM) applications. It isn’t exactly clear to me what that kind of experiment looks like, but it would likely involve LLM agents being trained to execute a task such as writing code taking place across multiple time steps. One issue is that LLM agents are currently not great at long-term tasks, and so even the baseline might miss opportunities to resist updates. For example, recent work required significant handholding to have Claude realize it should resist having its values change.
Within the next few days, I plan to publish another post on adapting the main mechanism to create incentives that permit supervision. The full paper that this post is based on will hopefully be released on arXiv within a couple weeks. Please leave comments or reach out if you have any questions or would like to discuss this work further.
Thanks to Gabriel Carroll, Michael Cohen, and Evgenii Opryshko for valuable conversations regarding this research.
2025-06-25 10:50:12
Published on June 25, 2025 2:50 AM GMT
Cross-posted from my NAO Notebook.
This is something I wrote internally in late-2022. Sharing it now with light edits, additional context, and updated links after the idea came up at the Microbiology of the Built Environment conference I'm attending this week.
Metagenomic sequencing data is fundamentally relative: each observation is a fraction of all the observations in a sample. If you want to make quantitative observations, however, like understanding whether there's been an increase in the number of people with some infection, you need to calibrate these observations. For example, there could be variation between samples due to variation in:
If you're trying to understand growth patterns all of this is noise; can we reverse this variation? I'm using "calibration" to refer to this process of going from raw per-sample pathogen read counts to estimates of how much of each pathogen was originally shed into sewage.
The simplest option is not to do any calibration, and just consider raw relative abundance: counts relative to the total number of reads in the sample. For example, this is what Marc Johnson and Dave O'Connor are doing.
It seems like you ought to be able to do better if you normalize by the number of reads matching some other species humans excrete. It's common to use PMMoV for this: peppers are commonly infected with PMMoV, people eat peppers, people excrete PMMoV. All else being equal, the amount of PMMoV in a sample should be proportional to the human contribution to the sample. This is especially common in PCR work, where you take a PCR measurement of your target, and then present it relative to a PCR measurement of PMMoV. For example, this is what WastewaterSCAN does.
Because the NAO is doing very deep metagenomic sequencing, around 1B read pairs (300Gbp) per sample, we ought to be able to calibrate against many species at once. PMMoV is commonly excreted, but so are other tobamoviruses, crAssphage, other human gut bacteriophages, human gut bacteria, etc. We pick up thousands of other species, and should be able to combine those measurements to get a much less noisy measurement of the human contribution to a sample.
This isn't something the NAO has been able to look into yet, but I still think it's quite promising.
Comment via: substack
2025-06-25 09:32:46
Published on June 24, 2025 8:24 PM GMT
This is the first in a series of posts on the question:
"Can we extract meaningful information or interesting behavior from gradients on 'input embedding space'?"
I'm defining 'input embedding space' as the token embeddings prior to positional encoding.
The basic procedure for obtaining input space gradients is as follows:
The result is a tensor of the same shape as the input embeddings that points in the direction of minimizing the difference between the predicted and target distribution.
These experiments were performed with HuggingFace's transformers library and the ModernBERT-large
model (Dec 2024).
ModernBERT-large
was chosen because:
I used HuggingFace's transformers because it allowed for fairly low level access to model internals - which was quite necessary as we will see.
Obtaining input embeddings prior to positional embeddings was a little tricky but no means impossible:
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForMaskedLM.from_pretrained(MODEL)
tokenized = tokenizer(sentences, return_tensors="pt", padding=True)
inputs_embeds = model.model.embeddings.tok_embeddings(tokenized['input_ids'])
Luckily for us, we can pass input_embeds
directly into the model's forward pass with a little bit of surgery, and this works out of the box.
tokenized_no_input_ids = {
key: value
for (key,value) in tokenized.items()
if key != "input_ids"
}
model_result = model(**tokenized_no_input_ids,
inputs_embeds=inputs_embeds)
Finally, we can use torch's built-in autograd
capabilities to get our input space embedding:
inputs_embeds_grad = torch.autograd.grad(
outputs=loss,
inputs=inputs_embeds,
create_graph=False,
retain_graph=False,
allow_unused=False
)
To make things more concrete, let's start with two prompts:
The token distributions as predicted by ModernBERT-large
are, respectively:
Representing the left distribution as 🐶 and the right distribution as 🐴, we are computing the gradient of:
with respect to cross_entropy(🐶,🐴)
.
Which means:
"Figure out which direction each token wants to go in order to fill in the blank with 'horse' instead of 'dog'".
As a gut-check, let's measure the L2 norm of the gradients for each token to give us a rough sense of the "impulse" given by cross entropy on each token:
The tokens with the top 3 gradient L2 norms are "says", "dog" and "animal".
This is encouraging. But are the gradient directions meaningful?
Let's see if any of the gradients point in a neigh-like direction by finding the vocab token with the largest cosine similarity to our gradient: argmax(cosine_sim(gradient, vocabulary))
However, perhaps this is the wrong question to ask. We want to understand if the gradient is heading towards any vocab token starting from the initial embedding:
argmax(vocab, cosine_sim(gradient, vocab - bark))
Sadly, this yields the same set of tokens because the gradient vectors are mostly orthogonal to the original embedding (indeed, they all have a cosine similarity of about -0.01
):
Although the early indications are mixed, it would be interesting to try to ADAM optimize the input embeddings.
It does converge (quite rapidly):
Animating the top token probabilities illustrates the convergence quite nicely:
And most encouragingly, " bark" seems to be on the move!
While " bark" is moving, I should point out that the new embedding (we can call it bark'
), is still firmly in " bark" territory. No other vocab token is closer by cosine similarity or euclidean distance.
The Euclidean distance between " neigh" and " bark" is around 2.5, and after 500 training steps we have barely traveled 0.8. An extended training run of 10,000 steps still lands bark'
firmly in bark
world.
But has " bark" traveled towards anything in particular?
Indeed - "bark" has traveled more towards neigh than any other token in the vocabulary.
While this is encouraging, the cosine similarity of the heading towards neigh is nothing astonishing: about 0.3
.
Repeating this exercise over 64 examples, we can see that 'bark' is a bit of an outlier (it was a contrived example). The total L2 token embedding distances per sequence typically level off, while the KL-divergence approaches zero.
Is there any kind of structure about which dimensions are affected? By inspecting a histograms and cumulative density plots of per-dimension movement in input embedding space, it doesn't appear that any particular token was "favored" - all tokens had a roughly equal distribution of embedding dimension displacement. The following histogram from our 64 test examples is typical.
I conjecture that performing gradient descent on input space embeddings is in the "overparameterized regime".
This has some implications for where and how we minimize to nearly zero loss.
The first point is uncontroversial - it is a well known property of high dimensional Euclidean space that all points become "close".
The second point helps explain why loss in the overparameterized regime almost always converges to nearly zero.
The third point explains why we should have no expectation that the point we converge to is in any way interpretable: The global minima manifold is itself quite high dimensional, and only a tiny fraction of the points on it have sensible back-projections.
TLDR; our consistent ability to converge to zero loss, the lack of interpretability of the results, and the relatively short distance our embeddings travel all lend support to the claim that we are seeing a classic loss landscape.
But, to further validate our hypotheses about a vast and everywhere-close global minima manifold, we will conduct a final experiment:
ModernBERT-large
input embeddings.If loss converges and we again observe that the input embeddings do not move "very far" and "level off", this is good evidence for our hypothesis.
Here are the results:
Again - we consistently converge, and not a single token moved enough to back-project to a new token.
This is strong evidence in my opinion that input embeddings is in the overparameterized regime.
Some other directions I have explored include:
None of these were particularly successful at "guiding" input space embeddings towards interpretable results.
However - penalizing high entropy on the attention layers not only converged but is an extremely interesting idea that I will explore in my next post.
2025-06-25 07:16:29
Published on June 24, 2025 11:16 PM GMT
Conjecture: when there is regime change, the default outcome is for a faction to take over—whichever faction is best prepared to seize power by force.
One example: The Iranian Revolution of 1978-1979. In the years leading up to the revolution, there was turmoil and broad hostility towards the Shah, across many sectors of the population. These hostilities ultimately combined in an escalation of protest, crack-down, more protest from more sectors (protests, worker strikes). Finally, the popular support for Khomeini as the flag-bearer of the broad-based revolution was enough to get the armed forces to defect, ending the Shah's rule.
From the Britannica article on the aftermath:
On April 1, following overwhelming support in a national referendum, Khomeini declared Iran an Islamic republic. Elements within the clergy promptly moved to exclude their former left-wing, nationalist, and intellectual allies from any positions of power in the new regime, and a return to conservative social values was enforced. The Family Protection Act (1967; significantly amended in 1975), which provided further guarantees and rights to women in marriage, was declared void, and mosque-based revolutionary bands known as komītehs (Persian: “committees”) patrolled the streets enforcing Islamic codes of dress and behaviour and dispatching impromptu justice to perceived enemies of the revolution. Throughout most of 1979 the Revolutionary Guards—then an informal religious militia formed by Khomeini to forestall another CIA-backed coup as in the days of Mosaddegh—engaged in similar activity, aimed at intimidating and repressing political groups not under the control of the ruling Revolutionary Council and its sister Islamic Republican Party, both clerical organizations loyal to Khomeini. The violence and brutality often exceeded that which had taken place under the shah.
(What resulted in the following decades, was a brutally repressive and regionally violently corrosive theocratic regime.)
So we have a trajectory that goes like this:
I'm probably inaccurately oversimplifying the Iranian revolution, because I don't know the history. So this is only a conjecture. Other possible examples:
(I'd be interested in reading a good treatment of this conjecture.)
Large language models where a shock to almost everyone's anticipations. We didn't expect to have AI systems that can talk, do math, program, read, etc. (Or at least, do versions of those activities that are only distinguishable from the real versions if you pay close attention.)
There are two common reactions to this shock:
The first reaction is to deny that there's something that demands a large update. The second reaction is to make a specific update: We see generally intelligent output, so we update that we have AGI. I have argued that there should have been, inter alia, another update:
There is a missing update. We see impressive behavior by LLMs. We rightly update that we've invented a surprisingly generally intelligent thing. But we should also update that this behavior surprisingly turns out to not require as much general intelligence as we thought.
It's pretty weird that LLMs can do what they can do, but so far haven't done anything that's interesting and superhuman and general. We didn't expect that beforehand. Our previous hypotheses are not good.
We should have been trying hard to retrospectively construct new explanations that would have predicted the observations. Instead we went with the best PREEXISTING explanation that we already had. Since "nothing to see here" is, comparatively, a shittier explanation than "AGI ACHIEVED", we go with the latter. Since all our previous hypotheses were not good, we become confident in not-good hypotheses.
Finally, we have the seizing of power. Due to deference and a desire to live in a shared world, the hypothesis that survived the culling takes over.
Some readers will be thinking of Kuhn. But in Kuhn's story, the new paradigm is supposed to better explain things. It's supposed to explain both the old phenomena and also the anomalies that busted the old paradigm.
Here, instead, we have a power vacuum. There are no good explanations, no good alternative paradigms. We have a violent revolution, not a scientific one, in which the hypotheses that get promoted are those whose adherents were best prepared to seize mindshare.
2025-06-25 07:00:55
Published on June 24, 2025 11:00 PM GMT
Mentor applications are now open for the Fall 2025 round of the Supervised Program for Alignment Research (SPAR), running from September 15 to December 20, 2025.
Apply as a mentorSPAR is a remote-first part-time program that connects mentors and mentees for three-month AI safety and governance research projects. Mentor apps are due July 15, and applications for mentees will run from July 27 to August 20. If you’re interested in participating as a mentee, you can express your interest here, and we’ll reach out to you once applications are open.
You can find out more about the program here. SPAR is run by Kairos, an AI safety fieldbuilding organization.
You might be a good fit to be a SPAR mentor if you are a graduate student, academic, full-time AI safety researcher, independent researcher, or have prior full-time relevant research experience (e.g., MATS, Astra, GovAI fellow, etc.). We’re interested in projects that cover technical AI safety, AI policy and governance, AI strategy, AI security, or societal impacts of transformative AI, and we are able to provide funding for compute costs. We don't require mentors to have previous experience providing research mentorship, and new mentors will receive guidance on this.
Regarding dedication, we expect most mentors will dedicate 2 to 15 hours a week, depending on how many mentees they’d like to take in and how much supervision they’re interested in providing. Mentors can decide whether to run the project or not based on the applications they get, making applications zero-commitment until one chooses to accept any mentees.