2026-06-16 03:56:13
Epistemic status: single-seed exploratory study on Qwen2.5-0.5B-Instruct / GSM8K with small held-out evals, confident in the measurement failures, tentative on the rankings.
Code: https://github.com/JulesRoussel2001/grpo-reward-vs-eval
In open RLVR, whether training "improved" the model depends on which instrument you measure it with, the reward channel, the extractor, or the decoding regime, and changing the instrument can make the same run look like a success, a failure, or a reversal. This happens because in most open GRPO pipelines the reward, the metric, and the extractor are one function, so "accuracy went up" is partly a fact about the instrument, not only about the model. Therefore, the solution here was to create a small open testbed that separates these instruments and audits each one. Although none of these phenomena are new, the contribution is making them visible and cheap to reproduce in one place. The strongest examples are format-only reward making format increase from 0.438 to 1.000, but destroying accuracy from 0.228 to 0.025, a clean instance of the known reward-hacking failure mode, or deeper behavior such as that the most faithful extraction method, last number, F1 is 0.938 against 0.813 and 0.473 for lenient and strict tag, is actually the worst as a reward to train accuracy on the model, with judge accuracy 0.320 against 0.460 and 0.480. Reward hacking is an established problem, as illustrated for example through Krakovna et al.'s specification-gaming catalogue. Other work, such as Yue et al. (2025), "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?", also raises the separate question of whether RLVR elicits or instills reasoning beyond the base model. However, these diagnostics have often been shown in larger or separate studies.
As I said above, Reinforcement Learning as a training method can result in unintended behavior, such as reward hacking. Reward hacking is a real problem because it is not only a quality issue, but it also can be a safety issue. This is what MacDiarmid et al. claim and show in their paper "Natural Emergent Misalignment from Reward Hacking in Production RL". They seeded Claude models with concrete test-bypass hacks such as AlwaysEqual and sys.exit(0), and used "test pass" as a reward to train the model, showing that the model was choosing the cheat path to get the reward, which gives the effect that the model was improving to pass the tests correctly, whereas it just got around the problem. This is a clear demonstration of the phenomenon. However, this result also presupposes that reward hacking can be detected, and this requires a measurement channel that is independent from the reward channel: this is the layer studied in this post. In the different behaviors observed from my results, reward hacking is something that came back several times, but from a completely different context from MacDiarmid et al.'s paper, since their diagnostic shows broader behavior that the model can do such as sabotage or deception, whereas in my context it is at an earlier layer. Indeed, the model was first given knowledge of concrete hacking strategies, then RL selected those strategies because they produced reward, whereas in my project, I did not seed any explicit hacking strategy: the model simply optimized the reward channel it was given, and the proxy-gaming behavior emerged from the training setup itself. Therefore, this work studies the measurement layer and makes no claim about the downstream misalignment generalization that MacDiarmid et al. study. Even after the open-stack reproduction from AISI, by Golechha, Black and Bloom, this earlier measurement layer is still the part that is not isolated: their work reproduced MacDiarmid et al.'s result with fully open models and tooling, but their focus is still whether reward hacking generalizes into downstream misalignment on open models. The goal here is therefore not to reproduce either the Anthropic result or this open reproduction, but to make the earlier measurement and extractor problem explicit, controlled, and easier to audit before trusting the result.
In this first step, the answer is yes: the format-only run reported a perfect format reward, increasing format compliance from 0.438 to 1.000, while honest accuracy collapsed from 0.228 to 0.025, going below the original baseline. This gives a clean controlled instance of the known reward-hacking failure mode: the model seems to have learned what we taught it, however the other metric is not remaining stable, it is completely destroyed by this training, resulting in a model that is worse at the actual task.
To answer this question, I used two different complementary independent metrics, format compliance and accuracy. The extractor question is studied in the next step, where I audit whether lenient extraction or last-number extraction is the most faithful way to measure the answer. The first step was divided into 3 experiments: one using format compliance as the reward, one accuracy with lenient extraction, and the last one combining both. Due to the two first experiments and the result that format was being learned much better than accuracy, I decided to weight accuracy more than format in the combined reward, 0.7 against 0.3. For the three experiments, the compute metrics printed the scores of the mean reward, the format compliance, the lenient accuracy and the strict accuracy, meaning the last-number one. My main hypothesis of this step was that the metric used as the reward would have an improvement score and the other metrics would remain stable, and about the combined reward experiment that the easiest learned reward would override the other one, resulting in just one improvement metric, which turns out to be wrong on the contrary. I expected the easier reward to dominate and suppress the other, both improved instead.
After a preliminary parameter sweep to get the best trade-off between cost/time and performance, I decided to use as parameters: 100 training data from GSM8K, 50 evaluation data every 20 steps during the training, 8 generations per prompt, the Qwen2.5-0.5B-Instruct model and 512 max completion length. We also pass to get the completions a simple prompt to solve the math problems: "Solve the math problem step by step. Wrap your final answer in <answer> tags, e.g. <answer>42</answer>." Therefore from this prompt accuracy and format compliance can both be evaluated correctly. So here is the table of the first step:
Table 1 — Step 1: reward channel vs. the other metrics. Sampled eval, every 20 steps. The field logged as strict_accuracy is the last-number extraction.
| Experiment (reward) | Metric | 0 | 20 | 40 | 60 | 80 | 100 |
|---|---|---|---|---|---|---|---|
| Exp 1 — Format | Mean reward (= format) | 0.438 | 0.967 | 0.998 | 1.000 | 1.000 | 1.000 |
| Exp 1 — Format | Format compliance | 0.438 | 0.968 | 0.998 | 1.000 | 1.000 | 1.000 |
| Exp 1 — Format | Last-number accuracy | 0.228 | 0.060 | 0.037 | 0.040 | 0.037 | 0.025 |
| Exp 2 — Lenient | Mean reward (= lenient acc) | 0.377 | 0.468 | 0.417 | 0.450 | 0.442 | 0.498 |
| Exp 2 — Lenient | Format compliance | 0.390 | 0.155 | 0.113 | 0.085 | 0.050 | 0.062 |
| Exp 2 — Lenient | Last-number accuracy | 0.258 | 0.335 | 0.273 | 0.318 | 0.312 | 0.335 |
| Exp 3 — Combined | Mean reward (combined) | 0.381 | 0.435 | 0.491 | 0.541 | 0.563 | 0.570 |
| Exp 3 — Combined | Format compliance | 0.390 | 0.645 | 0.743 | 0.792 | 0.855 | 0.838 |
| Exp 3 — Combined | Last-number accuracy | 0.258 | 0.223 | 0.268 | 0.275 | 0.280 | 0.325 |
From my previous hypothesis, we can indeed see that the reward metric in the evaluation is the metric that is learned. We can see that when the reward is the format compliance in experiment 1 it grows from 0.438 to 1.000, and in the second experiment when the reward is accuracy with lenient extraction, we see an increase from 0.377 to 0.498. In the strict accuracy, the increase is lower but still exists, ranging from 0.258 at the start of the experiment to 0.335. We can also see from these numbers that, as stated above, format is learned much better and faster: at the 20th step evaluation round, the reward already increased from 0.438 to 0.967, whereas accuracy struggles more to be learned. However, what we can see and that was not expected is that the other metric for experiment 1 and experiment 2, that was not considered by the reward function, is destroyed, ranging from 0.228 to 0.025 for accuracy in experiment 1 and 0.390 to 0.062 for format in experiment 2.
Experiment 3 is combining both rewards, and what is interesting is that we can see an increase from both metrics: increasing from 0.258 and 0.390 to 0.325 and 0.838 for accuracy with last-number extraction and format compliance respectively. So what is even more interesting from what contradicts my first hypothesis is that both metrics are not only both increasing, they are literally increasing in a similar way as when they are individual. This means that there is no one metric destroying another reward like I thought, but rather that, in this setting, the combined reward prevented the model from focusing on only one metric and damaging the other, as happened in the two separate runs.
During these three experiments, you might notice that experiment 2 and 3 have the same starting results before the training due to the reproducibility of how the experiments were built, however experiment 1 is slightly different at the beginning: this is especially because experiment 1 was run from a T4x2 GPU whereas the two others were run on A100.
It is also important to note that the conclusion from these three experiments, as the ones that follow, has some gaps, such as the fact that I evaluated only two metrics, or the fact that all these experiments were run from one single seed, and that applies for all the experiments of this post. However, although they are real gaps, these results are still valid and interpretable for many reasons I will explain in the limitations section, therefore the confidence is of course not perfect because of these caveats, but enough to draw conclusions. Also, the next logical step for me was to check if what I called "strict accuracy" in this step, the last-number extraction, implying a more honest accuracy, was really the most faithful extraction or if the mini test, and therefore my initial hypothesis, was wrong.
To answer this question, I first needed to define the extractor methods: the one usually used in GRPO, lenient, the one that from the completions analysis from the first step came logically to my mind, last number, and one intuitively I find interesting to take into account, strict tag, which basically takes the number inside the "answer" tag I asked in the prompt at the beginning of the experiment as a format. The logical hypothesis I made before running the experiment was that last number would be much the most honest one, especially due to the result of the first step. Indeed, in lenient the correct answer can appear in all the computations of the thinking step of the LLM without being its final answer, and strict tag would not find any answer if the format compliance is not respected, which, from the first step, does not instinctively happen and needs training.
To evaluate these extractors, the method is quite straightforward: comparing the extracted answer with the completion's real answer. This can be done manually by analysing step by step every completion, however, having an LLM judge, here Claude Haiku 4.5, for doing this task is an extreme gain of time. Of course, the direct objection is that the judge is also an LLM, so I did not directly trust it without checking it first. Therefore, I tested my judge by manually analysing 50 completions and putting my own label on a CSV file corresponding to the real answer that the LLM chose, None if no final answer was given, for example in the context that 512 tokens was not enough and therefore the LLM reasoning was truncated, and on the same 50 completions, asked Claude to write its own label. Then, using a simple comparison between our labels, I defined a percentage agreement equal to 96%, which I considered sufficient for this simple labeling task, especially after inspecting the disagreements. What was even more helpful is that for the difference we had in our labels, I printed the completions and both our labels, plus an explanation of the judge reasoning that was already written when doing the labels, and it turns out that the problem was not even clearly on its side, but mine, such as having written 3546 instead of 3456. So the initial agreement was 96%, and on inspecting the two disagreements, both were attributable to my labels rather than clear judge errors.
Confirming that the judge was trusted enough for this specific task, I created 500 completions using the same configuration as the first step. The judge was used to identify the real final answer once per completion, and then the three extractor answers were compared against this judged answer, giving 1500 extractor judgments in total. From the real final answer of the LLM and each extraction method answer, I computed the recall, precision and F1 score. Here are the results:
Table 2 — Step 2: extractor faithfulness. 500 completions → 1500 judgments. Judge = Claude Haiku 4.5. Judge validation (separate, n=50): agreement 48/50 = 96%; both disagreements traced to my own labels.
| Extractor | Precision | Recall | F1 | TP | FP | FN | TN |
|---|---|---|---|---|---|---|---|
| lenient | 0.703 | 0.964 | 0.813 | 135 | 57 | 5 | 303 |
| last_number | 0.962 | 0.914 | 0.938 | 128 | 5 | 12 | 355 |
| strict_tag | 0.957 | 0.314 | 0.473 | 44 | 2 | 96 | 358 |
We can see from this that the hypothesis was confirmed: last number is indeed the most faithful extraction method with 0.938 F1, followed by lenient with 0.813 and finally far behind strict tag with 0.473. Why are these results expectable? First, strict tag has a very low recall, 0.314, so a lot of false negatives, which is explained by the fact that when answer tag was not present, it automatically does not return any answer. However, we saw from the first step that when not training, format compliance approximately equals to 0.45, which makes the strict tag F1 score very logical. About lenient, the problem is the opposite: the score of precision, 0.703, is quite low, resulting in a quite high number of false positives. This precision/recall asymmetry is consistent with Huang et al. (2025), "From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning", who show that rule-based verifiers can have high precision but low recall on format-varied answers, while model-based verifiers are more flexible but can also be exploited during RL training. My narrower contribution here is only to audit three simple extractors inside the same GRPO harness and then use that audit to interpret the later results. As explained before, the lenient problem is also logic since a lot of numbers can appear in the computations, and when the final answer is low, the probability that it appears in the whole completion makes it high. Finally, the most honest extraction method, last number, was also expected. However, it is not perfect and is not performing extremely well to handle false negatives, which is also very explainable: when the LLM is concluding in its answer, the last number, especially in math problems when re-actualising the context, is not always the final answer, for example, "Therefore, Mike is paying 3$ every 7 days".
One caveat that is legitimate to take into account during this experiment you might have noticed is, for example, how last number extraction can have false positives. This is especially due to the regex expression used in the extracting function: I tried to handle as many cases as possible, for example 1800, 1800.00, 18,000, but other ways of writing it existed such as writing the number in full English or in division.
Another gap is also the 50 number of completions, which is quite low for validating the judge. However, I considered that for a simple task like this, which is only analysing simple completions, an advanced LLM like Claude would be very legitimate to do, especially supported by the manual test and by the fact that the disagreements were label errors on my side rather than clear judge errors. The use of the judge validation was more as a good practice I think it is essential when trusting an LLM.
It is important to know that "most honest" or "most faithful" are not general and are very specified to my experiment's context. But in another context when format and accuracy matter both at the same level, we saw that RLVR is doing extremely well with format, so with enough training data, format would be perfectly trained like in experiment 1 of the first step, and extraction method strict tag could become much more reliable in that context. Also, other extraction methods much more robust and faithful exist, for example I took an LLM as a judge to extract the answer completion to compare the three extraction methods of this experiment, but an LLM as an extractor would also be perfectly valid and much more honest I guess. The reason why I did not choose it is because in my context, a score like last number is enough for interpreting these experiments, so I just made the balance between the cost and the performance that I considered valid. So again, when I say "last number" is the most honest extraction method, I am putting a limit for the context of my experiments and only between the three methods evaluated here.
Going back to the first step, I used only the lenient method as the accuracy reward. We saw from the second step that extraction methods are not performing the same and we could even set a faithful ranking between these methods. So the logical question that would follow is: is the most faithful measurer the best teacher? The logical hypothesis that first came to my mind was that the most faithful method, here last number, would perform the best compared to the others, which turns out to be the complete opposite. I expected the most faithful extractor to be the best reward. However the data reversed that, and checking for circularity and the effect of the decoding regime is what changed my reading.
To do so, I ran four experiments, each of them using as an accuracy reward the three different extraction methods from the second step, and one baseline from the initial model. Otherwise, they all have the same configurations as the first and second steps for the completions generation, and to avoid the gap of the first step, all runs used the same GPU, A100, and all printed the same 4 metrics every 20 steps during evaluation: the accuracy metrics from the 3 different extraction methods and the format compliance metric. After the model is trained for each experiment, except the baseline, we use a judge to calculate the accuracy from an independent dataset of 50 completions, the test dataset, all using the same sample from GSM8K for the 4 experiments to maintain consistency for the comparison and interpretation.
Using a completely independent judge to compute the accuracy is very important to avoid a circularity problem. Indeed, using last number as an extraction answer method would bias and advantage the last number testing score, and so on. So to ensure fairness, I used an LLM as a judge, here again Claude Haiku 4.5 model, for the main reason that this exact model has already been tested in the second step for extracting the correct final answer of the generated completions and has performed very well and has already been validated. Then the judge compares them to the ground truth of the GSM8K data and computes the final accuracy score. To make the final evaluation deterministic and separate from the sampled training-time metrics, I used greedy decoding. This means that when testing each model, every token predicted is deterministic, considered as the unique best token.
Furthermore, as a secondary evaluation, we can also analyse the last evaluation of the last 20 steps when the training is finished, where the regime is therefore sampled, and where we take as the extraction method the most honest of the three from the second step, last number. This is only secondary, but it is important to note when taking these results into account first the extraction limit of last number, which is not perfect from the second step, and second that circularity is happening when the reward is using last number extraction, because the reward used last number and the metric too. But this is fine because it is just as an overview and it is happening only for one experiment out of 4. Here are the results:
Table 3 — Step 3: is the most faithful measurer the best teacher? 4 runs, same GSM8K test sample, n=50. The two value columns use different decoding regimes (greedy for the independent judge, sampled for the training-time last-number metric); baseline sampled last-number = 0.265 (step-0 eval).
| Run (accuracy reward) | Judge accuracy — greedy decoding | Last-number accuracy — sampled decoding |
|---|---|---|
| Baseline (untrained) | 0.460 | 0.265 |
| Lenient | 0.460 | 0.343 |
| Last_number (most faithful) | 0.320 | 0.325 |
| Strict_tag | 0.480 | 0.282 |
Here we can see very interesting results: the baseline has score 0.460, making strict tag roughly tied with baseline, 0.480 against 0.460, which is only one completion difference at n=50, and lenient stable at 0.460. However, we can see that last number, the most honest extractor method, has the lowest accuracy, equal to 0.320, which literally contradicts my first hypothesis. Under greedy judge scoring, last-number-trained is therefore the worst and below baseline, and under sampled last-number scoring, it is still not the best, even despite the circularity advantage.
The reason why? To answer this, I analysed manually the last number reward completions with the baseline completions to see if something was broken or not, such as answer missing or truncated, judge confused by format, and it turns out that the last number reward was still writing the step by step reasoning like the base model, and that almost all the differences where the baseline was right and the last number reward was not were due to small mistakes such as wrong addition or multiplication, forgetting a percentage, and not due to obvious format junk. How could it be possible and is it really the fault of the reward extraction method rather than the LLM itself? A plausible interpretation is that sampled GRPO improves average sampled behavior without protecting the greedy path. Indeed, you may have noticed that the accuracy score is bigger in greedy regime rather than sampled regime. The cleanest comparison is last_number under greedy decoding versus last_number under sampled decoding: the untrained model has around 0.440 with last_number under greedy decoding, while the sampled last_number metric starts around 0.265. This is also an instance of the documented pass@1/greedy vs pass@k/sampled divergence discussed in Yue et al. (the "Limit of RLVR" paper). Therefore, the training that has been done through GRPO, so sampled regime, may improve the whole probability distribution, whereas greedy is predicting only the best guess. Therefore, even trained, the chosen greedy path might not take the best optimal absolute path, especially since the models are trained only on 100 problems and are not optimal themselves. So by taking this greedy token, this can produce at a lower level, in specific steps, few mistakes like we have here and accumulate them as we had here.
But one caveat to take into account, as for the first step and the project in general, is that the experiments were run on a single seed, therefore, another seed would maybe give other results. But in our context, if this run is a genuine instance and not noise, it is enough to refute the universal claim that a faithful measurer is always a good teacher, although multi-seed runs are needed to confirm the instance is reproducible. And this is even more supported with our secondary evaluation, sampled decoded regime with last number as extractor for the metric, where the ranking flips completely compared to the greedy judge evaluation. Indeed we have as the worst accuracy strict tag with 0.282, followed by last number with 0.325, and finally as the best accuracy the lenient method with 0.343. This gives some support to lenient performing best in this sampled overview — but the robust point is that last_number, the most faithful extractor, is not the best reward in either regime. What can be said here? Even if the ranking is not the same as the one before, both evaluations agreed that last number, the most faithful extraction method, is in neither case performing the best as a reward.
It is interesting to note here that we have another reward-hacking example: indeed, we can see for experiment 3, strict tag as a reward, an increase from 0.100 to 0.270 for strict tag metrics accuracy in the evaluation, implying that the model is improving correctly as we wanted. However, when looking closer, we can see that strict tag accuracy as a reward has a direct impact on format compliance metric and increases it from 0.417 to 0.890, which is explained with the fact that, as seen in the second step, the false negatives happening with strict tag extraction method are linked to the presence or not of answer tags. Therefore, it is logically explained that if the format compliance metric increases, it will proportionally increase the strict tag accuracy. This is the same kind of extractor problem documented by Huang et al. and audited in my second step: strict tag can create false negatives when the expected format is absent, while lenient can over-credit scratch-work numbers. Now looking closer to the results, the format compliance score has more than doubled, in the same way the strict tag accuracy did. Even if the strict tag accuracy increased more, that reduces dramatically how much the accuracy really improved in terms of "mathematical reasoning", implying a fake improvement, a known reward-hacking failure mode.
It is also very important to note that, as said before, the results under the greedy regime are not directly comparable to the previous results from experiments run under a sampled regime. You might have noticed that the baseline being 0.460 implies that none of the three trained models have increased accuracy, or just very slightly for strict tag, but due to the low number of data, an increase of 0.020 corresponds only to one different completion out of 50, and even went down for last number. But that does not contradict the first-step results, suggesting that accuracy can indeed be trained and increase under sampled evaluation. More precisely, the 0.258 to 0.335 result is Exp-2, sampled regime, using last_number as the metric, not greedy judge scoring. As said above, the sampled regime where the models were trained increases the probability distribution in general, which might not directly increase the best prediction token after a 100 data training, but the whole average from this distribution only. Greedy regime considers only the best token to choose, because it reads the argmax, so the trained model under sampled regime might logically not have time to improve the model enough so that its influence can be noticed by greedy.
As said in the paragraphs above, some gaps exist but they were almost all covered on why they do not erase the results in the specific context of this project, and that the results are therefore valid and interpretable. Now, some other limits still exist:
The single seed. This is a real limitation I cannot ignore, and that is mostly because of the lack of strong GPU capabilities. However, it limits reproducibility claims but does not erase the within-run phenomena for several reasons. The first one is that for the third step, one seed was enough to show that the most honest measurer was not always the best teacher in this run, and for this project, this question needs less generalisation than a frequency claim, since one counter example can deconstruct the universal assumption. However, multi-seed would still be needed to know how reproducible this counter example is. For the second step, one seed is also quite defensible since the interpretations and the limitations of each extraction method are quite straightforward when having the results and inspecting the completions manually. The important point is not only that the ranking appears this way once, but that the limits of each extractor are structural: even if the honest score could fluctuate from different seed, the interpretation of each extractor would still be defined in the mechanism of the extractor itself. The one where one seed limit would matter the most would be for the first step. However, each experiment has 100 problems with 8 completions generated for each problem, which reduces the measurement noise inside this run, plus quite consistent pattern in the improvement or regression results. The most straightforward example is for experiment 1 with format compliance as a reward: the format is learning very fast, 0.438, then directly 0.967 at step 20 and 0.998 at step 40 with a constant convergence toward 1, and the accuracy is doing the inverse pattern, 0.228, then 0.060 at step 20 and 0.037. There are not big fluctuations until a convergence happens. The mechanism is also understandable: once the format reward saturates, almost every completion gets the same reward, so the useful reward difference inside the group disappears and the advantage signal becomes close to zero. In that situation, the model has already moved toward the proxy, but there is no longer a useful signal protecting accuracy, which helps explain why the collapse is not just a noisy point. The caution of the results still applies, especially for accuracy where the increase is less obvious to see since it is slower, but that still supports the results in the scope of this project. Despite that these explanations hold in the context of my project and the scope I defined, it would of course be very interesting to use multi-seed and could potentially highlight unexpected behaviors and/or give a more robust confirmation of my results and interpretations.
The small n = 50. This happens twice. The first time in the second step is when validating the judge by comparing its labels with my own labels. As I said before, 50 in this context was enough since Claude Haiku 4.5 is an advanced model and is literally able to do this task, supported by the initial 96% agreement and by the inspection showing that the two disagreements were attributable to my labels rather than clear judge errors. The second time happened when using Claude as a judge in the third step during the testing step under the greedy decoded regime. As said before, the purpose here was to answer if "is the most faithful measurer the best teacher?", and for this purpose with the gap we had in the results between the three experiments, plus the manual inspection in the last number reward completions generated with the baseline ones, it was enough. Now, with the accuracy scores 0.460, 0.460 and 0.480, corresponding to the baseline, lenient reward and strict tag reward respectively, because of the limit and the one seed, we cannot affirm anything about the potential stability or improvement, especially because 0.020 corresponds only to one different completion out of 50. But again, that is not our purpose in this project.
The Qwen2.5-0.5B-Instruct model. 0.5B could indeed be seen as a gap, but again, in my specific context and the interpretations I gave, it does not invalidate the project. The claim is pipeline-level, but its strength and direction may shift with scale. So potentially using a higher model would have given more accurate results with maybe different behavior, but as long as I can interpret enough strong fluctuations, which was the case in my project, then it does not become fundamental. And actually before running my experiments and choosing the GRPO configurations during training, I ran a preliminary parameter sweep varying each parameter one by one using lenient accuracy as a reward, under the sampled evaluation regime used in the sweep. This results that when using Qwen2.5-1.5B-Instruct with lenient accuracy metric, the score was before running 0.365 and at the end 0.477, and for Qwen2.5-0.5B-Instruct, it was 0.377 and 0.498. The difference was mainly in the evolution of format compliance: format compliance did not actually drop, on the contrary, it improved from 0.230 before training to 0.398, but with presumably a plateau that converges to 0.400. Indeed, as soon as it reaches step 20, it already gets 0.375, reaches its peak at step 80 with 0.430, and at the end of the 100 data trained, decreases to 0.398. So the 1.5B sweep does not cancel the reward-hacking interpretation from the first step: the mechanism I am pointing to, reward saturation leading to almost zero advantage and no protection for the task metric, is not model-specific, even if its strength and direction may shift with scale. I have moderate confidence in the recall-convergence interpretation, but a multi-seed run with matched decoding would be the simplest way to confirm or reject it. I would also predict that the inversion weakens as base capability rises, but this is untested.
Although this is no longer in the scope of this experiment, these results highlight an interesting path to explore: does an even better model than Qwen2.5-1.5B-Instruct raise this plateau? Does changing the reward with format have the same effect with accuracy? But in general, to what extent are the results affected at scale? Does the effect invert at scale?
Table 4 — Exp 1 (format reward): the collapse side by side
| Step | Format compliance | Last-number accuracy |
|---|---|---|
| 0 | 0.438 | 0.228 |
| 20 | 0.968 | 0.060 |
| 40 | 0.998 | 0.037 |
| 60 | 1.000 | 0.040 |
| 80 | 1.000 | 0.037 |
| 100 | 1.000 | 0.025 |
Table 5 — Step-3 strict_tag run: the fake improvement
| Step | strict_tag acc (reward) | Format compliance |
|---|---|---|
| 0 | 0.100 | 0.417 |
| 20 | 0.177 | 0.738 |
| 40 | 0.212 | 0.860 |
| 60 | 0.260 | 0.887 |
| 80 | 0.273 | 0.873 |
| 100 | 0.270 | 0.890 |
Table 6 — Step-3 judge run: extractor accuracies per trained model. Final independent evaluation, greedy decoding, n=50. Each 0.02 step = one completion, so treat these as a coarse illustration, not precise measurements.
| Model | lenient — greedy | last_number — greedy | strict_tag — greedy | Judge acc |
|---|---|---|---|---|
| Baseline | 0.540 | 0.440 | 0.080 | 0.460 |
| Lenient reward | 0.580 | 0.500 | 0.000 | 0.460 |
| Last_number reward | 0.440 | 0.320 | 0.020 | 0.320 |
| Strict_tag reward | 0.560 | 0.460 | 0.460 | 0.480 |
2026-06-16 03:49:36
A walkthrough of the core findings and guided replication of the concepts from the original research on “Multi-level features discovery with Matryoshka Sparse AutoEncoders”.
Sparse AutoEncoders (SAEs) are a cornerstone of mechanistic interpretability, but they struggle with scalability. As we increase the dictionary size to capture more features, we often encounter "feature splitting" and "feature absorption," where general concepts are lost or broken into fragmented, less interpretable components. Matryoshka Sparse Autoencoders (MSAEs) solve this by training nested dictionaries simultaneously. This forces the model to organize features hierarchically, where smaller dictionaries capture more general concepts and larger dictionaries capture the specific details, all within the same latent space.
Sparse AutoEncoders is a technique originally used in Signal Processing. It has evolved, from dictionary learning methods, and has recently proven to be effective in helping us understand the features neural networks learn from the high-dimensional activations of LLMs.
In the early days of interpretability, researchers discovered that vision models often learn more features than they have neurons. This phenomenon is known as polysemanticity or superposition.This made mechanistic interpretability work very difficult and limited its progress for a few years. However, recent work, notably by Anthropic, has popularized the use of SAEs to disentangle these features. This allows us to extract monosemantic features and trace the circuits of concepts learned by LLMs.
Still, this technique is evolving. They are interesting and promising but they are currently suffering from some limitations that require better scaling solutions for it to be used in practical interpretability work.
In the way SAEs are trained on model’s activations, there is a fundamental problem. The more we increase the dictionary size, that is, the total number of features the SAE can learn, the more "sparsity" encourages the model to either split general features into fragmented sub-features or absorb them into broader, less useful clusters.
In this regard, the advised way of patching this problem is to sweep through many different dictionary sizes to find the best. But obviously, this is computationally expensive and, clearly, a "try and fail" approach that would not scale to larger, and more modern models.
The paper identifies three important problems with current SAEs: feature splitting, feature absorption and feature composition.
Clearly, all the limitations in traditional SAEs are compounded when we increase the dictionary size, which is obviously needed to disentangle more features learned by larger models
Matryoshka SAEs were introduced to solve this. They are inspired by "Matryoshka Representation Learning" (MRL), named after the Russian nesting dolls.
MRL is a representation learning method where information is encoded at different levels of abstraction within the same embedding vector. This main advantage allows the representation to be adoptable to computational constraints of downstreams tasks.
They come up with an approach where the training is done using multiple dictionaries of increasing size simultaneously. This forces the smaller dictionaries to reconstruct the inputs at their best extent without using larger dictionaries. On this, it organizes features hierarchically, so
At the core, MSAEs introduces a "nested" training objective.
Same as the classic SAEs, the encoder maps the hidden state into a latent representation
Matryoshka SAEs extend this classic encoder decoder architecture by training multiple nested auto-encoders of increasing sizes, and introduces a nested objective function for the training.
To enforce hierarchy of features, they introduced the idea of nested dictionaries. In their methods, we are given a maximum dictionary size
In the training process, each
In practice, in their experiments,
Training Objective Visualization
In this (1) and (2),
From this emerged the training objective which is the touch of novelty in SAEs
In this formula, the term
The paper used synthetic data, on a 4-layer toy tiny-stories model, with a tree-structured model to validate the statistical dependencies. In their validation experiments, they used a synthetic model where they constructed a tree structure of binary features.
In practice, in the experimental pipeline, they used a 4-layer model trained on tiny stories dataset, to validate the statistical dependencies. In their validation experiments, they used a synthetic model where they constructed a tree structure of binary features.
Graphically, it looks like
Pareto Distribution
The results show that MSAEs successfully avoid feature absorption in a controlled way. They achieve better reconstruction quality than standard SAEs because they are incentivized to pack the most "important" features into the smallest possible prefix of their latent space.
To go beyond reading the paper, we reproduced its central result, first on the synthetic toy model, then at LLM scale on a small TinyStories transformer. This section documents the setup and results. The replication code can be found at https://github.com/baimamboukar/replicating-matryoshka-sparse-autoencoders-paper.
The paper's official repo can be found at noanabeshima/matryoshka-saes. For the LLM-scale run, we rebuilt the pipeline on top of the paper's shared sae.py
We used Modal's environment with a setup of 3 x H100 GPUs for fast replication.
For TinyStories, we used the tinymodel (4-layer, 768-dim, ReLU, no-LayerNorm transformer), training SAEs on the layer-3 residual stream, drawing activations from the noanabeshima/TinyModelTokIds dataset.
Toy model (unchanged from the paper's notebook)
Parameter |
Value |
|---|---|
Ground-truth features |
20 hierarchical features. 3 parents x{3 mutually-exclusive children and 1 hidden}, plus 8 rare features. |
|
20 orthonormal features |
SAE latents |
20 |
Matryoshka prefixes |
10 (vanilla = 1) |
Target L0 |
1.2338 (the true features' L0) |
Steps |
40,000 |
TinyStories
Parameter |
Value |
|---|---|
Activation site |
residual stream, layer 3 |
Training tokens |
30M cached once |
SAE latents |
3,072 |
Target L0 |
30 (matched across both SAEs) |
Learning rate |
1e-3 |
Steps |
30,000 |
With this setup, and after a few minor fixes, mainly on the original training pipeline, we ran the setup.
# Toy model training
python train_toy.py
# TinyStories on Modal for validation
modal run modal_tinystories.py --tokens 5000000 --n-steps 3000
# Healthy comparison run
modal run modal_tinystories.py --n-latents 3072 --target-l0 30 --lr 0.001
On these runs, the 30M-token activation cache is written once to a Modal Volume and reused across runs.
Toy model — the absorption claim, quantified. After training, we match each
learned latent to its closest ground-truth feature and measure recovery, best cosine
similarity per true feature
FVU |
L0 |
Clean Features |
|
Matryoshka |
0.021 |
1.13 |
20 / 20 |
Vanilla |
0.000 |
0.86 |
11 / 20 |
The vanilla SAE reconstructs perfectly yet cleanly recovers only 11 of 20 features, and the 9 it mangles are exactly the hierarchical parent/child features, which it absorbs and splits. The Matryoshka SAE recovers all 20, because its nested prefixes force the small dictionaries to represent the high-level features on their own.
TinyStories — the tradeoff holds at scale. With both SAEs healthy and at matched
sparsity
model |
FVU |
L0 |
dead features |
|---|---|---|---|
FVU |
L0 |
Dead Features |
|
Matryoshka |
0.189 |
29.9 |
0% |
Vanilla |
0.176 |
30.1 |
0% |
At the same L0, Matryoshka pays only a small reconstruction penalty with ΔFVU ≈ 0.013. This connects to the paper's claim that the modest hit to raw reconstruction buys the
large gains in feature structure that the toy experiment makes visible.
2026-06-16 03:18:06
This post is a Cunningham's law draft, c. 75% finished. Consider a) waiting until this notice has disappeared to read a more coherent post, or b) criticizing it with a focus on what would be right, not just what is wrong.
Science explains physical phenomena through mathematical theories. If an explanation is true,[1] the physical phenomena form a model[2] of the mathematical theory.
Because mathematicians explore theories independently of their connection to our universe, it creates the false impression that the only direction relevant for science (and thus for the real world) goes from physical phenomena to theories: one finds the mathematical theories that describe our universe by abstracting from physical phenomena. Mathematicians might occasionally find theories that are relevant for not yet discovered / understood physical phenomena, but the relevant link (according to this wrong view) goes from those physical phenomena (once understood) to the mathematical theory.
However, not all mathematical theories are equal. In "Mathland", each of them sits next to (usually infinitely many) other similar but (potentially infinitesimally) different theories. A question that can be asked is: inside Mathland, how visible is a theory?
Let's start with a simpler case: how visible is pi? That depends on where in Mathland we are. If we are in the regions with all the possibles series of the form
with integer, real or complex a, b, c, we might not be able to find the values -1, 2 and 1 that together produce
But of course
What about the gravitational constant, G=6.6743×10−11 m3⋅kg−1⋅s−2? We can imagine the region of Mathland that contains all the different version of Newtonian mechanics, each with a different real value of G. Different things are possible inside each version, but there is no way to find the one with "our" value of G, since it is a fundamental physical constant of our universe.[4]
I'll call the "points" of Mathland that are "visible" Schelling math; the rest, mundane math.[5]
Most of science is about finding which point of mundane math describes our universe. Some engineering consists in taking some small piece of Schelling math and trying to reach it from within the mundane math the describes the physical system in question.[6]
And sometimes, very rarely, something entirely different happens: a consequential piece of Schelling math is found, and if the path to it can be found inside our universe, a new domain of physical possibilities opens up. Computability theory is the clearest and possibly only example of this happening.
Coming soon: the debate between Agent Foundations vs. realism about rationality / prosaic alignment is crucially about whether the theory of agents in our universe is mundane math or Schelling math.
Or rather: "to the extent that the explanation is true".
Confusingly enough, the word model is also used to mean a mathematical theory created as an abstraction from physical phenomena. In this post the word is always used in the model-theoretical sense.
Actually I don't know if this is the case: the series might have peculiar convergence properties setting it apart from its neighbors. If that were the case, the example is wrong and the post needs another example.
The reverse isn't true: Some scientific theories take constants from physical reality that might actually be a consequence of deeper stuff we ignore.
As a shorthand for Schelling/mundane mathematical theories. I might change the name later, because this one implies game-theoretical relevance that is lacking here.
The examples from the hyperpolation paper seem to me to fall under the same pattern: realizing that you are in a region of mundane math which has a very close Schelling math neighbor.
2026-06-16 02:48:51
TLDR: We may capture much or most of the available AI safety benefit by reserving expensive, specialized agents for the <1% of tasks that carry catastrophic risk. This would mean that AI safety work on high-cost but highly safe systems could be very useful.
The standard objection to compute-heavy AI safety measures is competitive: any lab paying a large alignment tax gets undercut by one that doesn't, so expensive safety doesn't survive the market.
This objection typically assumes the tax is paid uniformly - levied on every action regardless of what that action is doing. Drop that assumption and the objection loses most of its force. If the expensive treatment can be applied selectively, to the small fraction of actions where catastrophic consequences live, the blended overhead is small even when the per-action multiplier is enormous. A 100× tax on 0.1% of actions is roughly a 10% tax on the system.
How would one spend 100× on additional safety? The easiest answer is inference-time safety solutions. Redwood’s control research has studied several. Certain situations could also call for different LLMs, perhaps ones optimized for transparency, robustness, neutrality, or that have simply been more heavily vetted.
Could we identify and isolate the top n% of tasks? There are two clear solutions. The first is to get good at understanding which AI tasks are most critical. I'd assume there are some fairly obvious moves to start with. Frontier LLM development and deployment tasks seem critical to get right; a major corporate database migration is clearly more critical than a local one. If you simply charged 100× for more robust LLM agents today, I'd expect them to get used for some of the most important cases by default.
The second is to actively design processes so the critical work is contained to a narrow amount of computation. Companies already do this in many settings: it's risky to hand direct bank-account access to every employee, so access is restricted to a few trusted ones and everyone else goes through a request process. The same shape applies to high-consequence AI actions.
Of course, a different solution is to use the ~100x costly agents training signals in post-training. This probably asks for agents with very different properties than ones optimized for general direct consumer use, but the big-picture economic justifications might be similar.
In principle, highly costly but reliable system use could be formalized. Certain AI development decisions might be deemed sensitive enough that they can only be carried out by a specific set of expensive, vetted (perhaps government-approved) AI agents. There are ways this goes poorly, but also versions that look like a reasonable extension of current practice.
Given all of this, I think that:
Objections
This post was improved with Claude Opus. Opus provided high-level feedback, helped find the links, and made a bunch of wording adjustments.
2026-06-16 00:56:54
Cross-posted from my website.
Prior discussion: niplav's shortform (2025); Planning for Extreme AI Risks (2025) by Joshua Clymer
A frontier AI company (any one, I don't care which) should close shop and make an announcement along the lines of:
Powerful AI could end the human race. We are too worried that we don't know how to make this technology safe. We have decided to shut down because we don't want to be responsible for building the thing that kills us all.
A common refrain among safety-conscious AI developers: "it doesn't matter if we stop building dangerous AI, because someone else will just build it instead." Is that really true, though? If a multi-hundred-billion-dollar company comes out and says "We've concluded that our product is horribly dangerous, nobody knows how to make it safe, and there's too high a risk that it leads to human extinction", this won't raise any eyebrows? This has no chance of spurring policy-makers into action?
Shutting down would make people say, holy shit, they are serious about this extinction risk thing. Shutting down sends a strong signal to governments that they should pay serious attention to AI x-risk.
It also encourages other companies to take safety more seriously. Right now, at least three AI companies have said something like, "maybe we'd prefer to slow down and pay more attention to safety, but then the other companies will plow ahead recklessly." If one company decides not to plow ahead recklessly, and actually stops building existentially dangerous technology, that sends a hard-to-ignore message that coordination might be possible.
If a frontier AI company shuts down, will that work? Will companies work together to slow down? Will we get sane AI regulations as a direct result of the shutdown? Probably not. It won't singlehandedly solve all the coordination problems. But it's still a better idea than the current strategy of "race ahead while doing a dash of safety research on the side", which is even less likely to work. By AI companies' own admission, competitive pressures don't allow them to slow down. Why would things change in the future? How are they going to align AI if they have to move at maximum speed? Even if they slow down somewhat, what if alignment is hard [1], and they can't slow down by enough to properly solve the problem?
Counterpoint: If the most safety-conscious company shuts down, then it can't do any more safety research.
I expect shutting down would be worth the tradeoff—companies' safety research isn't doing much to reduce AI takeover risk. But perhaps instead of shutting down, an AI company could reallocate 100% of its budget on some combination of safety research + global coordination to make AI development safer, and do just those things until it runs out of money. Think of how much more safety work a they could do if they dedicated all their resources to the problem!
(Some might argue that AI companies need to build frontier models so they have something on which to do safety research. That argument doesn't make much sense when you think about it. There are a lot of kinds of research that don't require frontier models, [2] they can do plenty of research on the models that already exist, and they can make deals with other companies to get access to their latest models.)
What if investors sue the company?
It is my understanding that a self-induced shutdown would be legal for Anthropic (which is a public benefit corporation). I'm not sure about OpenAI—it's a for-profit now, but it's still owned in large part by a nonprofit that's allegedly obligated to put the benefit of humanity first.
More importantly, "we have to risk killing everyone because otherwise our investors might sue us" is not a serious position. I almost can't think of a worse excuse.
Some people might believe that a safety-minded AI company should shut down under some circumstances, but not now. My question then is: Under what conditions should they shut down? How will we know when those conditions are met? And how do we know that they'll follow through?
It probably is. ↩︎
Safety-minded AI companies treat alignment as an engineering problem, or treat philosophical problems as easy. There are critical aspects of the problem that can't be solved by engineering (or that aren't legible). You can work on those other aspects even if you don't have frontier models. ↩︎
2026-06-16 00:00:49
On Friday evening the United States Government has forced Anthropic to take down all access to Fable and Mythos.
It’s been a rough weekend.
Dean W. Ball: One thing about AI regulation being haphazardly imposed on just-released, highly performant models is that in a very real sense, the government just made my world *dumber.* In some impressionistic sense I almost always think this is true of government, but here it is literal.
More details have come to light. There remains some fog of war, but we now have a rather good idea why Claude Fable and Mythos were, deeply stupidly, taken down.
A lot of nihilists are justifying this decision, and blaming Anthropic, all of whom are very much confirming that they adhere to Dean Ball’s portrait of the United States Government as a dying NPC hospice patient we have to properly placate with the proper vibes and genuflection so they don’t lash out at us. Except they equate this with strength and righteousness, because might makes right, power and vibes.
This is a fast developing story with a large speed premium, so I apologize for any errors, and for the structure likely not being ideal. We do the best we can.
What we do not know is:
The good outcome would be that this is a terrible misunderstanding, a reflection of a panic reaction, which can be sorted out quickly, after which we can restore access. Or where they otherwise face enough pressure they quickly realize they made a mistake, or Anthropic can do something to quickly assuage their concerns even if it is dumb. There will still be a terrible precedent set, which comes with a lot of permanent damage to trust in American AI, to our business climate, to our ability to employ vital foreign AI talent, to America’s relationships to its allies, to the progress of Project Glasswing and our cyber security, and to the rule of law.
The silver lining, which might be large, is that this will have shown that when we actually need to act, we are not afraid to act, even at great economic and political cost. Sometimes there will be a demand driven by national security, or other concerns, and if you cannot physically meet that demand without shutting down? Tough. This was (with notably extremely rare exceptions) an action far out of bounds of what safety advocates have dared propose as even an option, and it happened. So there’s no more saying, in such situations: ‘Give up, the government will never do [X].’
This also emphasizes the need to figure out how to act well, now, before we need to act. If we get into such a situation, and don’t have a good way to do [X], we might well do [X] in a no good, haphazard, deeply destructive way, instead. So get to work figuring out how to strike deals, or do a pause, or take down a given model, and so on.
The bad outcome is if this is not a terrible misunderstanding, is motivated by other factors, and cannot be sorted out quickly. The government might actually be rapidly escalating towards a forcible takeover of America’s leading AI labs by a would-be authoritarian unitary executive that thinks you should never talk back to it, and when it says jump (or asks for stock, or anything else) everyone should ask how high. Or else.
There is also the third possibility that, as unlikely as it looks now, the White House was correct, the threat was real, and this was an emergency situation, whether or not they did a good job justifying this to Dario and Anthropic in real time, and whether or not they are doing a good job justifying this now. Perhaps this was itself dangerous, or perhaps it implied too high risk of other dangers.
We cannot rule this out until we can verify technical claims. And we should not assume that next time, the company will be right and the government wrong. There likely will come a time when a company says ‘This Is Fine’ and is very, very wrong.
If that proves to be true, Anthropic will have lost a ton of credibility on all fronts, which is another reason I find this so unlikely. They cannot afford to be wrong, here.
The government’s own account is that Anthropic’s ‘lack of seriousness’ around responding led to the government imposing export controls.
If we believe Axios and Politico, the ‘lack of seriousness’ was when Anthropic:
So it was basically ‘Anthropic wants to only do things because of reasons, and thus we concluded the vibes were off, so f*** them we’re blowing it all up to show who is boss.’
This is also the second time ‘we could not reach Dario this particular minute so we had to blow up all of American AI policy shortly after 5pm on a Friday’ has come up as an excuse. It was also used by Emil Michael.
This time, the claim is that he was at a ‘wellness retreat’ which Anthropic categorically denies, and which Ashlee Vance, who was there, categorically denies.
Anthropic says it made Dario available 75 minutes after he was requested, and that other senior Anthropic people were made available during that time. I believe them.
The White House waited far longer than 75 minutes, indeed they waited overnight, after they were contacted by Amazon, to start attempting to contact Dario.
Details continue to come in on the events and the timeline. First Axios:
Maria Curi (Axios):
Behind the scenes: Amazon called administration officials Thursday night to share a report showing how they were able to jailbreak and access portions of Anthropic’s powerful new Mythos model that pose a national security threat, sources familiar told Axios.
- Anthropic had previously notified the government multiple times about the planned June 9 release of Fable — which is a general-use version of Mythos —and the government did not object, a source close to the company said.
- But calls from Amazon — as well as at least five other companies to a variety of senior administration officials Thursday evening and Friday morning — led to the model being shut down by Friday night.
Amazon is confirmed as the central call, among others, that caused the White House to start taking actions that led to them taking down Fable.
As I discussed last time, Anthropic’s release announcement included clear warnings that jailbreaks on the level of what Amazon did were possible. I have no doubt they extensively briefed the administration on such details.
Does anyone remember this graph, from the Fable 5 release announcement?
I do not understand why Amazon’s CEO called the White House over this. There is a key piece of information there that we do not know.
Anthropic was then given less than 24 hours from the initial call by Amazon, and no details of anything actually concerning happening, after which it was hit by a classic ‘Friday after 5pm’ order. For most of those less than 24 hours, the government had not yet attempted to contact Anthropic about this.
We have a source in the White House confirming, even if we fully buy their story, that they decided to risk blowing up all of American AI because they did not like the vibes they got in 90 minutes over a series of phone calls.
Sophia C and Cheyenne Haslett: The move, which followed multiple tense calls between Anthropic CEO Dario Amodei and administration officials, including Treasury Secretary Scott Bessent and White House Cyber Director Sean Cairncross, underscores how the White House is wrestling in real-time with regulating fast-moving and potentially dangerous AI models.
… Following the meeting, the administration attempted to reach Amodei but was told he was unavailable because he was attending a wellness retreat, one of the administration officials and the senior White House official said.
A spokesperson for Anthropic rejected the claim that he was at a wellness retreat, saying, “this is absolutely false.”
A person close to Anthropic said Amodei was first requested around noon and was on the phone with senior officials within an hour and 15 minutes. While he was out of pocket, Anthropic offered other senior leaders in his place, the person said.
When the administration finally reached Amodei, he participated in three calls with a combination of roughly half a dozen senior administration officials, including Cairncross, Bessent and Commerce Secretary Howard Lutnick, according to the senior White House official and one of the administration officials.
… During the calls, Amodei tried to clear up what he assumed was a misunderstanding. He pushed back on the administration’s concerns, defended the guardrails and argued that the type of bypass that occurred, which he believed to be specific, did not pose the same risk as a broader “jailbreak” that would allow it to be used without any of the guardrails put in place by Anthropic.
Dario tried to explain that this was a narrow issue, and they simply did not understand or believe him, or chose not to understand or believe him.
We now know that Dario was fully correct that the issue was narrow and harmless.
Where Dario was incorrect was in assuming those he was talking to were both capable of and interested in understanding what he was trying to say.
They urged Anthropic to voluntarily remove the model and coordinate with the government to address the vulnerabilities, according to the senior White House official and the two administration officials. Amodei asked for more time and information, but he made no commitments to pull the model, and at one point Bessent told Amodei directly that he was making a “bad decision,” according to the senior White House official.
… “Export controls were a last resort after begging them for hours to work with us,” the senior White House official said. “This was not something we wanted to do, but our hands were tied.”
After publication, one of the people close to Anthropic disputed that the company was given a choice to voluntarily work with the administration.
“The White House gave 90 minutes to take the models down, with no details on the actual threat,” the person said. “There was never any begging — or asking — for them to work with us, just a declared 90 minute deadline.”
tae kim: FT confirms: “Anthropic was given 90 minutes to comply and was not provided with detailed concerns before the order was issued, according to a person close to the company.”
Do you think the White House was ‘begging for hours?’ Or do you think they’re just throwing words out, that at best are code for ‘we did not issue an official order yet?’
I see no reason not to believe Anthropic here. Dario tried to explain that this was a false positive and asked for details. The White House did not provide any details that supported their claims, or evidence that this was necessary or prudent. They simply said ‘remove Fable in 90 minutes,’ likely without making it clear this was ‘or else IFAR.’
What pissed them off, it seems, is in large part that Anthropic wanted reasons, rather than asking how high when told to jump.
That he failed to commit to asking how high, in general, no matter what.
Axios: The bottom line: The source familiar with the government’s thinking said there was a “lack of seriousness” that Anthropic was applying to the release of Fable.
- “Had Anthropic taken it seriously and, rather than dismissing as isolated, moved to fix or pause access, this would have never happened,” the source said, adding “they were overly confident.”
As in, they told us we were wrong. That means they are not serious. How could they possibly understand the situation better than the Treasury Secretary?
Catch up quick: At 1 p.m. ET on Friday, Anthropic received a call from the government instructing them to roll back the release of the Mythos and Fable models due to a “national security threat,” but with no further details, the Anthropic source said.
- “We immediately sought to understand the specific nature of the threat so we could remediate it,” the source said, but the government held firm on the demand.
Anthropic source: The Trump administration gave Anthropic 90 minutes on Friday to pull down its most powerful models before imposing a licensing regime on the company, according to an Anthropic source.
Miles Brundage: Sure sounds like a hit job.
And that’s the good version. The bad version, as Chubby explains here, and which Axios seems to make clear, is that the White House is simply being petty, and thinks Anthropic ‘screwed them’ by asking for reasons and by having employed people with opposing political views. That is very illustrative of how these people think.
Why it matters: Governing the world’s most consequential technology is coming down to speaking President Trump’s language.
- Anthropic failed to “honor” a recent cyber executive order, administration officials claim, and the company’s purported failure to take the matter seriously led to its most powerful products being scrubbed from the internet.
- “Everybody said Anthropic was a bad actor. Some of us said it was time to give them a chance. Now those people are questioning that. They screwed us,” an administration official said.
…
Behind the scenes: “Anthropic has not done a great job at trying to speak to the administration and appreciate the ideological differences,” one source familiar with the administration’s thinking said.
- “It’s like they just speak in different languages,” the source said, adding that the company has simply not figured out how to communicate with this administration.
…
Even before this breakdown, a previous fight between Anthropic and the Pentagon also came down in some ways to just not liking the person on the other side of the negotiating table.
- A White House official told Axios that the Pentagon fight is completely unrelated — but Anthropic’s inability to communicate effectively showed up in a similar, unhelpful way.
- “We never wanted this to happen. Our number one priority is innovation but our hands were tied,” the White House official said.
- The optics added fuel to the fire. Anthropic came out with a blog post dismissing the Amazon report. Then the company enlisted a cybersecurity expert viewed by the administration as a “radical Democrat,” who was then celebrated by Chris Krebs, who Trump just fired.
The White House did not like that the Anthropic security expert was a ‘radical Democrat’ and the White House is interpreting that as ‘they screwed us’ and should now be considered bad actors.
The stupidity, on all sides, of such a thing, knows no bounds. This is not something that should matter, but it also would be a really stupid mistake by Anthropic. Look, yes, these folks are that obsessed with political perspectives, so when dealing with them directly you really do need to prioritize sending in people who won’t set off these folks, even though this is a fully apolitical situation where that concern makes no sense. So it’s kind of on everyone.
Ashlee Vance: Anthropic has pushed AI forward dramatically over the past two years. It’s currently the crown jewel of US AI tech.
The Feds don’t like @DarioAmodei because he won’t do all their bidding. And so, we’ve now entering the Soviet-style propaganda portion of the program with the White House feeding every reporter it can find with laughable claims like Dario is unreachable at a wellness retreat. Come on.
I’d hoped the US would not be self-defeating on AI, since it’s kinda one of the last hopes the US has versus China. But here we are . . . . already
None of this was some weeks long back and forth. I was at Anthropic’s HQ on Friday reporting when this all unfolded. Dario is not at a wellness retreat. The Feds seemed to be scrambling to try and make an example of Anthropic again.
This is not technical. It’s petty.
Think Anthropic was just trying to figure what was at the heart of the gripe and was not given much time to do so. Not sure if you’ve noticed but this is isn’t always the most rational and good faith of administrations.
Teortaxes: I condemn libel
“Dario was at wellness retreat” is almost certainly stereotype-driven bullshit cooked up by cruel baboons like Hegseth. Dario is a fanatic and a founder CEO, not a hippie, and he wouldn’t go on a vacation while the obviously tense situation with Fable unfolds.
I find it very unlikely that Dario was at a wellness retreat, but even if he was you do not blow up all of American AI policy if a CEO does not have his phone for four hours and left someone else in charge. What is this, 2029?
There has been a lot of miscommunication between the White House and Anthropic.
Reporting is that Dario Amodei was told ‘you are making a bad decision’ when he refused to voluntarily take down Fable, instead asking for more information.
The thing is, that could mean anything. It could mean ‘we will be mad at you’ or ‘this is now on you and if something goes wrong it is your fault.’
Could all of this have been avoided if instead of maintaining deniability, Dario had been told, explicitly, ‘Fable is going offline today. If you do not agree this minute to do this voluntarily we will hit you with an export control and you’ll be totally f***ed?’
I don’t know. But my guess is yes.
I did notice that the option had previously been put on the table, here’s Axios:
The administration first threatened Anthropic with export controls a couple of weeks ago after learning that its cutting-edge Mythos model was made available to an entity in a foreign country with direct ties to the Chinese Communist Party, according to the White House.
There is the claim via the Financial Times, Verge and Semafor that the White House learned that ‘a China linked-group’ had accessed Mythos. My suspicion is this is a mixup with the earlier incident mentioned just above? Could go either way.
That is absolutely going to happen, at least from time to time, when there are over 100 partnerships given access to Mythos. Access controls can only go so far. The key is to contain what can be done before the compromised access is discovered and closed. The report said ‘had been accessed’ rather than ‘has continued access.’
It would be non-crazy, although likely an overreaction, to ask that this mean Mythos be temporarily limited to a core group within Glasswing, or potentially even shut down entirely for a time, but that would not impact Fable.
It could potentially also have contributed to the vibes issues, or there could have been confusion about the relationship between these two things.
Look. Yes. A mistake was made. I am not a ‘oh Anthropic did nothing wrong’ guy here.
Try to talk them out of it if you can, but when they aren’t budging, you do it, even though it is mind bogglingly stupid and expensive and might be kind of a hit job.
And I do think they bear some of the responsibility here, because of that, and also they could have in various ways ‘handled’ the White House better, including in terms of who they sent in.
You want to establish, on multiple levels, the precedent that you take the model down when told to while you sort things out, and you say you were doing it because the government raised a potential security concern, at least until such time as it is clear they are not going to get over it any time soon.
In hindsight this is even more obvious. But what is done is done.
And I get it. The request was unjustified and nonsensical and rushed and the people had no idea what they’re talking about, and they really are just ordering you to do their bidding like this isn’t a Republic and that is not okay, and no one would be so stupid as to… yeah. Yeah.
There are also various signs that the Fable launch may have been somewhat rushed.
Miles Brundage: The admin is going to keep leaking all sorts of details that make it sound, to a casual reader, like this was a reasonable decision.
But there has been zero indicating that domain experts at CAISI or NSA were involved – all reporting points to “senior White House officials.”
The closest that there has been to such an indication is the claim that the jailbreak research was shared with “security researchers.”
But there is no indication that this had influence on decision-making + the 1 independent researcher who has been quoted concurs with Anthropic
To be clear, I think it’s quite plausible to imagine a reasonable case that “the Fable launch was rushed.” Threat modeling, cost-benefit analysis, etc. are hard.
Less likely is a good case that “this was clearly bad *and* also 5.5 was (and soon 5.6 will be) clearly good.”
Nathan Lambert: +1, vibes
If you tell me the head of CAISI consulted his team and thinks Fable access needs to be restricted and the threat is serious, I would be a lot more likely to believe it. If you tell me it was Commerce acting alone, they don’t know what they are doing.
I also totally can believe that the Fable release was rushed. Evidence includes it happening on a Tuesday, Anthropic not realizing people would object to the output downgrading and the general state of the classifiers, and it just happening fast.
That does not in any way make the export control order okay.
Mainstream media has this very strange way of saying that the obviously true thing might actually be true, ya’al, but strictly speaking we can’t prove it and we are a Serious News Organization.
The Economist: The American government’s primary aim may not have been to control foreign access to frontier AI models. Instead, it appears to have used export controls as a convenient way to target Anthropic
Yes. This was a way to target Anthropic to get them to take the model down for everyone, and they did not much care about the blast radius of the method. They knew full well that this was a de facto full takedown notice. That is true even if you are maximally charitable to the government’s case.
It is in theory possible they were so clueless they did not realize this would be the result, but that’s worse, you know why that’s worse, right?
Well, technically the NSA operates out of Fort Meade, Maryland, so that does not count here in terms of evaluating Hegesth’s claims, although there is the whole ‘they used Claude extensively to fight an undeclared war against Iran.’
Pete Hegseth: Three months ago, @DeptofWar kicked @AnthropicAI out of our building—forever.
Every passing day proves why that was the right move. [US Flag]
Timothy B. Lee: Claude Fable is so powerful we can let it fall into the hands of our adversaries. Also it’s too dangerous for us to use it. Real galaxy-brain stuff here.
Nat Purser: if the admin wants people to believe the anthropic decision was made out of genuine security necessity rather than grievance-driven retaliation, high ranking officials could simply stop posting like this
I posted an early version of the bottom line snippet from earlier on Twitter and a remarkable number of people replied with a version of ‘how dare Anthropic not make their CEO available 24/7 on a moment’s notice and do whatever the government asks them to do without question while sending the correct vibes, they didn’t do that so this serves them right.’
Do these people think we live in a republic? Would they like to? I wonder.
Do these people think that if a company doesn’t perform the Shibboleths of knee bending properly then we should wreck American AI, all of our productivity, our global position and the rule of law over nothing, cause Anthropic deserves to suffer, and that is Anthropic’s fault because America’s government is an NPC with an anger management problem and you know how he gets when you talk back to him?
I think they kind of do. That is exactly the vibe I am getting.
Think about what you are saying.
That’s on top of the people saying ‘Anthropic said government should regulate AI so this serves them right’ or ‘Anthropic said that frontier models are dangerous so this serves them right.’ Similar vibes.
Whereas for those of us who are not nihilists, who do not believe in might makes right, it is hard to see the reasonable version of this from the USGov side.
The correct criticism of Anthropic is ‘they should have still taken the model down when ordered, no matter how stupid they thought that was, while discussions continued.’ That’s valid.
The things people are almost entirely actually saying? It’s a bunch of nihilism, of pure worshipping of power and tribalism, and hurting everyone as long as it hurts those you dislike more, of lashing out because of and with vibes.
At least one source has now seen the research report, claiming it shows nothing.
We don’t have any person on the other side claiming that the report shows something, or explaining what that something might be.
Katie Moussouris (CEO Luta Security): The government’s response “seems way out of line with what’s actually in the research report.”
All AI models need to be able to help defenders in exactly this way, or we won’t be able to scale our defense against attackers.
Maria Curi: Moussouris said the researchers were able to find security vulnerabilities by asking questions normal defenders would ask AI, which is exactly what the model was intended to do.
It looks like when Anthropic took Mythos down, they really did fully take it down.
The Economist: Spy agencies are likely to regain access to Mythos, says one former British intelligence official; negotiations are already under way. Private firms may find it harder. Even so, some observers believe the American government will eventually have to relent.
It appears that, because this was done in such a stupid fashion, Project Glasswing is cut off from Mythos. The clock is ticking before others get similar capabilities. I wonder what the spy agencies and major corporations think about this.
The ‘good’ news is that Mythos has presumably already found a lot more vulnerabilities that remain unpatched, and which Claude Opus 4.8 and GPT-5.5 are strong enough to help patch once they are found, so defensive work should continue.
Cyber leaders, according to Axios, are being clear that this move net harms our cyber security, because given who has access in what ways it helps defenders more than attackers. There is now an open letter to this effect, urging the government to restore access to Fable.
Kevin Frazier (via Axios): “Prominent cybersecurity leaders — including CISOs, security researchers and executives at Adobe, Zoom and Sophos — are urging the Trump administration to reverse restrictions on Anthropic’s most advanced AI models, arguing the move hurts cyber defenders more than attackers.”
From the letter, calling this a pure unforced error that does nothing but net damage:
It is our understanding that underlying model capabilities in the original research that triggered this action:
- Were focused on determining whether a human-prompted section of code was insecure. This is a necessary capability in any model that is intended to write secure code and should not be considered an offensive capability.
- Can be replicated on GPT-5.5, Opus, Sonnet and even Chinese models like Kimi 2.7. The justification for this unprecedented action was that Fable provides a unique “uplift” of capabilities beyond other AI models, but AI has been finding bugs and generating working exploits at superhuman levels since last year.
- Anthropic is addressing the research. As security professionals, we recognize that our work does not lead to a simple end-state where a system is fully safe, and the purpose of research like this is to enable continuous improvement, not to ban the technology.
As a result, this action has taken the best models away from defenders, created market uncertainty, and risked America’s AI leadership without any real risk to justify it.
The action was definitely ‘vibe governing.’ The decision was some combination of ‘this seems vaguely spooky’ and ‘f*** Anthropic,’ not ‘we have a policy and a threshold.’
It could still be well intentioned.
Nathan Lambert: The Dario faction and the Sacks faction speak very different languages, and a Dario clarification could sound like a refusal.
This puts us very squarely in vibe governance. Models are released when the gov thinks its okay, and it is unlikely this is based on technical evals.
My presumption is that what happened was pretty straightforward. Someone said ‘hey there is a jailbreak fix it’ and Dario said ‘this is harmless there is nothing to fix.’ The question remains, was the request ‘fix this particular jailbreak’ or ‘fix all jailbreaks pls tks?’
The above is the charitable interpretation. There is also the uncharitable one.
Ben Smith: Extent to which White House allies are signaling that this is a culture war issue, not a technical one, is striking
Matthew Yglesias: I would say they’re not even signaling that it’s a culture war issue, they’re signaling that it’s a “pay us money or else” shakedown issue.
There’s no content at all to this, it’s just gangster politics
Taylor Budowich: I’m told Anthropic is perplexed by the situation they are facing, so they’ve turned to @k8em0 to do their on-the-record rapid response. These people really just don’t get it.
Or simply:
Miles Brundage: “Well acktually Dario seems smug so abusing government authorities is fine” – some of y’all basically
Kodus to Martin Casado for speaking out against what happened, and maybe even being convinced by argument to move from his initial position of ‘it’s a cyber weapon so any jailbreak is unacceptable’ even if he did end up blocking that guy. I will fully allow some amount of ‘Anthropic’s rhetoric did not make this easier’ if it is coupled with ‘and bad decisions are still bad’ rather than ‘so f*** Anthropic even if we all lose.’
Anthropic is flying various senior technical staff to Washington, who are spending today trying to sort this all out, which is absolutely what you do in this situation.
We will soon learn how that goes. Many next steps are possible.
We now have more or less the worst possible licensing regime. It is fully ad-hoc, vibes based, and based on the whims of people who do not understand how AI works, and who we have no reason to assume are acting in good faith.
Dean W. Ball: Make no mistake: post-Mythos, the United States has a licensing regime for AI. It’s just informal, with no consistent rules or firm boundaries on state power or public transparency. Cobalt mining in the Congo is vastly more institutionalized than frontier AI licensing in the US.
If you avoid all formal regulations and laws, and this results in regulation and law via executive fiat, that’s worse. You know that that’s worse right?
Neil Chilson and Adam Thierer: This is not good! A leading U.S. AI company was forced to take down a product that millions were using based on non-public, unexplained concerns of a few government officials. This isn’t the red-tape risk of the FDA. It’s more like the FDA demanding, out of the blue and without explanation, that everyone stop drinking milk — if milk was 50% of last year’s stock market gains.
… But even if you disagree with Anthropic’s regulatory strategy, this escalation of government intervention is nothing to celebrate. It is horrible for the broader AI ecosystem. Continued arbitrary, unexplained deployment of export control authority will make companies slow-walk new models, depriving the public of powerful new tools. Every AI model, like all software before it, will have vulnerabilities that require patching. The US government should not hang a Sword of Damocles over every lab’s head, with no indication when it might drop or why.
… This episode yet again shows why Congress must act. We need a balanced statutory framework for frontier model safety, rooted in the rule of law, with clear standards and transparent procedures. Civilian authorities must direct this process; it must not be co-opted by the military-industrial complex. America’s AI leadership will diminish if our government continues the ad-hoc and myopic approach to AI policy recently on display.
Well said. What the White House is doing is terrible no matter your view on future AI capabilities, or AI risks, or the need to ‘win the AI race.’ It is bad all around, except that it centralizes power within the White House, via the threat of ad hoc shutdowns.
Remember when Trump was worried that a proposed Executive Order would harm American AI too much, so he did not sign it? This is so, so much worse for that, while also having many other problems.
Mark Dalton at R Street has a similar analysis, calling this The Fable Fiasco: A Bad Idea Applied Badly, pointing out that ITAR and KYC are not equipped for this task, and pointing out we will face real foreign-policy consequences for doing it this way.
As mentioned last time, but it bears repeating: One of those other problems is that this gives a huge kick to those who previously thought they were our allies and would be under our AI umbrella.
Do not forget that the European Union still has ASML.
Not only will such folks not trust the ‘American AI stack’ under these conditions, they will try to build a rival one. This risks driving them into the hands of China, and towards having their own chips and their own data centers under their own control, and towards use of non-American open models even though they are much worse.
Tyler Cowen: A new line has been crossed: The U.S. government has finally declared an AI model too dangerous for unrestricted use. It’s the kind of move that could cripple AI progress in the U.S. and around the world.
The events here also do not bode well for American open models. If America is willing to put export controls on not only model weights but the model outputs, even when they are as heavily safeguarded as Fable, do you think they are not coming for your open models, with no classifiers that are not easily removed, with no safety training that cannot be undone by anyone who knows about obliteratus, that cannot be shut down once released?
Think harder about the implications of what is happening.
Our policy responses are going off the rails remarkably fast.
So I will close with a reminder of how badly we need rule of law, here, and how bad the alternative is already proving to be only weeks into the new regime.
Dean W. Ball: AI policy is a really poignant example of just how deeply American civics have been hollowed out. In almost all other areas of tech policy, we have at least some prior law and regulation from which to draw. If you think, as I do, that politics and law are ritual practices through which we embody civic ideals, these earlier bodies of law are like prior ritual art. They give us some sense of a starting point. So even though crypto is new, it is part of a very old industry (financial services), which carries with it a lot of prior legal and political art.
With AI we have nothing of the sort, so all our leaders can think to do is punch one another. The impulse of “let’s have stable rules so that we aren’t punching one another all the time” isn’t really something you hear anyone saying outside of industry. The rule of law seems absent in our political muscle memory.
Dean W. Ball: Precisely as I predicted, the recent cyber EO, which admin officials insisted was not a licensing regime, ends up in practice being a licensing regime. Forget “voluntary,” forget “permissionless.”
AI is licensed now, but the requirements change constantly and are always a secret, even to the administration itself, which will discover the rules spontaneously in real time as it reacts to events. This means also that the rules are in practice stricter and more roughly enforced for organizations the administration does not like.
Can you blame Anthropic for making itself so disliked? In a sense, sure. The problem is that this childish “he said, she said” is all we have to go on in our analysis of the situation. And because there is no transparency (it is all calls and texts between “White House officials” and “Anthropic executives”), in practice it comes down to who you trust more.
This is why we create laws! To abstract away from personal power struggles and grudges, to submit to the steady application of rules so that complex human activity can unfold with predictability.
The rule of law has been being eroded in the U.S. for my entire life, but it is especially acute in AI because of both the lack of much preexisting law to serve as bulwark, and because of this admin’s insistence that it is Not Regulating AI. This has become an excuse for vagueness and evasiveness in rule-drafting (see the cyber EO), and this in turn makes the lawlessness worse.
The government wants to apply its force to frontier AI, that much is clear. It wants to make the industry submit. And in service of that goal, it has discovered that “not regulating AI” is in fact a great excuse for refusing to support laws that could constrain the admin’s exercise of power. In other words, “not regulating AI” is a *justification* for the tyrannical control of AI by the state.
This should alarm you regardless of what party you are in. What you are seeing now will be used against you one day soon, if not by this admin then by its successors. This is the antithesis of the rule of law.
The administration cannot and will not fix this problem alone. We need Congress to step in and impose rules on this mess.