MoreRSS

site iconArmin RonacherModify

I'm currently located in Austria and working as a Director of Engineering for Sentry. Aside from that I do open source development.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Armin Ronacher

Skills vs Dynamic MCP Loadouts

2025-12-13 08:00:00

I’ve been moving all my MCPs to skills, including the remaining one I still used: the Sentry MCP1. Previously I had already moved entirely away from Playwright to a Playwright skill.

In the last month or so there have been discussions about using dynamic tool loadouts to defer loading of tool definitions until later. Anthropic has also been toying around with the idea of wiring together MCP calls via code, something I have experimented with.

I want to share my updated findings with all of this and why the deferred tool loading that Anthropic came up with does not fix my lack of love for MCP. Maybe they are useful for someone else.

What is a Tool?

When the agent encounters a tool definition through reinforcement learning or otherwise, it is encouraged to emit tool calls through special tokens when it encounters a situation where that tool call would be appropriate. For all intents and purposes, tool definitions can only appear between special tool definition tokens in a system prompt. Historically this means that you cannot emit tool definitions later in the conversation state. So your only real option is for a tool to be loaded when the conversation starts.

In agentic uses, you can of course compress your conversation state or change the tool definitions in the system message at any point. But the consequence is that you will lose the reasoning traces and also the cache. In the case of Anthropic, for instance, this will make your conversation significantly more expensive. You would basically start from scratch and pay full token rates plus cache write cost, compared to cache read.

One recent innovation from Anthropic is deferred tool loading. You still declare tools ahead of time in the system message, but they are not injected into the conversation when the initial system message is emitted. Instead they appear at a later point. The tool definitions however still have to be static for the entire conversation, as far as I know. So the tools that could exist are defined when the conversation starts. The way Anthropic discovers the tools is purely by regex search.

Contrasting with Skills

This is all quite relevant because even though MCP with deferred loading feels like it should perform better, it actually requires quite a bit of engineering on the LLM API side. The skill system gets away without any of that and, at least from my experience, still outperforms it.

Skills are really just short summaries of which skills exist and in which file the agent can learn more about them. These are proactively loaded into the context. So the agent understands in the system context (or maybe somewhere later in the context) what capabilities it has and gets a link to the manual for how to use them.

Crucially, skills do not actually load a tool definition into the context. The tools remain the same: bash and the other tools the agent already has. All it learns from the skill are tips and tricks for how to use these tools more effectively.

Because the main thing it learns is how to use other command line tools and similar utilities, the fundamentals of how to chain and coordinate them together do not actually change. The reinforcement learning that made the Claude family of models very good tool callers just helps with these newly discovered tools.

MCP as Skills?

So that obviously raises the question: if skills work so well, can I move the MCP outside of the context entirely and invoke it through the CLI in a similar way as Anthropic proposes? The answer is yes, you can, but it doesn’t work well. One option here is Peter Steinberger’s mcporter. In short, it reads the .mcp.json files and exposes the MCPs behind it as callable tools:

npx mcporter call 'linear.create_comment(issueId: "ENG-123", body: "Looks good!")'

And yes, it looks very much like a command line tool that the LLM can invoke. The problem however is that the LLM does not have any idea about what tools are available, and now you need to teach it that. So you might think: why not make some skills that teach the LLM about the MCPs? Here the issue for me comes from the fact that MCP servers have no desire to maintain API stability. They are increasingly starting to trim down tool definitions to the bare minimum to preserve tokens. This makes sense, but for the skill pattern it’s not what you want. For instance, the Sentry MCP server at one point switched the query syntax entirely to natural language. A great improvement for the agent, but my suggestions for how to use it became a hindrance and I did not discover the issue straight away.

This is in fact quite similar to Anthropic’s deferred tool loading: there is no information about the tool in the context at all. You need to create a summary. The eager loading of MCP tools we have done in the past now has ended up with an awkward compromise: the description is both too long to eagerly load it, and too short to really tell the agent how to use it. So at least from my experience, you end up maintaining these manual skill summaries for MCP tools exposed via mcporter or similar.

Path Of Least Resistance

This leads me to my current conclusion: I tend to go with what is easiest, which is to ask the agent to write its own tools as a skill. Not only does it not take all that long, but the biggest benefit is that the tool is largely under my control. Whenever it breaks or needs some other functionality, I ask the agent to adjust it. The Sentry MCP is a great example. I think it’s probably one of the better designed MCPs out there, but I don’t use it anymore. In part because when I load it into the context right away I lose around 8k tokens out of the box, and I could not get it to work via mcporter. On the other hand, I have Claude maintain a skill for me. And yes, that skill is probably quite buggy and needs to be updated, but because the agent maintains it, it works out better.

It’s quite likely that all of this will change, but at the moment manually maintained skills and agents writing their own tools have become my preferred way. I suspect that dynamic tool loading with MCP will become a thing, but it will probably quite some protocol changes to bring in skill-like summaries and built-in manuals for the tools. I also suspect that MCP would greatly benefit of protocol stability. The fact that MCP servers keep changing their tool descriptions at will does not work well with materialized calls and external tool descriptions in READMEs and skill files.

  1. Keen readers will remember that last time, the last MCP I used was Playwright. In the meantime I added and removed two more MCPs: Linear and Sentry, mostly because of authentication issues and neither having a great command line interface.

Let’s Destroy The European Union!

2025-12-09 08:00:00

Elon Musk is not happy with the EU fining his X platform and is currently on a tweet rampage complaining about it. Among other things, he wants the whole EU to be abolished. He sadly is hardly the first wealthy American to share their opinions on European politics lately. I’m not a fan of this outside attention but I believe it’s noteworthy and something to pay attention to. In particular because the idea of destroying and ripping apart the EU is not just popular in the US; it’s popular over here too. Something that greatly concerns me.

We Have Genuine Problems

There is definitely a bunch of stuff we might want to fix over here. I have complained about our culture before. Unfortunately, I happen to think that our challenges are not coming from politicians or civil servants, but from us, the people. Europeans don’t like to take risks and are quite pessimistic about the future compared to their US counterparts. Additionally, we Europeans have been trained to feel a lot of guilt over the years, which makes us hesitant to stand up for ourselves. This has led to all kinds of interesting counter-cultural movements in Europe, like years of significant support for unregulated immigration and an unhealthy obsession with the idea of degrowth. Today, though, neither seems quite as popular as it once was.

Morally these things may be defensible, but in practice they have led to Europe losing its competitive edge and eroding social cohesion. The combination of a strong social state and high taxes in particular does not mix well with the kind of immigration we have seen in the last decade: mostly people escaping wars ending up in low-skilled jobs. That means it’s not unlikely that certain classes of immigrants are going to be net-negative for a very long time, if not forever, and increasingly society is starting to think about what the implications of that might be.

Yet even all of that is not where our problems lie, and it’s certainly not our presumed lack of free speech. Any conversation on that topic is foolish because it’s too nuanced. Society clearly wants to place some limits to free speech here, but the same is true in the US. In the US we can currently see a significant push-back against “woke ideologies,” and a lot of that push-back involves restricting freedom of expression through different avenues.

America Likes a Weak Europe

The US might try to lecture Europe right now on free speech, but what it should be lecturing us on is our economic model. Europe has too much fragmentation, incredibly strict regulation that harms innovation, ineffective capital markets, and a massive dependency on both the United States and China. If the US were to cut us off from their cloud providers, we would not be able to operate anything over here. If China were to stop shipping us chips, we would be in deep trouble too (we have seen this).

This is painful because the US is historically a great example when it comes to freedom of information, direct democracy at the state level, and rather low corruption. These are all areas where we’re not faring well, at least not consistently, and we should be lectured. Fundamentally, the US approach to capitalism is about as good as it’s going to get. If there was any doubt that alternative approaches might have worked out better, at this point there’s very little evidence in favor of that. Yet because of increased loss of civil liberties in the US, many Europeans now see everything that the US is doing as bad. A grave mistake.

Both China and the US are quite happy with the dependency we have on them and with us falling short of our potential. Europe’s attempt at dealing with the dependency so far has been to regulate and tax US corporations more heavily. That’s not a good strategy. The solution must be to become competitive again so that we can redirect that tax revenue to local companies instead. The Digital Services Act is a good example: we’re punishing Apple and forcing them to open up their platform, but we have no company that can take advantage of that opening.

Europe is Europe’s Biggest Problem

If you read my blog here, you might remember my musings about the lack of clarity of what a foreigner is in Europe. The reality is that Europe has been deeply integrated for a long time now as a result of how the EU works — but still not at the same level as the US. I think this is still the biggest problem. People point to languages as the challenge, but underneath the hood, the countries are still fighting each other. Austria wants to protect its local stores from larger competition in Germany and its carpenters from the cheaper ones coming from Slovenia. You can replace Austria with any other EU country and you will find the same thing.

The EU might not be perfect, but it’s hard to imagine that abolishing it would solve any problem given how national states have shown to behave. The moment the EU fell away, we would be warming up all border struggles again. We have already seen similar issues pop up in Northern Ireland after the UK left.

And we just have so much bureaucracy, so many non-functioning social systems, and such a tremendous amount of incoming governmental debt to support our flailing pension schemes. We need growth more than any other bloc, and we have such a low probability of actually accomplishing that.

Given how the EU is structured, it’s also acting as the punching bag for the failure of the nation states to come to agreements. It’s not that EU bureaucrats are telling Europeans to take in immigrants, to enact chat control or to enact cookie banners or attached plastic caps. Those are all initiatives that come from one or more member states. But the EU in the end will always take the blame because even local politicians that voted in support of some of these things can easily point towards “Brussels” as having created a problem.

The United States of Europe

A Europe in pieces does not sound appealing to me at all, and that’s because I can look at what China and the US have.

What China and the US have that Europe lacks is a strong national identity. Both countries have recognized that strength comes from unity. China in particular is fighting any kind of regionalism tooth and nail. The US has accomplished this through the pledge of allegiance, a civil war, the Department of Education pushing a common narrative in schools, and historically putting post offices and infrastructure everywhere. Europe has none of that. More importantly, Europeans don’t even want it. There is a mistaken belief that we can just become these tiny states again and be fine.

If Europe wants to be competitive, it seems unlikely that this can be accomplished without becoming a unified superpower. Yet there is no belief in Europe that this can or should happen, and the other superpowers have little interest in seeing it happen either.

What Would Fixing Actually Look Like?

If I had to propose something constructive, it would be this: Europe needs to stop pretending it can be 27 different countries with 27 different economic policies while also being a single market. The half-measures are killing us. We have a common currency in the Eurozone but no common fiscal policy. We have freedom of movement but wildly different social systems. We have common regulations but fragmented enforcement. 27 labor laws, 27 different legal systems, tax codes, complex VAT rules and so on.

The Draghi report from last year laid out many of these issues quite clearly: Europe needs massive investment in technology and infrastructure. It needs a genuine single market for services, not just goods. It needs capital markets that can actually fund startups at scale. None of this is news to anyone paying attention.

But here’s the uncomfortable truth: none of this will happen without Europeans accepting that more integration is the answer, not less. And right now, the political momentum is in the opposite direction. Every country wants the benefits of the EU without the obligations. Every country wants to protect its own industries while accessing everyone else’s markets.

One of the arguments against deeper integration is that Europe hinges on some quite unrelated issues. For instance, the EU is seen as non-democratic, but some of the criticism just does not sit right with me. Sure, I too would welcome more democracy in the EU, but at the same time, the system really is not undemocratic today. Take things like chat control: the reason this thing does not die, is because some member states and their elected representatives are pushing for it.

What stands in the way is that the member countries and their people don’t actually want to strengthen the EU further. The “lack of democracy” is very much intentional and the exact outcome you get if you want to keep the power with the national states.

Foreign Billionaires and European Sovereignty

So back to where we started: should the EU be abolished as Musk suggests? I think this is a profoundly unserious proposal from someone who has little understanding of European history and even less interest in learning. The EU exists because two world wars taught Europeans that nationalism without checks leads to catastrophe. It exists because small countries recognized they have more leverage negotiating as a bloc than individually.

I also take a lot of issue with the idea that European politics should be driven by foreign interests. Neither Russians nor Americans have any good reason for why they should be having so much interest in European politics. They are not living here; we are.

Would Europe be more “free” without the EU? Perhaps in some narrow regulatory sense. But it would also be weaker, more divided, and more susceptible to manipulation by larger powers — including the United States. I also find it somewhat rich that American tech billionaires are calling for the dissolution of the EU while they are greatly benefiting from the open market it provides. Their companies extract enormous value from the European market, more than even local companies are able to.

The real question isn’t whether Europe should have less regulation or more freedom. It’s whether we Europeans can find the political will to actually complete the project we started. A genuine federation with real fiscal transfers, a common defense policy, and a unified foreign policy would be a superpower. What we have now is a compromise that satisfies nobody and leaves us vulnerable to exactly the kind of pressure Musk and other oligarchs represent.

A Different Path

Europe doesn’t need fixing in the way the loud present-day critics suggest. It doesn’t need to become more like America or abandon its social model entirely. What it needs is to decide what it actually wants to be. The current state of perpetual ambiguity is unsustainable.

It also should not lose its values. Europeans might no longer be quite as hot on the human rights that the EU provides, and they might no longer want to have the same level of immigration. Yet simultaneously, Europeans are presented with a reality that needs all of these things. We’re all highly dependent on movement of labour, and that includes people from abroad. Unfortunately, the wars of the last decade have dominated any migration discourse, and that has created ground for populists to thrive. Any skilled tech migrant is running into the same walls as everyone else, which has made it less and less appealing to come.

Or perhaps we’ll continue muddling through, which historically has been Europe’s preferred approach. It’s not inspiring, but it’s also not going to be the catastrophe the internet would have you believe either.

Is there reason to be optimistic? On a long enough timeline the graph goes up and to the right. We might be going through some rough patches, but structurally the whole thing here is still pretty solid. And it’s not as if the rest of the world is cruising along smoothly: the US, China, and Russia are each dealing with their own crises. That shouldn’t serve as an excuse, but it does offer context. As bleak as things can feel, we’re not alone in having challenges, but ours are uniquely ours and we will face them. One way or another.

LLM APIs are a Synchronization Problem

2025-11-22 08:00:00

The more I work with large language models through provider-exposed APIs, the more I feel like we have built ourselves into quite an unfortunate API surface area. It might not actually be the right abstraction for what’s happening under the hood. The way I like to think about this problem now is that it’s actually a distributed state synchronization problem.

At its core, a large language model takes text, tokenizes it into numbers, and feeds those tokens through a stack of matrix multiplications and attention layers on the GPU. Using a large set of fixed weights, it produces activations and predicts the next token. If it weren’t for temperature (randomization), you could think of it having the potential of being a much more deterministic system, at least in principle.

As far as the core model is concerned, there’s no magical distinction between “user text” and “assistant text”—everything is just tokens. The only difference comes from special tokens and formatting that encode roles (system, user, assistant, tool), injected into the stream via the prompt template. You can look at the system prompt templates on Ollama for the different models to get an idea.

The Basic Agent State

Let’s ignore for a second which APIs already exist and just think about what usually happens in an agentic system. If I were to have my LLM run locally on the same machine, there is still state to be maintained, but that state is very local to me. You’d maintain the conversation history as tokens in RAM, and the model would keep a derived “working state” on the GPU — mainly the attention key/value cache built from those tokens. The weights themselves stay fixed; what changes per step are the activations and the KV cache.

One further clarification: when I talk about state I don’t just mean the visible token history because the model also carries an internal working state that isn’t captured by simply re-sending tokens. In other words: you can replay the tokens and regain the text content, but you won’t restore the exact derived state the model had built.

From a mental-model perspective, caching means “remember the computation you already did for a given prefix so you don’t have to redo it.” Internally, that usually means storing the attention KV cache for those prefix tokens on the server and letting you reuse it, not literally handing you raw GPU state.

There are probably some subtleties to this that I’m missing, but I think this is a pretty good model to think about it.

The Completion API

The moment you’re working with completion-style APIs such as OpenAI’s or Anthropic’s, abstractions are put in place that make things a little different from this very simple system. The first difference is that you’re not actually sending raw tokens around. The way the GPU looks at the conversation history and the way you look at it are on fundamentally different levels of abstraction. While you could count and manipulate tokens on one side of the equation, extra tokens are being injected into the stream that you can’t see. Some of those tokens come from converting the JSON message representation into the underlying input tokens fed into the machine. But you also have things like tool definitions, which are injected into the conversation in proprietary ways. Then there’s out-of-band information such as cache points.

And beyond that, there are tokens you will never see. For instance, with reasoning models you often don’t see any real reasoning tokens, because some LLM providers try to hide as much as possible so that you can’t retrain your own models with their reasoning state. On the other hand, they might give you some other informational text so that you have something to show to the user. Model providers also love to hide search results and how those results were injected into the token stream. Instead, you only get an encrypted blob back that you need to send back to continue the conversation. All of a sudden, you need to take some information on your side and funnel it back to the server so that state can be reconciled on either end.

In completion-style APIs, each new turn requires resending the entire prompt history. The size of each individual request grows linearly with the number of turns, but the cumulative amount of data sent over a long conversation grows quadratically because each linear-sized history is retransmitted at every step. This is one of the reasons long chat sessions feel increasingly expensive. On the server, the model’s attention cost over that sequence also grows quadratically in sequence length, which is why caching starts to matter.

The Responses API

One of the ways OpenAI tried to address this problem was to introduce the Responses API, which maintains the conversational history on the server (at least in the version with the saved state flag). But now you’re in a bizarre situation where you’re fully dealing with state synchronization: there’s hidden state on the server and state on your side, but the API gives you very limited synchronization capabilities. To this point, it remains unclear to me how long you can actually continue that conversation. It’s also unclear what happens if there is state divergence or corruption. I’ve seen the Responses API get stuck in ways where I couldn’t recover it. It’s also unclear what happens if there’s a network partition, or if one side got the state update but the other didn’t. The Responses API with saved state is quite a bit harder to use, at least as it’s currently exposed.

Obviously, for OpenAI it’s great because it allows them to hide more behind-the-scenes state that would otherwise have to be funneled through with every conversation message.

State Sync API

Regardless of whether you’re using a completion-style API or the Responses API, the provider always has to inject additional context behind the scenes—prompt templates, role markers, system/tool definitions, sometimes even provider-side tool outputs—that never appears in your visible message list. Different providers handle this hidden context in different ways, and there’s no common standard for how it’s represented or synchronized. The underlying reality is much simpler than the message-based abstractions make it look: if you run an open-weights model yourself, you can drive it directly with token sequences and design APIs that are far cleaner than the JSON-message interfaces we’ve standardized around. The complexity gets even worse when you go through intermediaries like OpenRouter or SDKs like the Vercel AI SDK, which try to mask provider-specific differences but can’t fully unify the hidden state each provider maintains. In practice, the hardest part of unifying LLM APIs isn’t the user-visible messages—it’s that each provider manages its own partially hidden state in incompatible ways.

It really comes down to how you pass this hidden state around in one form or another. I understand that from a model provider’s perspective, it’s nice to be able to hide things from the user. But synchronizing hidden state is tricky, and none of these APIs have been built with that mindset, as far as I can tell. Maybe it’s time to start thinking about what a state synchronization API would look like, rather than a message-based API.

The more I work with these agents, the more I feel like I don’t actually need a unified message API. The core idea of it being message-based in its current form is itself an abstraction that might not survive the passage of time.

Learn From Local First?

There’s a whole ecosystem that has dealt with this kind of mess before: the local-first movement. Those folks spent a decade figuring out how to synchronize distributed state across clients and servers that don’t trust each other, drop offline, fork, merge, and heal. Peer-to-peer sync, and conflict-free replicated storage engines all exist because “shared state but with gaps and divergence” is a hard problem that nobody could solve with naive message passing. Their architectures explicitly separate canonical state, derived state, and transport mechanics — exactly the kind of separation missing from most LLM APIs today.

Some of those ideas map surprisingly well to models: KV caches resemble derived state that could be checkpointed and resumed; prompt history is effectively an append-only log that could be synced incrementally instead of resent wholesale; provider-side invisible context behaves like a replicated document with hidden fields.

At the same time though, if the remote state gets wiped because the remote site doesn’t want to hold it for that long, we would want to be in a situation where we can replay it entirely from scratch — which for instance the Responses API today does not allow.

Future Unified APIs

There’s been plenty of talk about unifying message-based APIs, especially in the wake of MCP (Model Context Protocol). But if we ever standardize anything, it should start from how these models actually behave, not from the surface conventions we’ve inherited. A good standard would acknowledge hidden state, synchronization boundaries, replay semantics, and failure modes — because those are real issues. There is always the risk that we rush to formalize the current abstractions and lock in their weaknesses and faults. I don’t know what the right abstraction looks like, but I’m increasingly doubtful that the status-quo solutions are the right fit.

Agent Design Is Still Hard

2025-11-21 08:00:00

I felt like it might be a good time to write about some new things I’ve learned. Most of this is going to be about building agents, with a little bit about using agentic coding tools.

TL;DR: Building agents is still messy. SDK abstractions break once you hit real tool use. Caching works better when you manage it yourself, but differs between models. Reinforcement ends up doing more heavy lifting than expected, and failures need strict isolation to avoid derailing the loop. Shared state via a file-system-like layer is an important building block. Output tooling is surprisingly tricky, and model choice still depends on the task.

Which Agent SDK To Target?

When you build your own agent, you have the choice of targeting an underlying SDK like the OpenAI SDK or the Anthropic SDK, or you can go with a higher level abstraction such as the Vercel AI SDK or Pydantic. The choice we made a while back was to adopt the Vercel AI SDK but only the provider abstractions, and to basically drive the agent loop ourselves. At this point we would not make that choice again. There is absolutely nothing wrong with the Vercel AI SDK, but when you are trying to build an agent, two things happen that we originally didn’t anticipate:

The first is that the differences between models are significant enough that you will need to build your own agent abstraction. We have not found any of the solutions from these SDKs that build the right abstraction for an agent. I think this is partly because, despite the basic agent design being just a loop, there are subtle differences based on the tools you provide. These differences affect how easy or hard it is to find the right abstraction (cache control, different requirements for reinforcement, tool prompts, provider-side tools, etc.). Because the right abstraction is not yet clear, using the original SDKs from the dedicated platforms keeps you fully in control. With some of these higher-level SDKs you have to build on top of their existing abstractions, which might not be the ones you actually want in the end.

We also found it incredibly challenging to work with the Vercel SDK when it comes to dealing with provider-side tools. The attempted unification of messaging formats doesn’t quite work. For instance, the web search tool from Anthropic routinely destroys the message history with the Vercel SDK, and we haven’t yet fully figured out the cause. Also, in Anthropic’s case, cache management is much easier when targeting their SDK directly instead of the Vercel one. The error messages when you get things wrong are much clearer.

This might change, but right now we would probably not use an abstraction when building an agent, at least until things have settled down a bit. The benefits do not yet outweigh the costs for us.

Someone else might have figured it out. If you’re reading this and think I’m wrong, please drop me a mail. I want to learn.

Caching Lessons

The different platforms have very different approaches to caching. A lot has been said about this already, but Anthropic makes you pay for caching. It makes you manage cache points explicitly, and this really changes the way you interact with it from an agent engineering level. I initially found the manual management pretty dumb. Why doesn’t the platform do this for me? But I’ve fully come around and now vastly prefer explicit cache management. It makes costs and cache utilization much more predictable.

Explicit caching allows you to do certain things that are much harder otherwise. For instance, you can split off a conversation and have it run in two different directions simultaneously. You also have the opportunity to do context editing. The optimal strategy here is unclear, but you clearly have a lot more control, and I really like having that control. It also makes it much easier to understand the cost of the underlying agent. You can assume much more about how well your cache will be utilized, whereas with other platforms we found it to be hit and miss.

The way we do caching in the agent with Anthropic is pretty straightforward. One cache point is after the system prompt. Two cache points are placed at the beginning of the conversation, where the last one moves up with the tail of the conversation. And then there is some optimization along the way that you can do.

Because the system prompt and the tool selection now have to be mostly static, we feed a dynamic message later to provide information such as the current time. Otherwise, this would trash the cache. We also leverage reinforcement during the loop much more.

Reinforcement In The Agent Loop

Every time the agent runs a tool you have the opportunity to not just return data that the tool produces, but also to feed more information back into the loop. For instance, you can remind the agent about the overall objective and the status of individual tasks. You can also provide hints about how the tool call might succeed when a tool fails. Another use of reinforcement is to inform the system about state changes that happened in the background. If you have an agent that uses parallel processing, you can inject information after every tool call when that state changed and when it is relevant for completing the task.

Sometimes it’s enough for the agent to self-reinforce. In Claude Code, for instance, the todo write tool is a self-reinforcement tool. All it does is take from the agent a list of tasks that it thinks it should do and echo out what came in. It’s basically just an echo tool; it really doesn’t do anything else. But that is enough to drive the agent forward better than if the only task and subtask were given at the beginning of the context and too much has happened in the meantime.

We also use reinforcements to inform the system if the environment changed during execution in a way that’s problematic for the agent. For instance, if our agent fails and retries from a certain step forward but the recovery operates off broken data, we inject a message informing it that it might want to back off a couple of steps and redo an earlier step.

Isolate Failures

If you expect a lot of failures during code execution, there is an opportunity to hide those failures from the context. This can happen in two ways. One is to run tasks that might require iteration individually. You would run them in a subagent until they succeed and only report back the success, plus maybe a brief summary of approaches that did not work. It is helpful for an agent to learn about what did not work in a subtask because it can then feed that information into the next task to hopefully steer away from those failures.

The second option doesn’t exist in all agents or foundation models, but with Anthropic you can do context editing. So far we haven’t had a lot of success with context editing, but we believe it’s an interesting thing we would love to explore more. We would also love to learn if people have success with it. What is interesting about context editing is that you should be able to preserve tokens for further down the iteration loop. You can take out of the context certain failures that didn’t drive towards successful completion of the loop, but only negatively affected certain attempts during execution. But as with the point I made earlier: it is also useful for the agent to understand what didn’t work, but maybe it doesn’t require the full state and full output of all the failures.

Unfortunately, context editing will automatically invalidate caches. There is really no way around it. So it can be unclear when the trade-off of doing that compensates for the extra cost of trashing the cache.

Sub Agents / Sub Inference

As I mentioned a couple of times on this blog already, most of our agents are based on code execution and code generation. That really requires a common place for the agent to store data. Our choice is a file system—in our case a virtual file system—but that requires different tools to access it. This is particularly important if you have something like a subagent or subinference.

You should try to build an agent that doesn’t have dead ends. A dead end is where a task can only continue executing within the sub-tool that you built. For instance, you might build a tool that generates an image, but is only able to feed that image back into one more tool. That’s a problem because you might then want to put those images into a zip archive using the code execution tool. So there needs to be a system that allows the image generation tool to write the image to the same place where the code execution tool can read it. In essence, that’s a file system.

Obviously it has to go the other way around too. You might want to use the code execution tool to unpack a zip archive and then go back to inference to describe all the images so that the next step can go back to code execution and so forth. The file system is the mechanism that we use for that. But it does require tools to be built in a way that they can take file paths to the virtual file system to work with.

So basically an ExecuteCode tool would have access to the same file system as the RunInference tool which could take a path to a file on that same virtual file system.

The Use Of An Output Tool

One interesting thing about how we structured our agent is that it does not represent a chat session. It will eventually communicate something to the user or the outside world, but all the messages that it sends in between are usually not revealed. The question is: how does it create that message? We have one tool which is the output tool. The agent uses it explicitly to communicate to the human. We then use a prompt to instruct it when to use that tool. In our case the output tool sends an email.

But that turns out to pose a few other challenges. One is that it’s surprisingly hard to steer the wording and tone of that output tool compared to just using the main agent loop’s text output as the mechanism to talk to the user. I cannot say why this is, but I think it’s probably related to how these models are trained.

One attempt that didn’t work well was to have the output tool run another quick LLM like Gemini 2.5 Flash to adjust the tone to our preference. But this increases latency and actually reduces the quality of the output. In part, I think the model just doesn’t word things correctly and the subtool doesn’t have sufficient context. Providing more slices of the main agentic context into the subtool makes it expensive and also didn’t fully solve the problem. It also sometimes reveals information in the final output that we didn’t want to be there, like the steps that led to the end result.

Another problem with an output tool is that sometimes it just doesn’t call the tool. One of the ways in which we’re forcing this is we remember if the output tool was called. If the loop ends without the output tool, we inject a reinforcement message to encourage it to use the output tool.

Model Choice

Overall our choices for models haven’t dramatically changed so far. I think Haiku and Sonnet are still the best tool callers available, so they make for excellent choices in the agent loop. They are also somewhat transparent with regards to what the RL looks like. The other obvious choices are the Gemini models. We so far haven’t found a ton of success with the GPT family of models for the main loop.

For the individual sub-tools, which in part might also require inference, our current choice is Gemini 2.5 if you need to summarize large documents or work with PDFs and things like that. That is also a pretty good model for extracting information from images, in particular because the Sonnet family of models likes to run into a safety filter which can be annoying.

There’s also probably the very obvious realization that token cost alone doesn’t really define how expensive an agent. A better tool caller will do the job in fewer tokens. There are some cheaper models available than sonnet today, but they are not necessarily cheaper in a loop.

But all things considered, not that much has changed in the last couple of weeks.

Testing and Evals

We find testing and evals to be the hardest problem here. This is not entirely surprising, but the agentic nature makes it even harder. Unlike prompts, you cannot just do the evals in some external system because there’s too much you need to feed into it. This means you want to do evals based on observability data or instrumenting your actual test runs. So far none of the solutions we have tried have convinced us that they found the right approach here. Unfortunately, I have to report that at the moment we haven’t found something that really makes us happy. I hope we’re going to find a solution for this because it is becoming an increasingly frustrating aspect of building an agent.

Coding Agent Updates

As for my experience with coding agents, not really all that much has changed. The main new development is that I’m trialing Amp more. In case you’re curious why: it’s not that it’s objectively a better agent than what I’m using, but I really quite like the way they’re thinking about agents from what they’re posting. The interactions of the different sub agents like the Oracle with the main loop is beautifully done, and not many other harnesses do this today. It’s also a good way for me to validate how different agent designs work. Amp, similar to Claude Code, really feels like a product built by people who also use their own tool. I do not feel every other agent in the industry does this.

Quick Stuff I Read And Found

That’s just a random assortment of things that I feel might also be worth sharing:

  • What if you don’t need MCP at all?: Mario argues that many MCP servers are overengineered and include large toolsets that consume lots of context. He proposes a minimalist approach for browser-agent use-cases by relying on simple CLI tools (e.g., start, navigate, evaluate JS, screenshot) executed via Bash, which keeps token usage small and workflows flexible. I built a Claude/Amp Skill out of it.
  • The fate of “small” open source: The author argues that the age of tiny, single-purpose open-source libraries is coming to an end, largely because built-in platform APIs and AI tools can now generate simple utilities on demand. Thank fucking god.
  • Tmux is love. There is no article that goes with it, but the TLDR is that Tmux is great. If you have anything that remotely looks like an interactive system that an agent should work with, you should give it some Tmux skills.
  • LLM APIs are a Synchronization Problem. This was a separate realization that was too long for this post, so I wrote a separate one.

Absurd Workflows: Durable Execution With Just Postgres

2025-11-03 08:00:00

It’s probably no surprise to you that we’re building agents somewhere. Everybody does it. Building a good agent, however, brings back some of the historic challenges involving durable execution.

Entirely unsurprisingly, a lot of people are now building durable execution systems. Many of these, however, are incredibly complex and require you to sign up for another third-party service. I generally try to avoid bringing in extra complexity if I can avoid it, so I wanted to see how far I can go with just Postgres. To this end, I wrote Absurd 1, a tiny SQL-only library with a very thin SDK to enable durable workflows on top of just Postgres — no extension needed.

Durable Execution 101

Durable execution (or durable workflows) is a way to run long-lived, reliable functions that can survive crashes, restarts, and network failures without losing state or duplicating work. Durable execution can be thought of as the combination of a queue system and a state store that remembers the most recently seen execution state.

Because Postgres is excellent at queues thanks to SELECT ... FOR UPDATE SKIP LOCKED, you can use it for the queue (e.g., with pgmq). And because it’s a database, you can also use it to store the state.

The state is important. With durable execution, instead of running your logic in memory, the goal is to decompose a task into smaller pieces (step functions) and record every step and decision. When the process stops (whether it fails, intentionally suspends, or a machine dies) the engine can replay those events to restore the exact state and continue where it left off, as if nothing happened.

Absurd At A High Level

Absurd at the core is a single .sql file (absurd.sql) which needs to be applied to a database of your choice. That SQL file’s goal is to move the complexity of SDKs into the database. SDKs then make the system convenient by abstracting the low-level operations in a way that leverages the ergonomics of the language you are working with.

The system is very simple: A task dispatches onto a given queue from where a worker picks it up to work on. Tasks are subdivided into steps, which are executed in sequence by the worker. Tasks can be suspended or fail, and when that happens, they execute again (a run). The result of a step is stored in the database (a checkpoint). To avoid repeating work, checkpoints are automatically loaded from the state storage in Postgres again.

Additionally, tasks can sleep or suspend for events and wait until they are emitted. Events are cached, which means they are race-free.

With Agents

What is the relationship of agents with workflows? Normally, workflows are DAGs defined by a human ahead of time. AI agents, on the other hand, define their own adventure as they go. That means they are basically a workflow with mostly a single step that iterates over changing state until it determines that it has completed. Absurd enables this by automatically counting up steps if they are repeated:

absurd.registerTask({name: "my-agent"}, async (params, ctx) => {
  let messages = [{role: "user", content: params.prompt}];
  let step = 0;
  while (step++ < 20) {
    const { newMessages, finishReason } = await ctx.step("iteration", async () => {
      return await singleStep(messages);
    });
    messages.push(...newMessages);
    if (finishReason !== "tool-calls") {
      break;
    }
  }
});

This defines a single task named my-agent, and it has just a single step. The return value is the changed state, but the current state is passed in as an argument. Every time the step function is executed, the data is looked up first from the checkpoint store. The first checkpoint will be iteration, the second iteration#2, iteration#3, etc. Each state only stores the new messages it generated, not the entire message history.

If a step fails, the task fails and will be retried. And because of checkpoint storage, if you crash in step 5, the first 4 steps will be loaded automatically from the store. Steps are never retried, only tasks.

How do you kick it off? Simply enqueue it:

await absurd.spawn("my-agent", {
  prompt: "What's the weather like in Boston?"
}, {
  maxAttempts: 3,
});

And if you are curious, this is an example implementation of the singleStep function used above:

Single step function
async function singleStep(messages) {
  const result = await generateText({
    model: anthropic("claude-haiku-4-5"),
    system: "You are a helpful agent",
    messages,
    tools: {
      getWeather: { /* tool definition here */ }
    },
  });

  const newMessages = (await result.response).messages;
  const finishReason = await result.finishReason;

  if (finishReason === "tool-calls") {
    const toolResults = [];
    for (const toolCall of result.toolCalls) {
      /* handle tool calls here */
      if (toolCall.toolName === "getWeather") {
        const toolOutput = await getWeather(toolCall.input);
        toolResults.push({
          toolName: toolCall.toolName,
          toolCallId: toolCall.toolCallId,
          type: "tool-result",
          output: {type: "text", value: toolOutput},
        });
      }
    }
    newMessages.push({
      role: "tool",
      content: toolResults
    });
  }

  return { newMessages, finishReason };
}

Events and Sleeps

And like Temporal and other solutions, you can yield if you want. If you want to come back to a problem in 7 days, you can do so:

await ctx.sleep(60 * 60 * 24 * 7); // sleep for 7 days

Or if you want to wait for an event:

const eventName = `email-confirmation-${userId}`;
try {
  const payload = await ctx.waitForEvent(eventName, {timeout: 60 * 5});
  // handle event payload
} catch (e) {
  if (e instanceof TimeoutError) {
    // handle timeout
  } else {
    throw e;
  }
}

Which someone else can emit:

const eventName = `email-confirmation-${userId}`;
await absurd.emitEvent(eventName, { confirmedAt: new Date().toISOString() });

That’s it!

Really, that’s it. There is really not much to it. It’s just a queue and a state store — that’s all you need. There is no compiler plugin and no separate service or whole runtime integration. Just Postgres. That’s not to throw shade on these other solutions; they are great. But not every problem necessarily needs to scale to that level of complexity, and you can get quite far with much less. Particularly if you want to build software that other people should be able to self-host, that might be quite appealing.

  1. It’s named Absurd because durable workflows are absurdly simple, but have been overcomplicated in recent years.

Regulation Isn’t the European Trap — Resignation Is

2025-10-21 08:00:00

Plenty has been written about how hard it is to build in Europe versus the US. The list is always the same with little process: brittle politics, dense bureaucracy, mandatory notaries, endless and rigid KYC and AML processes. Fine. I know, you know.

I’m not here to add another complaint to the pile (but if we meet over a beer or coffee, I’m happy to unload a lot of hilarious anecdotes on you). The unfortunate reality is that most of these constraints won’t change in my lifetime and maybe ever. Europe is not culturally aligned with entrepreneurship, it’s opposed to the idea of employee equity, and our laws reflect that.

What bothers me isn’t the rules — it’s the posture that develops from it within people that should know better. Across the system, everyone points at someone else. If a process takes 10 steps, you’ll find 10 people who feel absolved of responsibility because they can cite 9 other blockers. Friction becomes a moral license to do a mediocre job (while lamenting about it).

The vibe is: “Because the system is slow, I can be slow. Because there are rules, I don’t need judgment. Because there’s risk, I don’t need initiative.” And then we all nod along and nothing moves.

There are excellent people here; I’ve worked with them. But they are fighting upstream against a default of low agency. When the process is bad, too many people collapse into it. Communication narrows to the shortest possible message. Friday after 2pm, the notary won’t reply — and the notary surely will blame labor costs or regulation for why service ends there. The bank will cite compliance for why they don’t need to do anything. The registrar will point at some law that allows them to demand a translation of a document by a court appointed translator. Everyone has a reason. No one owns the outcome.

Meanwhile, in the US, our counsel replies when it matters, even after hours. Bankers answer the same day. The instinct is to enable progress, not enumerate reasons you can’t have it. The goal is the outcome and the rules are constraints to navigate, not a shield to hide behind.

So what’s the point? I can’t fix politics. What I can do: act with agency, and surround myself with people who do the same and speak in support of it. Work with those who start from “how do we make this work?” not “why this can’t work.” Name the absurdities without using them as cover. Be transparent, move anyway and tell people.

Nothing stops a notary from designing an onboarding flow that gets an Austrian company set up in five days — standardized KYC packets, templated resolutions, scheduled signing slots, clear checklists, async updates, a bias for same-day feedback. That could exist right now. It rarely does or falls short.

Yes, much in Europe is objectively worse for builders. We have to accept it. Then squeeze everything you can from what is in your control:

  • Own the handoff. When you’re step 3 of 10, behave like step 10 depends on you and behave like you control all 10 steps. Anticipate blockers further down the line. Move same day. Eliminate ambiguity. Close loops.
  • Default to clarity. Send checklists. Preempt the next two questions. Reduce the number of touches.
  • Model urgency without theatrics. Be calm, fast, and precise. Don’t make your customer chase you.
  • Use judgment. Rules exist and we can’t break them all. But we can work with them and be guided by them.

Select for agency. Choose partners who answer promptly when it’s material and who don’t confuse process with progress.

The trap is not only regulation. It’s the learned helplessness it breeds. If we let friction set our standards, we become the friction. We won’t legislate our way to a US-style environment anytime soon. But we don’t need permission to be better operators inside a bad one.

That’s the contrast and it’s the part we control.


Postscript: Comparing Europe to the US triggers people and I’m concious of that. Maturity is holding two truths at once: they do some things right and some things wrong and so do we. You don’t win by talking others down or praying for their failure. I’d rather see both Europe and the US succeed than celebrate Europe failing slightly less.

And no, saying I feel gratitude and happiness when I get a midnight reply doesn’t make me anti-work-life balance (I am not). It means when something is truly time-critical, fast, clear action lifts everyone. The times someone sent a document in minutes, late at night, both sides felt good about it when it mattered. Responsiveness, used with judgment, is not exploitation; it’s respect for outcomes and the relationships we form.