2026-02-08 05:31:44
I am having more fun programming than I ever have, because so many more of the programs I wish I could find the time to write actually exist. I wish I could share this joy with the people who are fearful about the changes agents are bringing. The fear itself I understand, I have fear more broadly about what the end-game is for intelligence on tap in our society. But in the limited domain of writing computer programs these tools have brought so much exploration and joy to my work.
— David Crawshaw, Eight more months of agents
Tags: coding-agents, ai-assisted-programming, generative-ai, ai, llms
2026-02-07 23:40:48
Last week I hinted at a demo I had seen from a team implementing what Dan Shapiro called the Dark Factory level of AI adoption, where no human even looks at the code the coding agents are producing. That team was part of StrongDM, and they've just shared the first public description of how they are working in Software Factories and the Agentic Moment:
We built a Software Factory: non-interactive development where specs + scenarios drive agents that write code, run harnesses, and converge without human review. [...]
In kōan or mantra form:
- Why am I doing this? (implied: the model should be doing this instead)
In rule form:
- Code must not be written by humans
- Code must not be reviewed by humans
Finally, in practical form:
- If you haven't spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement
I think the most interesting of these, without a doubt, is "Code must not be reviewed by humans". How could that possibly be a sensible strategy when we all know how prone LLMs are to making inhuman mistakes?
I've seen many developers recently acknowledge the November 2025 inflection point, where Claude Opus 4.5 and GPT 5.2 appeared to turn the corner on how reliably a coding agent could follow instructions and take on complex coding tasks. StrongDM's AI team was founded in July 2025 based on an earlier inflection point relating to Claude Sonnet 3.5:
The catalyst was a transition observed in late 2024: with the second revision of Claude 3.5 (October 2024), long-horizon agentic coding workflows began to compound correctness rather than error.
By December of 2024, the model's long-horizon coding performance was unmistakable via Cursor's YOLO mode.
Their new team started with the rule "no hand-coded software" - radical for July 2025, but something I'm seeing significant numbers of experienced developers start to adopt as of January 2026.
They quickly ran into the obvious problem: if you're not writing anything by hand, how do you ensure that the code actually works? Having the agents write tests only helps if they don't cheat and assert true.
This feels like the most consequential question in software development right now: how can you prove that software you are producing works if both the implementation and the tests are being written for you by coding agents?
StrongDM's answer was inspired by Scenario testing (Cem Kaner, 2003). As StrongDM describe it:
We repurposed the word scenario to represent an end-to-end "user story", often stored outside the codebase (similar to a "holdout" set in model training), which could be intuitively understood and flexibly validated by an LLM.
Because much of the software we grow itself has an agentic component, we transitioned from boolean definitions of success ("the test suite is green") to a probabilistic and empirical one. We use the term satisfaction to quantify this validation: of all the observed trajectories through all the scenarios, what fraction of them likely satisfy the user?
That idea of treating scenarios as holdout sets - used to evaluate the software but not stored where the coding agents can see them - is fascinating. It imitates aggressive testing by an external QA team - an expensive but highly effective way of ensuring quality in traditional software.
Which leads us to StrongDM's concept of a Digital Twin Universe - the part of the demo I saw that made the strongest impression on me.
The software they were building helped manage user permissions across a suite of connected services. This in itself was notable - security software is the last thing you would expect to be built using unreviewed LLM code!
[The Digital Twin Universe is] behavioral clones of the third-party services our software depends on. We built twins of Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets, replicating their APIs, edge cases, and observable behaviors.
With the DTU, we can validate at volumes and rates far exceeding production limits. We can test failure modes that would be dangerous or impossible against live services. We can run thousands of scenarios per hour without hitting rate limits, triggering abuse detection, or accumulating API costs.
How do you clone the important parts of Okta, Jira, Slack and more? With coding agents!
As I understood it the trick was effectively to dump the full public API documentation of one of those services into their agent harness and have it build an imitation of that API, as a self-contained Go binary. They could then have it build a simplified UI over the top to help complete the simulation.
With their own, independent clones of those services - free from rate-limits or usage quotas - their army of simulated testers could go wild. Their scenario tests became scripts for agents to constantly execute against the new systems as they were being built.
This screenshot of their Slack twin also helps illustrate how the testing process works, showing a stream of simulated Okta users who are about to need access to different simulated systems.
![Screenshot of a Slack-like interface titled "DTU Slack" showing a thread view (Thread — C4B9FBB97) with "Focus first" and "Leave" buttons. The left sidebar lists channels including # org-general (182), # general (0) (shared×2), # it-support (0), # channel-0002 (0) (shared×2), # channel-0003 (0) through # channel-0020 (0), # org-finance (1), and a DMs section with a "Start" button. A "Create" button appears at the top of the sidebar. The main thread shows approximately 9 automated introduction messages from users with Okta IDs (e.g. @okta-u-423438-00001, @okta-u-423438-00002, etc.), all timestamped 2025-11-12Z between 18:50:31 and 18:51:51. Each message follows the format "Hi team! I'm [Name], joining as Employee in general. Key skills: [fictional skill phrases]. Excited to contribute!" All users have red/orange "O" avatar icons.](https://static.simonwillison.net/static/2026/strong-dm-slack.jpg)
This ability to quickly spin up a useful clone of a subset of Slack helps demonstrate how disruptive this new generation of coding agent tools can be:
Creating a high fidelity clone of a significant SaaS application was always possible, but never economically feasible. Generations of engineers may have wanted a full in-memory replica of their CRM to test against, but self-censored the proposal to build it.
The techniques page is worth a look too. In addition to the Digital Twin Universe they introduce terms like Gene Transfusion for having agents extract patterns from existing systems and reuse them elsewhere, Semports for directly porting code from one language to another and Pyramid Summaries for providing multiple levels of summary such that an agent can enumerate the short ones quickly and zoom in on more detailed information as it is needed.
StrongDM AI also released some software - in an appropriately unconventional manner.
github.com/strongdm/attractor is Attractor, the non-interactive coding agent at the heart of their software factory. Except the repo itself contains no code at all - just three markdown files describing the spec for the software in meticulous detail, and a note in the README that you should feed those specs into your coding agent of choice!
github.com/strongdm/cxdb is a more traditional release, with 16,000 lines of Rust, 9,500 of Go and 6,700 of TypeScript. This is their "AI Context Store" - a system for storing conversation histories and tool outputs in an immutable DAG.
It's similar to my LLM tool's SQLite logging mechanism but a whole lot more sophisticated. I may have to gene transfuse some ideas out of this one!
I visited the StrongDM AI team back in October as part of a small group of invited guests.
The three person team of Justin McCarthy, Jay Taylor and Navan Chauhan had formed just three months earlier, and they already had working demos of their coding agent harness, their Digital Twin Universe clones of half a dozen services and a swarm of simulated test agents running through scenarios. And this was prior to the Opus 4.5/GPT 5.2 releases that made agentic coding significantly more reliable a month after those demos.
It felt like a glimpse of one potential future of software development, where software engineers move from building the code to building and then semi-monitoring the systems that build the code. The Dark Factory.
I glossed over this detail in my first published version of this post, but it deserves some serious attention.
If these patterns really do add $20,000/month per engineer to your budget they're far less interesting to me. At that point this becomes more of a business model exercise: can you create a profitable enough line of products that you can afford the enormous overhead of developing software in this way?
Building sustainable software businesses also looks very different when any competitor can potentially clone your newest features with a few hours of coding agent work.
I hope these patterns can be put into play with a much lower spend. I've personally found the $200/month Claude Max plan gives me plenty of space to experiment with different agent patterns, but I'm also not running a swarm of QA testers 24/7!
I think there's a lot to learn from StrongDM even for teams and individuals who aren't going to burn thousands of dollars on token costs. I'm particularly invested in the question of what it takes to have agents prove that their code works without needing to review every line of code they produce.
Tags: ai, generative-ai, llms, ai-assisted-programming, coding-agents, parallel-agents
2026-02-07 07:41:31
I don't know why this week became the tipping point, but nearly every software engineer I've talked to is experiencing some degree of mental health crisis.
[...] Many people assuming I meant job loss anxiety but that's just one presentation. I'm seeing near-manic episodes triggered by watching software shift from scarce to abundant. Compulsive behaviors around agent usage. Dissociative awe at the temporal compression of change. It's not fear necessarily just the cognitive overload from living in an inflection point.
— Tom Dale
Tags: ai-ethics, careers, coding-agents, generative-ai, ai, llms
2026-02-07 06:31:31
There's a jargon-filled headline for you! Everyone's building sandboxes for running untrusted code right now, and Pydantic's latest attempt, Monty, provides a custom Python-like language (a subset of Python) in Rust and makes it available as both a Rust library and a Python package. I got it working in WebAssembly, providing a sandbox-in-a-sandbox.
Here's how they describe Monty:
Monty avoids the cost, latency, complexity and general faff of using full container based sandbox for running LLM generated code.
Instead, it let's you safely run Python code written by an LLM embedded in your agent, with startup times measured in single digit microseconds not hundreds of milliseconds.
What Monty can do:
- Run a reasonable subset of Python code - enough for your agent to express what it wants to do
- Completely block access to the host environment: filesystem, env variables and network access are all implemented via external function calls the developer can control
- Call functions on the host - only functions you give it access to [...]
A quick way to try it out is via uv:
uv run --with pydantic-monty python -m asyncio
Then paste this into the Python interactive prompt - the -m asyncio enables top-level await:
import pydantic_monty code = pydantic_monty.Monty('print("hello " + str(4 * 5))') await pydantic_monty.run_monty_async(code)
Monty supports a very small subset of Python - it doesn't even support class declarations yet!
But, given its target use-case, that's not actually a problem.
The neat thing about providing tools like this for LLMs is that they're really good at iterating against error messages. A coding agent can run some Python code, get an error message telling it that classes aren't supported and then try again with a different approach.
I wanted to try this in a browser, so I fired up a code research task in Claude Code for web and kicked it off with the following:
Clone https://github.com/pydantic/monty to /tmp and figure out how to compile it into a python WebAssembly wheel that can then be loaded in Pyodide. The wheel file itself should be checked into the repo along with build scripts and passing pytest playwright test scripts that load Pyodide from a CDN and the wheel from a “python -m http.server” localhost and demonstrate it working
Then a little later:
I want an additional WASM file that works independently of Pyodide, which is also usable in a web browser - build that too along with playwright tests that show it working. Also build two HTML files - one called demo.html and one called pyodide-demo.html - these should work similar to https://tools.simonwillison.net/micropython (download that code with curl to inspect it) - one should load the WASM build, the other should load Pyodide and have it use the WASM wheel. These will be served by GitHub Pages so they can load the WASM and wheel from a relative path since the .html files will be served from the same folder as the wheel and WASM file
Here's the transcript, and the final research report it produced.
I now have the Monty Rust code compiled to WebAssembly in two different shapes - as a .wasm bundle you can load and call from JavaScript, and as a monty-wasm-pyodide/pydantic_monty-0.0.3-cp313-cp313-emscripten_4_0_9_wasm32.whl wheel file which can be loaded into Pyodide and then called from Python in Pyodide in WebAssembly in a browser.
Here are those two demos, hosted on GitHub Pages:
![Screenshot of a web app titled "Monty via Pyodide" with description "Run Monty (a sandboxed Python interpreter by Pydantic) inside Pyodide (CPython compiled to WebAssembly). This loads the pydantic-monty wheel and uses its full Python API. Code is saved in the URL for sharing." A green banner reads "Code executed successfully!" Below are example buttons labeled "Basic", "Inputs", "Reuse", "Error Handling", "Fibonacci", and "Classes". A code editor labeled "Python Code (runs inside Monty sandbox via Pyodide):" contains: "import pydantic_monty\n\n# Create interpreter with input variables\nm = pydantic_monty.Monty('x + y', inputs=['x', 'y'])\n\n# Run with different inputs\nresult1 = m.run(inputs={"x": 10, "y": 20})\nprint(f"10 + 20 = {result1}")\n\nresult2 = m.run(inputs={"x": 100, "y": 200})" with "Run Code" and "Clear" buttons. The Output section shows "10 + 20 = 30" and "100 + 200 = 300" with a "Copy" button. Footer reads "Executed in 4.0ms".](https://static.simonwillison.net/static/2026/monty-pyodide.jpg)
As a connoisseur of sandboxes - the more options the better! - this new entry from Pydantic ticks a lot of my boxes. It's small, fast, widely available (thanks to Rust and WebAssembly) and provides strict limits on memory usage, CPU time and access to disk and network.
It was also a great excuse to spin up another demo showing how easy it is these days to turn compiled code like C or Rust into WebAssembly that runs in both a browser and a Pyodide environment.
Tags: javascript, python, sandboxing, ai, rust, webassembly, pyodide, generative-ai, llms, ai-assisted-programming, pydantic, coding-agents, claude-code
2026-02-07 02:44:21
An ominous headline to see on the official Heroku blog and yes, it's bad news.
Today, Heroku is transitioning to a sustaining engineering model focused on stability, security, reliability, and support. Heroku remains an actively supported, production-ready platform, with an emphasis on maintaining quality and operational excellence rather than introducing new features. We know changes like this can raise questions, and we want to be clear about what this means for customers.
Based on context I'm guessing a "sustaining engineering model" (this definitely isn't a widely used industry term) means that they'll keep the lights on and that's it.
This is a very frustrating piece of corporate communication. "We want to be clear about what this means for customers" - then proceeds to not be clear about what this means for customers.
Why are they doing this? Here's their explanation:
We’re focusing our product and engineering investments on areas where we can deliver the greatest long-term customer value, including helping organizations build and deploy enterprise-grade AI in a secure and trusted way.
My blog is the only project I have left running on Heroku. I guess I'd better migrate it away (probably to Fly) before Salesforce lose interest completely.
Tags: salesforce, heroku, fly
2026-02-06 08:42:22
When I want to quickly implement a one-off experiment in a part of the codebase I am unfamiliar with, I get codex to do extensive due diligence. Codex explores relevant slack channels, reads related discussions, fetches experimental branches from those discussions, and cherry picks useful changes for my experiment. All of this gets summarized in an extensive set of notes, with links back to where each piece of information was found. Using these notes, codex wires the experiment and makes a bunch of hyperparameter decisions I couldn’t possibly make without much more effort.
— Karel D'Oosterlinck, I spent $10,000 to automate my research at OpenAI with Codex
Tags: codex-cli, coding-agents, ai-assisted-programming, generative-ai, openai, ai, llms