2026-03-22 23:30:41
Our 5th cohort of Becoming an AI Engineer starts in less than a week. This is a live, cohort-based course created in collaboration with best-selling author Ali Aminian and published by ByteByteGo.
Here’s what makes this cohort special:
Learn by doing: Build real world AI applications, not just by watching videos.
Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.
Live feedback and mentorship: Get direct feedback from instructors and peers.
Community driven: Learning alone is hard. Learning with a community is easy!
We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.
If you want to start learning AI from scratch, this is the perfect platform for you to begin.
2026-03-21 23:31:00
If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.
QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.
QA Wolf takes testing off your plate. They can get you:
Unlimited parallel test runs for mobile and web apps
24-hour maintenance and on-demand test creation
Human-verified bug reports sent directly to your team
Zero flakes guarantee
The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.
With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.
This week’s system design refresher:
Top 12 GitHub AI Repositories
Where Different Types of Tests Fit
How Single Sign-On (SSO) Works
How LLMs Use AI Agents with Deep Research
How Hackers Steal Passwords
These repositories were selected based on their overall popularity and GitHub stars.
OpenClaw: The always-on personal AI agent that lives on your device and talks to you through WhatsApp, Telegram, and 50+ other platforms.
N8n: A visual workflow automation platform with native AI capabilities and 400+ integrations.
Ollama: Run powerful LLMs locally on your own hardware with a single command.
Langflow: A drag-and-drop visual builder for designing and deploying AI agents and RAG workflows.
Dify: A full-stack prod-ready platform for building and deploying AI-powered apps and agentic workflows.
LangChain: The foundational framework powering the AI agent ecosystem with modular building blocks.
Open WebUI: A self-hosted, offline-capable ChatGPT alternative
DeepSeek-V3: An open-weight LLM that rivals GPT on benchmarks and is free for commercial use.
Gemini CLI: Google’s open-source tool to interact with the Gemini model right from your terminal.
RAGFlow: An enterprise-grade RAG engine that grounds AI answers in real documents with citation tracking.
Claude Code: An agentic coding tool that understands your entire codebase and executes engineering tasks from the terminal.
CrewAI: A lightweight Python framework for assembling teams of role-playing AI agents to collaborate on tasks.
Over to you: Which other repository will you add to the list?
Writing code is easy now, but testing code is hard.
Let’s take a look at where different types of tests fit.
Unit + Component Tests: These test individual functions or UI components in isolation. They’re fast, inexpensive to run, and easy to maintain. Tools like Jest, Vitest, JUnit, pytest, React Testing Library, Cypress, Vue Test Utils, and Playwright are commonly used here, and most of your test coverage should come from this layer.
Integration Tests: These verify communication between services, APIs, and databases. Testcontainers, Postman, Bruno, Supertest. Unit tests won't catch a broken API contract, but integration tests will.
End-to-End Tests: Tools like Cypress, Playwright, Appium, and QA Wolf validate full user journeys across the whole system. They are expensive to run and maintain, which is why fewer tests live in this layer.
AI tools are becoming part of the testing workflow. Tools like GitHub Copilot, ChatGPT, Claude, Cursor, and Qodo can help draft tests, update suites, and spot gaps in coverage. They take care of repetitive tasks and give engineers more time to focus on the edge cases that may arise in production.
Over to you: How do you test your code?
Building web scrapers for RAG pipelines or model training usually means managing fragile fleets of headless browsers and complex scraping logic. Cloudflare’s new Browser Rendering endpoint changes that. You can now crawl an entire website asynchronously with a single API call. Submit a starting URL, and the endpoint automatically discovers pages, renders them, and returns clean HTML, Markdown, or structured JSON. It fully respects robots.txt out of the box, supports incremental crawling to reduce costs, and includes a fast static mode. Stop managing scraping infrastructure and get back to building your application.
Single Sign-On (SSO) makes access feel effortless. One login, and you’re inside Slack and several other internal tools without logging in again.
But there’s a lot going on behind that single login.
Step 1: The first login
A user opens an application, for example Salesforce.
Instead of asking for credentials directly, Salesforce redirects the browser to an Identity Provider (IdP) like Okta or Auth0. This redirect usually happens through an HTTP 302 response.
The browser then sends an authentication request to the IdP using protocols such as SAML or OpenID Connect (OIDC).
The IdP presents the login page. The user enters their credentials, sometimes along with MFA.
Once verified, the IdP creates a login session and sends back an authentication response (a SAML assertion or ID token) through the browser.
The browser forwards that response back to Salesforce.
Salesforce validates the token and creates its own local session, typically stored as a cookie, and grants access.
Step 2: The SSO magic
Now the user opens another app, say Slack.
Slack also redirects the browser to the same identity provider. But the IdP checks and sees the user already has an active session. So it skips the login step entirely and issues a new authentication token.
The browser forwards that token to Slack.
Slack validates it, creates its own session cookie, and grants access.
The key idea behind SSO is simple. Applications don’t authenticate users themselves. They rely on a central identity provider to verify the user and issue a token that other systems trust.
Over to you: What SSO solutions have you used, and which is your favorite?
When you ask an LLM such as Claude, ChatGPT, or Gemini to do deep research on a complex topic, it’s not just one model doing all the work. It’s a coordinated system of specialized AI agents.
Here’s how it works:
Step 1: Understanding The Question and Making a Plan
It all starts with the query, something like “Analyze the competitive landscape of AI agents in 2026. The system doesn’t just dive in blindly. First, it may ask clarifying questions to understand exactly what is needed. Then, it generates a plan and breaks the big question down into smaller and manageable tasks.
Step 2: Sub-Agents Get to Work
Each small task gets assigned to a sub-agent, which is basically a mini AI worker with a specific job. For example, one sub-agent might be tasked with finding the latest Nvidia earnings. It figures out which tools to use, such as searching the web, browsing a specific page, or even run code to analyze data. All of this happens through a secure layer of APIs and services that connect the AI to the outside world.
Step 3: Putting it All Together
Once all the sub-agents finish their tasks, a Synthesizer Agent takes over. It aggregates everything, identifies key themes, plans an outline, and removes any redundant or duplicate information. At the same time, a Citation Agent makes sure every claim is linked back to its source and properly formatted. The end result is a polished, well-cited final output ready for use.
Over to you: Have you tried deep research in any LLM?
Most password attacks don't involve sophisticated hacking. They rely on automation, reused credentials, and predictable human behavior.
Here are six common techniques:
Brute-force attack: Automated tools cycle through password combinations at high speed until one works. No logic involved, just volume.
Dictionary attacks: Instead of random guesses, attackers use curated wordlists built from common passwords, leaked data, and predictable patterns.
Credential stuffing: When one site is breached, attackers reuse those stolen username–password pairs across many other services. It works because a large portion of users reuse passwords across multiple accounts.
Password spraying: One common password gets tried across many accounts in the same organization. Spreading attempts across accounts avoids triggering lockout thresholds.
Phishing: The victim lands on a fake login page and enters credentials. The attacker captures them in real time. No malware needed.
Keylogger malware: Malicious software records keystrokes and sends them to the attacker. Passwords, usernames, and anything else typed on the device are captured.
Over to you: Which attack have you seen most often?
2026-03-19 23:30:59
Every time we run an UPDATE statement in a database, something disappears. The old value, whatever was there a moment ago, is gone.
In fact, most databases are designed to forget. Every UPDATE overwrites what came before, every DELETE removes it entirely, and the application is left with only a snapshot of the present state. We accept this as normal because it’s the most natural way to think about things.
But what if your system needs to answer a different kind of question: not just “what is the current state?” but “how did we get here?”
That’s the question Event Sourcing is built to answer. And the solution is both more rewarding and more demanding than it first appears. In this article, we will look at Event Sourcing along with its benefits and trade-offs.
2026-03-18 23:30:41
Only 35% of engineering leaders report significant ROI from AI, and most ROI models miss the full picture.
The majority of engineering time is spent on investigating alerts, diagnosing incidents, and coordinating decisions across tools that don’t share context. The cost of that work rarely appears in ROI models.
When organizations only measure what it costs to produce code, they’re missing the downstream costs that pop up in production.
Learn how engineering teams at Zscaler, DoorDash, and Salesforce are measuring AI ROI across the full engineering lifecycle and finding the largest returns in production.
When OpenAI shipped Codex, their cloud-based coding agent, the hardest problems they had to solve had almost nothing to do with the AI model itself.
The model, codex-1, is a version of OpenAI’s o3 fine-tuned for software engineering. It was important, but it was also just one component in a much larger system. The real engineering went into everything around it.
How do you assemble the right prompt from five different sources? What happens when your conversation history grows so large it threatens to exceed the model’s memory? How do you make the same agent work in a terminal, a web browser, and three different IDEs without rewriting it each time?
When the Codex team needed their agent to work inside VS Code, they first tried the obvious approach and exposed it through MCP, the emerging standard for connecting AI models to tools. It didn’t work. The rich interaction patterns that a real agent needs, things like streaming progress, pausing mid-task for user approval, and emitting code diffs, didn’t map cleanly to what MCP offered. So the team built a new protocol from scratch.
In this article, we will look at how OpenAI built the right orchestration layer around the model.
Disclaimer: This post is based on publicly shared details from the OpenAI Engineering Team. Please comment if you notice any inaccuracies.
Codex is a coding agent that can write features, fix bugs, answer questions about your codebase, and propose pull requests.

Each task runs in its own isolated cloud sandbox, preloaded with your repository. You can assign multiple tasks in parallel and monitor progress in real time.
How Codex works behind the scenes is also quite interesting. The system has three layers worth understanding: the agent loop, prompt and context management, and the multi-surface architecture that lets one agent serve many different interfaces.
At the heart of Codex is something called the agent loop. The agent takes user input, constructs a prompt, sends it to the model for inference, and gets back a response.
However, that response isn’t always a final answer. Often, the model responds with a tool call instead, something like “run this shell command and tell me what happened.” When that happens, the agent executes the tool call, appends the output to the prompt, and queries the model again with this new information. This cycle repeats, sometimes dozens of times, until the model finally produces a message for the user.
See the diagram below:
What makes this more than a simple loop is everything the harness manages along the way.
Codex can read and edit files, run shell commands, execute test suites, invoke linters, and run type checkers. A single user request like “fix the bug in the auth module” might trigger the agent to read several files, run the existing tests to see what fails, edit the code, run the tests again, fix a linting error, and run the tests one more time before producing a final commit.
The model does the reasoning at each step, but the harness handles everything else, such as executing commands, collecting outputs, managing permissions, and deciding when the loop is done.
This distinction between model and harness matters because it shapes how developers actually use Codex. OpenAI’s own engineering teams use it to offload repetitive, well-scoped work like refactoring, renaming, writing tests, and triaging on-call issues.
The agent loop also has an outer layer. Each cycle of inference and tool calls constitutes what OpenAI calls a “turn.” However, conversations don’t end after one turn. When the user sends a follow-up message, the entire history of previous turns, including all the tool calls and their outputs, gets included in the next prompt. This is where things get expensive, and where the next layer of complexity kicks in.
See the diagram below:
When you type a request into Codex, your message becomes the bottom layer of a much larger prompt. Above it, the system stacks environment context like your current working directory and shell, the contents of any AGENTS.md files in your repository (these are project-specific instructions for the agent, covering things like coding conventions and which test commands to run), sandbox permission rules, developer instructions from configuration files, model-specific instructions, tool definitions, and a system message.

Each layer carries a role, either system, developer, or user, that signals its priority to the model. The server controls the ordering of the top layers. The client controls the rest. This layered construction means the model always has rich context about the environment it’s operating in. However, it also means the prompt is already large before the user says a single word. And it only grows from there.
Every tool call the model makes produces output that gets appended to the prompt. Every new conversation turn includes the full history of all previous turns, tool calls included.
See the diagram below:

This means that the total JSON sent to the API over the course of a conversation grows quadratically. If the first turn sends X amount of data, the second turn resends all of X plus the new data, the third turn resends all of that plus more, and so on.
OpenAI accepts this cost on purpose. They could use a server-side parameter that lets the API remember previous conversation state, avoiding the need to resend everything. They chose not to because doing so would break the statelessness of each request and prevent support for customers who require Zero Data Retention. Therefore, every request is self-contained and carries the full conversation with it.
The key mitigation is prompt caching. Since Codex always appends new content to the end of the existing prompt, the old prompt is always an exact prefix of the new one. This prefix property lets OpenAI reuse computation from previous inference calls, so even though the data transfer is quadratic, the actual model computation stays closer to linear.
However, the prefix property is fragile. Anything that changes the beginning or middle of the prompt, like switching models, changing tools, or altering sandbox configuration, breaks the cache. When OpenAI added support for MCP tools, they accidentally introduced a bug where the tools weren’t listed in a consistent order between requests. That inconsistency alone was enough to destroy cache hits.
Eventually, even with caching, conversations hit the context window limit, the maximum number of tokens the model can process in a single inference call. When that happens, Codex compacts the conversation. It replaces the full history with a smaller, representative version that preserves the model’s understanding of what happened through an encrypted payload that carries the model’s latent state. In reality, the compaction mechanism involves more nuance than a simple summary, but the core idea stands: managing the context window is a first-class engineering problem, not an afterthought.
AGENTS.md files deserve a quick mention here because they represent a design decision about where context should live. Rather than hardcoding project-specific knowledge into the system, OpenAI lets developers place AGENTS.md files in their repositories, right alongside their code. These files tell Codex how to navigate the codebase, which commands to run for testing, and how to follow the project’s conventions. The model performs better with them, but also works without them.
Codex started life as a CLI tool. You ran it in your terminal, and it operated on your local codebase.
Then OpenAI needed it in VS Code and then in a web app. Further, it was also needed as a macOS desktop app. Lastly, third-party IDEs like JetBrains and Xcode wanted to integrate it as well. Rewriting the agent logic for every surface was not an option.
As mentioned earlier, the first attempt was to expose Codex as an MCP server. However, the team found that MCP’s semantics couldn’t carry the full weight of what an agent conversation actually looks like. Codex needed to stream incremental progress as the model reasoned. It needed to pause mid-task and ask the user for approval before running certain commands. It needed to emit structured diffs. These interaction patterns were too rich for what MCP offered at the time.
So they built the App Server. All of the core agent logic, the agent loop, thread management, tool execution, configuration, and authentication live in a single codebase that OpenAI calls “Codex core.” The App Server wraps this core in a JSON-RPC protocol that any client can speak over standard input/output.
The protocol is fully bidirectional:
The client can send requests to the server (start a thread, submit a task).
The server can also send requests back to the client, for example, asking for approval before executing a shell command.
The agent’s turn pauses until the user responds with “allow” or “deny.” This lets the agent balance autonomy with human oversight without hardcoding that policy into the agent loop itself.
Different places use this architecture differently
The VS Code extension and the desktop app bundle the App Server binary, launch it as a child process, and keep a bidirectional stdio channel open.
The web app runs the App Server inside a cloud container. A worker provisions the container with the checked-out repository, launches the binary, and streams events to the browser over HTTP. State lives on the server, so work continues even if the user closes the tab.
Partners like Xcode decouple their release cycles from OpenAI’s by keeping their client stable and pointing it at newer App Server binaries as they become available. The protocol is designed to be backward compatible, so older clients can safely talk to newer servers.
This architecture wasn’t planned from the start. It evolved from a CLI, through a failed MCP attempt, to the App Server protocol that now underpins every Codex surface. That trajectory is itself a useful lesson about system design: the right abstraction usually doesn’t exist until you’ve tried the wrong one.
OpenAI’s experience proves that the model is a component and the agent is the system. Most of the engineering is in the system.
If you use tools like Codex, understanding these mechanics helps you use them more effectively. Writing clear AGENTS.md files gives the agent project-specific context that meaningfully improves its output. Scoping tasks tightly works better than vague, open-ended requests because the agent loop is most effective when each cycle has a clear next step. And knowing that long conversations degrade due to context window limits and compaction explains why starting fresh threads for new tasks often gives better results.
Codex still has real constraints. It can’t accept image inputs for frontend work. You can’t course-correct the agent mid-task. Delegating to a remote agent takes longer than interactive editing, and that shift in workflow takes getting used to. OpenAI is working toward a future where interacting with Codex feels more like asynchronous collaboration with a colleague, but the gap between that vision and the current product is still significant.
References:
2026-03-17 23:30:44
Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request.
Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss.
CodeRabbit reviews 1 million PRs every week across 3 million repositories and is used by 100 thousand Open-source projects.
CodeRabbit is free for all open-source repo’s.
The Reddit Engineering Team completed one of the most demanding infrastructure migrations in the company’s history. It moved its entire Apache Kafka fleet, comprising over 500 brokers and more than a petabyte of live data, from Amazon EC2 virtual machines onto Kubernetes.
The migration was done with zero downtime and without asking a single client application to change how it connected to Kafka.
In this article, we will look at the breakdown of this migration, the challenges the engineering team faced, and how they achieved their goal of a successful migration.
Disclaimer: This post is based on publicly shared details from the Reddit Engineering Team. Please comment if you notice any inaccuracies.
To put things into perspective, let us first understand what exactly Apache Kafka is.
Apache Kafka is an open-source message streaming platform. Applications called producers write messages into Kafka partitions, and other applications called consumers read those messages out. Kafka sits in the middle and stores those messages reliably, even if the producer and consumer are running at completely different times. A single Kafka server is called a broker, whereas a collection of brokers working together forms a cluster.
At Reddit, Apache Kafka is not a peripheral tool. It sits underneath hundreds of business-critical services, processing tens of millions of messages every second. If Kafka went down, large portions of Reddit would break.
Before the migration, Reddit managed its Kafka brokers on Amazon EC2 instances using a combination of Terraform, Puppet, and custom scripts. Operators handled upgrades, configuration changes, and machine replacements by running commands directly from their laptops. This worked fine until a certain point. However, as the fleet grew, it became increasingly slow, error-prone, and expensive. Reddit needed a more scalable and reliable way to operate Kafka.
Kubernetes, paired with a tool called Strimzi, offered that path.
Kubernetes is an open-source platform for running and managing containerized applications. Instead of manually provisioning and maintaining individual servers, Kubernetes lets developers describe what should be running and handles deployment, scaling, and recovery automatically. Strimzi, on the other hand, is a project under the Cloud Native Computing Foundation that specifically lets you run Kafka on Kubernetes. It provides a declarative way to manage Kafka clusters. This means that developers can describe what they want in a configuration file, and Strimzi handles deployment, upgrades, and maintenance. This promised fewer manual interventions and more predictable operations.
Reddit did not jump straight into moving brokers. Before writing a single line of migration code, Reddit identified four hard constraints that ruled out entire categories of approaches. The constraints are as follows:
Kafka had to stay up. There was no acceptable maintenance window. Downtime, data loss, or forcing client applications to change their configuration was not an option. This ruled out scheduled cutovers, dual-write strategies, and replay-based migrations.
Kafka’s metadata could not be rebuilt from scratch. Apache Kafka maintains a detailed internal state called metadata. This includes information about which brokers exist, which broker holds which data, and where replicas of that data are stored. ZooKeeper, an external service, was responsible for managing this metadata. There is no supported way to recreate this metadata on a fresh cluster while keeping the system available. New brokers had to join the existing cluster rather than replace it.
Client connectivity was tightly coupled to specific brokers. Over time, applications across Reddit had been configured to connect directly to specific broker hostnames, typically the first few brokers in a cluster, rather than using a single load-balanced endpoint. Turning off those brokers would immediately break hundreds of services. Reddit did not control the layer through which clients found and connected to Kafka.
Every step had to be reversible. No single action during the migration could leave the system in a state from which recovery was impossible. This meant Reddit had to accept a long period where EC2 brokers and Kubernetes brokers ran side by side, and it meant that riskier changes had to wait until everything else was stable.
The first phase of the migration did not touch Kafka at all.
Reddit introduced a DNS facade, which is a set of DNS records that act as an intermediate layer between client applications and the actual Kafka brokers. DNS is the system that translates human-readable names into the addresses of servers. By creating new, infrastructure-controlled DNS names that initially pointed to the same EC2 brokers, Reddit changed nothing from the perspective of client applications.
Reddit then rolled out these new connection strings across more than 250 services using automated tooling that generated batch pull requests to update configuration files. Once all clients were talking through this DNS layer, Reddit could change where those names pointed, from EC2 to Kubernetes, without modifying any client code.
Each Kafka broker is identified by a unique numeric ID. Strimzi assigns broker IDs starting at 0 by default. However, Reddit’s existing EC2 brokers already occupied those low numbers.
To free up that ID space, Reddit doubled the cluster size by adding new EC2 brokers with higher IDs, then terminated the original low-numbered brokers. This shifted all data onto the higher-numbered brokers and opened up IDs 0, 1, 2, and so on for Strimzi-managed brokers to use.
See the diagram below:
This was the most technically complex phase.
Reddit needed Strimzi brokers running on Kubernetes to join the same cluster as the existing EC2 brokers and communicate with them directly. Strimzi does not support this out of the box, so Reddit created a fork of the Strimzi operator. The changes Reddit made were deliberately small and targeted:
The inter-broker listener configuration was set to use plaintext listeners accessible from both EC2 and Kubernetes, ensuring brokers in different environments could talk to each other.
The ZooKeeper connection was pointed at Reddit’s existing EC2-hosted ZooKeeper, so that both old and new brokers shared the same metadata store and were part of the same logical cluster.
The Cruise Control topic was overridden to stay consistent across both broker sets, allowing Reddit to use Cruise Control to move data between EC2 and Kubernetes brokers. Cruise Control is a Kafka tool that automates the process of rebalancing data across brokers in a controlled, measured way. It was central to the actual movement of data during the migration.
Running a forked operator in production carries risk. Reddit kept the scope of changes narrow and planned from the start to switch back to the standard Strimzi operator once the migration was complete.
With both sets of brokers running inside the same cluster, Reddit used Cruise Control to incrementally move partition leadership and replicated data from EC2 brokers to the Kubernetes brokers.
Partition leadership determines which broker is responsible for serving reads and writes for a given piece of data. Kafka stores copies of each partition on multiple brokers for redundancy. This is called the replication factor. Moving data meant reassigning both the leadership and the replicas to the new set of brokers, one partition at a time.
Reddit monitored this process continuously as the partition leadership on EC2 declined steadily over roughly a week while leadership on Strimzi climbed in parallel. Network traffic followed the same pattern. At every point, Reddit could pause or reverse the process if something looked wrong.
See the dashboard view below:

ZooKeeper had managed Kafka’s metadata throughout the entire broker migration. Reddit made a deliberate choice not to change the control plane until after the data plane was fully stable on Kubernetes. This separation of concerns reduced the risk of compounding failures.
Once all EC2 brokers were terminated and all data and traffic were running on Kubernetes, Reddit executed the migration from ZooKeeper to KRaft. KRaft is Kafka’s built-in metadata management system that eliminates the need for ZooKeeper.
See the diagram below:
Since Strimzi and Kafka both provide documented steps for this migration, and because the rest of the system had already settled, this final phase was comparatively straightforward.
After both the data plane and the control plane were fully running on Kubernetes, Reddit removed all the configuration overrides that the forked Strimzi operator had introduced.
Control of the clusters was handed off to the standard, unmodified Strimzi operator. The EC2 infrastructure was decommissioned.
Reddit’s migration is a good example of how large-scale infrastructure changes do not have to be dramatic, high-risk events. By breaking the work into small, reversible, well-understood steps and by respecting the constraints the system imposed, Reddit moved a petabyte-scale platform to Kubernetes without a single moment of downtime.
Some key lessons from Reddit’s migration journey were as follows:
Introducing a controllable abstraction layer between clients and infrastructure, whether that is DNS, a proxy, or an API gateway, is one of the highest-leverage changes you can make during a migration. It decouples the two sides and lets you change the infrastructure without forcing every team to update their code.
Metadata and logical state tend to outlive the physical machines they run on. When planning any large migration, treat the logical state as the thing you are protecting, and treat the infrastructure as something you are replacing around it.
Designing each step to be undoable is not just a safety measure. It changes how confidently and quickly you can move forward, because you know you can always step back if something goes wrong.
A migration that looks messy in the middle but never breaks production is far preferable to a clean design that requires a moment where things could go wrong with no recovery path.
References:
2026-03-16 23:31:14
npx workos launches an AI agent, powered by Claude, that reads your project, detects your framework, and writes a complete auth integration directly into your existing codebase. It’s not a template generator. It reads your code, understands your stack, and writes an integration that fits.
The WorkOS agent then typechecks and builds, feeding any errors back to itself to fix.
Every week, Stripe merges over 1,300 pull requests that contain zero human-written code. Not a single line. These PRs are produced by “Minions,” Stripe’s internal coding agents, which work completely unattended.
An engineer sends a message in Slack, walks away, and comes back to a finished pull request that has already passed automated tests and is ready for human review. The productivity boost scenario is quite compelling.
Here’s what it looks like:

Consider a Stripe engineer who is on-call when five small issues pile up overnight. Instead of working through them sequentially, they open Slack and fire off five messages, each tagging the Minions bot with a description of the fix. Then, they go to get coffee. By the time they come back, five agents have each spun up an isolated cloud machine in under ten seconds, read the relevant documentation, written code, run linters, pushed to CI, and prepared pull requests. The developer reviews them, approves three, sends feedback on one, and discards the last. In other words, five issues were handled in the time it would have taken to fix two manually.
However, the primary reason the Minions work has almost nothing to do with the AI model powering them. It has everything to do with the infrastructure that Stripe built for human engineers, years before LLMs existed. In this article, we will look at how Stripe managed to reach this level.
Disclaimer: This post is based on publicly shared details from the Stripe Engineering Team. Please comment if you notice any inaccuracies.
The AI coding tools you’ve probably encountered fall into a category called attended agents. Tools like Cursor and Claude Code work alongside you. Developers watch them, steer them when they drift, and approve each step.
See the diagram below that shows the typical view of an AI Agent:
Stripe’s engineers use these tools too. However, Minions are what’s known as unattended agents. No one is watching or steering them. The agent receives a task, works through it alone, and delivers a finished result. This distinction changes the design requirements for everything downstream.
Stripe’s codebase also makes this harder than it sounds. The codebase consists of hundreds of millions of lines of code, mostly written in Ruby with Sorbet typing, which is a relatively uncommon stack. The code is full of homegrown libraries that LLMs have never encountered in training data, and it moves well over $1 trillion per year in payment volume through production. The stakes are as extreme as the complexity.
Building a prototype from scratch is fundamentally different from contributing code to a codebase of this scale and maturity. So Stripe built Minions specifically for unattended work, and let third-party tools handle attended coding.
AI coding tools are fast, capable, and completely context-blind. Even with rules, skills, and MCP connections, they generate code that misses your conventions, ignores past decisions, and breaks patterns. You end up paying for that gap in rework and tokens.
Unblocked changes the economics.
It builds organizational context from your code, PR history, conversations, docs, and runtime signals. It maps relationships across systems, reconciles conflicting information, respects permissions, and surfaces what matters for the task at hand. Instead of guessing, agents operate with the same understanding as experienced engineers.
You can:
Generate plans, code, and reviews that reflect how your system actually works
Reduce costly retrieval loops and tool calls by providing better context up front
Spend less time correcting outputs for code that should have been right in the first place
Once Stripe decided to build custom, the first problem was about where to actually run these agents.
An unattended agent needs three properties from its environment:
It needs isolation, so mistakes can’t touch production.
It needs parallelism, so multiple agents can work simultaneously on separate tasks.
And it needs predictability, so every agent starts from a clean, consistent state.
Stripe already had all three. Their “devboxes” are cloud machines pre-loaded with the entire codebase, tools, and services. They spin up in ten seconds because Stripe proactively provisions and warms a pool of them, cloning repositories, warming caches, and starting background services ahead of time. Engineers already used one devbox per task, and a single engineer might have half a dozen running at once. Agents slot into this same pattern.
Since devboxes run in a QA environment, they are already isolated from production data, real user information, and arbitrary network access. That means agents can run with full permissions and no confirmation prompts. The blast radius of any mistake is contained to one disposable machine.
The important thing to understand is that Stripe didn’t build this for agents. They built it for humans. Parallelism, predictability, and isolation were desirable properties for engineers long before LLMs entered the picture. In other words, what’s good for humans is good for agents as well.
A good environment gives the agent a place to work. But it doesn’t tell the agent how to work.
There are two common ways to orchestrate an LLM system:
A workflow is a fixed graph of steps where each step does one narrow thing, and the sequence is predetermined.
An agent is a loop where the LLM decides what to do next based on the results of its previous actions.
Workflows are predictable but rigid. Agents are flexible but unreliable.
Stripe built something in between that they call “blueprints.” A blueprint is a sequence of nodes where some nodes run deterministic code, and other nodes run an agentic loop. Think of it as a structure that alternates between rigid steps and creative steps. For example, the “implement the feature” step or “fix CI failures” step gets the full agentic loop with tools and freedom. On the other hand, the “run linters” step is hardcoded. The “push the branch” step is hardcoded.
This separation matters because some tasks should never be left to the agent’s judgment. You always want linters to run. You always want the branch pushed in a specific way that follows the company’s PR template. Making these deterministic saves tokens, reduces errors, and guarantees that critical steps happen every single time. Across hundreds of runs per day, each deterministic node is one less thing that can go wrong, and that compounds into big reliability gains.
Blueprints tell the agent how to work. But the agent still needs to know what it’s working with. In a codebase of hundreds of millions of lines, getting the right information into the agent’s limited context window is an engineering challenge.
LLMs can only hold so much text at once. If you try to load every coding rule and convention globally, the agent’s context fills up before it even starts working. Stripe uses global rules “very judiciously” for exactly this reason. Instead, they scope rules to specific subdirectories and file patterns. As the agent moves through the filesystem, it automatically picks up only the rules relevant to where it’s working. These are the same rule files that human-directed tools like Cursor and Claude Code read, so there is no duplication and no agent-specific overhead.
For information that doesn’t live in the filesystem, Stripe built a centralized internal server called Toolshed. It hosts nearly 500 tools using MCP, which stands for Model Context Protocol and is essentially an industry standard that gives agents a uniform way to call external services. Through MCP, agents can fetch internal documentation, ticket details, build statuses, code search results, and more.
But more tools aren’t better. Agents perform best with a carefully curated subset relevant to their task. Stripe gives Minions a small default set and lets engineers add more when needed.
The agent now has an environment, a structure, and the right context. However, the code still had to be correct, which meant more feedback loops.
Stripe’s feedback architecture works in layers:
First, local linting runs on every push in under five seconds. A background daemon precomputes which lint rules apply and caches the results, so this step is nearly instantaneous.
Second, CI selectively runs tests from Stripe’s battery of over three million tests, and autofixes are applied automatically for known failure patterns.
Third, if failures remain without an autofix, the agent gets one more chance to fix and push again.
Then it stops. At most two rounds of CI. If the code doesn’t pass after the second push, the branch goes back to the human engineer.
This cap is intentional. LLMs show diminishing returns when retrying the same problem repeatedly. More rounds cost more tokens and compute without proportional improvement. Knowing when to stop is as important as knowing how to start.
When a minion run doesn’t fully succeed, it’s still often a useful starting point. A partially correct PR that an engineer can polish in twenty minutes is still a significant win. Stripe is designed for this reality rather than pretending every run would be perfect.
Four layers make Stripe’s Minions work:
Isolated environments that give agents safe, parallel workspaces.
Hybrid orchestration that mixes deterministic guardrails with agentic flexibility.
Curated context that feeds agents the right information without overwhelming them.
And fast feedback loops with hard limits on iteration.
Each layer is necessary, and none alone is sufficient.
The primary insight in Stripe’s approach is that investments in developer productivity over the years can provide unexpected dividends when agents are included in the workflow. Human review didn’t disappear either, but shifted. Engineers moved from writing code to reviewing code.
A key lesson while thinking about deploying coding agents is not to start with model selection. Start with your developer environment, your test infrastructure, and your feedback loops. If those are solid, agents will benefit from them. If they’re not, no model will save you. Stripe’s experience suggests the answer is less about AI breakthroughs and more about the engineering fundamentals that were always supposed to matter.
References: