2026-04-29 23:31:30
Most teams optimize models. Few optimize inference. We benchmarked NVIDIA RTX PRO 6000 Blackwell on Akamai Cloud against H100 using real LLM workloads.
At 100 concurrent requests, Blackwell reached 24,240 tokens/sec per server, compared to 1,863 TPS on H100. That’s up to 1.63× higher throughput, with additional gains from FP4 precision.
The difference comes down to architecture. These GPUs run on a globally distributed platform built for real-time, latency-sensitive inference, not centralized batch jobs.
If you're building agentic systems or high-concurrency AI apps, infrastructure choices matter as much as model selection. See the full setup, methodology, and results. View benchmark results
In 2024, Wise’s deployment system automatically blocked hundreds of releases that would have caused production incidents.
There was no human intervention, but the system routed just 5% of traffic to the new version, watched technical and business metrics for 30 minutes, and rolled back when it detected anomalies. Three years earlier, Wise was deploying with a simpler in-house tool that treated each release as a basic transaction, where the process was essentially to push the code and hope for the best.
This leap was made possible by some interesting engineering decisions that we will learn about in this article.
For reference, Wise moves about £36 billion across borders every quarter, with 65% of transfers arriving instantly. One might assume that kind of reliability requires a tightly controlled, top-down engineering organization. However, the opposite is true.
Wise has 850+ engineers organized into autonomous squads, each empowered to make their own technical decisions. The reason this works, and the reason it would collapse without a very specific set of infrastructure investments, is the real engineering story behind Wise.
Behind the product that 15.6 million active customers interact with, there are over 1000 microservices, 700+ Java repositories, 40 web applications, and native iOS and Android apps with hundreds of modules each. What holds all of this together is an internal platform, a set of shared tools, frameworks, and automated systems that make the right engineering choice the easy one.
Disclaimer: This post is based on publicly shared details from the Wise Engineering Team. Please comment if you notice any inaccuracies.
When a system has 1000+ services owned by dozens of independent teams, the most dangerous form of complexity is inconsistency. If every team wires up security, database connections, Kafka consumers, and logging differently, you end up with 1000 slightly different systems that are all hard to debug, upgrade, and secure.
Wise’s answer to this is a microservice chassis framework, an opinionated, pre-configured foundation that every new backend service can start from. The chassis handles security, observability, database communication, Kafka integration, and more, all with recommended defaults so that teams can focus on business logic rather than plumbing.
What makes Wise’s approach distinct is that the chassis is shipped as a versioned artifact rather than a template you fork and modify.
The difference matters. With a template, the service diverges from the standard the moment you create it. With an artifact dependency, updates to the chassis flow downstream when teams bump the version. Security patches, observability improvements, and new defaults reach services through a regular dependency upgrade rather than a manual migration.
This approach also extends to the build pipeline.
Wise built a collection of in-house Gradle plugins, including one that standardizes GitHub Actions workflows. When Wise decided to roll out SLSA (a framework for protecting software supply-chain integrity) across the organization, it became a plugin version update across 700+ Java repositories rather than 700 individual pull requests.
On top of this, a language-agnostic automation service can make complex changes across the codebase at scale and create pull requests for the owning team to review. Dependency upgrades for Java services are now fully automated through this system.
The same standardization mindset shows up on the frontend.
Wise’s web applications are built on CRAB, a Wise-specific abstraction on top of Next.js, split across 40 distinct apps that handle specific product functions. Visual regression testing is handled by Storybook paired with Chromatic, which captures snapshots of React components after each change and highlights visual differences to catch UI bugs before they reach customers.
Standardizing how services are built is only half the problem. The other half is standardizing how they reach production.
Since 2018, Wise has relied on Kubernetes to host its services, originally built with Terraform, JSONNET, and ConcourseCI. That setup supported service-mesh controls through Envoy, PCI-DSS compliance, and frictionless deployments for several years.
But as Wise grew, the original approach could not scale further without becoming a maintenance burden. This led to the Compute Runtime Platform (CRP), a ground-up rebuild of Wise’s Kubernetes infrastructure. Terraform still provisions infrastructure, but the codebase was rewritten from scratch for flexibility. RKE2 now handles cluster bootstrapping, with Rancher managing overall cluster state. Helm replaced JSONNET for better maintainability and upstream compatibility. ArgoCD with custom plugins ensures fully automated provisioning and consistency across environments. The result is that Wise grew from 6 Kubernetes clusters to more than 20 while keeping maintenance manageable.

CRP also brought efficiency improvements:
Automated container CPU rightsizing through Vertical Pod Autoscaler is now live in non-production and rolling out to production for non-critical workloads.
Horizontal scaling through KEDA optimizes workloads based on daily and weekly traffic patterns.
Fully managed sidecar containers like the Envoy proxy simplify deployments for product teams, and Wise’s Envoy-powered service proxy now includes seamless integration and discovery between services.
All of this feeds into Wise’s broader Mission Zero cost optimization goals.
Wise’s deployment strategy has undergone a big shift with the transition from Octopus, their former in-house tool, to Spinnaker. This was more than a tool swap.
Octopus treats deployments as simple transactions, while Spinnaker treats them as orchestrated sequences of events with built-in canary analysis, metric validation, and automatic rollback.
See the diagram below:
The canary process is simple in concept but powerful in practice.
When a new version of a service is deployed, only 5% of traffic routes to it. Over a 30-minute window, the system analyzes both technical metrics (error rates, latency) and business metrics (transaction success rates, conversion). If it detects significant anomalies, it rolls back automatically. In 2024, this system prevented hundreds of potentially incident-causing deployments and saved thousands of engineering hours. Over half of Wise’s services already run on Spinnaker, with full migration expected by mid-2025.
On the CI side, migrating from CircleCI to GitHub Actions opened new optimization possibilities. By tracking detailed build metrics, Wise discovered that pre-populating caches for frequently used containers could slash build times by 15%. At their scale of 500,000 monthly builds, that translates to over 1,000 hours saved each month. Wise has also been methodically implementing the SLSA framework across build processes, strengthening supply-chain security one language at a time.
Mobile teams have seen similar gains.
iOS engineers migrated 250+ Xcode modules from Xcodegen to Tuist and switched from Cocoapods to Swift Package Manager, dropping zero-change build times from 28 seconds to 2 seconds.
On Android, the team manages over 300 Gradle modules and has fully moved to Jetpack Compose for UI, adopted Kotlin 2.0 and 2.1, and is exploring Kotlin Multiplatform for cross-platform code sharing.
Backend-for-frontend services (BFFs, which are lightweight backends tailored to a specific frontend’s needs) help share logic between Android, iOS, and web teams.
Fast, safe deployments matter even more at Wise than at most companies, because what those services are doing is moving real money through real banking systems around the world.
Wise connects directly to local payment schemes rather than routing through intermediary banks. They went live with InstaPay in the Philippines, were granted access to Zengin (Japan’s instant payments system), and received access to PIX in Brazil. Each of these integrations has different technical requirements, and some demand a physical data center presence in the country.
This creates an infrastructure consistency challenge.
Wise centralizes networking using AWS Transit Gateways, but the details of each integration vary substantially across their UK, Hungary, and Australia data center deployments. The Australian deployment is particularly interesting because it was one of the first deployments of AWS Outpost Servers, which allowed Wise to maintain consistent AWS tooling even inside a physical data center.
The goal is always to keep the infrastructure as uniform as possible so that teams working on payment integrations can focus on the business logic rather than wrestling with environment differences.
Wise also exposes this infrastructure through a public API, allowing banks, financial institutions, and enterprises to integrate cross-border payment services directly. For reference, the Wise Platform supports over 40 currencies and multiple payment routes, with OAuth authentication and built-in compliance features.
All of this money movement generates enormous volumes of data. Wise’s data architecture is built as a pipeline from movement to insight to action, with each layer feeding the next.
Kafka underpins most of Wise’s real-time data movement, handling asynchronous messaging between services, log collection, and streaming updates for analytics. Their Kafka clusters have grown significantly, with features like rack-aware standby replicas for fault tolerance.
An in-house data movement service funnels information from Kafka and databases into destinations like Snowflake, S3 Parquet, and Apache Iceberg, with automated checks in the configuration process to reduce human error.
A separate Data Archives service now handles over 100 billion records across multiple databases, reducing storage costs and making backups faster.
Wise has migrated most MariaDB and Postgres workloads from self-hosted EC2 instances to Amazon RDS, and is moving from self-hosted MongoDB to MongoDB Atlas. Redis continues to handle in-memory workloads.

For analytics, Wise is building a data lake on Amazon S3 using Apache Iceberg, which allows modifying table structures without rewriting all the data. Trino serves as the federated query engine, letting teams query Iceberg tables, Snowflake, or Kafka streams from one place. A Trino gateway handles workload separation and fault-tolerant queries, while Airflow and dbt-core manage complex data workflows.
Wise has built a comprehensive inventory system and governance portal that tracks where data is stored, who created it, and how it is classified.
Automated data discovery feeds into data deletion, compliance, and audit initiatives. For a regulated financial services company, this is load-bearing infrastructure, and as more engineers join the governance effort, Wise is rolling out stricter policies, enhanced privacy checks, and automated data lifecycle management.
Data scientists work in Amazon SageMaker Studio, with large-scale processing on Spark in EMR and orchestration through Airflow.
SageMaker Feature Store keeps hundreds of features in sync for training and inference, while MLflow tracks experiments, metrics, and model versions. When a model is ready for production, it is deployed through an in-house prediction service built on Ray Serve.

Applications span fraud detection, KYC verification, and customer onboarding, where every millisecond of inference time matters. Automated checks catch data drift and feature inconsistencies before they become serious issues.
Wise has built a secure gateway connecting to multiple LLM providers, including Anthropic (Claude), AWS Bedrock, Google Gemini, and OpenAI. This lets teams experiment with different models without managing separate credentials or compliance checks.
A Python library inspired by LangChain wraps these APIs to speed up prototyping. For cases requiring internal context, a custom Retrieval-Augmented Generation (RAG) service pulls the latest information from various data stores before generating responses, which is useful for summarizing complex documents and automating parts of customer service.
Building all of this is one challenge. Knowing whether it is working is another.
Wise has consolidated its observability stack onto the LGTM stack from Grafana. Loki handles logs, Grafana provides dashboards, Tempo handles traces, and Mimir handles metrics. The migration from Thanos to Mimir was driven by scalability needs, and the metrics stack now ingests roughly 6 million metric samples per second and processes 150 million active series for its largest tenant.

The real value of unification is correlation. When a service fails at 3 AM, developers want to see the error log, the trace showing which downstream call failed, and the metric spike that triggered the alert, all in one place and linked together. Running separate tools for each of those signals means context-switching during an incident, which costs time when time matters most.
Wise has also implemented dedicated observability clusters within its Compute Runtime Platform, separating monitoring infrastructure from production workloads so that a spike in monitoring load cannot affect the services being monitored. They are pilot testing Grafana Pyroscope for profiling select services, and have adopted Temporal as a workflow engine to automate database switchovers and recovery tests, keeping downtime minimal while staying compliant with strict resilience regulations.
Cost optimization is an ongoing task. Wise has invested in reducing operational costs and improving resource utilization across the observability stack, tying these efforts to their broader Mission Zero initiative.
A pattern runs through every layer of Wise’s tech stack.
At the frontend, the chassis and CRAB framework standardize how applications are built.
At the deployment layer, Spinnaker and automated canary analysis standardize how code reaches production.
At the infrastructure layer, CRP and Kubernetes absorb the complexity of hosting services.
At the data layer, shared pipelines and governance tools standardize how information flows.
At the observability layer, the LGTM stack standardizes how teams understand what is happening.
The common thread is that Wise treats its internal infrastructure as a product. The customers are the 850+ engineers.
The features are faster builds, safer deployments, easier service creation, and unified monitoring. And just like any product, the platform keeps evolving. CircleCI gives way to GitHub Actions, Octopus gives way to Spinnaker, Thanos gives way to Mimir, because the needs of those internal customers keep changing.
The trade-off is that this level of platform investment requires dedicated teams, carries significant migration cost, and only starts to pay off past a certain organizational size.
References:
2026-04-28 11:03:50
Most teams pick a search provider by running a few test queries and hoping for the best – a recipe for hallucinations and unpredictable failures. This technical guide from You.com gives you access to an exact framework to evaluate AI search and retrieval.
What you’ll get:
A four-phase framework for evaluating AI search
How to build a golden set of queries that predicts real-world performance
Metrics and code for measuring accuracy
Go from “looks good” to proven quality.
Every time you buy something online from a Stripe-powered business, a machine learning model evaluates over 1,000 signals about your transaction and decides in under 100 milliseconds whether to let it through.
Across billions of legitimate payments, it reaches the correct verdict 99.9% of the time. The system that delivers those numbers, however, looks entirely different from what Stripe originally built.
The architecture has been overhauled multiple times, and one of the most important upgrades required removing a component the team knew was actively improving accuracy, because keeping it was holding back everything else the team wanted to do.
For reference, online payment fraud occurs in roughly 1 out of every 1,000 transactions. That rarity makes fraud detection a difficult machine learning problem because the system has to surface a small number of fraudulent payments from a massive volume of legitimate ones, and it has to do this quickly and cheaply on every single transaction.
In this article, we will look at how Stripe’s Radar does this effectively and the architectural decisions the team took while building it.
Disclaimer: This post is based on publicly shared details from the Stripe Engineering Team. Please comment if you notice any inaccuracies.
Stripe began with relatively simple ML models like logistic regression (a statistical method that predicts the probability of an outcome based on input variables). Over time, as the Stripe network grew and ML technology advanced, they moved to more complex architectures. Each jump produced an equivalent leap in model performance.
The architecture preceding the current one was called Wide & Deep. It combined two models into an ensemble.
The “wide” component was XGBoost, a gradient-boosted decision tree that works by combining many small decision trees into one powerful predictor. XGBoost excelled at memorization, meaning it was strong at recognizing specific patterns and feature correlations it had encountered in training data.
The “deep” component was a deep neural network (DNN) that excelled at generalization, meaning it could learn abstract concepts like “unusual payment velocity on a card” and apply them to entirely new situations it had never seen before.
Together, the two components worked well. But XGBoost was creating operational bottlenecks. It was hard to parallelize, which meant retraining the combined model was slow. It was incompatible with advanced ML techniques Stripe wanted to adopt, like transfer learning that involves using knowledge gained from one task to improve performance on a different but related task, and embeddings. And it was also limiting how quickly the many engineers working on Radar each day could experiment with new ideas.
Simply dropping XGBoost would have caused a 1.5% drop in recall, meaning 1.5% more fraud would go undetected. That was an unacceptably large regression in performance. The value XGBoost provided was real and measurable, so it had to be replicated within a new architecture rather than just discarded.
Stripe’s solution drew inspiration from a research architecture called ResNeXt.
The core idea, sometimes called “Network-in-Neuron,” splits computation into multiple distinct branches, where each branch functions as a small neural network on its own. The outputs from all branches are summed to produce a final result. This multi-branch approach enriches feature representation along a new dimension, and it achieves this more effectively than the brute-force approach of simply making a DNN wider or deeper, which risks overfitting (the model memorizing random noise rather than learning real patterns).
The resulting architecture, internally called Shield NeXt, reduced training time by over 85%, bringing it to under two hours. Experiments that previously required overnight jobs could now run multiple times in a single working day. Stripe is now exploring techniques that this architectural shift made possible, including multi-task learning, where a single model is trained to handle several related objectives simultaneously.
Agents can generate code. Getting it right for your system, team conventions, and past decisions is the hard part. You end up babysitting the agent and watch the token costs climb.
More MCPs, rules, and bigger context windows give agents access to information, but not understanding. The teams pulling ahead have a context engine to give agents only what they need for the task at hand.
Join us live (FREE) on May 6 to see:
Where teams get stuck on the AI maturity curve and why common fixes fall short
How a context engine solves for quality, efficiency, and cost
Live demo: the same coding task with and without a context engine
The model architecture matters, but Radar’s biggest competitive advantage comes from the data flowing through the Stripe network. Stripe has engineered specific mechanisms to convert that scale into model performance.
90% of cards used on the Stripe network have been seen more than once across different merchants. A single business has visibility into only its own transactions. Radar, by contrast, sees patterns across millions of businesses and thousands of partner banks around the world.
There is also a structural advantage in how Radar gets its training labels, the data that tells the model which past transactions were actually fraudulent. Since Radar is built directly into Stripe’s payment flow, it receives these labels automatically when cardholders dispute charges. Most third-party fraud solutions require businesses to build separate data pipelines for sending payment labels back to the fraud provider, or to label payments manually, which is time-consuming and error-prone. Radar sidesteps all of this by ingesting ground truth data straight from the payment flow and card networks.
Stripe uses hundreds of features in its model, and most of them are aggregates computed across the entire network. As the network grows, each feature becomes more informative because the training data better represents the feature’s real-world distribution.
A “feature” in this context is a single signal the model uses to evaluate a transaction. Some are intuitive. For example, does the cardholder’s name match the provided email address? How many different cards have been associated with this IP address? A high count might indicate someone testing stolen cards. Other features are more surprising. The difference between the device’s local time and UTC, or the count of countries where a card has been successfully authorized, both turn out to be meaningful fraud signals.
Finding new features is part forensics and part experimentation. Stripe’s team reviews past fraud attacks in detail, building investigation reports that try to reconstruct how fraudsters operate. They look for patterns in throwaway email addresses used to set up multiple accounts quickly. They monitor dark web activity weekly. From this research, they build a prioritized list of candidate features, implement each one rapidly, and prototype them to measure model impact.
Sometimes the most promising ideas yield little. For example, Stripe once built a feature capturing whether a business was currently under a distributed fraud attack. It barely moved the model performance because the model was already learning that pattern implicitly.
One of the more powerful techniques Stripe uses is embeddings, which are learned numerical representations for categorical data. Things like merchant identity, issuing bank, user country, and day of the week have many possible values, and defining useful numerical representations for them is challenging.
Stripe trains its model to learn an embedding for each value, essentially a set of coordinates that position it relative to others based on transaction patterns. Uber and Lyft, for example, would end up with similar embedding coordinates because their transaction patterns resemble each other, while Slack would be positioned very differently.
Embeddings enable geographic transfer of fraud knowledge. If Stripe identifies a new fraud pattern in Brazil, the embeddings allow the system to recognize that same pattern in the US automatically, without retraining. The model essentially learns which merchants and regions behave similarly, then applies fraud knowledge across the entire network.
Stripe also found that scaling up training data continued to yield significant gains. A 10x increase in training transaction data still produced meaningful model improvements, and the team was working on a 100x version. This kind of scaling was only feasible because the DNN-only architecture could train fast enough to handle much larger datasets practically.

Having a great model and great data still leaves a fundamental question unanswered.
How much fraud should you actually block?
Every fraud detection system faces an inherent tension between two types of errors:
A false negative is when fraud slips through undetected, costing the business the product, a chargeback fee, and potential reputational damage with card networks.
A false positive is when a legitimate customer gets blocked, and the business loses the sale, along with potentially the customer forever. A survey found that 33% of consumers said they would stop shopping at a business after a single false decline.
These two errors exist on a curve.
Precision measures the fraction of blocked transactions that are actually fraudulent. Recall measures the fraction of all actual fraud that gets caught. As you raise the blocking threshold, requiring a higher fraud probability before blocking a payment, precision goes up because you become more selective about what you block. But recall goes down because more marginal fraud slips through. Lowering the threshold pushes things in the opposite direction.
Stripe frames this as two distinct problems:
The data science problem is about making the model better by adding predictive features, training on more data, and refining the architecture. A better model shifts the entire precision-recall curve upward, meaning that at any given threshold, you get better outcomes on both dimensions.
The business problem is about choosing where on that curve to operate, and the right answer depends entirely on the economics of each merchant.
For example, consider two businesses.
A food delivery company with thin margins might earn $2 in profit per order. Once you account for product cost and chargeback fees, a single fraudulent transaction can wipe out the profit from nearly 19 legitimate ones. For this business, aggressive blocking makes sense because the cost of missed fraud is devastating. On the other hand, a SaaS company with high margins faces the opposite calculation. The lifetime revenue lost by blocking a legitimate subscriber who would have paid $200 per month for years far outweighs the cost of an occasional fraudulent charge.
This is why Stripe built Radar to be configurable.
Merchants can adjust their risk threshold, and Radar for Fraud Teams lets them compose custom rules and set up manual review queues.
Stripe evaluates custom rules with the same precision-recall framework it uses for the model itself. When a merchant creates a rule, Stripe shows historical statistics on matching transactions that were actually disputed, refunded, or accepted, so the merchant can evaluate the impact before the rule goes live. Stripe also uses additional evaluation tools like ROC curves and AUC (area under the curve) scores to assess overall model quality, but the precision-recall framing captures the core tension most directly.
Manual review adds yet another lever.
Sending borderline transactions to human reviewers instead of blocking them outright improves precision with minimal impact on recall. Also, sending borderline transactions to review instead of allowing them through improves recall with minimal impact on precision. The cost is human effort, but it gives merchants a way to reshape their own precision-recall curve using business knowledge the model cannot access on its own.
All machine learning models are opaque to some degree, and deep neural networks are especially more opaque. Stripe accepted this when they chose DNNs over simpler, more interpretable techniques. The predictions are better, but explaining why a specific transaction received a given score is harder.
Stripe’s response was to build layers of interpretability around the model.
In 2020, they launched risk insights, a feature that shows merchants which factors contributed to a transaction being declined. The interface displays the top fraud signals, like an address being associated with a previous early fraud warning or an unusually high number of names linked to a card. It includes a location map showing distances between the billing address, shipping address, and IP address. It shows customer metadata like email, cardholder name, and the authorization rate for transactions associated with that email.
See the diagram below:

Stripe also uses Elasticsearch, a search engine optimized for fast lookups across large datasets, to surface related transactions and help merchants put a specific decline in a broader context.
Internally, the team built a table view displaying the exact features contributing most to a transaction’s fraud score, which engineers use to debug support cases. Stripe is working on sharing more of these internal tools with merchants, closing the gap between what engineers can see and what users can see.
Explainability serves a practical purpose beyond building trust. When merchants understand why Radar scored a transaction the way it did, they can improve the data they send to Stripe for more accurate decisions. They can create custom rules that incorporate knowledge only they have about their own business. The explanation layer transforms Radar from a black box into something merchants can actively collaborate with.
Building a better model is half the challenge. Deploying it safely at Stripe’s scale is the other half, and it involves two hard engineering problems.
The first is real-time feature computation. Every feature the model uses during training must also be computable in production, because Radar needs to score every incoming payment as part of the Stripe API flow. For a feature like “the two most frequent IP addresses previously used with this card,” Stripe maintains an up-to-date state on every card ever seen on the network, and fetching or updating that state has to be fast. Stripe’s ML infrastructure team built systems that let engineers define features declaratively, with current values made available automatically in production at low latency.
The second is ensuring that model improvements hold across the entire user base, all the way down to individual merchants. A model that performs better on aggregate metrics might still cause a spike in block rate for smaller businesses, which would be disruptive for those merchants and their customers. Before releasing any model, Stripe measures the change it would cause to the false positive rate, block rate, and authorization rate on both an aggregate and per-merchant basis. If a model would cause undesirable shifts for certain users, they adjust it for those segments before release. They also compare score distributions between old and new models, aiming to keep the proportion of transactions above each merchant’s blocking threshold stable.
Fraud patterns shift constantly, which means even a well-performing model degrades over time, a phenomenon called model drift. Stripe found that retraining the same model on more recent data, with identical features and architecture, improves recall by up to half a percentage point per month. That is a big gain from simply keeping the data fresh. By investing in automated training, tuning, and evaluation tooling, Stripe tripled their model release cadence. They continuously update performance dashboards after training but before release, so engineers can spot stale model candidates and proactively retrain them.
The fraud landscape itself keeps evolving. Patterns have shifted from primarily stolen credit card fraud to a growing mix of traditional card fraud and high-velocity card testing attacks, where automated scripts try large numbers of stolen card numbers against a merchant’s checkout flow. Stripe’s deployment infrastructure is built to support this kind of rapid adaptation.
Stripe also faces a subtle measurement challenge in production. Transactions that the model blocks have unknown true outcomes because the payment was never completed. Computing a full production precision-recall curve requires counterfactual analysis, meaning statistical methods that estimate what would have happened to payments Radar blocked. Stripe has developed proprietary techniques for this over the years.
Radar is a very different product from what it was at launch.
The models, the data pipelines, the explainability tools, and the way Stripe communicates fraud decisions to merchants have all been rebuilt. Fraud patterns have changed considerably in that time as well.
However, the core goal of the Radar team remains the same.
They are still working to create an environment where businesses and customers can transact with confidence, still optimizing that brief moment customers barely register, the instant between clicking “purchase” and seeing the transaction confirmed. Every architectural choice, every feature, every deployment safeguard exists to make that 100-millisecond window as accurate, fair, and trustworthy as possible.
References:
2026-04-27 23:30:57
Most AI agents don’t fail because the model is bad. They fail because the model doesn’t have the proper infrastructure to reason well.
Simba Khadder, Head of Engineering at Redis, lays out a 4 pillar framework for building context systems that hold up in production—plus an architectural self-audit checklist you can run against your stack today.
Search “shoes for pregnant women” on Amazon, and the best results you get might be slip-resistant shoes. This is even though the word “pregnant” appears nowhere in those product listings.
In other words, there is zero keyword overlap between the query and the product. The search engine has to reason that pregnant women need stability, that stability means slip-resistance, and that slip-resistant shoes are the right match.
Traditional recommendation systems match text to text and purchase history to purchase history. They handle keyword overlap quite well. However, when a shopper’s intent requires a reasoning step that lives entirely in human common sense, those systems hit a wall.
Amazon’s search team recognized this blind spot and built a commonsense knowledge graph called COSMO that teaches the recommendation engine to think the way a human shopper would.
In this article, we will look at how COSMO works and the challenges the engineering team faced.
Disclaimer: This post is based on publicly shared details from the Amazon Engineering Team. Please comment if you notice any inaccuracies.
Amazon already operates large-scale knowledge graphs that store factual product attributes like brand, color, material, and category. These graphs power a lot of what works well in product search today. However, they mainly try to encode what a product is, and they don’t explain why a human would want it.
This is the semantic gap problem.
For example, a query like “winter clothes” carries an implicit intent around warmth. The product catalog for a long-sleeve puffer coat describes its material, size options, and sleeve length, but it may say nothing about warmth directly. The gap between what the customer typed and what the product listing says requires a reasoning step that factual knowledge graphs were never designed to handle.
Amazon’s team surveyed the landscape of existing solutions.
Alibaba built AliCoCo (163K nodes, 91 relations) and AliCG (5M nodes), both extracted from search logs. These capture product concepts, but they stay focused on product attributes and categories, skipping user intent entirely.
General commonsense knowledge bases like ConceptNet (8M nodes, 21M edges) cover everyday reasoning but are built for general purposes, with little grounding in shopping behavior.
Amazon’s own earlier effort, FolkScope, demonstrated that commonsense knowledge could be extracted from shopping data, but it covered only 2 product categories and only co-purchase behavior.
The gap was clear. Though factual product knowledge and general commonsense knowledge existed, structured knowledge about why people buy things at an e-commerce scale was missing.
The intuition behind Amazon’s approach was simple. Large language models encode enormous amounts of world knowledge in their parameters. Taking our earlier example, if you ask an LLM why a customer who searched “winter coat” bought a long-sleeve puffer coat, it can reason that puffer coats provide warmth, and warmth is what the customer wanted.
The team fed millions of user behavior pairs into OPT-175B and OPT-30B, large language models hosted internally on 16 A100 GPUs. The choice of OPT over GPT-4 was driven by a hard constraint around data privacy. Customer behavior data, meaning which queries led to which purchases, could only be processed on Amazon’s own infrastructure.
Two types of behavior data went into the system.
Query-purchase pairs capture the connection between a search query and the product a customer ultimately bought.
Co-purchase pairs capture products bought together in the same shopping session
Across 18 product categories, the team sampled 3.14 million co-purchase pairs and 1.87 million query-purchase pairs.
The sampling strategy was itself a design decision.
For products, Amazon covered popular browse node categories and selected top-tier products with high interaction volume, also using product type labels (more than a thousand classes like “umbrella” or “chair”) for finer-grained selection.
For co-purchase pairs, the team cross-checked product types to remove random co-purchases and filtered out products that co-occurred with too many different product types (a signal of noise rather than intent).
For search-buy pairs, thresholds on both purchase rate and click rate determined which queries and products entered the sample.
Crucially, an in-house query specificity service helped prioritize broad or ambiguous queries, because those are exactly where the semantic gap is largest and commonsense knowledge adds the most value.
Prompt design mattered too. Rather than using simple text continuation, Amazon formatted each behavior pair as a question-answering task and instructed the LLM to generate a numbered list of candidates rather than a single response.
The LLM generated millions of candidate explanations. However, only 35% of search-buy explanations met Amazon’s quality bar for typicality, meaning they were representative of genuine shopping intent. For co-purchase explanations, that number dropped to 9%. The rest were filler. The LLM produced circular rationales like “customers bought them together because they like them,” or trivially obvious statements like “customers bought an Apple Watch because it is a type of watch.”
The 9% vs. 35% gap reveals something about how LLMs reason. Explaining why a query led to a purchase is relatively constrained because the query provides clear context about intent. But explaining why two products were bought together requires identifying a shared reason across two different items, and LLMs tend to default to generic explanations for one item rather than reasoning about the pair.
Amazon also needed a way to categorize the relationships that the LLM was generating. The team started with 4 broad seed relations (usedFor, capableOf, isA, cause) that prior work had shown produce diverse outputs. From there, they mined finer-grained relation types directly from the LLM’s generated text by looking for recurring predicate patterns.
The most common pattern was “the product is capable of being used [preposition],” where different prepositions mapped to different semantic relationships. This data-driven process produced 15 relation types that capture distinct ways humans reason about products. These include used_for_function (”dry face”), used_for_event (”walk the dog”), used_for_audience (”daycare worker”), used_in_location (”bedroom”), used_in_body (”sensitive skin”), used_with (complementary products like “surface cover”), and person-centric relations like xIs_a (”pregnant women”) and xWant (”play tennis”). The ontology was shaped by what the LLM actually generated, then canonicalized and structured by Amazon’s researchers, rather than being designed top-down by a team of knowledge engineers.
The LLM produced a mountain of hypotheses, which were mostly noise. Amazon’s solution was a multi-stage refinement pipeline, where each stage catches a different type of failure.
Coarse-grained filtering tackled the most obvious problems first. Rule-based filters removed incomplete sentences by measuring sentence quality with a language model (GPT-2) and tuning a threshold. Generations that exactly matched the query text, the product type, or the product title (or fell within a small edit distance) were discarded. For generic statements like “used for the same reason” or “used with clothes,” Amazon identified these by combining frequency and entropy, since generic explanations tend to co-occur with many different products rather than specific ones.
Similarity filtering handled a subtler problem. Some LLM outputs looked different from the input on the surface but were semantically just paraphrases of the original query or product description.
Amazon used an in-house language model, pre-trained on e-commerce text including queries and product information, to compute embeddings for the generated knowledge, the query, and the product. When the vector similarity (measured by cosine distance) between the generated knowledge and the original context was too high, the candidate was filtered out. The team found that filtered generations were essentially syntactic transformations of the original input, rearranging the same meaning in slightly different words.
Human-in-the-loop annotation came next. Amazon sampled 30,000 knowledge candidates for human review, with 15,000 from co-purchase behavior and 15,000 from search-buy behavior spread across 18 categories. Rather than picking candidates uniformly, the team used a weighted formula that combined the frequency of a piece of generated knowledge with the popularity of the associated product or query. Popular products produce common knowledge, so the weighting pushed toward diverse, less obvious knowledge that the classifier would later need to generalize.
Annotators evaluated each candidate on two dimensions:
Plausibility measures whether the posited relationship is reasonable.
Typicality measures whether the knowledge is representative of genuine shopping behavior.
As a concrete example, the more typical reason people buy Apple Watches is that they are intelligent watches, rather than that they tell the time. Both statements are plausible, but only the first is typical.
To reduce cognitive burden and disagreement among annotators, Amazon decomposed these assessments into five yes/no questions covering completeness, relevance, informativeness, plausibility, and typicality. Two annotators labeled each question independently, with a third resolving disagreements. A pilot study of 2,000 examples showed this decomposition significantly reduced the disagreement rate, and internal auditing of 5% of all annotations showed over 90% accuracy. Due to data privacy requirements, Amazon employed a professional data annotation vendor company, followed by a strict internal auditing process.
Classifier generalization was the final step. Amazon fine-tuned DeBERTa-large (a high-performing language model for classification tasks) and an in-house language model on the 30,000 annotated samples to predict plausibility and typicality scores for all remaining candidates. Only candidates scoring above a 0.5 plausibility threshold survived.
The output of this pipeline is a set of structured knowledge triples. A triple connects two entities through a defined relationship. For example, the triple <co-purchase of camera case and screen protector, capableOf, protecting camera> captures the commonsense reasoning that these two products are bought together because they both serve the purpose of protecting a camera. Assembled, these triples form a knowledge graph of 6.3 million nodes and 29 million edges spanning 18 product categories. From 30,000 human judgments to 29 million edges.
The knowledge graph captures pre-computed commonsense relationships, but Amazon’s search engine encounters new queries and products constantly. Running the full pipeline (OPT-175B generation followed by classifier scoring) for every new behavior pair would be prohibitively expensive in production.
Amazon’s solution was instruction tuning.
The team used their 30,000 annotated samples to create instruction data and fine-tuned LLaMA 7B and 13B models. These base models offered the best balance between generation quality and inference cost for production serving, with far fewer parameters than OPT-175B while still producing high-quality outputs when trained on domain-specific data. The resulting model, COSMO-LM, was trained across 18 product domains, 15 relation types, and 5 distinct tasks.
Beyond commonsense generation, those tasks included plausibility prediction, typicality prediction, search relevance prediction, and co-purchase prediction. The multi-task training means COSMO-LM can both generate knowledge and evaluate its own output quality, effectively collapsing the “big LLM plus classifier” stack into a single, smaller model.
To make the model robust to different input formats, Amazon varied the templates during training. The same query-product pair might be prefixed with “search query,” “user input,” or “user searched” across different training examples. This prevents COSMO-LM from becoming brittle to prompt phrasing.
The result is two complementary artifacts in production. The static knowledge graph (29 million pre-computed edges) handles known product relationships. COSMO-LM generates fresh commonsense knowledge on the fly for new or unseen query-product pairs, with dramatically lower inference cost than the original OPT-175B pipeline. A demo of the system shows COSMO-LM generating knowledge for a query like “how to decorate a home,” producing a list of product types (wall art, decorative signage, sticker decal, decorative pillow cover, artificial plant, rug, home mirror, lamp), each accompanied by a commonsense explanation of its role in home decoration.
Having a model that generates useful knowledge is one challenge. Serving it at Amazon’s scale with acceptable latency is another.
Amazon’s deployment architecture centers on two components:
A Feature Store transforms COSMO-LM’s raw text outputs into structured features that downstream applications can consume directly. These features include product key-value pairs, semantic subcategory representations, and intent signals.
An Asynchronous Cache Store manages the serving layer through a two-tiered caching strategy.
The first tier pre-loads responses for yearly frequent searches, covering the majority of traffic. The second-tier batch processes daily requests for newer or less common queries and updates the cache.
When a user query arrives, the system checks the cache first. Hits get immediate responses. Misses go to batch processing, and the cache updates for future identical queries.
SageMaker manages model deployment and refresh, ingesting customer behavior session logs daily. The structured data from the cache feeds three downstream systems simultaneously, including Search Relevance, Recommendation, and Navigation.
This architecture meets Amazon’s strict search latency requirements while keeping storage costs comparable to real-time serving for most traffic. But it comes with a tradeoff. COSMO updates daily, which means it cannot incorporate real-time events like flash sales that fluctuate within hours. Amazon explicitly acknowledges this limitation and identifies it as an area for future development.
Search relevance saw the most dramatic offline improvements.
On the public ESCI dataset from KDD Cup 2022, a cross-encoder (a model architecture that jointly processes query and product features together, rather than encoding them separately) augmented with COSMO triples achieved 73.48% Macro F1 and 90.78% Micro F1 with trainable encoders.
For context, Macro F1 averages performance across all product categories equally (so rare categories matter just as much as common ones), while Micro F1 measures overall accuracy regardless of category. That cross-encoder result surpassed the top-1 ensemble model on the KDD Cup leaderboard. With frozen encoders, where the only difference was whether COSMO triples were included as input, the improvement was 60% on Macro F1.
On private datasets spanning four markets (US, Canada, UK, and India), the COSMO-enhanced model consistently outperformed baselines in every locale, with the strongest gains in the India market, where the gap between query language and product catalog language tends to be larger.
Session-based recommendation benefited from COSMO knowledge as well.
Amazon built COSMO-GNN, extending a graph neural network model (a model that learns relationships between items by treating shopping sessions as connected graphs) for session-based recommendations with COSMO-generated intent knowledge. It outperformed all competitive baselines on Hits@10 and NDCG@10 in both clothing and electronics categories.
The improvement was larger for electronics (5.82% vs. 4.05% on Hits@10), where users revise their search queries more frequently (2.47 unique queries per session versus 1.36 for clothing). This pattern makes sense. When users are actively reformulating queries to narrow down what they want, commonsense knowledge about why they are searching becomes especially valuable.
Search navigation is where COSMO reached production and generated real business impact. COSMO powers a multi-turn navigation system that organizes intent hierarchically. A search for “camping” branches into fine-grained intents like “winter camping,” “beach camping,” or “lakeside camping.” These connect to product types like “air mattress” or “winter boots,” which are then further refined by attributes like “4 person.”
This hierarchical organization of knowledge allows the system to mirror a natural discovery process, helping customers progressively narrow their search through multiple rounds of refinement rather than requiring them to formulate the perfect query upfront.
Amazon ran A/B tests over several months, targeting approximately 10% of U.S. traffic. The results were significant. A 0.7% relative increase in product sales within the test segment translated to hundreds of millions of dollars in additional annual revenue.
An 8% increase in navigation engagement was observed in the same segment. These outcomes came from a single, relatively small feature on the search page with limited visibility. Amazon has projected that extending COSMO-LM across all traffic for navigation alone could produce revenue gains in the billions.
COSMO is Amazon’s first production system that uses instruction-tuned large language models to construct a knowledge graph and serve it to online applications. It marks a shift from factual product knowledge graphs toward intent-based commonsense knowledge graphs.
The most important number from this entire project may be the leverage ratio. Thirty thousand human annotations became 29 million knowledge graph edges across 18 product categories. That ratio was possible because Amazon invested heavily in sampling strategy, annotation design, classifier training, and instruction tuning rather than in brute-force labeling.
The system’s acknowledged limitations are worth keeping in mind as well.
COSMO’s daily refresh cycle means it cannot keep up with real-time dynamics. Its aggressive filtering (only candidates above 0.5 plausibility survive) means the knowledge graph has gaps in coverage, especially for long-tail products and unusual queries. These are genuine tradeoffs, and Amazon chose precision over recall because unreliable commonsense knowledge in production would be worse than missing knowledge.
References:
2026-04-25 23:30:59
If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.
QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.
QA Wolf takes testing off your plate. They can get you:
Unlimited parallel test runs for mobile and web apps
24-hour maintenance and on-demand test creation
Human-verified bug reports sent directly to your team
Zero flakes guarantee
The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.
With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.
This week’s system design refresher:
Coding Agents Explained: How Claude Code, Codex & Cursor Actually Work (Youtube video)
Data Warehouse vs Data Lake vs Data Mesh
API Concepts Every Software Engineer Should Know
Polling vs Long Polling vs Webhooks vs SSE
SLA vs SLO vs SLI
Build with Claude Code — Course Direction Survey
Storing data is the easy part. Deciding where and how to organize it is the real challenge.
A data warehouse is the traditional approach. It cleans and structures data before storing it. Queries run fast, and reports stay consistent. But adding a new data source takes effort because everything has to fit the schema first.
A data lake takes the opposite approach. It stores everything raw, like databases, logs, images, and video. Process it when you need it. The flexibility is great, but if rules around naming, formatting, and ownership are not properly set, you end up with duplicate, outdated, and undocumented data that is hard to manage.
Data mesh shifts data ownership from a central team to individual departments. For example, sales publishes sales data, and finance publishes finance data. Shared standards keep things compatible across teams.
It works well in larger organizations. But it requires every team to have the right people and processes to manage their data quality, documentation, and access, which is a challenge.
In practice, many companies use more than one approach. They'll use a warehouse for dashboards and reporting, a lake for machine learning workloads and start applying mesh principles as teams scale.
Most engineers use APIs every day. Sending a request and reading JSON is one thing. Designing an API that other people can rely on is something where things get complicated.
A lot of problems begin with basic HTTP details that seem small at first. Methods, status codes, request formats, and response structure can make an API feel clear and predictable, or confusing and inconsistent.
Then there are the bigger design choices. REST, GraphQL, gRPC, webhooks, and WebSockets each make sense in different situations. The challenge is knowing what actually fits the system and the use case.
A lot of API problems also comes from design decisions that do not get enough attention early on. Naming, pagination, versioning, error responses, and backward compatibility often decide whether an API is easy to work with or frustrating to maintain.
Security is another area where weak decisions can cause real problems. API keys, OAuth, JWTs, scopes, and permissions are easy to mention. Getting them right is harder, and mistakes here can be costly.
Reliability matters too. Timeouts, retries, idempotency, rate limits, and caching are often easy to ignore until the system is under pressure.
And once an API starts growing, the supporting work matters too. Clear documentation, solid specs, observability, and contract testing make it much easier for teams to trust the API and use it without guessing how it works.
Over to you: What’s the most overlooked API concept in your experience?
Four ways to get updates from a server. Each one makes a different tradeoff between simplicity, efficiency, and real-time delivery.
Here's how they compare:
Polling: The client sends a request every few seconds asking "anything new?" The server responds immediately, whether or not there's new data. Most of those requests come back empty, wasting client and server resources. For use cases like an order status page where a small delay is acceptable, polling is the simplest option to implement.
Long Polling: The client sends a request, and the server keeps the HTTP connection open until new data is available or a timeout occurs. This means fewer empty responses compared to regular polling. Some chat applications used this pattern to deliver messages closer to real-time communication.
Server-Sent Events (SSE): The client opens a persistent HTTP connection, and the server streams events through it as they're generated. It is one-way, lightweight, and built on plain HTTP. Many AI responses that appear token by token are delivered through SSE, streaming each chunk over a single open connection.
Webhooks: Instead of the client asking for updates, the service sends an HTTP POST to a pre-registered callback URL whenever a specific event occurs. Stripe uses this for payment confirmations. GitHub uses it for push events. The client never polls or holds a connection open, it just waits for the server to call.
Many systems don't rely on a single pattern. You may use polling for order status, SSE for streaming AI responses, and webhooks for payment confirmations.
These three terms are related, but they mean different things. Knowing the difference helps you define what to measure, aim for, and promise your customers.
Here's how they actually connect:
SLI (Service Level Indicator): This is the metric you're measuring. For a login service, it could be the ratio of successful login requests to total valid requests. It tells you how your service is performing right now.
SLO (Service Level Objective): You take that SLI and define a target around it. Something like "login availability should stay above 99.9% over a rolling 28-day window." When you're missing your SLO, it’s a signal to find out what's failing before customers notice.
SLA (Service Level Agreement): This is what you promise your customers in a contract. It's usually set lower than the SLO, say 99.5% monthly availability. If you breach it, you owe service credits.
If your SLO and SLA are both set to 99.9%, then the moment your availability drops below 99.9%, you've already breached the agreement.
The SLI tells you where you stand. The SLO tells you where you should be. The SLA tells your customers what they can expect.
Over to you: How do you decide what the right SLO target is when you're launching a new service?
We’re building a new course, Build with Claude Code, and we’d love your input before we finalize it.
If you’re an engineer or engineering leader, we’d appreciate 3 minutes of your time. Your answers will directly shape what we cover. Thank you so much!
2026-04-23 23:30:59
Every database has to solve the same basic problem.
Data lives on disk, and accessing disk is slow. Every read and every write eventually has to reach the disk, and how a database organizes data on that disk determines everything about its performance.
Over decades of research, two dominant approaches have emerged.
B-Trees keep data sorted on disk so reads are fast, but pay for it on every write.
LSM Trees buffer writes in memory and flush them to disk in bulk, making writes cheap but reads more expensive.
Neither approach is better. They represent two different approaches, and understanding the tradeoff between them is one of the most useful mental models in system design.
In this article, we will look at B-Trees and LSM trees in detail, along with the trade-offs associated with each of them.
2026-04-21 23:30:21
Skip the guesswork with this MongoDB cheatsheet from Datadog. You’ll get a quick, practical reference for monitoring performance and diagnosing issues in real systems.
Use it to:
Track key metrics like latency, throughput, and resource utilization
Monitor MongoDB and Atlas health with the right signals
Set up dashboards to quickly identify bottlenecks and performance issues
When DoorDash needed to launch Dasher onboarding in Puerto Rico, it took about a week. That wasn’t because they cut corners or threw a huge team at it. It took a week because almost no new code was needed. The steps that Puerto Rican Dashers would go through (identity checks, data collection, compliance validation) already existed as independent modules, battle-tested by thousands of Dashers in other countries. The team assembled them into a new workflow, made one minor customization, and shipped.
And it wasn’t just Puerto Rico. Australia’s migration was completed in under a month. Canada took two weeks, and New Zealand required almost no new development at all.
This speed came from an architectural decision the DoorDash engineering team made when they looked at their growing mess of country-specific if/else statements and decided to stop patching.
They rebuilt their onboarding system around a simple idea. Decompose the process into self-contained modules with standardized interfaces, then connect them through a deliberately simple orchestration layer.
In this article, we will look at how this architecture was designed and the challenges they faced.
Disclaimer: This post is based on publicly shared details from the DoorDash Engineering Team. Please comment if you notice any inaccuracies.
DoorDash’s Dasher onboarding started simple, with just a few steps serving a single country through straightforward logic. Then the company expanded internationally, and every new market meant new branches in the code.
At one point, three API versions ended up coexisting. V3, the newest, continued calling V2 handlers for backward compatibility and also continued writing to V2 database tables. The system literally couldn’t avoid its own history. All developers have probably seen something like this before, where nobody can fully explain which version handles what, and removing any piece feels dangerous because something else might depend on it.
See the diagram below that shows the legacy system view:
The step sequences themselves were hard-coded, with country-specific logic spread throughout. Business logic started immediately after receiving a request, branching into deep if/else chains based on country, step type, or prior state. Adding a new market meant carefully threading new conditions through this maze of conditions.
Vendor integrations followed no consistent pattern either. Some onboarding steps used internal services, which called third-party vendors. Other steps called vendors directly. This inconsistent layering made testing and debugging unpredictable.
And then there was also the state management problem. Onboarding progress was tracked across multiple separate database tables. Flags like validation_complete = true or documents_uploaded = false lived in different systems. If a user dropped off mid-onboarding and came back later, reconstructing where they actually stood required querying several systems and inferring logic. This frequently led to errors.
The practical cost was that adding a new country took months of engineering effort across APIs, tables, and code branches. Every change carried the risk of breaking something in a market on the other side of the world.
DoorDash’s rebuild was organized around three distinct layers, each with a single responsibility. It’s easy to blur these layers together, but the separation between them is where the real power lives.
The Orchestrator sits at the top. It’s a lightweight routing layer that looks at context (which country and which market type) and decides which workflow definition should handle the request. That’s all it does. It doesn’t execute steps or manage state. It doesn’t contain business logic either. The main insight here is that the smartest thing about the orchestrator is how little it does. Developers tend to imagine the central controller as the brain of the system. However, in this architecture, the brain is distributed, and the orchestrator is just a traffic cop.
Workflow Definitions are the second layer. A workflow is simply an ordered list of steps for a specific market. The US workflow might look like Data Collection, followed by Identity Verification, followed by Compliance Check, followed by Additional Validation. Australia’s workflow skips one step and reorders another. Puerto Rico adds a regional customization. Each workflow is defined as a class with a list of step references, making it easy to see exactly what each market’s onboarding process looks like.
Think of it like a Lego set. Each brick has a standardized shape, studs on top, tubes on the bottom, and that standard interface lets you build anything. A workflow definition is like building instructions for a specific model.
Step Modules are the third layer, and this is where the actual work happens. Each step (data collection, identity verification, risk and compliance checking, document verification) is implemented as an independent and self-contained module. A step knows how to collect its data, validate it, call its external vendors, handle retries and failures, and report success or failure. What it doesn’t know is which workflow it belongs to, or what step comes before or after it. This isolation is what makes reuse possible.
The mechanism enabling this plug-and-play behavior is the interface contract. Every step implements the same standardized interface, with a method to process the step, a method to check if it’s complete, and a method to return its response data. As long as a new step honors this contract, it can slot into any workflow without the workflow knowing or caring about its internals.
This contract also enables team autonomy. The identity verification step can be owned entirely by the security team. Payment setup can belong to the finance team. Each team iterates on their step independently, as long as they maintain the shared interface. In a way, the architecture mirrors the organizational structure, or more accurately, it lets the organizational structure work for the system instead of against it.
Two additional capabilities make the system even more flexible:
Composite steps group multiple granular steps into a single logical unit. One country might collect all personal information on a single screen. Another might split it across three screens. A composite step called “PersonalDetails” can wrap Profile, Additional Info, and Vehicle steps together, handling that variation without changing the individual step implementations underneath.
And steps can be dynamic and conditional. A Waitlist step might only appear in markets with specific supply conditions. The same step can even appear multiple times within a single workflow.
This flexibility goes beyond simple reordering and confirms that steps are truly stateless and workflow-agnostic.
The address collection step is the clearest proof that this works in practice. DoorDash built it once as a standalone module. When Australia needed address collection early in their flow for compliance checks, the team simply inserted the module before the compliance step in Australia’s workflow definition, without any special logic or branching. Canada later adopted the same step for validation and service-area mapping. It worked out of the box. The US team then experimented by enabling it in select regions, and again, with no new code.
This three-layer pattern isn’t specific to onboarding. Any multi-step process that varies across contexts (checkout flows, approval pipelines, content moderation queues) can be decomposed this way.
One important clarification here is that DoorDash’s step modules are not separate microservices. They are modules within a single service, which means the lesson here is about logical decomposition and interface design rather than strict deployment boundaries. Technically, we could apply this same pattern inside a monolith.
How does the system know where each applicant is in their journey?
Answering this question is needed to make modular steps work.
In the legacy system, this was a mess. Progress was tracked across multiple separate tables, each representing part of the workflow. Introducing a new onboarding step meant modifying several of these tables. Ensuring synchronization between them required close coordination across services, and it often broke down, leading to data mismatches and brittle integrations.
The new system introduced the status map, a single JSON object in the database where every step writes its own progress. It looks something like this:
{
“personal_info”: { “status”: “DONE”, “metadata”: { “name”: “Jane” } },
“address”: { “status”: “DONE”, “metadata”: { “address_id”: “abc123” } },
“validation”: { “status”: “IN_PROGRESS” },
“compliance”: { “status”: “INIT” }
}Each step is responsible for updating its own entry in the map. When a step starts, completes, fails, or gets skipped, it writes that transition directly to its entry. The workflow layer never writes to the status map. It just reads it.
See the diagram below:

Each step also exposes an isStepCompleted() method that defines its own completion logic based on the status map. One step might treat “SKIPPED” as a terminal state, while another might not. This flexibility lives at the step level, not the workflow level, which keeps the orchestration logic simple and stateless.
The practical benefit is immediate. A single query on the status map tells you exactly where any applicant stands in their onboarding journey. Partial updates are handled through atomic JSON key merges, meaning that when one step updates its status, it only touches its own entry without overwriting the rest of the map.
The architecture is only half the story. Getting there without breaking a running system is where the real engineering difficulty lives.
DoorDash didn’t flip a switch. They designed the new platform to coexist with the existing V2 and V3 APIs, running old and new systems side by side. Applicants who had partially completed onboarding under the legacy system needed to continue seamlessly, so the team built temporary synchronization mechanisms that mirrored progress between systems until the migration was complete. This parallel operation was itself a temporary technical debt, built intentionally to be thrown away.
Other major initiatives were underway during the rebuild, sometimes conflicting with the new onboarding design. Rather than treating these as blockers, the team collaborated across those efforts and adapted the architecture where necessary.
The migration started with the US in January 2025, their largest and most complex market, as the proving ground. Then the compounding payoff kicked in. Australia was completed in under a month, needing only two localized steps. Canada followed in two weeks with a single new module. Puerto Rico took a week with a minor customization. New Zealand required almost no new development.
Every migration launched with zero regressions, no user-facing incidents, no onboarding downtime, and no unexpected drop-offs in completion rates. Each rollout got faster because more modules had already been battle-tested by thousands of Dashers in prior markets.
The architecture has also proven its value beyond adding countries. DoorDash is integrating its onboarding with another large, independently developed ecosystem that has its own mature onboarding flow. The modular design allowed them to build integration-specific workflows while reusing much of the existing logic, something that would have been extremely painful with the legacy system.
The tradeoffs are real, though. Modularity adds coordination overhead. For a single-market startup, this architecture can be considered overkill. A monolithic onboarding flow is completely fine until you hit the inflection point where country-specific branching becomes more expensive than decomposition.
Reusable modules work well when the underlying concept generalizes across markets. For example, addresses are conceptually similar everywhere, which is why the address step was reused so cleanly. However, compliance requirements can be fundamentally different between regulatory regimes.
The boundary between the platform team and domain teams also requires ongoing negotiation. DoorDash addresses this through published platform principles, versioned interface contracts, and joint KPIs that create shared accountability. Domain expert teams own their business logic (fraud detection, compliance, payments) while the platform enforces consistency. This is a human coordination challenge that architecture alone doesn’t solve.
Looking ahead, DoorDash’s roadmap includes dynamic configuration loading to enable workflows to go live through config rather than code, step versioning to allow multiple iterations of a step to coexist during experiments or rollouts, and enhanced operational tooling to give non-engineering teams the ability to manage workflows directly.
That said, DoorDash deliberately kept workflows code-defined rather than jumping straight to config-driven. While config-driven systems are powerful, they introduce their own complexity. They can be harder to debug and harder to test.
Ultimately, what DoorDash built is a sort of pattern for any system that needs to support multiple variants of a multi-step process. The core idea is three layers (a thin orchestrator, composable workflows, and self-contained steps behind standardized interfaces) connected by a single shared state structure.
References: