2026-02-20 00:30:44
Eventual consistency is a key architectural choice in modern distributed systems. When we choose eventual consistency, we are making a trade-off between immediate synchronization across all database copies for better performance, scalability, and availability.
Using eventual consistency has been a key factor in our ability to build systems that serve millions of users globally. Whether we are building social media platforms, e-commerce sites, or real-time gaming applications, eventual consistency gives us the tools to handle data under load and during failures.
In this article, we will look at what eventual consistency is, why it exists, how to control it, and how to handle the challenges it creates.
2026-02-19 00:31:13
If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.
QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.
QA Wolf takes testing off your plate. They can get you:
Unlimited parallel test runs for mobile and web apps
24-hour maintenance and on-demand test creation
Human-verified bug reports sent directly to your team
Zero flakes guarantee
The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.
With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.
When Stripe first launched, they became known for integrating payment processing into any business with just seven lines of code.
This was a really big achievement. Taking something as complex as credit card processing and reducing it to a simple code snippet felt revolutionary. In essence, a developer could open a terminal, run a basic curl command, and immediately see a successful credit card payment.
However, building and maintaining a payment API that works across dozens of countries, each with different payment methods, banking systems, and regulatory requirements, is one of the most difficult problems. Most of the time, companies either lock themselves into supporting just one or two payment methods or force developers to write different integration code for each market.
Stripe had to evolve the API multiple times over the next 10 years to handle credit cards, bank transfers, Bitcoin wallets, and cash payments through a unified integration.
But getting there wasn’t easy. In this article, we look at how Stripe’s payment APIs evolved over the years, the technical challenges they faced, and the engineering decisions that shaped modern payment processing.
Disclaimer: This post is based on publicly shared details from the Stripe Engineering Team. Please comment if you notice any inaccuracies.
When Stripe launched in 2011, credit cards dominated the US payment landscape. The initial API design reflected this reality.
Stripe introduced two fundamental concepts that would become the foundation of their platform.
The Token was the first concept. When a customer entered their card details in a web browser, those details were sent directly to Stripe’s servers using a JavaScript library called Stripe.js.
This was crucial for security. By never allowing card data to touch the merchant’s servers, Stripe helped businesses avoid complex PCI compliance requirements. PCI compliance refers to security standards that businesses must follow when handling credit card information. These requirements are expensive and technically demanding to implement correctly.
In exchange for the card details, Stripe returned a Token. Think of a Token as a safe reference to the card information. The actual card number lived in Stripe’s secure systems. The Token was just a pointer to that data.
The Charge was the second concept. After receiving a Token from the client, the merchant’s server could create a Charge using that Token and a secret API key.
A Charge represented the actual payment request. When you created a Charge, the payment either succeeded or failed immediately. This immediate response is called synchronous processing, meaning the result comes back right away.
See the diagram below that shows this approach:
The payment flow followed a pattern common in traditional web applications:
JavaScript client creates a Token using a publishable API key
The browser sends the Token to the merchant’s server
The server creates a Charge using the Token and a secret API key
Payment succeeds or fails immediately
The server can fulfill the order based on the result
As Stripe expanded, they needed to support payment methods beyond credit cards. In 2015, they added ACH debit and Bitcoin. These payment methods introduced fundamental differences that challenged the existing API design.
Payment methods differ along two important dimensions.
First, when is the payment finalized? Finalized means you have confidence that the funds are guaranteed and you can ship goods to the customer. Credit card payments are finalized immediately. However, Bitcoin payments can take about an hour, whereas ACH debit payments may take days to finalize.
Second, who initiates the payment? With credit cards and ACH debit, the merchant initiates the payment by charging the customer. With Bitcoin, the customer creates a transaction and sends it to the merchant. This requires the customer to take action before any money moves.
For ACH debit, Stripe extended the Token resource to represent both card details and bank account details. However, they needed to add a pending state to the Charge. An ACH debit Charge would start as pending and only transition to successful days later. Merchants had to implement webhooks to know when the payment actually succeeded.
See the diagram below:
For reference, a webhook is a mechanism where Stripe calls your server when something happens. Instead of your server repeatedly asking Stripe if the payment succeeded yet, Stripe sends a notification to a URL on your server when the status changes. Your server needs to set up an endpoint that listens for these notifications and processes them accordingly.
For Bitcoin, the existing abstractions did not work at all. Stripe introduced a new BitcoinReceiver resource. A receiver was a temporary storage for funds. It had a simple state machine with one boolean property called filled. A state machine is a system that can be in different states and transitions between them based on events. The BitcoinReceiver could be filled (true) or not filled (false).
The Bitcoin payment flow worked like this:
Client creates a BitcoinReceiver.
The customer sends Bitcoin to the receiver’s address.
Receiver transitions to filled.
The server creates a Charge using the BitcoinReceiver.
The charge starts in the pending state.
Charge transitions to “succeeded” after confirmations.
See the diagram below:
This introduced complexity. Merchants now had to manage two state machines to complete a single payment: BitcoinReceiver on the client side and Charge on the server side. Additionally, they needed to handle asynchronous payment finalization through webhooks.
Over the next two years, Stripe added many more payment methods. Most were similar to Bitcoin, requiring customer action to initiate payment. The Stripe engineering team realized that creating a new receiver-like resource for each payment method would become unmanageable. Therefore, they decided to design a unified payments API.
To do so, Stripe combined Tokens and BitcoinReceivers into a single client-driven state machine called a Source. When created, a Source could be immediately chargeable, like credit cards, or pending, like payment methods requiring customer action. The server-side integration remained simple: create a Charge using the Source.
See the diagram below:
The Sources API supported cards, ACH debit, SEPA direct debit, iDEAL, Alipay, Giropay, Bancontact, WeChat Pay, Bitcoin, and many others. All of these payment methods use the same two API abstractions: a Source and a Charge.
While this approach seemed elegant at first, the team discovered serious problems once they understood how the flow integrated into real applications. Consider a common scenario with iDEAL, the predominant payment method in the Netherlands:
The customer completes payment on their bank’s website.
If the browser loses connectivity before communicating back to the merchant’s server, the server never creates a Charge.
After a few hours, Stripe automatically refunds the money to the customer. The merchant loses the sale even though the customer successfully paid. This is a conversion nightmare.
To reduce this risk, Stripe recommended that merchants either poll the API from their server until the Source became chargeable or listen for the source.chargeable webhook event to create the Charge. However, if a merchant’s application went down temporarily, these webhooks would not be delivered, and the server would not create the Charge.
The integration grew more complex because different Sources behaved differently:
Some Sources like cards and bank account were synchronously chargeable and could be charged immediately on the server. Others were asynchronous and could only be charged hours or days later. Merchants often built parallel integrations using both synchronous HTTP requests and event-driven webhook handlers.
For payment methods like OXXO, where customers print a physical voucher and pay cash at a store, the payment happens entirely outside the digital flow. Listening for the webhook became necessary for these payment methods.
Merchants also had to track both the Charge ID and Source ID for each order. If two Sources became chargeable for the same order, perhaps because a customer decided to switch payment methods mid-payment, the merchant needed logic to prevent double-charging.
See the diagram below:
Stripe realized they had designed their system around the simplest payment method: credit cards. Looking at all payment methods, cards were actually the outlier. Cards were the only payment method that finalized immediately and required no customer action to initiate payment. Everything else was more complex.
Developers had to understand the success, failure, and pending states of two state machines whose states varied across different payment methods. This demanded far more conceptual understanding than the original seven lines of code promised.
In late 2017, Stripe assembled a small team: four engineers and one product manager. They locked themselves in a conference room for three months with a singular goal of designing a truly unified payments API that would work for all payment methods globally.
The team followed strict rules:
They closed their laptops during working sessions to stay fully present.
They started each session with questions they wanted to answer and wrote down new questions for later sessions rather than getting sidetracked.
They used colors and shapes on whiteboards instead of naming concepts prematurely, avoiding premature anchoring on specific definitions.
Most importantly, they focused on enabling real user integrations. They wrote hypothetical integration guides for every payment method to validate their concepts.
They even wrote guides for imaginary payment methods to ensure the abstractions were flexible enough.

The team created two new concepts that finally achieved true unification.
PaymentMethod represents the “how of a payment.” It contains static information about the payment instrument the customer wants to use. This includes the payment scheme and credentials needed to move money, such as card information, bank account details, or customer email. For some methods (like Alipay), only the payment method name is required because the payment method itself handles collecting further information. Importantly, a PaymentMethod has no state machine and contains no transaction-specific data. It is simply a description of how to process a payment.
PaymentIntent represents the “what of a payment.” It captures transaction-specific data such as the amount to charge and the currency. The PaymentIntent is the stateful object that tracks the customer’s attempt to pay. If one payment attempt fails, the customer can try again with a different PaymentMethod. The same PaymentIntent can be used with multiple PaymentMethods until payment succeeds.
See the diagram below:
The key insight was creating one predictable state machine for all payment methods:
requires_payment_method: Need to specify how the customer will pay
requires_confirmation: Have the payment method ready to initiate payment
requires_action: Customer must do something like authenticate or redirect
processing: Stripe is processing the payment
succeeded: Funds are guaranteed, and the merchant can fulfill the order
Notably, there is no failed state. If a payment attempt fails, the PaymentIntent returns to requires_payment_method so the customer can try again with a different method.
The new integration works consistently across all payment methods:
The server creates a PaymentIntent with an amount and a currency
Server sends the PaymentIntent’s client_secret to the browser
The browser collects the customer’s preferred payment method
The browser confirms the PaymentIntent using the secret and payment method
PaymentIntent may enter requires_action state with instructions
The browser handles the action, such as 3D Secure authentication
Server listens for payment_intent.succeeded webhook
The server fulfills the order when payment succeeds
This approach had major improvements over Sources and Charges. Only one webhook handler was needed, and it was not in the critical path for collecting money. The entire flow used one predictable state machine. The integration was resilient to client disconnects because the PaymentIntent persisted on the server. Most importantly, the same integration worked for all payment methods with just parameter changes.
Designing the PaymentIntents API was the hard but enjoyable part. Launching it took almost two years because of a perception challenge: the new API did not feel like seven lines of code anymore.
In normalizing the API across all payment methods, card payments became more complicated to integrate. The new flow flipped the order of client and server requests. It also introduced webhook events that were optional before. For developers building traditional web applications who only cared about accepting card payments in the US and Canada, PaymentIntents was objectively harder than Charges.
The power-to-effort curve looked different. Each incremental payment method was cheap to add to a PaymentIntents integration. However, getting started with just card payments required more upfront effort. Speed matters for startups wanting to get running quickly. With Charges, getting cards working was intuitive and low-effort.

Stripe’s solution was to add convenient packaging of the API that catered to developers who wanted the simplest possible flow. They called the default integration the global payments integration and created a simpler version called card payments without bank authentication.
This simpler integration used a special parameter called error_on_requires_action. This parameter tells the PaymentIntent to return an error if any customer action is required to complete the payment. A merchant using this parameter cannot handle actions required by the PaymentIntent state machine, effectively making it behave like the old Charges API.
The parameter name makes it very clear what merchants are choosing. When they eventually need to handle actions or add new payment methods, it is obvious what to do: remove this parameter and start handling the requires_action state. Developers using this packaging do not have to change the core resources even when upgrading to the full global integration.
Stripe emphasized that a great API requires more than just the API itself. Some approaches they used are as follows:
They developed the Stripe CLI, a command-line tool that made testing webhooks locally much simpler.
They created Stripe Samples, allowing developers who prefer learning by example to start with working code.
They redesigned the Stripe Dashboard to help developers debug and understand the PaymentIntent state machine visually.
The team also handled the unglamorous but essential work of updating every piece of documentation, support article, and canned response that referenced old APIs. They reached out to community content creators, asking them to update their materials. They recorded numerous tutorials for both users and internal support teams.
The journey from Charges to PaymentIntents revealed important principles about API design.
First, successful products tend to accumulate product debt over time, similar to technical debt. For API products, this debt is particularly hard to address because you cannot force developers to restructure their integrations fundamentally. It is much easier to add parameters to existing requests than to introduce new abstractions.
Second, designing from first principles is essential. Stripe realized that Charges and Tokens were foundational, not because they were the right abstraction for global payments, but simply because they were the first APIs built. They had to set aside the existing APIs and think about the problem fresh.
Third, keeping things simple does not mean reducing the number of resources or parameters. Two overloaded abstractions are not simpler than four clearly-defined abstractions. Simplicity means making APIs consistent and predictable while creating the right packages.
Fourth, migration requires compromise. Stripe created Charge objects behind the scenes for each PaymentIntent to maintain compatibility with existing integrations. This allowed merchants to migrate their payment flow without breaking their analytics and reporting systems.
Finally, API design is fundamentally collaborative work. The breakthrough came when engineers and product managers worked together intensively, closing laptops and focusing completely on understanding the problem space.
In a nutshell, Stripe’s evolution from seven lines of code to a sophisticated global payments API demonstrates that simplicity and power are not opposing goals. The challenge is creating abstractions that handle complexity internally while presenting a predictable, consistent interface to developers.
References:
2026-02-18 00:31:17
Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request.
Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss.
CodeRabbit reviews 1 million PRs every week across 3 million repositories and is used by 100 thousand Open-source projects.
CodeRabbit is free for all open-source repo’s.
Cloudflare has reduced cold start delays in its Workers platform by 10 times through a technique called worker sharding.
A cold start occurs when serverless code must initialize completely before handling a request. For Cloudflare Workers, this initialization involves four distinct phases:
Fetching the JavaScript source code from storage
Compiling that code into executable machine instructions
Executing any top-level initialization code
Finally, invoking the code to handle the incoming request
See the diagram below:
The improvement around cold starts means that 99.99% of requests now hit already-running code instances instead of waiting for code to start up.
The overall solution works by routing all requests for a specific application to the same server using a consistent hash ring, reducing the number of times code needs to be initialized from scratch.
In this article, we will look at how Cloudflare built this system and the challenges it faced.
Disclaimer: This post is based on publicly shared details from the Cloudflare Engineering Team. Please comment if you notice any inaccuracies.
In 2020, Cloudflare introduced a solution that masked cold starts by pre-warming Workers during TLS handshakes.
TLS is the security protocol that encrypts web traffic and makes HTTPS possible. Before any actual data flows between a browser and server, they perform a handshake to establish encryption. This handshake requires multiple round-trip messages across the network, which takes time.
The original technique worked because Cloudflare could identify which Worker to start from the Server Name Indication (SNI) field in the very first TLS message. While the rest of the handshake continued, they would initialize the Worker in the background. If the Worker finished starting up before the handshake completed, the user experienced zero visible delay.
See the diagram below:
This technique succeeded initially because cold starts took only 5 milliseconds while TLS 1.2 handshakes required three network round-trips. The handshake provided enough time to hide the cold start entirely.
The effectiveness of the TLS handshake technique depended on a specific timing relationship in which cold starts had to complete faster than TLS handshakes. Over the past five years, this relationship broke down for two reasons.
First, cold starts became longer. Cloudflare increased Worker script size limits from 1 megabyte to 10 megabytes for paying customers and to 3 megabytes for free users. They also increased the startup CPU time limit from 200 milliseconds to 400 milliseconds. These changes allowed developers to deploy much more complex applications on the Workers platform. Larger scripts require more time to transfer from storage and more time to compile. Longer CPU time limits mean initialization code can run for longer periods. Together, these changes pushed cold start times well beyond their original 5-millisecond duration.
Second, TLS handshakes became faster. TLS 1.3 reduced the handshake from three round-trips to just one round-trip. This improvement in security protocols meant less time to hide cold start operations in the background.
The combination of longer cold starts and shorter TLS handshakes meant that users increasingly experienced visible delays. The original solution no longer eliminated the problem.
Cloudflare realized that further optimizing cold start duration directly would be ineffective. Instead, they needed to reduce the absolute number of cold starts happening across their network.
The key insight involved understanding how requests were distributed across servers. Consider a Cloudflare data center with 300 servers. When a low-traffic application receives one request per minute, load balancing distributes these requests evenly across all servers. Each server receives one request approximately every five hours.
This distribution creates a problem. In busy data centers, five hours between requests is long enough that the Worker must be shut down to free memory for other applications. When the next request arrives at that server, it triggers a cold start. The result is a 100% cold start rate for low-traffic applications.
The solution involves routing all requests for a specific Worker to the same server within a data center. If all requests go to one server, that server receives one request per minute rather than one request every five hours. The Worker stays active in memory, and subsequent requests find it already running.
This approach provides multiple benefits. The application experiences mostly warm requests with only one initial cold start. Memory usage drops by over 99% because 299 servers no longer need to maintain copies of the Worker. This freed memory allows other Workers to stay active longer, creating improved performance across the entire system.
Cloudflare borrowed a technique from its HTTP caching system to implement worker sharding. The core data structure is called a consistent hash ring.
A naive approach to assigning Workers to servers would use a standard hash table. In this approach, each Worker identifier maps directly to a specific server address. This works fine until servers crash, get rebooted, or are added to the data center. When the number of servers changes, the entire hash table must be recalculated. Every Worker would get reassigned to a different server, causing universal cold starts.
A consistent hash ring solves this problem. Instead of directly mapping Workers to servers, both are mapped to positions on a number line that wraps around from end to beginning. Think of a clock face where positions range from 0 to 359 degrees.
The assignment process works as follows:
Hash each server address to a position on the ring
Hash each Worker identifier to a position on the ring
Assign each Worker to the first server encountered, moving clockwise from the Worker’s position
When a server disappears from the ring, only the Workers positioned immediately before it need reassignment. All other Workers remain with their current servers.
Similarly, when a new server joins, only Workers in a specific range move to the new server.
This stability is crucial for maintaining warm Workers. If the server constantly reshuffled Worker assignments, the benefits of routing requests to the same server would disappear.
The sharding system introduces two server roles in request handling:
The shard client is the server that initially receives a request from the internet.
The shard server is the home server for that specific Worker according to the consistent hash ring.
When a request arrives, the shard client looks up the Worker’s home server using the hash ring. If the shard client happens to be the home server, it executes the Worker locally. Otherwise, it forwards the request to the appropriate shard server over the internal data center network.
Forwarding requests between servers adds latency. Each forwarded request must travel across the data center network, adding approximately one millisecond to the response time. However, this overhead is much less than a typical cold start, which can take hundreds of milliseconds. Forwarding a request to a warm Worker is always faster than starting a cold Worker locally.
Worker sharding can concentrate traffic onto fewer servers, which creates a new problem. Individual Workers can receive enough traffic to overload their home server. The system must handle this situation gracefully without serving errors to users.
Cloudflare evaluated two approaches for load shedding:
The first approach has the shard client ask permission before sending each request. The shard server responds with either approval or refusal. If refused, the shard client handles the request locally by starting a cold Worker. This permission-based approach introduces an additional latency of one network round-trip on every sharded request. The shard client must wait for approval before sending the actual request data.
The second approach sends the request optimistically without waiting for permission. If the shard server becomes overloaded, it forwards the request back to the shard client. This avoids the round-trip latency penalty when the shard server can handle the request, which is the common case.
See the diagram below that shows the pessimistic approach:
Cloudflare chose the optimistic approach for two reasons.
First, refusals are rare in practice. When a shard client receives a refusal, it starts a local Worker instance and serves all future requests locally. After one refusal, that shard client stops sharding requests for that Worker until traffic patterns change.
Second, Cloudflare developed a technique to minimize the cost of forwarding refused requests back to the client.
See the diagram below:
The Workers runtime uses Cap’n Proto RPC for communication between server instances.
Cap’n Proto provides a distributed object model that simplifies complex scenarios. When assembling a sharded request, the shard client includes a special handle called a capability. This capability represents a lazy Worker instance that exists on the shard client but has not been initialized yet. The lazy Worker has the same interface as any other Worker, but only starts when first invoked.
If the shard server must refuse the request due to overload, it does not send a simple rejection message. Instead, it returns the shard client’s own lazy capability as the response.
The shard client’s application code receives a Worker capability from the shard server. It attempts to invoke this capability to handle the request. The RPC system recognizes that this capability actually points back to a local lazy Worker. Once it realizes the request would loop back to the shard client, it stops sending additional request bytes to the shard server and handles everything locally.
This mechanism prevents wasted bandwidth. Without it, the shard client might send a large request body to the shard server, only to have the entire body forwarded back again. Cap’n Proto’s distributed object model automatically optimizes this pattern by recognizing local capabilities and short-circuiting the communication path.
Many Cloudflare products involve Workers invoking other Workers.
Service Bindings allow one Worker to call another directly. Workers KV, despite appearing as a storage service, actually involves cross-Worker invocations. However, the most complex scenario involves Workers for Platforms.
Workers for Platforms enables customers to build their own serverless platforms on Cloudflare infrastructure. A typical request flow involves three or four different Workers.
First, a dynamic dispatch Worker receives the request and selects which user Worker should handle it.
The user Worker processes the request, potentially invoking an outbound Worker to intercept network calls.
Finally, a tail Worker might collect logs and traces from the entire request flow.
These Workers can run on different servers across the data center. Supporting sharding for nested Worker invocations requires passing the execution context between servers.
The execution context includes information like permission overrides, resource limits, feature flags, and logging configurations. When Workers ran on a single server, managing this context was straightforward. With sharding, the context must travel between servers as Workers invoke each other.
Cloudflare serializes the context stack into a Cap’n Proto message and includes it in sharded requests. The shard server deserializes the context and continues execution with the correct configuration.
The tail Worker scenario demonstrates Cap’n Proto’s power. A tail Worker must receive traces from potentially many servers that participated in handling a request. Rather than having each server know where to send traces, the system includes a callback capability in the execution context. Each server simply invokes this callback with its trace data. The RPC system automatically routes these calls back to the dynamic dispatch Worker’s home server, where all traces are collected together.
After deploying worker sharding globally, Cloudflare measured several key metrics:
Only 4% of total enterprise requests are sharded to a different server. This low percentage reflects that 96% of requests go to high-traffic Workers that run multiple instances across many servers.
Despite sharding only 4% of requests, the global Worker eviction rate decreased by 10 times. Eviction rate measures how often Workers are shut down to free memory. Fewer evictions indicate that memory is being used more efficiently across the system.
The 4% sharding rate achieving 10 times efficiency improvement stems from the power law distribution of internet traffic. A small number of Workers receive the vast majority of requests.
These high-traffic Workers already maintained warm instances before sharding. A large number of Workers receive relatively few requests. These low-traffic Workers suffered from frequent cold starts and are exactly the ones helped by sharding.
The warm request rate for enterprise traffic increased from 99.9% to 99.99%. This improvement represents going from three nines to four nines of reliability. Equivalently, the cold start rate decreased from 0.1% to 0.01% of all requests. This is a 10 times reduction in how often users experience cold start delays.
The warm request rate also became less volatile throughout each day. Previous patterns showed significant variation as traffic levels changed. Sharding smoothed these variations by ensuring low-traffic Workers maintained warm instances even during off-peak hours.
Cloudflare’s worker sharding system demonstrates how distributed systems techniques can solve performance problems that direct optimization cannot address. Rather than making cold starts faster, they made cold starts less frequent. Rather than using more computing resources, they used existing resources more efficiently.
References:
2026-02-17 00:30:36
🤖Most AI coding tools only see your source code. Seer, Sentry’s AI debugging agent, uses everything Sentry knows about how your code has behaved in production to debug locally, in your PR, and in production.
🛠️How it works:
Seer scans & analyzes issues using all Sentry’s available context.
In development, Seer debugs alongside you as you build
In review, Seer alerts you to bugs that are likely to break production, not nits
In production, Seer can find a bug’s root cause, suggest a fix, open a PR automatically, or send the fix to your preferred IDE.
OpenAI scaled PostgreSQL to handle millions of queries per second for 800 million ChatGPT users. They did it with just a single primary writer supported by read replicas.
At first glance, this should sound impossible. The common wisdom suggests that beyond a certain scale, you must shard the database or risk failure. The conventional playbook recommends embracing the complexity of splitting the data across multiple independent databases.
OpenAI’s engineering team chose a different path. They decided to see just how far they could push PostgreSQL.
Over the past year, their database load grew by more than 10X. They experienced the familiar pattern of database-related incidents: cache layer failures causing sudden read spikes, expensive queries consuming CPU, and write storms from new features. Yet through systematic optimization across every layer of their stack, they achieved five-nines availability with low double-digit millisecond latency. But the road wasn’t easy.
In this article, we will look at the challenges OpenAI faced while scaling Postgres and how the team handled the various scenarios.
Disclaimer: This post is based on publicly shared details from the OpenAI Engineering Team. Please comment if you notice any inaccuracies.
A single-primary architecture means one database instance handles all writes, while multiple read replicas handle read queries.
See the diagram below:
This design creates an inherent bottleneck because writes cannot be distributed. However, for read-heavy workloads like ChatGPT, where users primarily fetch data rather than modify it, this architecture can scale effectively if properly optimized.
OpenAI avoided sharding its PostgreSQL deployment for pragmatic reasons. Sharding would require modifying hundreds of application endpoints and could take months or years to complete. Since their workload is primarily read-heavy and current optimizations provide sufficient capacity, sharding remains a future consideration rather than an immediate necessity.
So how did OpenAI go about scaling the read replicas? There were three main pillars to their overall strategy:
The primary database represents the system’s most critical bottleneck. OpenAI implemented multiple strategies to reduce pressure on this single writer:
Offloading Read Traffic: OpenAI routes most read queries to replicas rather than the primary. However, some read queries must remain on the primary because they occur within write transactions. For these queries, the team ensures maximum efficiency to avoid slow operations that could cascade into broader system failures.
Migrating Write-Heavy Workloads: The team migrated workloads that could be horizontally partitioned to sharded systems like Azure Cosmos DB. These shardable workloads can be split across multiple databases without complex coordination. Workloads that are harder to shard continue to use PostgreSQL but are being gradually migrated.
Application-Level Write Optimization: OpenAI fixed application bugs that caused redundant database writes. They implemented lazy writes where appropriate to smooth traffic spikes rather than hitting the database with sudden bursts. When backfilling table fields, they enforce strict rate limits even though the process can take over a week. This patience prevents write spikes that could impact production stability.
First, OpenAI identified several expensive queries that consumed disproportionate CPU resources. One particularly problematic query joined 12 tables, and spikes in this query’s volume caused multiple high-severity incidents.
The team learned to avoid complex multi-table joins in their OLTP system. When joins are necessary, OpenAI breaks down complex queries and moves join logic to the application layer, where it can be distributed across multiple application servers.
Object-Relational Mapping frameworks, commonly known as ORMs, generate SQL automatically from code objects. While convenient for developers, ORMs can produce inefficient queries. OpenAI carefully reviews all ORM-generated SQL to ensure it performs as expected. They also configure timeouts like idle_in_transaction_session_timeout to prevent long-running idle queries from blocking autovacuum (PostgreSQL’s cleanup process).
Second, Azure PostgreSQL instances have a maximum connection limit of 5,000. OpenAI previously experienced incidents where connection storms exhausted all available connections, bringing down the service.
Connection pooling solves this problem by reusing database connections rather than creating new ones for each request. Think of it as carpooling. Instead of everyone driving their own car to work, people share vehicles to reduce traffic congestion.
OpenAI deployed PgBouncer as a proxy layer between applications and databases. PgBouncer runs in statement or transaction pooling mode, efficiently reusing connections and reducing the number of active client connections. In benchmarks, average connection time dropped from 50 milliseconds to just 5 milliseconds.
Each read replica has its Kubernetes deployment running multiple PgBouncer pods. Multiple deployments sit behind a single Kubernetes Service that load-balances traffic across pods. OpenAI co-locates the proxy, application clients, and database replicas in the same geographic region to minimize network latency and connection overhead.
See the diagram below:
Today’s AI agents are mostly chatbots and copilots - reactive tools waiting for human input. But agents are moving into the backend: running autonomously, replacing brittle rule engines with reasoning, creating capabilities you couldn’t build with deterministic pipelines.
This changes everything about your architecture. Agent reasoning takes seconds, not milliseconds. You need identity beyond API keys. You need to know why an agent made every decision. And you need to scale from one prototype to thousands.
AgentField is the open-source infrastructure layer for autonomous AI agents in production.
OpenAI identified a recurring pattern in their incidents. To reduce read pressure on PostgreSQL, OpenAI uses a caching layer to serve most read traffic.
However, when cache hit rates drop unexpectedly, the burst of cache misses can push massive request volumes directly to PostgreSQL. In other words, an upstream issue causes a sudden spike in database load. This could be widespread cache misses from a caching layer failure, expensive multi-way joins saturating the CPU, or a write storm from a new feature launch.
As resource utilization climbs, query latency rises, and requests begin timing out. Applications then retry failed requests, which further amplifies the load. This creates a feedback loop that can degrade the entire service.
To prevent this situation, the OpenAI engineering team implemented a cache locking and leasing mechanism to prevent this scenario. When multiple requests miss on the same cache key, only one request acquires a lock and fetches data from PostgreSQL to repopulate the cache. All other requests wait for the cache update rather than simultaneously hitting the database.
See the diagram below:
Taking further precautions, OpenAI implemented rate limiting across the application, connection pooler, proxy, and query layers. This prevents sudden traffic spikes from overwhelming database instances and triggering cascading failures. They also avoid overly short retry intervals, which can trigger retry storms where failed requests multiply exponentially.
The team enhanced their ORM layer to support rate limiting and can fully block specific query patterns when necessary. This targeted load shedding enables rapid recovery from sudden surges of expensive queries.
Despite all this, OpenAI encountered situations where certain requests consumed disproportionate resources on PostgreSQL instances, creating a problem known as the noisy neighbor effect. For example, a new feature launch might introduce inefficient queries that heavily consume CPU, slowing down other critical features.
To mitigate this, OpenAI also isolates workloads onto dedicated instances. They split requests into low-priority and high-priority tiers and route them to separate database instances. This ensures that low-priority workload spikes cannot degrade high-priority request performance. The same strategy applies across different products and services.
PostgreSQL uses Multi-Version Concurrency Control for managing concurrent transactions. When a query updates a tuple (database row) or even a single field, PostgreSQL copies the entire row to create a new version. This design allows multiple transactions to access different versions simultaneously without blocking each other.
However, MVCC creates challenges for write-heavy workloads. It causes write amplification because updating one field requires writing an entire row. It also causes read amplification because queries must scan through multiple tuple versions, called dead tuples, to retrieve the latest version. This leads to table bloat, index bloat, increased index maintenance overhead, and complex autovacuum tuning requirements.
OpenAI’s primary strategy for addressing MVCC limitations involves migrating write-heavy workloads to alternative systems and optimizing applications to minimize unnecessary writes. They also restrict schema changes to lightweight operations that do not trigger full table rewrites.
Another constraint with Postgres is related to schema changes. Even small schema changes like altering a column type can trigger a full table rewrite in PostgreSQL. During a table rewrite, PostgreSQL creates a new copy of the entire table with the change applied. For large tables, this can take hours and block access.
To handle this, OpenAI enforces strict rules around schema changes:
Only lightweight schema changes are permitted, such as adding or removing certain columns that do not trigger table rewrites.
All schema changes have a 5-second timeout.
Creating and dropping indexes must be done concurrently to avoid blocking.
Schema changes are restricted to existing tables only.
New features requiring additional tables must use alternative sharded systems like Azure Cosmos DB.
When backfilling a table field, OpenAI applies strict rate limits even though the process can take over a week. This ensures stability and prevents production impact.
With a single primary database, the failure of that instance affects the entire service. OpenAI addressed this critical risk through multiple strategies.
First, they offloaded most critical read-only requests from the primary to replicas. If the primary fails, read operations continue functioning. While write operations would still fail, the impact is significantly reduced.
Second, OpenAI runs the primary in High Availability mode with a hot standby. A hot standby is a continuously synchronized replica that remains ready to take over immediately. If the primary fails or requires maintenance, OpenAI can quickly promote the standby to minimize downtime. The Azure PostgreSQL team has done significant work ensuring these failovers remain safe and reliable even under high load.
For read replica failures, OpenAI deploys multiple replicas in each region with sufficient capacity headroom. A single replica failure does not lead to a regional outage because traffic automatically routes to other replicas.
The primary database streams Write Ahead Log data to every read replica. WAL contains a record of all database changes, which replicas replay to stay synchronized. As the number of replicas increases, the primary must ship WAL to more instances, increasing pressure on network bandwidth and CPU. This causes higher and more unstable replica lag.
As mentioned, OpenAI currently operates nearly 50 read replicas across multiple geographic regions. While this scales well with large instance types and high network bandwidth, the team cannot add replicas indefinitely without eventually overloading the primary.
To address this future constraint, OpenAI is collaborating with the Azure PostgreSQL team on cascading replication. In this architecture, intermediate replicas relay WAL to downstream replicas rather than the primary streaming to every replica directly. This tree structure allows scaling to potentially over 100 replicas without overwhelming the primary. However, it introduces additional operational complexity, particularly around failover management. The feature remains in testing until the team ensures it can fail over safely.
See the diagram below:
OpenAI’s optimization efforts have delivered impressive results.
The system handles millions of queries per second while maintaining replication lag near zero. The architecture delivers low double-digit millisecond p99 latency, meaning 99 percent of requests complete in under roughly 50 milliseconds. The system achieves five-nines availability, equivalent to 99.999 percent uptime.
Over the past 12 months, OpenAI experienced only one SEV-0 PostgreSQL incident. This occurred during the viral launch of ChatGPT ImageGen when write traffic suddenly surged by more than 10x as over 100 million new users signed up within a week.
Looking ahead, OpenAI continues migrating remaining write-heavy workloads to sharded systems. The team is working with Azure to enable cascading replication for safely scaling to significantly more read replicas. They will continue exploring additional approaches, including sharded PostgreSQL or alternative distributed systems as infrastructure demands grow.
OpenAI’s experience shows that PostgreSQL can reliably support much larger read-heavy workloads than conventional wisdom suggests. However, achieving this scale requires rigorous optimization, careful monitoring, and operational discipline. The team’s success came not from adopting the latest distributed database technology but from deeply understanding their workload characteristics and eliminating bottlenecks.
References:
2026-02-15 00:30:21
Richard Socher and Bryan McCann are among the most-cited AI researchers in the world. They just released 35 predictions for 2026. Three that stand out:
The LLM revolution has been “mined out” and capital floods back to fundamental research
“Reward engineering” becomes a job; prompts can’t handle what’s coming next
Traditional coding will be gone by December; AI writes the code and humans manage it
This week’s system design refresher:
MCP vs RAG vs AI Agents
How ChatGPT Routes Prompts and Handles Modes
Agent Skills, Clearly Explained
12 Architectural Concepts Developers Should Know
How to Deploy Services
Everyone is talking about MCP, RAG, and AI Agents. Most people are still mixing them up. They’re not competing ideas. They solve very different problems at different layers of the stack.
MCP (Model Context Protocol) is about how LLMs use tools. Think of it as a standard interface between an LLM and external systems. Databases, file systems, GitHub, Slack, internal APIs.
Instead of every app inventing its own glue code, MCP defines a consistent way for models to discover tools, invoke them, and get structured results back. MCP doesn’t decide what to do. It standardizes how tools are exposed.
RAG (Retrieval-Augmented Generation) is about what the model knows at runtime. The model stays frozen. No retraining. When a user asks a question, a retriever fetches relevant documents (PDFs, code, vector DBs), and those are injected into the prompt.
RAG is great for:
Internal knowledge bases
Fresh or private data
Reducing hallucinations
But RAG doesn’t take actions. It only improves answers.
AI Agents are about doing things. An agent observes, reasons, decides, acts, and repeats. It can call tools, write code, browse the internet, store memory, delegate tasks, and operate with different levels of autonomy.
GPT-5 is not one model.
It is a unified system with multiple models, safeguards, and a real-time router.
This post and diagram are based on our understanding of the GPT 5 system card.
When you send a query, the mode determines which model to use and how much work the system does.
Instant mode sends the query directly to a fast, non-reasoning model named GPT-5-main. It optimizes for latency and is used for simple or low-risk tasks like short explanations or rewrites.
Thinking mode uses a reasoning model named GPT-5-thinking that runs multiple internal steps before producing the final answer. This improves correctness on complex tasks like math or planning.
Auto mode adds a real-time router. A lightweight classifier looks at the query and decides whether to use GPT-5-main or GPT-5-thinking when deeper reasoning is needed.
Pro mode does not use a different model. It uses GPT-5-thinking but samples multiple reasoning attempts and selects the best one using a reward model.
Across all modes, safeguards run in parallel at various stages. A fast topic classifier determines whether the topic is high-risk, followed by a reasoning monitor that applies stricter checks to ensure unsafe responses are blocked.
Over to you: What's your favorite AI chat bot?
Unblocked is the only AI code review tool that has deep understanding of your codebase, past decisions, and internal knowledge, giving you high-value feedback shaped by how your system actually works instead of flooding your PRs with stylistic nitpicks.
Why do we need Agent Skills? Long prompts hurt agent performance. Instead of one massive prompt, agents keep a small catalog of skills, reusable playbooks with clear instructions, loaded only when needed.
Here is what the Agent Skills workflow looks like:
User Query: A user submits a request like “Analyze data & draft report”.
Build Prompt + Skills Index: The agent runtime combines the query with Skills metadata, a lightweight list of available skills and their short descriptions.
Reason & Select Skill: The LLM processes the prompt, thinks, and decides: "I want Skill X."
Load Skill into Context: The agent runtime receives the specific skill request from the LLM. Then, it loads SKILL. md and adds it into the LLM's active context.
Final Output: The LLM follows SKILL. md, runs scripts, and generates the final report.
By dynamically loading skills only when needed, Agent Skills keep context small and the LLM’s behavior consistent.
Over to you: What skills would you find most useful in agents?
Load Balancing: Distributed incoming traffic across multiple servers to ensure no single node is overwhelmed.
Caching: Stores frequently accessed data in memory to reduce latency.
Content Delivery Network (CDN): Stores static assets across geographically distributed edge servers so users download content from the nearest location.
Message Queue: Decouples components by letting producers enqueue messages that consumers process asynchronously.
Publish-Subscribe: Enables multiple consumers to receive messages from a topic.
API Gateway: Acts as a single entry point for client requests, handling routing, authentication, rate limiting, and protocol translation.
Circuit Breaker: Monitors downstream service calls and stops attempts when failures exceed a threshold.
Service Discovery: Automatically tracks available service instances so components can locate and communicate with each other dynamically.
Sharding: Splits large datasets across multiple nodes based on a specific shard key.
Rate Limiting: Controls the number of requests a client can make in a given time window to protect services from overload.
Consistent Hashing: Distributes data across nodes in a way that minimizes reorganization when nodes join or leave.
Auto Scaling: Automatically adds or removes compute resources based on defined metrics.
Over to you: Which architectural concept will you add to the list?
Deploying or upgrading services is risky. In this post, we explore risk mitigation strategies.
The diagram below illustrates the common ones.
Multi-Service Deployment
In this model, we deploy new changes to multiple services simultaneously. This approach is easy to implement. But since all the services are upgraded at the same time, it is hard to manage and test dependencies. It’s also hard to rollback safely.
Blue-Green Deployment
With blue-green deployment, we have two identical environments: one is staging (blue) and the other is production (green). The staging environment is one version ahead of production. Once testing is done in the staging environment, user traffic is switched to the staging environment, and the staging becomes the production. This deployment strategy is simple to perform rollback, but having two identical production quality environments could be expensive.
Canary Deployment
A canary deployment upgrades services gradually, each time to a subset of users. It is cheaper than blue-green deployment and easy to perform rollback. However, since there is no staging environment, we have to test on production. This process is more complicated because we need to monitor the canary while gradually migrating more and more users away from the old version.
A/B Test
In the A/B test, different versions of services run in production simultaneously. Each version runs an “experiment” for a subset of users. A/B test is a cheap method to test new features in production. We need to control the deployment process in case some features are pushed to users by accident.
Over to you - Which deployment strategy have you used? Did you witness any deployment-related outages in production and why did they happen?
2026-02-13 00:30:55
Software architecture patterns are reusable solutions to common problems that occur when designing software systems. Think of them as blueprints that have been tested and proven effective by countless developers over many years.
When we build applications, we often face similar challenges, such as how to organize code, how to scale systems, or how to handle communication between different parts of an application. Architecture patterns provide us with established approaches to solve these challenges.
Learning about architecture patterns offers several key benefits.
First, it increases our productivity because we do not need to invent solutions from scratch for every project.
Second, it improves our code quality by following proven approaches that make systems more maintainable and easier to understand.
Third, it enhances communication within development teams by providing a common vocabulary to discuss design decisions.
In this article, we will explore the essential architecture patterns that every software engineer should understand. We will look at how each pattern works, when to use it, what performance characteristics it has, and see practical examples of each.