MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

Messaging Patterns Explained: Pub-Sub, Queues, and Event Streams

2025-05-08 23:30:54

Modern software rarely lives on one machine anymore. Services run across clusters, web applications are dynamically rendered on the browser, and data resides on a mix of cloud platforms and in-house data centers. 

In such a scenario, coordination becomes harder, latency becomes visible, and reliability is crucial. In this environment, messaging patterns become a key implementation detail.

When two services or applications need to talk, the simplest move is a direct API call. It’s easy, familiar, and synchronous: one service waits for the other to respond. But that wait is exactly where things break.

What happens when the downstream service is overloaded? Or slow? Or down entirely? Suddenly, the system starts to stall: call chains pile up, retries back up, and failures increase drastically.

That’s where asynchronous communication changes the game.

Messaging decouples the sender from the receiver. A service doesn’t wait for another to complete work. It hands off a message and moves on. The message is safely stored in a broker, and the recipient can process it whenever it's ready. If the recipient fails, the message waits. If it’s slow, nothing else stalls.

However, not all messaging systems look the same. Three main patterns show up frequently:

  • Message Queues: One producer, one consumer. Tasks get processed once and only once.

  • Publish-Subscribe: One producer, many consumers. Messages fan out to multiple subscribers.

  • Event Streams: A durable, replayable log of events. Consumers can rewind, catch up, or read in parallel.

Each pattern solves a different problem and comes with trade-offs in reliability, ordering, throughput, and complexity. And each maps to different real-world use cases, from task queues in background job systems to high-throughput clickstream analytics to real-time chat.

In this article, we will look at these patterns in more detail, along with some common technologies that help implement these patterns.

Core Concepts

Read more

How Halo on Xbox Scaled to 10+ Million Players using the Saga Pattern

2025-05-06 23:31:14

The 6 Core Competencies of Mature DevSecOps Orgs (Sponsored)

Understand the core competencies that define mature DevSecOps organizations. This whitepaper offers a clear framework to assess your organization's current capabilities, define where you want to be, and outline practical steps to advance in your journey. Evaluate and strengthen your DevSecOps practices with Datadog's maturity model.

Access the whitepaper


Disclaimer: The details in this post have been derived from the articles/videos shared online by the Halo engineering team. All credit for the technical details goes to the Halo/343 Engineering Team. The links to the original articles and videos are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

In today's world of massive-scale applications, whether it’s gaming, social media, or online shopping, building reliable systems is a difficult task. As applications grow, they often move from using a single centralized database to being spread across many smaller services and storage systems. This change, while necessary for handling more users and data, brings a whole new set of challenges, especially around consistency and transaction handling.

In traditional systems, if we wanted to update multiple pieces of data (say, saving a new customer order and reducing inventory), we could easily rely on a transaction. A transaction would guarantee that either all updates succeed together or none of them happen at all. 

However, in distributed systems, there is no longer just one database to talk to. We might have different services, each managing its data in different locations. Each one might be running on different servers, cloud providers, or even different continents. Suddenly, getting all of them to agree at the same time becomes much harder. Network failures, service crashes, and inconsistencies are now of everyday situations.

This creates a huge problem: how do we maintain correct and reliable behavior when we can't rely on traditional transactions anymore? If we’re booking a trip, we don’t want to end up with a hotel reservation but no flight. If we’re updating a player's stats in a game, we can't afford for some stats to update and others to disappear.

Engineers must find new ways to coordinate operations across multiple independent systems to tackle these issues. One powerful pattern for solving this problem is the Saga Pattern, a technique originally proposed in the late 1980s but increasingly relevant today. In this article, we’ll look at how the Halo Engineering Team at 343 Game Studio (now Halo Studios) used the Saga pattern to improve the player experience.


ACID Transactions

When engineers design systems that store and update data, they often rely on a set of guarantees called ACID properties. These properties make sure that when something changes in the system, like saving a purchase or updating a player's stats, it happens safely and predictably. 

Here’s a quick look at each property.

  • Atomicity: Atomicity means "all or nothing". Either every step of a transaction happens successfully, or none of it does. If something goes wrong halfway, the system cancels everything so that it is not stuck with half-completed work. For example, either book both a flight and a hotel together, or neither.

  • Consistency: Consistency guarantees that the system’s data moves from one valid state to another valid state. After a transaction finishes, the data must still follow all the rules and constraints it’s supposed to. For example, after buying a ticket, the system shouldn’t show negative seat availability. The rules about what’s valid are always respected.

  • Isolation: Isolation ensures that transactions don’t interfere with each other even if they happen at the same time. It should feel like each transaction happens one after another, even when they’re happening in parallel. For example, if two people are trying to buy the last ticket at the same time, only one transaction succeeds.

  • Durability: Durability means that once a transaction is committed, it’s permanent. Even if the system crashes right after, the result stays safe and won't be lost. For example, if a user bought a ticket and the app crashes immediately after, the ticket is still booked and not lost in the system.

Single Database Model

In older system architectures, the typical way to build applications was to have a single, large SQL database that acted as the central source of truth. 

Every part of the application, whether it was a game like Halo, an e-commerce site, or a banking app, would send all of its data to this one place.

Here’s how it worked:

  • Applications (like a game server or a website) would talk to a stateless service.

  • That stateless service would then communicate directly with the single SQL database.

  • The database would store everything: player statistics, game progress, inventory, financial transactions, all in one consistent, centralized location.

Some advantages of the single database model were strong guarantees enforced by the ACID properties, simplicity of development, and vertical scaling.

Scalability Crisis of Halo 4

During the development of Halo 4, the engineering team faced unprecedented scale challenges that had not been encountered in earlier titles of the franchise. 

Halo 4 experienced an overwhelming level of engagement:

  • Over 1.5 billion games were played.

  • More than 11.6 million unique players connected and competed online.

Every match generated detailed telemetry data for each player: kills, assists, deaths, weapon usage, medals, and various other game-related statistics. This information needed to be ingested, processed, stored, and made accessible across multiple services, both for real-time feedback in the game itself and for external analytics platforms like Halo Waypoint.

The complexity further increased because a single match could involve anywhere from 1 to 32 players. For each game session, statistics needed to be reliably updated across multiple player records simultaneously, preserving data accuracy and consistency.

Inadequacy of a Single SQL Database

Before Halo 4, earlier installments in the series relied heavily on a centralized database model. 

A single large SQL Server instance operated as the canonical source of truth. Application services would interact with this centralized database to read and write all gameplay, player, and match data, relying on the built-in ACID guarantees to ensure data integrity.

However, the scale required by Halo 4 quickly revealed serious limitations in this model:

  • Vertical Scaling Limits: While the centralized database could be scaled vertically (by adding more powerful hardware), there were inherent physical and operational limits. Beyond a certain threshold, no amount of memory, CPU power, or storage optimization could compensate for the growing volume of concurrent reads and writes.

  • Single Point of Failure: Relying on one database instance introduced a critical operational risk. Any downtime, data corruption, or resource saturation in that instance could bring down essential services for the entire player base.

  • Contention and Locking Issues: With millions of users interacting with the database simultaneously, contention for locks and indexes became a bottleneck.

  • Operational Complexity of Partitioning: The original centralized system was not designed for partitioned workloads. Retroactively introducing sharding or partitioning into a monolithic SQL structure would have required major rewrites and complex operational procedures, creating risks of inconsistency and service instability.

  • Mismatch with Cloud-native Architecture: Halo 4’s backend migrated to Azure’s cloud infrastructure, specifically using Azure Table Storage. Azure Table Storage is a NoSQL system that inherently relies on partitioned storage and offers eventual consistency. The old model of transactional consistency across a single database did not align with this partitioned, distributed environment.

Given these challenges, the engineering team recognized that continuing to rely on a monolithic SQL database would limit scalability and expose the system to unacceptable levels of risk and downtime. A transition to a distributed architecture was necessary.

Introduction to Saga Pattern

The Saga Pattern originated through a research paper published in 1987 by Hector Garcia-Molina and Kenneth Salem at Princeton University. The research addressed a critical problem: how to handle long-lived transactions in database systems.

At the time, traditional transactions were designed to be short-lived operations, locking resources for a minimal duration to maintain ACID guarantees. However, some operations, such as generating a complex bank statement, processing large historical datasets, or reconciling multi-step financial workflows, require holding locks for extended periods. These long-running transactions created bottlenecks by tying up resources, reducing system concurrency, and increasing the risk of failure.

The Saga Pattern solves these issues in the following ways:

  • A single logical operation is split into a sequence of smaller sub-transactions.

  • Each sub-transaction is executed independently and commits its changes immediately.

  • If all sub-transactions succeed, the Saga as a whole is considered successful.

  • If any sub-transaction fails, compensating transactions are executed in reverse order to undo the changes of previously completed sub-transactions.

See the diagram below that shows an example of this pattern:

Key technical points about Sagas are as follows:

  • Sub-transactions must be independent: The successful execution of one sub-transaction should not directly depend on the outcome of another. This independence allows for better concurrency and avoids cascading failures.

  • Compensating transactions are required: For every sub-transaction, a corresponding compensating action must be defined. These compensating transactions semantically "undo" the effect of their associated operations. However, the system may not always be able to return to the exact previous state; instead, it aims for a semantically consistent recovery.

  • Atomicity is weakened: Unlike traditional transactions, where partial updates are never visible, in a Saga, intermediate states are visible to other parts of the system. Partial results may exist temporarily until either the full sequence completes or a failure triggers rollback through compensation.

  • Consistency is preserved through business logic: Instead of relying on database-level transactional guarantees, Sagas maintain application-level consistency by ensuring that after all sub-transactions and compensations, the system is left in a valid and coherent state.

  • Failure management is built in: Sagas treat failures as an expected part of the system's operation. The pattern provides a structured way to handle errors and maintain resilience without assuming perfect reliability.

Saga Execution Models

The main aspects of the Saga execution model are as follows:

Single Database Execution

When the Saga Pattern was first introduced, it was designed to operate within a single database system. In this environment, executing a saga requires two main components:

1 - Saga Execution Coordinator (SEC)

The Saga Execution Coordinator is a process that orchestrates the execution of all the sub-transactions in the correct sequence. It is responsible for:

  • Starting the saga

  • Executing each sub-transaction one after another

  • Monitoring the success or failure of each sub-transaction

  • Triggering compensating transactions if something goes wrong

The SEC ensures that the saga progresses correctly without needing distributed coordination because everything is happening within the same database system.

2 - Saga Log

The Saga Log acts as a durable record of everything that happens during the execution of a saga. Every major event, starting a saga, beginning a sub-transaction, completing a sub-transaction, beginning a compensating transaction, completing a compensating transaction, ending a saga, is written to the log.

The Saga Log guarantees that even if the SEC crashes during execution, the system can recover by replaying the events recorded in the log. This provides durability and recovery without relying on traditional transaction locking across the entire saga.

Failure Handling

Handling failures in a single database saga relies on a strategy called backward recovery. 

This means that if any sub-transaction fails during the saga’s execution, the system must roll back by executing compensating transactions for all the sub-transactions that had already completed successfully.

Here’s how the process works:

  • Detection of Failure: The SEC detects that a sub-transaction has failed because the database operation either returns an error or violates a business rule.

  • Recording the Abort: The SEC writes an abort event to the Saga Log, marking that the forward execution path has been abandoned.

  • Starting Compensations: The SEC reads the Saga Log to determine which sub-transactions had been completed. It then begins executing the corresponding compensating transactions in the reverse order.

  • Completing Rollback: Each compensating transaction is logged in the Saga Log as it begins and completes.

After all necessary compensations have been successfully applied, the saga is formally marked as aborted in the log.

Halo 4 Stats Service

Here are the key components of how Halo used the Saga Pattern:

Service Architecture

The Halo 4 statistics service was built to handle large volumes of player data generated during and after every game. The architecture used a combination of cloud-based storage and actor-based programming models to manage this complexity effectively.

The service architecture included the following major components:

  • Azure Table Storage: All persistent player data was stored in Azure Table Storage, a NoSQL key-value store. Each player's data was assigned to a separate partition, allowing for highly parallel reads and writes without a single centralized bottleneck.

  • Orleans Actor Model: The team adopted Microsoft's Orleans framework, which is based on the actor model. Actors (referred to as "grains" in Orleans) represented different logical units in the system. 

    • Game Grains handled the aggregation of statistics for a game session.

    • Player Grains were responsible for persisting each player’s statistics. In Halo 4’s backend system, a Player Grain is a logical unit that represents a single player’s data and behavior inside the server-side application.

  • Azure Service Bus: Azure Service Bus was used as a message queue and a distributed, durable log. When a new game is completed, a message containing the statistics payload is published to the Service Bus. This acted as the start of a saga.

  • Stateless Frontend Services: These services accepted raw statistics from Xbox clients, published them to the Service Bus, and triggered the saga processing pipelines.

The separation into game and player grains, combined with distributed cloud storage, provided a scalable foundation that could process thousands of simultaneous games and millions of concurrent player updates.

Saga Application

The team applied the Saga Pattern to manage the complex updates needed for player statistics across multiple partitions.

The typical sequence was:

  • Aggregation: After a game session ended, the Game Grain aggregated statistics from the participating players.

  • Saga Initiation: A message was logged in the Azure Service Bus indicating the start of a saga for updating statistics for that game.

  • Sub-requests to Player Grains: For each player in the game (up to 32), the Game Grain sent individual update requests to their corresponding Player Grains. Each Player Grain then updated its player's statistics in Azure Table Storage.

  • Logging Progress: Successful updates and any errors were recorded through the durable messaging system, ensuring that no state was lost even if a process crashed.

  • Completion or Failure: If all player updates succeeded, the saga was considered complete. If any player update failed (for example, due to a temporary Azure storage issue), the saga would not be rolled back but would move into a forward recovery phase.

Through this structure, the team could ensure that updates were processed independently per player without relying on traditional ACID transactions across all player partitions.

Forward Recovery Strategy

Rather than using traditional backward recovery (rolling back completed sub-transactions), the Halo 4 team implemented forward recovery for their statistics sagas.

The main reasons for choosing forward recovery are as follows:

  • User Experience: Players who had their stats updated successfully should not see those stats suddenly disappear if a rollback occurred. Rolling back visible, successfully processed data would create a confusing and poor experience.

  • Operational Efficiency: Retrying only the failed player updates was more efficient than undoing successful writes and restarting the entire game processing.

Here’s how forward recovery works:

  • If a Player Grain failed to update its statistics (for example, due to storage partition unavailability or quota exhaustion), the system recorded the failure but did not undo any successful updates already completed for other players.

  • The failed update was queued for a retry using a back-off strategy. This allowed time for temporary issues to resolve without overwhelming the storage system with aggressive retries.

  • Retried updates were required to be idempotent. That is, repeating the update operation would not result in duplicated statistics or corruption. This was achieved by relying on database operations that safely applied incremental changes or overwrote fields as necessary.

  • Successful retries eventually brought all player records to a consistent state, even if it took minutes or hours to do so after the original game session ended.

By using forward recovery and designing for idempotency, the Halo 4 backend was able to achieve high availability.

Conclusion

As systems grow to support millions of users, traditional database models that rely on centralized transactions and strong ACID guarantees begin to break down. 

The Halo 4 engineering team’s experience highlighted the urgent need for a new approach: one that could handle enormous scale, tolerate failures gracefully, and still maintain consistency across distributed data stores.

The Saga Pattern provided an elegant and practical solution to these challenges. By decomposing long-lived operations into sequences of sub-transactions and compensating actions, the team was able to build a system that prioritized availability, resilience, and operational correctness without relying on expensive distributed locking or rigid coordination protocols.

The lessons learned from this system apply broadly, not only to gaming infrastructure but to any domain where distributed operations must maintain reliability at massive scale. 

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

How Canva Collects 25 Billion Events a Day

2025-05-05 23:30:52

ACI.dev: The Only MCP Server Your AI Agents Need (Sponsored)

ACI.dev’s Unified MCP Server provides every API your AI agents will need through just one MCP server and two functions. One connection unlocks 600+ integrations with built-in multi-tenant auth and natural-language permission scopes.

  • Plug & Play – Framework-agnostic, works with any architecture.

  • Secure – Tenant isolation for your agent’s end users.

  • Smart – Dynamic intent search finds the right function for each task.

  • Reliable – Sub-API permission boundaries to improve agent reliability.

  • Fully Open Source – Backend, dev portal, library, MCP server implementation.

Skip months of infra plumbing; ship the agent features that matter.

Star us on GitHub!


Disclaimer: The details in this post have been derived from the articles written by the Canva engineering team. All credit for the technical details goes to the Canva Engineering Team. The links to the original articles and videos are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Every product team wants data. Not just numbers, but sharp, trustworthy, real-time answers to questions like: Did this new feature improve engagement? Are users abandoning the funnel? What’s trending right now?

However, collecting meaningful analytics at scale is less about dashboards and more about plumbing.

At Canva, analytics isn’t just a tool for dashboards but a part of the core infrastructure. Every design viewed, button clicked, or page loaded gets translated into an event. Multiply that across hundreds of features and millions of users, and it becomes a firehose: 25 billion events every day, flowing with five nines of uptime.

Achieving that kind of scale requires deliberate design choices: strict schema governance, batch compression, fallback queues, and a router architecture that separates ingestion from delivery. 

This article walks through how Canva structures, collects, and distributes billions of events daily, without drowning in tech debt and increasing cloud bills.

Their system is organized into three core stages:

  • Structure: Define strict schemas

  • Collect: Ingest and enrich events

  • Distribute: Route events to appropriate destinations

Let’s each look at each stage in detail.

Structure

Most analytics pipelines start with implementation speed in mind, resulting in undocumented types and incompatible formats. It works until someone asks why this metric dropped, and there is no satisfactory answer.

Canva avoided that trap by locking down its analytics schema from day one. Every event, from a page view to a template click, flows through a strictly defined Protobuf schema.

Instead of treating schemas as an afterthought, Canva treats them like long-term contracts. Every analytics event must conform to a Protobuf schema that guarantees full transitive compatibility:

  • Forward-compatible: New consumers must handle events created by old clients.

  • Backward-compatible: Old consumers must handle events from new clients.

Breaking changes like removing a required field or changing types aren’t allowed. If something needs to change fundamentally, engineers ship an entirely new schema version. This keeps years of historical data accessible and analytics queries future-proof.

To enforce these schema rules automatically, Canva built Datumgen: a layer on top of protoc that goes beyond standard code generation.

Datumgen handles various components like:

  • TypeScript definitions for frontends, ensuring events are type-checked at compile time.

  • Java definitions for backend services that produce or consume analytics.

  • SQL schemas for Snowflake, so the data warehouse always knows the shape of incoming data.

  • A live Event Catalog UI that anyone at Canva can browse to see what events exist, what fields they contain, and where they’re routed.

Every event schema must list two human owners:

  • A Technical Owner: Usually the engineer who wrote the event logic.

  • A Business Owner: Often a data scientist who knows how the event maps to product behavior.

Fields must also include clear, human-written comments that explain what they mean and why they matter. These aren’t just helpful for teammates. They directly power the documentation shown in Snowflake and the Event Catalog.

Collect

The biggest challenge with analytics pipelines isn’t collecting one event, but collecting billions, across browsers, devices, and flaky networks, without turning the ingestion service into a bottleneck or a brittle mess of platform-specific hacks.

Canva’s ingestion layer solves this by betting on two things: a unified client and an asynchronous, AWS Kinesis-backed enrichment pipeline. Rather than building (and maintaining) separate analytics SDKs for iOS, Android, and web, Canva went the other way: every frontend platform uses the same TypeScript analytics client, running inside a WebView shell.

Only a thin native layer is used to grab platform-specific metadata like device type or OS version. Everything else, from event structure to queueing to retries, is handled in one shared codebase.

This pays off in a few key ways:

  • Engineers don’t have to fix bugs in three places.

  • Schema definitions stay consistent across platforms.

  • Feature instrumentation stays unified, reducing duplication and drift.

Once events leave the client, they land at a central ingestion endpoint.

Before anything else happens, each event is checked against the expected schema. If it doesn’t match (for example, if a field is missing, malformed, or just plain wrong) it’s dropped immediately. This upfront validation acts as a firewall against bad data.

Valid events are then pushed asynchronously into Amazon Kinesis Data Streams (KDS), which acts as the ingestion buffer for the rest of the pipeline.

The key move here is the decoupling: the ingestion endpoint doesn’t block on enrichment or downstream delivery. It validates fast, queues fast, and moves on. That keeps response times low and isolates ingest latency from downstream complexity.

The Ingest Worker pulls events from the initial KDS stream and handles all the heavy lifting that the client can’t or shouldn’t do, such as:

  • Geolocation enrichment based on IP.

  • Device fingerprinting from available metadata.

  • Timestamp correction to fix clock drift or stale client buffers.

Once events are enriched, they’re forwarded to a second KDS stream that acts as the handoff to the routing and distribution layer.

This staging model brings two major benefits:

  • It keeps enrichment logic separate from the ingestion path, preventing slow lookups or third-party calls from impacting front-end latencies.

  • It isolates faults. If enrichment fails or lags, it doesn’t block new events from entering the pipeline.

Deliver

A common failure mode in analytics pipelines isn’t losing data but delivering it too slowly. When personalization engines lag, dashboards go blank, or real-time triggers stall, it usually traces back to one culprit: tight coupling between ingestion and delivery.

Canva avoids this trap by splitting the pipeline cleanly. Once events are enriched, they flow into a decoupled router service.

The router service sits between enrichment and consumption. Its job is simple in theory but crucial in practice: get each event to the right place, without letting any consumer slow down the others.

Here’s how it works:

  • Pulls enriched events from the second Kinesis Data Stream (KDS).

  • Matches each event against the routing configuration defined in code.

  • Delivers each event to the set of downstream consumers that subscribe to its type.

Why decouple routing from the ingest worker?  Because coupling them would mean:

  • A slow consumer blocks all others.

  • A schema mismatch in one system causes cascading retries.

  • Scaling becomes painful, especially when some consumers want real-time delivery and others batch once an hour.

Canva delivers analytics events to a few key destinations, each optimized for a different use case:

  • Snowflake (via Snowpipe Streaming): This is where dashboards, metrics, and A/B test results come from. Latency isn’t critical. Freshness within a few minutes is enough. However, reliability and schema stability matter deeply.

  • Kinesis: Used for real-time backend systems related to personalization, recommendations, or usage tracking services. Kinesis shines here because it supports high-throughput parallel reads, stateful stream processing, and replay.

  • SQS Queues: Ideal for services that only care about a handful of event types. SQS is low-maintenance and simple to integrate with. 

This multi-destination setup lets each consumer pick the trade-off it cares about: speed, volume, simplicity, or cost.

The platform guarantees “at-least-once” delivery. In other words, an event may be delivered more than once, but never silently dropped. That means each consumer is responsible for deduplication, whether by using idempotent writes, event IDs, or windowing logic.

This trade-off favors durability over purity. In large-scale systems, it’s cheaper and safer to over-deliver than to risk permanent data loss due to transient failures.

Infrastructure Cost Optimization

Here’s how the team brought infrastructure costs down by over 20x, without sacrificing reliability or velocity.

SQS + SNS

The MVP version of Canva’s event delivery pipeline leaned on AWS SQS and SNS:

  • Easy to set up.

  • Scaled automatically.

  • Integrated smoothly with existing services.

But convenience came at a cost. Over time, SQS and SNS accounted for 80% of the platform’s operating expenses.

That kicked off a debate between streaming solutions:

  • Amazon MSK (Managed Kafka) offered a 40% cost reduction but came with significant operational overhead: brokers, partitions, storage tuning, and JVM babysitting.

  • Kinesis Data Streams (KDS) wasn’t the fastest, but it won on simplicity, scalability, and price. 

The numbers made the decision easy: KDS delivered an 85% cost reduction compared to the SQS/SNS stack, with only a modest latency penalty (10–20ms increase). The team made the switch and cut costs by a factor of 20.

Compress First: Then Ship

Kinesis charges by volume, not message count. That makes compression a prime lever for cost savings. Instead of firing events one by one, Canva performs some key optimizations such as:

  • Batch collecting hundreds of events at a time.

  • Compressing them using ZSTD: a fast, high-ratio compression algorithm.

  • Pushing compressed blobs into KDS.

This tiny shift delivered a big impact. Some stats are as follows:

  • 10x compression ratio on typical analytics data (which tends to be repetitive).

  • ~100ms compression/decompression overhead per batch: a rounding error in stream processing.

  • $600,000 in annual savings, with no visible trade-off in speed or accuracy.

KDS Tail Latency

Kinesis isn’t perfect. While average latency stays around 7ms, tail latency can spike over 500ms, especially when shards approach their 1MB/sec write limits.

This poses a threat to frontend response times. Waiting on KDS means users wait too. That’s a no-go.

The fix was A fallback to SQS whenever KDS misbehaves:

  • If a write is throttled or delayed, the ingestion service writes to SQS instead.

  • This keeps p99 response times under 20ms, even under shard pressure.

  • It costs less than $100/month to maintain this overflow buffer.

This fallback also acts as a disaster recovery mechanism. If KDS ever suffers a full outage, the system can redirect the full event stream to SQS with no downtime. 

Conclusion

Canva’s event collection pipeline is a great case of fundamentals done right: strict schemas, decoupled services, typed clients, smart batching, and infrastructure that fails gracefully. Nothing in the architecture is wildly experimental, and that’s the point.

Real systems break when they’re over-engineered for edge cases or under-designed for scale. Canva’s approach shows what it looks like to walk the line: enough abstraction to stay flexible, enough discipline to stay safe, and enough simplicity to keep engineers productive.

For any team thinking about scaling their analytics, the lesson would be to build for reliability, cost, and long-term clarity. That’s what turns billions of events into usable insight.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

EP161: A Cheatsheet on REST API Design Best Practices

2025-05-03 23:30:25

WorkOS + MCP: Authorization for AI Agents (Sponsored)

Wide-open access to every tool on your MCP server is a major security risk. Unchecked access can quickly lead to serious incidents.

Teams need a fast, easy way to lock down access with roles and permissions.

WorkOS AuthKit makes it simple with RBAC — assign roles, enforce permissions, and control exactly who can access critical tools.

Don’t wait for a breach to happen. Secure your server today.

Watch the demo to learn more


This week’s system design refresher:

  • System Design Was HARD - Until You Knew the Trade-Offs, Part 2 (Youtube video)

  • A Cheatsheet on REST API Design Best Practices

  • Top 30 AWS Services That Are Commonly Used

  • The Large-Language Model Glossary

  • We're hiring at ByteByeGo

  • SPONSOR US


System Design Was HARD - Until You Knew the Trade-Offs, Part 2


A Cheatsheet on REST API Design Best Practices

Well-designed APIs behave consistently, fair predictably, and grow without friction. Some best practices to keep in mind are as follows:

  1. Resource-oriented paths and proper use of HTTP verbs help APIs align with standard tools.

  2. Use a proper API versioning approach.

  3. Use standard error codes while generating API responses.

  4. APIs should be idempotent. They ensure safe retries by making repeated requests to produce the same result, especially for POST operations.

  5. Idempotency keys allow clients to safely deduplicate operations with side effects.

  6. APIs should support pagination to prevent performance bottlenecks and payload bloat. Some common pagination strategies are offset-based, cursor-based, and keyset-based.

  7. API security is mandatory for well-designed APIs. Use proper authentication and authorization with APIs using API Keys, JWTs, OAuth2, and other mechanisms. HTTPS is also a must-have for APIs running in production.

Over to you: Which other best practices do you follow while designing APIs?


Pgvector vs. Qdrant: Open-Source Vector Database Comparison (Sponsored)

Looking for an open-source, high-performance vector database for large-scale workloads? We compare Qdrant vs. Postgres + pgvector + pgvectorscale.

Read The Benchmark


Top 30 AWS Services That Are Commonly Used

We group them by category and understand what they do.

Compute Services
1 - Amazon EC2: Virtual servers in the cloud
2 - AWS Lambda: Serverless functions for event-driven workloads
3 - Amazon ECS: Managed container orchestration
4 - Amazon EKS: Kubernetes cluster management service
5 - AWS Fargate: Serverless compute for containers

Storage Services
6 - Amazon S3: Scalable secure object storage
7 - Amazon EBS: Block storage for EC2 instances
8 - Amazon FSx: Fully managed file storage
9 - AWS Backup: Centralized backup automation
10 - Amazon Glacier: Archival cold storage for backups

Database Services
11 - Amazon RDS: Managed relational database service
12 - Amazon DynamoDB: NoSQL database with low latency
13 - Amazon Aurora: High-performance cloud-native database
14 - Amazon Redshift: Scalable data warehousing solution
15 - Amazon Elasticache: In-memory caching with Redis/Memcached
16 - Amazon DocumentDB: NoSQL document database (MongoDB-compatible)
17 - Amazon Keyspaces: Managed Cassandra database service

Networking & Security
18 - Amazon VPC: Secure cloud networking
19 - AWS CloudFront: Content Delivery Network
20 - AWS Route53: Scalable domain name system (DNS)
21 - AWS WAF: Protects web applications from attacks
22 - AWS Shield: DDoS protection for AWS workloads

AI & Machine Learning
23 - Amazon SageMaker: Build, train, and deploy ML models
24 - AWS Rekognition: Image and video analysis with AI
25 - AWS Textract: Extracts text from scanned documents
26 - Amazon Comprehend: AI-driven natural language processing

Monitoring & DevOps
27 - Amazon CloudWatch: AWS performance monitoring and alerts
28 - AWS X-Ray: Distributed tracing for applications
29 - AWS CodePipeline: CI/CD automation for deployments
30 - AWS CloudFormation - Infrastructure as Code (IaC)

Over to you: Which other AWS service will you add to the list?


The Large-Language Model Glossary

This glossary can be divided into high-level categories:

  1. Models: Includes the types of models such as Foundation, Instruction-Tuned, Multi-modal, Reasoning, and Small Language Model.

  2. Training LLM: Training begins with pretraining RLHF, DPO, and Synthetic Data. Fine-Tuning adds control with datasets, checkpoints, LoRA/QLoRA, guardrails, and parameter tunings.

  3. Prompts: Prompts drive how models respond using User/System Prompts, Chain of Thought, of Few/Zero-Shot learning. Prompt Tuning and large Context Windows help shape more precise, multi-turn conversations.

  4. Inference: This is how models generate responses. Key factors include Temperature, Max Tokens, Seed, and Latency. Hallucination is a common issue here, where the model makes things up that sound real.

  5. Retrieval-Augmented Generation: RAG improves accuracy by fetching real-world data. It uses Retrieval, Semantic Search, Chunks, Embeddings, and VectorDBs. Reranking and Indexing ensure the best answers are surfaced, not just the most likely ones.

Over to you: What else will you add to the LLM glossary?


We're hiring two new positions at ByteByeGo: Full-Stack Engineer and Sales/Partnership

Role Type: Part-time (20+ weekly) or Full-time
Compensation: Competitive

Full-Stack Engineer (Remote)
We are hiring a Full Stack Engineer to build an easy-to-use educational platform and drive product-led growth. You'll work closely with the founder, wearing a product manager's hat when needed to prioritize user experience and feature impact. You'll operate in a fast-paced startup environment where experimentation, creativity, and using AI tools for rapid prototyping are encouraged.

We’re less concerned with years of experience. We care more about what you've built than about your resume. Share your projects, GitHub, portfolio, or any artifacts that showcase your ability to solve interesting problems and create impactful solutions. When you're ready, send your resume and a brief note about why you're excited to join ByteByteGo to [email protected]

Sales/Partnership (US based remote role)
We’re looking for a sales and partnerships specialist who will help grow our newsletter sponsorship business. This role will focus on securing new advertisers, nurturing existing relationships, and optimizing revenue opportunities across our newsletter and other media formats.

We’re less concerned with years of experience. What matters most is that you’re self-motivated, organized, and excited to learn and take on new challenges.

How to Apply: send your resume and a short note on why you’re excited about this role to [email protected]


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

Synchronous vs Asynchronous Communication: When to Use What?

2025-05-01 23:30:47

A system's usefulness depends heavily on its ability to communicate with other systems. 

That’s true whether it’s a pair of microservices exchanging user data, a mobile app fetching catalog details, or a distributed pipeline pushing events through a queue. At some point, every system has to make a call: Should this interaction happen synchronously or asynchronously?

That question surfaces everywhere: sometimes explicitly in design documents, sometimes buried in architectural decisions that later appear as latency issues, cascading failures, or observability blind spots. It affects how APIs are designed, how systems scale, and how gracefully they degrade when things break.

Synchronous communication feels familiar. One service calls another, waits for a response, and moves on. It’s clean, predictable, and easy to trace. However, when the service being called slows down or fails, then everything that depends on it also gets impacted.

Asynchronous communication decouples these dependencies. A message is published, a job is queued, or an event is fired, and the sender moves on. It trades immediacy for flexibility. The system becomes more elastic, but harder to debug, reason about, and control.

Neither approach is objectively better. They serve different needs, and choosing between them (or deciding how to combine them) is a matter of understanding different trade-offs: 

  • Latency versus throughput

  • Simplicity versus resilience

  • Real-time response versus eventual progress

In this article, we’ll take a detailed look at synchronous and asynchronous communication, along with their trade-offs. We will also explore some popular communication protocols that make synchronous or asynchronous communication possible.

Understanding the Basics

Read more

How Meta Built Threads to Support 100 Million Signups in 5 Days

2025-04-29 23:30:31

Generate your MCP server with Speakeasy (Sponsored)

Like it or not, your API has a new user: AI agents. Make accessing your API services easy for them with an MCP (Model Context Protocol) server. Speakeasy uses your OpenAPI spec to generate an MCP server with tools for all your API operations to make building agentic workflows easy.

Once you've generated your server, use the Speakeasy platform to develop evals, prompts and custom toolsets to take your AI developer platform to the next level.

Start Generating


Disclaimer: The details in this post have been derived from the articles written by the Meta engineering team. All credit for the technical details goes to the Meta/Threads Engineering Team. The links to the original articles and videos are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Threads, Meta’s newest social platform, launched on July 5, 2023, as a real-time, public conversation space. 

Built in under five months by a small engineering team, the product received immediate momentum. Infrastructure teams had to respond immediately to the incredible demand. 

When a new app hits 100 million signups in under a week, the instinct is to assume someone built a miracle backend overnight. That’s not what happened with Threads. There was no time to build new systems or bespoke scaling plans. The only option was to trust the machinery already in place.

And that machinery worked quite smoothly. As millions signed up in 5 days, the backend systems held on, and everything from the user’s perspective worked as intended.

Threads didn’t scale because it was lucky. It scaled because it inherited Meta’s hardened infrastructure: platforms shaped by a decade of lessons from Facebook, Instagram, and WhatsApp. 

This article explores two of those platforms that played a key role in the successful launch of Threads:

  • ZippyDB, the distributed key-value store powering state and search.

  • Async, the serverless compute engine that offloads billions of background tasks.

Neither of these systems was built for Threads. But Threads wouldn’t have worked without them.

ZippyDB was already managing billions of reads and writes daily across distributed regions. Also, Async had been processing trillions of background jobs across more than 100,000 servers, quietly powering everything from feed generation to follow suggestions.

ZippyDB: Key-Value at Hyperscale

ZippyDB is Meta’s internal, distributed key-value store designed to offer strong consistency, high availability, and geographical resilience at massive scale. 

At its core, it builds on RocksDB for storage, extends replication with Meta’s Data Shuttle (a Multi-Paxos-based protocol), and manages placement and failover through a system called Shard Manager.

Unlike purpose-built datastores tailored to single products, ZippyDB is a multi-tenant platform. Dozens of use cases (from metadata services to product feature state) share the same infrastructure. This design ensures higher hardware utilization, centralized observability, and predictable isolation across workloads.

The Architecture of ZippyDB

ZippyDB doesn’t treat deployment as a monolith. It’s split into deployment tiers: logical groups of compute and storage resources distributed across geographic regions. 

Each tier serves one or more use cases and provides fault isolation, capacity management, and replication boundaries. The most commonly used is the wildcard tier, which acts as a multi-tenant default, balancing hardware utilization with operational simplicity. Dedicated tiers exist for use cases with strict isolation or latency constraints.

Within each tier, data is broken into shards, the fundamental unit of distribution and replication. Each shard is independently managed and:

  • Synchronously replicated across a quorum of Paxos nodes for durability and consistency. This guarantees that writes survive regional failures and meet strong consistency requirements.

  • Asynchronously replicated to follower replicas, which are often co-located with high-read traffic regions. These replicas serve low-latency reads with relaxed consistency, enabling fast access without sacrificing global durability.

This hybrid replication model (strong quorum-based writes paired with regional read optimization) gives ZippyDB flexibility across a spectrum of workloads.

See the diagram below that shows the concept of region-based replication supported by ZippyDB.

To push scalability even further, ZippyDB introduces a layer of logical partitioning beneath shards: μshards (micro-shards). These are small, related key ranges that provide finer-grained control over data locality and mobility.

Applications don’t deal directly with physical shards. Instead, they write to μshards, which ZippyDB dynamically maps to underlying storage based on access patterns and load balancing requirements.

ZippyDB supports two primary strategies for managing μshard-to-shard mapping:

  • Compact Mapping: Best for workloads with relatively static data distribution. Mappings change only when shards grow too large or too hot. This model prioritizes stability over agility and is common in systems with predictable access patterns.

  • Akkio Mapping: Designed for dynamic workloads. A system called Akkio continuously monitors access patterns and remaps μshards to optimize latency and load. This is particularly valuable for global products where user demand shifts across regions throughout the day. Akkio reduces data duplication while improving locality, making it ideal for scenarios like feed personalization, metadata-heavy workloads, or dynamic keyspaces.

In ZippyDB, the Shard Manager acts as the external controller for leadership and failover. It doesn’t participate in the data path but plays a critical role in keeping the system coordinated.

The Shard Manager assigns a Primary replica to each shard and defines an epoch: a versioned leadership lease. The epoch ensures only one node has write authority at any given time. When the Primary changes (for example, due to failure), Shard Manager increments the epoch and assigns a new leader. The Primary sends regular heartbeats to the Shard Manager. If the heartbeats stop, the Shard Manager considers the Primary unhealthy and triggers a leader election by promoting a new node and bumping the epoch.

See the diagram below that shows the role of the Shard Manager:

Consistency and Durability in ZippyDB

In distributed systems, consistency is rarely black-and-white. ZippyDB embraces this by giving clients per-request control over consistency and durability levels, allowing teams to tune system behavior based on workload characteristics.

1 - Strong Consistency

Strong consistency in ZippyDB ensures that reads always reflect the latest acknowledged writes, regardless of where the read or write originated. To achieve this, ZippyDB routes these reads to the primary replica, which holds the current Paxos lease for the shard. The lease ensures that only one primary exists at any time, and only it can serve linearizable reads.

If the lease state is unclear (for example, during a leadership change), the read may fall back to a quorum check to avoid split-brain scenarios. This adds some latency, but maintains correctness.

2 - Bounded Staleness (Eventual Consistency)

Eventual consistency in ZippyDB isn’t the loose promise it implies in other systems. Here, it means bounded staleness: a read may be slightly behind the latest write, but it will never serve stale data beyond a defined threshold.

Follower replicas (often located closer to users) serve these reads. ZippyDB uses heartbeats to monitor follower lag, and only serves reads from replicas that are within an acceptable lag window. This enables fast, region-local reads without compromising on the order of operations.

3 - Read-Your-Writes Consistency

For clients that need causal guarantees, ZippyDB supports a read-your-writes model. 

After a write, the server returns a version number (based on Paxos sequence ordering). The client caches this version and attaches it to subsequent reads. ZippyDB then ensures that reads reflect data at or after that version.

This model works well for session-bound workloads, like profile updates followed by an immediate refresh.

4 - Fast-Ack Write Mode

In scenarios where write latency matters more than durability, ZippyDB offers a fast-acknowledgment write mode. Writes are acknowledged as soon as they are enqueued on the primary for replication, not after they’re fully persisted in the quorum.

This boosts throughput and responsiveness, but comes with trade-offs:

  • Lower durability (data could be lost if the primary crashes before replication).

  • Weaker consistency (readers may not see the write until replication completes).

This mode fits well in systems that can tolerate occasional loss or use an idempotent retry.

Transactions and Conditional Writes

ZippyDB supports transactional semantics for applications that need atomic read-modify-write operations across multiple keys. Unlike systems that offer tunable isolation levels, ZippyDB keeps things simple and safe: all transactions are serializable by default.

Transactions

ZippyDB implements transactions using optimistic concurrency control:

  • Clients read a database snapshot (usually from a follower) and assemble a write set.

  • They send both read and write sets, along with the snapshot version, to the primary.

  • The primary checks for conflicts, whether any other transaction has modified the same keys since the snapshot.

  • If there are no conflicts, the transaction commits and is replicated via Paxos.

If a conflict is detected, the transaction is rejected, and the client retries. This avoids lock contention but works best when write conflicts are rare.

ZippyDB maintains recent write history to validate transaction eligibility. To keep overhead low, it prunes older states periodically. Transactions spanning epochs (i.e., across primary failovers) are automatically rejected, which simplifies correctness guarantees at the cost of some availability during leader changes.

Conditional Writes

For simpler use cases, ZippyDB exposes a conditional write API that maps internally to a server-side transaction. This API allows operations like:

  • “Set this key only if it doesn’t exist.”

  • “Update this value only if it matches X.”

  • “Delete this key only if it’s present.”

These operations avoid the need for client-side reads and round-trips. Internally, ZippyDB evaluates the precondition, checks for conflicts, and commits the write as a transaction if it passes.

This approach simplifies client code and improves performance in cases where logic depends on the current key's presence or state.

Why was ZippyDB critical To Threads?

Threads didn’t have months to build a custom data infrastructure. It needed to read and write at scale from day one. ZippyDB handled several core responsibilities, such as:

  • Counters: Like counts, follower tallies, and other rapidly changing metrics.

  • Feed ranking state: Persisted signals to sort and filter what shows up in a user's home feed.

  • Search state: Underlying indices that powered real-time discovery.

What made ZippyDB valuable wasn’t just performance but adaptability. As a multi-tenant system, it supported fast onboarding of new services. Teams didn’t have to provision custom shards or replicate schema setups. They configured what they needed and got the benefit of global distribution, consistency guarantees, and monitoring from day one.

At launch, Threads was expected to grow. But few predicted the velocity: 100 million users in under a week. That kind of growth doesn’t allow for manual shard planning or last-minute migrations.

ZippyDB’s resharding protocol turned a potential bottleneck into a non-event. Its clients map data into logical shards, which are dynamically routed to physical machines. When load increases, the system can:

  • Provision new physical shards.

  • Reassign logical-to-physical mappings live, without downtime.

  • Migrate data using background workers that ensure consistency through atomic handoffs.

No changes are required in the application code. The system handles remapping and movement transparently. Automation tools orchestrate these transitions, enabling horizontal scale-out at the moment it's needed, not hours or days later.

This approach allowed Threads to start small, conserve resources during early development, and scale adaptively as usage exploded, without risking outages or degraded performance.

During Threads’ launch window, the platform absorbed thousands of machines in a matter of hours. Multi-tenancy played a key role here. Slack capacity from lower-usage keyspaces was reallocated, and isolation boundaries ensured that Threads could scale up without starving other workloads.

Async - Serverless at Meta Scale

Async is Meta’s internal serverless compute platform, formally known as XFaaS (eXtensible Function-as-a-Service). 

At peak, Async handles trillions of function calls per day across more than 100,000 servers. It supports multiple languages such as HackLang, Python, Haskell, and Erlang.

What sets Async apart is the fact that it abstracts away everything between writing a function and running it on a global scale. There is no need for service provisioning. Drop code into Async, and it inherits Meta-grade scaling, execution guarantees, and disaster resilience.

Async’s Role in Threads

When Threads launched, one of the most critical features wasn’t visible in the UI. It was the ability to replicate a user’s Instagram follow graph with a single tap. Behind that one action sat millions of function calls: each new Threads user potentially following hundreds or thousands of accounts in bulk.

Doing that synchronously would have been a non-starter. Blocking the UI on graph replication would have led to timeouts, poor responsiveness, and frustrated users. Instead, Threads offloaded that work to Async.

Async queued those jobs, spread them across the fleet, and executed them in a controlled manner. That same pattern repeated every time a celebrity joined Threads—millions of users received follow recommendations and notifications, all piped through Async without spiking database load or flooding services downstream.

How Async Handled the Surge

Async didn’t need to be tuned for Threads. It scaled the way it always does.

Several features were key to the success of Threads:

  • Queueing deferred less-urgent jobs to prevent contention with real-time tasks.

  • Batching combined many lightweight jobs into fewer, heavier ones, reducing overhead on dispatchers and improving cache efficiency.

  • Capacity-aware scheduling throttled job execution when downstream systems (like ZippyDB or the social graph service) showed signs of saturation.

This wasn’t reactive tuning. It was a proactive adaptation. Async observed system load and adjusted flow rates automatically. Threads engineers didn’t need to page anyone or reconfigure services. Async matched its execution rate to what the ecosystem could handle.

Developer Experience

One of the most powerful aspects of Async is that the Threads engineers didn’t need to think about scale. Once business logic was written and onboarded into Async, the platform handled the rest:

  • Delivery windows: Jobs could specify execution timing, allowing deferment or prioritization.

  • Retries: Transient failures were retried with backoff, transparently.

  • Auto-throttling: Job rates were adjusted dynamically based on system health.

  • Multi-tenancy isolation: Surges in one product didn’t impact another.

These guarantees allowed engineers to focus on product behavior, not operational limits. Async delivered a layer of predictable elasticity, absorbing traffic spikes that would have crippled a less mature system.

Conclusion

The Threads launch was like a stress test for Meta’s infrastructure. And the results spoke for themselves. A hundred million users joined in less than a week. But there were no major outages. 

That kind of scale doesn’t happen by chance. 

ZippyDB and Async weren’t built with Threads in mind. But Threads only succeeded because those systems were already in place, hardened by a decade of serving billions of users across Meta’s core apps. They delivered consistency, durability, elasticity, and observability without demanding custom effort from the product team. 

This is the direction high-velocity engineering is heading: modular infrastructure pieces that are composable and deeply battle-tested. Not all systems need to be cutting-edge. However, the ones handling critical paths, such as state, compute, and messaging, must be predictable, reliable, and invisible.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].