MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

How Spotify Ships to 675 Million Users Every Week Without Breaking Things

2026-04-08 23:30:20

Unlock access to the data your product needs (Sponsored)

Most tools are still locked to their own database, blind to everything users already have in Slack, GitHub, Salesforce, Google Drive, and dozens of other apps. That's the ceiling on what you can build.

WorkOS Pipes removes it. One API call connects your product to the apps your users live in. Pull context from their tools, cross-reference data across silos, power AI agents that act across services. All with fresh, managed credentials you never have to think about.

Turn data to insight →


Every Friday morning, a team at Spotify takes hundreds of code changes written by dozens of engineering teams and begins packaging them into a single app update. That update will eventually reach more than 675 million users on Android, iOS, and Desktop. They do this every single week. And somehow, more than 95% of those releases ship to every user without a hitch.

The natural assumption is that they’re either incredibly careful and therefore slow, or incredibly fast and therefore reckless. The truth is neither.

How do you ship to 675 million users every week, with hundreds of changes from dozens of teams running on thousands of device configurations, without breaking things?

The answer is not to test really hard. Spotify built a release architecture where speed and safety reinforce each other. In this article, we will take a look at this process in detail and attempt to derive learnings.

Disclaimer: This post is based on publicly shared details from the Spotify Engineering Team. Please comment if you notice any inaccuracies.

The Two-Week Journey of a Spotify Release

To see how this works, let us follow a single release from code merge to production.

Spotify practices trunk-based development, which means that all developers merge their code into a single main branch as soon as it’s tested and reviewed. There are no long-lived feature branches where code sits in isolation for weeks. Everyone pushes to the same branch continuously, which keeps integration problems small but requires discipline and solid automated testing.

See the diagram below that shows the concept of trunk-based development:

Each release cycle starts on a Friday morning. The version number gets bumped on the main branch. From that point, nightly builds start going out to Spotify employees and a group of external alpha testers. During this first week, teams develop and merge new code freely. Bug reports flow in from internal and alpha users. Crash rates and other quality metrics are tracked for each build, both automatically and by human review. When a crash or issue crosses a predefined severity threshold, a bug ticket gets created automatically. When something looks suspicious but falls below that threshold, the Release Manager can create one manually.

On the Friday of the second week, the release branch gets cut, meaning a separate copy of the codebase is created specifically for this release. This is the key moment in the release cycle. From this point, only critical bug fixes are allowed on the release branch. Meanwhile, the main branch keeps moving. New features and non-critical fixes continue to merge there, destined for next week’s release. This separation is the mechanism that lets Spotify develop at full speed while simultaneously stabilizing what’s about to ship.

Teams then perform regression testing, checking that existing features still work correctly after the new changes, and report their results. Teams with high confidence in their automated tests and pre-merge routines can opt out of manual testing entirely. Beta testers receive builds from the more stable release branch, providing additional real-world runtime over the weekend.

By Monday, the goal is to submit the app to the stores. By Tuesday, if the app store review passes and quality metrics look good, Spotify rolls it out to 1% of users. By Wednesday, if nothing alarming surfaces, they roll out to 100%.

The flow below shows all the steps in a typical release process:

As an example, for version 8.9.2, which carried the Audiobooks feature launch in new markets, this timeline played out almost exactly as planned. What made that possible was everything happening behind the timeline.

Rings of Exposure: Catching Bugs Where They’re Cheapest to Fix

The code doesn’t go from a developer’s laptop to 675 million users in one jump. It passes through concentric rings of users, and each ring exists to catch a specific category of failure.

  • The first ring is Spotify’s own employees. They run nightly builds from the main branch, using the app the way real users do. This catches obvious functional bugs early. Even a crash that only affects a small number of employees gets investigated, because a bug that appears minor internally could signal a much larger problem once it hits millions of devices.

  • The second ring is external alpha testers. These users introduce more device diversity and real-world usage patterns that the internal team may not have anticipated. They’re running builds that are still being actively developed, so rough edges are expected, but the data they generate is invaluable.

  • The third ring is beta testers, who receive builds from the release branch rather than the main branch. These builds are expected to be more stable. Beta users provide additional runtime over weekends and evenings, and their feedback either builds confidence that the release is solid or surfaces issues that slipped through the first two rings.

  • The fourth ring is the 1% production rollout. Real users, real devices, real conditions. Spotify’s user base is large enough that even 1% provides statistically meaningful data. If a severe issue appears during this phase, the rollout is paused immediately, and the responsible team starts working on a fix.

  • The fifth and final ring is the 100% rollout. Only after the 1% rollout looks clean does the release go out to everyone.

For reference, the Audiobooks launch in version 8.9.2 shows how this system works at an even more granular level.

The Audiobooks feature didn’t just pass through these five rings of app release. It had its own layered rollout on top of that. The feature code had been sitting in the app for multiple releases already, hidden behind a backend feature flag. It was turned on for most employees first. The team watched for any crash, no matter how small, that might indicate trouble. Only after the app release itself reached a sufficient user base did the Audiobooks team begin gradually enabling the feature for real users in specific markets, using the same backend flag to control the percentage.

See the diagram below that shows the concept of a feature flag:

This separation between deploying code and activating a feature is a powerful pattern in the Spotify release process. It allows code to sit in the app, baking in production conditions invisibly, and get turned on later. If something goes wrong after activation, the feature can be turned off without shipping a new release. At Spotify’s scale, feature flags are a core safety mechanism, though managing hundreds of them across a large organization, each with per-market and per-user-percentage controls, is its own engineering challenge.

The Release Manager also made a deliberate coordination decision for 8.9.2. Since the Audiobooks feature was a high-stakes launch with marketing events already scheduled, another major feature that had been planned for the same release was rescheduled to the following week. Fewer variables in a single release means easier diagnosis if something goes wrong. That kind of judgment call is one of the things that separates release management from pure automation.

From Jira to a Release Command Center

The multi-ring system generates a lot of data, such as Crash rates, bug tickets, sign-off statuses, build verification results, and app store review progress. Someone has to make sense of all of it, and this wasn’t an easy task.

Before the Release Manager Dashboard existed, everything lived in Jira. The Release Manager had to jump between tickets, check statuses across multiple views, and verify conditions manually, all while answering questions from teams on Slack. It was easy to miss a small detail, and a missed detail could mean extra work or a bug slipping through.

So the Release team built a dedicated command center with clear goals:

  • Optimize for the Release Manager’s workflow

  • Minimize context switching

  • Reduce cognitive load

  • Enable fast and accurate decisions

The result was the Release Manager Dashboard, built as a plugin on Backstage, Spotify’s internal developer portal.

It pulls and aggregates data from around 10 different backend systems into a single view. For each platform (Android, iOS, Desktop), the dashboard shows blocking bugs, the latest build status, automated test results, crash rates normalized against actual usage (so a crash rate is meaningful whether 1,000 or 1,000,000 people are using the build), team sign-off progress, and rollout state. Everything is color-coded

  • Green means ready to advance

  • Yellow means something needs attention

  • Red means there’s a problem requiring action

Here’s an example of how the dashboard appears:

The dashboard also surfaces release criteria as a visible checklist:

  • All commits on the release branch are included in the latest build and passing tests

  • No open blocking bug tickets

  • All teams signed off

  • Crash rates below defined thresholds

  • Sufficient real-world usage of the build

When everything goes green, the release is ready to advance.

The dashboard got off to a rocky start, however. The first version was slow and expensive. Every page reload triggered queries to all 10 of the source systems it depended on, causing long load times and high costs. The Spotify engineering team noted that each reload cost about as much as a decent lunch in Stockholm. After switching to caching and pre-aggregating data every five minutes, load time dropped to eight seconds, and the cost became negligible.

The Robot: Automating the Predictable, Keeping Humans for the Ambiguous

The dashboard gave the Release Manager the information to make fast decisions.

However, by analyzing the time-series data the dashboard generated, the team noticed that a lot of the time in the release cycle wasn’t spent on hard decisions, but waiting.

The biggest time sinks were testing and fixing bugs (unavoidable), waiting for app store approval (outside Spotify’s control), and delays from manually advancing a release when a step was completed outside working hours. That last one alone could cost up to 12 hours. If the app store approved a build at 11 PM, the release just sat there until someone woke up and clicked “next.”

Therefore, the team built what they called “the Robot.”

It’s a backend service that models the release process as a state machine, a set of defined stages with specific conditions that must be met before moving to the next one. The Robot tracks seven states. The five states on the normal path forward are release branched, final release candidate (the build that will actually ship), submitted for app store review, rolled out to 1%, and rolled out to 100%. Two additional states handle problems, which means either the rollout gets paused or the release gets cancelled entirely.

See the diagram below:

The Robot continuously checks whether the conditions for advancing to the next state are met. If manual testing is signed off, no blocking bugs are open, and automated tests pass on the latest commit on the release branch, the Robot automatically submits the build for app store review without human intervention. If the app store approves the build at 3 AM, the Robot initiates the 1% rollout immediately instead of waiting for someone to show up at the office.

The result was an average reduction of about eight hours per release cycle.

However, the Robot doesn’t make the hard calls. It doesn’t decide whether a crash affecting users in a specific region is severe enough to block a release. It doesn’t decide whether a bug in a new feature like Audiobooks, with marketing events already scheduled, should delay the entire release or just the feature rollout. It doesn’t negotiate with feature teams about timing. Those decisions require judgment, context, and sometimes difficult conversations. The Release Manager handles all of them.

This split is deliberate. Predictable transitions that depend on rule-checks get automated. Ambiguous decisions that require coordination and judgment stay with humans.

Conclusion

Spotify ships weekly to 675 million users through a strong release architecture. Layered exposure catches bugs where they’re cheapest to fix and centralized tooling turns scattered data into fast decisions. Automation handles the predictable so humans can focus on the ambiguous.

The key lesson here is that speed and safety aren’t opposites. At Spotify, each one enables the other. A weekly cadence means each release carries fewer changes. Fewer changes mean less risk per release. Less risk means shipping with confidence.

Since a cancelled release only costs one week, not a month or a quarter, teams are more willing to kill a bad release rather than push it through and hope for the best.

References:

Nextdoor’s Database Evolution: A Scaling Ladder

2026-04-07 23:32:00

New Year, New Metrics: Evaluating AI Search in the Agentic Era (Sponsored)

Most teams pick a search provider by running a few test queries and hoping for the best – a recipe for hallucinations and unpredictable failures. This technical guide from You.com gives you access to an exact framework to evaluate AI search and retrieval.

What you’ll get:

  • A four-phase framework for evaluating AI search

  • How to build a golden set of queries that predicts real-world performance

  • Metrics and code for measuring accuracy

Go from “looks good” to proven quality.

Learn how to run an eval


Nextdoor operates as a hyper-local social networking service that connects neighbors based on their geographic location.

The platform allows people to share local news, recommend local businesses, and organize neighborhood events. Since the platform relies on high-trust interactions within specific communities, the data must be both highly available and extremely accurate.

However, as the service scaled to millions of users across thousands of global neighborhoods, the underlying database architecture had to evolve from a simple setup into a sophisticated distributed system.

This engineering journey at Nextdoor highlights a fundamental rule of system design.

Every performance gain introduces a new requirement for data integrity. The team followed a predictable progression, moving from a single database instance to a complex hierarchy of connection poolers, read replicas, versioned caches, and background reconcilers. In this article, we will look at how the Nextdoor engineering team handled this evolution and the challenges they faced.

Disclaimer: This post is based on publicly shared details from the Nextdoor Engineering Team. Please comment if you notice any inaccuracies.

The Limits of the “Big Box”

In the early days, Nextdoor relied on a single PostgreSQL instance to handle every post, comment, and neighborhood update.

For many growing platforms, this is the most logical starting point. It is simple to manage, and PostgreSQL provides a robust engine capable of handling significant workloads. However, as more neighbors joined and the volume of simultaneous interactions grew, the team hit a wall that was not related to the total amount of data stored, but more to do with the connection limit.

PostgreSQL uses a process-per-connection model. In other words, every time an application worker wants to talk to the database, the server creates a completely new process to handle that request. If an application has five thousand web workers trying to access the database at the same time, the server must manage five thousand separate processes. Each process consumes a dedicated slice of memory and CPU cycles just to exist.

Managing thousands of processes creates a massive overhead for the operating system. The server eventually spends more time switching between these processes than it does running the actual queries that power the neighborhood feed. This is often the point where vertical scaling, or buying a larger server with more cores, starts to show diminishing returns. The overhead of the “process-per-connection” model remains a bottleneck regardless of how much hardware is thrown at the problem.

To solve this, Nextdoor introduced a layer of middleware called PgBouncer. This is a connection pooler that sits between the application and the database. Instead of every application worker maintaining its own dedicated line to the database, they all talk to PgBouncer.

  • The Request Phase: A web worker requests a connection from PgBouncer to execute a quick query.

  • The Assignment Phase: PgBouncer assigns an idle connection from its pre-established pool rather than forcing the database to create a new process.

  • The Execution Phase: The query runs against the database using that shared connection.

  • The Release Phase: The worker finishes its task, and the connection returns to the pool immediately for the next worker to use.

This allows thousands of application workers to share a few hundred “warm” database connections. This effectively removed the connection bottleneck and allowed the primary database to focus entirely on data processing.

Dividing the Labor and the “Lag” Problem

Once connection management was stable, the next bottleneck appeared in the form of read traffic.

In a social network like Nextdoor, the ratio of people reading the feed compared to people writing a post is heavily skewed. For every one person who saves a new neighborhood update, hundreds of others might view it. A single database server must handle both the “Writes” and the “Reads” at the same time. This creates resource contention where heavy read queries can slow down the ability of the system to save new data.

The solution was to move to a Primary-Replica architecture. In this setup, one database server is designated as the Primary. It is the only server allowed to modify or change data. Several other servers, known as Read Replicas, maintain copies of the data from the Primary. All the “Read” traffic from the application is routed to these replicas, while only the “Write” traffic goes to the Primary.

See the diagram below:

This separation of labor allows for massive horizontal scaling of reads. However, this introduces the challenge of Asynchronous Replication. The Primary database sends its changes to the replicas using a stream of logs. It takes time for a new post saved on the Primary to travel across the network and appear on the replicas. This delay is known as replication lag.

See the diagram below that shows the difference between synchronous and asynchronous replication:

To solve the issue of a neighbor making a post and then seeing it disappear upon a refresh, Nextdoor uses Time-Based Dynamic Routing. This is a smart routing logic that ensures users always see the results of their own actions. Here’s how it works:

  • The Write Marker: When a user performs a write action, like posting a comment, the application notes the exact timestamp of that event.

  • The Protected Window: For a specific period of time after that write, often a few seconds, the system treats that specific user as sensitive

  • Dynamic Routing: During this window, all read requests from that user are dynamically routed to the Primary database instead of a replica.

  • The Handover: Once the time window expires and the system is confident the replicas have caught up with the Primary, the user’s traffic is routed back to the replicas to save resources.

This ensures that while the general neighborhood sees eventually consistent data, the person who made the change always sees strongly consistent data.


Why writing code isn’t the hard part anymore (Sponsored)

Coding is no longer the bottleneck, it’s prod.

With the rise in AI coding tools, teams are shipping code faster than they can operate it. And production work still means jumping between fragmented tools, piecing together context from systems that don’t talk to each other, and relying on the few engineers who know how everything connects.

Leading teams like Salesforce, Coinbase, and Zscaler cut investigation time by over 80% with Resolve AI, using multi-agent investigation that works across code, infrastructure, and telemetry.

Learn how AI-native engineering teams are implementing AI in their production systems

Get the free AI for Prod ebook ➝


The High-Speed Library

Even with multiple replicas, hitting a database for every single page load is an expensive operation.

Databases must read data from a disk or a large memory pool and often perform complex joins between different tables to assemble a single record. To provide the millisecond response times neighbors expect, Nextdoor implemented a caching layer using Valkey. This is an open-source high-performance data store that holds information in RAM for near-instant access.

The team uses a Look-aside Cache pattern. When the application needs data, it follows a specific sequence:

  • The Cache Check: The application looks for the data in Valkey using a unique key.

  • The Cache Hit: If the data is found, it is returned instantly to the user without touching the database.

  • The Cache Miss: If the data is missing, the application queries the PostgreSQL database to find the truth.

  • The Population Step: The application takes the database result, saves a copy in Valkey for future requests, and then returns it to the user.

Efficiency is vital when managing a cache at this scale. RAM is much more expensive than disk storage, so the data must be as small as possible.

Nextdoor uses a binary serialization format called MessagePack. In other words, instead of storing data as a bulky text format like JSON, they convert it into a highly compressed binary format that is much faster for the computer to parse.

MessagePack is particularly useful for Nextdoor because it supports schema evolution. If the engineering team adds a new field to a neighbor’s profile, the older cached data can still be read without crashing the application. For even larger pieces of data, they use Zstd compression. By combining these two tools, Nextdoor reduces the memory footprint of its cache servers.

Versioning and Atomic Updates

Caching can create a serious problem when it starts lying in particular scenarios. For example, if the database is updated but the cache is not refreshed, users can see old, incorrect information. Most simple caching strategies rely on a “Time to Live” or TTL. This is a timer that tells the cache to delete an entry after a few minutes. For a real-time social network, waiting several minutes for a post to update is not an acceptable solution.

Nextdoor built a sophisticated versioning engine to ensure the cache stays up to date. They added a special column called system_version to their database tables and used PostgreSQL Triggers to manage this number. For reference, a trigger is a small script that runs automatically inside the database whenever a row is touched. Every time a post is updated, the trigger increments the version number. This ensures that the database remains the ultimate source of truth regarding which version of a post is the newest.

When the application tries to update the cache, it does not just overwrite the old data. It uses a Lua script executed inside Valkey. This script performs an atomic Compare and set operation that works as follows:

  • The Metadata Fetch: The script retrieves the version number currently stored in the cache entry.

  • The Version Comparison: It compares the version to the version number of the new update being sent by the application.

  • The Conditional Write: If the new version is strictly greater than the cached version, the update is saved.

  • The Rejection: If the cached version is already equal to or higher than the new update, the script rejects the change entirely.

This prevents “race conditions.” Imagine two different servers trying to update the same post at the same time. Without this logic, an older update could arrive a millisecond later and overwrite a newer update. This would leave the cache permanently out of sync with the database. By using Lua, the entire process of checking the version and updating the data happens as a single, unbreakable step that cannot be interrupted.

CDC and Reconciliation

Even with versioning and Lua scripts, errors can occur.

A network partition might prevent a cache update from reaching Valkey, or an application process might crash before it can finish the population step. Nextdoor needed a final safety net to catch these discrepancies. They implemented Change Data Capture, also known as CDC, using a tool called Debezium.

See the diagram below:

CDC works by “listening” to the internal logs of the PostgreSQL database. Specifically, it watches the Write-Ahead Log, where every single change is recorded before it is committed. Every time a change happens in the database, Debezium captures that event and turns it into a message in a data stream. A background service known as the Reconciler watches this stream.

The reconciliation flow provides a “self-healing” mechanism for the entire setup:

  • The Database Update: A user updates their neighborhood bio in the Primary PostgreSQL database.

  • The Log Capture: Debezium detects the new log entry and publishes a change event message.

  • The Reconciler Action: The background service receives this message and identifies which cache key needs to be corrected.

  • The Invalidation: The service tells the cache to delete the old entry. The next time a neighbor requests that bio, the application will experience a “Cache Miss” and fetch the perfectly fresh data from the database.

This process provides eventual consistency. While the primary cache update might fail for a fraction of a second, the CDC Reconciler will eventually detect the change and fix the cache. It acts like a detective that constantly audits the system to ensure the fast truth in the cache matches the real truth in the database.

Sharding

There comes a point where even the most optimized single Primary database cannot handle the volume of incoming writes. When a platform processes billions of rows, the hardware itself reaches physical limits. This is when Nextdoor moves to the final rung of the ladder. This rung is Sharding.

Sharding is the process of breaking a single, massive table into smaller pieces and spreading them across entirely different database clusters. Nextdoor typically shards data by a unique identifier such as a Neighborhood ID.

  • The Cluster Split: All data for Neighborhoods 1 through 500 might live on Cluster A, while Neighborhoods 501 through 1,000 live on Cluster B.

  • The Shard Key: The application uses the neighborhood_id to know exactly which database cluster to talk to for any given request.

Sharding allows for much greater scaling because we can keep adding more clusters based on growth. However, it comes at a high cost in complexity. Once we shard a database, we can no longer easily perform a “Join” between data on two different shards.

Conclusion

The journey of Nextdoor’s database shows that great engineering is rarely about choosing the most complex tool first. It is about a disciplined progression.

They started with a single server and added connection pooling when the lines got too long. They added replicas when the read traffic became too heavy. Finally, they built a world-class versioned caching system to provide the speed neighbors expect without sacrificing the accuracy they require.

The takeaway is that complexity must be earned. Each layer of the scaling ladder solves one problem while introducing a new challenge in data consistency. By building robust safety nets such as versioning and reconciliation, the Nextdoor engineering team ensured that its system could grow without losing the trust of the communities it serves.

References:

A Guide to Context Engineering for LLMs

2026-04-06 23:30:52

The workshop for teams drowning in observability tools (Sponsored)

Five vendors, rising costs, and you still can’t tell why something broke.

Sentry’s Lazar Nikolov sits down with Recurly’s Chris Barton to talk through what observability consolidation actually looks like in practice: how to evaluate your options, where AI fits in, and how to think about cost when you’re ready to simplify.

Register your spot today


Giving an LLM more information can make it dumber. A 2025 research study by Chroma tested 18 of the most powerful language models available, including GPT-4.1, Claude, and Gemini, and found that every single one performed worse as the amount of input grew.

The degradation wasn’t minor, either. Some models held steady at 95% accuracy and then nosedived to 60% once the input crossed a certain length.

This finding busts one of the most common myths about working with LLMs that more context is always better. The reality is that LLMs have architectural blind spots that make what you put in front of them, and how you structure it, far more important than how much you include.

The discipline of getting this right is called context engineering.

In this article, we’ll look at how LLMs actually process the information you give them, what context engineering is, and the strategies that can help with it.

Key Terminologies

Before we go further, there are three terms that come up constantly when talking about LLMs. Getting clear on these first will make everything that follows much easier to reason about.

  • Tokens: They are the units LLMs think in. They aren’t full words, but rather chunks of text that average roughly three-quarters of a word each. The word “context” is one token, while the word “engineering” gets split into two. Every piece of text the model processes, from your question to its instructions to any documents you’ve included, is measured in tokens.

  • Context Window: It is the total number of tokens the model can see at once during a single interaction. Everything has to fit inside this window: the system instructions that define the model’s behavior, the conversation history, any external documents or data you’ve injected, and your actual question. Modern models advertise context windows ranging from 128,000 to over 2 million tokens. That sounds enormous, but as we’ll see, bigger isn’t straightforwardly better.

  • Attention: This is the mechanism the model uses to figure out which tokens matter to which other tokens. Before generating each new token of its response, the model compares it against every other token currently in the context window. This gives LLMs their ability to connect ideas across long stretches of text, but it’s also the source of their most important limitations.

How LLMs Process Context

When we send text to an LLM, it doesn’t read from top to bottom the way a human would. The attention mechanism compares every token against every other token to compute relationships, which means the model can, in principle, connect an idea from the first sentence of the input to one in the last sentence. However, this power comes with two critical costs.

  • The first is computational. Doubling the number of tokens in the context window roughly quadruples the computation required. Longer contexts are disproportionally slower and more expensive.

  • The second cost is more consequential. Attention isn’t distributed evenly across the context window. Research has consistently shown that LLMs pay the most attention to tokens at the beginning and end of the input, with a significant drop-off in the middle. This is known as the “lost in the middle” problem, and research has found that accuracy can drop by over 30% when relevant information is placed in the middle of the input compared to the beginning or end.

See the diagram below that shows the attention curve:

This isn’t a bug in any particular model, but rather a structural property of how transformers (the neural network architecture that powers virtually all modern LLMs) encode the position of tokens.

The positional encoding method used in most modern LLMs (called Rotary Position Embedding, or RoPE) introduces a decay effect that makes tokens far from both the start and end of the sequence land in a low-attention zone. Newer models have reduced the severity, but no production model has fully eliminated it.

The practical implication is that the position of information in the input matters as much as the information itself. If we paste a long document into an LLM, the model is most likely to miss information buried in the middle pages.

Why More Context Can Hurt

The uneven attention distribution is one problem, but there’s a broader pattern that compounds it, known as context rot.

Context rot is the degradation of LLM performance as input length increases, even on simple tasks. The Chroma research team’s 2025 study tested 18 frontier models and found that this degradation isn’t gradual. Models can maintain near-perfect accuracy up to a certain context length, and then performance drops off a cliff unpredictably, varying by model and by task in ways that make it impossible to reliably predict when you’ll hit a breaking point.

Why does this happen?

Every token you add to the context window draws from a finite attention budget. Irrelevant information buries important information in low-attention zones, and content that sounds related but isn’t actually useful confuses the model’s ability to identify what’s relevant. The model doesn’t get smarter with more input, but kind of gets distracted.

On top of this, LLMs are stateless. They have zero memory between calls, and each interaction starts completely fresh. When there is a multi-turn conversation with an LLM like ChatGPT, and it seems to “remember” what we said earlier, that’s because the system is re-injecting the conversation history into the context window each time. The model itself remembers nothing, which means someone, or some system, has to decide for every single call what information to include, what to leave out, and how to structure it.

There’s also a meaningful gap between marketing and reality. Models advertise million-token context windows, and they pass simple benchmarks at those lengths. However, the effective context length, where the model actually uses information reliably, is often much smaller. Passing a “needle in a haystack” test (finding one planted sentence in a long document) is very different from reliably synthesizing information scattered across hundreds of pages

Defining Context Engineering

Context engineering is the practice of designing, assembling, and managing the entire information environment an LLM sees before it generates a response. It goes beyond writing a single good instruction to orchestrating everything that fills the context window, so the model has exactly what it needs for the task at hand and nothing more.

To understand what this involves, it helps to see what actually competes for space inside a context window. There are six types of context in a typical LLM call:

  • System instructions (the behavioral rules, persona, and guidelines the model follows)

  • User input (your actual question or command)

  • Conversation history (the short-term memory of the current session)

  • Retrieved knowledge (documents, database results, or API responses pulled in from external sources)

  • Tool descriptions (definitions of tools the model can call and how to use them)

  • Tool outputs (results returned from previous tool calls)

  • The user’s actual question is often a tiny fraction of the total token count.

The rest is infrastructure, and that infrastructure is what context engineering designs.

This also clarifies how context engineering differs from prompt engineering. Prompt engineering asks, “How do I phrase my instruction to get the best result?” On the other hand, Context engineering asks, “What does the model need to see right now, and how do I assemble all of it dynamically?”

Prompt engineering is one component within context engineering, focused on the instruction layer, while context engineering encompasses the full information system around the model. As Andrej Karpathy put it in a widely referenced post, context engineering is the “delicate art and science of filling the context window with just the right information for the next step.”

Two people using the same model can get wildly different results. The model is the same, but the context is different, and context engineering is the factor that determines things.

Core Strategies

Developers have converged on four broad strategies for managing context, categorized as write, select, compress, and isolate. Each one is a direct response to a specific constraint we’ve already covered.

Write: Save Context Externally

The constraint it addresses is that the context window is finite, and statelessness means information is lost between calls.

Instead of trying to keep everything inside the context window, save important information to external storage and bring it back when needed. This takes two main forms.

  • The first is scratchpads, where an agent saves intermediate plans, notes, or reasoning steps to external storage during a long-running task. Anthropic’s multi-agent research system does exactly this. The lead researcher agent writes its plan to external memory at the start of a task, because if the context window exceeds 200,000 tokens, it gets truncated and the plan would be lost.

  • The second form is long-term memory, which involves persisting information across sessions. ChatGPT auto-generates user preferences from conversations, Cursor and Windsurf learn coding patterns and project context, and Claude Code uses CLAUDE.md files as persistent instruction memory. All of these systems treat external storage as the real memory layer, with the context window serving as a temporary workspace.

Select: Pull In Only What’s Relevant

The constraint it addresses is that more context isn’t better, and the model needs the right information rather than all available information.

The most important technique here is Retrieval-Augmented Generation, or RAG. Instead of stuffing all your knowledge into the context window, we store it externally in a searchable database. At query time, retrieve only the chunks most relevant to the current question and inject those into the context, giving the model targeted knowledge without the noise of everything else.

Selection also applies to tools. When an agent has dozens of available tools, listing every tool description in every prompt wastes tokens and confuses the model. A better approach is to retrieve only the tool descriptions relevant to the current task.

The critical tradeoff with selection is precision. If the retrieval pulls in documents that are almost relevant but not quite, they become distractors that add tokens and push important context into low-attention zones. The retrieval step itself has to be good, or the whole strategy backfires.

Compress: Keep Only What You Need

The constraint it addresses is the context rot and the escalating cost of attention across more tokens.

As agent workflows span dozens or hundreds of steps, the context window fills up with accumulated conversation history and tool outputs. Compression strategies reduce this bulk while trying to preserve the essential information.

Conversation summarization is the most common approach. Claude Code, for instance, triggers an “auto-compact” process when the context hits 95% capacity, summarizing the entire interaction history into a shorter form. Cognition, the company behind the Devin coding agent, trained a separate, dedicated model specifically for summarization at agent-to-agent boundaries. The fact that they built a separate model just for this step tells us how consequential bad compression can be, since a specific decision or detail that gets summarized away is gone permanently.

Simpler forms of compression include trimming (removing older messages from the history) and tool output compression (reducing verbose search results or code outputs to their essentials before they enter the context).

Isolate: Split Context Across Agents

The constraint it addresses is that of attention dilution and context poisoning when too many types of information compete in one window.

Instead of one agent trying to handle everything in a single bloated context window, this strategy splits the work across multiple specialized agents, each with its own clean, focused context. A “researcher” agent gets a context loaded with search tools and retrieved documents, while a “writer” agent gets a context loaded with style guides and formatting rules, so neither is distracted by the other’s information.

Anthropic demonstrated this with their multi-agent research system, where a lead Opus 4 agent delegated sub-tasks to Sonnet 4 sub-agents. The system achieved a 90.2% improvement over a single Opus 4 agent on research tasks, despite using the same underlying model family. The entire performance gain came from how context was managed, not from a more powerful model.

See the diagram below:

Tradeoffs

These strategies are powerful, but they involve trade-offs with no universal right answers:

  • Compression versus information loss: Every time you summarize, you risk losing a detail that turns out to matter later. The more aggressively you compress, the more you save on tokens, but the higher the chance of permanently destroying something important.

  • Single agent versus multi-agent: Anthropic’s multi-agent results are impressive, but others, notably Cognition, have argued that a single agent with good compression delivers more stability and lower cost. Both sides are debating the same core question of how to manage context effectively, and the answer depends on task complexity, cost tolerance, and reliability requirements.

  • Retrieval precision versus noise: RAG adds knowledge, but imprecise retrieval adds distractors. If the documents you retrieve aren’t genuinely relevant, they consume tokens and push important content into low-attention positions, so the retrieval system itself has to be well-engineered, or RAG makes things worse.

  • Cost versus richness: Every token costs money and processing time. The disproportionate scaling of attention means longer contexts get expensive fast, and context engineering is partly an economics problem of figuring out where the return on additional tokens stops being worth the cost.

Conclusion

The core takeaway is that the model is only as good as the context it receives. Working with LLMs effectively requires thinking about the entire system around the model, not just the model itself.

As models get more powerful, context engineering becomes more important. When the model is capable enough, most failures stop being intelligence failures and start being context failures, where the model could have gotten it right but didn’t have what it needed or had too much of what it didn’t need.

The strategies are evolving, and best practices are being revised as new models ship. However, the underlying constraints of finite attention, positional bias, and statelessness are architectural.

References

EP209: 12 Claude Code Features Every Engineer Should Know

2026-04-04 23:30:42

Turn cloud logs into real security signals (Sponsored)

This guide from Datadog provides best practices on how to use Cloud SIEM to detect threats, investigate incidents, and reduce blind spots across cloud and Kubernetes environments.

You’ll learn how to:

  • Analyze CloudTrail, GCP audit, and Azure logs for suspicious activity

  • Detect authentication anomalies and common attack patterns

  • Monitor Kubernetes audit logs for lateral movement and misuse

  • Correlate signals across services to accelerate investigations

Get the ebook


This week’s system design refresher:

  • 12 Claude Code Features Every Engineer Should Know

  • How Agentic RAG Works?

  • How does REST API work?

  • 7 Key Load Balancer Use Cases

  • Our New Book on Behavioral Interviews Is Now Available on Amazon!


12 Claude Code Features Every Engineer Should Know

  1. CLAUDE. md: A project memory file to define custom rules and conventions. Claude reads at the start of every session.

  2. Permissions: Control which tools Claude can and can't use.

  3. Plan Mode: Claude plans before it acts. You can review them before any code changes.

  4. Checkpoints: Automatic snapshots of your project to revert to if something goes wrong.

  5. Skills: Reusable instruction files Claude follows automatically.

  6. Hooks: Run custom shell scripts on lifecycle events like PreToolUse or PostToolUse.

  7. MCP: Connect Claude to any external tools like databases and third-party services.

  8. Plugins: Extend Claude with third-party integrations containing skills, MCPs, and hooks.

  9. Context: Feed Claude what it needs and manage the current context window with /context.

  10. Slash Commands: Create shortcuts for tasks you run often. Type / and pick from your saved commands.

  11. Compaction: Compress long conversations to save tokens.

  12. Subagents: Spawn parallel agents for complex tasks. Divide large multi-step workflows and run them simultaneously.

Over to you: Which Claude Code feature do you use the most? Any features you wish were on this list?


How Agentic RAG Works?

A traditional RAG has a simple retrieval, limited adaptability, and relies on static knowledge, making it less flexible for dynamic and real-time information.

Agentic RAG improves on this by introducing AI agents that can make decisions, select tools, and even refine queries for more accurate and flexible responses. Here’s how Agentic RAG works on a high level:

  1. The user query is directed to an AI Agent for processing.

  2. The agent uses short-term and long-term memory to track query context. It also formulates a retrieval strategy and selects appropriate tools for the job.

  3. The data fetching process can use tools such as vector search, multiple agents, and MCP servers to gather relevant data from the knowledge base.

  4. The agent then combines retrieved data with a query and system prompt. It passes this data to the LLM.

  5. LLM processes the optimized input to answer the user’s query.


Stop babysitting your agents. (Sponsored)

Unblocked gives Cursor, Codex, Claude and Copilot the organizational knowledge to generate mergeable code without the back and forth. It pulls context from across your engineering stack, resolves conflicts, and cuts the rework cycle by delivering only what agents need for the task at hand.

Unblock your agents


How does REST API work?

What are its principles, methods, constraints, and best practices? I hope the diagram below gives you a quick overview.


7 Key Load Balancer Use Cases

  1. Traffic Distribution: Load Balancers help evenly distribute traffic among multiple server instances.

  2. SSL Termination: Load Balancers can offload the responsibility of SSL termination from the backend servers, thereby reducing their workload.

  3. Session Persistence: Load Balancers ensure that all requests from a user hit the same instance to maintain session persistence.

  4. High Availability: Improves the system’s availability by rerouting traffic away from failed or unhealthy servers to healthy ones.

  5. Scalability: Load Balancers facilitate horizontal scaling when additional instances are added to the server pool to handle increased traffic.

  6. DDoS Mitigation: Load Balancers can help mitigate the impact of DDoS attacks by rate limiting requests or distributing them across a wider surface.

  7. Health Monitoring: Load Balancers also monitor the health and performance of server instances and remove failed or unhealthy servers from the pool.

Over to you: Which other load balancer use case will you add to the list?


Our New Book on Behavioral Interviews Is Now Available on Amazon!

The book is written by Steve Huynh and published by ByteByteGo. Steve is a former principal engineer at Amazon. His ability to break down complex interview dynamics into clear, actionable advice made this book possible. Still, it took us two years to get it ready.

Here's what's inside:

  • 130+ interview questions, from the most common to the ones that catch candidates off guard

  • 72 example stories showing what strong answers look like, from entry level to principal

  • Clear guidance on what interviewers look for, including key signals and red flags

  • High-Signal Storytelling, a framework to build a story bank for any behavioral interview

  • A practical prep plan and interview-day techniques for follow-ups and unexpected questions

Order your copy on Amazon

Our New Book on Behavioral Interviews Is Now Available on Amazon

2026-04-03 23:31:34

The book is written by Steve Huynh and published by ByteByteGo. Steve is a former principal engineer at Amazon. His ability to break down complex interview dynamics into clear, actionable advice made this book possible. Still, it took us two years to get it ready.

Check It Out Now

Here’s what’s inside:

- 130+ interview questions, from the most common to the ones that catch candidates off guard

- 72 example stories showing what strong answers look like, from entry level to principal

- Clear guidance on what interviewers look for, including key signals and red flags

- High-Signal Storytelling, a framework to build a story bank for any behavioral interview

- A practical prep plan and interview-day techniques for follow-ups and unexpected questions

Note: the book will also be available in India in a week or two.

Check It Out Now

Database Performance Strategies and Their Hidden Costs

2026-04-02 23:31:36

A feature is deployed, and the database queries run well. The team is happy with the results. However, six months later, the main table has grown from 50,000 rows to 5 million, and the same query now takes eight seconds.

Then, someone adds an index, and read latency drops to milliseconds, which seems like a clear win. But a week later, the nightly data import is running 40% slower than before. Fixing one problem created another.

This is the central challenge of database performance.

Every optimization helps one thing and can potentially hurt something else. Indexes speed up reads but slow down writes. Caching reduces database load but introduces stale data. Denormalization makes queries faster but complicates updates.

The real challenge isn’t knowing the strategies, but understanding what each strategy costs and deciding which tradeoffs a given application can afford. In this article, we’ll go through the major strategies for improving database performance along with their benefits and trade-offs.

Queries and Indexes

Read more