MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

How LinkedIn Feed Uses LLMs to Serve 1.3 Billion Users

2026-04-13 23:31:28

How to stop babysitting your agents (Sponsored)

Agents can generate code. Getting it right for your system, team conventions, and past decisions is the hard part. You end up babysitting the agent and watch the token costs climb.

More MCPs, rules, and bigger context windows give agents access to information, but not understanding. The teams pulling ahead have a context engine to give agents only what they need for the task at hand.

Join us for a FREE webinar on April 23 to see:

  • Where teams get stuck on the AI maturity curve and why common fixes fall short

  • How a context engine solves for quality, efficiency, and cost

  • Live demo: the same coding task with and without a context engine

If you want to maximize the value you get from AI agents, this one is worth your time.

Register now


LinkedIn used to run five separate systems just to decide which posts to show you. One tracked trending content. Another did collaborative filtering. A third handled embedding-based retrieval.

Each had its own infrastructure, its dedicated team, and its own optimization logic. The setup worked, but when the Feed team wanted to improve one part, they’d break another. Therefore, they made a radical bet and ripped out all five systems, replacing them with a single LLM-powered retrieval model. That solved the complexity problem, but it raised new questions, such as:

  • How do you teach an LLM to understand structured profile data?

  • How do you make a transformer serve predictions in under 50 milliseconds for 1.3 billion users?

  • How do you train the model when most of the data is noise?

In this article, we will look at how the LinkedIn engineering team rebuilt the Feed and the challenges they faced.

Disclaimer: This post is based on publicly shared details from the LinkedIn Engineering Team. Please comment if you notice any inaccuracies.

Five Librarians, One Library

For years, LinkedIn’s Feed retrieval relied on what engineers call a heterogeneous architecture. When you opened the Feed, content came from multiple specialized sources running in parallel.

  • A chronological index of network activity.

  • Trending posts by geography.

  • Collaborative filtering based on similar members.

  • Industry-specific pipelines.

  • Several embedding-based retrieval systems.

Each maintained its own infrastructure, index structure, and optimization strategy.

See the diagram below:

This architecture surfaced diverse, relevant content. But optimizing one retrieval source could degrade another, and no team could tune across all sources simultaneously. Holistic improvement was nearly impossible.

So the Feed team asked a simple question. What if they replaced all of these sources with a single system powered by LLM-generated embeddings?

Under the hood, this works through a dual encoder architecture. A shared LLM converts both members and posts into vectors in the same mathematical space. The training process pushes member and post representations close together when there’s genuine engagement, and pulls them apart when there isn’t. When you open your Feed, the system fetches your member embedding and runs a nearest-neighbor search against an index of post embeddings, retrieving the most relevant candidates in under 50 milliseconds.

However, the real power comes from what the LLM brings to those embeddings. Traditional keyword-based systems rely on surface-level text overlap. If your profile says “electrical engineering” and a post is about “small modular reactors,” a keyword system misses the connection.

An LLM-based system understands that these topics are related because the model carries world knowledge from pretraining. It knows that electrical engineers often work on power grid optimization and nuclear infrastructure. This is especially powerful for cold-start scenarios, when a new member joins with just a profile headline. The LLM can infer likely interests without waiting for engagement history to accumulate.

The downstream benefits compounded the benefits. Instead of receiving candidates from disparate sources with different biases, the ranking layer now receives a coherent candidate set selected through the same semantic similarity. Ranking became easier, and each optimization to the ranking model became more effective.

But replacing five systems with one LLM created a new problem. LLMs expect text, and recommendation systems run on structured data and numbers.

The Model Is Only As Good As Its Input

To feed structured data into an LLM, LinkedIn built a “prompt library” that transforms structured features into templated text sequences. For posts, it includes author information, engagement counts, and post text. For members, it incorporates profile information, skills, work history, and a chronologically ordered sequence of posts they’ve previously engaged with. Think of it as prompt engineering for recommendation systems.

The most striking example is what happened with numerical features. Initially, LinkedIn passed raw engagement counts directly into prompts. For example, a post with 12,345 views would appear as “views:12345” in the text. The model treated those digits like any other text tokens. When the team measured the correlation between item popularity counts and embedding similarity scores, they found it was essentially zero (-0.004). Popularity is one of the strongest relevance signals in recommendation. And the model was completely ignoring it.

The problem is fundamental. LLMs don’t understand magnitude. They process “12345” as a sequence of digit tokens, not as a quantity.

The fix was quite simple. Instead of passing raw counts, LinkedIn converted them into percentile buckets wrapped in special tokens. This meant that “Views:12345” became <view_percentile>71</view_percentile>, indicating this post sits in the 71st percentile of view counts. Most values between 1 and 100 get processed by the LLM as a single unit rather than a multi-digit sequence, giving the model a stable, learnable vocabulary for quantity. The model can learn that anything above 90 means “very popular” without trying to parse arbitrary digit sequences.

The correlation between popularity features and embedding similarity jumped 30x. Recall@10, which measures whether the top 10 retrieved posts are actually relevant, improved by 15%. LinkedIn applied the same strategy to engagement rates, recency signals, and affinity scores.

Less Data, Better Model

When building the member’s interaction history for training, LinkedIn initially included everything. Every post that was shown to a member went into the sequence, whether they engaged with it or scrolled past. The idea was that more data should mean a better model.

However, this didn’t turn out to be the case. Including scrolled-past posts not only made model performance worse, but it also made training significantly more expensive. GPU compute for transformer models scales quadratically with context length.

When the team filtered to include only positively-engaged posts, the results improved across every dimension.

  • Memory footprint per sequence dropped by 37%.

  • The system could process 40% more training sequences per batch.

  • Training iterations ran 2.6x faster

The reason comes down to signal clarity. A scrolled-past post is ambiguous. Maybe the post was irrelevant. Maybe the member was busy. Maybe the headline was mildly interesting, but not enough to stop for. Posts you actively chose to engage with are a much cleaner learning target.

The gains compounded due to this change. Better signal quality meant faster training. Faster training meant more experimentation. More experimentation meant better hyperparameter tuning. When a single change improves both quality and efficiency, the benefits multiply through the entire development cycle.

The training strategy had one more clever element. LinkedIn used two types of negative examples:

  • Easy negatives were randomly sampled posts never shown to a member, providing a broad contrastive signal.

  • Hard negatives were posts actually shown but not engaged with, the almost-right cases where the model must learn nuanced distinctions between relevant and genuinely valuable.

The difficulty of the negative examples shapes what the model learns. Easy negatives teach broad distinction, whereas hard negatives teach the fine-grained ones. Using both together is a common and effective pattern across retrieval systems, and at LinkedIn, adding just two hard negatives per member improved recall by 3.6%.

With retrieval producing high-quality candidates, the next question was how to rank them. LinkedIn’s answer was to stop treating each post as an isolated decision.

The Feed Is a Story, Not a Snapshot

Traditional ranking models evaluate each member-post pair independently. This works, but it misses something fundamental about how professionals consume content.

LinkedIn built a Generative Recommender (GR) model that treats your Feed interaction history as a sequence. Instead of scoring each post in isolation, GR processes more than a thousand of a user’s historical interactions to understand temporal patterns and long-term interests.

The practical difference matters. If the user engages with machine learning content on Monday, distributed systems on Tuesday, and opens LinkedIn again on Wednesday, a sequential model understands these aren’t random events. They’re the continuation of a learning trajectory. A traditional pointwise model sees three independent decisions, whereas the sequential model sees the story.

The GR model uses a transformer with causal attention, meaning each position in the history can only attend to previous positions, mirroring how you actually experienced content over time. Recent posts might matter more for predicting immediate interests, but a post from two weeks ago might suddenly become relevant if recent activity suggests renewed interest.

See the diagram below that shows the transformer architecture:

One of the most practical architectural decisions is what LinkedIn calls late fusion. Not every feature benefits from full self-attention. Count features and affinity signals carry a strong independent signal, and running them through the transformer would inflate computational cost quadratically without clear benefit. Instead, these features are concatenated with the transformer output after sequence processing. This results in rich sequential understanding from the transformer, plus contextual signals that drive relevance, without the cost of processing them through self-attention.

The serving challenge is equally important. Processing 1,000+ historical interactions through multiple transformer layers for every ranking request is expensive. LinkedIn’s solution is shared context batching. The system computes the user’s history representation once, then scores all candidates in parallel using custom attention masks.

On top of the transformer, a Multi-gate Mixture-of-Experts (MMoE) prediction head routes different engagement predictions like clicks, likes, comments, and shares through specialized gates while sharing the same sequential representations underneath.

See the diagram below that shows a typical Mixture-of-Experts architecture.

This lets the model handle multiple prediction tasks without duplicating the expensive transformer computation. Together, shared context batching and the MMoE head are what make the sequential model viable at production scale.

Making It All Work at Scale

Even the best model is useless without the infrastructure to serve it. LinkedIn’s historical ranking models ran on CPUs. Transformers are fundamentally different, with self-attention scaling quadratically with sequence length and massive parameter counts requiring GPU memory. At LinkedIn’s scale, cost-per-inference determines whether sophisticated AI models can serve every member, or only high-engagement users.

The team invested heavily in custom infrastructure on both sides. For training, a custom C++ data loader eliminates Python’s multiprocessing overhead, custom GPU routines reduced metric computation from a bottleneck to negligible overhead, and parallelized evaluation across all checkpoints cut pipeline time substantially. For serving, a disaggregated architecture separates CPU-bound feature processing from GPU-heavy model inference, and a custom Flash Attention variant called GRMIS delivered an additional 2x speedup over PyTorch’s standard implementation.

See the diagram below that shows the GR Infrastructure Stack

Freshness required its own solution.

Three continuously running background pipelines keep the system current, capturing platform activity, generating updated embeddings through LLM inference servers, and ingesting them into a GPU-accelerated index.

Each pipeline optimizes independently, while the end-to-end system stays fresh within minutes. LinkedIn’s models are also regularly audited to ensure posts from different creators compete on an equal footing, with ranking relying on professional signals and engagement patterns, never demographic attributes.

Conclusion

There are some takeaways:

  • Replacing five retrieval systems with one trades resilience for simplicity.

  • LLM-based embeddings are richer but more expensive than lightweight alternatives.

  • The bottleneck is rarely the model architecture. It’s everything around it.

The infrastructure investment represents an effort most teams can’t replicate. And this approach leans on LinkedIn’s rich text data. For primarily visual platforms, the calculus would be different.

The next time you open LinkedIn and see a post from someone you don’t follow, on a topic you didn’t search for, but it’s exactly what you needed to read, that’s all of this working together under the hood.

References:

EP210: Monolithic vs Microservices vs Serverless

2026-04-11 23:30:48

✂️ Cut your QA cycles down to minutes with QA Wolf (Sponsored)

If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.

QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.

QA Wolf takes testing off your plate. They can get you:

  • Unlimited parallel test runs for mobile and web apps

  • 24-hour maintenance and on-demand test creation

  • Human-verified bug reports sent directly to your team

  • Zero flakes guarantee

The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.

With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.

Schedule a demo to learn more


This week’s system design refresher:

  • Monolithic vs Microservices vs Serverless

  • CLI vs MCP

  • Comparing 5 Major Coding Agents

  • Essential AWS Services Every Engineer Should Know

  • JWT Visualized


Monolithic vs Microservices vs Serverless

A monolith is usually one codebase, one database, and one deployment. For a small team, that’s often the simplest way to build and ship quickly. The problem arises when the codebase grows. A tiny fix in the cart code requires redeploying the whole app, and one bad release can take down everything with it.

Microservices try to solve that by breaking the system into separate services. Product, Cart, and Order run on their own, scale separately, and often manage their own data. That means you can ship changes to Cart without affecting the rest of the system.

But now you are dealing with multiple moving parts. You generally need service discovery, distributed tracing, and request routing between services.

Serverless is a different model. Instead of managing servers, you write functions that run when something triggers them, and the cloud provider handles the scaling. In many cases, you only pay when those functions actually run.

However, in serverless, cold starts can add latency, debugging across lots of stateless functions can get messy, and the more you build around one cloud’s runtime, the harder it gets to switch later.

Most production systems don't use just one approach. There's usually a monolith at the core, and over time teams spin up a few services where they need independent scaling or faster deploys. Serverless tends to show up later for things like notifications or background jobs.


CLI vs MCP

AI agents need to talk to external tools, but should they use CLI or MCP?

Both call the same APIs under the hood. The difference is how the agent invokes them.

Here's a side-by-side comparison across 6 dimensions:

  1. Token Cost: MCP loads the full JSON schema (tool names, descriptions, field types) into the context window before any work begins. CLI needs no schema, so saves more context window.

  2. Native Knowledge: LLMs were trained on billions of CLI examples. MCP schemas are custom JSON the model encounters for the first time at runtime.

  3. Composability: CLI tools chain with Unix pipes. Something like gh | jq | grep runs in a single LLM call. MCP has no native chaining. The agent must orchestrate each tool call separately.

  4. Multi-User Auth: CLI agents inherit a single shared token. You can't revoke one user without rotating everyone's key. MCP supports per-user OAuth.

  5. Stateful Sessions: CLI spawns a new process and TCP connection per command. MCP keeps a persistent server with connection pooling.

  6. Enterprise Governance: CLI's only audit trail is ~/.bash_history. MCP provides structured audit logs, access revocation, and monitoring built into the protocol.

Over to you: For which use cases do you prefer CLI over MCP, or vice versa?


Comparing 5 Major Coding Agents

The diagram below compares the 5 leading agents across interface, model, context window, autonomy, and more.

Here's what the landscape tells us:

  1. The terminal is the new IDE. Most coding agents now live in your terminal, not inside an editor. The command line is back.

  2. Context windows are getting massive. We've gone from 8K tokens to 1M in just two years. Agents can now reason over entire codebases in a single prompt.

  3. Autonomy is a spectrum. Some agents run fully async in the background. Others keep you in the loop on every edit. Teams are still figuring out how much to delegate.

  4. Open source is gaining ground. The open-source coding agent ecosystem is maturing fast, giving teams full control over their toolchain.

  5. Pricing varies wildly. From completely free (Gemini CLI, Deep Agents) to $15 per 1M output tokens. Check the cost row before you commit.

There is no single winner. The best agent depends on your workflow, budget, and how much autonomy you're comfortable with.

Over to you: Which coding agent is your daily driver in 2026?


Essential AWS Services Every Engineer Should Know

AWS has 200+ services, but most production systems only use a small subset. In many setups, a request ends up going through API Gateway, then an ALB, executes on Lambda or ECS, reads from DynamoDB, and gets cached in ElastiCache.

Each service on its own is straightforward. Deciding where it actually fits is where things get tricky.

EC2 and S3 are usually the starting point for a lot of people. But when things break, the focus shifts to services that didn’t get much attention early on, like CloudWatch for observability, IAM for access control, and KMS for encryption.

Networking tends to be where things get confusing. VPC, subnets, security groups, Route 53, and CloudFront run behind everything. When something is off, the errors don’t always help much.

Database choices are not easy to reverse later. RDS, DynamoDB, and Aurora solve different problems, and changing direction means redesigning a lot of what you've already built. It’s similar with the integration layer. SQS, SNS, and EventBridge each handle a different pattern (queuing vs fan-out vs event routing), and choosing the wrong one causes problems you notice when the system is under load.

SageMaker and Bedrock are newer services, but they're already part of the stack at many companies. SageMaker is for training and hosting models, and Bedrock is for calling foundation models directly.

CloudFormation lets you define infrastructure as code, and CodePipeline handles CI/CD. Once set up, deployments run without manual steps.


JWT Visualized

Imagine you have a special box called a JWT. Inside this box, there are three parts: a header, a payload, and a signature.

The header is like the label on the outside of the box. It tells us what type of box it is and how it's secured. It's usually written in a format called JSON, which is just a way to organize information using curly braces { } and colons : .

The payload is like the actual message or information you want to send. It could be your name, age, or any other data you want to share. It's also written in JSON format, so it's easy to understand and work with.

Now, the signature is what makes the JWT secure. It's like a special seal that only the sender knows how to create. The signature is created using a secret code, kind of like a password. This signature ensures that nobody can tamper with the contents of the JWT without the sender knowing about it.

When you want to send the JWT to a server, you put the header, payload, and signature inside the box. Then you send it over to the server. The server can easily read the header and payload to understand who you are and what you want to do.

Over to you: When should we use JWT for authentication? What are some other authentication methods?

Must-Know Cross-Cutting Concerns in API Development

2026-04-09 23:30:31

What do authentication, logging, rate limiting, and input validation have in common?

The obvious answer is that they’re all important parts of an API. But the real answer is deeper is that none of them belong to any single endpoint or show up in usual product requirements. For all purposes, they are invisible to users when they work and catastrophic when they’re missing. And the hardest part about all of them is making sure they’re applied uniformly across every single route an API exposes.

This family of problems has a name. They’re called cross-cutting concerns, and they’re the invisible layer that separates a collection of API endpoints from a production-ready system.

In this article, we will learn about these key concerns and their trade-offs in detail.

What Makes a Concern “Cross-Cutting”

Read more

How Spotify Ships to 675 Million Users Every Week Without Breaking Things

2026-04-08 23:30:20

Unlock access to the data your product needs (Sponsored)

Most tools are still locked to their own database, blind to everything users already have in Slack, GitHub, Salesforce, Google Drive, and dozens of other apps. That's the ceiling on what you can build.

WorkOS Pipes removes it. One API call connects your product to the apps your users live in. Pull context from their tools, cross-reference data across silos, power AI agents that act across services. All with fresh, managed credentials you never have to think about.

Turn data to insight →


Every Friday morning, a team at Spotify takes hundreds of code changes written by dozens of engineering teams and begins packaging them into a single app update. That update will eventually reach more than 675 million users on Android, iOS, and Desktop. They do this every single week. And somehow, more than 95% of those releases ship to every user without a hitch.

The natural assumption is that they’re either incredibly careful and therefore slow, or incredibly fast and therefore reckless. The truth is neither.

How do you ship to 675 million users every week, with hundreds of changes from dozens of teams running on thousands of device configurations, without breaking things?

The answer is not to test really hard. Spotify built a release architecture where speed and safety reinforce each other. In this article, we will take a look at this process in detail and attempt to derive learnings.

Disclaimer: This post is based on publicly shared details from the Spotify Engineering Team. Please comment if you notice any inaccuracies.

The Two-Week Journey of a Spotify Release

To see how this works, let us follow a single release from code merge to production.

Spotify practices trunk-based development, which means that all developers merge their code into a single main branch as soon as it’s tested and reviewed. There are no long-lived feature branches where code sits in isolation for weeks. Everyone pushes to the same branch continuously, which keeps integration problems small but requires discipline and solid automated testing.

See the diagram below that shows the concept of trunk-based development:

Each release cycle starts on a Friday morning. The version number gets bumped on the main branch. From that point, nightly builds start going out to Spotify employees and a group of external alpha testers. During this first week, teams develop and merge new code freely. Bug reports flow in from internal and alpha users. Crash rates and other quality metrics are tracked for each build, both automatically and by human review. When a crash or issue crosses a predefined severity threshold, a bug ticket gets created automatically. When something looks suspicious but falls below that threshold, the Release Manager can create one manually.

On the Friday of the second week, the release branch gets cut, meaning a separate copy of the codebase is created specifically for this release. This is the key moment in the release cycle. From this point, only critical bug fixes are allowed on the release branch. Meanwhile, the main branch keeps moving. New features and non-critical fixes continue to merge there, destined for next week’s release. This separation is the mechanism that lets Spotify develop at full speed while simultaneously stabilizing what’s about to ship.

Teams then perform regression testing, checking that existing features still work correctly after the new changes, and report their results. Teams with high confidence in their automated tests and pre-merge routines can opt out of manual testing entirely. Beta testers receive builds from the more stable release branch, providing additional real-world runtime over the weekend.

By Monday, the goal is to submit the app to the stores. By Tuesday, if the app store review passes and quality metrics look good, Spotify rolls it out to 1% of users. By Wednesday, if nothing alarming surfaces, they roll out to 100%.

The flow below shows all the steps in a typical release process:

As an example, for version 8.9.2, which carried the Audiobooks feature launch in new markets, this timeline played out almost exactly as planned. What made that possible was everything happening behind the timeline.

Rings of Exposure: Catching Bugs Where They’re Cheapest to Fix

The code doesn’t go from a developer’s laptop to 675 million users in one jump. It passes through concentric rings of users, and each ring exists to catch a specific category of failure.

  • The first ring is Spotify’s own employees. They run nightly builds from the main branch, using the app the way real users do. This catches obvious functional bugs early. Even a crash that only affects a small number of employees gets investigated, because a bug that appears minor internally could signal a much larger problem once it hits millions of devices.

  • The second ring is external alpha testers. These users introduce more device diversity and real-world usage patterns that the internal team may not have anticipated. They’re running builds that are still being actively developed, so rough edges are expected, but the data they generate is invaluable.

  • The third ring is beta testers, who receive builds from the release branch rather than the main branch. These builds are expected to be more stable. Beta users provide additional runtime over weekends and evenings, and their feedback either builds confidence that the release is solid or surfaces issues that slipped through the first two rings.

  • The fourth ring is the 1% production rollout. Real users, real devices, real conditions. Spotify’s user base is large enough that even 1% provides statistically meaningful data. If a severe issue appears during this phase, the rollout is paused immediately, and the responsible team starts working on a fix.

  • The fifth and final ring is the 100% rollout. Only after the 1% rollout looks clean does the release go out to everyone.

For reference, the Audiobooks launch in version 8.9.2 shows how this system works at an even more granular level.

The Audiobooks feature didn’t just pass through these five rings of app release. It had its own layered rollout on top of that. The feature code had been sitting in the app for multiple releases already, hidden behind a backend feature flag. It was turned on for most employees first. The team watched for any crash, no matter how small, that might indicate trouble. Only after the app release itself reached a sufficient user base did the Audiobooks team begin gradually enabling the feature for real users in specific markets, using the same backend flag to control the percentage.

See the diagram below that shows the concept of a feature flag:

This separation between deploying code and activating a feature is a powerful pattern in the Spotify release process. It allows code to sit in the app, baking in production conditions invisibly, and get turned on later. If something goes wrong after activation, the feature can be turned off without shipping a new release. At Spotify’s scale, feature flags are a core safety mechanism, though managing hundreds of them across a large organization, each with per-market and per-user-percentage controls, is its own engineering challenge.

The Release Manager also made a deliberate coordination decision for 8.9.2. Since the Audiobooks feature was a high-stakes launch with marketing events already scheduled, another major feature that had been planned for the same release was rescheduled to the following week. Fewer variables in a single release means easier diagnosis if something goes wrong. That kind of judgment call is one of the things that separates release management from pure automation.

From Jira to a Release Command Center

The multi-ring system generates a lot of data, such as Crash rates, bug tickets, sign-off statuses, build verification results, and app store review progress. Someone has to make sense of all of it, and this wasn’t an easy task.

Before the Release Manager Dashboard existed, everything lived in Jira. The Release Manager had to jump between tickets, check statuses across multiple views, and verify conditions manually, all while answering questions from teams on Slack. It was easy to miss a small detail, and a missed detail could mean extra work or a bug slipping through.

So the Release team built a dedicated command center with clear goals:

  • Optimize for the Release Manager’s workflow

  • Minimize context switching

  • Reduce cognitive load

  • Enable fast and accurate decisions

The result was the Release Manager Dashboard, built as a plugin on Backstage, Spotify’s internal developer portal.

It pulls and aggregates data from around 10 different backend systems into a single view. For each platform (Android, iOS, Desktop), the dashboard shows blocking bugs, the latest build status, automated test results, crash rates normalized against actual usage (so a crash rate is meaningful whether 1,000 or 1,000,000 people are using the build), team sign-off progress, and rollout state. Everything is color-coded

  • Green means ready to advance

  • Yellow means something needs attention

  • Red means there’s a problem requiring action

Here’s an example of how the dashboard appears:

The dashboard also surfaces release criteria as a visible checklist:

  • All commits on the release branch are included in the latest build and passing tests

  • No open blocking bug tickets

  • All teams signed off

  • Crash rates below defined thresholds

  • Sufficient real-world usage of the build

When everything goes green, the release is ready to advance.

The dashboard got off to a rocky start, however. The first version was slow and expensive. Every page reload triggered queries to all 10 of the source systems it depended on, causing long load times and high costs. The Spotify engineering team noted that each reload cost about as much as a decent lunch in Stockholm. After switching to caching and pre-aggregating data every five minutes, load time dropped to eight seconds, and the cost became negligible.

The Robot: Automating the Predictable, Keeping Humans for the Ambiguous

The dashboard gave the Release Manager the information to make fast decisions.

However, by analyzing the time-series data the dashboard generated, the team noticed that a lot of the time in the release cycle wasn’t spent on hard decisions, but waiting.

The biggest time sinks were testing and fixing bugs (unavoidable), waiting for app store approval (outside Spotify’s control), and delays from manually advancing a release when a step was completed outside working hours. That last one alone could cost up to 12 hours. If the app store approved a build at 11 PM, the release just sat there until someone woke up and clicked “next.”

Therefore, the team built what they called “the Robot.”

It’s a backend service that models the release process as a state machine, a set of defined stages with specific conditions that must be met before moving to the next one. The Robot tracks seven states. The five states on the normal path forward are release branched, final release candidate (the build that will actually ship), submitted for app store review, rolled out to 1%, and rolled out to 100%. Two additional states handle problems, which means either the rollout gets paused or the release gets cancelled entirely.

See the diagram below:

The Robot continuously checks whether the conditions for advancing to the next state are met. If manual testing is signed off, no blocking bugs are open, and automated tests pass on the latest commit on the release branch, the Robot automatically submits the build for app store review without human intervention. If the app store approves the build at 3 AM, the Robot initiates the 1% rollout immediately instead of waiting for someone to show up at the office.

The result was an average reduction of about eight hours per release cycle.

However, the Robot doesn’t make the hard calls. It doesn’t decide whether a crash affecting users in a specific region is severe enough to block a release. It doesn’t decide whether a bug in a new feature like Audiobooks, with marketing events already scheduled, should delay the entire release or just the feature rollout. It doesn’t negotiate with feature teams about timing. Those decisions require judgment, context, and sometimes difficult conversations. The Release Manager handles all of them.

This split is deliberate. Predictable transitions that depend on rule-checks get automated. Ambiguous decisions that require coordination and judgment stay with humans.

Conclusion

Spotify ships weekly to 675 million users through a strong release architecture. Layered exposure catches bugs where they’re cheapest to fix and centralized tooling turns scattered data into fast decisions. Automation handles the predictable so humans can focus on the ambiguous.

The key lesson here is that speed and safety aren’t opposites. At Spotify, each one enables the other. A weekly cadence means each release carries fewer changes. Fewer changes mean less risk per release. Less risk means shipping with confidence.

Since a cancelled release only costs one week, not a month or a quarter, teams are more willing to kill a bad release rather than push it through and hope for the best.

References:

Nextdoor’s Database Evolution: A Scaling Ladder

2026-04-07 23:32:00

New Year, New Metrics: Evaluating AI Search in the Agentic Era (Sponsored)

Most teams pick a search provider by running a few test queries and hoping for the best – a recipe for hallucinations and unpredictable failures. This technical guide from You.com gives you access to an exact framework to evaluate AI search and retrieval.

What you’ll get:

  • A four-phase framework for evaluating AI search

  • How to build a golden set of queries that predicts real-world performance

  • Metrics and code for measuring accuracy

Go from “looks good” to proven quality.

Learn how to run an eval


Nextdoor operates as a hyper-local social networking service that connects neighbors based on their geographic location.

The platform allows people to share local news, recommend local businesses, and organize neighborhood events. Since the platform relies on high-trust interactions within specific communities, the data must be both highly available and extremely accurate.

However, as the service scaled to millions of users across thousands of global neighborhoods, the underlying database architecture had to evolve from a simple setup into a sophisticated distributed system.

This engineering journey at Nextdoor highlights a fundamental rule of system design.

Every performance gain introduces a new requirement for data integrity. The team followed a predictable progression, moving from a single database instance to a complex hierarchy of connection poolers, read replicas, versioned caches, and background reconcilers. In this article, we will look at how the Nextdoor engineering team handled this evolution and the challenges they faced.

Disclaimer: This post is based on publicly shared details from the Nextdoor Engineering Team. Please comment if you notice any inaccuracies.

The Limits of the “Big Box”

In the early days, Nextdoor relied on a single PostgreSQL instance to handle every post, comment, and neighborhood update.

For many growing platforms, this is the most logical starting point. It is simple to manage, and PostgreSQL provides a robust engine capable of handling significant workloads. However, as more neighbors joined and the volume of simultaneous interactions grew, the team hit a wall that was not related to the total amount of data stored, but more to do with the connection limit.

PostgreSQL uses a process-per-connection model. In other words, every time an application worker wants to talk to the database, the server creates a completely new process to handle that request. If an application has five thousand web workers trying to access the database at the same time, the server must manage five thousand separate processes. Each process consumes a dedicated slice of memory and CPU cycles just to exist.

Managing thousands of processes creates a massive overhead for the operating system. The server eventually spends more time switching between these processes than it does running the actual queries that power the neighborhood feed. This is often the point where vertical scaling, or buying a larger server with more cores, starts to show diminishing returns. The overhead of the “process-per-connection” model remains a bottleneck regardless of how much hardware is thrown at the problem.

To solve this, Nextdoor introduced a layer of middleware called PgBouncer. This is a connection pooler that sits between the application and the database. Instead of every application worker maintaining its own dedicated line to the database, they all talk to PgBouncer.

  • The Request Phase: A web worker requests a connection from PgBouncer to execute a quick query.

  • The Assignment Phase: PgBouncer assigns an idle connection from its pre-established pool rather than forcing the database to create a new process.

  • The Execution Phase: The query runs against the database using that shared connection.

  • The Release Phase: The worker finishes its task, and the connection returns to the pool immediately for the next worker to use.

This allows thousands of application workers to share a few hundred “warm” database connections. This effectively removed the connection bottleneck and allowed the primary database to focus entirely on data processing.

Dividing the Labor and the “Lag” Problem

Once connection management was stable, the next bottleneck appeared in the form of read traffic.

In a social network like Nextdoor, the ratio of people reading the feed compared to people writing a post is heavily skewed. For every one person who saves a new neighborhood update, hundreds of others might view it. A single database server must handle both the “Writes” and the “Reads” at the same time. This creates resource contention where heavy read queries can slow down the ability of the system to save new data.

The solution was to move to a Primary-Replica architecture. In this setup, one database server is designated as the Primary. It is the only server allowed to modify or change data. Several other servers, known as Read Replicas, maintain copies of the data from the Primary. All the “Read” traffic from the application is routed to these replicas, while only the “Write” traffic goes to the Primary.

See the diagram below:

This separation of labor allows for massive horizontal scaling of reads. However, this introduces the challenge of Asynchronous Replication. The Primary database sends its changes to the replicas using a stream of logs. It takes time for a new post saved on the Primary to travel across the network and appear on the replicas. This delay is known as replication lag.

See the diagram below that shows the difference between synchronous and asynchronous replication:

To solve the issue of a neighbor making a post and then seeing it disappear upon a refresh, Nextdoor uses Time-Based Dynamic Routing. This is a smart routing logic that ensures users always see the results of their own actions. Here’s how it works:

  • The Write Marker: When a user performs a write action, like posting a comment, the application notes the exact timestamp of that event.

  • The Protected Window: For a specific period of time after that write, often a few seconds, the system treats that specific user as sensitive

  • Dynamic Routing: During this window, all read requests from that user are dynamically routed to the Primary database instead of a replica.

  • The Handover: Once the time window expires and the system is confident the replicas have caught up with the Primary, the user’s traffic is routed back to the replicas to save resources.

This ensures that while the general neighborhood sees eventually consistent data, the person who made the change always sees strongly consistent data.


Why writing code isn’t the hard part anymore (Sponsored)

Coding is no longer the bottleneck, it’s prod.

With the rise in AI coding tools, teams are shipping code faster than they can operate it. And production work still means jumping between fragmented tools, piecing together context from systems that don’t talk to each other, and relying on the few engineers who know how everything connects.

Leading teams like Salesforce, Coinbase, and Zscaler cut investigation time by over 80% with Resolve AI, using multi-agent investigation that works across code, infrastructure, and telemetry.

Learn how AI-native engineering teams are implementing AI in their production systems

Get the free AI for Prod ebook ➝


The High-Speed Library

Even with multiple replicas, hitting a database for every single page load is an expensive operation.

Databases must read data from a disk or a large memory pool and often perform complex joins between different tables to assemble a single record. To provide the millisecond response times neighbors expect, Nextdoor implemented a caching layer using Valkey. This is an open-source high-performance data store that holds information in RAM for near-instant access.

The team uses a Look-aside Cache pattern. When the application needs data, it follows a specific sequence:

  • The Cache Check: The application looks for the data in Valkey using a unique key.

  • The Cache Hit: If the data is found, it is returned instantly to the user without touching the database.

  • The Cache Miss: If the data is missing, the application queries the PostgreSQL database to find the truth.

  • The Population Step: The application takes the database result, saves a copy in Valkey for future requests, and then returns it to the user.

Efficiency is vital when managing a cache at this scale. RAM is much more expensive than disk storage, so the data must be as small as possible.

Nextdoor uses a binary serialization format called MessagePack. In other words, instead of storing data as a bulky text format like JSON, they convert it into a highly compressed binary format that is much faster for the computer to parse.

MessagePack is particularly useful for Nextdoor because it supports schema evolution. If the engineering team adds a new field to a neighbor’s profile, the older cached data can still be read without crashing the application. For even larger pieces of data, they use Zstd compression. By combining these two tools, Nextdoor reduces the memory footprint of its cache servers.

Versioning and Atomic Updates

Caching can create a serious problem when it starts lying in particular scenarios. For example, if the database is updated but the cache is not refreshed, users can see old, incorrect information. Most simple caching strategies rely on a “Time to Live” or TTL. This is a timer that tells the cache to delete an entry after a few minutes. For a real-time social network, waiting several minutes for a post to update is not an acceptable solution.

Nextdoor built a sophisticated versioning engine to ensure the cache stays up to date. They added a special column called system_version to their database tables and used PostgreSQL Triggers to manage this number. For reference, a trigger is a small script that runs automatically inside the database whenever a row is touched. Every time a post is updated, the trigger increments the version number. This ensures that the database remains the ultimate source of truth regarding which version of a post is the newest.

When the application tries to update the cache, it does not just overwrite the old data. It uses a Lua script executed inside Valkey. This script performs an atomic Compare and set operation that works as follows:

  • The Metadata Fetch: The script retrieves the version number currently stored in the cache entry.

  • The Version Comparison: It compares the version to the version number of the new update being sent by the application.

  • The Conditional Write: If the new version is strictly greater than the cached version, the update is saved.

  • The Rejection: If the cached version is already equal to or higher than the new update, the script rejects the change entirely.

This prevents “race conditions.” Imagine two different servers trying to update the same post at the same time. Without this logic, an older update could arrive a millisecond later and overwrite a newer update. This would leave the cache permanently out of sync with the database. By using Lua, the entire process of checking the version and updating the data happens as a single, unbreakable step that cannot be interrupted.

CDC and Reconciliation

Even with versioning and Lua scripts, errors can occur.

A network partition might prevent a cache update from reaching Valkey, or an application process might crash before it can finish the population step. Nextdoor needed a final safety net to catch these discrepancies. They implemented Change Data Capture, also known as CDC, using a tool called Debezium.

See the diagram below:

CDC works by “listening” to the internal logs of the PostgreSQL database. Specifically, it watches the Write-Ahead Log, where every single change is recorded before it is committed. Every time a change happens in the database, Debezium captures that event and turns it into a message in a data stream. A background service known as the Reconciler watches this stream.

The reconciliation flow provides a “self-healing” mechanism for the entire setup:

  • The Database Update: A user updates their neighborhood bio in the Primary PostgreSQL database.

  • The Log Capture: Debezium detects the new log entry and publishes a change event message.

  • The Reconciler Action: The background service receives this message and identifies which cache key needs to be corrected.

  • The Invalidation: The service tells the cache to delete the old entry. The next time a neighbor requests that bio, the application will experience a “Cache Miss” and fetch the perfectly fresh data from the database.

This process provides eventual consistency. While the primary cache update might fail for a fraction of a second, the CDC Reconciler will eventually detect the change and fix the cache. It acts like a detective that constantly audits the system to ensure the fast truth in the cache matches the real truth in the database.

Sharding

There comes a point where even the most optimized single Primary database cannot handle the volume of incoming writes. When a platform processes billions of rows, the hardware itself reaches physical limits. This is when Nextdoor moves to the final rung of the ladder. This rung is Sharding.

Sharding is the process of breaking a single, massive table into smaller pieces and spreading them across entirely different database clusters. Nextdoor typically shards data by a unique identifier such as a Neighborhood ID.

  • The Cluster Split: All data for Neighborhoods 1 through 500 might live on Cluster A, while Neighborhoods 501 through 1,000 live on Cluster B.

  • The Shard Key: The application uses the neighborhood_id to know exactly which database cluster to talk to for any given request.

Sharding allows for much greater scaling because we can keep adding more clusters based on growth. However, it comes at a high cost in complexity. Once we shard a database, we can no longer easily perform a “Join” between data on two different shards.

Conclusion

The journey of Nextdoor’s database shows that great engineering is rarely about choosing the most complex tool first. It is about a disciplined progression.

They started with a single server and added connection pooling when the lines got too long. They added replicas when the read traffic became too heavy. Finally, they built a world-class versioned caching system to provide the speed neighbors expect without sacrificing the accuracy they require.

The takeaway is that complexity must be earned. Each layer of the scaling ladder solves one problem while introducing a new challenge in data consistency. By building robust safety nets such as versioning and reconciliation, the Nextdoor engineering team ensured that its system could grow without losing the trust of the communities it serves.

References:

A Guide to Context Engineering for LLMs

2026-04-06 23:30:52

The workshop for teams drowning in observability tools (Sponsored)

Five vendors, rising costs, and you still can’t tell why something broke.

Sentry’s Lazar Nikolov sits down with Recurly’s Chris Barton to talk through what observability consolidation actually looks like in practice: how to evaluate your options, where AI fits in, and how to think about cost when you’re ready to simplify.

Register your spot today


Giving an LLM more information can make it dumber. A 2025 research study by Chroma tested 18 of the most powerful language models available, including GPT-4.1, Claude, and Gemini, and found that every single one performed worse as the amount of input grew.

The degradation wasn’t minor, either. Some models held steady at 95% accuracy and then nosedived to 60% once the input crossed a certain length.

This finding busts one of the most common myths about working with LLMs that more context is always better. The reality is that LLMs have architectural blind spots that make what you put in front of them, and how you structure it, far more important than how much you include.

The discipline of getting this right is called context engineering.

In this article, we’ll look at how LLMs actually process the information you give them, what context engineering is, and the strategies that can help with it.

Key Terminologies

Before we go further, there are three terms that come up constantly when talking about LLMs. Getting clear on these first will make everything that follows much easier to reason about.

  • Tokens: They are the units LLMs think in. They aren’t full words, but rather chunks of text that average roughly three-quarters of a word each. The word “context” is one token, while the word “engineering” gets split into two. Every piece of text the model processes, from your question to its instructions to any documents you’ve included, is measured in tokens.

  • Context Window: It is the total number of tokens the model can see at once during a single interaction. Everything has to fit inside this window: the system instructions that define the model’s behavior, the conversation history, any external documents or data you’ve injected, and your actual question. Modern models advertise context windows ranging from 128,000 to over 2 million tokens. That sounds enormous, but as we’ll see, bigger isn’t straightforwardly better.

  • Attention: This is the mechanism the model uses to figure out which tokens matter to which other tokens. Before generating each new token of its response, the model compares it against every other token currently in the context window. This gives LLMs their ability to connect ideas across long stretches of text, but it’s also the source of their most important limitations.

How LLMs Process Context

When we send text to an LLM, it doesn’t read from top to bottom the way a human would. The attention mechanism compares every token against every other token to compute relationships, which means the model can, in principle, connect an idea from the first sentence of the input to one in the last sentence. However, this power comes with two critical costs.

  • The first is computational. Doubling the number of tokens in the context window roughly quadruples the computation required. Longer contexts are disproportionally slower and more expensive.

  • The second cost is more consequential. Attention isn’t distributed evenly across the context window. Research has consistently shown that LLMs pay the most attention to tokens at the beginning and end of the input, with a significant drop-off in the middle. This is known as the “lost in the middle” problem, and research has found that accuracy can drop by over 30% when relevant information is placed in the middle of the input compared to the beginning or end.

See the diagram below that shows the attention curve:

This isn’t a bug in any particular model, but rather a structural property of how transformers (the neural network architecture that powers virtually all modern LLMs) encode the position of tokens.

The positional encoding method used in most modern LLMs (called Rotary Position Embedding, or RoPE) introduces a decay effect that makes tokens far from both the start and end of the sequence land in a low-attention zone. Newer models have reduced the severity, but no production model has fully eliminated it.

The practical implication is that the position of information in the input matters as much as the information itself. If we paste a long document into an LLM, the model is most likely to miss information buried in the middle pages.

Why More Context Can Hurt

The uneven attention distribution is one problem, but there’s a broader pattern that compounds it, known as context rot.

Context rot is the degradation of LLM performance as input length increases, even on simple tasks. The Chroma research team’s 2025 study tested 18 frontier models and found that this degradation isn’t gradual. Models can maintain near-perfect accuracy up to a certain context length, and then performance drops off a cliff unpredictably, varying by model and by task in ways that make it impossible to reliably predict when you’ll hit a breaking point.

Why does this happen?

Every token you add to the context window draws from a finite attention budget. Irrelevant information buries important information in low-attention zones, and content that sounds related but isn’t actually useful confuses the model’s ability to identify what’s relevant. The model doesn’t get smarter with more input, but kind of gets distracted.

On top of this, LLMs are stateless. They have zero memory between calls, and each interaction starts completely fresh. When there is a multi-turn conversation with an LLM like ChatGPT, and it seems to “remember” what we said earlier, that’s because the system is re-injecting the conversation history into the context window each time. The model itself remembers nothing, which means someone, or some system, has to decide for every single call what information to include, what to leave out, and how to structure it.

There’s also a meaningful gap between marketing and reality. Models advertise million-token context windows, and they pass simple benchmarks at those lengths. However, the effective context length, where the model actually uses information reliably, is often much smaller. Passing a “needle in a haystack” test (finding one planted sentence in a long document) is very different from reliably synthesizing information scattered across hundreds of pages

Defining Context Engineering

Context engineering is the practice of designing, assembling, and managing the entire information environment an LLM sees before it generates a response. It goes beyond writing a single good instruction to orchestrating everything that fills the context window, so the model has exactly what it needs for the task at hand and nothing more.

To understand what this involves, it helps to see what actually competes for space inside a context window. There are six types of context in a typical LLM call:

  • System instructions (the behavioral rules, persona, and guidelines the model follows)

  • User input (your actual question or command)

  • Conversation history (the short-term memory of the current session)

  • Retrieved knowledge (documents, database results, or API responses pulled in from external sources)

  • Tool descriptions (definitions of tools the model can call and how to use them)

  • Tool outputs (results returned from previous tool calls)

  • The user’s actual question is often a tiny fraction of the total token count.

The rest is infrastructure, and that infrastructure is what context engineering designs.

This also clarifies how context engineering differs from prompt engineering. Prompt engineering asks, “How do I phrase my instruction to get the best result?” On the other hand, Context engineering asks, “What does the model need to see right now, and how do I assemble all of it dynamically?”

Prompt engineering is one component within context engineering, focused on the instruction layer, while context engineering encompasses the full information system around the model. As Andrej Karpathy put it in a widely referenced post, context engineering is the “delicate art and science of filling the context window with just the right information for the next step.”

Two people using the same model can get wildly different results. The model is the same, but the context is different, and context engineering is the factor that determines things.

Core Strategies

Developers have converged on four broad strategies for managing context, categorized as write, select, compress, and isolate. Each one is a direct response to a specific constraint we’ve already covered.

Write: Save Context Externally

The constraint it addresses is that the context window is finite, and statelessness means information is lost between calls.

Instead of trying to keep everything inside the context window, save important information to external storage and bring it back when needed. This takes two main forms.

  • The first is scratchpads, where an agent saves intermediate plans, notes, or reasoning steps to external storage during a long-running task. Anthropic’s multi-agent research system does exactly this. The lead researcher agent writes its plan to external memory at the start of a task, because if the context window exceeds 200,000 tokens, it gets truncated and the plan would be lost.

  • The second form is long-term memory, which involves persisting information across sessions. ChatGPT auto-generates user preferences from conversations, Cursor and Windsurf learn coding patterns and project context, and Claude Code uses CLAUDE.md files as persistent instruction memory. All of these systems treat external storage as the real memory layer, with the context window serving as a temporary workspace.

Select: Pull In Only What’s Relevant

The constraint it addresses is that more context isn’t better, and the model needs the right information rather than all available information.

The most important technique here is Retrieval-Augmented Generation, or RAG. Instead of stuffing all your knowledge into the context window, we store it externally in a searchable database. At query time, retrieve only the chunks most relevant to the current question and inject those into the context, giving the model targeted knowledge without the noise of everything else.

Selection also applies to tools. When an agent has dozens of available tools, listing every tool description in every prompt wastes tokens and confuses the model. A better approach is to retrieve only the tool descriptions relevant to the current task.

The critical tradeoff with selection is precision. If the retrieval pulls in documents that are almost relevant but not quite, they become distractors that add tokens and push important context into low-attention zones. The retrieval step itself has to be good, or the whole strategy backfires.

Compress: Keep Only What You Need

The constraint it addresses is the context rot and the escalating cost of attention across more tokens.

As agent workflows span dozens or hundreds of steps, the context window fills up with accumulated conversation history and tool outputs. Compression strategies reduce this bulk while trying to preserve the essential information.

Conversation summarization is the most common approach. Claude Code, for instance, triggers an “auto-compact” process when the context hits 95% capacity, summarizing the entire interaction history into a shorter form. Cognition, the company behind the Devin coding agent, trained a separate, dedicated model specifically for summarization at agent-to-agent boundaries. The fact that they built a separate model just for this step tells us how consequential bad compression can be, since a specific decision or detail that gets summarized away is gone permanently.

Simpler forms of compression include trimming (removing older messages from the history) and tool output compression (reducing verbose search results or code outputs to their essentials before they enter the context).

Isolate: Split Context Across Agents

The constraint it addresses is that of attention dilution and context poisoning when too many types of information compete in one window.

Instead of one agent trying to handle everything in a single bloated context window, this strategy splits the work across multiple specialized agents, each with its own clean, focused context. A “researcher” agent gets a context loaded with search tools and retrieved documents, while a “writer” agent gets a context loaded with style guides and formatting rules, so neither is distracted by the other’s information.

Anthropic demonstrated this with their multi-agent research system, where a lead Opus 4 agent delegated sub-tasks to Sonnet 4 sub-agents. The system achieved a 90.2% improvement over a single Opus 4 agent on research tasks, despite using the same underlying model family. The entire performance gain came from how context was managed, not from a more powerful model.

See the diagram below:

Tradeoffs

These strategies are powerful, but they involve trade-offs with no universal right answers:

  • Compression versus information loss: Every time you summarize, you risk losing a detail that turns out to matter later. The more aggressively you compress, the more you save on tokens, but the higher the chance of permanently destroying something important.

  • Single agent versus multi-agent: Anthropic’s multi-agent results are impressive, but others, notably Cognition, have argued that a single agent with good compression delivers more stability and lower cost. Both sides are debating the same core question of how to manage context effectively, and the answer depends on task complexity, cost tolerance, and reliability requirements.

  • Retrieval precision versus noise: RAG adds knowledge, but imprecise retrieval adds distractors. If the documents you retrieve aren’t genuinely relevant, they consume tokens and push important content into low-attention positions, so the retrieval system itself has to be well-engineered, or RAG makes things worse.

  • Cost versus richness: Every token costs money and processing time. The disproportionate scaling of attention means longer contexts get expensive fast, and context engineering is partly an economics problem of figuring out where the return on additional tokens stops being worth the cost.

Conclusion

The core takeaway is that the model is only as good as the context it receives. Working with LLMs effectively requires thinking about the entire system around the model, not just the model itself.

As models get more powerful, context engineering becomes more important. When the model is capable enough, most failures stop being intelligence failures and start being context failures, where the model could have gotten it right but didn’t have what it needed or had too much of what it didn’t need.

The strategies are evolving, and best practices are being revised as new models ship. However, the underlying constraints of finite attention, positional bias, and statelessness are architectural.

References