2026-02-13 00:30:55
Software architecture patterns are reusable solutions to common problems that occur when designing software systems. Think of them as blueprints that have been tested and proven effective by countless developers over many years.
When we build applications, we often face similar challenges, such as how to organize code, how to scale systems, or how to handle communication between different parts of an application. Architecture patterns provide us with established approaches to solve these challenges.
Learning about architecture patterns offers several key benefits.
First, it increases our productivity because we do not need to invent solutions from scratch for every project.
Second, it improves our code quality by following proven approaches that make systems more maintainable and easier to understand.
Third, it enhances communication within development teams by providing a common vocabulary to discuss design decisions.
In this article, we will explore the essential architecture patterns that every software engineer should understand. We will look at how each pattern works, when to use it, what performance characteristics it has, and see practical examples of each.
2026-02-12 00:30:38
Join us for Sonar Summit on March 3rd, a global virtual event, bringing together the brightest minds in software development.
In a world increasingly shaped by AI, it’s more crucial than ever to cut through the noise and amplify the ideas and practices that lead to truly good code. We created Sonar Summit to help you navigate the future with clarity and knowledge you need to build better software, faster.
OpenAI recently launched ChatGPT Atlas, a web browser where the LLM acts as your co-pilot across the internet. You can ask questions about any page, have ChatGPT complete tasks for you, or let it browse in Agent mode while you work on something else.
Delivering this experience wasn’t trivial. ChatGPT Atlas needed to start instantly and stay responsive even with hundreds of tabs open. To make development faster and avoid reinventing the wheel, the team built on top of Chromium, the engine that powers many other modern browsers.
However, Atlas is not just another Chromium-based browser with a different skin. Most Chromium-based browsers embed the web engine directly into their application, which creates tight coupling between the UI and the rendering engine. This architecture works fine for traditional browsing, but it makes certain capabilities extremely difficult to achieve.
Therefore, OpenAI’s solution was to build OWL (OpenAI’s Web Layer), an architectural layer that runs Chromium as a separate process, thereby unlocking capabilities that would have been nearly impossible otherwise.
In this article, we learn how the OpenAI Engineering Team built OWL and the technical challenges they faced around rendering and inter-process communication.
Disclaimer: This post is based on publicly shared details from the OpenAI Engineering Team. Please comment if you notice any inaccuracies.
Chromium was the natural choice as the web engine for Atlas. Chromium provides a state-of-the-art rendering engine with strong security, proven performance, and complete web compatibility. It powers many modern browsers, including Chrome, Edge, and Brave. Furthermore, Chromium benefits from continuous improvements by a global developer community. For any team building a browser today, Chromium is the logical starting point.
However, using Chromium comes with significant challenges. The OpenAI Engineering Team had ambitious goals that were difficult to achieve with Chromium’s default architecture:
First, they wanted instant startup times. Users should see the browser interface immediately, not after waiting for everything to load.
Second, they needed rich animations and visual effects for features like Agent mode, which meant using modern native frameworks like SwiftUI and Metal rather than Chromium’s built-in UI system.
Third, Atlas needed to support hundreds of open tabs without degrading performance.
Chromium has strong opinions about how browsers should work. It controls the boot sequence, the threading model, and how tabs are managed.
While OpenAI could have made extensive modifications to Chromium itself, this approach had problems. Making substantial changes to Chromium’s core would mean maintaining a large set of custom patches. Every time a new Chromium version was released, merging those changes would become increasingly difficult and time-consuming.
There was also a cultural consideration. OpenAI has an engineering principle called “shipping on day one,” where every new engineer makes and merges a code change on their first afternoon. This practice keeps development velocity high and helps new team members feel immediately productive. However, Chromium takes hours to download and build from source. Making this requirement work with traditional Chromium integration seemed nearly impossible.
OpenAI needed a different approach to integrate Chromium that would enable rapid experimentation, faster feature delivery, and maintain their engineering culture.
With the largest catalog of AI apps and agents in the industry, Microsoft Marketplace is a single source of cloud and AI needs. As a software company, Marketplace is how you connect your solution to millions of global buyers 24/7, helping reach new customers and sell with the power of Microsoft.
Publish your solution to the Microsoft Marketplace and grow pipeline with trials and product-led sales. Plus, you can simplify sales operations by streamlining terms, payouts, and billing.
Expand your product reach with Microsoft Marketplace
The answer was OWL, a new architectural layer that fundamentally changes how Chromium integrates with the browser application.
The key tenet behind the architecture is that instead of embedding Chromium inside the Atlas application, OpenAI runs Chromium’s browser process outside the main Atlas application process.
In this architecture, Atlas is the OWL Client, and the Chromium browser process is the OWL Host. These two components communicate through IPC using Mojo, which is Chromium’s own message-passing system. OpenAI wrote custom Swift and TypeScript bindings for Mojo, allowing their Swift-based Atlas application to call Chromium functions directly.
See the diagram below:
The OWL client library exposes a clean Swift API that abstracts several key concepts:
Session: Configures and controls the Chromium host globally
Profile: Manages browser state for a specific user profile (bookmarks, history, etc.)
WebView: Controls individual web pages, handling navigation, zoom, and input
WebContentRenderer: Forwards input events into Chromium and receives feedback
LayerHost/Client: Exchanges compositing information between Atlas UI and Chromium
Additionally, OWL provides service endpoints for managing high-level features like bookmarks, downloads, extensions, and autofill.
One of the most complex aspects of OWL is rendering.
How do you display web content that Chromium generates in one process within Atlas windows that exist in another process?
OpenAI solved this using a technique called layer hosting. Here is how it works:
On the Chromium side, web content is rendered to a CALayer, which is a macOS graphics primitive. This layer has a unique context ID.
On the Atlas side, an NSView (a window component) embeds this layer using the private CALayerHost API. The context ID tells Atlas which layer to display.
See the diagram below:
The result is that pixels rendered by Chromium in the OWL process appear seamlessly in Atlas windows. The GPU compositor handles this efficiently because both processes can share graphics memory. Multiple tabs can share a single compositing container. When you switch tabs, Atlas simply swaps which WebView is connected to the visible container.
This technique also works for special UI elements like dropdown menus from select elements or color pickers. These render in separate pop-up widgets in Chromium, each with its own rendering surface, but they follow the same delegated rendering model.
OpenAI also uses this approach selectively to project elements of Chromium’s native UI into Atlas. This is useful for quickly bootstrapping features like permission prompts without building complete replacements in SwiftUI. The technique borrows from Chromium’s existing infrastructure for installable web applications on macOS.
User input requires careful handling across the process boundary. Normally, Chromium’s UI layer translates platform events like mouse clicks or key presses from macOS NSEvents into Blink’s WebInputEvent format before forwarding them to web page renderers.
In the OWL architecture, Chromium runs without visible windows, so it never receives these platform events directly. Instead, the Atlas client library performs the translation from NSEvents to WebInputEvents and forwards the already-translated events to Chromium over IPC.
See the diagram below:
From there, events follow the same lifecycle they would normally follow for web content. If a web page indicates it did not handle an event, Chromium returns it to the Atlas client. When this happens, Atlas resynthesizes an NSEvent and gives the rest of the application a chance to handle the input. This allows browser-level keyboard shortcuts and gestures to work correctly even though the web engine is in a separate process.
Atlas includes an agentic browsing feature where ChatGPT can control the browser to complete tasks. This capability poses unique challenges for rendering, input handling, and data storage.
The computer use model that powers Agent mode expects a single screenshot of the browser as input. However, some UI elements, like dropdown menus, render outside the main tab bounds in separate windows. To solve this, Atlas composites these pop-up windows back into the main page image at their correct coordinates in Agent mode. This ensures the AI model sees the complete context in a single frame.
For input events, OpenAI applies a strict security principle. Agent-generated events route directly to the web page renderer and never pass through the privileged browser layer. This preserves the security sandbox even under automated control. The system prevents AI-generated events from synthesizing keyboard shortcuts that would make the browser perform actions unrelated to the displayed web content.
Agent mode also supports ephemeral browsing sessions. Instead of using the user’s existing Incognito profile, which could leak state between sessions, OpenAI uses Chromium’s StoragePartition infrastructure to create isolated, in-memory data stores. Each agent session starts completely fresh. When the session ends, all cookies and site data are discarded. You can run multiple logged-out agent sessions simultaneously, each in its own browser tab, with complete isolation between them.
The OWL architecture delivers several critical benefits that enable OpenAI’s product goals.
Atlas achieves fast startup because Chromium boots asynchronously in the background while the Atlas UI appears nearly instantly. Users see pixels on screen within milliseconds, even though the web engine may still be initializing.
The application is simpler to develop because Atlas is built almost entirely in SwiftUI and AppKit. This creates a unified codebase with one primary language and technology stack, making it easier for developers to work across the entire application.
Process isolation means that if Chromium’s main thread hangs, Atlas remains responsive. If Chromium crashes, Atlas stays running and can recover. This separation protects the user experience from issues in the web engine.
OpenAI maintains a much smaller diff against upstream Chromium because they are not modifying Chromium’s UI layer extensively. This makes it easier to integrate new Chromium versions as they are released.
Most importantly for developer productivity, most engineers never need to build Chromium locally. OWL ships internally as a prebuilt binary, so Atlas builds completely in minutes rather than hours.
Every architectural decision involves trade-offs:
Running two separate processes uses more memory than a monolithic architecture.
The IPC layer adds complexity that must be maintained.
Cross-process rendering could potentially add latency, although OpenAI mitigates this through efficient use of CALayerHost and GPU memory sharing.
However, OpenAI determined that these trade-offs were worthwhile. The benefits of stability, developer productivity, and architectural flexibility outweigh the costs. The clean separation between Atlas and Chromium creates a foundation that will support future innovation, particularly for agentic use cases.
OWL is not just about building a better browser today.
It creates infrastructure for the future of AI-powered web experiences. The architecture makes it easy to run multiple isolated agent sessions, add new AI capabilities, and experiment with novel interactions between users, AI, and web content. The built-in sandboxing for agent actions provides security by design rather than as an afterthought.
Building ChatGPT Atlas required rethinking fundamental assumptions about browser architecture. By running Chromium outside the main application process and creating the OWL integration layer, the OpenAI Engineering Team solved multiple challenges simultaneously. They achieved instant startup, maintained developer productivity, enabled rich UI capabilities, and built a strong foundation for agentic browsing.
References:
2026-02-11 00:30:21
Monster SCALE Summit is a virtual conference all about extreme-scale engineering and data-intensive applications. Engineers from Discord, Disney, LinkedIn, Uber, Pinterest, Rivian, ClickHouse, Redis, MongoDB, ScyllaDB + more will be sharing 50+ talks on topics like:
Distributed databases
Streaming and real-time processing
Intriguing system designs
Approaches to a massive scaling challenge
Methods for balancing latency/concurrency/throughput
Infrastructure built for unprecedented demands.
Don’t miss this chance to connect with 20K of your peers designing, implementing, and optimizing data-intensive applications – for free, from anywhere.
LinkedIn serves hundreds of millions of members worldwide, delivering fast experiences whether someone is loading their feed or sending a message. Behind the scenes, this seamless experience depends on thousands of software services working together. Service Discovery is the infrastructure system that makes this coordination possible.
Consider a modern application at scale. Instead of building one massive program, LinkedIn breaks functionality into tens of thousands of microservices. Each microservice handles a specific task like authentication, messaging, or feed generation. These services need to communicate with each other constantly, and they need to know where to find each other.
Service discovery solves this location problem. Instead of hardcoding addresses that can change as servers restart or scale, services use a directory that tracks where every service currently lives. This directory maintains IP addresses and port numbers for all active service instances.
At LinkedIn’s scale, with tens of thousands of microservices running across global data centers and handling billions of requests each day, service discovery becomes exceptionally challenging. The system must update in real time as servers scale up or down, remain highly reliable, and respond within milliseconds.
In this article, we learn how LinkedIn built and rolled out Next-Gen Service Discovery, a scalable control plane supporting app containers in multiple programming languages.
Disclaimer: This post is based on publicly shared details from the LinkedIn Engineering Team. Please comment if you notice any inaccuracies.
For the past decade, LinkedIn used Apache Zookeeper as the control plane for service discovery. Zookeeper is a coordination service that maintains a centralized registry of services.
In this architecture, Zookeeper allows server applications to register their endpoint addresses in a custom format called D2, which stands for Dynamic Discovery. The system stored the configuration about how RPC traffic should flow as D2 configs and served them to application clients. The application servers and clients formed the data plane, handling actual inbound and outbound RPC traffic using LinkedIn’s Rest.li framework, a RESTful communication system.
Here is how the system worked:
The Zookeeper client library ran on all application servers and clients.
The Zookeeper ensemble took direct write requests from application servers to register their endpoint addresses as ephemeral nodes called D2 URIs.
Ephemeral nodes are temporary entries that exist only while the connection remains active.
The Zookeeper performed health checks on these connections to keep the ephemeral nodes alive.
The Zookeeper also took direct read requests from application clients to set watchers on the server clusters they needed to call. When updates happened, clients would read the changed ephemeral nodes.
Despite its simplicity, this architecture had critical problems in three areas: scalability, compatibility, and extensibility. Benchmark tests conducted in the past projected that the system would reach capacity in early 2025.
LLMs are powerful—but without fresh, reliable information, they hallucinate, miss context, and go out of date fast. SerpApi gives your AI applications clean, structured web data from major search engines and marketplaces, so your agents can research, verify, and answer with confidence.
Access real-time data with a simple API.
The key problems with Zookeeper are as follows:
The control plane operated as a flat structure handling requests for hundreds of thousands of application instances.
During deployments of large applications with many calling clients, the D2 URI ephemeral nodes changed frequently. This led to read storms with huge fanout from all the clients trying to read updates simultaneously, causing high latencies for both reads and writes.
Zookeeper is a strong consistency system, meaning it enforces strict ordering over availability. All reads, writes, and session health checks go through the same request queue. When the queue had a large backlog of read requests, write requests could not be processed. Even worse, all sessions would be dropped due to health check timeouts because the queue was too backed up. This caused ephemeral nodes to be removed, resulting in capacity loss of application servers and site unavailability.
The session health checks performed on all registered application servers became unscalable with fleet growth. As of July 2022, LinkedIn had about 2.5 years of capacity left with a 50 to 100 percent yearly growth rate in cluster size and number of watchers, even after increasing the number of Zookeeper hosts to 80.
Since D2 entities used LinkedIn’s custom schemas, they were incompatible with modern data plane technologies like gRPC and Envoy.
The read and write logic in application containers was implemented primarily in Java, with a partial and outdated implementation for Python applications. When onboarding applications in other languages, the entire logic needed to be rewritten from scratch.
The lack of an intermediary layer between the service registry and application instances prevented the development of modern centralized RPC management techniques like centralized load balancing.
It also created challenges for integrating with new service registries to replace Zookeeper, such as Etcd with Kubernetes, or any new storage system that might have better functionality or performance.
The LinkedIn Engineering Team designed the new architecture to address all these limitations. Unlike Zookeeper handling read and write requests together, Next-Gen Service Discovery consists of two separate paths: Kafka for writes and Service Discovery Observer for reads.
Kafka takes in application server writes and periodic heartbeats through Kafka events called Service Discovery URIs. Kafka is a distributed streaming platform capable of handling millions of messages per second. Each Service Discovery URI contains information about a service instance, including service name, IP address, port number, health status, and metadata.
Service Discovery Observer consumes the URIs from Kafka and writes them into its main memory. Application clients open bidirectional gRPC streams to the Observer, sending subscription requests using the xDS protocol. The Observer keeps these streams open to push data and all subsequent updates to application clients instantly.
The xDS protocol is an industry standard created by the Envoy project for service discovery. Instead of clients polling for updates, the Observer pushes changes as they happen. This streaming approach is far more efficient than the old polling model.
D2 configs remain stored in Zookeeper. Application owners run CLI commands to leverage Config Service to update the D2 configs and convert them into xDS entities.
Observer consumes the configs from Zookeeper and distributes them to clients the same way as the URIs.
The Observer is horizontally scalable and written in Go, chosen for its high concurrency capabilities.
It can process large volumes of client requests, dispatch data updates, and consume URIs for the entire LinkedIn fleet efficiently. As of today, one Observer can maintain 40,000 client streams while sending 10,000 updates per second and consuming 11,000 Kafka events per second.
With projections of fleet size growing to 3 million instances in the coming years, LinkedIn will need approximately 100 Observers.
Here are some key improvements that the new architecture provided in comparison to Zookeeper:
LinkedIn prioritized availability over consistency because service discovery data only needs to eventually converge. Some short-term inconsistency across servers is acceptable, but the data must be highly available to the huge fleet of clients. This represents a fundamental shift from Zookeeper’s strong consistency model.
Multiple Observer replicas reach eventual consistency after a Kafka event is consumed and processed on all replicas. Even when Kafka experiences significant lag or goes down, Observer continues serving client requests with its cached data, preventing cascading failures.
LinkedIn can further improve scalability by separating dedicated Observer instances. Some Observers can focus on consuming Kafka events as consumers, while other Observers serve client requests as servers. The server Observers would subscribe to the consumer Observers for cache updates.
Next-Gen Service Discovery supports the gRPC framework natively and enables multi-language support.
Since the control plane uses the xDS protocol, it works with open-source gRPC and Envoy proxy. Applications not using Envoy can leverage open-source gRPC code to directly subscribe to the Observer. Applications onboarding the Envoy proxy get multi-language support automatically.
Adding Next-Gen Service Discovery as a central control plane between the service registry and clients enables LinkedIn to extend to modern service mesh features. These include centralized load balancing, security policies, and transforming endpoint addresses between IPv4 and IPv6.
LinkedIn can also integrate the system with Kubernetes to leverage application readiness probes. This would collect the status and metadata of application servers, converting servers from actively making announcements to passively receiving status probes, which is more reliable and better managed.
Next-Gen Service Discovery Observers run independently in each fabric. A fabric is a data center or isolated cluster. Application clients can be configured to connect to the Observer in a remote fabric and be served with the server applications in that fabric. This supports custom application needs or provides failover when the Observer in one fabric goes down, ensuring business traffic remains unaffected.
See the diagram below:
Application servers can also write to the control plane in multiple fabrics. Cross-fabric announcements are appended with a fabric name suffix to differentiate from local announcements. Application clients can then send requests to application servers in both local and remote fabrics based on preference.
See the diagram below:
Rolling out Next-Gen Service Discovery to hundreds of thousands of hosts without impacting current requests required careful planning.
LinkedIn needed the service discovery data served by the new control plane to exactly match the data on Zookeeper. They needed to equip all application servers and clients companywide with related mechanisms through just an infrastructure library version bump. They needed central control on the infrastructure side to switch Next-Gen Service Discovery read and write on and off by application. Finally, they needed good central observability across thousands of applications on all fabrics for migration readiness, results verification, and troubleshooting.
The three major challenges were as follows:
First, service discovery is mission-critical, and any error could lead to severe site-wide incidents. Since Zookeeper was approaching capacity limits, LinkedIn needed to migrate as many applications off Zookeeper as quickly as possible.
Second, application states were complex and unpredictable. Next-Gen Service Discovery Read required client applications to establish gRPC streams. However, Rest.li applications that had existed at the company for over a decade were in very different states regarding dependencies, gRPC SSL, and network access. Compatibility with the control plane for many applications was unpredictable without actually enabling the read.
Third, read and write migrations were coupled. If the write was not migrated, no data could be read on Next-Gen Service Discovery. If the read was not migrated, data was still read on Zookeeper, blocking the write migration. Since read path connectivity was vulnerable to application-specific states, the read migration had to start first. Even after client applications migrated for reads, LinkedIn needed to determine which server applications became ready for Next-Gen Service Discovery Write and prevent clients from regressing to read Zookeeper again.
LinkedIn implemented a dual mode strategy where applications run both old and new systems simultaneously, verifying the new flow behind the scenes.
To decouple read and write migration, the new control plane served a combined dataset of Kafka and Zookeeper URIs, with Kafka as the primary source and Zookeeper as backup. When no Kafka data existed, the control plane served Zookeeper data, mirroring what clients read directly from Zookeeper. This enabled read migration to start independently.
In Dual Read mode, an application client reads data from both Next-Gen Service Discovery and Zookeeper, keeping Zookeeper as the source of truth for serving traffic. Using an independent background thread, the client tried to resolve traffic as if it were served by Next-Gen Service Discovery data and reported any errors.
LinkedIn built comprehensive metrics to verify connectivity, performance, and data correctness on both the client side and Observer side. On the client side, connectivity and latency metrics watched for connection status and data latencies from when the subscription request was sent to when data was received. Dual Read metrics compared data received from Zookeeper and Next-Gen Service Discovery to identify mismatches. Service Discovery request resolution metrics showed request status, identical to Zookeeper-based metrics, but with a Next-Gen Service Discovery prefix to identify whether requests were resolved by Next-Gen Service Discovery data and catch potential errors like missing critical data.
On the Observer side, connection and stream metrics watched for client connection types, counts, and capacity. These helped identify issues like imbalanced connections and unexpected connection losses during restart. Request processing latency metrics measured time from when the Observer received a request to when the requested data was queued for sending. The actual time spent sending data over the network was excluded since problematic client hosts could get stuck receiving data and distort the metric. Additional metrics tracked Observer resource utilization, including CPU, memory, and network bandwidth.
See the diagram below:
With all these metrics and alerts, before applications actually used Next-Gen Service Discovery data, LinkedIn caught and resolved numerous issues, including connectivity problems, reconnection storms, incorrect subscription handling logic, and data inconsistencies, avoiding many companywide incidents. After all verifications passed, applications were ramped to perform Next-Gen Service Discovery read-only.
In Dual Write mode, application servers reported to both Zookeeper and Next-Gen Service Discovery.
On the Observer side, Zookeeper-related metrics monitored potential outages, connection losses, or high latencies by watching connection status, watch status, data received counts, and lags. Kafka metrics monitored potential outages and high latencies by watching partition lags and event counts.
LinkedIn calculated a URI Similarity Score for each application cluster by comparing data received from Kafka and Zookeeper. A 100 percent match could only be reached if all URIs in the application cluster were identical, guaranteeing that Kafka announcements matched existing Zookeeper announcements.
Cache propagation latency is measured as the time from when data was received on the Observer to when the Observer cache was updated.
Resource propagation latency is measured as the time from when the application server made the announcement to when the Observer cache was updated, representing the full end-to-end write latency.
On the application server side, a metric tracked the server announcement mode to accurately determine whether the server was announcing to Zookeeper only, dual write, or only Next-Gen Service Discovery. This allowed LinkedIn to understand if all instances of a server application had fully adopted a new stage.
See the diagram below:
LinkedIn also monitored end-to-end propagation latency, measuring the time from when an application server made an announcement to when a client host received the update. They built a dashboard to measure this across all client-server pairs daily, monitoring for P50 less than 1 second and P99 less than 5 seconds. P50 means that 50 percent of clients received the propagated data within that time, and P99 means 99 percent received it within that time.
The safest approach for write migration would be waiting until all client applications are migrated to Next-Gen Service Discovery Read and all Zookeeper-reading code is cleaned up before stopping Zookeeper announcements. However, with limited Zookeeper capacity and the urgency to avoid outages, LinkedIn needed to begin write migration in parallel with client application migration.
LinkedIn built cron jobs to analyze Zookeeper watchers set on the Zookeeper data of each application and list the corresponding reader applications. A watcher is a mechanism where clients register interest in data changes. When data changes, Zookeeper notifies all watchers. These jobs generated snapshots of watcher status at short intervals, catching even short-lived readers like offline jobs. The snapshots were aggregated into daily and weekly reports.
These reports identified applications with no readers on Zookeeper in the past two weeks, which LinkedIn set as the criteria for applications becoming ready to start Next-Gen Service Discovery Write. The reports also showed top blockers, meaning reader applications blocking the most server hosts from migrating, and top applications being blocked, identifying the largest applications unable to migrate, and which readers were blocking them.
This information helped LinkedIn prioritize focus on the biggest blockers for migration to Next-Gen Service Discovery Read. Additionally, the job could catch any new client that started reading server applications already migrated to Next-Gen Service Discovery Write and send alerts, allowing prompt coordination with the reader application owner for migration or troubleshooting.
The Next-Gen Service Discovery system achieved significant improvements over the Zookeeper-based architecture.
The system now handles the company-wide fleet of hundreds of thousands of application instances in one data center with data propagation latency of P50 less than 1 second and P99 less than 5 seconds. The previous Zookeeper-based architecture experienced high latency and unavailability incidents frequently, with data propagation latency of P50 less than 10 seconds and P99 less than 30 seconds.
This represents a tenfold improvement in median latency and a sixfold improvement in 99th percentile latency. The new system not only safeguards platform reliability at massive scale but also unlocks future innovations in centralized load balancing, service mesh integration, and cross-fabric resiliency.
Next-Gen Service Discovery marks a foundational transformation in LinkedIn’s infrastructure, changing how applications discover and communicate with each other across global data centers. By replacing the decade-old Zookeeper-based system with a Kafka and xDS-powered architecture, LinkedIn achieved near real-time data propagation, multi-language compatibility, and true horizontal scalability.
References:
2026-02-10 00:31:03
Free trials help AI apps grow, but bots and fake accounts exploit them. They steal tokens, burn compute, and disrupt real users.
Cursor, the fast-growing AI code assistant, uses WorkOS Radar to detect and stop abuse in real time. With device fingerprinting and behavioral signals, Radar blocks fraud before it reaches your app.
You open an app with one specific question in mind, but the answer is usually hidden in a sea of reviews, photos, and structured facts. Modern content platforms are information-rich, though surfacing direct answers can still be a challenge. A good example is Yelp business pages. Imagine you are deciding where to go and you ask “Is the patio heated?”. The page might contain the answer in a couple of reviews, a photo caption, or an attribute field, but you still have to scan multiple sections to piece it together.
A common way to solve this is to integrate an AI assistant inside the app. The assistant retrieves the right evidence and turns it into a single direct answer with citations to the supporting snippets.
This article walks through what it takes to ship a production-ready AI assistant using Yelp Assistant on business pages as a concrete case study. We’ll cover the engineering challenges, architectural trade-offs, and practical lessons from the development of the Yelp Assistant.
Note: This article is written in collaboration with Yelp. Special thanks to the Yelp team for sharing details with us about their work and for reviewing the final article before publication.
To deliver answers that are both accurate and cited, we cannot rely on an LLM’s internal knowledge alone. Instead, we use Retrieval-Augmented Generation (RAG).
RAG decouples the problem into two distinct phases: retrieval and generation, supported by an offline indexing pipeline that prepares the knowledge store.
The development of a RAG system starts with an indexing pipeline, which builds a knowledge store from raw data offline. Upon receiving a user query, the retrieval system scans this store using both lexical search for keywords and semantic search for intent to locate the most relevant snippets. Finally, the generation phase feeds these snippets to the LLM with strict instructions to answer solely based on the provided evidence and to cite specific sources.
Citations are typically produced by having the model output citation markers that refer to specific snippets. For example, if the prompt includes snippets with IDs S1, S2, and S3, the model might generate “Yes, the patio is heated” and attach markers like [S1] and [S3]. A citation resolution step then maps those markers back to the original sources, such as a specific review excerpt, photo caption, or attribute field, and formats them for the UI. Finally, citations are verified to ensure every emitted citation maps to real retrievable content.
While this system is enough for a prototype, a production system requires additional layers for reliability, safety, and performance. The rest of this article uses the Yelp Assistant as a case study to explore the real-world engineering challenges of building this at scale and the mitigations to solve them.
AI can write code in seconds. You’re the one who gets paged at 2am when production breaks.
As teams adopt agentic workflows, features change faster than humans can review them. When an AI-written change misbehaves, redeploying isn’t fast enough, rollbacks aren’t clean, and you’re left debugging decisions made by your AI overlord.
In this tech talk, we’ll show FeatureOps patterns to stay in control at runtime, stop bad releases instantly, limit blast radius, and separate deployment from exposure.
Led by Alex Casalboni, Developer Advocate at Unleash, who spent six years at AWS seeing the best and worst of running applications at scale.
A robust data strategy determines what content the assistant can retrieve and how quickly it stays up to date. The standard pipeline consists of three stages beginning with data sourcing where we select necessary inputs like reviews or business hours and define update contracts. Next is ingestion, which cleans and transforms these raw feeds into a trusted canonical format. Finally, indexing transforms these records into retrieval-ready documents using keyword or vector search signals so the system can filter to the right business scope.
Setting up a data pipeline for a demo is usually simple. For example, Yelp’s early prototype relied on ad hoc batch dumps loaded into a Redis snapshot which effectively treated each business as a static bundle of content.
In production, this approach collapses because content changes continuously and the corpus grows without bound. A stale answer regarding operating hours is worse than no answer at all, and a single generic index struggles to find specific needle-in-the-haystack facts as the data volume explodes. To meet the demands of high query volume and near real-time freshness, Yelp evolved their data strategy through four key architectural shifts.
Treating every data source as real-time makes ingestion expensive to operate while treating everything as a weekly batch results in stale answers. Yelp set explicit freshness targets based on the content type. They implemented streaming ingestion for high-velocity data like reviews and business attributes to ensure updates appear within 10 minutes. Conversely, they used a weekly batch pipeline for slow-moving sources like menus and website text. This hybrid approach ensures a user asking “Is it open?” gets the latest status without wasting resources streaming static content.
Not all questions should be answered the same way. Some require searching through noisy text while others require a single precise fact. Treating everything as generic text makes retrieval unreliable; it allows anecdotes in reviews to override canonical fields like operating hours.
Yelp replaced the single prototype Redis snapshot with two distinct production stores. Unstructured content like reviews and photos serves through search indices to maximize relevance. Structured facts like amenities and hours live in a Cassandra database using an Entity-Attribute-Value layout.
This separation prevents hallucinated facts and makes schema evolution significantly simpler. Engineers can add new attributes such as EV charging availability without running migrations.
Photos can be retrieved using only captions, only image embeddings, or a combination of both. Caption-only retrieval fails when captions are missing, too short, or phrased differently than the user’s question. Embedding-only retrieval can miss literal constraints like exact menu item names or specific terms the user expects to match.
Yelp bridged this gap by implementing hybrid retrieval. The system ranks photos using both caption text matches and image embedding similarity. If a user asks about a heated patio, the system can retrieve relevant evidence whether the concept is explicitly written as “heaters” in the caption or simply visible as a heat lamp in the image itself.
Splitting data across search indices and databases improves quality but can hurt latency. A single answer might require a read for hours, a query for reviews, and another query for photos. These separate network calls add up and force the assistant logic to manage complex data fetching.
Yelp solved this by placing a Content Fetching API in front of all retrieval stores. This abstraction handles the complexity of parallelizing backend reads and enforcing latency budgets. The result is a consistent response format that keeps the 95th percentile latency under 100 milliseconds and decouples the assistant logic from the underlying storage details. The following figure summarizes the data sources and any special handling for each one.
Prototypes often prioritize simplicity by relying on a single large model for everything. The backend stuffs all available content such as menus and reviews into one massive prompt, forcing the model to act as a slow and expensive retrieval engine. Yelp followed this pattern in early demos. If a user asked, “Is the patio heated?”, the model had to read the entire business bundle to find a mention of heaters.
While this works for a demo, it collapses under real traffic. Excessive context leads to less relevant answers and high latency, while the lack of guardrails leaves the system vulnerable to adversarial attacks and out-of-scope questions that waste expensive compute.
To move from a brittle prototype to a robust production system, Yelp deconstructed the monolithic LLM into several specialized models to ensure safety and improve retrieval quality.
Yelp separated “finding evidence” from “writing the answer.” Instead of sending the entire business bundle to the model, the system queries near real-time indices to retrieve only the relevant snippets. For a question like “Is the patio heated?”, the system retrieves specific reviews mentioning “heaters” and the outdoor seating attribute. The LLM then generates a concise response based solely on that evidence, citing its sources.
Retrieval alone isn’t enough if you search every source by default. Searching menus for “ambiance” questions or searching reviews for “opening hours” introduces noise that confuses the model.
Yelp fixed this with a dedicated selector. A Content Source Selector analyzes the intent and outputs only the relevant stores. This enables the system to route inputs like “What are the hours?” to structured facts and “What is the vibe?” to reviews.
This routing also serves as conflict resolution if sources disagree. Yelp found it works best to default to authoritative sources like business attributes or website information for objective facts, and to rely on reviews for subjective, experience-based questions.
Users rarely use search-optimized keywords. They ask incomplete questions such as “vibe?” or “good for kids?” that fail against exact-match indices.
Yelp introduced a Keyword Generator, a fine-tuned GPT-4.1-nano model, that translates user queries into targeted search terms. For example, “vegan options” might generate keywords like “plant-based” or “dairy-free”. When the user’s prompt is broad, the Keyword Generator is trained to emit no keywords to avoid producing misleading keywords.
Before any retrieval happens, the system must decide if it should answer. Yelp uses two classifiers: Trust & Safety to block adversarial inputs and Inquiry Type to redirect out-of-scope questions like “Change my password” to the correct support channels.
Building this pipeline required a shift in training strategy. While prompt engineering a single large model works for prototypes, it proved too brittle for production traffic where user phrasing varies wildly. Yelp adopted a hybrid approach:
Fine-tuning for question analysis: They fine-tuned small and efficient models (GPT-4.1-nano) for the question analysis steps including Trust and Safety, Inquiry Type, and Source Selection. These small models achieved lower latency and higher consistency than prompting a large generic model.
Prompting for final generation: For the final answer where nuance and tone are critical, they stuck with a powerful generic model (GPT-4.1). Small fine-tuned models struggled to synthesize multiple evidence sources effectively, making the larger model necessary for the final output.
Prototypes usually handle each request as one synchronous blocking call. The system fetches content, builds a prompt, waits for the full model completion, and then returns one response payload. This workflow is simple but generally not optimized for latency or cost. Consequently, it becomes slow and expensive at scale.
Yelp optimized serving to reduce latency from over 10 seconds in prototypes to under 3 seconds in production. Key techniques include:
Streaming: In a synchronous prototype, users stare at a blank screen until the full answer is ready. Yelp migrated to FastAPI to support Server-Sent Events (SSE), allowing the UI to render text token-by-token as it generates. This significantly reduced the perceived wait time (Time-To-First-Byte).
Parallelism: Serial execution wastes time. Yelp built asynchronous clients to run independent tasks concurrently. Question analysis steps run in parallel, as do data fetches from different stores (Lucene for text, Cassandra for facts).
Early Stopping: If the Trust & Safety classifier flags a request, the system immediately cancels all downstream tasks. This prevents wasting compute and retrieval resources on blocked queries.
Tiered Models: Running a large model for every step is slow and expensive. By restricting the large model (GPT-4o) to the final generation step and using fast fine-tuned models for the analysis pipeline, Yelp reduced costs and improved inference speed by nearly 20%.
Together, these techniques helped Yelp build a faster, more responsive system. At p50, the latency breakdown is:
Question analysis: ~1.4s
Retrieval: ~0.03s
Time to first byte: ~0.9s
Full answer generation: ~3.5s
In a prototype, evaluation is usually informal where developers try a handful of questions and tweak prompts until the result feels right. This approach is fragile because it only tests anticipated cases and often misses how real users phrase ambiguous queries. In production, failures show up as confident hallucinations or technically correct but unhelpful replies. Yelp observed this directly when their early prototype voice swung between overly formal and casual depending on slight wording changes.
A robust evaluation system must separate quality into distinct dimensions that can be scored independently. Yelp defined six primary dimensions. They rely on an LLM-as-a-judge system where a specialized grader evaluates a single dimension using a strict rubric. For example, the Correctness grader reviews the answer against retrieved snippets and assigns a label like “Correct” or “Unverifiable” .
The key learning from Yelp is that subjective dimensions like Tone and Style are difficult to automate reliably. While logical metrics like Correctness are easy to judge against evidence, tone is an evolving contract between the brand and the user. Rather than forcing an unreliable automated judge early, Yelp tackled this by co-designing principles with their marketing team and enforcing them via curated few-shot examples in the prompt.
Most teams can get a grounded assistant to work for a demo. The difficult part is engineering a system that stays fresh, fast, safe, and efficient under real traffic. Below are the key lessons from the journey to production.
1. Retrieval is never done. Keyword retrieval is often the fastest path to a shippable product because it leverages existing search infrastructure. However, in production, new question types and wordings keep appearing. These will expose gaps in your initial retrieval logic. You must design retrieval so you can evolve it without rewriting the whole pipeline. You start with keywords for high-precision intents (brands, locations, technical terms, many constraints), then add embeddings for more exploratory questions, and keep tuning based on log failures.
2. Prompt bloat silently erases cost wins. As you fix edge cases regarding tone, refusals, and citation formatting, the system prompt inevitably grows. Even if you optimize your retrieved context, this prompt growth can overwrite those savings. Treat prompts as code. Version them, review them, and track token counts and cost impact. Prefer modular prompt chunks and assemble them dynamically at runtime. Maintain an example library and retrieve only the few-shot examples that match the current case. Do not keep every example in the static prompt. Yelp relies on dynamic prompt composition that includes only the relevant instructions and examples for the detected question type. This keeps the prompt lean and focused.
3. Build Modular Guardrails. After launch, users will push every boundary. They ask for things you did not design for, try to bypass instructions, and shift their behavior over time. This includes unsafe requests, out of scope questions, and adversarial prompts. Trying to catch all of this with a single “safety check” becomes impossible to maintain. Instead, split guardrails into small tasks. Each task should have a clear decision and label set. Run these checks in parallel and give them the authority to cancel downstream work. If a check fails, the system should return immediately with the right response without paying for retrieval or generation.
The Yelp Assistant on business pages is built as a multi-stage evidence-grounded system rather than a monolithic chatbot. The key takeaway is that the gap between a working prototype and a production assistant is substantial. Closing this gap requires more than just a powerful model. It requires a complete engineering system that ensures data stays fresh, answers remain grounded, and behavior stays safe.
Looking ahead, Yelp is focused on stronger context retention in longer multi-turn conversations, better business-to-business comparisons, and deeper use of visual language models to reason over photos more directly.
2026-02-08 00:30:23
If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.
QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.
QA Wolf takes testing off your plate. They can get you:
Unlimited parallel test runs for mobile and web apps
24-hour maintenance and on-demand test creation
Human-verified bug reports sent directly to your team
Zero flakes guarantee
The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.
With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.
This week’s system design refresher:
9 AI Concepts Explained in 7 minutes (Youtube video)
The Evolution of AI in Software Development
Git pull vs. git fetch
Agentic Browsers Workflow
[Subscriber Exclusive] Become an AI Engineer - Cohort 4
AI has fundamentally changed how engineers code. This shift can be described in three waves.
General-purpose LLMs (chat assistants)
Treating general-purpose LLMs like a coding partner: you copied code into ChatGPT, asked why it is wrong, and manually applied the fix. This helped engineers move faster, but the workflow was slow and manual.
Coding LLMs (autocompletes)
Tools like Copilot and Cursor Tab brought AI into the editor. As you type, a coding model suggests the next few tokens and you accept or reject. It speeds up typing, but it cannot handle repo-level tasks.
Coding Agents
Coding agents handle tasks end-to-end. You ask “refactor my code”, and the agent searches the repo, edits multiple files, and iterates until tests pass. This is where most capable tools such as Claude Code and OpenAI’s Codex focus today.
Over to you: What do you think will be the next wave?
If you have ever mixed up “git pull” and “git fetch”, you’re not alone, even experienced developers get these two commands wrong. They sound similar, but under the hood, they behave very differently.
Let’s see how each command updates your repository:
Initial state: Your local repo is slightly behind the remote. The remote has new commits (R3, R4, R5), while your local “main” still ends at L3.
What git fetch actually does: git fetch downloads the new commits without touching your working branch. It only updates “origin/main”.
Think of it as: “Show me what changed, but don’t apply anything yet.”
What git pull actually does: git pull is a combination of “fetch + merge” commands. It downloads the new commits and immediately merges them into your local branch.
This is the command that updates both “origin/main” and your local “main”.
Think of it as: “Fetch updates and apply them now.”
Over to you: Which one do you use more often, “git pull” or “git fetch”?
Agentic browsers embed an agent that can read webpages and take actions in your browser.
Most agentic browsers have four major layers.
Perception layer: Converts the current UI into model input. It starts with an accessibility tree snapshot. If the tree is incomplete or ambiguous, the agent takes a screenshot, sends it to a vision model (for example, Gemini Pro) to extract UI elements into a structured form, then uses that result to decide the next action.
Reasoning layer: Uses specialized agents for read-only browsing, navigation, and data entry. Separating roles improves reliability and lets you apply safety rules per agent.
Security layer: Enforces domain allowlisting and deterministic boundaries such as restricted actions, and confirmation steps to reduce prompt injection risk.
Execution layer: Runs browser tools (click, type, upload, navigate, screenshot, tab operations) and refreshes state after each step.
Over to you: Do you think agentic browsers are reliable enough to be used at scale?
We are excited to announce Cohort 4 of Becoming an AI Engineer.
Because you’re part of this newsletter community, you get an exclusive discount not available anywhere else.
A one-time 40% discount. Code expires next Friday
Use code: BBGNL
This is not just another course about AI frameworks and tools. Our goal is to help engineers build the foundation and end to end skill set needed to thrive as AI engineers.
Here’s what makes this cohort special:
Learn by doing: Build real world AI applications, not just by watching videos.
Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.
Live feedback and mentorship: Get direct feedback from instructors and peers.
Community driven: Learning alone is hard. Learning with a community is easy!
We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.
If you want to start learning AI from scratch, this is the perfect time to begin.
Dates: Feb 21— March 29, 2026
2026-02-07 00:31:56
We are excited to announce Cohort 4 of Becoming an AI Engineer.
Because you’re part of this newsletter community, you get an exclusive discount not available anywhere else.
A one-time 40% discount. Code expires next Friday
Use code: BBGNL
This is not just another course about AI frameworks and tools. Our goal is to help engineers build the foundation and end to end skill set needed to thrive as AI engineers.
Here’s what makes this cohort special:
Learn by doing: Build real world AI applications, not just by watching videos.
Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.
Live feedback and mentorship: Get direct feedback from instructors and peers.
Community driven: Learning alone is hard. Learning with a community is easy!
We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.
If you want to start learning AI from scratch, this is the perfect time to begin.
Dates: Feb 21— March 29, 2026