MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

Scalability Patterns for Modern Distributed Systems

2025-11-14 00:30:22

When we talk about scalability in system design, we’re talking about how well a system can handle growth.

A scalable system can serve more users, process more data, and handle higher traffic without slowing down or breaking. It means you can increase the system’s capacity and throughput by adding resources (like servers, databases, or storage) while keeping performance, reliability, and cost under control. Think of scalability as a measure of how gracefully a system grows. A small application running on one server might work fine for a few thousand users, but if a million users arrive tomorrow, it could start to fail under pressure.

A scalable design allows you to add more servers or split workloads efficiently so that the system continues to perform well, even as demand increases. There are two main ways to scale a system: vertical scaling and horizontal scaling. Here’s what they mean:

  • Vertical scaling (or scaling up) means upgrading a single machine by adding more CPU, memory, or storage to make it stronger. This is simple to do but has limits. Eventually, a single machine reaches a maximum capacity, and scaling further becomes expensive or impossible.

  • Horizontal scaling (or scaling out) means adding more machines to share the workload. Instead of one powerful server, we have many regular servers working together. This approach is more flexible and is the foundation of most modern large-scale systems like Google, Netflix, and Amazon.

However, scalability is not just about adding hardware. It’s about designing software and infrastructure that can make effective use of that hardware. Poorly written applications or tightly coupled systems may not scale even if you double or triple the number of servers.

When evaluating scalability, we also need to look beyond simple averages. Metrics like p95 and p99 latency show how the slowest 5% or 1% of requests perform. These tail latencies are what users actually feel during peak times, and they reveal where the system truly struggles under load.

Similarly, error budgets help teams balance reliability and innovation. They define how much failure is acceptable within a given time period. For example, a 99.9% uptime target still allows about 43 minutes of downtime per month. Understanding these numbers helps teams make practical trade-offs instead of chasing perfection at the cost of progress.

In this article, we will look at the top scalability patterns and their pros and cons.

Stateless Services

Read more

How Tinder Decomposed Its iOS Monolith App Handling 70M Users

2025-11-13 00:30:47

One major reason AI adoption stalls? Training. (Sponsored)

AI implementation often goes sideways due to unclear goals and a lack of a clear framework. This AI Training Checklist from You.com pinpoints common pitfalls and guides you to build a capable, confident team that can make the most out of your AI investment.

What you’ll get:

  • Key steps for building a successful AI training program

  • Guidance on overcoming employee resistance and fostering adoption

  • A structured worksheet to monitor progress and share across your organization

Set your AI initiatives on the right track.

Get the Checklist


Disclaimer: The details in this post have been derived from the details shared online by the Tinder Engineering Team. All credit for the technical details goes to the Tinder Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Tinder’s iOS app may look simple to its millions of users, but behind that smooth experience lies a complex codebase that must evolve quickly without breaking.

Over time, as Tinder kept adding new features, its iOS codebase turned into a monolith. In other words, it became a single, massive block of code where nearly every component was intertwined with others.

At first, this kind of structure is convenient. Developers can make changes in one place, and everything compiles together. However, as the app grew, the monolith became a bottleneck. Even small code changes required lengthy builds and extensive testing because everything was connected. Ownership boundaries became blurred: when so many teams touched the same code, it was hard to know who was responsible for what. Over time, making progress felt risky because each update could easily break something unexpected.

The root of the problem lies in how iOS applications are built. Ultimately, an iOS app compiles into a single binary artifact, meaning all modules and targets must come together in one final build.

In Tinder’s case, deep inter-dependencies between those targets stretched what engineers call the critical path, which is the longest chain of dependent tasks that determines how long a build takes. Since so many components are dependent on each other, Tinder’s build system could not take full advantage of modern multi-core machines.

See the diagram below:

In simple terms, the system could not build parts of the app in parallel, forcing long waits and limiting developer productivity.

The engineering team’s goal was clear: flatten the build graph. This means simplifying and reorganizing the dependency structure so that more components can compile independently. By shortening the critical path and increasing parallelism, Tinder hoped to dramatically reduce build times and restore agility to its development workflow.

See the diagram below that tries to demonstrate this concept:

In this article, we take a detailed look at how Tinder tackled the challenge of decomposing its monolith and the challenges it faced.

Strategy: Modularizing with Compiler-Driven Planning

After identifying the core issue with the monolith, Tinder needed a methodical way to separate its massive iOS codebase into smaller, more manageable parts. Trying to do this by hand would have been extremely time-consuming and error-prone. Every file in the app was connected to many others, so removing one piece without understanding the full picture could easily break something else.

To avoid this, the Tinder engineering team decided to use the Swift Compiler that already understood how everything in the codebase was connected. Each time an iOS app is built, the compiler analyzes every file, keeping track of which files define certain functions or classes and which other files use them. These relationships are known as declarations and references.

For example, if File A defines a class and File B uses that class, the compiler knows there is a dependency from B to A. In simple terms, the compiler already has a map of how different parts of the app talk to each other. Tinder realized this built-in knowledge could be used as a blueprint for modularization.

By extracting these declarations and reference relationships, the team could build a dependency graph. This is a type of network diagram that visually represents how code files depend on each other. In this graph, each file in the app becomes a node, and each connection or dependency becomes a link (often called an edge). If File A imports something from File B, then a link is drawn from A to B.

This graph gave Tinder a clear and accurate picture of the monolith’s structure. Instead of relying on guesswork, they could now see which parts of the code were tightly coupled and which could safely be separated into independent modules.

Execution: Phased Leaf-to-Root Extraction

After building the dependency graph, Tinder needed a way to separate the monolith without breaking the app. The graph made it clear which files were independent and which were deeply connected to others.

To make progress safely, the Tinder engineering team divided the work into phases. In each phase, they moved a group of files that had the fewest dependencies. These files were known as leaf nodes in the dependency graph. A leaf node is a file that no other files rely on, which makes it much easier to move without disrupting other parts of the system.

Starting with these leaf nodes helped the team limit the blast radius, which means reducing the potential side effects of each change. Since these files were less connected, moving them first carried a lower risk of introducing build errors or breaking functionality. This approach also simplified code reviews, because each phase affected only a small and manageable part of the codebase.

Once one phase was complete and verified to build correctly, Tinder moved on to the next set of files. Each successful phase made the monolith smaller and cleaner, allowing the next steps to be faster. Over time, this created a clear sense of progress, with the app continuously improving rather than waiting for a big final milestone.

By the time they reached the fourth phase, the Tinder team had already completed more than half of the entire decomposition work. This showed that the strategy naturally tackled the simpler and more independent files first, leaving the more complex, interdependent ones for later.

Source: Tinder Engineering Blog

What Moving a File Actually Means

Breaking a monolith into smaller modules may sound straightforward, but each “move” in the process has a ripple effect across the rest of the codebase. When the Tinder engineering team moved a file from the monolith into a new subtarget (which is essentially a separate Swift module), several adjustments had to be made to ensure the app still built and functioned correctly.

The process usually requires four main types of updates:

  • Dependencies: Every Swift module depends on others for certain features or utilities. When a file moves into a new module, Tinder had to update the list of module dependencies on both sides: the module receiving the file and the remaining monolith. This step ensured that each module still had access to the code it relied on.

  • Imports: In Swift, files use import statements to bring in functionality from other modules. When a file was moved out of the monolith, the import paths often changed. The team had to carefully update these statements wherever the file’s functions or classes were used, so everything continued to compile correctly.

  • Access Control: Swift limits how and where certain pieces of code can be accessed through visibility rules such as private, internal, and public. Once a file was extracted into a different module, its classes and methods were no longer visible to the rest of the app unless the team increased their access control level. This meant changing access modifiers to allow cross-module visibility, which required careful judgment to avoid exposing too much.

  • Dependency Injection Adjustments: The decomposition broke some of the shortcuts that had accumulated in the monolith, such as singletons or nearby function calls that assumed everything lived in the same codebase. To restore proper communication between modules, Tinder used dependency injection, a method that passes dependencies into a component rather than letting it reach out and find them on its own. This made the architecture cleaner and more modular, but required additional setup during extraction.

In practice, moving a single file often required changes to several others. On average, each file extraction involved edits to about fifteen different files across the codebase. This might include fixing imports, updating dependencies, adjusting access levels, or modifying initialization code.

While the automation tools handled much of this mechanical work, engineers still needed to review every change to ensure that the refactoring preserved behavior and followed coding standards.

The Role of Automation

Breaking down a large monolith by hand is not only slow but also highly error-prone.

For Tinder, with thousands of interconnected files, doing this work manually would have been nearly impossible within a reasonable timeframe. Each time a file was moved, engineers would have needed to rebuild the app to check for new compile errors, fix those issues, and rebuild again. The number of required builds would have grown rapidly as more files were moved, creating what engineers call a quadratic growth problem, in which the effort increases much faster than the number of files being processed.

To overcome this, the Tinder engineering team invested heavily in automation. They built tools that could perform the extraction process automatically, following the dependency graph that had already been created. These tools could:

  • Apply the Graph Plan in Bulk: The automation system took the planned sequence of file moves directly from the dependency graph and applied it automatically. It could move dozens of files at once, update dependencies, fix import statements, and adjust access controls without manual intervention.

  • Handle Merge Conflicts and Rebases Smoothly: In a large, active codebase, multiple teams commit changes every day. This can easily lead to merge conflicts, where different changes overlap in the same files. The automation system was designed to rerun cleanly after such changes. Instead of engineers fixing conflicts manually, the tool could simply replay the transformation from scratch, producing the same consistent result each time.

This automation completely changed the timeline of the project.

Tinder was able to fully decompose its iOS monolith in less than six months, whereas doing it manually was estimated to take around twelve years. The graph below shows the distribution of time spent on the decomposition effort.

The tools transformed what would have been a slow, iterative process into one that operated almost in constant time per phase, meaning each phase took roughly the same amount of effort regardless of its size.

Conclusion

Tinder’s journey from a massive iOS monolith to a modular, maintainable architecture is one of the most striking examples of how thoughtful engineering, automation, and discipline can transform development speed and reliability at scale. By relying on the compiler’s own knowledge to map dependencies, Tinder created a scientific, repeatable process for breaking the monolith apart without risking stability.

The results were remarkable.

  • Over 1,000 files were successfully extracted into modular targets without a single P0 incident, which means no critical production failures occurred during the entire effort.

  • Build times for the iOS app dropped by an impressive 78 percent, allowing engineers to iterate faster and ship new features with far greater confidence.

  • This technical progress was supported by a key policy shift: new files could no longer be added to the old monolith target. Instead, all new feature work had to be developed within the modular structure.

This change went beyond technology and tools. It created a cultural shift. Engineers now had fewer places to hide quick fixes or shortcuts, and higher standards of modularity, clarity, and ownership became the new normal across the team.

For other engineering organizations, Tinder’s experience offers several practical lessons.

  • Treat the compiler as a planner by using it to extract a file-level dependency graph and design modularization phases based on leaf removal.

  • Expect systematic fix-ups such as dependency updates, imports, access control changes, and dependency injection adjustments, and budget accordingly.

  • Automate wherever possible, since automation not only speeds up the work but also prevents errors and merge conflicts.

  • Finally, lock in the gains with clear policies that maintain the new modular structure and continuously measure the impact of those changes.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

How Spotify Built Its Data Platform To Understand 1.4 Trillion Data Points

2025-11-12 00:31:32

Must-Read Books for High-Performance System Design (Sponsored)

Access 4 PDFs on building and optimizing data-intensive applications.

Read experts’ proven strategies for optimizing data-intensive applications, database performance, and latency:

  • Designing Data-Intensive Applications by Martin Kleppmann: Discover new ways of thinking about your distributed data system challenges – with actionable insights for building scalable, high-performance solutions.

  • Latency by Pekka Enberg: Learn how to expertly diagnose latency problems and master the low-latency techniques that have been predominantly “tribal knowledge” until now.

  • Database Performance at Scale: A Practical Guide: Discover new ways to optimize database performance — and avoid common pitfalls – based on learnings from thousands of real-world database use cases.

  • ScyllaDB in Action by Bo Ingram: A practical guide to everything you need to know about ScyllaDB, from your very first queries to running it in a production environment.

Whether you’re working on large-scale systems or designing distributed data architectures, these books will prepare you for what’s next.

Access Books for Free


Disclaimer: The details in this post have been derived from the details shared online by the Spotify Engineering Team. All credit for the technical details goes to the Spotify Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Every day, Spotify processes around 1.4 trillion data points. These data points come from millions of users around the world listening to music, creating playlists, and interacting with the app in different ways. Handling this volume of information is not something a few ad hoc systems can manage. It requires a robust, well-designed data platform that can reliably collect, process, and make sense of all this information.

From payments to personalized recommendations to product experiments, almost every decision Spotify makes depends on data. This makes its data platform one of the most critical parts of the company’s overall technology stack.

Spotify did not build this platform overnight. In the early years, data systems were more improvised. As the company grew, the number of users increased, and the complexity of business decisions became greater. Different teams began collecting data for their own needs. Over time, this led to a growing need for a centralized, structured, and productized platform that could support the entire company, not just individual teams.

The shift toward a formal data platform was driven by both business and technical factors:

  • Business drivers: Spotify needed high-quality, reliable data to support its core functions, such as financial reporting, advertising, product experimentation, and personalized recommendations. Data had to be consistent and trustworthy so that teams could make important decisions with confidence.

  • Technical drivers: The sheer scale of the data meant the company needed strong infrastructure that could collect events from millions of devices, process them quickly, and make them available in usable form. It also required clear data ownership, easy searchability, built-in quality checks, and compliance with privacy regulations.

This evolution happened organically as Spotify’s products matured. With each new feature and business need, new data requirements emerged. Over time, this learning process shaped a platform that now supports everything from real-time streaming analytics to large-scale experimentation and machine learning applications.

In this article, we will look at how Spotify built its data platform and the challenges it faced along the way.


State of Trust: AI-driven attacks are getting more sophisticated (Sponsored)

AI-driven attacks are getting bigger, faster, and more sophisticated—making risk much more difficult to contain. Without automation to respond quickly to AI threats, teams are forced to react without a plan in place.

This is according to Vanta’s newest State of Trust report, which surveyed 3,500 business and IT leaders across the globe.

One big change since last year’s report? Teams falling behind AI risks—and spending way more time and energy proving trust than building it.

  • 61% of leaders spend more time proving security rather than improving it

  • 59% note that AI risks outpace their expertise

  • But 95% say AI is making their security teams more effective

Get the full report to learn how organizations are navigating these changes, and what early adopters are doing to stay ahead.

Download the report


Platform Composition and Evolution

In its early years, Spotify’s data operations were run by a single team. At one point, this team was managing Europe’s largest Hadoop cluster. For reference, Hadoop is an open-source framework used to store and process very large amounts of data across many computers.

At that stage, most data work was still centralized, and many processes were built manually or handled through shared infrastructure.

As Spotify grew, this model became too limited. The company needed more specialized tools and teams to handle the increasing scale and complexity. Over time, Spotify moved from that single Hadoop cluster to a multi-product data platform team. This new structure allowed them to separate the platform into clear functional areas, each responsible for a specific part of the data journey.

At the core of Spotify’s platform are three main building blocks that work together:

  • Data Collection: This part of the platform focuses on how data is gathered from millions of clients around the world. These “clients” include mobile apps, desktop apps, web browsers, and backend services. Every time a user plays a song, skips a track, adjusts the volume, or interacts with the app, an event is recorded. Spotify uses specialized tools and event delivery systems to collect these events in real time. This ensures that the data entering the platform is structured, consistent, and ready to be processed further.

  • Data Processing: Once the data is collected, it must be cleaned, transformed, and organized so it can be used effectively. This happens through data pipelines, which are automated workflows that process large amounts of information on a fixed schedule or in real time. These pipelines might do things like aggregating how many times a track was played in a day, linking a user’s activity to recommendation systems, or preparing data for financial reports. By scheduling and running thousands of pipelines, Spotify can provide up-to-date and accurate data to every team that needs it.

  • Data Management: This part ensures that the data is secure, private, and trustworthy. It includes data attribution, privacy controls, and security mechanisms to comply with regulations and internal governance rules. Data management also deals with data integrity, which means making sure the data is correct, consistent, and not corrupted during collection or processing.

These three areas are not isolated. They are deeply interconnected, forming a platform that is reliable, searchable, and easy to build on.

See the diagram below that shows the key components of the data platform:

For example, once data is collected and processed, it becomes searchable and can be used directly in other systems. One of the most important systems it powers is Spotify’s experimentation platform, called Confidence. This platform allows teams to run A/B tests and other experiments at scale, ensuring new product features are backed by real data before being fully launched.

Data Collection - Event Delivery at Scale

One of the most impressive parts of Spotify’s data platform is its data collection system.

Every time a user interacts with the app, whether by hitting play, searching for a song, skipping a track, or creating a playlist, an “event” is generated. Spotify’s data collection system is responsible for capturing and delivering all of these events reliably and at a massive scale.

Spotify collects more than one trillion events every day. This is an extraordinary amount of data flowing in constantly from mobile apps, desktop clients, web players, and other connected devices. To handle this, the architecture of the data collection system has evolved through multiple iterations over the years. Early versions were much simpler, but as the user base and product features expanded, the system had to be redesigned to keep up with growing scale and complexity.

How Developers Work with Event Data

At Spotify, product teams don’t need to build custom infrastructure to collect events. Instead, they use client SDKs that make it easy to define what events should be collected.

Here’s how the workflow looks:

  • A team defines a new event schema, which describes what kind of data will be collected and in what format. For example, a schema might specify that each “play” event should include the user ID (or an anonymous identifier), the song ID, the timestamp, and the device type.

  • Once the schema is defined, the platform automatically deploys all the infrastructure needed to handle that event. This includes:

    • Pub/Sub queues to reliably stream data as it arrives

    • Anonymization pipelines to remove or protect sensitive user information

    • Streaming jobs to process and route the data further down the platform

  • If a schema changes (for example, if a team adds a new field like “playlist ID”), the system automatically redeploys the affected components so the infrastructure stays in sync.

See the diagram below:

This level of automation is made possible through Kubernetes Operators. An operator is a special kind of software that manages complex applications running on Kubernetes, which is the container orchestration system Spotify uses to run its services.

In simple terms, operators allow Spotify to treat data infrastructure as code, so changes are applied quickly and reliably without manual work.

Built-in Privacy and Security

Handling user data at this scale comes with serious privacy responsibilities. Spotify builds anonymization directly into its pipelines to ensure that sensitive information is protected before it ever reaches downstream systems.

They also use internal key-handling systems, which help control how and when certain pieces of data can be accessed or decrypted. This is essential for compliance with privacy regulations like GDPR and for maintaining user trust.

Centralization and Self-Service

A key strength of Spotify’s data collection system is its ownership model. Instead of making the central infrastructure team responsible for every change, Spotify has designed the platform so that product teams can manage most of their own event data.

This means a team can:

  • Add or modify event schemas

  • Deploy event pipelines

  • Make small adjustments to how their data is processed

They can do all this without depending on the central platform team. This balance between centralization and self-service helps the platform scale to thousands of active users inside the company while keeping operational overhead low.

Breadth of Event Types

The platform currently handles around 1,800 different event types, each capturing different kinds of user interactions and system signals.

There are dedicated teams responsible for:

  • Maintaining the event delivery infrastructure

  • Managing the client SDKs that developers use

  • Building “journey datasets”, which combine multiple event streams into structured, meaningful timelines

  • Supporting the underlying infrastructure that keeps the system running smoothly

This massive, well-structured data collection layer forms the foundation of Spotify’s entire data platform. Without it, the rest of the platform (processing, management, analytics, and experimentation) would not be possible. It ensures that the right data is captured, secured, and made available at the right time for everything from recommendations to business decisions.

Data Management and Data Processing

Once data is collected, Spotify needs to transform it into something meaningful and trustworthy. This is where data processing and management come into play. At Spotify’s scale, this is a massive and complex operation that must run reliably every hour of every day.

Spotify runs more than 38,000 active data pipelines on a regular schedule. Some run hourly, while others run daily, depending on the business need. A data pipeline is essentially an automated workflow that moves and transforms data from one place to another.

For example:

  • A pipeline might take raw event data from user streams and aggregate it into daily summaries of how many times each song was played.

  • Another pipeline might prepare datasets that support recommendation algorithms.

  • Yet another might generate financial reports.

Operating this many pipelines requires a strong focus on scalability (handling growth efficiently), traceability (understanding where data comes from and how it changes), searchability (finding the right datasets quickly), and regulatory compliance (meeting privacy and data retention requirements).

The Execution Stack

To run these pipelines, Spotify uses a scheduler that automatically triggers workflows at the right times. These workflows are executed on:

  • BigQuery, Google’s cloud data warehouse, allows fast analysis of very large datasets.

  • Flink or Dataflow are frameworks for processing data streams or large batches of data in parallel.

Most pipelines are written using Scio, a Scala library built on top of Apache Beam. Apache Beam is a framework that allows developers to write data processing jobs once and then run them on different underlying engines like Flink or Dataflow. This gives Spotify flexibility and helps standardize development across teams.

Each pipeline in Spotify’s platform produces a data endpoint. An endpoint is essentially the final dataset that other teams and systems can use. These endpoints are treated like products with well-defined characteristics:

  • Explicit schemas so everyone knows exactly what data is included and in what format.

  • Multi-partitioning to organize data efficiently, often by time or other logical dimensions.

  • Retention policies specify how long the data is stored.

  • Access Control Lists (ACLs) are used to define who can view or modify the data.

  • Lineage tracking that records where the data came from and what transformations were applied.

  • Quality checks to catch errors early and ensure the data is trustworthy.

Platform as Code

Spotify has built the platform in a way that allows engineers to define their pipelines and endpoints as code. This means all configuration and resource definitions are stored in the same place as the pipeline’s source code.

As mentioned, the system uses custom Kubernetes Operators to manage this. When developers push updates to their code repositories, the operators automatically deploy the necessary resources. This code ownership model ensures that the team responsible for a pipeline also controls its configuration and lifecycle, reducing dependency on centralized teams.

With tens of thousands of pipelines, keeping everything healthy and efficient is a major priority. Spotify has a strong operations and observability layer that includes:

  • Alerts for late pipelines, long-running jobs, or failures.

  • Endpoint health monitoring to make sure datasets are fresh and accurate.

  • Backstage integration to provide a single interface where teams can view and manage their data resources.

Backstage is an internal developer portal that Spotify open-sourced. It brings together tools for monitoring, cost analysis, quality assurance, and documentation. Instead of searching across many systems, engineers can manage everything from one central place.

Conclusion

Spotify’s data platform is a great example of how a company can evolve its infrastructure to meet growing business and technical demands. What began as a small team managing an on-premises Hadoop cluster has grown into a platform run by more than a hundred engineers, operating entirely on Google Cloud.

This transformation did not happen overnight. Spotify aligned its organizational needs with its technical investments, defined clear goals, and built strong feedback channels with its internal users. By starting small, iterating, and learning from each stage of growth, they created a platform that balances centralized infrastructure with self-service capabilities for product teams. Other organizations can take valuable lessons from this journey.

A dedicated data platform becomes essential when:

  • Teams need searchable and democratized data across business and engineering functions.

  • Financial reporting and operational metrics require predictable, reportable pipelines.

  • Data quality and trust are critical for decision-making.

  • Experimentation and development need efficient workflows with strong tooling.

  • Machine learning initiatives depend on well-organized and structured datasets.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

How Uber Built a Conversational AI Agent For Financial Analysis

2025-11-11 00:31:11

Stream Smarter with Amazon S3 + Redpanda (Sponsored)

Join us live on November 12 for a Redpanda Tech Talk with AWS experts exploring how to connect streaming and object storage for real-time, scalable data pipelines. Redpanda’s Chandler Mayo and AWS Partner Solutions Architect Dr. Art Sedighi will show how to move data seamlessly between Redpanda Serverless and Amazon S3 — no custom code required. Learn practical patterns for ingesting, exporting, and analyzing data across your streaming and storage layers. Whether you’re building event-driven apps or analytics pipelines, this session will help you optimize for performance, cost, and reliability.

Sign Up Now


Disclaimer: The details in this post have been derived from the details shared online by the Uber Engineering Team. All credit for the technical details goes to the Uber Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

For a company operating at Uber’s scale, financial decision-making depends on how quickly and accurately teams can access critical data. Every minute spent waiting for reports can delay decisions that impact millions of transactions worldwide.

Uber Engineering Team recognized that their finance teams were spending a significant amount of time just trying to retrieve the right data before they could even begin their analysis.

Historically, financial analysts had to log into multiple platforms like Presto, IBM Planning Analytics, Oracle EPM, and Google Docs to find relevant numbers. This fragmented process created serious bottlenecks. Analysts often had to manually search across different systems, which increased the risk of using outdated or inconsistent data. If they wanted to retrieve more complex information, they had to write SQL queries. This required deep knowledge of data structures and constant reference to documentation, which made the process slow and prone to errors.

In many cases, analysts submitted requests to the data science team to get the required data, which introduced additional delays of several hours or even days. By the time the reports were ready, valuable time had already been lost.

For a fast-moving company, this delay in accessing insights can limit the ability to make informed, real-time financial decisions.

Uber Engineering Team set out to solve this. Their goal was clear: build a secure and real-time financial data access layer that could live directly inside the daily workflow of finance teams. Instead of navigating multiple platforms or writing SQL queries, analysts should be able to ask questions in plain language and get answers in seconds.

This vision led to the creation of Finch, Uber’s conversational AI data agent. Finch is designed to bring financial intelligence directly into Slack, the communication platform already used by the company’s teams. In this article, we will look at how Uber built Finch and how it works under the hood.

What is Finch?

To solve the long-standing problem of slow and complex data access, the Uber Engineering Team built Finch, a conversational AI data agent that lives directly inside Slack. Instead of logging into multiple systems or writing long SQL queries, finance team members can simply type a question in natural language. Finch then takes care of the rest.

See the comparison table below that shows how Finch stands out from other AI finance tools.

At its core, Finch is designed to make financial data retrieval feel as easy as sending a message to a colleague. When a user types a question, Finch translates the request into a structured SQL query behind the scenes. It identifies the right data source, applies the correct filters, checks user permissions, and retrieves the latest financial data in real time.

Security is built into this process through role-based access controls (RBAC). This ensures that only authorized users can access sensitive financial information. Once Finch retrieves the results, it sends the response back to Slack in a clean, readable format. If the data set is large, Finch can automatically export it to Google Sheets so that users can work with it directly without any extra steps.

For example, the user might ask: “What was the GB value in US&C in Q4 2024?”

Finch quickly finds the relevant table, builds the appropriate SQL query, executes it, and returns the result right inside Slack. The user gets a clear, ready-to-use answer in seconds instead of spending hours searching, writing queries, or waiting for another team.

Finch Architecture Overview

The design of Finch is centered on three major goals: modularity, security, and accuracy in how large language models generate and execute queries.

Uber Engineering Team built the system so that each part of the architecture can work independently while still fitting smoothly into the overall data pipeline. This makes Finch easier to scale, maintain, and improve over time.

The diagram below shows the key components of Finch:

At the foundation of Finch is its data layer. Uber uses curated, single-table data marts that store key financial and operational metrics. Instead of allowing queries to run on large, complex databases with many joins, Finch works with simplified tables that are optimized for speed and clarity.

To make Finch understand natural language better, the Uber Engineering Team built a semantic layer on top of these data marts. This layer uses OpenSearch to store natural language aliases for both column names and their values. For example, if someone types “US&C,” Finch can map that phrase to the correct column and value in the database. This allows the model to do fuzzy matching, meaning it can correctly interpret slightly different ways of asking the same question. This improves the accuracy of WHERE clauses in the SQL queries Finch generates, which is often a weak spot in many LLM-based data agents.

Finch’s architecture combines several key technologies that work together to make the experience seamless for finance teams.

  • Generative AI Gateway: This is Uber’s internal infrastructure for accessing multiple large language models, both self-hosted and third-party. It allows the team to swap models or upgrade them without changing the overall system.

  • LangChain and LangGraph: This framework is used to orchestrate specialized agents inside Finch, such as the SQL Writer Agent and the Supervisor Agent. Each agent has a specific role, and LangGraph coordinates how these agents work together in sequence to understand a question, plan the query, and return the result.

  • OpenSearch: This is the backbone of Finch’s metadata indexing. It stores the mapping between natural language terms and the actual database schema. This makes Finch much more reliable when handling real-world language variations.

  • Slack SDK and Slack AI Assistant APIs: These enable Finch to connect directly with Slack. The APIs allow the system to update the user with real-time status messages, offer suggested prompts, and provide a smooth, chat-like interface. This means analysts can interact with Finch as if they were talking to a teammate.

  • Google Sheets Exporter: For larger datasets, Finch can automatically export results to Google Sheets. This removes the need to copy data manually and allows users to analyze results with familiar spreadsheet tools.

Finch Agentic Workflow

One of the most important elements of Finch is how its different components work together to handle a user’s query.

Uber Engineering Team designed Finch to operate through a structured orchestration pipeline, where each agent in the system has a clear role:

  • The process starts when a user enters a question in Slack. This can be something as simple as “What were the GB values for US&C in Q4 2024?”

  • Once the message is sent, the Supervisor Agent receives the input. Its job is to figure out what type of request has been made and route it to the right sub-agent. For example, if it’s a data retrieval request, the Supervisor Agent will send the task to the SQL Writer Agent.

  • After routing, the SQL Writer Agent fetches metadata from OpenSearch, which contains mappings between natural language terms and actual database columns or values. This step is what allows Finch to correctly interpret terms like “US&C” or “gross bookings” without the user needing to know the exact column names or data structure.

  • Next, Finch moves into query construction. The SQL Writer Agent uses the metadata to build the correct SQL query against the curated single-table data marts. It ensures that the right filters are applied and the correct data source is used. Once the query is ready, Finch executes it.

  • While this process runs in the background, Finch provides live feedback through Slack. The Slack callback handler updates the user in real time, showing messages like “identifying data source,” “building SQL,” or “executing query.” This gives users visibility into what Finch is doing at each step.

  • Finally, once the query is executed, the results are returned directly to Slack in a structured, easy-to-read format. If the data set is too large, Finch automatically exports it to Google Sheets and shares the link.

  • Users can also ask follow-up questions like “Compare to Q4 2023,” and Finch will refine the context of the conversation to deliver updated results.

See the diagram below that shows the data agent’s context building flow:

Finch’s Accuracy and Performance Evaluation

For Finch to be useful at Uber’s scale, it must be both accurate and fast. A conversational data agent that delivers wrong or slow answers would quickly lose the trust of financial analysts.

Uber Engineering Team built Finch with multiple layers of testing and optimization to ensure it performs consistently, even when the system grows more complex. There were two main areas:

Continuous Evaluation

Uber continuously evaluates Finch to make sure each part of the system works as expected. Here are the key evaluation steps:

  • This starts with sub-agent evaluation, where agents like the SQL Writer and Document Reader are tested against “golden queries.” These golden queries are the correct, trusted outputs for a set of common use cases. By comparing Finch’s output to these expected answers, the team can detect any drop in accuracy.

  • Another key step is checking the Supervisor Agent routing accuracy. When users ask questions, the Supervisor Agent decides which sub-agent should handle the request. Uber tests this decision-making process to catch issues where similar queries might be routed incorrectly, such as confusing data retrieval tasks with document lookup tasks.

  • The system also undergoes end-to-end validation, which involves simulating real-world queries to ensure the full pipeline works correctly from input to output. This helps catch problems that might not appear when testing components in isolation.

  • Finally, regression testing is done by re-running historical queries to see if Finch still returns the same correct results. This allows the team to detect accuracy drift before any model or prompt updates are deployed.

Performance Optimization

Finch is built to deliver answers quickly, even when handling a large volume of queries.

Uber Engineering Team optimized the system to minimize database load by making SQL queries more efficient. Instead of relying on one long, blocking process, Finch uses multiple sub-agents that can work in parallel, reducing latency.

To make responses even faster, Finch pre-fetches frequently used metrics. This means that some common data is already cached or readily accessible before users even ask for it, leading to near-instant responses in many cases.

Conclusion

Finch represents a major step in how financial teams at Uber access and interact with data.

Instead of navigating multiple platforms, writing complex SQL queries, or waiting for data requests to be fulfilled, analysts can now get real-time answers inside Slack using natural language. By combining curated financial data marts, large language models, metadata enrichment, and secure system design, the Uber Engineering Team has built a solution that removes layers of friction from financial reporting and analysis.

The architecture of Finch shows a thoughtful balance between innovation and practicality. It uses a modular agentic workflow to orchestrate different specialized agents, ensures accuracy through continuous evaluation and testing, and delivers low latency through smart performance optimizations. The result is a system that not only works reliably at scale but also fits seamlessly into the daily workflow of Uber’s finance teams.

Looking ahead, Uber plans to expand Finch even further. The roadmap includes deeper FinTech integration to support more financial systems and workflows across the organization. For executive users like the CEO and CFO, Uber Engineering Team is introducing a human-in-the-loop validation system, where a “Request Validation” button will allow critical answers to be reviewed by subject matter experts before final approval. This will increase trust in Finch’s responses for high-stakes decisions.

The team is also working to support more user intents and specialized agents, expanding Finch beyond simple data retrieval into richer financial use cases such as forecasting, reporting, and automated analysis. As these capabilities grow, Finch will evolve from being a helpful assistant into a central intelligence layer for Uber’s financial operations.

See the diagram below that shows a glimpse of Finch’s Intent Flow Future State:

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

EP188: Servers You Should Know in Modern Systems

2025-11-09 00:30:48

The Developer’s Guide to MCP Auth (Sponsored)

Securely authorizing access to an MCP server is complex. You need PKCE, scopes, consent flows, and a way to revoke access when needed.

Learn from WorkOS how to implement OAuth 2.1 in a production-ready setup, with clear steps and examples.

Read the guide →


This week’s system design refresher:

  • Design a Web Crawler: FAANG Interview Question (Youtube video)

  • The AI Engineering Cohort 2 Starts Today!

  • Servers You Should Know in Modern Systems

  • What is Prompt Engineering? (Youtube video)

  • The Building Blocks of Modern Networking

  • Network Services That Power Modern Connectivity

  • SPONSOR US


Design a Web Crawler: FAANG Interview Question


The AI Engineering Cohort 2 Starts Today!

This is a live, cohort-based course created in collaboration with best-selling author Ali Aminian and published by ByteByteGo.

Here’s what makes this cohort special:

  • Learn by doing: Build real world AI applications, not just by watching videos.

  • Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.

  • Live feedback and mentorship: Get direct feedback from instructors and peers.

  • Community driven: Learning alone is hard. Learning with a community is easy!

We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.

If you want to start learning AI from scratch, this is the perfect time to begin.

Check it out here


Servers You Should Know in Modern Systems


What is Prompt Engineering?


The Building Blocks of Modern Networking

Every modern network, from home Wi-Fi to global cloud infrastructure, is built on a few essential components. Here’s a quick overview of the Building Blocks of Modern

Networking:

  • Core Networking: Switches connect devices within a local network. Every office has dozens of these. Routers move packets between different networks. Your gateway to the internet and beyond. SD-WAN is how modern companies connect branch offices. Software-defined, flexible, way cheaper.

    DNS translates domain names to IP addresses. DHCP hands out IP addresses automatically. NTP keeps clocks synchronized across all systems.

  • Network Security: Firewalls are your first line of defense. Next-gen versions can inspect traffic at the application level. VPNs create encrypted tunnels for remote access and site-to-site connections. Remote work runs on this. IDS/IPS detects and blocks malicious traffic before it reaches your servers.

  • Delivery (Traffic management): Load Balancers distribute requests across multiple servers. One server goes down? Users never notice. Reverse Proxy sits in front of your backend servers, handling SSL termination and caching. API Gateway manages all your API traffic.

  • Identity & Trust: Identity Provider is your single source of truth for user authentication. Think Okta, Azure AD, Auth0. RADIUS/AAA handles network device authentication. PKI manages digital certificates and encryption keys. HTTPS wouldn’t exist without it.

  • Operations: SIEM collects and analyzes security events from across your entire infrastructure. NMS monitors network health and performance. Alerts you before users start complaining.

  • Edge: Access Points provide WiFi coverage. IoT Gateway connects sensors, cameras, and smart devices to your network. The bridge between operational tech and IT.

  • Infrastructure: NFV runs network functions as software instead of dedicated hardware. Virtual firewalls, virtual routers, virtual everything.

Over to you: What component from this list do you want a deep dive on next?


Network Services That Power Modern Connectivity

Every time you open a browser, send an email, or connect to a VPN, these network services quietly make it possible.

  • DNS: Resolves domain names to IP addresses so users can reach websites without memorizing numbers.

  • DHCP: Automatically assigns IPs and network settings to devices joining the network.

  • NTP: Keeps clocks synchronized across systems to ensure consistent logs and authentication.

  • SSH: Enables secure remote login and encrypted file transfers over port 22.

  • RDP: Allows remote desktop access to Windows systems through port 3389.

  • Mail (SMTP submission): Sends emails securely from clients to mail servers.

  • HTTPS / HTTP3 (QUIC): Secures web, API, and app communication over encrypted channels.

  • LDAP (over TLS): Acts as a central directory for enterprise logins and access control.

  • OAuth2.0 / OpenID Connect: Powers modern authentication flows like “Sign in with Google.”

  • MySQL / PostgreSQL / Oracle: Handle backend data storage and retrieval for web and mobile applications.

  • WireGuard / IPsec: Create encrypted tunnels for private and remote network access.

Over to you: When something breaks, which protocol do you check first, DNS, DHCP, or HTTPS?


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

Last Chance to Enroll | Become an AI Engineer | Cohort 2

2025-11-08 00:30:37

After the incredible success of our first cohort, with nearly 500 participants, we’re thrilled to announce the launch of Cohort 2 of Become an AI Engineer! Our second cohort begins in less than One Day.

Check it out Here

Check it out Here

This is not just another course about AI frameworks and tools. Our goal is to help engineers build the foundation and end to end skill set needed to thrive as AI engineers.

Here’s what makes this cohort special:

• Learn by doing: Build real world AI applications, not just by watching videos.

• Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.

• Live feedback and mentorship: Get direct feedback from instructors and peers.

• Community driven: Learning alone is hard. Learning with a community is easy!

We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.

If you want to start learning AI from scratch, this is the perfect time to begin.

Check it out Here