MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

Inside Airbnb’s AI-Powered Pipeline to Migrate Tests: Months of Work in Days

2025-06-24 23:30:34

DevOps Roadmap: Future-proof Your Engineering Career (Sponsored)

Full-stack isn't enough anymore. Today's top developers also understand DevOps.

Our actionable roadmap cuts straight to what matters.

Built for busy coders, this step-by-step guide maps out the essential DevOps skills that hiring managers actively seek and teams desperately need.

Stop feeling overwhelmed and start accelerating your market value. Join thousands of engineers who've done the same.

GRAB YOUR FREE ROADMAP

This guide was created exclusively for ByteByteGo readers by TechWorld with Nana


Disclaimer: The details in this post have been derived from the articles/videos shared online by the Airbnb Engineering Team. All credit for the technical details goes to the Airbnb Engineering Team. The links to the original articles and videos are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Code migrations are usually a slow affair. Dependencies change, frameworks evolve, and teams get stuck rewriting thousands of lines that don’t even change product behavior. 

That was the situation at Airbnb. 

Thousands of React test files still relied on Enzyme, a tool that hadn’t kept up with modern React patterns. The goal was clear: move everything to React Testing Library (RTL). However, with over 3,500 files in scope, the effort appeared to be a year-long grind of manual rewrites.

Instead, the team finished it in six weeks.

The turning point was the use of AI, specifically Large Language Models (LLMs), not just as assistants, but as core agents in an automated migration pipeline. By breaking the work into structured, per-file steps, injecting rich context into prompts, and systematically tuning feedback loops, the team transformed what looked like a long, manual slog into a fast, scalable process.

This article unpacks how that migration happened. It covers the structure of the automation pipeline, the trade-offs behind prompt engineering vs. brute-force retries, the methods used to handle complex edge cases, and the results that followed. 


Where Fintech Engineers Share How They Actually Build (Sponsored)

Built by developers, for developers, fintech_devcon is the go-to technical conference for engineers and product leaders building next-generation financial infrastructure.

  • Why attend? It’s dev-first, focused on deep, educational content (with no sales pitches). Hear from builders at Wise, Block, Amazon, Adyen, Plaid, and more.

  • What will you learn? Practical sessions on AI, payment flows, onboarding, dev tools, security, and more. Expect code, architecture diagrams, and battle-tested lessons.

  • When and where? Happening in Denver, August 4–6. Use code BBG25 to save $195.

Still on the fence? Watch past sessions, including Kelsey Hightower’s phenomenal 2024 keynote.

See the agenda


The Need for Migration

Enzyme, adopted in 2015, provided fine-grained access to the internal structure of React components. This approach matched earlier versions of React, where testing internal state and component hierarchy was a common pattern.

By 2020, Airbnb had shifted all new test development to React Testing Library (RTL). 

RTL encourages testing components from the perspective of how users interact with them, focusing on rendered output and behavior, not implementation details. This shift reflects modern React testing practices, which prioritize maintainability and resilience to refactoring.

However, thousands of existing test files at Airbnb were still using Enzyme. Migrating them introduced several challenges:

  • Different testing models: Enzyme relies on accessing component internals. RTL operates at the DOM interaction level. Tests couldn’t be translated line-for-line and required structural rewrites.

  • Risk of coverage loss: Simply removing legacy Enzyme tests would leave significant gaps in test coverage, particularly for older components no longer under active development.

  • Manual effort was prohibitive: Early projections estimated over a year of engineering time to complete the migration manually, which was too costly to justify.

The migration was necessary to standardize testing across the codebase and support future React versions, but it had to be automated to be feasible.

Migration Strategy and Proof of Concept

The first indication that LLMs could handle this kind of migration came during a 2023 internal hackathon. A small team tested whether a large language model could convert Enzyme-based test files to RTL. Within days, the prototype successfully migrated hundreds of files. The results were promising in terms of accuracy as well as speed.

That early success laid the groundwork for a full-scale solution. In 2024, the engineering team formalized the approach into a scalable migration pipeline. The goal was clear: automate the transformation of thousands of test files, with minimal manual intervention, while preserving test intent and coverage.

To get there, the team broke the migration process into discrete, per-file steps that could be run independently and in parallel. Each step handled a specific task, like replacing Enzyme syntax, fixing Jest assertions, or resolving lint and TypeScript errors. When a step failed, the system invoked an LLM to rewrite the file using contextual information.

This modular structure made the pipeline easy to debug, retry, and extend. More importantly, it made it possible to run migrations across hundreds of files concurrently, accelerating throughput without sacrificing quality.

Pipeline Design and Techniques

Here are the key components of the pipeline design and the various techniques involved:

1 - Step-Based Workflow

To scale migration reliably, the team treated each test file as an independent unit moving through a step-based state machine. This structure enforced validation at every stage, ensuring that transformations passed real checks before advancing.

Each file advanced through the pipeline only if the current step succeeded. If a step failed, the system paused progression, invoked an LLM to refactor the file based on the failure context, and then re-validated before continuing.

Key stages in the workflow included:

  • Enzyme refactor: Replaced Enzyme-specific API calls and structures with RTL equivalents.

  • Jest fixes: Addressed changes in assertion patterns and test setup to ensure compatibility with RTL.

  • Lint and TypeScript checks: Ensured the output aligned with Airbnb’s static analysis standards and type safety expectations.

  • Final validation: Confirmed the migrated test behaved as expected, with no regressions or syntax issues.

This approach worked for the following reasons: 

  • State transitions made progress measurable. Every file had a clear status and history across the pipeline.

  • Failures were contained and explainable. A failed lint check or Jest test didn’t block the entire process, just the specific step for that file.

  • Parallel execution became safe and efficient. The team could run hundreds of files through the pipeline concurrently without bottlenecks or coordination overhead.

  • Step-specific retries became easy to implement. When errors showed up consistently at one stage, fixes could target that layer without disrupting others.

This structured approach provided a foundation for automation to succeed at scale. It also set up the necessary hooks for advanced retry logic, context injection, and real-time debugging later in the pipeline.

2 - Retry Loops and Dynamic Prompting

Initial experiments showed that deep prompt engineering only got so far. 

Instead of obsessing over the perfect prompt, the team leaned into a more pragmatic solution: automated retries with incremental context updates. The idea was simple. If a migration step failed, try again with better feedback until it passed or hit a retry limit.

At each failed step, the system fed the LLM:

  • The latest version of the file

  • The validation errors from the failed attempt

This dynamic prompting approach allowed the model to refine its output based on concrete failures, not just static instructions. Instead of guessing at improvements, the model had specific reasons why the last version didn’t pass.

Each step ran inside a loop runner, which retried the operation up to a configurable maximum. This was especially effective for simple to mid-complexity files, where small tweaks (like fixing an import, renaming a variable, or adjusting test structure) often resolved the issue.

This worked for the following reasons:

  • The feedback loop wasn’t manual. It ran automatically and cheaply at scale.

  • Most files didn’t need so many tries. Many succeeded after 1 or 2.

  • There was no need to perfectly tune the initial prompt. The system learned through failure.

Retrying with context turned out to be a better investment than engineering the “ideal” prompt up front. It allowed the pipeline to adapt without human intervention and pushed a large portion of files through successfully with minimal effort.

3 - Rich Prompt Context

Retry loops handled the bulk of test migrations, but they started to fall short when dealing with more complex files: tests with deep indirection, custom utilities, or tightly coupled setups. These cases needed more than just brute-force retries. They needed contextual understanding.

To handle these, the team significantly expanded prompt inputs, pushing token counts into the 40,000 to 100,000 range. Instead of a minimal diff, the model received a detailed picture of the surrounding codebase, testing patterns, and architectural intent.

Each rich prompt included:

  • The component source code being tested

  • The test file targeted for migration

  • Any validation errors from previous failed attempts

  • Sibling test files from the same directory to reflect team-specific patterns

  • High-quality RTL examples taken from the same project

  • Relevant import files and utility modules

  • General migration guidelines outlining preferred testing practices

const prompt = [
  'Convert this Enzyme test to React Testing Library:',
  `SIBLING TESTS:\n${siblingTestFilesSourceCode}`,
  `RTL EXAMPLES:\n${reactTestingLibraryExamples}`,
  `IMPORTS:\n${nearestImportSourceCode}`,
  `COMPONENT SOURCE:\n${componentFileSourceCode}`,
  `TEST TO MIGRATE:\n${testFileSourceCode}`,
].join('\n\n');

Source: Airbnb Engineering Blog

The key insight was choosing the right context files, pulling in examples that matched the structure and logic of the file being migrated. Adding more tokens didn’t help unless those tokens carried meaningful, relevant information.

By layering rich, targeted context, the LLM could infer project-specific conventions, replicate nuanced testing styles, and generate outputs that passed validations even for the hardest edge cases. This approach bridged the final complexity gap, especially in files that reused abstractions, mocked behavior indirectly, or followed non-standard test setups.

4 - Systematic Cleanup From 75% to 97%

The first bulk migration pass handled 75% of the test files in under four hours. That left around 900 files stuck. These were too complex for basic retries and too inconsistent for a generic fix. Handling this long tail required targeted tools and a feedback-driven cleanup loop.

Two capabilities made this possible:

Migration Status Annotations

Each file was automatically stamped with a machine-readable comment that recorded its migration progress. 

These markers helped identify exactly where a file had failed, whether in the Enzyme refactor, Jest fixes, or final validation.

// MIGRATION STATUS: {"enzyme":"done","jest":{"passed":8,"failed":2}}

Source: Airbnb Engineering Blog

This gave the team visibility into patterns: common failure points, repeat offenders, and areas where LLM-generated code needed help.

Step-Specific File Reruns

A CLI tool allowed engineers to reprocess subsets of files filtered by failure step and path pattern:

$ llm-bulk-migration --step=fix-jest --match=project-abc/**

Source: Airbnb Engineering Blog

This made it easy to focus on fixes without rerunning the full pipeline, accelerating feedback, and isolating scope.

Structured Feedback Loop

To convert failure patterns into working migrations, the team used a tight iterative loop:

  • Sample 5 to 10 failing files with a shared issue

  • Tune prompts or scripts to address the root cause

  • Test the updated approach against the sample

  • Sweep across all similar failing files

  • Repeat the cycle with the next failure category

This method wasn’t theoretical. In practice, it pushed the migration from 75% to 97% completion in just four days. For the remaining ~100 files, the system had already done most of the work. LLM outputs weren’t usable as-is, but served as solid baselines. Manual cleanup on those final files wrapped up the migration in a matter of days, not months.

The takeaway was that brute force handled the bulk, but targeted iteration finished the job. Without instrumentation and repeatable tuning, the migration would have plateaued far earlier.

Conclusion

The results validated both the tooling and the strategy. The first bulk run completed 75% of the migration in under four hours, covering thousands of test files with minimal manual involvement. 

Over the next four days, targeted prompt tuning and iterative retries pushed completion to 97%. The remaining ~100 files, representing the final 3%, were resolved manually using LLM-generated outputs as starting points, cutting down the time and effort typically required for handwritten migrations.

Throughout the process, the original test intent and code coverage were preserved. The transformed tests passed validation, matched behavioral expectations, and aligned with the structural patterns encouraged by RTL. Even for complex edge cases, the baseline quality of LLM-generated code reduced the manual burden to cleanup and review, not full rewrites.

In total, the entire migration was completed in six weeks, with only six engineers involved and modest LLM API usage. Compared to the original 18-month estimate for a manual migration, the savings in time and cost were substantial.

The project also highlighted where LLMs excel:

  • When the task involves repetitive transformations across many files.

  • When contextual cues from sibling files, examples, and project structure can guide generation.

  • When partial automation is acceptable, and post-processing can clean up the edge cases.

Airbnb now plans to extend this framework to other large-scale code transformations, such as library upgrades, testing strategy shifts, or language migrations. 

The broader conclusion is clear: AI-assisted development can reduce toil, accelerate modernization, and improve consistency when structured properly, instrumented well, and paired with domain knowledge.

References:


Jobright Agent : The First AI that hunts jobs for you

Job hunting can feel like a second full-time job—hours each day scrolling through endless listings, re-typing the same forms, tweaking your resume, yet still hearing nothing back.

What if you had a seasoned recruiter who handled 90% of the grunt work and lined up more interviews for you? That’s the experience with Jobright Agent:

  • Scan 400K+ fresh postings every morning and line up your best matches before you’re even awake.

  • One-click apply—tailors your resume, writes a fresh cover letter, fills out the forms, and hits submit.

  • Track every application and recommend smart next moves to land more interviews.

  • Stay by your side, cheering you on and guiding you when it matters most.

Watch How the Agent Works


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

Object Oriented Design Interview Book is here — now available on Amazon!

2025-06-23 23:31:08

*BIG* announcement: Our new book, Object Oriented Design Interview, is available on Amazon!

Check it out Now!

𝐖𝐡𝐚𝐭’𝐬 𝐢𝐧𝐬𝐢𝐝𝐞?

- An insider's take on what interviewers really look for and why.

- A 4-step framework for solving any object-oriented design interview question.

- 11 real object-oriented design interview questions with detailed solutions.

- 133 detailed diagrams explaining system architectures and workflows.

𝐓𝐚𝐛𝐥𝐞 𝐨𝐟 𝐂𝐨𝐧𝐭𝐞𝐧𝐭:

Chapter 1 What is an Object-Oriented Design (OOD) Interview?

Chapter 2 A Framework for the OOD Interview

Chapter 3 OOP Fundamentals

Chapter 4 Parking Lot System

Chapter 5 Movie Ticket Booking System

Chapter 6 Unix File Search System

Chapter 7 Vending Machine System

Chapter 8 Elevator System

Chapter 9 Grocery Store System

Chapter 10 Tic-Tac-Toe Game

Chapter 11 Blackjack Game

Chapter 12 Shipping Locker System

Chapter 13 Automated Teller Machine (ATM) System

Chapter 14 Restaurant Management System

The digital version will be available on the ByteByteGo website in 1–2 weeks. The print edition will also be available in India in a few days.

Check it out on Amazon now!

EP168: AI Vs Machine Learning Vs Deep Learning Vs Generative AI

2025-06-21 23:30:15

✂️ Cut your QA cycles down to minutes with QA Wolf (Sponsored)

If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.

QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.

QA Wolf takes testing off your plate. They can get you:

  • Unlimited parallel test runs for mobile and web apps

  • 24-hour maintenance and on-demand test creation

  • Human-verified bug reports sent directly to your team

  • Zero flakes guarantee

The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.

With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.

Schedule a demo to learn more


This week’s system design refresher:

  • AI Vs Machine Learning Vs Deep Learning Vs Generative AI

  • How SQL Query Executes In A Database?

  • Top 20 AI Agent Concepts You Should Know

  • How RabbitMQ Works

  • Hiring Now

  • SPONSOR US


AI Vs Machine Learning Vs Deep Learning Vs Generative AI

  1. Artificial Intelligence (AI)
    It is the overarching field focused on creating machines or systems that can perform tasks typically requiring human intelligence, such as reasoning, learning, problem-solving, and language understanding. AI consists of various subfields, including ML, NLP, Robotics, and Computer Vision

  2. Machine Learning (ML)
    It is a subset of AI that focuses on developing algorithms that enable computers to learn from and make decisions based on data.

    Instead of being explicitly programmed for every task, ML systems improve their performance as they are exposed to more data. Common applications include spam detection, recommendation systems, and predictive analytics.

  3. Deep Learning
    It is a specialized subset of ML that utilizes artificial neural networks with multiple layers to model complex patterns in data.

    Neural networks are computational models inspired by the human brain’s network of neurons. Deep neural networks can automatically discover representations needed for future detection. Use cases include image and speech recognition, NLP, and autonomous vehicles.

  4. Generative AI
    It refers to AI systems capable of generating new content, such as text, images, music, or code, that resembles the data they were trained on. They rely on the Transformer Architecture.

    Notable generative AI models include GPT for text generation and DALL-E for image creation.

Over to you: What else will you add to understand these concepts better?


Level Up Your API Stack with Postman (Sponsored)

Your API workflow is changing whether you like it or not. Postman just dropped features built by devs like you to help you stay ahead of the game.

Postman’s POST/CON 25 product reveals include real-time production visibility with Insights, tighter spec workflows with Spec Hub + GitHub Sync, and AI-assisted debugging that actually works.

Think native integrations that plug directly into your stack—VS Code, GitHub, Slack—plus workflow orchestration without infrastructure headaches.

Get the full technical breakdown and see what your API development could look like.

Get all the details


How SQL Query Executes In A Database?

STEP 1
The query string first reaches the Transport Subsystem of the database. This subsystem manages the connection with the client. Also, it performs authentication and authorization checks, and if everything looks fine, it lets the query go to the next step.

STEP 2
The query now reaches the Query Processor subsystem, which has two parts: Query Parser and Query Optimizer.

The Query Parser breaks down the query into sub-parts (such as SELECT, FROM, WHERE). It checks for any syntax errors and creates a parse tree.

Then, the Query Optimizer goes through the parse tree, checks for semantic errors (for example, if the “users” table exists or not), and finds out the most efficient way to execute the query.

The output of this step is the execution plan.

STEP 3
The execution plan goes to the Execution Engine. This plan is made up of all the steps needed to execute the query.

The Execution Engine takes this plan and coordinates the execution of each step by calling the Storage Engine. It also collects the results from each step and returns a combined or unified response to the upper layer.

STEP 4
The Execution Engine sends low-level read and write requests to the Storage Engine based on the execution plan.

This is handled by the various components of the Storage Engine, such as the transaction manager (for transaction management), lock manager (acquires necessary locks), buffer manager (checks if data pages are in memory), and recovery manager (for rollback or recovery).

Over to you: What else will you add to understand the execution of an SQL Query?


Top 20 AI Agent Concepts You Should Know

  1. Agent: An autonomous entity that perceives, reasons, and acts in an environment to achieve goals.

  2. Environment: The surrounding context or sandbox in which the agent operates and interacts.

  3. Perception: The process of interpreting sensory or environmental data to build situational awareness.

  4. State: The agent’s current internal condition or representation of the world.

  5. Memory: Storage of recent or historical information for continuity and learning.

  6. Large Language Models: Foundation models powering language understanding and generation.

  7. Reflex Agent: A simple type of agent that makes decisions based on predefined “condition-action” rules.

  8. Knowledge Base: Structured or unstructured data repository used by agents to inform decisions.

  9. CoT (Chain of Thought): A reasoning method where agents articulate intermediate steps for complex tasks.

  10. ReACT: A framework that combines step-by-step reasoning with direct environmental actions.

  11. Tools: APIs or external systems that agents use to augment their capabilities.

  12. Action: Any task or behavior executed by the agent as a result of its reasoning.

  13. Planning: Devising a sequence of actions to reach a specific goal.

  14. Orchestration: Coordinating multiple steps, tools, or agents to fulfill a task pipeline.

  15. Handoffs: The transfer of responsibilities or tasks between different agents.

  16. Multi-Agent System: A framework where multiple agents operate and collaborate in the same environment.

  17. Swarm: Emergent intelligent behavior from many agents following local rules without central control.

  18. Agent Debate: A mechanism where agents argue opposing views to refine or improve outcomes.

  19. Evaluation: Measuring the effectiveness or success of an agent’s actions and outcomes.

  20. Learning Loop: The cycle where agents improve performance by continuously learning from feedback or outcomes.

Over to you: Which other AI agent concept will you add to the list?


How RabbitMQ Works?

RabbitMQ is a message broker that enables applications to communicate by sending and receiving messages through queues. It helps decouple services, improve scalability, and handle asynchronous processing efficiently.

Here’s how it works:

  1. A producer (usually an application or service) sends messages to the RabbitMQ broker, which manages message routing and delivery.

  2. Within the broker, messages are sent to an exchange, which determines how they should be routed based on the type of exchange: Direct, Topic, or Fanout.

  3. Bindings connect exchanges to queues using a binding key, which defines the rules for routing messages (for example, exact match or pattern-based)

  4. Direct exchanges route messages to queues that match the routing key exactly, as shown with Queue 1.

  5. Topic exchanges use patterns to route messages to matching queues.

  6. Fanout exchanges broadcast messages to all bound queues, regardless of routing keys.

  7. Finally, messages are pulled from the queues by a consumer, which processes them and can pass the results to other systems.

Over to you: What else will you add to the RabbitMQ process flow?


Hiring Now

We collaborate with Jobright.ai (an AI job search copilot trusted by 500K+ tech professionals) to curate this job list.

This Week’s High-Impact Roles at Fast-Growing AI Startups

  • Senior / Staff Software Engineer, Data Platform at Waabi (California, USA)

    • Yearly: 155000 - 240000

    • Waabi is an artificial intelligence company that develops autonomous driving technology for the transportation sector.

  • Senior Full Stack Engineer at Proton.ai (US)

    • Yearly: 60000 - 90000

    • Proton.ai is an AI-powered sales platform for distributors to gain millions of revenue and reclaim market share.

  • Software Engineer - Frontend UI at Luma AI (Palo Alto, CA)

    • Yearly: 220K - 280K

    • Luma AI is a generative AI startup that enables users to transform text descriptions into corresponding 3D models.

High Salary SWE Roles this week

Today’s latest ML positions


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

A Guide to Database Transactions: From ACID to Concurrency Control

2025-06-19 23:30:36

Modern applications don’t operate in a vacuum. Every time a ride is booked, an item is purchased, or a balance is updated, the backend juggles multiple operations (reads, writes, validations) often across different tables or services. These operations must either succeed together or fail as a unit. 

That’s where transactions step in.

A database transaction wraps a series of actions into an all-or-nothing unit. Either the entire thing commits and becomes visible to the world, or none of it does. In other words, the goal is to have no half-finished orders, no inconsistent account balances, and no phantom bookings. 

However, maintaining correctness gets harder when concurrency enters the picture. 

This is because transactions don’t run in isolation. Real systems deal with dozens, hundreds, or thousands of simultaneous users. And every one of them expects their operation to be successful. Behind the scenes, the database has to balance isolation, performance, and consistency without grinding the system to a halt.

This balancing act isn’t trivial. Here are a few cases:

  • One transaction might read data that another is about to update. 

  • Two users might try to reserve the same inventory slot. 

  • A background job might lock a record moments before a customer clicks "Confirm." 

Such scenarios can result in conflicts, race conditions, and deadlocks that stall the system entirely.

In this article, we break down the key building blocks that make transactional systems reliable in the face of concurrency. We will start with the fundamentals: what a transaction is, and why the ACID properties matter. We will then dig deeper into the mechanics of concurrency control (pessimistic and optimistic) and understand the trade-offs related to them.

What is a Database Transaction?

Read more

How the Google Cloud Outage Crashed the Internet

2025-06-17 23:30:25

7 Key Insights from the State of DevSecOps Report (Sponsored)

Datadog analyzed data from tens of thousands of orgs to uncover 7 key insights on modern DevSecOps practices and application security risks.

Highlights:

  • Why smaller container images reduce severe vulns

  • How runtime context helps you prioritize critical CVEs

  • The link between deploy frequency and outdated dependencies

Plus, learn proven strategies to implement infrastructure as code, automated cloud deploys, and short-lived CI/CD credentials.

Get the report


Disclaimer: The details in this post have been derived from the details shared online by the Google Engineering Team. All credit for the technical details goes to the Google Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

On June 12, 2025, a significant portion of the internet experienced a sudden outage. What started as intermittent failures on Gmail and Spotify soon escalated into a global infrastructure meltdown. For millions of users and hundreds of companies, critical apps simply stopped working.

At the heart of it all was a widespread outage in Google Cloud Platform (GCP), which serves as the backend for a vast ecosystem of digital services. The disruption began at 10:51 AM PDT, and within minutes, API requests across dozens of regions were failing with 503 errors. Over a few hours, the ripple effects became undeniable.

Among consumer platforms, the outage took down:

  • Spotify (approximately 46,000 users reported on Downdetector).

  • Snapchat, Discord, Twitch, and Fitbit: users were unable to stream, chat, or sync their data.

  • Google Workspace apps (including Gmail, Calendar, Meet, and Docs). These apps power daily workflows for hundreds of millions of users.

The failure was just as acute for enterprise and developer tools:

  • GitLab, Replit, Shopify, Elastic, LangChain, and other platforms relying on GCP services saw degraded performance, timeouts, or complete shutdowns.

  • Thousands of CI/CD pipelines, model serving endpoints, and API backends stalled or failed outright.

  • Vertex AI, BigQuery, Cloud Functions, and Google Cloud Storage were all affected, halting data processing and AI operations.

In total, more than 50 distinct Google Cloud services across over 40 regions worldwide were affected. 

Perhaps the most significant impact came from Cloudflare, a company often viewed as a pillar of internet reliability. While its core content delivery network (CDN) remained operational, Cloudflare's authentication systems, reliant on Google Cloud, failed. This led to issues with session validation, login workflows, and API protections for many of its customers. 

The financial markets also felt the impact of this outage. Alphabet (Google’s parent) saw its stock fall by nearly 1 percent. The logical question that arose from this incident is as follows: How did a platform built for global scale suffer such a cascading collapse? 

Let’s understand more about it.


Special Event: Save 20% on Top Maven Courses (Sponsored)

Your education is expiring faster than ever. What you learned in college won’t help you lead in the age of AI.

That's why Maven specializes in live courses with practitioners who have actually done the work and shipped innovative products:

  • Shreyas Doshi (Product leader at Stripe, Twitter, Google) teaching Product Sense

  • Hamel Husain (renowned ML engineer, Github) teaching AI evals

  • Aish Naresh Reganti (AI scientist at AWS) teaching Agentic AI

  • Hamza Farooq (Researcher at Google) teaching RAG

This week only: Save 20% on Maven’s most popular courses in AI, product, engineering, and leadership to accelerate your career.

Explore Event (Ends Sunday)


Inside the Outage

To understand how such a massive outage occurred, we need to look under the hood at a critical system deep inside Google Cloud’s infrastructure. It’s called the Service Control.

The Key System: Service Control

Service Control is one of the foundational components of Google Cloud's API infrastructure. 

Every time a user, application, or service makes an API request to a Google Cloud product, Service Control sits between the client and the backend. It is responsible for several tasks such as:

  • Verifying if the API request is authorized.

  • Enforcing quota limits (how many requests can be made).

  • Checking various policy rules (such as organizational restrictions).

  • Logging, metering, and auditing requests for monitoring and billing.

The diagram below shows how the Service Control works on a high level:

In short, Service Control acts as the gatekeeper for nearly all Google Cloud API traffic. If it fails, most of Google Cloud fails with it.

The Faulty Feature

On May 29, 2025, Google introduced a new feature into the Service Control system. This feature added support for more advanced quota policy checks, allowing finer-grained control over how quota limits are applied.

The feature was rolled out across regions in a staged manner. However, it contained a bug that introduced a null pointer vulnerability in a new code path that was never exercised during rollout. The feature relied on a specific type of policy input to activate. Because that input had not yet been introduced during testing, the bug went undetected.

Critically, this new logic was also not protected by a feature flag, which would have allowed Google to safely activate it in a controlled way. Instead, the feature was present and active in the binary, silently waiting for the right (or in this case, wrong) conditions to be triggered.

The Triggering Event

Those conditions arrived on June 12, 2025, at approximately 10:45 AM PDT, when a new policy update was inserted into Google Cloud’s regional Spanner databases. This update contained blank or missing fields that were unexpected by the new quota checking logic.

As Service Control read this malformed policy, the new code path was activated. The result was a null pointer error getting triggered, causing the Service Control binary to crash in that region.

Since Google Cloud’s policy and quota metadata is designed to replicate globally in near real-time as per Spanner’s key feature, the corrupted policy data was propagated to every region within seconds. 

Here’s a representative diagram on how replication works in Google Spanner:

As soon as each regional Service Control instance attempted to process the same bad data, it all began to crash in the same way. This created a global failure of Service Control. 

Since this system is essential for processing API requests, nearly all API traffic across Google Cloud began to fail, returning HTTP 503 Service Unavailable errors.

The speed and scale of the failure were staggering. One malformed update, combined with an unprotected code path and global replication of metadata, brought one of the most robust cloud platforms in the world to a standstill within minutes.

How Google Responded?

Once the outage began to unfold, Google’s engineering teams responded with speed and precision. Within two minutes of the first crashes being observed in Service Control, Google’s Site Reliability Engineering (SRE) team was actively handling the situation. 

The sequence of events that followed is as follows:.

The Red Button Fix

Fortunately, the team that introduced the new quota checking feature had built in a safeguard: an internal “red-button” switch. This kill switch was designed to immediately disable the specific code path responsible for serving the new quota policy logic. 

While not a complete fix, it offered a quick way to bypass the broken logic and stop the crash loop.

The red-button mechanism was activated within 10 minutes of identifying the root cause. By 40 minutes after the incident began, the red-button change had been rolled out across all regions, and systems began to stabilize. Smaller and less complex regions recovered first, as they required less infrastructure coordination.

This kill switch was essential in halting the worst of the disruption. However, because the feature had not been protected by a traditional feature flag, the issue had already been triggered in production globally before the red button could be deployed. 

Delayed Recovery in US-CENTRAL-1

Most regions began to recover relatively quickly after the red button was applied. However, one region (us-central-1), located in Iowa, took significantly longer to stabilize.

The reason for this delay was a classic case of the “herd effect.” 

As Service Control tasks attempted to restart en masse, they all hit the same underlying infrastructure: the regional Spanner database that held policy metadata. Without any form of randomized exponential backoff, the system became overwhelmed by a flood of simultaneous requests. Rather than easing into recovery, it created a new performance bottleneck.

Google engineers had to carefully throttle task restarts in us-central1 and reroute some of the load to multi-regional Spanner databases to alleviate pressure. This process took time. Full recovery in us-central1 was not achieved until approximately 2 hours and 40 minutes after the initial failure, well after other regions had already stabilized.

Communication Breakdown

While the technical team worked to restore service, communication with customers proved to be another challenge.

Because the Cloud Service Health dashboard itself was hosted on the same infrastructure affected by the outage, Google was unable to immediately post incident updates. The first public acknowledgment of the problem did not appear until nearly one hour after the outage began. During that period, many customers had no clear visibility into what was happening or which services were affected.

To make matters worse, some customers relied on Google Cloud monitoring tools, such as Cloud Monitoring and Cloud Logging, that were themselves unavailable due to the same root cause. This left entire operations teams effectively blind, unable to assess system health or respond appropriately to failing services.

The breakdown in visibility highlighted a deeper vulnerability: when a cloud provider's observability and communication tools are hosted on the same systems they are meant to monitor, customers are left without reliable status updates in times of crisis.

The Key Engineering Failures

The Google Cloud outage was not the result of a single mistake, but a series of engineering oversights that compounded one another. Each failure point, though small in isolation, played a role in turning a bug into a global disruption.

Here are the key failures that contributed to the entire issue:

  • The first and most critical lapse was the absence of a feature flag. The new quota-checking logic was released in an active state across all regions, without the ability to gradually enable or disable it during rollout. Feature flags are a standard safeguard in large-scale systems, allowing new code paths to be activated in controlled stages. Without one, the bug went live in every environment from the start.

  • Second, the code failed to include a basic null check. When a policy with blank fields was introduced, the system did not handle the missing values gracefully. Instead, it encountered a null pointer exception, which crashed the Service Control binary in every region that processed the data.

  • Third, Google’s metadata replication system functioned exactly as designed. The faulty policy data propagated across all regions almost instantly, triggering the crash everywhere. The global replication process had no built-in delay or validation checkpoint to catch malformed data before it reached production.

  • Fourth, the recovery effort in the “us-central1” region revealed another problem. As Service Control instances attempted to restart, they all hit the backend infrastructure at once, creating a “herd effect” that overwhelmed the regional Spanner database. Because the system lacked appropriate randomized exponential backoff, the recovery process generated new stress rather than alleviating it.

  • Finally, the monitoring and status infrastructure failed alongside the core systems. Google’s own Cloud Service Health dashboard went down during the outage, and many customers could not access logs, alerts, or observability tools that would normally guide their response. This created a critical visibility gap during the peak of the crisis.

Conclusion

In the end, it was a simple software bug that brought down one of the most sophisticated cloud platforms in the world. 

What might have been a minor error in an isolated system escalated into a global failure that disrupted consumer apps, developer tools, authentication systems, and business operations across multiple continents. This outage is a sharp reminder that cloud infrastructure, despite its scale and automation, is not infallible. 

Google acknowledged the severity of the failure and issued a formal apology to customers. In its public statement, the company committed to making improvements to ensure such an outage does not happen again. The key actions Google has promised are as follows:

  • Prevent the API management system from crashing in the presence of invalid or corrupted data.

  • Introduce safeguards to stop metadata from being instantly replicated across the globe without proper testing and monitoring.

  • Improve error handling in core systems and expand testing to ensure invalid data is caught before it can cause failure.

Reference:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected]




EP167: Top 20 AI Concepts You Should Know

2025-06-14 23:30:22

WorkOS: Scoped Access and Control for AI Agents (Sponsored)

AI agents can trigger tools, call APIs, and access sensitive data.
Failing to control access creates real risk.

Learn how to:

  • Limit what agents can do with scoped tokens

  • Define roles and restrict permissions

  • Log activity for debugging and review

  • Secure credentials and secrets

  • Detect and respond to suspicious behavior

Real teams are applying these practices to keep agent workflows safe, auditable, and aligned with least-privilege principles.

Secure your agent workflows today


This week’s system design refresher:

  • Top 20 AI Concepts You Should Know

  • The AI Application Stack for Building RAG Apps

  • Shopify Tech Stacks and Tools

  • Our new book, Mobile System Design Interview, is available on Amazon!

  • Featured Job

  • Other Jobs

  • SPONSOR US


Top 20 AI Concepts You Should Know

  1. Machine Learning: Core algorithms, statistics, and model training techniques.

  2. Deep Learning: Hierarchical neural networks learning complex representations automatically.

  3. Neural Networks: Layered architectures efficiently model nonlinear relationships accurately.

  4. NLP: Techniques to process and understand natural language text.

  5. Computer Vision: Algorithms interpreting and analyzing visual data effectively

  6. Reinforcement Learning: Distributed traffic across multiple servers for reliability.

  7. Generative Models: Creating new data samples using learned data.

  8. LLM: Generates human-like text using massive pre-trained data.

  9. Transformers: Self-attention-based architecture powering modern AI models.

  10. Feature Engineering: Designing informative features to improve model performance significantly.

  11. Supervised Learning: Learns useful representations without labeled data.

  12. Bayesian Learning: Incorporate uncertainty using probabilistic model approaches.

  13. Prompt Engineering: Crafting effective inputs to guide generative model outputs.

  14. AI Agents: Autonomous systems that perceive, decide, and act.

  15. Fine-Tuning Models: Customizes pre-trained models for domain-specific tasks.

  16. Multimodal Models: Processes and generates across multiple data types like images, videos, and text.

  17. Embeddings: Transforms input into machine-readable vector formats.

  18. Vector Search: Finds similar items using dense vector embeddings.

  19. Model Evaluation: Assessing predictive performance using validation techniques.

  20. AI Infrastructure: Deploying scalable systems to support AI operations.

Over to you: Which other AI concept will you add to the list?


The AI Application Stack for Building RAG Apps

  1. Large Language Models
    These are the core engines behind Retrieval-Augmented Generation (RAG), responsible for understanding queries and generating coherent and contextual responses. Some common LLM options are OpenAI GPT models, Llama, Claude, Gemini, Mistral, DeepSeek, Qwen 2.5, Gemma, etc.

  2. Frameworks and Model Access
    These tools simplify the integration of LLMs into your applications by handling prompt orchestration, model switching, memory, chaining, and routing. Common tools are Langchain, LlamaIndex, Haystack, Ollama, Hugging Face, and OpenRouter.

  3. Databases
    RAG applications rely on storing and retrieving relevant information. These vector databases are optimized for similarity search, while relational options like Postgres offer structured storage. Tools are Postgres, FAISS, Milvus, pgVector, Weaviate, Pinecone, Chroma, etc.

  4. Data Extraction
    To populate your knowledge base, these tools help extract structured information from unstructured sources like PDFs, websites, and APIs. Some common tools are Llamaparse, Docking, Megaparser, Firecrawl, ScrapeGraph AI, Document AI, and Claude API.

  5. Text Embeddings
    Embeddings convert text into high-dimensional vectors that enable semantic similarity search, which is a critical step for connecting queries with relevant context in RAG. Common tools are Nomic, OpenAI, Cognita, Gemini, LLMWare, Cohere, JinaAI, and Ollama.

Over to you: What else will you add to the list to build RAG apps?


Shopify Tech Stacks and Tools

Shopify handles scale that would break most systems.

On a single day (Black Friday 2024), the platform processed 173 billion requests, peaked at 284 million requests per minute, and pushed 12 terabytes of traffic every minute through its edge.

These numbers aren’t anomalies. They’re sustained targets that Shopify strives to meet. Behind this scale is a stack that looks deceptively simple from the outside: Ruby on Rails, React, MySQL, and Kafka.

But that simplicity hides sharp architectural decisions, years of refactoring, and thousands of deliberate trade-offs.

In this newsletter, we map the tech stack powering Shopify from:

  • the modular monolith that still runs the business,

  • to the pods that isolate failure domains,

  • to the deployment pipelines that ship hundreds of changes a day.

  • It covers the tools, programming languages, and patterns Shopify uses to stay fast, resilient, and developer-friendly at incredible scale.

A huge thank you to Shopify’s world-class engineering team for sharing their insights and for collaborating with us on this deep technical exploration.

🔗 Dive into the full newsletter here.


Our new book, Mobile System Design Interview, is available on Amazon!

Book author: Manuel Vicente Vivo

What’s inside?

  • An insider's take on what interviewers really look for and why.

  • A 5-step framework for solving any mobile system design interview question.

  • 7 real mobile system design interview questions with detailed solutions.

  • 24 deep dives into complex technical concepts and implementation strategies.

  • 175 topics covering the full spectrum of mobile system design principles.

Table Of Contents
Chapter 1: Introduction
Chapter 2: A Framework for Mobile System Design Interviews
Chapter 3: Design a News Feed App
Chapter 4: Design a Chat App
Chapter 5: Design a Stock Trading App
Chapter 6: Design a Pagination Library
Chapter 7: Design a Hotel Reservation App
Chapter 8: Design the Google Drive App
Chapter 9: Design the YouTube app
Chapter 10: Mobile System Design Building Blocks
Quick Reference Cheat Sheet for MSD Interview

Check it out on Amazon now


Featured Job

Founding Engineer @dbdasher.ai

Location: Remote (India)

Role Type: Full-time

Compensation: Highly Competitive

Experience Level: 2+ years preferred

About dbdasher.ai: dbdasher.ai is a well-funded, high-ambition AI startup on a mission to revolutionize how large enterprises interact with data. We use cutting-edge language models to help businesses query complex datasets with natural language. We’re already working with two pilot customers - a publicly listed company and a billion-dollar private enterprise and we’re just getting started.

We’re building something new from the ground up. If you love solving hard problems and want to shape the future of enterprise AI tools, this is the place for you.

About the Role: We’re hiring a Founding Engineer to join our early team and help build powerful, user-friendly AI-driven products from scratch. You’ll work directly with the founders to bring ideas to life, ship fast, and scale systems that power real-world business decisions.

If you are interested, apply here or email Rishabh at [email protected]


Other Jobs

We collaborate with Jobright.ai (an AI job search copilot trusted by 500K+ tech professionals) to curate this job list.

This Week’s High-Impact Roles at Fast-Growing AI Startups

  • Senior Software Engineer, Search Evaluations at OpenAI (San Francisco, CA)

    • Yearly: 245,000 - 465,000USD

    • OpenAI creates artificial intelligence technologies to assist with tasks and provide support for human activities.

  • Staff Software Engineer, ML Engineering at SmarterDx (United States)

    • Yearly: 220,000 - 270,000USD

    • SmarterDx is a clinical AI company that develops automated pre-bill review technology to assist hospitals in analyzing patient discharges.

  • Software Engineering Manager, Core Platform at Standard Bots (New York, NY)

    • Yearly: 220,000 - 240,000

    • Standard Bots offers advanced automation solutions, including the RO1 robot, to help businesses streamline their operations.

High Salary SWE Roles this week

Today’s latest ML positions - hiring now!


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected]