MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

Must-Know Message Broker Patterns

2026-01-09 00:30:49

Modern distributed systems rely on message brokers to enable communication between independent services.

However, using message brokers effectively requires understanding common architectural patterns that solve recurring challenges. This article introduces seven essential patterns that help developers build reliable, scalable, and maintainable systems using message brokers.

These patterns address three core categories of problems:

  • Ensuring data consistency across services

  • Managing workload efficiently

  • Gaining visibility into the messaging infrastructure.

Whether we’re building an e-commerce platform, a banking system, or any distributed application, these patterns provide proven solutions to common challenges.

In this article, we will look at each of these patterns in detail and understand the scenarios where they help the most.

Ensuring Data Consistency

Read more

How AI Transformed Database Debugging at Databricks

2026-01-07 00:31:08

New Year, New Metrics: Evaluating AI Search in the Agentic Era (Sponsored)

Most teams pick a search provider by running a few test queries and hoping for the best – a recipe for hallucinations and unpredictable failures. This technical guide from You.com gives you access to an exact framework to evaluate AI search and retrieval.

What you’ll get:

  • A four-phase framework for evaluating AI search

  • How to build a golden set of queries that predicts real-world performance

  • Metrics and code for measuring accuracy

Go from “looks good” to proven quality.

Learn how to run an eval


Disclaimer: The details in this post have been derived from the details shared online by the Databricks Engineering Team. All credit for the technical details goes to the Databricks Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Databricks is a cloud platform that helps companies manage all their data in one place. It combines the best features of data warehouses and data lakes into a lakehouse architecture, which means you can store and work with any type of data.

Recently, Databricks built an internal AI-powered agentic platform that reduced database debugging time by up to 90% across thousands of OLTP instances spanning hundreds of regions on multiple cloud platforms.

The AI agent interprets, executes, and debugs by retrieving key metrics and logs and automatically correlating signals. It makes the life of Databricks engineers easy. They can now ask questions about the health of their services in natural language without needing to reach out to on-call engineers in storage teams.

The great part was that this platform evolved from a hackathon project into a company-wide tool that unifies metrics, tooling, and expertise for managing databases at scale. In this article, we will look at how the Databricks engineering team built this platform and the challenges faced along the way.

The Pre-AI Workflow and Pain Points

In the pre-AI workflow, Databricks engineers had to manually jump between multiple tools whenever they had to debug a database problem. Here’s how the workflow ran:

  • Engineers would first open Grafana to examine performance metrics and charts that showed how the database was behaving over time.

  • Next, they would switch to Databricks’ internal dashboards to understand which client applications were running and how much workload they were generating on the database.

  • Engineers would then run command-line interface commands to inspect InnoDB status, which provides a detailed snapshot of MySQL’s internal state, including active transactions, I/O operations, and any deadlocks.

  • Finally, engineers would log into their cloud provider’s console to download slow query logs that revealed which database queries were taking an unusually long time to execute.

The first attempt to alleviate this problem was made during a company-wide hackathon, during which developers built a simple prototype that unified a few core database metrics and dashboards into a single view. The results were promising. However, before writing more code, Databricks took a research-driven approach by actually observing on-call engineers during real debugging sessions and conducting interviews to understand their challenges firsthand.

The first major problem was fragmented tooling, where each debugging tool worked in complete isolation without any integration or ability to share information with other tools. This lack of integration meant engineers had to manually piece together information from multiple disconnected sources, which made the entire debugging process slow and prone to human error.

The second major problem was that engineers spent most of their incident response time gathering context rather than actually fixing the problem. Context gathering involved figuring out what had recently changed in the system, determining what “normal” baseline behavior looked like, and tracking down other engineers who might have relevant knowledge.

The third major problem was that engineers lacked clear guidance during incidents about which mitigation actions were safe to take and which would actually be effective. Without clear runbooks or automated guidance, engineers would either spend a lot of time investigating to ensure they fully understood the situation or they would wait for senior experts to become available and tell them what to do.

Evolution Through Iteration

Databricks didn’t build its AI debugging platform in one shot. They went through multiple versions.

The first version they built was a static agentic workflow that simply followed a pre-written debugging Standard Operating Procedure, which is essentially a step-by-step checklist of what to do. This first version failed because engineers didn’t want to follow a manual checklist, but wanted the system to automatically analyze their situation and give them a diagnostic report with immediate insights about what was wrong.

Learning from this failure, Databricks built a second version focused on anomaly detection, which could automatically identify unusual patterns or behaviors in the database metrics. However, while the anomaly detection system successfully surfaced relevant problems, it still fell short because it only told engineers “here’s what’s wrong” without providing clear guidance on what to do next to fix those problems.

The breakthrough came with the third version, which was an interactive chat assistant that fundamentally changed how engineers could debug their databases. This chat assistant codifies expert debugging knowledge, meaning it captures the wisdom and experience of senior database engineers and makes it available to everyone through conversation. Unlike the previous versions, the chat assistant can answer follow-up questions, allowing engineers to have a back-and-forth dialogue rather than just receiving a one-time report.

This interactive nature transforms debugging from a series of isolated manual steps into a continuous, conversational process where the AI guides engineers through the entire investigation.

See the evolution journey in the diagram below:

Platform Foundation Architecture

Before the Databricks engineering team could effectively add AI to its debugging platform, it realized that it needed to build a solid architectural foundation that would make the AI integration meaningful. This was because any agent would need to handle region and cloud-specific logic.

This was a difficult problem since Databricks operates thousands of database instances across hundreds of regions, eight regulatory domains, and three clouds. The team recognized that without building this strong architectural foundation first, trying to add AI capabilities would run into unavoidable roadblocks. Some of the problems were as follows:

  • The first problem that would occur without this foundation is context fragmentation, where all the debugging data would be scattered across different locations, making it impossible for an AI agent to get a complete picture of what’s happening.

  • The second problem would be unclear governance boundaries, meaning it would be extremely difficult to ensure that the AI agent and human engineers stay within their proper permissions and don’t accidentally access or modify things they shouldn’t.

  • The third problem would be slow iteration loops, where inconsistent ways of doing things across different clouds and regions would make it very hard to test and improve the AI agent’s behavior.

To support this complexity, the platform is built on three core architectural principles that work together to create a unified, secure, and scalable system.

Global Storex Instance

The first principle is a central-first sharded architecture, which means there’s one central “brain” (called Storex) that coordinates many regional pieces of the system.

This global Storex instance acts like a traffic controller, providing engineers with a single unified interface to access all their databases, no matter where those databases are physically located. Even though engineers interact with one central system, the actual sensitive data stays local in each region, which is crucial for meeting privacy and regulatory requirements.

This architecture ensures compliance across eight different regulatory domains, which are different legal jurisdictions that have their own rules about where data can be stored and who can access it.

Fine-Grained Access Control

The second principle is fine-grained access control, which means the platform has very precise and detailed rules about who can do what. Access permissions are enforced at multiple levels, such as:

  • The Team Level: Determines which teams can access what.

  • The Resource Level: Determines which specific databases or systems.

  • The RPC Level: Determines which specific operations or function calls.

This multi-layered permission system ensures that both human engineers and AI agents only perform actions they’re authorized to do, preventing accidental or unauthorized changes.

Unified Orchestration

The third principle is unified orchestration, which means the platform brings together all the existing infrastructure services under one cohesive system.

This orchestration creates consistent abstractions, which means engineers can work with databases the same way whether they’re on AWS in Virginia, Azure in Europe, or Google Cloud in Asia. By providing these consistent abstractions, the platform eliminates the need for engineers to learn and handle cloud-specific or region-specific differences in how things work.

AI Agent Implementation

Databricks engineering team built a lightweight framework for their AI agent that was inspired by two existing technologies: MLflow’s prompt optimization tools and a system called DsPy.

The key innovation of this framework is that it decouples (separates) the prompting from the tool implementation, meaning engineers can change what the AI says without having to rewrite how the underlying tools work. Engineers define tools by writing simple Scala classes (a programming language) with function signatures that describe what the tool does, rather than having to write complex instructions for the AI. Each tool just needs a simple docstring description (a short text explanation), and the large language model can automatically figure out three important things: what format of input the tool needs, what structure the output will have, and how to interpret the results.

See the diagram below:

This design enables rapid iteration, meaning engineers can quickly experiment with different prompts and swap tools in and out without having to modify the underlying infrastructure that handles parsing data, connecting to the LLM, or managing the conversation state.

Agent Decision Loop

The AI agent operates in a continuous decision loop that determines what actions to take based on the user’s needs.

  • First, the user’s input goes to the Storex Router, which is like a switchboard that directs the request to the right place.

  • Second, the LLM Endpoint (the large language model) generates a response based on what the user asked and the current context of the conversation.

  • Third, if the LLM determines it needs more information, it executes a Tool Call to retrieve data like database metrics, logs, or configuration details.

  • Fourth, the LLM Response processes the output from the tool, interpreting what the data means in the context of the user’s question.

  • Fifth, the system either loops back to step 2 to gather more information with additional tool calls or it produces a final User Response if it has everything needed to answer the question.

Validation Framework

Databricks built a validation framework to ensure that as they improve the AI agent, they don’t accidentally make it worse or introduce bugs (called “regressions”).

The framework captures snapshots of production state, which are like frozen moments in time that record what the databases looked like, what problems existed, and what the correct diagnosis should be. The snapshots include database schemas (the structure of the data), physical database info (hardware and configuration details), metrics like CPU usage and IOPS (input/output operations per second), and the expected diagnostic outputs that represent the “correct answer”. These snapshots are then replayed through the agent, meaning the system feeds old problems to the new version of the AI to see how it handles them. A separate “judge” LLM scores the agent’s responses on two key criteria: accuracy (did it identify the problem correctly) and helpfulness (did it provide useful guidance to the engineer).

See the diagram below:

All of these test results are stored in Databricks tables so the team can analyze trends over time and understand whether their changes are actually improving the agent.

Multi-Agent Specialization

Rather than building one giant AI agent that tries to do everything, Databricks’ framework enables them to create specialized agents that each focus on different domains or areas of expertise.

They have a system and database issues agent that specializes in low-level technical problems with the database software and hardware. They have a client-side traffic patterns agent that specializes in understanding how applications are using the database and whether unusual workload patterns are causing problems.

The framework allows them to easily create additional domain-specific agents as they identify new areas where specialized knowledge would be helpful. Each agent builds deep expertise in its particular area by having prompts, tools, and context specifically tailored to that domain, rather than being a generalist.

These specialized agents can collaborate with each other to provide complete root cause analysis, where one agent might identify a traffic spike and another might correlate it with a specific database configuration issue.

Conclusion

The results of Databricks’ AI-assisted debugging platform have been transformative across multiple dimensions.

The platform achieved up to 90% reduction in debugging time, turning what were once hours-long investigations into tasks that can be completed in minutes. Perhaps most remarkably, new engineers with zero context can now jump-start a database investigation in under 5 minutes. This was something that was previously nearly impossible without significant training and experience. The platform has achieved company-wide adoption across all engineering teams, demonstrating its universal value beyond just the database specialists who originally needed it.

The user feedback has been quite positive, with engineers pointing out that they no longer need to remember where various query dashboards are located or spend time figuring out where to find specific information. Multiple engineers described the platform as a big change in developer experience.

Looking forward, the platform lays the foundation for AI-assisted production operations, including automated database restores, production query optimization, and configuration updates. The architecture is designed to extend beyond databases to other infrastructure components, promising to transform how Databricks operates its entire cloud infrastructure at scale.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

How Google’s Tensor Processing Unit (TPU) Works?

2026-01-06 00:31:12

4 Key Insights for Scaling LLM Applications (Sponsored)

LLM workflows can be complex, opaque, and difficult to secure. Get the latest ebook from Datadog for practical strategies to monitor, troubleshoot, and protect your LLM applications in production. You’ll get key insights into how to overcome the challenges of deploying LLMs securely and at scale, from debugging multi-step workflows to detecting prompt injection attacks.

Download the eBook


Disclaimer: The details in this post have been derived from the details shared online by the Google Engineering Team. All credit for the technical details goes to the Google Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them

When DeepMind’s AlphaGo defeated Go world champion Lee Sedol in March 2016, the world witnessed a big moment in artificial intelligence. The match was powered by hardware Google had been running in production for over a year, but had never publicly acknowledged.

The Tensor Processing Unit, or TPU, represented something more profound than just another fast chip. It marked a fundamental shift in computing philosophy: sometimes doing less means achieving more.

Ever since then, Google’s TPU family has evolved through seven generations since 2015, scaling from single-chip processing image recognition queries to 9216-chip supercomputers training the largest language models in existence. In this article, we look at why Google built custom silicon, and how it works, revealing the physical constraints and engineering trade-offs they had to make.

The Need for TPU

In 2013, Google’s infrastructure team ran a calculation. If Android users adopted voice search at the scale Google anticipated, using it for just three minutes per day, the computational demand would require doubling the company’s entire global data center footprint.

This was a problem with no obvious solution at the time. Building more data centers filled with traditional processors was economically unfeasible. More critically, Moore’s Law has been slowing for years. For decades, the semiconductor industry had relied on the observation that transistor density doubles roughly every two years, delivering regular performance improvements without architectural changes. However, by 2013, this trend was weakening. Google couldn’t simply wait for Intel’s next generation of CPUs to solve its problem.

The root cause of this situation was architectural. Traditional computers follow the Von Neumann architecture, where a processor and memory communicate through a shared bus. To perform any calculation, the CPU must fetch an instruction, retrieve data from memory, execute the operation, and write results back. This constant transfer of information between the processor and memory creates what computer scientists call the Von Neumann bottleneck.

The energy cost of moving data across this bus often exceeds the energy cost of the computation itself. For example, imagine a chef preparing a meal but having to walk to a distant pantry for each ingredient. The cooking takes seconds, but the walking consumes hours. For general-purpose computing tasks like word processing or web browsing, this design makes sense because workloads are unpredictable. However, neural networks are different.

Deep learning models perform one operation overwhelmingly: matrix multiplication. A neural network processes information by multiplying input data by learned weight matrices, adding bias values, and applying activation functions. This happens billions of times for a single prediction. Modern language models with hundreds of billions of parameters require hundreds of billions of multiply-add operations per query. Critically, these operations are predictable, parallel, and deterministic.

CPUs devote significant processing power to features like branch prediction and out-of-order execution, designed to handle unpredictable code. Graphics Processing Units, or GPUs, improved matters with thousands of cores working in parallel, but they still carried architectural overhead from their graphics heritage. Google’s insight was to build silicon that does only what neural networks need and strip away everything else.

The Systolic Array: A Different Way to Compute

The heart of the TPU is an architecture called a systolic array. The name originates from the Greek word for heartbeat, referencing how data pulses rhythmically through the chip. To understand why this matters, consider how different processors approach the same task.

  • A CPU operates like a single worker running back and forth between a water well and a fire, filling one bucket at a time.

  • A GPU deploys thousands of workers making the same trips simultaneously. Throughput increases, but the traffic between the well and the fire becomes chaotic and energy-intensive.

  • A systolic array takes a fundamentally different approach. The workers form a line and pass buckets hand to hand. Water flows through the chain without anyone returning to the source until the job is complete.

In a TPU, the workers are simple multiply-accumulate units arranged in a dense grid. The first-generation TPU used a 256 by 256 array, meaning 65,536 calculators operating simultaneously. Here’s how computation proceeds:

  • Neural network weights are loaded into each calculator from above and remain stationary.

  • Input data flows in from the left, one row at a time.

  • As data passes through each calculator, it is multiplied by the resident weight.

  • The product adds to a running sum, then passes rightward to the next calculator.

  • Partial results accumulate and flow downward.

  • Final results emerge from the bottom after all calculations are complete.

See the diagram below:

This design means data is read from memory once but used thousands of times as it traverses the array. Traditional processors must access memory for nearly every operation. The systolic array eliminates this bottleneck. Data moves only between spatially adjacent calculators over short wires, dramatically reducing energy consumption.

The numbers make a strong case for this approach.

  • TPU v1’s 256 by 256 array could perform 65536 multiply-accumulate operations per clock cycle. Running at 700 MHz, this delivered 92 trillion 8-bit operations per second while consuming just 40 watts.

  • A contemporary GPU might perform tens of thousands of operations per cycle, while the TPU performed hundreds of thousands.

  • More than 90 percent of the silicon performed useful computation, compared to roughly 30 percent in a GPU.

The trade-off here is absolute specialization. A systolic array can only perform matrix multiplications efficiently. It cannot render graphics, browse the web, or run a spreadsheet. Google accepted this limitation because neural network inference is fundamentally matrix multiplication repeated many times.

The Supporting Architecture

The systolic array requires carefully orchestrated support components to achieve its performance. Each piece solves a specific bottleneck in the pipeline from raw data to AI predictions.

Let’s look at the most important components:

The Matrix Multiply Unit

The Matrix Multiply Unit, or MXU, is the systolic array itself.

TPU v1 used a single 256-by-256 array operating on 8-bit integers. Later versions shifted to 128 by 128 arrays using Google’s BFloat16 format for training workloads, then returned to 256 by 256 arrays in v6 for quadrupled throughput. The weight-stationary design minimizes data movement, which is the primary consumer of energy in computing.

Unified Buffer

The Unified Buffer provides 24 megabytes of on-chip SRAM, serving as a high-speed staging area between slow external memory and the hungry MXU.

This buffer stores input activations arriving from the host computer, intermediate results between neural network layers, and final outputs before transmission. Since this memory sits directly on the chip, it operates at a higher bandwidth than external memory. This difference is critical for keeping the MXU continuously fed with data rather than sitting idle waiting for memory access.

Vector Processing Unit

The Vector Processing Unit handles operations that the MXU cannot. This includes activation functions like ReLU, sigmoid, and tanh.

Neural networks require non-linearity to learn complex patterns. Without it, multiple layers would collapse mathematically into a single linear transformation. Rather than implementing these functions in software, the TPU has dedicated hardware circuits that compute activations in a single cycle. Data typically flows from the MXU to the VPU for activation processing before moving to the next layer.

Accumulators

Accumulators collect the 32-bit results flowing from the MXU.

When multiplying 8-bit inputs, products are 16-bit, but accumulated sums grow larger through repeated addition. Using 32-bit accumulators prevents overflow during the many additions a matrix multiplication requires. The accumulator memory totals 4 megabytes across 4,096 vectors of 256 elements each.

Weight FIFO Buffer

The Weight FIFO buffer stages weights between external memory and the MXU using a technique called double-buffering.

The MXU holds two sets of weight tiles: one actively computing while the other loads from memory. This overlap completely hides memory latency, ensuring the computational units never wait for data.

High Bandwidth Memory

High Bandwidth Memory evolved across TPU generations.

The original v1 used DDR3 memory delivering 34 gigabytes per second. Modern Ironwood TPUs achieve 7.4 terabytes per second, a 217-fold improvement. HBM accomplishes this by stacking multiple DRAM dies vertically with thousands of connections between them, enabling bandwidth impossible with traditional memory packaging.

The Precision Advantage

TPUs gain significant efficiency through quantization, using lower-precision numbers than traditional floating-point arithmetic. This choice has big hardware implications that cascade through the entire design.

Scientific computing typically demands high precision. Calculating pi to ten decimal places requires careful representation of very small differences. However, neural networks operate differently. They compute probabilities and patterns. For example, whether a model predicts an image is 85 percent likely to be a cat versus 85.3472 percent likely makes no practical difference to the classification.

A multiplier circuit’s silicon area scales with the square of the bit width. An 8-bit multiplier requires roughly 64 units of silicon area, whereas a 32-bit multiplier requires about 576 units. This mathematical relationship explains why TPU v1 could pack 65,536 multiply-accumulate units into a modest chip while a GPU contains far fewer floating-point units. More multipliers mean more parallel operations per cycle.

The first TPU used 8-bit integers for inference, reducing memory requirements by four times compared to 32-bit floats. A 91-megabyte model becomes 23 megabytes when quantized. Research demonstrated that inference rarely needs 32-bit precision. The extra decimal places don’t meaningfully affect predictions.

Training requires more precision because small gradient updates accumulate over millions of iterations. Google addressed this by inventing BFloat16, or Brain Floating-Point 16. This format maintains the same 8-bit exponent as a 32-bit float but uses only 7 bits for the mantissa. The key insight was that neural networks are far more sensitive to dynamic range, controlled by the exponent, than to precision, controlled by the mantissa. BFloat16 provides a wide range of floating-point formats with half the bits, enabling efficient training without the overflow problems that plagued alternative 16-bit formats.

See the diagram below:

Modern TPUs support multiple precision modes.

  • BFloat16 for training.

  • INT8 for inference runs twice as fast on TPU v5e,

  • The newest FP8 format.

Ironwood is the first TPU with native FP8 support, avoiding the emulation overhead of earlier generations.

Evolution Journey

TPU development follows a clear trajectory.

Each generation increased performance while improving energy efficiency. The evolution reveals how AI hardware requirements shifted as models scaled.

  • TPU v1 launched secretly in 2015, focusing exclusively on inference. Built on 28-nanometer process technology and consuming just 40 watts, it delivered 92 trillion 8-bit operations per second. The chip connected via PCIe to standard servers and began powering Google Search, Photos, Translate, and YouTube before anyone outside Google knew it existed. In March 2016, TPU v1 powered AlphaGo’s victory over Lee Sedol, proving that application-specific chips could beat general-purpose GPUs by factors of 15 to 30 times in speed and 30 to 80 times in power efficiency.

  • TPU v2 arrived in 2017 with fundamental architecture changes to support training. Replacing the 256 by 256 8-bit array with two 128 by 128 BFloat16 arrays enabled the floating-point precision training requires. Adding High Bandwidth Memory, 16 gigabytes at 600 gigabytes per second, eliminated the memory bottleneck that limited v1. Most importantly, v2 introduced the Inter-Chip Interconnect, custom high-speed links connecting TPUs directly to each other. This enabled TPU Pods where 256 chips operate as a single accelerator delivering 11.5 petaflops.

  • TPU v3 in 2018 doubled performance to 420 teraflops per chip and introduced liquid cooling to handle increased power density. Pod size expanded to 1,024 chips, exceeding 100 petaflops, enough to train the largest AI models of that era in reasonable timeframes.

  • TPU v4 in 2021 brought multiple innovations. SparseCores accelerated embedding operations critical for recommendation systems and language models by five to seven times using only 5 percent of the chip area. Optical Circuit Switches enabled dynamic network topology reconfiguration. Instead of fixed electrical cables, robotic mirrors steer beams of light between fibers. This allows the interconnect to route around failures and scale to 4,096-chip Pods approaching one exaflop. The 3D torus topology, with each chip connected to six neighbors instead of four, reduced communication latency for distributed training.

  • Ironwood, or TPU v7, launched in 2025 and represents the most significant leap. Designed specifically for the age of inference, where deploying AI at scale matters more than training, each chip delivers 4,614 teraflops with 192 gigabytes of HBM at 7.4 terabytes per second bandwidth.

Conclusion

TPU deployments demonstrate practical impact across diverse applications.

For reference, a single TPU processes over 100 million Google Photos per day. AlphaFold’s solution to the 50-year protein folding problem, earning the 2024 Nobel Prize in Chemistry, ran on TPUs. Training PaLM, a 540-billion-parameter language model, across 6,144 TPU v4 chips achieved 57.8 percent hardware utilization over 50 days, remarkable efficiency for distributed training at that scale. Beyond Google, TPUs power Anthropic’s Claude assistant, Midjourney’s image generation models, and numerous research breakthroughs.

However, TPUs aren’t universally superior. They excel at large-scale language model training and inference, CNNs and Transformers with heavy matrix operations, high-throughput batch processing, and workloads prioritizing energy efficiency. On the other hand, GPUs remain better choices for PyTorch-native development, which requires the PyTorch/XLA bridge with some friction. Small batch sizes, mixed AI and graphics workloads, multi-cloud deployments, and rapid prototyping often favor GPUs.

TPUs represent a broader industry shift toward domain-specific accelerators.

The general-purpose computing model, where CPUs run any program reasonably well, hits physical limits when workloads scale to trillions of operations per query. Purpose-built silicon that sacrifices flexibility for efficiency delivers order-of-magnitude improvements that no amount of general-purpose processor optimization can match.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

EP196: Cloud Load Balancer Cheat Sheet

2026-01-04 00:31:02

Cut Code Review Time & Bugs in Half (Sponsored)

Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request.

Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss.

CodeRabbit has so far reviewed more than 10 million PRs, installed on 2 million repositories, and used by 100 thousand Open-source projects. CodeRabbit is free for all open-source repo’s.

Get Started Today


This week’s system design refresher:


Cloud Load Balancer Cheat Sheet

Efficient load balancing is vital for optimizing the performance and availability of your applications in the cloud.

However, managing load balancers can be overwhelming, given the various types and configuration options available.

In today's multi-cloud landscape, mastering load balancing is essential to ensure seamless user experiences and maximize resource utilization, especially when orchestrating applications across multiple cloud providers. Having the right knowledge is key to overcoming these challenges and achieving consistent, reliable application delivery.

In selecting the appropriate load balancer type, it's essential to consider factors such as application traffic patterns, scalability requirements, and security considerations. By carefully evaluating your specific use case, you can make informed decisions that enhance your cloud infrastructure's efficiency and reliability.

This Cloud Load Balancer cheat sheet would help you in simplifying the decision-making process and helping you implement the most effective load balancing strategy for your cloud-based applications.

Over to you: What factors do you believe are most crucial in choosing the right load balancer type for your applications?


How CQRS Works?

CQRS (Command Query Responsibility Segregation) separates write (Command) and read (Query) operations for better scalability and maintainability.

Here’s how it works:

  1. The client sends a command to update the system state. A Command Handler validates and executes logic using the Domain Model.

  2. Changes are saved in the Write Database and can also be saved to an Event Store. Events are emitted to update the Read Model asynchronously.

  3. The projections are stored in the Read Database. This database is eventually consistent with the Write Database.

  4. On the query side, the client sends a query to retrieve data.

  5. A Query Handler fetches data from the Read Database, which contains precomputed projections.

  6. Results are returned to the client without hitting the write model or the write database.

Over to you: What else will you add to understand CQRS?


How does Docker Work?

Docker’s architecture is built around three main components that work together to build, distribute, and run containers.

  1. Docker Client
    This is the interface through which users interact with Docker. It sends commands (such as build, pull, run, push) to the Docker Daemon using the Docker API.

  2. Docker Host
    This is where the Docker Daemon runs. It manages images, containers, networks, and volumes, and is responsible for building and running applications.

  3. Docker Registry
    The storage system for Docker images. Public registries like Docker Hub or private registries allow pulling and pushing images.

Over to you: Do you use Docker in your projects?


6 Practical AWS Lambda Application Patterns You Must Know

AWS Lambda pioneered the serverless paradigm, allowing developers to run code without provisioning, managing, or scaling servers. Let’s look at a few practical application patterns you can implement using Lambda.

  1. On-Demand Media Transformation
    Whenever a user requests an image from S3 in a format that isn’t available, an on-demand transformation can be done using AWS Lambda.

  2. Multiple Data Format from Single Source
    AWS Lambda can work with SNS to create a layer where data can be processed in the required format before sending to the storage layer.

  3. Real-time Data Processing
    Create a Kinesis stream and corresponding Lambda function to process different types of data (clickstream, logs, location tracking, or transactions) from your application.

  4. Change Data Capture
    Amazon DynamoDB can be integrated with AWS Lambda to respond to database events (inserts, updates, and deletes) in the DynamoDB streams.

  5. Serverless Image Processing
    Process and recognize images in a serverless manner using AWS Lambda. Integrate with AWS Step Functions for better workflow management.

  6. Automated Stored Procedure
    Invoke Lambda as a stored procedure to trigger functionality before/after some operations are performed on a particular database table.

Over to you: Have you used AWS Lambda in your project?


Containerization Explained: From Build to Runtime

“Build once, run anywhere.” That’s the promise of containerization, and here’s how it actually works:

Build Flow: Everything starts with a Dockerfile, which defines how your app should be built. When you run docker build, it creates a Docker Image containing:

  • Your code

  • The required dependencies

  • Necessary libraries

This image is portable. You can move it across environments, and it’ll behave the same way, whether on your local machine, a CI server, or in the cloud.

Runtime Architecture: When you run the image, it becomes a Container, an isolated environment that executes the application. Multiple containers can run on the same host, each with its own filesystem, process space, and network stack.

The Container Engine (like Docker, containerd, CRI-O, or Podman) manages:

  • The container lifecycle

  • Networking and isolation

  • Resource allocation

All containers share the Host OS kernel, sitting on top of the hardware. That’s how containerization achieves both consistency and efficiency, light like processes, but isolated like VMs.

Over to you: When deploying apps, do you prefer Docker, containerd, or Podman, and why?


🚀 Learn AI in the New Year: Become an AI Engineer Cohort 3 Now Open

After the amazing success of Cohorts 1 and 2 (with close to 1,000 engineers joining and building real AI skills), we are excited to announce the launch of Cohort 3 of Become an AI Engineer!

Check it out Here

Check it out Here


Message Brokers 101: Storage, Replication, and Delivery Guarantees

2026-01-02 00:33:32

A message broker is a middleware system that facilitates asynchronous communication between applications and services using messages.

At its core, a broker decouples producers of information from consumers, allowing them to operate independently without direct knowledge of each other. This decoupling is foundational to modern distributed architectures, where services communicate through the broker rather than directly with one another, enabling them to evolve independently without tight coupling.

To understand this in practice, consider an order-processing service that places an “Order Placed” message on a broker. Downstream services such as inventory, billing, and shipping will get that message from the broker when they are ready to process it, rather than the order service calling each one synchronously. This approach eliminates the need for the order service to know about or wait for these downstream systems.

Message brokers are not merely pipes for data transmission. They are sophisticated distributed databases specialized for functionalities such as stream processing and task distribution. The fundamental value proposition of a message broker lies in its ability to introduce a temporal buffer between distinct systems. By allowing a producer to emit a message without waiting for a consumer to process it, the broker facilitates temporal decoupling. This ensures that a spike in traffic at the ingress point does not immediately overwhelm downstream services.

In this article, we will look at how message brokers work in detail and explore the various patterns they enable in distributed system design.

Fundamental Terms

Read more

OpenAI CLIP: The Model That Learnt Zero-Shot Image Recognition Using Text

2025-12-30 00:30:45

If Your API Isn’t Fresh, Your Agents Aren’t Either. (Sponsored)

In the agentic era, outdated retrieval breaks workflows. This API Benchmark Report from You.com shows how each major search API performs to reveal which can best answer real-world, time-sensitive queries.

What’s inside:

  • Head-to-head benchmarks comparing You.com, Google SerpAPI, Exa, and Tavily across accuracy, latency, and cost

  • Critical performance data to identify which APIs best handle time-sensitive queries

  • A data-driven analysis of the Latency vs. Accuracy trade-off to help you select the best retrieval layer for enterprise agents

Curious who performed best?

Get the 2025 API Benchmark Report


Disclaimer: The details in this post have been derived from the details shared online by the OpenAI Engineering Team. All credit for the technical details goes to the OpenAI Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Imagine teaching a computer to recognize objects not by showing it millions of labeled photos, but by letting it browse the internet and learn from how people naturally describe images. That’s exactly what OpenAI’s CLIP does, and it represents a fundamental shift in how we teach machines to understand visual content.

CLIP (Contrastive Language-Image Pre-training) is a neural network that connects vision and language. Released in January 2021, it can classify images into any categories you want without being specifically trained for that task. Just tell it what you’re looking for in plain English, and it can recognize it. This “zero-shot” capability makes CLIP different from almost every computer vision system that came before it.

In this article, we will look at how CLIP works and the problems it tries to solve.

The Problem CLIP Solves

Traditional computer vision followed a rigid formula. If you want a model to distinguish cats from dogs, you need thousands of labeled photos. For different car models, you need another expensive dataset. For reference, ImageNet, one of the most famous image datasets, required over 25,000 workers to label 14 million images.

This approach created three major problems:

  • First, datasets were expensive and time-consuming to build.

  • Second, models became narrow specialists. An ImageNet model could recognize 1,000 categories, but adapting it to new tasks required collecting more data and retraining.

  • Third, models could “cheat” by optimizing for specific benchmarks.

For example, a model achieving 76% accuracy on ImageNet might drop to 37% on sketches of the same objects, or plummet to 2.7% on slightly modified images. Models learned ImageNet’s quirks rather than truly understanding visual concepts.

CLIP’s approach is radically different. Instead of training on carefully labeled datasets, it learns from 400 million image-text pairs collected from across the internet. These pairs are everywhere online: Instagram photos with captions, news articles with images, product listings with descriptions, and Wikipedia entries with pictures. People naturally write text that describes, explains, or comments on images, creating an enormous source of training data.

However, CLIP doesn’t try to predict specific category labels. Instead, it learns to match images with their corresponding text descriptions. During training, CLIP sees an image and a huge batch of text snippets (32,768 at a time). Its job is to determine which text snippet best matches the image.

Think of it as a massive matching game. For example, we show the system a photo of a golden retriever playing in a park. Among 32,768 text options, only one is correct: maybe “a golden retriever playing fetch in the park.” The other 32,767 options might include “a black cat sleeping,” “a mountain landscape at sunset,” “a person eating pizza,” and thousands of other descriptions. To consistently pick the right match across millions of such examples, CLIP must learn what objects, scenes, actions, and attributes look like and how they correspond to language.

By solving this matching task over and over with incredibly diverse internet data, CLIP develops a deep understanding of visual concepts and their linguistic descriptions. For example, it might learn that furry, four-legged animals with wagging tails correspond to words like “dog” and “puppy”. It might learn that orange and pink skies over water relate to “sunset” and “beach.” In other words, it builds a rich mental model connecting the visual and linguistic worlds.


👋 Goodbye low test coverage and slow QA cycles (Sponsored)

Bugs sneak out when less than 80% of user flows are tested before shipping. However, getting that kind of coverage (and staying there) is hard and pricey for any team.

QA Wolf’s AI-native solution provides high-volume, high-speed test coverage for web and mobile apps, reducing your organization’s QA cycle to minutes.

They can get you:

  • 80% automated E2E test coverage in weeks—not years

  • Unlimited parallel test runs

  • 24-hour maintenance and on-demand test creation

  • Zero flakes, guaranteed

The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.

With QA Wolf, Drata’s team of engineers achieved 4x more test cases and 86% faster QA cycles.

⭐ Rated 4.8/5 on G2

Schedule a demo to learn more


The Technical Foundation

Under the hood, CLIP uses two separate neural networks working in tandem: an image encoder and a text encoder.

The image encoder takes raw pixels and converts them into a numerical vector (called an embedding). The text encoder takes words and sentences and also outputs a vector. The key insight is that both encoders output vectors in the same dimensional space, making them directly comparable.

Initially, these encoders may produce completely random, meaningless vectors. For example, an image of a dog might become [0.2, -0.7, 0.3, ...] while the text “dog” becomes [-0.5, 0.1, 0.9, ...]. These numbers have no relationship whatsoever. But here’s where training works its magic.

The training process uses what’s called a contrastive loss function. This is simply a mathematical way of measuring how wrong the model currently is. For correct image-text pairs (like a dog image with “dog playing fetch”), the loss function says these embeddings should be very similar. For incorrect pairs (like a dog image with “cat sleeping”), it says they should be very different. The loss function produces a single number representing the total error across all images and texts in a batch.

See the diagram below:

Then comes backpropagation, the fundamental learning mechanism in neural networks. It calculates exactly how each weight in both encoders should change to reduce this error. The weights update slightly, and the process repeats millions of times with different batches of data. Gradually, both encoders learn to produce similar vectors for matching concepts. For example, images of dogs start producing vectors near where the text encoder puts the word “dog”.

In other words, through the continuous pressure to match correct pairs and separate incorrect ones across millions of diverse examples, the encoders evolve to speak the same language.

Zero-Shot Classification in Action

Once CLIP is trained, its zero-shot capabilities become evident. Suppose we want to classify images as containing either dogs or cats. We don’t need to retrain CLIP or show it labeled examples.

Instead, we can simply take the image and pass it through the image encoder to get an embedding. Next, we can take the text “a photo of a dog” and pass it through the text encoder to get another embedding. Further on, we can take the text “a photo of a cat” and get a third embedding. Compare which text embedding is closer to the image embedding, which would be the answer.

See the diagram below:

CLIP is essentially asking: “Based on everything learned from the internet, would this image more likely appear with text about dogs or text about cats?”

Since it learned from such diverse data, this approach works for nearly any classification task you can describe in words.

Want to classify types of food? Use “a photo of pizza,” “a photo of sushi,” “a photo of tacos” as your categories. Need to analyze satellite imagery? Try “a satellite photo of a forest,” “a satellite photo of a city,” “a satellite photo of farmland.” Working with medical images? You could use “an X-ray showing pneumonia” versus “an X-ray of healthy lungs.” You just change the text descriptions. No retraining required.

This flexibility is transformative. Traditional models needed extensive labeled datasets for each new task. CLIP can tackle new tasks immediately, limited only by your ability to describe categories in natural language.

Design Choices That Made CLIP Possible

CLIP’s success wasn’t just about the core idea. OpenAI made two critical technical decisions that made training computationally feasible.

  • First, they chose contrastive learning over the more obvious approach of training the model to generate image captions. Early experiments tried teaching systems to look at images and produce full text descriptions word by word, similar to how language models generate text. While intuitive, this approach proved incredibly slow and computationally expensive. Generating entire sentences requires much more computation than simply learning to match images with text. Contrastive learning turned out to be 4 to 10 times more efficient for achieving good zero-shot performance.

  • Second, they adopted Vision Transformers for the image encoder. Transformers, the architecture behind GPT and BERT, had already revolutionized natural language processing. Applying them to images (treating image patches like words in a sentence) provided another 3x computational efficiency gain over traditional convolutional neural networks like ResNet.

Combined, these choices meant CLIP could be trained on 256 GPUs for two weeks, similar to other large-scale vision models of the time, rather than requiring astronomically more compute.

Conclusion

OpenAI tested CLIP on over 30 different datasets covering diverse tasks: fine-grained classification, optical character recognition, action recognition, geographic localization, and satellite imagery analysis.

The results validated CLIP’s approach. While matching ResNet-50’s 76.2% accuracy on standard ImageNet, CLIP outperformed the best publicly available ImageNet model on 20 out of 26 transfer learning benchmarks. More importantly, CLIP maintained strong performance on stress tests where traditional models collapsed. On ImageNet Sketch, CLIP achieved 60.2% versus ResNet’s 25.2%. On adversarial examples, CLIP scored 77.1% compared to ResNet’s 2.7%.

However, the model still struggles with some things, such as:

  • Tasks requiring precise spatial reasoning or counting. It also has difficulty with very fine-grained distinctions, like differentiating between similar car models or aircraft variants where subtle details matter.

  • When tested on handwritten digits from the MNIST dataset (a task considered trivial in computer vision), CLIP achieved only 88% accuracy, well below the 99.75% human performance.

  • CLIP exhibits sensitivity to how you phrase your text prompts. Sometimes it requires trial and error (”prompt engineering”) to find wording that works well.

  • CLIP inherits biases from its internet training data. The way we phrase categories can dramatically influence model behavior in problematic ways.

However, despite the limitations, CLIP demonstrates that the approach powering recent breakthroughs in natural language processing (learning from massive amounts of internet text) can transfer to computer vision. Just as GPT models learned to perform diverse language tasks by training on internet text, CLIP learned diverse visual tasks by training on internet image-text pairs.

Since its release, CLIP has become foundational infrastructure across the AI industry. It’s fully open source, catalyzing widespread adoption. Modern text-to-image systems like Stable Diffusion and DALL-E use CLIP-like models to understand text prompts. Companies employ it for image search, content moderation, and recommendations.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].