2025-12-24 00:30:27
Real-time isn’t just about speed. It’s about instant, fresh, and reliable responses at scale.
This definitive Redis guide breaks down how to architect a real-time data layer that keeps user experiences snappy, AI agents responsive, and data up to date across your stack.
Inside, you’ll learn:
How to get your apps from “fast” to truly real-time
The role of Redis in low-latency caching, vector search, AI agent memory, and streaming workloads
Real-world patterns from companies using Redis to cut latency, reduce drop-offs, and keep users in flow
Note: This article is written in collaboration with the Shopify engineering team. Special thanks to the Shopify engineering team for sharing details with us about their Black Friday Cyber Monday preparation work and also for reviewing the final article before publication. All credit for the technical details shared in this article goes to the Shopify Engineering Team.
Black Friday Cyber Monday (BFCM) 2024 was massive for Shopify. The platform processed 57.3 petabytes of data, handled 10.5 trillion database queries, and peaked at 284 million requests per minute on its edge network. On app servers alone, they handled 80 million requests per minute while pushing 12 terabytes of data every minute on Black Friday.
Here’s the interesting part: this level of traffic is now the baseline for Shopify. And BFCM 2025 was even bigger, serving 90 petabytes of data, handling 1.75 trillion database writes with peak performance at 489 million requests per minute. This is why Shopify rebuilt its entire BFCM readiness program from scratch.
The preparation involved thousands of engineers working for nine months, running five major scale tests.
In this article, we will look at how Shopify prepared for success during the Super Bowl of commerce
Shopify’s BFCM preparation started in March with a multi-region strategy on Google Cloud.
The engineering team organized the work into three parallel tracks that run simultaneously and influence each other:
Capacity Planning involves modeling traffic patterns using historical data and merchant growth projections. The team submits these estimates to their cloud providers early so the providers can ensure they have enough physical infrastructure available. This planning defines how much computing power Shopify needs and where it needs to be located geographically.
The Infrastructure Roadmap is where the team reviews their technology stack, evaluates what architectural changes are needed, and identifies system upgrades required to hit their target capacity. This track helps sequence all the work ahead. Importantly, Shopify never uses BFCM as a release deadline. Every architectural change and migration happens months before the critical window.
Risk Assessments use “What Could Go Wrong” exercises to document failure scenarios. The team sets escalation priorities and generates inputs for what they call Game Days. This intelligence helps them test and harden systems well in advance.
These three tracks constantly feed into each other. For example, risk findings might reveal capacity gaps the team didn’t account for. Infrastructure changes might introduce new risks that need assessment. In other words, it’s a continuous feedback loop.
To assess risks properly, the Shopify engineering team runs Game Days. These are chaos engineering exercises that intentionally simulate production failures at the BFCM scale.
The team started hosting Game Days in early spring. This involves deliberately injecting faults into the systems to test how they respond under failure conditions. Think of it like a fire drill, but for software.
During these Game Days, the engineering team focuses extra attention on what they call “critical journeys”. These are the most business-critical paths through their platform: checkout, payment processing, order creation, and fulfillment. If these break during BFCM, merchants lose sales immediately.
Critical Journey Game Days run cross-system disaster simulations. Here are some common aspects that are tested by the team:
The team tests search and pages endpoints while randomizing navigation to mimic real user behavior. They inject network faults and latency to see what happens when services can’t communicate quickly.
They bust caches to create realistic load patterns instead of the artificially fast responses you get when everything is cached.
Frontend teams run bug bashes during these exercises. They identify regressions, test critical user flows, and validate that the user experience holds up under peak load conditions.
These exercises build muscle memory for incident response by exposing gaps in operational playbooks and monitoring tools.
Most importantly, Shopify closes those gaps well ahead of BFCM instead of discovering them when merchants need the platform most. All findings from Game Days feed into what Shopify calls the Resiliency Matrix. This is centralized documentation that tracks vulnerabilities, incident response procedures, and fixes across the entire platform.
The Resiliency Matrix includes five key components.
First is service status, showing the current operational state of all critical services.
Second is failure scenarios that document how things can break and what the impact would be.
Third is recovery procedures, listing expected recovery time objectives and detailed runbooks for fixing issues.
Fourth is operational playbooks with step-by-step incident response guides.
Fifth is on-call coverage showing team schedules and PagerDuty escalation paths.
The Matrix becomes the roadmap for system hardening before BFCM. Teams update it continuously throughout the year, documenting resilience improvements as they go.
Game Days test components in isolation, but Shopify also needs to know if the entire platform can handle BFCM volumes. That’s where load testing comes in.
The engineering team built a tool called Genghis that runs scripted workflows mimicking real user behavior. It simulates browsing, adding items to the cart, and going through checkout flows. The tool gradually ramps up traffic until something breaks, which helps the team find their actual capacity limits.
Tests run on production infrastructure simultaneously from three Google Cloud regions: us-central, us-east, and europe-west4. This simulates global traffic patterns accurately. Genghis also injects flash sale bursts on top of baseline load to test peak capacity scenarios.
Shopify pairs Genghis with Toxiproxy, an open-source framework they built for simulating network conditions. Toxiproxy injects network failures and partitions that prevent services from reaching each other. For reference, a network partition is when two parts of your system lose the ability to communicate, even though both are still running.
During tests, teams monitor dashboards in real time and are ready to abort if systems begin to degrade. Multiple teams coordinate to find and fix bottlenecks as they emerge.
When load testing reveals limits, teams have three options:
Horizontal scaling means adding more instances of the application.
Vertical scaling means giving each instance more resources, such as CPU and memory.
Optimizations mean making architecture-level changes that improve performance, ranging from better database queries to performance tuning across consuming layers up to the frontend.
These decisions set the final BFCM capacity and drive optimization work across Shopify’s entire stack. The key insight is that the team cannot wait until BFCM to discover the capacity limits. It takes months of preparation to scale infrastructure and optimize code.
BFCM tests every system at Shopify, but 2025 presented a unique challenge. Part of their infrastructure had never experienced holiday traffic, which creates a problem: how do you prepare for peak load when you have no historical data to model from?
In 2024, Shopify’s engineering team rebuilt its entire analytics platform. They created new ETL pipelines. ETL stands for Extract, Transform, Load, which is the process of pulling data from various sources, processing it, and storing it somewhere useful. They also switched the persistence layer and replaced their legacy system with completely new APIs.
This created an asymmetry. The ETL pipelines ran through BFCM 2024, so the team had one full season of production data showing how those pipelines perform under holiday load. But their API layer launched after peak season ended. They were preparing for BFCM on APIs that had never seen holiday traffic.
This matters a lot because during BFCM, merchants obsessively check their analytics. They want real-time sales numbers, conversion rates, traffic patterns, and data about popular products. Every single one of these queries hits the API layer. If those APIs can’t handle the load, merchants lose visibility during their most critical sales period.
Shopify ran Game Days specifically for the analytics infrastructure. These were controlled experiments designed to reveal failure modes and bottlenecks. The team simulated increased traffic loads, introduced database latency, and tested cache failures to systematically map how the system behaves under stress.
The results showed four critical issues that needed fixes:
First, the ETL pipelines needed Kafka partition increases to maintain data freshness during traffic spikes. Apache Kafka is a distributed streaming platform that handles real-time data flows. More partitions mean more parallel processing, which keeps data fresh for the APIs to serve.
Second, the API layer memory usage required optimization. The team found this through profiling, which means measuring exactly how the code uses memory. Each API request was using too much memory. Under high load, this would cause out-of-memory errors, slower response times, or complete crashes.
Third, connection timeouts needed tuning to prevent pool exhaustion. A connection pool is a set of reusable database connections. Creating new connections is expensive, so applications reuse them. The problem was that timeouts were too long, meaning connections would get stuck waiting. Under high load, you run out of available connections, and new requests start failing. Shopify tuned the timeouts to release connections faster.
Fourth, the team split API requests through a different load balancer approach. Originally, API requests would all enqueue to one region, which added latency and load. By scaling up the secondary region’s cluster and updating the load balancing policy, they better distributed the work and prevented API servers from being overwhelmed.
Beyond the performance fixes, the team validated alerting and documented response procedures. Their teams were trained and prepared to handle failures during the actual event.
Game Days and load testing prepare individual components, but scale testing is different. It validates the entire platform working together at BFCM volumes, revealing issues that only surface when everything runs at capacity simultaneously.
From April through October, Shopify ran five major scale tests at their forecasted traffic levels, specifically their peak p90 traffic assumptions. In statistics, p90 means the 90th percentile, or the traffic level that 90% of requests will be below.
Here are the details of those scale tests:
The first two tests validated baseline performance against 2024’s actual numbers.
Tests three through five ramped up to 2025 projections, targeting 150% of last year’s load.
By the fourth test, Shopify hit 146 million requests per minute and over 80,000 checkouts per minute. On the final test of the year, they tested their p99 scenario, which reached 200 million requests per minute.
These tests are extraordinarily large, and therefore, Shopify runs them at night and coordinates with YouTube because the tests impact shared cloud infrastructure. The team tested resilience, not just raw load capacity. They executed regional failovers, evacuating traffic from core US and EU regions to validate their disaster recovery procedures actually work.
Shopify ran four types of tests:
Architecture scale-up tests validated that their infrastructure handles planned capacity.
Load tests during normal operations established baseline performance at peak load.
Load tests with failover validated disaster recovery and cross-region failover capabilities.
Game Day simulations tested cross-system resilience through chaos engineering.
The team simulated real user behavior, such as storefront browsing and checkout, admin API traffic from apps and integrations, analytics and reporting loads, and backend webhook processing. They also tested critical scenarios like sustained peak load, regional failover, and cascading failures where multiple systems fail simultaneously.
Each test cycle identified issues that would never appear under steady-state load, and the team fixed each issue as it emerged. Some of the key issues were as follows:
Scale Tests 1 and 2 revealed that under heavy load, core operations threw errors, and checkout queues backed up.
Scale Test 3 validated key migrations and confirmed that regional routing behaved as expected after infrastructure changes.
Scale Test 4 hit limits that triggered an unplanned failover, identifying priority issues in test traffic routing and discovering delays when bringing regions back online during rebalancing.
Scale Test 5 performed a full dress rehearsal and was the only test run during North American business hours to simulate real BFCM conditions. All the other tests ran at night.
Mid-program, Shopify made an important shift. They added authenticated checkout flows to their test scenarios. Modeling real logged-in buyers exposed rate-limiting code paths that anonymous browsing never touches. Even though authenticated flows were a small percentage of traffic, they revealed bottlenecks that would have caused problems during the real event.
BFCM preparation gets Shopify ready, but operational excellence keeps them steady when traffic actually spikes.
The operational plan coordinates engineering teams, incident response, and live system tuning. Here are the key components of this plan:
The plan for BFCM weekend includes real-time monitoring with dashboard visibility across all regions and automated alerts.
For incident response, Incident Manager OnCall teams provide 24/7 coverage with clear escalation paths.
Merchant communications ensure stores get status updates and notifications about any issues.
Live optimization allows system tuning based on real-time traffic patterns as they develop.
After BFCM ends, the post-mortem process correlates monitoring data with actual merchant outcomes to understand what worked and what needs improvement.
The philosophy is simple: preparation gets you ready, but operational excellence keeps you steady.
Shopify’s 2025 BFCM readiness program shows what systematic preparation looks like at scale. Thousands of engineers worked for nine months, running five major scale tests that pushed their infrastructure to 150% of expected load. They executed regional failovers, ran chaos engineering exercises, documented system vulnerabilities, and hardened systems with updated runbooks before merchants needed them.
What makes this different from typical pre-launch preparation is the systematic approach. Most companies load test once, maybe twice, fix critical bugs, and hope for the best. Shopify spent nine months continuously testing, finding breaking points, fixing issues, and validating that the fixes actually work.
Also, the tools Shopify built aren’t temporary BFCM scaffolding. The Resiliency Matrix, Critical Journey Game Days, and real-time adaptive forecasting became permanent infrastructure improvements. They make Shopify more resilient every day, not just during peak season.
To provide a visualization of BFCM, Shopify also launched an interesting pinball game to showcase the Shopify Live Globe. The game itself runs at 120fps in a browser with a full 3d environment, physics engine, and VR Support. Behind the scenes, the game is a three[dot]js app built with “react-three-fiber”. Every merchant sale shows up a few seconds later on this globe. Everyone can check out the game and the visualization on the homepage for Shopify Live Globe
References:
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-12-23 00:30:45
Static training data can’t keep up with fast-changing information, leaving your models to guess. We recommend this technical guide from You.com, which gives developers the code and framework to connect GenAI apps to the live web for accurate, real-time insights.
What you’ll get:
A step-by-step Python tutorial to integrate real-time search with a single GET request
The exact code logic to build a “Real-Time Market Intelligence Agent” that automates daily briefings
Best practices for optimizing latency, ensuring zero data retention, and establishing traceability
Turn “outdated” into “real-time.”
For a long time, AI systems were specialists confined to a single sense. For example:
Computer vision models could identify objects in photographs, but couldn’t describe what they saw.
Natural language processing systems could write eloquent prose but remained blind to images.
Audio processing models could transcribe speech, but had no visual context.
This fragmentation represented a fundamental departure from how humans experience the world. Human cognition is inherently multimodal. We don’t just read text or just see images. We simultaneously observe facial expressions while listening to the tone of voice. We connect the visual shape of a dog with the sound of a bark and the written word “dog.”
To create AI that truly operates in the real world, these separated sensory channels needed to converge.
Multimodal Large Language Models represent this convergence. For example, GPT-4o can respond to voice input in just 232 milliseconds, matching human conversation speed. Google’s Gemini can process an entire hour of video in a single prompt.
These capabilities emerge from a single unified neural network that can see, hear, and read simultaneously.
But how does a single AI system understand such fundamentally different types of data? In this article, we try to answer this question.
What if you could spend most of your IT resources on innovation, not maintenance?
The latest report from the IBM Institute for Business Value explores how businesses are using intelligent automation to get more out of their technology, drive growth & cost the cost of complexity.
The core breakthrough behind multimodal LLMs is quite simple. Every type of input, whether text, images, or audio, gets converted into the same type of mathematical representation called embedding vectors. Just as human brains convert light photons, sound waves, and written symbols into uniform neural signals, multimodal LLMs convert diverse data types into vectors that occupy the same mathematical space.
Let us consider a concrete example. A photograph of a dog, the spoken word “dog,” and the written text “dog” all get transformed into points in a high-dimensional mathematical space. These points cluster together, close to each other, because they represent the same concept.
This unified representation enables what researchers call cross-modal reasoning. The model can understand that a barking sound, a photo of a golden retriever, and the sentence “the dog is happy” all relate to the same underlying concept. The model doesn’t need separate systems for each modality. Instead, it processes everything through a single architecture that treats visual patches and audio segments just like text tokens.
The diagram below shows the high-level view of a multimodal LLM works:
Modern multimodal LLMs consist of three essential components working together to process diverse inputs.
The first component handles the translation of raw sensory data into initial mathematical representations.
Vision Transformers process images by treating them like sentences, dividing photographs into small patches and processing each patch as if it were a word.
Audio encoders convert sound waves into spectrograms, which are visual-like representations showing how frequencies change over time.
These encoders are typically pre-trained on massive datasets to become highly skilled at their specific tasks.
The second component acts as a bridge. Even though both encoders produce vectors, these vectors exist in different mathematical spaces. In other words, the vision encoder’s representation of “cat” lives in a different geometric region than the language model’s representation of the word “cat.”
Projection layers align these different representations into the shared space where the language model operates. Often, these projectors are surprisingly simple, sometimes just a linear transformation or a small two-layer neural network. Despite their simplicity, they’re crucial for enabling the model to understand visual and auditory concepts.
The third component is the core LLM, such as GPT or LLaMA.
This is the “brain” that does the actual reasoning and generates responses. It receives all inputs as sequences of tokens, whether those tokens originated from text, image patches, or audio segments.
The language model treats them identically, processing everything through the same transformer architecture that powers text-only models. This unified processing is what allows the model to reason across modalities as naturally as it handles pure text.
See the diagram below that shows the transformers architecture:
The breakthrough that enabled modern multimodal vision came from a 2020 paper with a memorable title: “An Image is Worth 16x16 Words.” This paper introduced the idea of processing images exactly like sentences by treating small patches as tokens.
The process works through several steps:
First, the image gets divided into a grid of fixed-size patches, typically 16x16 pixels each.
A standard 224x224 pixel image becomes approximately 196 distinct patches, each representing a small square region.
Each patch is flattened from a 2D grid into a 1D vector of numbers representing pixel intensities.
Positional embeddings are added so the model knows where each patch came from in the original image.
These patch embeddings flow through transformer layers, where attention mechanisms allow patches to learn from each other.
The attention mechanism is where understanding emerges. A patch showing a dog’s ear learns it connects to nearby patches showing the dog’s face and body. Patches depicting a beach scene learn to associate with each other to represent the broader context of sand and water. By the final layer, these visual tokens carry rich contextual information. The model doesn’t just see “brown pixels” but understands “golden retriever sitting on beach.”
The second critical innovation was CLIP, developed by OpenAI. CLIP revolutionized how vision encoders are trained by changing the fundamental objective. Instead of training on labeled image categories, CLIP was trained on 400 million pairs of images and their text captions from the internet.
CLIP uses a contrastive learning approach. Given a batch of image-text pairs, it computes embeddings for all images and all text descriptions. The goal is to maximize the similarity between embeddings of correct image-text pairs while minimizing similarity between incorrect pairings. An image of a dog should produce a vector close to the caption “a dog in the park” but far from “a plate of pasta.”
Audio presents unique challenges for language models.
Unlike text, which naturally divides into discrete words, or images, which can be divided into spatial patches, sound is continuous and temporal. For example, a 30-second audio clip sampled at 16,000 Hz contains 480,000 individual data points. Feeding this massive stream of numbers directly into a transformer is computationally impossible and inefficient. The solution requires converting audio into a more tractable representation.
The key innovation is transforming audio into spectrograms, which are essentially images of sound. The process involves several mathematical transformations:
The long audio signal gets sliced into tiny overlapping windows, typically 25 milliseconds each.
A Fast Fourier Transform extracts which frequencies are present in each window
These frequencies are mapped onto the mel scale, which matches human hearing sensitivity by giving more resolution to lower frequencies
The result is a 2D heat map where time runs along one axis, frequency along the other, and color intensity represents volume
This mel-spectrogram looks like an image to the AI model. For a 30-second clip, this might produce an 80x3,000 grid, which is essentially a visual representation of acoustic patterns that can be processed similarly to photographs.
Once audio is converted to a spectrogram, models can apply the same techniques used for vision. The Audio Spectrogram Transformer divides the spectrogram into patches, just as an image is divided. For example, models like Whisper, trained on 680,000 hours of multilingual audio, excel at this transformation.
The training process goes through different stages:
Training a multimodal LLM typically happens in two distinct stages.
The first stage focuses purely on alignment, teaching the model that visual and textual representations of the same concept should be similar. During this stage, both the pre-trained vision encoder and the pre-trained language model remain frozen. Only the projection layer’s weights get updated through training.
Alignment alone isn’t sufficient for practical use. A model might describe what’s in an image but fail at complex tasks like “Why does the person look sad?” or “Compare the two charts”.
Visual instruction tuning addresses this by training the model to follow sophisticated multimodal instructions.
During this stage, the projection layer continues training and the language model is also updated, often using parameter-efficient methods. The training data shifts to instruction-response datasets formatted as conversations.
An important innovation here was using GPT-4 to generate synthetic training data. Researchers fed GPT-4 textual descriptions of images and prompted it to create realistic conversations about those images. Training on this synthetic but high-quality data effectively distills GPT-4’s reasoning capabilities into the multimodal model, teaching it to engage in nuanced visual dialogue rather than just describing what it sees.
Multimodal LLMs achieve their remarkable capabilities through a unifying principle. By converting all inputs into sequences of embedding vectors that occupy a shared mathematical space, a single transformer architecture can reason across modalities as fluidly as it processes language alone.
The architectural innovations powering this capability represent genuine advances: Vision Transformers treating images as visual sentences, contrastive learning aligning modalities without explicit labels, and cross-attention enabling selective information retrieval across different data types.
The future points toward any-to-any models that can both understand and generate all modalities. In other words, a model that outputs text, generates images, and synthesizes speech in a single response.
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-12-21 00:31:00
If slow QA processes bottleneck you or your software engineering team and you’re releasing slower because of it — you need to check out QA Wolf.
QA Wolf’s AI-native service supports web and mobile apps, delivering 80% automated test coverage in weeks and helping teams ship 5x faster by reducing QA cycles to minutes.
QA Wolf takes testing off your plate. They can get you:
Unlimited parallel test runs for mobile and web apps
24-hour maintenance and on-demand test creation
Human-verified bug reports sent directly to your team
Zero flakes guaranteed
The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.
With QA Wolf, Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.
This week’s system design refresher:
Evolution of HTTP
System Performance Metrics Every Engineer Should Know
Why Is Nginx So Popular?
Network Debugging Commands Every Engineer Should Know
Hub, Switch, & Router Explained
SPONSOR US
The Hypertext Transfer Protocol (HTTP) has evolved over the years to meet the needs of modern applications, from simple text delivery to high-performance, real-time experiences.
Here is how HTTP has progressed:
HTTP/0.9: Built to fetch simple HTML documents with a single GET request.
HTTP/1.0: Added headers and status codes to support richer interactions, but every request still required a new connection.
HTTP/1.1: Introduced persistent connections and more methods, making the web faster and more efficient for everyday browsing.
HTTP/2: Solved performance bottlenecks with multiplexing, enabling multiple requests to share one connection.
HTTP/3 (QUIC): Shifted to UDP with QUIC to reduce latency and improve reliability, especially for mobile and real-time apps.
Over to you: Are you already taking advantage of HTTP/3 in your projects?
Code from Claude is about to hit prod, but it doesn’t have to be painful.
Engineering teams at Coinbase, Toast, Gametime, MSCI, and Zscaler use Resolve AI to resolve incidents, optimize costs, and build with production context using AI that works across code, infra, and telemetry
The results mean 70% faster MTTR, 30% fewer engineers pulled in per incident, and thousands of saved engineering hours. Imagine what you could ship with that time in 2026
Learn more about AI for prod, workflow-autonomous multi-agent systems, and how you can cut orchestration tax, improve investigations, and shift engineering time from grunt work to great work.
Your API is slow. But how slow, exactly? You need numbers. Real metrics that tell you what's actually broken and where to fix it.
Here are the four core metrics every engineer should know when analyzing system performance:
Queries Per Second (QPS): How many incoming requests your system handles per second. Your server gets 1,000 requests in one second? That's 1,000 QPS. Sounds straightforward until you realize most systems can't sustain their peak QPS for long without things starting to break.
Transactions Per Second (TPS): How many completed transactions your system processes per second. A transaction includes the full round trip, i.e., the request goes out, hits the database, and comes back with a response.
TPS tells you about actual work completed, not just requests received. This is what your business cares about.
Concurrency: How many simultaneous active requests your system is handling at any given moment. You could have 100 requests per second, but if each takes 5 seconds to complete, you're actually handling 500 concurrent requests at once.
High concurrency means you need more resources, better connection pooling, and smarter thread management.
Response Time (RT): The elapsed time from when a request starts until the response is received. Measured at both the client level and server level.
A simple relationship ties them all together: QPS = Concurrency ÷ Average Response Time
More concurrency or lower response time = higher throughput.
Over to you: When you analyze performance, which metric do you look at first, QPS, TPS, or Response Time?
Apache dominated web servers for 20 years, then Nginx showed up and changed everything. Now Nginx powers some of the largest sites on the internet, including Netflix, Airbnb, Dropbox, and WordPress. com. Not because it's newer or trendier, but because it solves problems that Apache couldn't handle efficiently.
Here’s what makes Nginx so popular:
High-Performance Web Server
Reverse Proxy & Load Balancer
Caching Layer
SSL Termination (Offloading)
Over to you: What’s your primary use for Nginx today, web server, reverse proxy, or load balancer?
When someone says “It’s a network issue,” these commands help you find what’s wrong fast.
ping: Checks if the destination responds and reports the round-trip time for basic reachability.
traceroute / tracert: Shows each hop on the path so you can see where packets slow down or stop.
mtr / pathping: Continuously measures latency and loss per hop to catch intermittent issues.
ip addr, ip link / ipconfig /all: Prints local IPs, MACs, and interface status so you can verify the machine’s network identity.
ip route: Reveals the routing table to confirm which gateway and next hop the system will use.
ip neigh: Displays IP-to-MAC entries to detect duplicates or stale ARP records on the LAN.
ss -tulpn: Lists listening sockets and PIDs so you can confirm a service is actually bound to the expected port.
dig: Resolves DNS records to verify the exact IPs clients will connect to.
curl -I: Fetches only HTTP(S) headers to check status codes, redirects, and cache settings.
tcpdump / tshark: Captures packets so you can inspect real traffic and validate what’s sent and received.
iperf3: Measures end-to-end throughput between two hosts to separate bandwidth limits from app issues.
ssh: Opens a secure shell on the remote machine to run checks and apply fixes directly.
sftp: Transfers files securely so you can pull logs or push artifacts during an incident.
nmap: Scans open ports and probes versions to confirm which services are exposed and responding.
Over to you: What's your go-to command when debugging network issues?
Every home and office network relies on these three devices, hub, switch, and router, yet their roles are often mixed up.
A hub operates at Layer 1 (Physical Layer). It’s the simplest of the three, it doesn’t understand addresses or data types. When a packet arrives, it simply broadcasts it to every connected device, creating one big collision domain. That means all devices compete for the same bandwidth, making hubs inefficient in modern networks.
A switch works at Layer 2 (Data Link Layer). It learns MAC addresses and forwards frames only to the correct destination device. Each port on a switch acts as its own collision domain, improving efficiency and speeding up communication within a LAN.
A router operates at Layer 3 (Network Layer). It routes packets based on IP addresses and connects different networks together, for example, your home network to the Internet. Each router interface forms a separate broadcast domain, keeping local and external traffic isolated.
Understanding how these three layers work together is the foundation of every modern network, from your home Wi-Fi to the global Internet backbone.
Over to you: How do you usually figure out whether a network issue is caused by the router or the switch?
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-12-19 00:31:12
In a monolithic application, a function call is a local, in-memory process. Aside from a catastrophic hardware failure or a process crash, the execution of a function is essentially guaranteed. If the process is alive, the call succeeds.
However, in distributed systems, this guarantee does not hold. Components communicate over physical networks that are inherently unreliable. This reality is captured in the “Fallacies of Distributed Computing,” specifically the first fallacy: “The network is reliable”. In truth, it is not. A request sent from Service A to Service B may fail not because Service B is broken, but simply because the communication medium momentarily faltered.
This creates a need for defensive programming patterns, and one of the primary mechanisms we use is the Retry pattern. By automatically retrying a failed operation, a system can trade latency for availability, turning what would have been a failed user request into a successful one.
However, retries are both essential and dangerous in distributed systems. On the one hand, they transform unreliable networks into reliable ones. But on the other hand, indiscriminate retries can lead to latency amplification, resource exhaustion, and cascading failures that can take down entire platforms.
In this article, we will explore the retry pattern in depth, understand when and how to use it safely and effectively.
2025-12-18 00:30:16
Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request.
Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss.
CodeRabbit has so far reviewed more than 10 million PRs, installed on 2 million repositories, and used by 100 thousand Open-source projects. CodeRabbit is free for all open-source repo’s.
Disclaimer: The details in this post have been derived from the details shared online by the Meta Engineering Team. All credit for the technical details goes to the Meta Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
When Meta announced in Q2 2025 that its new Generative Ads Model (GEM) had driven a 5% increase in ad conversions on Instagram and a 3% increase on Facebook Feed, the numbers might have seemed modest.
However, at Meta’s scale, these percentages translate to billions of dollars in additional revenue and represent a fundamental shift in how AI-powered advertising works.
GEM is the largest foundation model ever built for recommendation systems. It has been trained at the scale typically reserved for large language models like GPT-4 or Claude. Yet here’s the paradox: GEM is so powerful and computationally intensive that Meta can’t actually use it directly to serve ads to users.
Instead, the company developed a teacher-student architecture that lets smaller, faster models benefit from GEM’s intelligence without inheriting its computational cost.
In this article, we look at how the Meta engineering team built GEM and the challenges they overcame.
Bugs sneak out when less than 80% of user flows are tested before shipping. However, getting that kind of coverage (and staying there) is hard and pricey for any team.
QA Wolf’s AI-native solution provides high-volume, high-speed test coverage for web and mobile apps, reducing your organization’s QA cycle to minutes.
They can get you:
80% automated E2E test coverage in weeks—not years
Unlimited parallel test runs
24-hour maintenance and on-demand test creation
Zero flakes, guaranteed
The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.
With QA Wolf, Drata’s team of engineers achieved 4x more test cases and 86% faster QA cycles.
⭐ Rated 4.8/5 on G2
Every day, billions of users scroll through Facebook, Instagram, and other Meta platforms, generating trillions of potential ad impression opportunities. Each impression represents a decision point: which ad, from millions of possibilities, should be shown to this specific user at this particular moment? Getting this wrong means wasting advertiser budgets on irrelevant ads and annoying users with content they don’t care about. Getting it right creates value for everyone involved.
Traditional ad recommendation systems struggled with this in several ways. Some systems treated each platform separately, which meant that insights about user behavior on Instagram couldn’t inform predictions on Facebook. This siloed approach missed valuable cross-platform patterns. Other systems tried to treat all platforms identically, ignoring the fact that people interact with Instagram Stories very differently from how they browse Facebook Feed. Neither approach was optimal.
The data complexity also compounds these challenges in the following ways:
Meaningful signals like clicks and conversions are extremely sparse compared to total impression volume.
User features are dynamic and constantly changing.
The system must process multimodal inputs, including text, images, video, and complex behavioral sequences.
Traditional models had severe memory limitations, typically only considering a user’s last 10 to 20 actions.
GEM’s goal was to create a unified intelligence that understands users holistically across Meta’s entire ecosystem, learning from long behavioral histories and complex cross-platform patterns while maintaining the nuance needed to optimize for each specific surface and objective.
GEM’s architecture processes user and ad information through three complementary systems, each handling a different aspect of the prediction problem.
The first system handles what Meta calls non-sequence features, which are essentially static attributes and their combinations. These include user demographics like age and location, user interests, ad characteristics like format and creative content, and advertiser objectives.
The challenge here isn’t just knowing these individual features but understanding how they interact. For example, a 25-year-old tech worker has very different purchasing patterns than a 25-year-old teacher, even if they share some interests. The system needs to learn which combinations of features actually matter.
GEM uses an enhanced version of the Wukong architecture with stackable factorization machines that can scale both vertically for deeper interactions and horizontally for broader feature coverage. This architecture works through multiple stacked layers, where each successive layer learns increasingly complex patterns from the simpler patterns discovered by previous layers. For instance, an early layer might discover the basic pattern that young professionals respond well to tech product ads. A layer deeper in the stack builds on this by learning that young professionals in urban areas who show interest in fitness respond especially well to smart wearable ads. An even deeper layer might refine this further, discovering that this combination works best specifically when those ads emphasize data tracking features rather than fashion elements.
The second system handles sequence features, which capture the timeline of user behavior. A user’s actions don’t exist in isolation. They tell a story with order and meaning. Someone who clicked on home workout content, then searched for gyms nearby, then viewed several gym websites, then researched membership costs is clearly on a specific journey. Traditional architectures struggled to process long sequences efficiently because the computational cost grows rapidly with sequence length.
GEM overcomes this with a pyramid-parallel structure. Think of it as processing your behavior history in chunks at the bottom level, then combining those chunks into broader patterns at middle levels, and finally synthesizing everything into a complete journey understanding at the top level. Multiple chunks can be processed simultaneously rather than sequentially, which dramatically improves efficiency.
The breakthrough here is scale. GEM can now analyze thousands of your past actions rather than just the most recent handful. This extended view reveals patterns that shorter windows simply cannot capture, like the progression from casual interest to serious purchase intent that might develop over months.
See the diagram below:
The third system, called InterFormer, handles cross-feature learning by connecting your static profile with your behavioral timeline. This is where GEM’s intelligence really becomes evident. Previous approaches would compress your entire behavior history into a compact summary vector (like reducing an entire novel to a single rating). This compression inevitably loses critical details about your journey.
InterFormer takes a different approach using an interleaving structure. It alternates between layers that focus purely on understanding your behavior sequence and layers that connect those behaviors to your profile attributes.
The first sequence layer might identify that you’ve shown increasing interest in fitness over time.
The first cross-feature layer then considers how your age, income, and location context shape what that fitness interest means.
The second sequence layer re-examines your behavior with these new insights and might notice that your fitness research intensified after a gym opened near your workplace.
The second cross-feature layer then makes even deeper connections about purchase intent and timing.
This alternating process continues through multiple layers, with each cycle refining understanding without losing access to the complete behavioral record.
Despite GEM’s obvious strengths, Meta faced a fundamental engineering challenge in using GEM.
GEM is enormous and trained using thousands of GPUs over extended periods. Running GEM directly for every ad prediction would be impossibly slow and expensive. When a user scrolls through Instagram, the system needs to make ad decisions in tens of milliseconds. GEM simply cannot operate at that speed while serving billions of users simultaneously.
Meta’s solution was a teacher-student architecture where GEM acts as the master teacher that trains hundreds of smaller, faster Vertical Models (VMs) that actually serve ads in production. These VMs are specialized for specific contexts like Instagram Stories click prediction or Facebook Feed conversion prediction. Each VM is lightweight enough to make predictions in milliseconds, but they’re much smarter than they would be if trained independently because they learn from GEM.
The knowledge transfer happens through two strategies. Direct transfer works when a VM operates in the same domain where GEM was trained, with similar data and objectives. GEM can teach these models directly. Hierarchical transfer applies when VMs work in specialized areas quite different from GEM’s training domain. In these cases, GEM first teaches medium-sized domain-specific foundation models for areas like Instagram or Facebook Marketplace. These domain models then teach the even smaller VMs. The knowledge flows down through levels, getting adapted and specialized at each stage.
Meta employs three sophisticated techniques to maximize transfer efficiency:
Knowledge distillation with Student Adapter: Student models learn to replicate GEM’s reasoning process, not just final predictions. The Student Adapter refines GEM’s predictions using recent ground-truth data, adjusting for timing delays and domain-specific differences.
Representation learning: Creates a shared conceptual framework between teacher and students. GEM learns to encode information in ways that transfer well across different model sizes, adding no computational overhead during ad serving.
Parameter sharing: This lets VMs selectively incorporate specific components directly from GEM. Small VMs stay fast while borrowing GEM’s sophisticated components for complex user understanding tasks.
Together, these three techniques achieve twice the effectiveness of standard knowledge distillation alone. The continuous improvement cycle works like this:
Users interact with fast VMs in real time
Their engagement data flows back into Meta’s data pipelines
GEM periodically re-trains on this fresh data, updated knowledge transfers to VMs through the post-training techniques, and
Improved VMs get deployed to production.
This cycle repeats continuously, with GEM getting smarter and VMs getting regular intelligence updates.
Building GEM required Meta to rebuild its training infrastructure from the ground up.
The challenge was training a model at LLM scale, but for the fundamentally different task of recommendation rather than language generation. The company achieved a 23x increase in effective training throughput while using 16x more GPUs and simultaneously improving hardware efficiency by 1.43x.
This required innovations across multiple areas. Multi-dimensional parallelism orchestrates how thousands of GPUs work together, splitting the model’s dense components using techniques like Hybrid Sharded Distributed Parallel while handling sparse components like embedding tables through a combination of data and model parallelism. The goal was to ensure every GPU stayed busy with minimal idle time waiting for communication from other GPUs.
System-level optimizations pushed GPU utilization even higher:
Custom GPU kernels designed for variable-length user sequences, fusing operations to reduce memory bandwidth bottlenecks.
PyTorch 2.0 graph-level compilation automates optimizations like activation checkpointing and operator fusion.
Memory compression, including FP8 quantization to reduce the footprint without impacting accuracy.
NCCLX communication collectives that handle inter-GPU communication without consuming the main compute resources.
The efficiency gains extended beyond raw training speed.
Meta reduced job startup time by 5x through optimizations in trainer initialization, data reader setup, and checkpointing. They cut PyTorch 2.0 compilation time by 7x using intelligent caching strategies. These might seem like minor details, but when you’re training models that cost millions of dollars in compute resources, every percentage point of efficiency improvement matters enormously.
The result is a training system that can iterate rapidly on GEM, incorporating new data and architectural improvements at a pace that would have been impossible with previous infrastructure. This enables Meta to keep GEM at the frontier of recommendation AI while controlling costs enough to make the massive investment worthwhile.
Meta’s roadmap for GEM extends well beyond its current capabilities.
The next major evolution involves true multimodal learning, where GEM processes text, images, audio, and video together rather than treating them as separate input streams. This will enable an even richer understanding of both user preferences and ad creative effectiveness across all content types. The company is also exploring inference-time scaling, which would allow the system to dynamically allocate more computational resources to difficult predictions while handling straightforward cases more efficiently.
Perhaps most ambitiously, Meta envisions a unified engagement model that ranks both organic content and ads using the same underlying intelligence. This would fundamentally change how advertising integrates into social feeds, potentially creating more seamless experiences where ads feel like natural content recommendations rather than interruptions. On the advertiser side, GEM’s intelligence will enable more sophisticated agentic automation, where AI systems can manage and optimize campaigns with minimal human intervention while achieving better results.
References:
Meta’s Generative Ads Model (GEM): The Central Brain Accelerating Ads Recommendation AI Innovation
Wukong: Towards a Scaling Law for Large-Scale Recommendation
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].
2025-12-17 00:30:58
Enterprise customers expect SSO, Directory Sync, RBAC, and Audit Logs, but building and maintaining that infrastructure slows teams down and pulls focus from core product work.
WorkOS provides these features through simple APIs and a hosted Admin Portal that integrates with every identity provider. You get production-ready enterprise capabilities without owning the complexity yourself.
Trusted by OpenAI, Cursor, Vercel, 1000+ more. Your first million MAUs are free.
Disclaimer: The details in this post have been derived from the details shared online by the LinkedIn Engineering Team. All credit for the technical details goes to the LinkedIn Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
Recruiting is a profession that demands both strategic thinking and meticulous attention to detail. Recruiters must make high-value decisions about which candidates are the best fit for a role, but they also spend countless hours on repetitive pattern recognition tasks. Sorting through hundreds of resumes, evaluating qualifications against job requirements, and drafting personalized outreach messages are all essential activities. However, they also consume enormous amounts of time that could otherwise be spent on relationship-building and strategic hiring decisions.
LinkedIn’s Hiring Assistant represents a new approach to solving this challenge.
Rather than replacing recruiters, this AI agent is designed to handle the repetitive, time-consuming aspects of the recruiting workflow, freeing professionals to focus on what they do best: connecting with people and making critical hiring choices.
The most labor-intensive parts of recruiting fall into three main categories.
First, sourcing candidates requires searching through LinkedIn’s network of over 1.2 billion profiles to identify qualified individuals.
Second, evaluating candidates involves carefully reading resumes and profiles to assess whether each person meets the specific requirements of a role.
Third, engaging candidates means drafting and sending personalized communications to potential hires, answering their questions, and maintaining ongoing dialogue throughout the hiring process.
To address these challenges, LinkedIn built the Hiring Assistant with three core capabilities.
The system delivers value at scale by efficiently searching across billions of profiles and handling enterprise-level workloads reliably.
It enables interactive communication by understanding recruiter intent through natural conversation, asking clarifying questions when needed, and adapting its behavior based on real-time feedback.
Lastly, it also features continuous learning by improving over time based on observing what recruiters do, learning individual preferences, and remembering past interactions and decisions.
In this article, we will look at the architecture and technical building blocks of LinkedIn’s Hiring Assistant.
This holiday season, the equation is simple: everyone gets a better deal with Verizon. Best devices. Best plans. Add that to an award-winning network, and you have the best deals. Period.
Unbeatable Deal: Switch to Verizon and get four lines on Unlimited Welcome for $25 per line/month (on Auto Pay plus taxes & fees) and get four of the newest, premium devices like the iPhone 17 Pro, Samsung Galaxy S25+, or Google Pixel 10 Pro XL all on Verizon.
Enjoy flexibility and save money this holiday season because every dollar you spend matters.
Explore Holiday Deals. See here for full terms.
At its core, the Hiring Assistant is built on what LinkedIn calls a “plan-and-execute” architecture as shown in the diagram below:
To understand why this matters, it helps to know what they avoided. A simpler approach, known as ReAct, would have the AI try to handle everything at once in a single continuous loop. While straightforward, this method runs into problems when tasks get complex. Large language models, the AI systems that power tools like this, can become unreliable when asked to juggle too many things simultaneously.
See the diagram below for the ReAct pattern
.Instead, LinkedIn split the work into two distinct phases:
The Planner acts as the strategic thinker. When a recruiter makes a request, the Planner examines it from a high level, breaks the work into smaller, manageable steps, and creates a structured plan for what needs to happen. Think of it as a project manager outlining the approach before any actual work begins.
The Executor then takes over. It works through the plan step by step, using available tools to complete each task. For each step, the Executor runs its own loop of reasoning and action, figuring out what needs to happen and then making it happen.
This divide-and-conquer strategy brings several advantages:
First, it makes the system more reliable. Breaking complex recruiting workflows into discrete steps means the AI is less likely to get confused or make mistakes.
Second, it allows for better cost management. LinkedIn can use more powerful AI models for complex reasoning tasks while deploying simpler, cheaper models for straightforward steps.
Third, tasks are far more likely to be completed successfully when they are well-defined and manageable in scope.
Beyond the plan-and-execute design, the Hiring Assistant uses a message-driven architecture.
Each recruiter gets their own individual instance of the assistant, complete with its own identity and mailbox. Everything works through asynchronous messages, much like email. When a recruiter asks the assistant to find candidates, they do not have to sit and wait for results. The assistant receives the message, processes it in the background, and sends updates when ready.
This asynchronous approach is what enables the assistant to work at scale. While a recruiter focuses on other tasks, their assistant can be searching through millions of profiles, evaluating candidates, and preparing recommendations, all without requiring constant attention or supervision.
The Hiring Assistant operates in two complementary modes, each designed for different stages of the recruiting process:
Interactive Mode: When recruiters first start a new project, they work with the assistant in interactive mode. This feels like having a conversation with a colleague. Recruiters can clarify what kind of person they are looking for, refine job requirements, and get immediate feedback on their requests. The assistant shows its reasoning as it works, making the process transparent. This builds trust because recruiters can see exactly what the system is doing and correct course quickly if something seems off.
Asynchronous Mode: Once the recruiter and assistant are aligned on what success looks like, the system shifts into asynchronous mode. This is where the real power of automation comes into play. The assistant works autonomously in the background, running large-scale searches across millions of profiles, continuously updating candidate pipelines, and evaluating new applicants as they appear.
LinkedIn describes this as a “source while you sleep” capability.
The assistant can review thousands of candidates overnight, a task that would take a human recruiter weeks to complete manually.
Yet even in this autonomous mode, humans remain in control of important decisions. The assistant surfaces candidates and provides recommendations, but recruiters make the final calls about who to contact and ultimately hire. This balance between automation and human judgment is central to how the system is designed.
The Hiring Assistant is built on top of LinkedIn’s broader agent platform, a foundation of reusable components that can power any AI agent product across the company. This approach means the LinkedIn engineering team does not have to reinvent the wheel each time it builds a new intelligent system.
At the user-facing level, a client-side SDK embeds the assistant directly into recruiter workflows. This SDK creates dynamic interfaces that adapt based on what the AI needs at any given moment. It supports multiple input methods, including chat, voice, and typing assistance, while logging all interactions for future analysis and improvement.
Connecting this interface to backend services is a GraphQL API, which delivers data in structured packages called view models. These contain everything needed to display information on screen. LinkedIn calls it the agent-driven UI, where the AI itself can determine what recruiters see, dynamically adjusting the interface as tasks progress.
Rather than the traditional request-response pattern where you ask a question and wait for an answer, the system uses a push-based, event-driven architecture. It works as follows:
The user interface subscribes to updates from the agent, and when something changes, the agent publishes that update. This means the interface refreshes automatically without users needing to manually reload anything.
Long-running AI tasks are delivered through streaming responses. Instead of waiting for a complete answer, recruiters see the AI’s reasoning unfold in real time, with results appearing as soon as they become available.
If a recruiter is logged in on multiple devices, cross-session synchronization keeps everything in sync. An action taken on a phone immediately reflects on a desktop browser.
At the center of the Hiring Assistant sits what LinkedIn calls the supervisor agent. If the overall system is a team, the supervisor is the team leader who makes sure everyone works together effectively.
See the diagram below:
The supervisor handles several critical responsibilities:
It oversees workflow management for the entire hiring process, ensuring tasks move forward in the right sequence.
When a recruiter sends a message or request, the supervisor receives it and routes it to the appropriate sub-agent for handling.
It also makes judgment calls about task prioritization, deciding what requires human input versus what can be safely automated.
Beyond just delegating work, the supervisor coordinates between different sub-agents to ensure they work together smoothly. It actively observes the environment, watching for changes like new candidate activity or application submissions, and triggers appropriate actions in response.
The supervisor also manages the human-in-the-loop aspect of the system. It knows which decisions are significant enough to require human approval and surfaces those moments to recruiters.
All communication, whether from users or between sub-agents, flows through the supervisor. It serves as the central hub that keeps the entire operation organized and aligned with recruiter goals.
The Hiring Assistant divides recruiting work among several specialized sub-agents, each focused on a specific part of the workflow. This modular design allows each component to excel at its particular task while working together as a cohesive system. Let’s look at the various sub-agents in detail:
The intake agent serves as the starting point for every hiring project.
It gathers job requirements from recruiters, confirming essential details like job title, location, and seniority level. When information is missing, the agent leverages LinkedIn’s Economic Graph (a digital map of the global economy) to intelligently fill in gaps. The agent then generates specific qualifications based on successful past hires and industry knowledge, creating a clear framework for evaluating candidates.
Finding the right candidates is perhaps the most knowledge-intensive part of recruiting, and the sourcing agent approaches this challenge with multiple strategies.
It creates search queries using traditional Boolean logic (AND, OR, NOT operators), generates AI-powered queries based on hiring requirements, and draws on historical recruiter search patterns as starting points. Importantly, customer data never crosses company boundaries, maintaining strict data isolation.
What sets this agent apart is its integration with LinkedIn’s Economic Graph.
This gives it access to insights about top locations, job titles, and skills for specific talent pools. It can identify which candidates are actively looking or were recently hired, understand talent flow patterns between companies and industries, spot fast-growing companies and skill sets, flag companies experiencing layoffs, and highlight opportunities at top schools or companies with open positions. These insights help the agent find hidden gems that might otherwise be overlooked, going well beyond simple keyword matching.
The sourcing agent also implements a closed feedback loop. It combines sourcing with evaluation results, using AI reasoning to refine queries based on which candidates prove to be good matches. This allows the system to balance precision (finding exactly the right candidates) with liquidity (finding enough candidates), continuously improving the quality and volume of results over time.
Reading resumes and assessing qualifications is one of the most time-consuming tasks for recruiters.
The evaluation agent tackles this by reading candidate profiles and resumes, comparing them against job qualifications, and providing structured recommendations backed by evidence. It shows why a candidate may or may not match requirements, rather than simply offering a yes or no answer.
LinkedIn engineered this agent to address several complex challenges.
Before any evaluation begins, recruiters must review and approve the qualifications being used.
Safety checks ensure these qualifications follow responsible AI policies. The agent searches through profiles and resumes for specific evidence demonstrating how candidates meet each qualification, surfacing this evidence to recruiters for review.
To ensure accuracy, LinkedIn built quality benchmarks for testing the evaluation agent across different scenarios.
They developed custom AI models specifically optimized for qualification evaluation, as general-purpose models could not achieve the necessary combination of accuracy and speed. Using techniques like speculative decoding and custom serving infrastructure, these fine-tuned models can evaluate candidates in seconds rather than minutes, fast enough to support real-time, conversational refinement of requirements.
Once promising candidates are identified, the outreach agent handles communication.
It writes personalized messages, sends initial outreach and follow-ups, and replies to candidate questions using job-specific FAQs defined during intake. The agent can even schedule phone screenings directly through messaging, streamlining coordination.
Supporting the interview process, the screening agent prepares tailored interview questions based on hiring requirements and candidate profiles.
It can transcribe and summarize screening conversations while capturing notes and insights. Importantly, recruiters maintain full control, able to take over conversations at any time or guide the process as needed.
The learning agent enables the system to improve over time.
It analyzes recruiter actions such as which candidates they message or add to pipelines, learning from both explicit feedback and implicit behavioral signals. The agent updates job qualifications based on these patterns, but any suggested changes must be reviewed and approved by recruiters before being applied. This ensures the assistant adapts while keeping humans in control.
Finally, the cognitive memory agent gives the assistant persistent memory across interactions.
It remembers past conversations, preferences, and decisions, helping personalize recommendations over time. All memory data remains scoped to the individual recruiter’s environment with strong privacy protections.
This data is never used to train AI models, ensuring customer information stays secure and confidential.
Building an AI agent that operates at scale requires a comprehensive approach to quality that ensures the system behaves safely, responsibly, and effectively.
The LinkedIn engineering team built its quality framework on two complementary pillars:
Product policy serves as the rails that keep the system on track. These policies set clear boundaries for safety, compliance, and legal standards while defining expected agent behavior. They establish minimum quality thresholds that must be met.
To enforce these standards, LinkedIn employs AI-powered judges that evaluate different aspects of quality. Some judges check for coherence, asking whether outputs make logical sense. Others verify factual accuracy, ensuring the system does not generate false or misleading information.
Human alignment acts as the compass, ensuring the assistant moves toward genuinely valuable outcomes.
This pillar is grounded in human-validated data, including annotated datasets where people label examples, and real recruiter activity. When a recruiter messages a candidate or adds them to a pipeline, the system treats this as a strong positive signal.
Over time, the assistant learns to recommend candidates matching these recruiter-validated patterns. Human alignment also serves to validate whether product policies are actually working in practice.
LinkedIn’s Hiring Assistant demonstrates a big approach to building enterprise-grade AI agents.
By adopting a plan-and-execute architecture, the system breaks complex recruiting workflows into manageable steps, improving reliability and reducing errors. The message-driven design allows each recruiter to have their own assistant instance that works asynchronously in the background, enabling true scale.
The division of labor among specialized sub-agents ensures that each component can focus on what it does best, from sourcing and evaluation to outreach and screening. Integration with LinkedIn’s Economic Graph provides market intelligence that goes beyond simple keyword matching, helping uncover candidates who might otherwise be overlooked.
Perhaps most importantly, the system balances automation with human judgment. The quality framework keeps the assistant safe and aligned with real hiring outcomes, while the learning agent ensures continuous improvement based on individual recruiter preferences.
References:
Building the agentic future of recruiting: how we engineered LinkedIn’s Hiring Assistant
Under the hood: The tech behind the first agent from LinkedIn
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].