2026-02-27 00:30:39
If a database stores data on three servers in three different cities, and we write a new value, when exactly is that write “done”? Does it happen when the first server saves it? Or when all three have it? Or when two out of three confirm?
The answer to this question is quite important. Consider a simple bank transfer where we move $500 from savings to checking. We see the updated balance on our phone. But our partner, checking the same account from their laptop in another city, still sees the old balance. For a few seconds, the household has two different versions of the truth. For something similar to a like count on social media, that kind of temporary disagreement is harmless. For a bank balance, it’s a different story.
The guarantee that every reader sees the most recent write, no matter where or when they read, is what distributed systems engineers call strong consistency. It sounds straightforward. Making it work across machines, data centers, and continents is one of the hardest problems in distributed systems, because it requires those machines to coordinate, and coordination has a cost governed by physics.
In a previous article, we had looked at eventual consistency. In this article, we will look at what strong consistency actually means, how systems deliver it, and what it really costs.
2026-02-26 00:30:28
Context engineering is the new critical layer in every production AI app, and Redis is the real-time context engine powering it. Redis gathers, syncs, and serves the right mix of memory, knowledge, tools, and state for each model call, all from one unified platform. Search across RAG, short- and long-term memory, and structured and unstructured data without stitching together a fragile multi-tool stack. With 30+ agent framework integrations across OpenAI, LangChain, Bedrock, NVIDIA NIM, and more, Redis fits the stack your teams are already building on. Accurate, reliable AI apps that scale. Built on one platform.
Every time we open X (formerly Twitter) and scroll through the “For You” tab, a recommendation system is deciding which posts to show and in what order. This recommendation system works in real-time.
In the world of social media, this is a big deal because any latency issues can cause user dissatisfaction.
Until now, the internal workings of this recommendation system were more or less a mystery. However, recently, the xAI engineering team open-sourced the algorithm that powers this feed, publishing it on GitHub under an Apache-2.0 license. It reveals a system built on a Grok-based transformer model that has replaced nearly all hand-crafted rules with machine learning.
In this article, we will look at what the algorithm does, how its components fit together, and why the xAI Engineering Team made the design choices they did.
Disclaimer: This post is based on publicly shared details from the xAI Engineering Team. Please comment if you notice any inaccuracies.
When you request the For You feed in X, the algorithm draws from two separate sources of content:
The first source is called in-network content. These are posts from accounts you already follow. If you follow 200 people, the system looks at what those 200 people have posted recently and considers them as candidates for your feed.
The second source is called out-of-network content. These are posts from accounts you do not follow. The algorithm discovers them by searching across a global pool of posts using a machine learning technique called similarity search. The idea behind this is that if your past behavior suggests you would find a post interesting, that post becomes a candidate even if you have never heard of the author.
Both sets of candidates are then merged into a single list, scored, filtered, and ranked. The top-ranked posts are what you see when you open the app.
The diagram below shows the overall architecture of the system built by the xAI engineering team:
The codebase is organized into four main directories, each representing a distinct part of the system. The entire codebase is written in Rust (62.9%) and Python (37.1%).
Home Mixer is the orchestration layer. It acts as the coordinator that calls the other components in the right order and assembles the final feed. It is not doing the heavy ML work itself, but just managing the pipeline.
When a request comes in, Home Mixer kicks off several stages in sequence:
Fetching user context
Retrieving candidate posts
Enriching those posts with metadata
Filtering out the ineligible ones, scoring the survivors
Selecting the top results and running final checks.
The server exposes a gRPC endpoint called ScoredPostsService that returns the ranked list of posts for a given user.
Thunder is an in-memory post store and real-time ingestion pipeline. It consumes post creation and deletion events from Kafka and maintains per-user stores for original posts, replies, reposts, and video posts.
When the algorithm needs in-network candidates, it queries Thunder, which can return results in sub-millisecond time because everything lives in memory rather than in an external database. Thunder also automatically removes posts that are older than a configured retention period, keeping the data set fresh.
Phoenix is the ML brain of the system. It has two jobs:
Phoenix uses a two-tower model to find out-of-network posts:
One tower (the User Tower) takes your features and engagement history and encodes them into a mathematical representation called an embedding.
The other tower (the Candidate Tower) encodes every post into its own embedding.
Finding relevant posts then becomes a similarity search. The system computes a dot product between your user embedding and each candidate embedding and retrieves the top-K most similar posts. If you are unfamiliar with dot products, the core idea is that two embeddings that “point in the same direction” in a high-dimensional space produce a high score, meaning the post is likely relevant to you.
See the diagram below that shows the concept of embeddings:
Once candidates have been retrieved from both Thunder and Phoenix’s retrieval step, Phoenix runs a Grok-based transformer model to predict how likely you are to engage with each post.
See the diagram below that shows the concept of a transformer model:
The transformer implementation is ported from the Grok-1 open source release by xAI, adapted for recommendation use cases. It takes your engagement history and a batch of candidate posts as input and outputs a probability for each type of engagement action.
The Candidate Pipeline is a reusable framework that defines the structure of the whole recommendation process.
It provides traits (interfaces, in Rust terminology) for each stage of the pipeline:
Source (fetch candidates)
Hydrator (enrich candidates with extra data)
Filter (remove ineligible candidates)
Scorer (compute scores)
Selector (sort and pick the top candidates)
SideEffect (run asynchronous tasks like caching and logging).
The framework runs independent stages in parallel where possible and includes configurable error handling. This modular design makes it straightforward for the xAI Engineering Team to add new data sources or scoring models without rewriting the pipeline logic.
Here is the full sequence that runs every time you open the For You feed:
Query Hydration: The system fetches your recent engagement history (what you liked, replied to, and reposted) and your metadata, such as your following list.
Candidate Sourcing: Thunder provides recent posts from accounts you follow. Phoenix Retrieval provides ML-discovered posts from the global corpus.
Candidate Hydration: Each candidate post is enriched with additional information: its text and media content, the author’s username and verification status, video duration if applicable, and subscription status.
Pre-Scoring Filters: Before any scoring happens, the system removes posts that are duplicates, too old, authored by you, from accounts you have blocked or muted, containing keywords you have muted, posts you have already seen, or ineligible subscription content.
Scoring: The remaining candidates pass through multiple scorers in sequence. First, the Phoenix Scorer gets ML predictions from the transformer. Then, the Weighted Scorer combines those predictions into a single relevance score. Next, an Author Diversity Scorer reduces the score of posts from repeated authors so your feed is not dominated by one person. Finally, an OON (out-of-network) Scorer adjusts scores for posts from accounts you do not follow.
Selection: Posts are sorted by their final score, and the top K are selected.
Post-Selection Filters: A final round of checks removes posts that have been deleted, flagged as spam, or identified as containing violent or graphic content. A conversation deduplication filter also ensures you do not see multiple branches of the same reply thread.
The Phoenix transformer predicts probabilities for a wide range of user actions: liking, replying, reposting, quoting, clicking, visiting the author’s profile, watching a video, expanding a photo, sharing, dwelling (spending time reading), following the author, marking “not interested,” blocking the author, muting the author, and reporting the post.
Each of these predicted probabilities is multiplied by a weight and then summed to produce a final score. Positive actions like liking, reposting, and sharing carry positive weights. Negative actions like blocking, muting, and reporting carry negative weights. This means that if the model predicts you are likely to block the author of a post, that post’s score gets pushed down significantly. The formula is simple:
Final Score = sum of (weight for action * predicted probability of that action)
This multi-action prediction approach is more nuanced than a single “relevance” score because it lets the system distinguish between content you would enjoy and content you would find annoying or harmful.
There are five architectural choices worth understanding from xAI’s recommendation system:
Instead of humans deciding which signals matter (post length, hashtag count, time of day), the Grok-based transformer learns what matters directly from user engagement sequences. This simplifies the data pipelines and serving infrastructure.
When the transformer scores a batch of candidate posts, each post can only “attend to” (or look at) the user’s context. It cannot attend to the other candidates in the same batch. This design choice ensures that a post’s score does not change depending on which other posts happen to be in the same batch. It makes scores consistent and cacheable, which is important at the scale X operates at.
Both the retrieval and ranking stages use multiple hash functions for embedding lookup.
Rather than collapsing everything into a single relevance number, the model predicts probabilities for many distinct actions. This gives the Weighted Scorer fine-grained control over what the feed optimizes for.
The Candidate Pipeline framework separates the pipeline’s execution logic from the business logic of individual stages. This makes it easy to add a new data source, swap in a different scoring model, or insert a new filter without touching the rest of the system.
References:
2026-02-25 00:30:15
Monster SCALE Summit is a new virtual conference all about extreme-scale engineering and data-intensive applications.
Join us on March 11 and 12 to learn from engineers at Discord, Disney, LinkedIn, Uber, Pinterest, Rivian, ClickHouse, Redis, MongoDB, ScyllaDB and more. A few topics on the agenda:
What Engineering Leaders Get Wrong About Scale
How Discord Automates Database Operations at Scale
Lessons from Redesigning Uber’s Risk-as-a-Service Architecture
Scaling Relational Databases at Nextdoor
How LinkedIn Powers Recommendations to Billions of Users
Powering Real-Time Vehicle Intelligence at Rivian with Apache Flink and Kafka
The Data Architecture behind Pinterest’s Ads Reporting Services
Bonus: We have 500 free swag packs for attendees. And everyone gets 30-day access to the complete O’Reilly library & learning platform.
Uber’s infrastructure runs on thousands of microservices, each making authorization decisions millions of times per day. This includes every API call, database query, and message published to Kafka. To make matters more interesting, Uber needs these decisions to happen in microseconds to have the best possible user experience.
Traditional access control could not handle the complexity. For instance, you might say “service A can call service B” or “employees in the admin group can access this database.” While these rules work for small systems, they fall short when you need more control. For example, what if you need to restrict access based on the user’s location, the time of day, or relationships between different pieces of data?
Uber needed a better approach. They built an attribute-based access control system called Charter to evaluate complex conditions against attributes pulled from various sources at runtime.
In this article, we will look at how the Uber engineering team built Charter and the challenges they faced.
Disclaimer: This post is based on publicly shared details from the Uber Engineering Team. Please comment if you notice any inaccuracies.
Before diving into ABAC, you need to understand how Uber thinks about authorization. Every access request can be broken down into a simple question:
Can an Actor perform an Action on a Resource in a given Context?
Let’s understand each component of this statement:
Actor represents the entity making the request. At Uber, this could be an employee, a customer, or another microservice. Uber uses the SPIFFE format to identify actors. An employee might be identified as spiffe://personnel.upki.ca/eid/123456, where 123456 is their employee ID. A microservice running in production would be identified as spiffe://prod.upki.ca/workload/service-foo/production
Action describes what the actor wants to do. Common actions include create, read, update, and delete, often abbreviated as CRUD. Services can also define custom actions like invoke for API calls, subscribe for message queues, or publish for event streams.
A resource is the object being accessed. Uber represents resources using UON, which stands for Uber Object Name. This is a URI-style format that looks like uon://service-name/environment/resource-type/identifier. For example, a specific table in a database might be uon://orders.mysql.storage/production/table/orders.
The host portion of the UON is called the policy domain. This acts as a namespace for grouping related policies and configurations.
As mentioned, Uber built a centralized service called Charter to manage all authorization policies. Think of Charter as a policy repository where administrators define who can access what. This approach offers several advantages over having each service implement its own authorization logic.
See the diagram below:
Policies stored in Charter are distributed to individual services. Each service includes a local library called authfx that evaluates these policies.
The architecture works as follows:
Policy authors create and update policies in Charter
Charter stores these policies in a database
A unified configuration distribution system pushes policy updates to all relevant services
Services use the authfx library to evaluate policies for incoming requests
Authorization decisions are made locally within each service
SerpApi turns live search engines into APIs, returning clean JSON for results, reviews, prices, locations, and more. Use it to ground your app or LLMs with real-world data from Google, Maps, Amazon, and beyond, without maintaining scrapers.
The simplest form of policy at Uber connects actors to resources through actions.
A basic policy might look like this in YAML format:
file_type: policy
effect: allow
actions:
- invoke
resource: “uon://service-foo/production/rpc/foo/method1”
associations:
- target_type: WORKLOAD
target_id: “spiffe://prod.upki.ca/workload/service-bar/production”This policy translates to: “Allow service-bar to invoke method1 of service-foo.” Another example shows how employees can be granted access:
file_type: policy
effect: allow
actions:
- read
- write
resource: “uon://querybuilder/production/report/*”
associations:
- target_type: GROUP
target_id: “querybuilder-development”This policy means: “Allow employees in the querybuilder-development group to read and write query reports.”
These basic policies work well for straightforward authorization scenarios. However, real-world requirements are often more complex.
Uber encountered several limitations with the basic policy model.
For example, consider a payment support service. Customer support representatives need to access payment information to help customers. However, for privacy and compliance reasons, support reps should only access payment data for customers in their assigned region. The basic policy syntax can only specify that a representative can access a payment profile by its UUID. It cannot express the requirement that the rep’s region must also match the customer’s region.
Another example involves employee data. An employee information service needs to allow employees to view and edit their own profiles. It should also allow their managers to access their profiles. The basic policy model cannot express this “self or manager” relationship because it would require checking whether the actor’s employee ID matches either the resource’s employee ID or the resource’s manager ID.
A third scenario involves data analytics. Some reports should only be accessible to users who belong to multiple specific groups simultaneously. The existing model supported checking if a user belonged to any group in a list, but not whether they belonged to all groups in a list.
In a nutshell, Uber needed a way to incorporate additional context and attributes into authorization decisions.
ABAC extends the basic policy model by adding conditions. A condition is a Boolean expression that evaluates to true or false based on attributes. If a permission includes a condition, that permission only grants access when the condition evaluates to true.
Attributes are characteristics of actors, resources, actions, or the environment. For example:
An actor might have attributes like location, department, or role.
A resource might have attributes such as owner, sensitivity level, or creation date.
The environment might provide attributes like current time, day of the week, or request IP address.
Attribute Stores are the sources that provide attribute values at authorization time. In formal authorization terminology, these are called Policy Information Points or PIPs. When evaluating a condition, the authorization engine queries the appropriate attribute store to fetch the necessary values.
The enhanced policy model adds an optional condition field to each permission. Here’s an example:
actions: [create, delete, read, update]
resource: “uon://payments.svc/production/payment/*”
associations:
- target_type: EMPLOYEE
condition:
expression: “resource.paymentType == ‘credit card’ && actor.location == resource.paymentLocation”
effect: ALLOW
```This policy allows employees to perform CRUD operations on payment records, but only when two conditions are met: the payment type is a credit card, and the employee’s location matches the payment’s location.
When ABAC is enabled, the authorization architecture includes additional components.
The authfx library now includes an authorization engine that coordinates policy evaluation. When a request arrives, the engine first checks if the basic requirements are met: does the actor match, does the action match, does the resource match? If those checks pass and a condition exists, the engine moves to condition evaluation.
The authorization engine interacts with an expression engine that evaluates the condition expression. The expression engine identifies which attributes are needed and requests them from the appropriate attribute stores. See the diagram below:
Uber defined four types of attribute store interfaces:
ActorAttributeStore fetches attributes about the actor making the request. This might include their employee ID, group memberships, location, or department.
ResourceAttributeStore fetches attributes about the resource being accessed. This could include the resource’s owner, creation date, sensitivity classification, or any custom business attributes.
ActionAttributeStore fetches attributes related to the action being performed, though this is used less frequently than actor and resource attributes.
EnvironmentAttributeStore fetches contextual attributes like the current timestamp, day of week, or request metadata.
Each attribute store must implement a SupportedAttributes() function that declares which attributes it can provide. This enables the authorization engine to pre-compile condition expressions and validate that all required attributes are available. At runtime, when an attribute value is needed, the engine calls the appropriate method on the corresponding store.
See the code snippet below:

The design allows a single service to use multiple attribute stores, and a single attribute store can be shared across multiple services for reusability.
To represent conditions based on attributes, Uber needed an expression language. Rather than inventing a new language from scratch, the engineering team evaluated existing open-source options.
They selected the Common Expression Language (CEL), developed by Google. CEL offered several advantages:
First, it has a simple, familiar syntax similar to other programming languages.
Second, it supports multiple data types, including strings, numbers, booleans, and lists.
Third, it includes built-in functions for string manipulation, arithmetic operations, and boolean logic.
CEL also provides macros that are particularly useful for working with collections. For instance, you can write actor.groups.exists(g, g == ‘admin’) to check if the actor belongs to a group called “admin.”
The performance characteristics of CEL were excellent. Expression evaluation typically takes only a few microseconds. Both Go and Java implementations of CEL are available, meeting Uber’s backend service requirements. Additionally, both implementations support lazy attribute fetching, meaning they only request the attribute values actually needed to evaluate the expression, improving efficiency.
A sample CEL expression looks like this:
resource.paymentType == ‘credit card’ && actor.location == resource.paymentLocationThis expression is evaluated against attribute values fetched at runtime to produce a true or false result.
To illustrate the practical benefits of ABAC, consider how Uber manages authorization for Apache Kafka topics.
Uber uses thousands of Kafka topics for event streaming across its platform. Each topic needs access controls to specify which services can publish messages and which can subscribe to receive messages. The Kafka infrastructure team is responsible for managing these policies.
With basic policies, the Kafka team would need to create individual policies for every topic. Given the sheer volume of topics, this would be impractical and time-consuming.
Uber has a service called uOwn that tracks ownership and roles for technological assets. Each Kafka topic can have roles assigned directly or inherited through the organizational hierarchy. One such role is “Develop,” which indicates responsibility for developing and maintaining that topic.
Using ABAC, the Uber engineering team created a single generic policy that applies to all Kafka topics:
effect: allow
actions: [admin]
resource: “uon://topics.kafka/production/*”
associations:
- target_type: EMPLOYEE
condition:
expression: ‘actor.adgroup.exists(x, x in resource.uOwnDevelopGroups)’Source: Uber Engineering Blog
The wildcard in the resource pattern means this policy applies to every Kafka topic. The condition checks whether the actor belongs to any Active Directory group that has the Develop role for the requested topic.
An attribute store plugin retrieves the list of groups with the Develop role for each topic from uOwn. This information becomes the resource.uOwnDevelopGroups attribute. When an employee attempts to perform an admin action on a topic, the authorization engine evaluates whether that employee belongs to one of the authorized groups.
This solution saved the Kafka team enormous effort. Instead of managing thousands of individual policies, they maintain one generic policy. As ownership changes in uOwn, authorization automatically adjusts without any policy updates.
The implementation of ABAC delivered multiple benefits across Uber’s infrastructure.
Authorization policies became more precise and fine-grained. Decisions could now consider any relevant attribute rather than just basic identity and group membership. This enabled security policies that more accurately reflected business requirements.
The system became more dynamic. When attribute values change in source systems like uOwn or employee directories, authorization decisions automatically adapt. No code deployment or policy update is required. This agility is critical in a fast-moving organization.
Scalability improved dramatically. A single well-designed ABAC policy can govern authorization for thousands or even millions of resources.
Centralization through the Charter made policy management easier. Rather than scattering authorization logic across hundreds of services, security teams can audit and manage policies in one place.
Performance remained excellent. Despite the added complexity of condition evaluation and attribute fetching, authorization decisions are still completed in microseconds due to local evaluation and on-demand attribute fetching.
Also, most importantly, ABAC separated policy from code. System owners can change authorization policies without building and deploying new code. This separation of concerns allows security policies to evolve independently from application logic.
Since implementing ABAC, 70 Uber services have adopted attribute-based policies to meet their specific authorization requirements. The framework provides a unified approach across diverse use cases, from protecting microservice endpoints to securing database access to managing infrastructure resources.
References:
2026-02-24 00:30:39
To scale with LLMs, you need to know how to monitor them effectively. In this eBook, get practical strategies to monitor, debug, and secure LLM-powered applications. From tracing multi-step workflows and detecting prompt injection attacks to evaluating response quality and tracking token usage, you’ll learn best practices for integrating observability into every layer of your LLM stack.
When we talk about large language models “learning,” we can end up creating a misleading impression. The word “learning” suggests something similar to human learning, complete with understanding, reasoning, and insight.
However, that’s not what happens inside these systems. LLMs don’t learn the way you learned to code or solve problems. Instead, they follow repetitive mathematical procedures billions of times, adjusting countless internal parameters until they become very good at mimicking patterns in text.
This distinction matters more than you might think because it changes the way LLMs generate their answers.
Understanding how LLMs actually work helps you know when to trust their outputs and when to be skeptical. It reveals why they can write convincing essays about topics they don’t fully understand, and why they sometimes fail in surprising ways.
In this article, we’ll explore three core concepts that have a key impact on the working of LLMs: loss functions (how we measure failure), gradient descent (how we make improvements), and next-token prediction (what LLMs actually do).
Before an LLM can learn anything, we need a way to measure how badly it’s performing. This measurement is called a loss function.
Think of it as a scoring system that provides a single number representing how wrong the model is. The higher the number, the worse the performance. The goal of training is to make this number as small as possible.
However, you can’t just pick any measurement and expect it to work. A good loss function must satisfy three critical requirements:
First, it must be specific. It needs to measure something concrete and not vague. If someone told you to “build an intelligent computer,” you’d struggle because intelligence itself is hard to define. Would a system that passes an IQ test count? Probably not, since computers have passed IQ tests for over a decade without being useful for much else. For LLMs, we pick something very specific, such as predicting the next word in a sequence correctly. This is concrete and measurable.
Second, the loss function must be computable. The computer needs to calculate it quickly and repeatedly. We can’t measure abstract qualities like “creativity” or “hard work” because these aren’t things you can easily quantify with the data available during training. However, you can measure whether a predicted word matches the actual next word in your training data. That’s a simple comparison that computers handle effortlessly.
Third, the loss function must be smooth. This is the trickiest requirement to grasp. Smoothness means the function’s output should change gradually as inputs change, without sudden jumps or breaks. Imagine walking down a gentle slope versus walking down a staircase. The slope is smooth because your altitude changes continuously. Stairs are not smooth because you suddenly drop from one step to the next.
Why does smoothness matter?
The training algorithm needs to figure out which direction to adjust the model’s parameters. If the loss function jumps around wildly, the algorithm can’t determine whether it’s moving in the right direction. Interestingly, accuracy (counting correct predictions) isn’t smooth because you can’t have partial predictions. You either got 47 or 48 predictions right, not 47.3. This is why LLMs actually optimize for something called cross-entropy loss instead, which is smooth and works better mathematically, even though accuracy is what we ultimately care about.
The crucial point to understand here is that LLMs are scored on matching patterns in their training data, not on being truthful or correct. If false information appears frequently in training data, the model gets rewarded for reproducing it. This fundamental design choice explains why LLMs can confidently state things that are completely wrong.
Many developer tools promise context-aware AI, but having data access doesn’t automatically mean agents know when to use it.
Real context requires understanding. Unblocked synthesizes knowledge from your codebase, PRs, discussions, docs, project trackers, and runtime signals. It connects past decisions to current work, resolves conflicts between outdated docs and actual practice, respects data permissions, and surfaces what matters for the task at hand.
With Unblocked:
Coding agents like Cursor, Claude, and Copilot generate output that aligns with your actual architecture and conventions
Code review focuses on real bugs rather than stylistic nits
You find instant answers without interrupting teammates
Once the loss function is decided, we need a process to actually improve the model. This is where gradient descent comes in.
Gradient descent is the algorithm that figures out how to adjust the billions of parameters inside a neural network to reduce the loss.
See the diagram below:
Imagine you have a ball sitting somewhere on a hilly landscape. The ball’s position represents the model’s current parameter values. The height of the ground beneath the ball represents the loss function’s output. Valleys represent low loss (good performance), and peaks represent high loss (bad performance). The goal is to get the ball to the lowest valley possible.
The process follows these steps:
Start with the ball at a random position on the landscape
Look at the slope directly around the ball to determine which direction is downhill
Roll the ball a tiny distance in that downhill direction
Repeat this process billions of times until the ball settles in a valley
Each adjustment is incredibly small. We’re not throwing the ball or making dramatic changes, but nudging it slightly based on the local slope. The “gradient” in gradient descent refers to this slope measurement, which tells you both the direction and steepness of the decline.
This approach uses a greedy algorithm, meaning it only considers the immediate next step without looking ahead. Picture walking downhill in thick fog where you can only see your feet. We can tell which direction slopes downward right where we’re standing, but we can’t see if there’s a deeper valley just beyond a small uphill section. The ball might settle in a minor dip when a much better solution exists nearby.
Why use such a limited approach?
This is because the alternative is computationally impossible. An LLM might have hundreds of billions of parameters. Evaluating all possible future states to find the absolute best solution would take longer than the lifespan of the universe. Gradient descent is practical because each step is simple and cheap to compute, even though we need billions of them.
Modern LLMs use a variation called Stochastic Gradient Descent, or SGD. The word “stochastic” means random. Instead of calculating loss across all your training data at once (which would require impossible amounts of memory), SGD uses small random batches of data. This makes training feasible with massive datasets. If we have a billion training examples, we can take a billion small steps using different random samples, which actually works better than trying to process everything at once.
Now we get to what LLMs actually train on. Despite their ability to write essays, explain concepts, and hold conversations, LLMs are trained on one simple task: predict the next word in a sequence.
Take the sentence “The cat sat on the mat.” During training, the model doesn’t see the whole sentence at once. Instead, it trains on overlapping segments:
Input: “The” → Predict: “cat” → If correct, gain a point
Input: “The cat” → Predict: “sat” → If correct, gain a point
Input: “The cat sat” → Predict: “on” → If correct, gain a point
Input: “The cat sat on” → Predict: “the” → If correct, gain a point
Input: “The cat sat on the” → Predict: “mat” → If correct, gain a point
This process repeats billions of times across trillions of words from the internet, books, articles, and other text sources. Every time the model predicts correctly, gradient descent adjusts its parameters to make similar predictions more likely in the future. Every time it predicts incorrectly, the parameters adjust to make that mistake less likely.
But why does this simple task produce such convincing outputs? The answer lies in how context narrows down possibilities.
Consider predicting the next word in this sequence: “I love to eat.” Without more context, it could be almost any food. But add more information: “I love to eat something for breakfast.” Now you’re narrowed down to breakfast foods like eggs, cereal, pancakes, or toast. Add even more: “I love to eat something for breakfast with chopsticks.” Now you’re thinking about foods eaten with chopsticks at breakfast, perhaps rice or noodles. Include geography: “I love to eat something for breakfast with chopsticks in Tokyo.” The possibilities narrow further to Japanese breakfast items.
LLMs excel at this pattern recognition because they process billions of these associations during training. They learn which words tend to follow others in different contexts. The more context we provide, the better their predictions become. This is why longer prompts often produce better results.
The transformer architecture that powers modern LLMs has a critical advantage over older approaches. It can process all these training examples in parallel rather than one at a time. This parallelization is why we can now train models on datasets that would take you multiple lifetimes to read. It’s the breakthrough that made current LLMs possible.
Next-token prediction through pattern matching produces impressive results. LLMs can write in different styles, translate languages, explain complex topics, and generate code. They spot subtle patterns across billions of examples that humans would never notice. For most common tasks, this approach works quite well.
However, pattern matching is not reasoning, and this creates predictable failure modes.
Consider what happens when you ask an LLM a question with a false premise. The model doesn’t stop to verify whether the premise is true. Instead, it might pattern-match to find the appropriate answer based on its training data. The answer can sound authoritative and detailed, but it might explain something that isn’t true. In other words, the model is trained to continue patterns in text, but not to fact-check or apply logical reasoning.
This problem extends to situations where training data is scarce. Suppose you ask an LLM to write code in Python. It will likely produce excellent results because massive amounts of Python code exist in its training data. However, ask it to write the same code in an obscure programming language, and it starts making confident mistakes. It might use operators that don’t exist in that language or call functions with the wrong number of arguments. The model extrapolates common programming patterns from popular languages, assuming they apply everywhere. With insufficient training examples to learn otherwise, these extrapolations lead to errors.
Perhaps most tellingly, LLMs fail at variations of problems they’ve seen before. There’s a famous logic puzzle about transporting a cabbage, a goat, and a wolf across a river with specific constraints about which items can’t be left alone together. LLMs solve this puzzle easily because it appears many times in their training data. However, if you slightly modify the constraints, the model often continues using the original solution. It doesn’t reason through the new logical requirements. Instead, it pattern-matches to the familiar puzzle and reproduces the memorized answer.
This happens because of how transformers work internally. When the model sees text that looks very similar to something in its training data, it does a fuzzy match and retrieves the known answer. This is efficient for common problems but fails when those small differences actually matter.
The core issue is that LLMs are optimized to reproduce patterns from their training data, not to be truthful, logical, or correct. When training data contains errors (and internet data contains many), models learn to reproduce those errors. When training data contains biases, models learn those too. When a task requires actual reasoning rather than pattern matching, the illusion can break down.
Understanding the mechanics of LLM training helps you use these tools more effectively.
LLMs are sophisticated pattern-matching systems that predict tokens through billions of small parameter adjustments. They’re not reasoning engines, and they don’t truly understand the text they generate.
This knowledge suggests several practical guidelines:
Use LLMs for tasks that are well-represented in their training data. They excel at common programming problems, generating content in standard formats, and answering frequently asked questions. They’re powerful productivity tools that can save enormous amounts of time on routine work.
However, be skeptical when dealing with novel problems, unusual edge cases, or domains where accuracy is critical.
Always verify outputs for important use cases. Don’t assume that confident-sounding responses are correct. The training process optimizes for sounding like training data, not for being right.
Most importantly, remember that LLMs are tools with specific capabilities and specific limitations. They’re remarkable at what they do, which is identifying and reproducing patterns in text. However, pattern matching, no matter how sophisticated, is not the same as reasoning, understanding, or intelligence. Knowing this difference helps you leverage their strengths while avoiding their weaknesses.
2026-02-22 00:30:27
npx workos launches an AI agent, powered by Claude, that reads your project, detects your framework, and writes a complete auth integration directly into your existing codebase. It’s not a template generator. It reads your code, understands your stack, and writes an integration that fits.
Then it typechecks and builds, feeding any errors back to itself to fix. Just run npx workos, from WorkOS.
This week’s system design refresher:
What Is Redis Really About? Why Is It So Popular? (Youtube video)
RabbitMQ vs Kafka vs Pulsar
What Are Agent Skills Really About? (Youtube video)
REST vs GraphQL
LAST CALL FOR ENROLLMENT: Become an AI Engineer - Cohort 4
RabbitMQ, Kafka, and Pulsar all move messages, but they solve very different problems under the hood.
This diagram looks simple, but it hides three very different mental models for building distributed systems.
RabbitMQ is a classic message broker. Producers publish to exchanges, exchanges route messages to queues, and consumers compete to process them.
Messages are pushed, acknowledged, and then gone. It’s great for task distribution, request handling, and workflows where “do this once” really matters.
Kafka flips the model. It’s not a queue, it’s a distributed log. Producers append events to partitions. Data stays there based on retention, not consumption. Consumers pull data using offsets and can replay everything.
This is why Kafka works so well for event streaming, analytics, and pipelines where multiple teams need the same data at different times.
Pulsar tries to combine both worlds. Brokers handle serving traffic, while BookKeeper stores data in a durable ledger. Consumers track position with cursors instead of offsets.
This separation lets Pulsar scale storage and compute independently and support both streaming and queue-like patterns.
Choosing between them isn’t about “which is faster” or “which is popular.” It’s about how you want data to flow, how long it should live, and how many times it needs to be read.
Join us on February 24, 2026 (AMER) / February 25, 2026 (EMEA & APJ) for a free live webinar where we’ll unveil how Intelligent Observability can help you build smarter automations. During the event, you’ll see our new agentic platform in action—an essential tool if you’re working with AI Agents. We’ll also share key updates on our innovations in APM, Infrastructure, and the latest advancements in OpenTelemetry support. This is your opportunity to explore cutting-edge solutions designed to empower your work and streamline your operations.
With REST, the server decides the response shape. You call “/v1/articles/123” and you get whatever that endpoint returns. If you need related data, you make another request. If the payload is larger than needed, you live with over-fetching.
HTTP gives you great primitives though: clear resource boundaries, URL-based versioning, and native caching via ETag, Cache-Control, and CDNs.
With GraphQL, the client decides the response shape. You send a single query describing exactly what fields you want. Behind the scenes, a GraphQL gateway fans out to multiple services, runs resolvers, and aggregates the response.
The complexity shifts from the client to the server. Caching still exists, but it usually lives at the application layer (persisted queries, response caching), not automatically at the HTTP layer.
Neither approach is “better” by default. REST optimizes for simplicity, cacheability, and clear ownership of resources. GraphQL optimizes for flexibility, client-driven data needs, and aggregation across services.
Over to you: What signals tell you REST is enough, and when GraphQL becomes worth it?
Enrollment for our upcoming Become an AI Engineer - Cohort 4 is closing soon, and classes officially begin on February 21.
Get 40% on your registration cost with code: BBGNL
This is not just another course about AI frameworks and tools. Our goal is to help engineers build the foundation and end to end skill set needed to thrive as AI engineers.
Here’s what makes this cohort special:
Learn by doing: Build real world AI applications, not just by watching videos.
Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.
Live feedback and mentorship: Get direct feedback from instructors and peers.
Community driven: Learning alone is hard. Learning with a community is easy!
We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.
If you want to start learning AI from scratch, this is the perfect time to begin.
2026-02-21 00:31:57
Enrollment for our upcoming Become an AI Engineer - Cohort 4 is closing soon, and classes officially begin on February 21.
Get 40% on your registration cost with code: BBGNL
This is not just another course about AI frameworks and tools. Our goal is to help engineers build the foundation and end to end skill set needed to thrive as AI engineers.
Here’s what makes this cohort special:
Learn by doing: Build real world AI applications, not just by watching videos.
Structured, systematic learning path: Follow a carefully designed curriculum that takes you step by step, from fundamentals to advanced topics.
Live feedback and mentorship: Get direct feedback from instructors and peers.
Community driven: Learning alone is hard. Learning with a community is easy!
We are focused on skill building, not just theory or passive learning. Our goal is for every participant to walk away with a strong foundation for building AI systems.
If you want to start learning AI from scratch, this is the perfect time to begin.