MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

The Architecture Behind Open-Source LLMs

2026-03-03 00:30:50

npx workos: An AI Agent That Writes Auth Directly Into Your Codebase (Sponsored)

npx workos launches an AI agent, powered by Claude, that reads your project, detects your framework, and writes a complete auth integration directly into your existing codebase. It’s not a template generator. It reads your code, understands your stack, and writes an integration that fits.

The WorkOS agent then typechecks and builds, feeding any errors back to itself to fix.

See how it works →


In December 2024, DeepSeek released V3 with the claim that they had trained a frontier-class model for $5.576 million. They used an attention mechanism called Multi-Head Latent Attention that slashed memory usage. An expert routing strategy avoided the usual performance penalty. Aggressive FP8 training cuts costs further.

Within months, Moonshot AI’s Kimi K2 team openly adopted DeepSeek’s architecture as their starting point, scaled it to a trillion parameters, invented a new optimizer to solve a training stability challenge that emerged at that scale, and competed with it across major benchmarks.

Then, in February 2026, Zhipu AI’s GLM-5 integrated DeepSeek’s sparse attention mechanism into their own design while contributing a novel reinforcement learning framework.

This is how the open-weight ecosystem actually works: teams build on each other’s innovations in public, and the pace of progress compounds. To understand why, you need to look at the architecture.

In this article, we will cover various open-source models and the engineering bets that define each one.

The Common Skeleton

Every major open-weight LLM released at the frontier in 2025 and 2026 uses a Mixture-of-Experts (MoE) transformer architecture.

See the diagram below that shows the concept of the MoE architecture:

The reason is that dense transformers activate all parameters for every token. To make a denser model smarter, if you add more parameters, the computational cost scales linearly. With hundreds of billions of parameters, this becomes prohibitive.

MoE solves this by replacing the monolithic feed-forward layer in each transformer block with multiple smaller “expert” networks and a learned router that decides which experts handle each token. This result is a model that can, for example, store the knowledge of 671 billion parameters but only compute 37 billion per token.

This is why two numbers matter for every model:

  • Total parameters (memory footprint, knowledge capacity)

  • Active parameters (inference speed, per-token cost).

Think of a specialist hospital with 384 doctors on staff, but only 8 in the room for any given patient. You benefit from the knowledge of 384 specialists while only paying for 8 at a time. The triage nurse (the router) decides who gets called.

That’s also why a trillion-parameter model and a 235-billion-parameter model cost roughly the same per query. For example, Kimi K2 activates 32 billion parameters per token, while Qwen3 activates 22 billion. In other words, you’re comparing the active counts, not the totals.


Granola MCP (Sponsored)

Take your meeting context to new places

If you’re already using Claude or ChatGPT for complex work, you know the drill: you feed it research docs, spreadsheets, project briefs... and then manually copy-paste meeting notes to give it the full picture.

What if your AI could just access your meeting context automatically?

Granola’s new Model Context Protocol (MCP) integration connects your meeting notes to your AI app of choice.

Ask Claude to review last week’s client meetings and update your CRM. Have ChatGPT extract tasks from multiple conversations and organize them in Linear. Turn meeting insights into automated workflows without missing a beat.

Perfect for engineers, PMs, and operators who want their AI to actually understand their work.

-> Try the MCP integration for free here or use the code BYTEBYTEGO

Try 1 month for free


The Open Weight Reality

Almost every model marketed as “open source” is technically open weight. This means that the trained parameters are public, but the training data and often the full training code are not. In traditional software, however, “open source” means code is available, modifiable, and redistributable.

What does this mean in practice?

You can download, fine-tune, and commercially deploy all six of these models. However, you cannot see or audit their training data, and you cannot reproduce their training runs from scratch. For most engineering teams, the first part is what matters. But the distinction is worth knowing.

The license landscape also varies. DeepSeek V3 and GLM-5 use the MIT license. Qwen3 and Mistral Large 3 use Apache 2.0. Both are fully permissive for commercial use. Kimi K2 uses a modified MIT license. Llama 4 uses a custom community license that restricts usage for companies with over 700 million monthly users and prohibits using the model to train competing models.

Transparency also varies. Some teams publish detailed technical reports with architecture diagrams, ablation studies, and hyperparameters. Others provide weights and a blog post with less architectural detail. More transparency enables the “borrow and build” dynamic described above, which is part of why it works.

The Attention Bet

Every time a model generates a token, it needs to “remember” keys and values for all previous tokens in the conversation. This storage, called the KV-cache, grows linearly with sequence length and becomes a memory bottleneck for long contexts. Different models use three different strategies to deal with it.

  • Grouped-Query Attention (GQA) shares key-value pairs across groups of query heads. It’s the industry default, offering straightforward implementation and moderate memory savings. Qwen3 and Llama 4 both use GQA.

  • Multi-Head Latent Attention (MLA) compresses key-value pairs into a low-dimensional latent space before caching, then decompresses when needed. It was introduced in DeepSeek V2 and used in both DeepSeek V3 and Kimi K2. MLA saves more memory than GQA but adds computational overhead for the compress/decompress step.

  • Sparse Attention skips attending to all previous tokens and instead selects the most relevant ones. DeepSeek introduced DeepSeek Sparse Attention (DSA) in V3.2, and GLM-5 openly adopted DSA in its architecture. Since sparse attention optimizes the attention layers while MoE optimizes the feed-forward layers, the two techniques compound. Therefore, GLM-5 benefits from both.

See the diagram below that shows DeepSeek’s Multi-Head Latent Attention approach:

The tradeoff comes down to what matters most in your deployment. GQA is simpler and proven. MLA is more memory-efficient but more complex to engineer. Sparse attention reduces compute for long contexts but requires careful design to avoid missing important tokens. The strategy a model chooses depends on whether the bottleneck is memory, compute, or context length.

The Sparsity Bet

The six models range from 16 to 384 experts, reflecting a fundamental disagreement about how far sparsity should be pushed.

At a fixed compute budget, increasing the number of experts can improve both training and validation loss. However, the report also notes that this gain comes with increased infrastructure complexity. More total experts means more total parameters stored in memory, and Kimi K2’s trillion parameters require a multi-GPU cluster regardless of how few fire per token. By contrast, Llama 4 Scout’s 109 billion total parameters can fit on a single high-memory server.

Two other design choices stand out:

  • First, the shared expert question. DeepSeek V3, Llama 4, and Kimi K2 include a shared expert that processes every token, providing a baseline capability floor. Qwen3’s technical report notes that, unlike their earlier Qwen2.5-MoE, they dropped the shared expert, but doesn’t disclose why. There is no consensus in the field on whether shared experts are worth the compute cost.

  • Second, Llama 4 takes a unique approach. Rather than making every layer MoE, Llama 4 alternates between dense and MoE layers, and routes each token to only 1 expert (plus the shared) rather than 8. This means fewer active experts per token, but each expert is larger.

See the diagram below that shows Llama’s approach:

The Training Bet

Architecture determines capacity, but training determines what a model actually does with it.

Pre-training, where the model learns by predicting the next token across trillions of tokens, gives the model its base knowledge. The scale varies (14.8 trillion tokens for DeepSeek V3, up to 36 trillion for Qwen3), but the approach is similar. Post-training is where models diverge, and it’s now the primary differentiator.

Reinforcement learning with verifiable rewards checks whether the model’s output is objectively correct.

Did the code compile? Is the math answer right? The model is rewarded for correct outputs and penalized for wrong ones. This was the breakthrough behind DeepSeek R1, and elements of it were distilled into DeepSeek V3.

Distillation from a larger teacher trains a massive model and uses its outputs to teach smaller ones. Llama 4 co-distilled from Behemoth, a 2-trillion-parameter teacher model, during pre-training itself. Qwen3 distills from its flagship down to smaller models in the family.

See the diagram below that shows Qwen’s post-training flow:

Synthetic agentic data involves building simulated environments loaded with real tools like APIs, shells, and databases, then rewarding the model for completing tasks in those environments. For example, Kimi K2’s technical report describes a large-scale pipeline that systematically generates tool-use demonstrations across simulated and real-world environments.

Novel RL infrastructure can be a contribution in itself. GLM-5 developed “Slime,” a new asynchronous reinforcement learning framework that improves training throughput for post-training, enabling more iterations within the same compute budget.

Training stability also deserves attention here. At this scale, a single training crash can waste days of GPU time. To counter this, Kimi K2 developed the MuonClip optimizer specifically to prevent exploding attention logits, enabling them to train on 15.5 trillion tokens without a single loss spike. DeepSeek V3 similarly reported zero irrecoverable loss spikes across its entire training run. These engineering contributions may prove more reusable than any particular architectural choice.

Conclusion

Architectures are converging. Everyone is building MoE transformers. Training approaches are diverging, with teams placing different bets on reinforcement learning, distillation, synthetic data, and new optimizers.

The specific models covered here will be overtaken in months. However, the framework for evaluating them likely won’t change.

Important questions would still be about active parameter count and not just the total. Also, what attention tradeoff did they make, and does it match the context-length needs? How many experts fire per token, and can the infrastructure handle the total? How was it post-trained, and does that approach align with your use case? What does the license actually permit?

References:

EP204: 11 Ways To Use AI To Increase Your Productivity

2026-03-01 00:30:26

Unblocked: Context that saves you time and tokens (Sponsored)

AI coding tools are fast, capable, and completely context-blind. Even with rules, skills, and MCP connections, they generate code that misses your conventions, ignores past decisions, and breaks patterns. You end up paying for that gap in rework and tokens.

Unblocked changes the economics.

It builds organizational context from your code, PR history, conversations, docs, and runtime signals. It maps relationships across systems, reconciles conflicting information, respects permissions, and surfaces what matters for the task at hand. Instead of guessing, agents operate with the same understanding as experienced engineers.

You can:

  • Generate plans, code, and reviews that reflect how your system actually works

  • Reduce costly retrieval loops and tool calls by providing better context up front

  • Spend less time correcting outputs for code that should have been right in the first place

See how it works


This week’s system design refresher:

  • 11 Ways To Use AI To Increase Your Productivity

  • Al Topics to Learn before Taking Al/ML Interviews

  • PostgreSQL versus MySQL

  • Why AI Needs GPUs and TPUs?

  • Network Protocols Explained


11 Ways To Use AI To Increase Your Productivity

AI is changing how we work. People who use AI get more done in less time. You do not need to code. You need to know which tool to use and when.

For example, instead of reading long technical blogs, you can upload them to Google’s NotebookLM and ask it to summarize the key points.

Or you can use Otter.ai to turn meeting transcripts into action items, decisions, and highlights.

Here is a list of 19 tools that can speed up your daily workflow across different areas. Save this for the next time you feel stuck getting started.

Over to you: What’s the underrated AI tool that others might not know about?


Al Topics to Learn before Taking Al/ML Interviews

AI interviews often test fundamentals, not tools. This visual splits them into two buckets that show up repeatedly: Traditional AI and Modern AI.

Traditional AI focuses on fundamental ML topics, mostly from before neural networks became dominant.

Modern AI focuses on neural network foundations and newer concepts like transformers, RAG, and post-training.

Interviewers generally expect you to know both. They expect you to explain how they work, when they break, and the trade-offs.

Use this as a checklist, and make sure you can explain each topic clearly before your next AI interview.

Over to you: which topic here do you find hardest to explain under interview pressure? What else is missing?


PostgreSQL versus MySQL

Built using the C language, PostgreSQL uses a process-based architecture.

You can think of it like a factory with a manager (Postmaster) coordinating specialized workers. Each connection gets its own process and shares a common memory pool. Background workers handle tasks like writing data, vacuuming, and logging independently.

MySQL takes a thread-based approach. Imagine a single multi-tasking brain.

It uses a layered design with one server handling multiple connections through threads. The magic happens using pluggable storage engines (such as InnoDB, MyISAM) that you can swap based on your needs.

Over to you: Which database do you prefer?


Why AI Needs GPUs and TPUs?

When AI processes data, it’s essentially multiplying massive arrays of numbers. This can mean billions of calculations happening simultaneously.

CPUs can handle such calculations sequentially. To perform any calculation, the CPU must fetch an instruction, retrieve data from memory, execute the operation, and write results back. This constant transfer of information between the processor and memory is not very efficient. It’s like having one very smart person solve a giant puzzle alone.

GPUs change the game with parallel processing. They split the work across hundreds of cores, reducing processing time to milliseconds.

However, TPUs take it even further with a systolic array architecture that’s thousands of times faster than CPUs. Each unit in a TPU multiplies its stored weight by incoming data, adds to a running sum flowing vertically, and passes both values to its neighbor. This cuts down on I/O costs and reduces processing times even further.

Over to you: What else will you add to explain the need for GPUs and TPUs to run AI workloads?


Network Protocols Explained

Every time you type a URL and hit Enter, half a dozen network protocols quietly come to life. We usually talk about HTTP or HTTPS, but that’s just the tip of the iceberg. Under the hood, the web runs on a carefully layered stack of protocols, each solving a very specific problem.

At the top is HTTP. It defines the request-response model that powers browsers, APIs, and microservices. Simple, stateless, and everywhere.

When security is added, HTTP becomes HTTPS, wrapping every request in TLS so data is encrypted and the server is authenticated before anything meaningful is exchanged.

Before any of that can happen, DNS steps in. Humans think in domain names, machines think in IP addresses. DNS bridges that gap, resolving names into routable IPs so packets know where to go.

Then comes the transport layer. TCP sets up a connection, performs the three-way handshake, retransmits lost packets, and ensures everything arrives in order. It’s reliable, but that reliability comes with overhead.

UDP skips all of that. No handshakes, no guarantees, just fast datagrams. That’s why it’s used for streaming, gaming, and newer protocols like QUIC.

At the bottom sits IP. This is the postal system of the internet. It doesn’t care about reliability or order. Its only job is to move packets from one network to another through routers, hop by hop, until they reach the destination.

Each layer is deliberately limited in scope. DNS doesn’t care about encryption. TCP doesn’t care about HTTP semantics. IP doesn’t care if packets arrive at all. That separation is exactly why the internet scales.

Over to you: When something breaks, which layer do you usually blame first, DNS, TCP, or the application itself?

Strong Consistency In Databases: Promises and Costs

2026-02-27 00:30:39

If a database stores data on three servers in three different cities, and we write a new value, when exactly is that write “done”? Does it happen when the first server saves it? Or when all three have it? Or when two out of three confirm?

The answer to this question is quite important. Consider a simple bank transfer where we move $500 from savings to checking. We see the updated balance on our phone. But our partner, checking the same account from their laptop in another city, still sees the old balance. For a few seconds, the household has two different versions of the truth. For something similar to a like count on social media, that kind of temporary disagreement is harmless. For a bank balance, it’s a different story.

The guarantee that every reader sees the most recent write, no matter where or when they read, is what distributed systems engineers call strong consistency. It sounds straightforward. Making it work across machines, data centers, and continents is one of the hardest problems in distributed systems, because it requires those machines to coordinate, and coordination has a cost governed by physics.

In a previous article, we had looked at eventual consistency. In this article, we will look at what strong consistency actually means, how systems deliver it, and what it really costs.

What Strong Consistency Actually Promises

Read more

The Algorithm That Powers Your X (Twitter) Post

2026-02-26 00:30:28

The right data. The right time.(Sponsored)

Context engineering is the new critical layer in every production AI app, and Redis is the real-time context engine powering it. Redis gathers, syncs, and serves the right mix of memory, knowledge, tools, and state for each model call, all from one unified platform. Search across RAG, short- and long-term memory, and structured and unstructured data without stitching together a fragile multi-tool stack. With 30+ agent framework integrations across OpenAI, LangChain, Bedrock, NVIDIA NIM, and more, Redis fits the stack your teams are already building on. Accurate, reliable AI apps that scale. Built on one platform.

Explore Redis for AI


Every time we open X (formerly Twitter) and scroll through the “For You” tab, a recommendation system is deciding which posts to show and in what order. This recommendation system works in real-time.

In the world of social media, this is a big deal because any latency issues can cause user dissatisfaction.

Until now, the internal workings of this recommendation system were more or less a mystery. However, recently, the xAI engineering team open-sourced the algorithm that powers this feed, publishing it on GitHub under an Apache-2.0 license. It reveals a system built on a Grok-based transformer model that has replaced nearly all hand-crafted rules with machine learning.

In this article, we will look at what the algorithm does, how its components fit together, and why the xAI Engineering Team made the design choices they did.

Disclaimer: This post is based on publicly shared details from the xAI Engineering Team. Please comment if you notice any inaccuracies.

The Big Picture

When you request the For You feed in X, the algorithm draws from two separate sources of content:

  • The first source is called in-network content. These are posts from accounts you already follow. If you follow 200 people, the system looks at what those 200 people have posted recently and considers them as candidates for your feed.

  • The second source is called out-of-network content. These are posts from accounts you do not follow. The algorithm discovers them by searching across a global pool of posts using a machine learning technique called similarity search. The idea behind this is that if your past behavior suggests you would find a post interesting, that post becomes a candidate even if you have never heard of the author.

Both sets of candidates are then merged into a single list, scored, filtered, and ranked. The top-ranked posts are what you see when you open the app.

The Four Core Components

The diagram below shows the overall architecture of the system built by the xAI engineering team:

The codebase is organized into four main directories, each representing a distinct part of the system. The entire codebase is written in Rust (62.9%) and Python (37.1%).

Home Mixer

Home Mixer is the orchestration layer. It acts as the coordinator that calls the other components in the right order and assembles the final feed. It is not doing the heavy ML work itself, but just managing the pipeline.

When a request comes in, Home Mixer kicks off several stages in sequence:

  • Fetching user context

  • Retrieving candidate posts

  • Enriching those posts with metadata

  • Filtering out the ineligible ones, scoring the survivors

  • Selecting the top results and running final checks.

The server exposes a gRPC endpoint called ScoredPostsService that returns the ranked list of posts for a given user.

Thunder

Thunder is an in-memory post store and real-time ingestion pipeline. It consumes post creation and deletion events from Kafka and maintains per-user stores for original posts, replies, reposts, and video posts.

When the algorithm needs in-network candidates, it queries Thunder, which can return results in sub-millisecond time because everything lives in memory rather than in an external database. Thunder also automatically removes posts that are older than a configured retention period, keeping the data set fresh.

Phoenix

Phoenix is the ML brain of the system. It has two jobs:

Job 1: Retrieval

Phoenix uses a two-tower model to find out-of-network posts:

  • One tower (the User Tower) takes your features and engagement history and encodes them into a mathematical representation called an embedding.

  • The other tower (the Candidate Tower) encodes every post into its own embedding.

Finding relevant posts then becomes a similarity search. The system computes a dot product between your user embedding and each candidate embedding and retrieves the top-K most similar posts. If you are unfamiliar with dot products, the core idea is that two embeddings that “point in the same direction” in a high-dimensional space produce a high score, meaning the post is likely relevant to you.

See the diagram below that shows the concept of embeddings:

Job 2: Ranking

Once candidates have been retrieved from both Thunder and Phoenix’s retrieval step, Phoenix runs a Grok-based transformer model to predict how likely you are to engage with each post.

See the diagram below that shows the concept of a transformer model:

The transformer implementation is ported from the Grok-1 open source release by xAI, adapted for recommendation use cases. It takes your engagement history and a batch of candidate posts as input and outputs a probability for each type of engagement action.

Candidate Pipeline

The Candidate Pipeline is a reusable framework that defines the structure of the whole recommendation process.

It provides traits (interfaces, in Rust terminology) for each stage of the pipeline:

  • Source (fetch candidates)

  • Hydrator (enrich candidates with extra data)

  • Filter (remove ineligible candidates)

  • Scorer (compute scores)

  • Selector (sort and pick the top candidates)

  • SideEffect (run asynchronous tasks like caching and logging).

The framework runs independent stages in parallel where possible and includes configurable error handling. This modular design makes it straightforward for the xAI Engineering Team to add new data sources or scoring models without rewriting the pipeline logic.

The Pipeline Step by Step

Here is the full sequence that runs every time you open the For You feed:

  • Query Hydration: The system fetches your recent engagement history (what you liked, replied to, and reposted) and your metadata, such as your following list.

  • Candidate Sourcing: Thunder provides recent posts from accounts you follow. Phoenix Retrieval provides ML-discovered posts from the global corpus.

  • Candidate Hydration: Each candidate post is enriched with additional information: its text and media content, the author’s username and verification status, video duration if applicable, and subscription status.

  • Pre-Scoring Filters: Before any scoring happens, the system removes posts that are duplicates, too old, authored by you, from accounts you have blocked or muted, containing keywords you have muted, posts you have already seen, or ineligible subscription content.

  • Scoring: The remaining candidates pass through multiple scorers in sequence. First, the Phoenix Scorer gets ML predictions from the transformer. Then, the Weighted Scorer combines those predictions into a single relevance score. Next, an Author Diversity Scorer reduces the score of posts from repeated authors so your feed is not dominated by one person. Finally, an OON (out-of-network) Scorer adjusts scores for posts from accounts you do not follow.

  • Selection: Posts are sorted by their final score, and the top K are selected.

  • Post-Selection Filters: A final round of checks removes posts that have been deleted, flagged as spam, or identified as containing violent or graphic content. A conversation deduplication filter also ensures you do not see multiple branches of the same reply thread.

How Scoring Works

The Phoenix transformer predicts probabilities for a wide range of user actions: liking, replying, reposting, quoting, clicking, visiting the author’s profile, watching a video, expanding a photo, sharing, dwelling (spending time reading), following the author, marking “not interested,” blocking the author, muting the author, and reporting the post.

Each of these predicted probabilities is multiplied by a weight and then summed to produce a final score. Positive actions like liking, reposting, and sharing carry positive weights. Negative actions like blocking, muting, and reporting carry negative weights. This means that if the model predicts you are likely to block the author of a post, that post’s score gets pushed down significantly. The formula is simple:

Final Score = sum of (weight for action * predicted probability of that action)

This multi-action prediction approach is more nuanced than a single “relevance” score because it lets the system distinguish between content you would enjoy and content you would find annoying or harmful.

Conclusion

There are five architectural choices worth understanding from xAI’s recommendation system:

  • Instead of humans deciding which signals matter (post length, hashtag count, time of day), the Grok-based transformer learns what matters directly from user engagement sequences. This simplifies the data pipelines and serving infrastructure.

  • When the transformer scores a batch of candidate posts, each post can only “attend to” (or look at) the user’s context. It cannot attend to the other candidates in the same batch. This design choice ensures that a post’s score does not change depending on which other posts happen to be in the same batch. It makes scores consistent and cacheable, which is important at the scale X operates at.

  • Both the retrieval and ranking stages use multiple hash functions for embedding lookup.

  • Rather than collapsing everything into a single relevance number, the model predicts probabilities for many distinct actions. This gives the Weighted Scorer fine-grained control over what the feed optimizes for.

  • The Candidate Pipeline framework separates the pipeline’s execution logic from the business logic of individual stages. This makes it easy to add a new data source, swap in a different scoring model, or insert a new filter without touching the rest of the system.

References:

How Uber Reinvented Access Control for Microservices

2026-02-25 00:30:15

Don’t miss out: your free pass to Monster SCALE Summit is waiting! 50+ engineering talks on AI, databases, Rust, and more. (Sponsored)

Monster SCALE Summit is a new virtual conference all about extreme-scale engineering and data-intensive applications.

Join us on March 11 and 12 to learn from engineers at Discord, Disney, LinkedIn, Uber, Pinterest, Rivian, ClickHouse, Redis, MongoDB, ScyllaDB and more. A few topics on the agenda:

  • What Engineering Leaders Get Wrong About Scale

  • How Discord Automates Database Operations at Scale

  • Lessons from Redesigning Uber’s Risk-as-a-Service Architecture

  • Scaling Relational Databases at Nextdoor

  • How LinkedIn Powers Recommendations to Billions of Users

  • Powering Real-Time Vehicle Intelligence at Rivian with Apache Flink and Kafka

  • The Data Architecture behind Pinterest’s Ads Reporting Services

GET YOUR FREE TICKET

Bonus: We have 500 free swag packs for attendees. And everyone gets 30-day access to the complete O’Reilly library & learning platform.


Uber’s infrastructure runs on thousands of microservices, each making authorization decisions millions of times per day. This includes every API call, database query, and message published to Kafka. To make matters more interesting, Uber needs these decisions to happen in microseconds to have the best possible user experience.

Traditional access control could not handle the complexity. For instance, you might say “service A can call service B” or “employees in the admin group can access this database.” While these rules work for small systems, they fall short when you need more control. For example, what if you need to restrict access based on the user’s location, the time of day, or relationships between different pieces of data?

Uber needed a better approach. They built an attribute-based access control system called Charter to evaluate complex conditions against attributes pulled from various sources at runtime.

In this article, we will look at how the Uber engineering team built Charter and the challenges they faced.

Disclaimer: This post is based on publicly shared details from the Uber Engineering Team. Please comment if you notice any inaccuracies.

Understanding the Authorization Request

Before diving into ABAC, you need to understand how Uber thinks about authorization. Every access request can be broken down into a simple question:

Can an Actor perform an Action on a Resource in a given Context?

Let’s understand each component of this statement:

  • Actor represents the entity making the request. At Uber, this could be an employee, a customer, or another microservice. Uber uses the SPIFFE format to identify actors. An employee might be identified as spiffe://personnel.upki.ca/eid/123456, where 123456 is their employee ID. A microservice running in production would be identified as spiffe://prod.upki.ca/workload/service-foo/production

  • Action describes what the actor wants to do. Common actions include create, read, update, and delete, often abbreviated as CRUD. Services can also define custom actions like invoke for API calls, subscribe for message queues, or publish for event streams.

  • A resource is the object being accessed. Uber represents resources using UON, which stands for Uber Object Name. This is a URI-style format that looks like uon://service-name/environment/resource-type/identifier. For example, a specific table in a database might be uon://orders.mysql.storage/production/table/orders.

The host portion of the UON is called the policy domain. This acts as a namespace for grouping related policies and configurations.

The Charter System

As mentioned, Uber built a centralized service called Charter to manage all authorization policies. Think of Charter as a policy repository where administrators define who can access what. This approach offers several advantages over having each service implement its own authorization logic.

See the diagram below:

Policies stored in Charter are distributed to individual services. Each service includes a local library called authfx that evaluates these policies.

The architecture works as follows:

  • Policy authors create and update policies in Charter

  • Charter stores these policies in a database

  • A unified configuration distribution system pushes policy updates to all relevant services

  • Services use the authfx library to evaluate policies for incoming requests

  • Authorization decisions are made locally within each service


Turn Search Engines Into APIs for Your App (Sponsored)

SerpApi turns live search engines into APIs, returning clean JSON for results, reviews, prices, locations, and more. Use it to ground your app or LLMs with real-world data from Google, Maps, Amazon, and beyond, without maintaining scrapers.

Try for Free


Basic Policies

The simplest form of policy at Uber connects actors to resources through actions.

A basic policy might look like this in YAML format:

file_type: policy
effect: allow
actions:
  - invoke
resource: “uon://service-foo/production/rpc/foo/method1”
associations:
  - target_type: WORKLOAD
    target_id: “spiffe://prod.upki.ca/workload/service-bar/production”

This policy translates to: “Allow service-bar to invoke method1 of service-foo.” Another example shows how employees can be granted access:

file_type: policy
effect: allow
actions:
  - read
  - write
resource: “uon://querybuilder/production/report/*”
associations:
  - target_type: GROUP
    target_id: “querybuilder-development”

This policy means: “Allow employees in the querybuilder-development group to read and write query reports.”

These basic policies work well for straightforward authorization scenarios. However, real-world requirements are often more complex.

Why ABAC Became Necessary

Uber encountered several limitations with the basic policy model.

For example, consider a payment support service. Customer support representatives need to access payment information to help customers. However, for privacy and compliance reasons, support reps should only access payment data for customers in their assigned region. The basic policy syntax can only specify that a representative can access a payment profile by its UUID. It cannot express the requirement that the rep’s region must also match the customer’s region.

Another example involves employee data. An employee information service needs to allow employees to view and edit their own profiles. It should also allow their managers to access their profiles. The basic policy model cannot express this “self or manager” relationship because it would require checking whether the actor’s employee ID matches either the resource’s employee ID or the resource’s manager ID.

A third scenario involves data analytics. Some reports should only be accessible to users who belong to multiple specific groups simultaneously. The existing model supported checking if a user belonged to any group in a list, but not whether they belonged to all groups in a list.

In a nutshell, Uber needed a way to incorporate additional context and attributes into authorization decisions.

Introducing Attributes and Conditions

ABAC extends the basic policy model by adding conditions. A condition is a Boolean expression that evaluates to true or false based on attributes. If a permission includes a condition, that permission only grants access when the condition evaluates to true.

Attributes are characteristics of actors, resources, actions, or the environment. For example:

  • An actor might have attributes like location, department, or role.

  • A resource might have attributes such as owner, sensitivity level, or creation date.

  • The environment might provide attributes like current time, day of the week, or request IP address.

Attribute Stores are the sources that provide attribute values at authorization time. In formal authorization terminology, these are called Policy Information Points or PIPs. When evaluating a condition, the authorization engine queries the appropriate attribute store to fetch the necessary values.

The enhanced policy model adds an optional condition field to each permission. Here’s an example:

actions: [create, delete, read, update]
resource: “uon://payments.svc/production/payment/*”
associations:
  - target_type: EMPLOYEE
condition:
  expression: “resource.paymentType == ‘credit card’ && actor.location == resource.paymentLocation”
effect: ALLOW
```

This policy allows employees to perform CRUD operations on payment records, but only when two conditions are met: the payment type is a credit card, and the employee’s location matches the payment’s location.

The Technical Architecture of ABAC

When ABAC is enabled, the authorization architecture includes additional components.

The authfx library now includes an authorization engine that coordinates policy evaluation. When a request arrives, the engine first checks if the basic requirements are met: does the actor match, does the action match, does the resource match? If those checks pass and a condition exists, the engine moves to condition evaluation.

The authorization engine interacts with an expression engine that evaluates the condition expression. The expression engine identifies which attributes are needed and requests them from the appropriate attribute stores. See the diagram below:

Uber defined four types of attribute store interfaces:

  • ActorAttributeStore fetches attributes about the actor making the request. This might include their employee ID, group memberships, location, or department.

  • ResourceAttributeStore fetches attributes about the resource being accessed. This could include the resource’s owner, creation date, sensitivity classification, or any custom business attributes.

  • ActionAttributeStore fetches attributes related to the action being performed, though this is used less frequently than actor and resource attributes.

  • EnvironmentAttributeStore fetches contextual attributes like the current timestamp, day of week, or request metadata.

Each attribute store must implement a SupportedAttributes() function that declares which attributes it can provide. This enables the authorization engine to pre-compile condition expressions and validate that all required attributes are available. At runtime, when an attribute value is needed, the engine calls the appropriate method on the corresponding store.

See the code snippet below:

The design allows a single service to use multiple attribute stores, and a single attribute store can be shared across multiple services for reusability.

Choosing an Expression Language

To represent conditions based on attributes, Uber needed an expression language. Rather than inventing a new language from scratch, the engineering team evaluated existing open-source options.

They selected the Common Expression Language (CEL), developed by Google. CEL offered several advantages:

  • First, it has a simple, familiar syntax similar to other programming languages.

  • Second, it supports multiple data types, including strings, numbers, booleans, and lists.

  • Third, it includes built-in functions for string manipulation, arithmetic operations, and boolean logic.

CEL also provides macros that are particularly useful for working with collections. For instance, you can write actor.groups.exists(g, g == ‘admin’) to check if the actor belongs to a group called “admin.”

The performance characteristics of CEL were excellent. Expression evaluation typically takes only a few microseconds. Both Go and Java implementations of CEL are available, meeting Uber’s backend service requirements. Additionally, both implementations support lazy attribute fetching, meaning they only request the attribute values actually needed to evaluate the expression, improving efficiency.

A sample CEL expression looks like this:

resource.paymentType == ‘credit card’ && actor.location == resource.paymentLocation

This expression is evaluated against attribute values fetched at runtime to produce a true or false result.

Real-World Application: Kafka Topic Management

To illustrate the practical benefits of ABAC, consider how Uber manages authorization for Apache Kafka topics.

Uber uses thousands of Kafka topics for event streaming across its platform. Each topic needs access controls to specify which services can publish messages and which can subscribe to receive messages. The Kafka infrastructure team is responsible for managing these policies.

With basic policies, the Kafka team would need to create individual policies for every topic. Given the sheer volume of topics, this would be impractical and time-consuming.

Uber has a service called uOwn that tracks ownership and roles for technological assets. Each Kafka topic can have roles assigned directly or inherited through the organizational hierarchy. One such role is “Develop,” which indicates responsibility for developing and maintaining that topic.

Using ABAC, the Uber engineering team created a single generic policy that applies to all Kafka topics:

effect: allow
actions: [admin]
resource: “uon://topics.kafka/production/*”
associations:
  - target_type: EMPLOYEE
condition:
  expression: ‘actor.adgroup.exists(x, x in resource.uOwnDevelopGroups)’

Source: Uber Engineering Blog

The wildcard in the resource pattern means this policy applies to every Kafka topic. The condition checks whether the actor belongs to any Active Directory group that has the Develop role for the requested topic.

An attribute store plugin retrieves the list of groups with the Develop role for each topic from uOwn. This information becomes the resource.uOwnDevelopGroups attribute. When an employee attempts to perform an admin action on a topic, the authorization engine evaluates whether that employee belongs to one of the authorized groups.

This solution saved the Kafka team enormous effort. Instead of managing thousands of individual policies, they maintain one generic policy. As ownership changes in uOwn, authorization automatically adjusts without any policy updates.

Conclusion

The implementation of ABAC delivered multiple benefits across Uber’s infrastructure.

  • Authorization policies became more precise and fine-grained. Decisions could now consider any relevant attribute rather than just basic identity and group membership. This enabled security policies that more accurately reflected business requirements.

  • The system became more dynamic. When attribute values change in source systems like uOwn or employee directories, authorization decisions automatically adapt. No code deployment or policy update is required. This agility is critical in a fast-moving organization.

  • Scalability improved dramatically. A single well-designed ABAC policy can govern authorization for thousands or even millions of resources.

  • Centralization through the Charter made policy management easier. Rather than scattering authorization logic across hundreds of services, security teams can audit and manage policies in one place.

  • Performance remained excellent. Despite the added complexity of condition evaluation and attribute fetching, authorization decisions are still completed in microseconds due to local evaluation and on-demand attribute fetching.

  • Also, most importantly, ABAC separated policy from code. System owners can change authorization policies without building and deploying new code. This separation of concerns allows security policies to evolve independently from application logic.

Since implementing ABAC, 70 Uber services have adopted attribute-based policies to meet their specific authorization requirements. The framework provides a unified approach across diverse use cases, from protecting microservice endpoints to securing database access to managing infrastructure resources.

References:

How Large Language Models Learn

2026-02-24 00:30:39

Overcome the challenges of deploying LLMs securely and at scale (Sponsored)

To scale with LLMs, you need to know how to monitor them effectively. In this eBook, get practical strategies to monitor, debug, and secure LLM-powered applications. From tracing multi-step workflows and detecting prompt injection attacks to evaluating response quality and tracking token usage, you’ll learn best practices for integrating observability into every layer of your LLM stack.

Download the eBook


When we talk about large language models “learning,” we can end up creating a misleading impression. The word “learning” suggests something similar to human learning, complete with understanding, reasoning, and insight.

However, that’s not what happens inside these systems. LLMs don’t learn the way you learned to code or solve problems. Instead, they follow repetitive mathematical procedures billions of times, adjusting countless internal parameters until they become very good at mimicking patterns in text.

This distinction matters more than you might think because it changes the way LLMs generate their answers.

Understanding how LLMs actually work helps you know when to trust their outputs and when to be skeptical. It reveals why they can write convincing essays about topics they don’t fully understand, and why they sometimes fail in surprising ways.

In this article, we’ll explore three core concepts that have a key impact on the working of LLMs: loss functions (how we measure failure), gradient descent (how we make improvements), and next-token prediction (what LLMs actually do).

The Foundation: Loss Functions

Before an LLM can learn anything, we need a way to measure how badly it’s performing. This measurement is called a loss function.

Think of it as a scoring system that provides a single number representing how wrong the model is. The higher the number, the worse the performance. The goal of training is to make this number as small as possible.

However, you can’t just pick any measurement and expect it to work. A good loss function must satisfy three critical requirements:

  • First, it must be specific. It needs to measure something concrete and not vague. If someone told you to “build an intelligent computer,” you’d struggle because intelligence itself is hard to define. Would a system that passes an IQ test count? Probably not, since computers have passed IQ tests for over a decade without being useful for much else. For LLMs, we pick something very specific, such as predicting the next word in a sequence correctly. This is concrete and measurable.

  • Second, the loss function must be computable. The computer needs to calculate it quickly and repeatedly. We can’t measure abstract qualities like “creativity” or “hard work” because these aren’t things you can easily quantify with the data available during training. However, you can measure whether a predicted word matches the actual next word in your training data. That’s a simple comparison that computers handle effortlessly.

  • Third, the loss function must be smooth. This is the trickiest requirement to grasp. Smoothness means the function’s output should change gradually as inputs change, without sudden jumps or breaks. Imagine walking down a gentle slope versus walking down a staircase. The slope is smooth because your altitude changes continuously. Stairs are not smooth because you suddenly drop from one step to the next.

Why does smoothness matter?

The training algorithm needs to figure out which direction to adjust the model’s parameters. If the loss function jumps around wildly, the algorithm can’t determine whether it’s moving in the right direction. Interestingly, accuracy (counting correct predictions) isn’t smooth because you can’t have partial predictions. You either got 47 or 48 predictions right, not 47.3. This is why LLMs actually optimize for something called cross-entropy loss instead, which is smooth and works better mathematically, even though accuracy is what we ultimately care about.

The crucial point to understand here is that LLMs are scored on matching patterns in their training data, not on being truthful or correct. If false information appears frequently in training data, the model gets rewarded for reproducing it. This fundamental design choice explains why LLMs can confidently state things that are completely wrong.


Unblocked: The context layer your AI tools are missing (Sponsored)

Many developer tools promise context-aware AI, but having data access doesn’t automatically mean agents know when to use it.

Real context requires understanding. Unblocked synthesizes knowledge from your codebase, PRs, discussions, docs, project trackers, and runtime signals. It connects past decisions to current work, resolves conflicts between outdated docs and actual practice, respects data permissions, and surfaces what matters for the task at hand.

With Unblocked:

  • Coding agents like Cursor, Claude, and Copilot generate output that aligns with your actual architecture and conventions

  • Code review focuses on real bugs rather than stylistic nits

  • You find instant answers without interrupting teammates

See how Unblocked works


The Process: Gradient Descent

Once the loss function is decided, we need a process to actually improve the model. This is where gradient descent comes in.

Gradient descent is the algorithm that figures out how to adjust the billions of parameters inside a neural network to reduce the loss.

See the diagram below:

Imagine you have a ball sitting somewhere on a hilly landscape. The ball’s position represents the model’s current parameter values. The height of the ground beneath the ball represents the loss function’s output. Valleys represent low loss (good performance), and peaks represent high loss (bad performance). The goal is to get the ball to the lowest valley possible.

The process follows these steps:

  • Start with the ball at a random position on the landscape

  • Look at the slope directly around the ball to determine which direction is downhill

  • Roll the ball a tiny distance in that downhill direction

  • Repeat this process billions of times until the ball settles in a valley

Each adjustment is incredibly small. We’re not throwing the ball or making dramatic changes, but nudging it slightly based on the local slope. The “gradient” in gradient descent refers to this slope measurement, which tells you both the direction and steepness of the decline.

This approach uses a greedy algorithm, meaning it only considers the immediate next step without looking ahead. Picture walking downhill in thick fog where you can only see your feet. We can tell which direction slopes downward right where we’re standing, but we can’t see if there’s a deeper valley just beyond a small uphill section. The ball might settle in a minor dip when a much better solution exists nearby.

Why use such a limited approach?

This is because the alternative is computationally impossible. An LLM might have hundreds of billions of parameters. Evaluating all possible future states to find the absolute best solution would take longer than the lifespan of the universe. Gradient descent is practical because each step is simple and cheap to compute, even though we need billions of them.

Modern LLMs use a variation called Stochastic Gradient Descent, or SGD. The word “stochastic” means random. Instead of calculating loss across all your training data at once (which would require impossible amounts of memory), SGD uses small random batches of data. This makes training feasible with massive datasets. If we have a billion training examples, we can take a billion small steps using different random samples, which actually works better than trying to process everything at once.

The LLM Secret: Next-Token Prediction

Now we get to what LLMs actually train on. Despite their ability to write essays, explain concepts, and hold conversations, LLMs are trained on one simple task: predict the next word in a sequence.

Take the sentence “The cat sat on the mat.” During training, the model doesn’t see the whole sentence at once. Instead, it trains on overlapping segments:

Input: “The” → Predict: “cat” → If correct, gain a point

Input: “The cat” → Predict: “sat” → If correct, gain a point

Input: “The cat sat” → Predict: “on” → If correct, gain a point

Input: “The cat sat on” → Predict: “the” → If correct, gain a point

Input: “The cat sat on the” → Predict: “mat” → If correct, gain a point

This process repeats billions of times across trillions of words from the internet, books, articles, and other text sources. Every time the model predicts correctly, gradient descent adjusts its parameters to make similar predictions more likely in the future. Every time it predicts incorrectly, the parameters adjust to make that mistake less likely.

But why does this simple task produce such convincing outputs? The answer lies in how context narrows down possibilities.

Consider predicting the next word in this sequence: “I love to eat.” Without more context, it could be almost any food. But add more information: “I love to eat something for breakfast.” Now you’re narrowed down to breakfast foods like eggs, cereal, pancakes, or toast. Add even more: “I love to eat something for breakfast with chopsticks.” Now you’re thinking about foods eaten with chopsticks at breakfast, perhaps rice or noodles. Include geography: “I love to eat something for breakfast with chopsticks in Tokyo.” The possibilities narrow further to Japanese breakfast items.

LLMs excel at this pattern recognition because they process billions of these associations during training. They learn which words tend to follow others in different contexts. The more context we provide, the better their predictions become. This is why longer prompts often produce better results.

The transformer architecture that powers modern LLMs has a critical advantage over older approaches. It can process all these training examples in parallel rather than one at a time. This parallelization is why we can now train models on datasets that would take you multiple lifetimes to read. It’s the breakthrough that made current LLMs possible.

Why This Is Amazing But Also Has Problems

Next-token prediction through pattern matching produces impressive results. LLMs can write in different styles, translate languages, explain complex topics, and generate code. They spot subtle patterns across billions of examples that humans would never notice. For most common tasks, this approach works quite well.

However, pattern matching is not reasoning, and this creates predictable failure modes.

Consider what happens when you ask an LLM a question with a false premise. The model doesn’t stop to verify whether the premise is true. Instead, it might pattern-match to find the appropriate answer based on its training data. The answer can sound authoritative and detailed, but it might explain something that isn’t true. In other words, the model is trained to continue patterns in text, but not to fact-check or apply logical reasoning.

This problem extends to situations where training data is scarce. Suppose you ask an LLM to write code in Python. It will likely produce excellent results because massive amounts of Python code exist in its training data. However, ask it to write the same code in an obscure programming language, and it starts making confident mistakes. It might use operators that don’t exist in that language or call functions with the wrong number of arguments. The model extrapolates common programming patterns from popular languages, assuming they apply everywhere. With insufficient training examples to learn otherwise, these extrapolations lead to errors.

Perhaps most tellingly, LLMs fail at variations of problems they’ve seen before. There’s a famous logic puzzle about transporting a cabbage, a goat, and a wolf across a river with specific constraints about which items can’t be left alone together. LLMs solve this puzzle easily because it appears many times in their training data. However, if you slightly modify the constraints, the model often continues using the original solution. It doesn’t reason through the new logical requirements. Instead, it pattern-matches to the familiar puzzle and reproduces the memorized answer.

This happens because of how transformers work internally. When the model sees text that looks very similar to something in its training data, it does a fuzzy match and retrieves the known answer. This is efficient for common problems but fails when those small differences actually matter.

The core issue is that LLMs are optimized to reproduce patterns from their training data, not to be truthful, logical, or correct. When training data contains errors (and internet data contains many), models learn to reproduce those errors. When training data contains biases, models learn those too. When a task requires actual reasoning rather than pattern matching, the illusion can break down.

Conclusion

Understanding the mechanics of LLM training helps you use these tools more effectively.

LLMs are sophisticated pattern-matching systems that predict tokens through billions of small parameter adjustments. They’re not reasoning engines, and they don’t truly understand the text they generate.

This knowledge suggests several practical guidelines:

  • Use LLMs for tasks that are well-represented in their training data. They excel at common programming problems, generating content in standard formats, and answering frequently asked questions. They’re powerful productivity tools that can save enormous amounts of time on routine work.

  • However, be skeptical when dealing with novel problems, unusual edge cases, or domains where accuracy is critical.

  • Always verify outputs for important use cases. Don’t assume that confident-sounding responses are correct. The training process optimizes for sounding like training data, not for being right.

Most importantly, remember that LLMs are tools with specific capabilities and specific limitations. They’re remarkable at what they do, which is identifying and reproducing patterns in text. However, pattern matching, no matter how sophisticated, is not the same as reasoning, understanding, or intelligence. Knowing this difference helps you leverage their strengths while avoiding their weaknesses.