MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

How to Scale An API

2026-01-30 00:30:59

In modern software development, APIs serve as the critical communication layer between clients and backend services.

Whether we are building a web application, mobile app, or any internet-based system, the API layer acts as the primary interface through which clients access functionality and data. As our applications grow and attract more users, the ability to scale this API layer becomes increasingly important for maintaining performance and delivering a positive user experience.

API scalability refers to the system’s ability to handle increasing amounts of traffic and requests without degrading performance. As applications gain popularity, they inevitably face surges in user demand. Without proper scaling mechanisms, these traffic spikes can lead to slow response times, timeouts, or even complete system failures.

In this article, we will learn how to scale APIs effectively using different strategies.

Why API Scalability Matters

Read more

How to Write High-Performance Code

2026-01-29 00:30:44

😘 Kiss bugs goodbye with fully automated end-to-end test coverage (Sponsored)

Bugs sneak out when less than 80% of user flows are tested before shipping. However, getting that kind of coverage (and staying there) is hard and pricey for any team.

QA Wolf’s AI-native service provides high-volume, high-speed test coverage for web and mobile apps, reducing your organization’s QA cycle to less than 15 minutes.

They can get you:

Engineering teams move faster, releases stay on track, and testing happens automatically—so developers can focus on building, not debugging.

Drata’s team of 80+ engineers achieved 4x more test cases and 86% faster QA cycles.

⭐ Rated 4.8/5 on G2.

Get Started for Free


We’ve all been there. Our code works perfectly, passes all tests, and does exactly what it’s supposed to do. Then we deploy it to production and realize it takes 10 seconds to load a page when users expect instant results. Or worse, it works fine with test data but crawls to a halt with real-world volumes.

The common reaction is to think about optimizing later, or leaving performance tuning for experts. Both assumptions are wrong. The truth is that writing reasonably fast code doesn’t require advanced computer science knowledge or years of experience. It requires developing an intuition about where performance matters and learning some fundamental principles.

Many developers have heard the famous quote about premature optimization being “the root of all evil.” However, this quote from Donald Knuth is almost always taken out of context. The full statement reads: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%”.

This article is about that critical 3%, where we’ll explore how to estimate performance impact, when to measure, what to look for, and practical techniques that work across different programming languages.

Learning to Estimate

One of the most valuable skills in performance-aware development is the ability to estimate rough performance costs before writing code. We don’t need precise measurements at this stage, but we just need to understand orders of magnitude.

Think of computer operations as existing in different speed tiers. At the fastest tier, we have CPU cache access, which happens in nanoseconds. These are operations where the data is already sitting right next to the processor, ready to be used. One tier slower is accessing main memory (RAM), which takes roughly 100 times longer than cache access. Moving down the hierarchy, reading from an SSD might take 40,000 times longer than a cache access. Network operations take even longer, and traditional spinning disk seeks can be millions of times slower than working with cached data.

This matters because for designing a system that needs to process a million records, the architecture should look completely different depending on whether that data comes from memory, disk, or a network call. A simple back-of-the-envelope calculation can tell us whether a proposed solution will take seconds, minutes, or hours.

Here’s a practical example. Suppose we need to process one million user records. If each record requires a network call to a database, and each call takes 50 milliseconds, we’re looking at 50 million milliseconds, or about 14 hours. However, if we can batch those requests and fetch 1000 records per call, suddenly we only need 1000 calls, which takes about 50 seconds.

Measure Before We Optimize

Our intuition about performance bottlenecks is usually wrong. We might spend days optimizing a function we think is slow, only to discover through profiling that some completely different part of the code is the actual problem.

This is why the main rule of performance optimization is to measure first and optimize second. Modern programming languages and platforms provide excellent profiling tools that show us exactly where our program spends its time. These tools track CPU usage, memory allocations, I/O operations, and lock contention in multi-threaded programs.

The basic profiling approach is straightforward.

  • First, we run our program under a profiler using realistic workloads, not toy examples. The profile shows us which functions consume the most time.

  • Sometimes, however, we encounter what’s called a flat profile. This is when no single function dominates the runtime. Instead, many functions each contribute a small percentage. This situation is actually common in mature codebases where the obvious bottlenecks have already been fixed. When facing a flat profile, our strategy shifts. We look for patterns across multiple functions, consider structural changes higher up in the call chain, or focus on accumulating many small improvements rather than one big win.

The key point is that we should let data guide our optimization decisions.

Algorithmic Wins

The most important performance improvements almost always come from choosing better algorithms and data structures. A better algorithm can provide a 10x or 100x speedup, dwarfing any micro-optimization we make.

Consider a common scenario: we have two lists and need to find which items from the first list exist in the second. The naive approach uses nested loops. For each item in list A, we scan through all of list B looking for a match. If each list has 1000 items, that’s potentially one million comparisons. This is an O(N²) algorithm, meaning the work grows with the square of the input size.​​

A better approach converts list B into a hash table, then looks up each item from list A. Hash table lookups are typically O(1), constant time, so now we’re doing 1000 lookups instead of a million comparisons. The total work is O(N) instead of O(N²). For our 1000-item lists, this could mean finishing in milliseconds instead of seconds.

Another common algorithmic improvement involves caching or precomputation. If we’re calculating the same value repeatedly, especially inside a loop, we should calculate it once and store the result.

The key to spotting algorithmic problems in a profile is looking for functions that consume most of the runtime and contain nested loops, or that repeatedly search or sort the same data. Before we dive into optimizing such code, we should step back and ask if there is a fundamentally different approach that does less work?

Memory Matters

Modern CPUs are incredibly fast, but they can only work on data that’s in their small, ultra-fast caches. When the data they need isn’t in cache, they have to fetch it from main memory, which is much slower. This difference is so significant that the layout of our data in memory often matters more than the algorithmic complexity of our code.

See the diagram below:

The fundamental principle here is locality: data that is accessed together should be stored together in memory. When the CPU fetches data from memory, it doesn’t fetch just one byte. It fetches an entire cache line, typically 64 bytes. If our related data is scattered across memory, we waste cache lines and constantly fetch new data. If it’s packed together, we get multiple pieces of related data in each cache line.

Consider two ways to store a list of user records:

  • We could have an array of pointers, where each pointer points to a user object allocated somewhere else in memory.

  • Or we could have a single contiguous array where all user objects are stored sequentially.

The first approach means that accessing each user requires chasing a pointer, and each user object might be on a different cache line. The second approach means that once we fetch the first user, the next several users are likely already in cache.

This is why arrays and vectors typically outperform linked lists for most operations, even though linked lists have theoretical advantages for insertions and deletions. The cache efficiency of sequential access usually dominates.

Reducing the memory footprint also helps. Smaller data structures mean more fit in cache. If we’re using a 64-bit integer to store values that never exceed 255, we’re wasting memory. Using an 8-bit type instead means we can fit eight times as many values in the same cache line. Similarly, removing unnecessary fields from frequently accessed objects can have a measurable impact.

The access pattern matters too. Sequential access through an array is much faster than random access. If we’re summing a million numbers stored in an array, it’s fast because we access them sequentially and the CPU can predict what we’ll need next. If those same numbers are in a linked list, each access requires chasing a pointer to an unpredictable location, destroying cache efficiency.

The practical takeaway is to prefer contiguous storage (arrays, vectors) over scattered storage (linked lists, maps) when performance matters. Keep related data together, and access it sequentially when possible.

Reduce Allocations

Every time we allocate memory, there’s a cost. The memory allocator has to find available space, bookkeeping data structures need updating, and the new object typically needs initialization. Later, when we’re done with the object, it needs cleanup or destruction. In garbage-collected languages, the collector has to track these objects and eventually reclaim them, which can cause noticeable pauses.

Beyond the allocator overhead, each allocation typically ends up on a different cache line. If we’re creating many small objects independently, they’ll be scattered across memory, hurting cache efficiency as discussed in the previous section.

Common sources of excessive allocation include creating temporary objects inside loops, repeatedly resizing containers as we add elements, and copying data when we could move it or just reference it.

Container pre-sizing is an effective technique. If we know we’ll eventually need a vector with 1000 elements, we should reserve that space upfront. Otherwise, the vector might allocate space for 10 elements, then 20, then 40, and so on as we add elements, copying all existing data each time.

Reusing objects is another straightforward win. If we’re creating and destroying the same type of temporary object thousands of times in a loop, we should create it once before the loop and reuse it on each iteration, clearing it as needed. This is especially important for complex objects like strings, collections, or protocol buffers.

Modern languages support “moving” data instead of copying it. Moving transfers ownership of the data without duplicating it. When we’re passing large objects between functions and don’t need to keep the original, moving is much cheaper than copying.

Different container types have different allocation characteristics. Maps and sets typically allocate memory for each element individually. Vectors and arrays are allocated in bulk. When performance matters, and we’re choosing between containers with similar capabilities, the allocation pattern can be the deciding factor.

Avoid Unnecessary Work

The fastest code is code that never runs. Before we optimize how we do something, we should ask whether we need to do it at all or whether we can do it less frequently.

Some common strategies for the same are as follows:

  • Creating fast paths for common cases is a powerful technique. In many systems, 80% of cases follow a simple pattern while 20% require complex handling. If we can identify the common case and create an optimized path for it, most operations become faster even though the complex case remains unchanged. For example, if most strings in our system are short, we can optimize specifically for short strings while keeping the general implementation for longer ones.

  • Precomputation and caching exploit the idea that calculating something once is cheaper than calculating it repeatedly. If we have a value that doesn’t change during a loop, we should calculate it before the loop, not on every iteration. If we have an expensive calculation whose result might be needed multiple times, we should cache it after computing it the first time.

  • Lazy evaluation defers work until we know it’s actually needed. If we’re building a complex report that the user might never request, we shouldn’t generate it eagerly during initialization. Similarly, if we’re processing a sequence of items and might exit early based on the first few items, we shouldn’t process all items upfront.

  • Bailing out early means checking cheap conditions before expensive ones and returning as soon as we have an answer. If we’re validating user input and the first validation check fails, we shouldn’t proceed with expensive downstream processing. This simple principle can eliminate huge amounts of unnecessary work.

  • Finally, there’s the choice between general and specialized code. General-purpose libraries handle all possible cases, which often makes them slower than code written for one specific case. If we’re in a performance-critical section and using a heavyweight general solution, we should consider whether a lightweight specialized implementation would suffice.

Practical Tips for Everyday Coding

Some performance considerations should become second nature as we write code, even before we profile.

  • When designing APIs, we should think about efficient implementation. Processing items in batches is usually more efficient than processing them one at a time. Allowing callers to pass in pre-allocated buffers can eliminate allocations in hot paths. We should avoid forcing features on callers who don’t need them. For example, if most users of a data structure don’t need thread-safety, we shouldn’t build synchronization into the structure itself. Let those who need it add synchronization externally.

  • We should watch out for hidden costs. Logging statements can be expensive even when logging is disabled, because checking whether logging is enabled still requires work. If we have a logging function called millions of times, we should consider removing it or checking the logging level once outside the hot path. Similarly, error checking inside tight loops can add up. When possible, we should validate inputs at module boundaries rather than checking the same conditions repeatedly.

  • Every language has performance characteristics for its standard containers and libraries. For example, in Python, list comprehensions and built-in functions are typically faster than manual loops. In JavaScript, typed arrays are much faster than regular arrays for numeric data. In Java, ArrayList almost always beats LinkedList. In C++, vector is usually the right choice over list. Learning these characteristics for our chosen language pays dividends.

  • String operations are quite common and often performance-sensitive. In many languages, concatenating strings in a loop is expensive because each concatenation creates a new string object. String builders or buffers solve this problem by accumulating the result efficiently and creating the final string once.

Knowing When to Stop

Not all code needs optimization. We shouldn’t optimize everything, just the parts that matter. The famous Pareto principle applies here: roughly 20% of our code typically accounts for 80% of the runtime. Our job is to identify and optimize that critical 20%.

Performance isn’t the only concern. Code readability also matters because other developers need to maintain our code. Overly optimized code can be difficult to understand and modify. Developer time has value, too. Spending a week optimizing code that saves one second per day might not be worthwhile.

The right approach is to write clear, correct code first. Then we should measure to find real bottlenecks, followed by optimizing the critical paths until we hit our performance targets. At that point, we stop. Additional optimization has diminishing returns and increasing costs in terms of code complexity.

That said, some situations clearly demand performance attention. Code that runs millions of times, like functions in the hot path of request handling, deserves optimization. User-facing operations should feel instant because perceived performance affects user satisfaction. Background processing at scale can consume enormous resources if inefficient, directly impacting operational costs. Resource-constrained environments like mobile devices or embedded systems need careful optimization to function well.

Conclusion

Writing fast code requires developing intuition about where performance matters and learning fundamental principles that apply across languages and platforms.

The journey starts with estimation. Understanding the rough costs of common operations helps us make smart architectural decisions before writing code. Measurement shows us where the real problems are, often surprising us by revealing bottlenecks we never suspected. Better algorithms and data structures provide order-of-magnitude improvements that dwarf micro-optimizations. Memory layout and allocation patterns matter more than many developers realize. And often, the easiest performance win is simply avoiding unnecessary work.

The mindset shift is from treating performance as something we fix later to something we consider throughout development. We don’t need to obsess over every line of code, but we should develop awareness of the performance implications of our choices.

How Google Manages Trillions of Authorizations with Zanzibar

2026-01-28 00:30:31

WorkOS Pipes: Ship Third-Party Integrations Without Rebuilding OAuth (Sponsored)

Connecting user accounts to third-party APIs always comes with the same plumbing: OAuth flows, token storage, refresh logic, and provider-specific quirks.

WorkOS Pipes removes that overhead. Users connect services like GitHub, Slack, Google, Salesforce, and other supported providers through a drop-in widget. Your backend requests a valid access token from the Pipes API when needed, while Pipes handles credential storage and token refresh.

Simplify integrations today →


Sometime before 2019, Google built a system that manages permissions for billions of users while maintaining both correctness and speed.

When you share a Google Doc with a colleague or make a YouTube video private, a complex system works behind the scenes to ensure that only the right people can access the content. That system is Zanzibar, Google’s global authorization infrastructure that handles over 10 million permission checks every second across services like Drive, YouTube, Photos, Calendar, and Maps.

In this article, we will look at the high-level architecture of Zanzibar and understand the valuable lessons it provides for building large-scale systems, particularly around the challenges of distributed authorization.

See the diagram below that shows the high-level architecture of Zanzibar.

Disclaimer: This post is based on publicly shared details from the Google Engineering Team. Please comment if you notice any inaccuracies.

The Core Problem: Authorization at Scale

Authorization answers a simple question: Can this particular user access this particular resource? For a small application with a few users, checking permissions is straightforward. We might store a list of allowed users for each document and check if the requesting user is on that list.

The challenge multiplies at Google’s scale. For reference, Zanzibar stores over two trillion permission records and serves them from dozens of data centers worldwide. A typical user action might trigger tens or hundreds of permission checks. When searching for an artifact in Google Drive, the system must verify your access to every result before displaying it. Any delay in these checks directly impacts user experience.

Beyond scale, authorization systems also face a critical correctness problem that Google calls the “new enemy” problem. Consider the scenario where we remove someone from a document’s access list, then add new content to that document. If the system uses stale permission data, the person who was just removed might still see the new content. This happens when the system doesn’t properly track the order in which you made changes.

Zanzibar solves these challenges through three key architectural decisions:

  • A flexible data model based on tuples.

  • A consistency protocol that respects causality.

  • A serving layer optimized for common access patterns.


AI code review with the judgment of your best engineer. (Sponsored)

Unblocked is the only AI code review tool that has deep understanding of your codebase, past decisions, and internal knowledge, giving you high-value feedback shaped by how your system actually works instead of flooding your PRs with stylistic nitpicks.

Try Unblocked for free


The Data Model

Zanzibar represents all permissions as relation tuples, which are simple statements about relationships between objects and users. A tuple follows this format: object, relation, user. For example, “document 123, viewer, Alice” means Alice can view document 123.

See the diagram below:

This tuple-based approach differs from traditional access control lists that attach permissions directly to objects. Tuples can refer to other tuples. Instead of listing every member of a group individually on a document, we can create one tuple that says “members of the Engineering group can view this document.” When the Engineering group membership changes, the document permissions automatically reflect those changes.

The system organizes tuples into namespaces, which are containers for objects of the same type. Google Drive might have separate namespaces for documents and folders, while YouTube has namespaces for videos and channels. Each namespace defines what relations are possible and how they interact.

Zanzibar’s clients can use a configuration language to specify rules about how relations are composed. For instance, a configuration might state that all editors are also viewers, but not all viewers are editors.

See the code snippet below that shows the configuration language approach for defining the relations.

name: “doc”
relation { name: “owner” }
relation {
  name: “editor”
  userset_rewrite {
    union {
      child { _this {} }
      child { computed_userset { relation: “owner” } }
    }
  }
}
relation {
  name: “viewer”
  userset_rewrite {
    union {
      child { _this {} }
      child { computed_userset { relation: “editor” } }
    }
  }
}

Source: Zanzibar Research Paper

These rules, called userset rewrites, let the system derive complex permissions from simple stored tuples. For example, consider a document in a folder. The folder has viewers, and you want those viewers to automatically see all documents in the folder. Rather than duplicating the viewer list on every document, you write a rule saying that to check who can view a document, look up its parent folder, and include that folder’s viewers. This approach enables permission inheritance without data duplication.

The configuration language supports set operations like union, intersection, and exclusion. A YouTube video might specify that its viewers include direct viewers, plus viewers of its parent channel, plus anyone who can edit the video. This flexibility allows diverse Google services to specify their specific authorization policies using the same underlying system.

Handling Consistency with Ordering

The “new enemy” problem shows why distributed authorization is harder than it appears. When you revoke someone’s access and then modify content, two separate systems must coordinate:

  • Zanzibar for permissions

  • Application for content

Zanzibar addresses this through tokens called zookies. When an application saves new content, it requests an authorization check from Zanzibar. If authorized, Zanzibar returns a zookie encoding the current timestamp, which the application stores with the content.

Later, when someone tries to view that content, the application sends both the viewer’s identity and the stored zookie to Zanzibar. This tells Zanzibar to check permissions using data at least as fresh as that timestamp. Since the timestamp came from after any permission changes, Zanzibar will see those changes when performing the check.

This protocol works because Zanzibar uses Google Spanner, which provides external consistency.

If event A happens before event B in real time, their timestamps reflect that ordering across all data centers worldwide through Spanner’s TrueTime technology.

The zookie protocol has an important property. It specifies the minimum required freshness, not an exact timestamp. Zanzibar can use any timestamp equal to or fresher than required, enabling performance optimizations.

The Architecture: Distribution and Caching

Zanzibar runs on over 10,000 servers organized into dozens of clusters worldwide. Each cluster contains hundreds of servers that cooperate to answer authorization requests. The system replicates all permission data to more than 30 geographic locations, ensuring that checks can be performed close to users.

When a check request arrives, it goes to any server in the nearest cluster, and that server becomes the coordinator for the request. Based on the permission configuration, the coordinator may need to contact other servers to evaluate different parts of the check. These servers might recursively contact additional servers, particularly when checking membership in nested groups.

For instance, checking if Alice can view a document might require verifying if she is an editor (which implies viewer access), and whether her group memberships grant access, and whether the document’s parent folder grants access. Each of these checks can execute in parallel on different servers, which then combine the results.

The distributed nature of this processing can create potential hotspots. Popular content generates many concurrent permission checks, all targeting the same underlying data. Zanzibar employs several techniques to mitigate these hotspots:

  • First, the system maintains a distributed cache across all servers in a cluster. Using consistent hashing, related checks route to the same server, allowing that server to cache results and serve subsequent identical checks from memory. The cache keys include timestamps, so checks at the same time can share cached results.

  • Second, Zanzibar uses a lock table to deduplicate concurrent identical requests. When multiple requests for the same check arrive simultaneously, only one actually executes the check. The others wait for the result, then all receive the same answer. This prevents flash crowds from overwhelming the system before the cache warms up.

  • Third, for exceptionally hot items, Zanzibar reads the entire permission set at once rather than checking individual users. While this consumes more bandwidth for the initial read, subsequent checks for any user can be answered from the cached full set.

The system also makes intelligent choices about where to evaluate checks. The zookie flexibility mentioned earlier allows Zanzibar to round evaluation timestamps to coarse boundaries, such as one-second or ten-second intervals. This quantization means that many checks evaluate at the same timestamp and can share cache entries, dramatically improving hit rates.

Handling Complex Group Structures

Some scenarios involve deeply nested groups or groups with thousands of subgroups. Checking membership by recursively following relationships becomes too slow when these structures grow large.

Zanzibar includes a component called Leopard that maintains a denormalized index precomputing transitive group membership. Instead of following chains like “Alice is in Backend, Backend is in Engineering,” Leopard stores direct mappings from users to all groups they belong to.

Leopard uses two types of sets: one mapping users to their direct parent groups, and another mapping groups to all descendant groups. Therefore, checking if Alice belongs to Engineering becomes a set intersection operation that executes in milliseconds.

Leopard keeps its denormalized index consistent through a two-layer approach. An offline process periodically builds a complete index from snapshots. An incremental layer watches for changes and applies them on top of the snapshot. Queries combine both layers for consistent results.

Performance Optimization

Zanzibar’s performance reveals optimization for common cases. Around 99% of permission checks use moderately stale data, served entirely from local replicas. These checks have a median latency of 3 milliseconds and reach the 95th percentile at 9 milliseconds. The remaining 1% requiring fresher data have a 95th percentile latency of around 60 milliseconds due to cross-region communication.

Writes are slower by design, with a median latency of 127 milliseconds reflecting distributed consensus costs. However, writes represent only 0.25% of operations.

Zanzibar employs request hedging to reduce tail latency. After sending a request to one replica and receiving no response within a specified threshold, the system sends the same request to another replica and uses the response from the first replica that arrives. Each server tracks latency distributions and automatically tunes parameters like default staleness and hedging thresholds.

Isolation and Reliability

Operating a shared authorization service for hundreds of client applications requires strict isolation between clients. A misbehaving or unexpectedly popular feature in one application should not affect others.

Zanzibar implements isolation at multiple levels. Each client has CPU quotas measured in generic compute units. If a client exceeds its quota during periods of resource contention, its requests are throttled, but other clients continue unaffected. The system also limits the number of concurrent requests per client and the number of concurrent database reads per client.

The lock tables used for deduplication include the client identity in their keys. This ensures that if one client creates a hotspot that fills its lock table, other clients’ requests can still proceed independently.

These isolation mechanisms proved essential in production. When clients launch new features or experience unexpected usage patterns, the problems remain contained. Over three years of operation, Zanzibar has maintained 99.999% availability, meaning less than two minutes of downtime per quarter.

Conclusion

Google’s Zanzibar represents five years of evolution in production, serving hundreds of applications and billions of users. The system demonstrates that authorization at massive scale requires more than just fast databases. It demands careful attention to consistent semantics, intelligent caching and optimization, and robust isolation between clients.

Zanzibar’s architecture offers insights applicable beyond Google’s scale. The tuple-based data model provides a clean abstraction unifying access control lists and group membership. Separating policy configuration from data storage makes it easier to evolve authorization logic without migrating data.

The consistency model demonstrates that strong guarantees are achievable in globally distributed systems through careful protocol design. The zookie approach balances correctness with performance by giving the system flexibility within bounds.

Most importantly, Zanzibar illustrates optimizing for observed behavior rather than theoretical worst cases. The system handles the common case (stale reads) extremely well while supporting the uncommon case (fresh reads) adequately. The sophisticated caching strategies show how to overcome normalized storage limitations while maintaining correctness.

For engineers building authorization systems, Zanzibar provides a comprehensive reference architecture. Even at smaller scales, the principles of tuple-based modeling, explicit consistency guarantees, and optimization through measurement remain valuable.

References:

How Cursor Shipped its Coding Agent to Production

2026-01-27 00:30:35

New report: 96% of devs don’t fully trust AI code (Sponsored)

AI is accelerating code generation, but it’s creating a bottleneck in the verification phase. Based on a survey of 1,100+ developers, Sonar’s newest State of Code report analyzes the impact of generative AI on software engineering workflows and how developers are adapting to address it.

Survey findings include:

  • 96% of developers don’t fully trust that AI-generated code is functionally correct yet only 48% always check it before committing

  • 61% agree that AI often produces code that looks correct but isn’t reliable

  • 24% of a developer’s work week is spent on toil work

Download survey


On October 29, 2025, Cursor shipped Cursor 2.0 and introduced Composer, its first agentic coding model. Cursor claims Composer is 4x faster than similarly intelligent models, with most turns completing in under 30 seconds. For more clarity and detail, we worked with Lee Robinson at Cursor on this article.

Shipping a reliable coding agent requires a lot of systems engineering. Cursor’s engineering team has shared technical details and challenges from building Composer and shipping their coding agent into production. This article breaks down those engineering challenges and how they solved them.

What is a Coding Agent?

To understand coding agents, we first need to look at how AI coding has evolved.

AI in software development has evolved in three waves. First, we treated general-purpose LLMs like a coding partner. You copied code, pasted it into ChatGPT, asked for a fix, and manually applied the changes. It was helpful, but disconnected.

In the second wave, tools like Copilot and Cursor Tab brought AI directly into the editor. To power these tools, specialized models were developed for fast, inline autocomplete. They helped developers type faster, but they were limited to the specific file being edited.

More recently, the focus has shifted to coding agents that handle tasks end-to-end. They don’t just suggest code; they handle coding requests end-to-end. They can search your repo, edit multiple files, run terminal commands, and iterate on errors until the build and tests pass. We are currently living through this third wave.


A coding agent is not a single model. It is a system built around a model with tool access, an iterative execution loop, and mechanisms to retrieve relevant code. The model, often referred to as an agentic coding model, is a specialized LLM trained to reason over codebases, use tools, and work effectively inside an agentic system.

It is easy to confuse agentic coding models with coding agents. The agentic coding model is like the brain. It has the intelligence to reason, write code, and use tools. The coding agent is the body. It has the “hands” to execute tools, manage context, and ensure it reaches a working solution by iterating until the build and tests pass.

AI labs first train an agentic coding model, then wrap it in an agent system, also known as a harness, to create a coding agent. For example, OpenAI Codex is a coding agent environment powered by the GPT-5.2-Codex model, and Cursor’s coding agent can run on multiple frontier models, including its own agentic coding model, Composer. In the next section, we take a closer look at Cursor’s coding agent and Composer.


Web Search API for Your AI Applications (Sponsored)

LLMs are powerful—but without fresh, reliable information, they hallucinate, miss context, and go out of date fast. SerpApi gives your AI applications clean, structured web data from major search engines and marketplaces, so your agents can research, verify, and answer with confidence.

Access real-time data with a simple API.

Try for Free


System Architecture

A production-ready coding agent is a complex system composed of several critical components working in unison. While the model provides the intelligence, the surrounding infrastructure is what enables it to interact with files, run commands, and maintain safety. The next Figure shows the key components of a Cursor’s agent system.

Router

Cursor integrates multiple agentic models, including its own specialized Composer model. For efficiency, the system offers an “Auto” mode that acts as a router. It dynamically analyzes the complexity of each request to choose the best model for the job.

LLM (agentic coding model)

The heart of the system is the agentic coding model. In Cursor’s agent, that model can be Composer, or any other frontier coding models picked by the router. Unlike a standard LLM trained just to predict the next token of text, this model is trained on trajectories, sequences of actions that show the model how and when to use available tools to solve a problem.

Creating this model is often the heaviest lift in building a coding agent. It requires massive data preparation, training, and testing to ensure the model doesn’t just write code, but understands the process of coding (e.g., “search first, then edit, then verify”). Once this model is ready and capable of reasoning, the rest of the work shifts to system engineering to provide the environment it needs to operate.

Tools

Composer is connected to a tool harness inside Cursor’s agent system, with more than ten tools available. These tools cover the core operations needed for coding such as searching the codebase, reading and writing files, applying edits, and running terminal commands.

Context Retrieval

Real codebases are too large to fit into a single prompt. The context retrieval system searches the codebase to pull in the most relevant code snippets, documentation, and definitions for the current step, so the model has what it needs without overflowing the context window.

Orchestrator

The orchestrator is the control loop that runs the agent. The model decides what to do next and which tool to use, and the orchestrator executes that tool call, collects the result such as search hits, file contents, or test output, rebuilds the working context with the new information, and sends it back to the model for the next step. This iterative loop is what turns the system from a chatbot into an agent.

One common way to implement this loop is the ReAct pattern, where the model alternates between reasoning steps and tool actions based on the observations it receives.

Sandbox (execution environment)

Agents need to run builds, tests, linters, and scripts to verify their work. However, giving an AI unrestricted access to your terminal is a security risk. To solve this, tool calls are executed in a Sandbox. This secure and isolated environment uses strict guardrails to ensure that the user’s host machine remains safe even if the agent attempts to run a destructive command. Cursor offers the flexibility to run these sandboxes either locally or remotely on a cloud virtual machine.

Note that these are the core building blocks you will see in most coding agents. Different labs may add more components on top, such as long-term memory, policy and safety layers, specialized planning modules, or collaboration features, depending on the capabilities they want to support.

Production Challenges

On paper, tools, memory, orchestration, routing, and sandboxing look like a straightforward blueprint. In production, the constraints are harsher. A model that can write good code is still useless if edits do not apply cleanly, if the system is too slow to iterate, or if verification is unsafe or too expensive to run frequently.

Cursor’s experience highlights three engineering hurdles that general-purpose models do not solve out of the box: reliable editing, compounded latency, and sandboxing at scale.

Challenge 1: The “Diff Problem”

General-purpose models are trained primarily to generate text. They struggle significantly when asked to perform edits on existing files.

This is known as the “Diff Problem.” When a model is asked to edit code, it has to locate the right lines, preserve indentation, and output a rigid diff format. If it hallucinates line numbers or drifts in formatting, the patch fails even when the underlying logic is correct. Worse, a patch can apply incorrectly, which is harder to detect and more expensive to clean up. In production, incorrect edits are often worse than no edits because they reduce trust and increase cleanup time.

A common way to mitigate the diff problem is to train on edit trajectories. For example, you can structure training data as triples like (original_code, edit_command, final_code), which teaches the model how an edit instruction should change the file while preserving everything else.

Another critical step is teaching the model to use specific editing tools such as search and replace. Cursor emphasized that these two tools were significantly harder to teach than other tools. To solve this, they ensured their training data contained a high volume of trajectories specifically focused on search and replace tool usage, forcing the model to over-learn the mechanical constraints of these operations. Cursor utilized a cluster of tens of thousands of GPUs to train the Composer model, ensuring these precise editing behaviors were fundamentally baked into the weights.

Challenge 2: Latency Compounds

In a chat interface, a user might tolerate a short pause. In an agent loop, latency compounds. A single task might require the agent to plan, search, edit, and test across many iterations. If each step takes a few seconds, the end-to-end time quickly becomes frustrating.

Cursor treats speed as a core product strategy. The make the coding agent fast, they have employed three key techniques:

  • Mixture of Experts (MoE) architecture

  • Speculative decoding

  • Context compaction

MoE architecture: Composer is a MoE language model. MoE modifies the Transformer by making some feed-forward computation conditional. Instead of sending every token through the same dense MLP, the model routes each token to a small number of specialized MLP experts.

MoE can improve both capacity and efficiency by activating only a few experts per token, which can yield better quality at similar latency, or similar quality at lower latency, especially at deployment scale. However, MoE often introduces additional engineering challenges and complexity. If every token goes to the same expert, that expert becomes a bottleneck while others sit idle. This causes high tail latency.

Teams typically address this with a combination of techniques. During training they add load-balancing losses to encourage the router to spread traffic across experts. During serving, they enforce capacity limits and reroute overflow. At the infrastructure level, they reduce cross-GPU communication overhead by batching and routing work to keep data movement predictable.

Speculative Decoding: Generation is sequential. Agents spend a lot of time producing plans, tool arguments, diffs, and explanations, and generating these token by token is slow. Speculative decoding reduces latency by using a smaller draft model to propose tokens that a larger model can verify quickly. When the draft is correct, the system accepts multiple tokens at once, reducing the number of expensive decoding steps.

Since code has a very predictable structure, such as imports, brackets, and standard syntax, waiting for a large model like Composer to generate every single character is inefficient. Cursor confirms they use speculative decoding and trained specialized “draft” models that predict the next few tokens rapidly. This allows Composer to generate code much faster than the standard token-by-token generation rate.

Context Compaction: Agents also generate a lot of text that is useful once but costly to keep around, such as tool outputs, logs, stack traces, intermediate diffs, and repeated snippets. If the system keeps appending everything, prompts bloat and latency increases.

Context compaction addresses this by summarizing the working state and keeping only what is relevant for the next step. Instead of carrying full logs forward, the system retains stable signals like failing test names, error types, and key stack frames. It compresses or drops stale context, deduplicates repeated snippets, and keeps raw artifacts outside the prompt unless they are needed again. Many advanced coding agents like OpenAI’s Codex and Cursor rely on context compaction to stay fast and reliable when reaching the context window limit.

Context compaction improves both latency and quality. Fewer tokens reduce compute per call, and less noise reduces the chance the model drifts or latches onto outdated information.

Put together, these three techniques target different sources of compounded latency. MoE reduces per-call serving cost, speculative decoding reduces generation time, and context compaction reduces repeated prompt processing.

Challenge 3: Sandboxing at Scale

Coding agents do not only generate text. They execute code. They run builds, tests, linters, formatters, and scripts as part of the core loop. That requires an execution environment that is isolated, resource-limited, and safe by default.

In Cursor’s flow, the agent provisions a sandboxed workspace from a specific repository snapshot, executes tool calls inside that workspace, and feeds results back into the model. At a small scale, sandboxing is mostly a safety feature. At large scale, it becomes a performance and infrastructure constraint.

Two major issues dominate when training the model:

  • Provisioning time becomes the bottleneck. The model may generate a solution in milliseconds, but creating a secure, isolated environment can take much longer. If sandbox startup dominates, the system cannot iterate quickly enough to feel usable.

  • Concurrency makes startup overhead a bottleneck at scale. Spinning up thousands of sandboxes all at once very quickly is challenging. This becomes even more challenging during training. Teaching the model to call tools at scale requires running hundreds of thousands of concurrent sandboxed coding environments in the cloud.

These challenges pushed the Cursor team to build custom sandboxing infrastructure. They rewrote their VM scheduler to handle bursty demand, like when an agent needs to spin up thousands of sandboxes in a short time. Cursor treats sandboxes as core serving infrastructure, with an emphasis on fast provisioning and aggressive recycling so tool runs can start quickly and sandbox startup time does not dominate the time to a verified fix.

For safety, Cursor defaults to a restricted Sandbox Mode for agent terminal commands. Commands run in an isolated environment with network access blocked by default and filesystem access limited to the workspace and /tmp/. If a command fails because it needs broader access, the UI lets the user skip it or intentionally re-run it outside the sandbox.

The key takeaway is to not treat sandboxes as just containers. Treat them like a system that needs its own scheduler, capacity planning, and performance tuning.

Conclusion

Cursor shows that modern coding agents are not just better text generators. They are systems built to edit real repositories, run tools, and verify results. Cursor paired a specialized MoE model with a tool harness, latency-focused serving, and sandboxed execution so the agent can follow a practical loop: inspect the code, make a change, run checks, and iterate until the fix is verified.

Cursor’s experience shipping Composer to production points to three repeatable lessons that matter for most coding agents:

  1. Tool use must be baked into the model. Prompting alone is not enough for reliable tool calling inside long loops. The model needs to learn tool usage as a core behavior, especially for editing operations like search and replace where small mistakes can break the edit.

  2. Adoption is the ultimate metric. Offline benchmarks are useful, but a coding agent lives or dies by user trust. A single risky edit or broken build can stop users from relying on the tool, so evaluation has to reflect real usage and user acceptance.

  3. Speed is part of the product. Latency shapes daily usage. You do not need a frontier model for every step. Routing smaller steps to fast models while reserving larger models for harder planning turns responsiveness into a core feature, not just an infrastructure metric.

Coding agents are still evolving, but the trend is promising. With rapid advances in model training and system engineering, we are moving toward a future where they become much faster and more effective.

EP199: Behind the Scenes: What Happens When You Enter Google.com

2026-01-25 00:30:29

Debugging Next.js in Production (Sponsored)

Next.js makes it easy to ship fast, but once your app is in production it can be hard to tell where errors, slow requests, or hydration issues are really coming from.

Join Sentry’s hands-on workshop where Sergiy Dybskiy will dive into how these problems show up in real apps and how to connect what users experience with what’s happening under the hood. 🚀

Register for the workshop


This week’s system design refresher:

  • What Are AI Agents & How Do They Work? (Youtube video)

  • Behind the Scenes: What Happens When You Enter Google.com

  • Understanding the Linux Directory Structure

  • Symmetric vs. Asymmetric Encryption

  • Network Troubleshooting Test Flow

  • We’re hiring at ByteByteGo


What Are AI Agents & How Do They Work?


Behind the Scenes: What Happens When You Enter Google.com

Most of us hit Enter and expect the page to load instantly, but under the hood, a surprisingly intricate chain of events fires off in milliseconds.

Here’s a quick tour of what actually happens:

  1. The journey starts the moment you type “google. com” into the address bar.

  2. The browser checks everywhere for a cached IP: Before touching the network, your browser looks through multiple cache layers, browser cache, OS cache, router cache, and even your ISP’s DNS cache.
    A cache hit means an instant IP address. A miss kicks off the real journey.

  3. Recursive DNS resolution begins: Your DNS resolver digs through the global DNS hierarchy:
    - Root servers
    - TLD servers (.com)
    - Authoritative servers for google. com

  4. A TCP connection is established: Your machine and Google’s server complete the classic TCP 3-way handshake:
    - SYN → SYN/ACK → ACK
    Only after the connection is stable does the browser move on. TLS handshake wraps everything in encryption. By the end of this handshake, a secure HTTPS tunnel is ready.

  5. The actual HTTP request finally goes out: Google processes the request and streams back HTML, CSS, JavaScript, and all the assets needed to build the page.

  6. The rendering pipeline kicks in:
    Your browser parses HTML into a DOM tree, CSS into a CSSOM tree, merges them into the Render Tree, and then:
    - Lays out elements
    - Loads and executes JavaScript
    - Repaints the screen

    8. The page is fully loaded.
    Over to you: What part of this journey was most surprising the first time you learned how browsers work?


Understanding the Linux Directory Structure

  • The root directory “/” is the starting point of the entire filesystem. From there, Linux organizes everything into specialized folders.

  • “/boot” stores the bootloader and kernel files, without it, the system can’t start.

  • “/dev” holds device files that act as interfaces to hardware.

  • “/usr” contains system resources, libraries, and user-level applications.

  • “/bin” and “/sbin” store essential binaries and system commands needed during startup or recovery.

  • User-related data sits under “/home” for regular users and “/root” for the root account.

  • System libraries that support core binaries live in “/lib” and “/lib64”.

  • Temporary data is kept in “/tmp”, while “/var” tracks logs, caches, and frequently changing files.

  • Configuration files live in “/etc”, and runtime program data lives in “/run”.

  • Linux also exposes virtual filesystems through “/proc” and “/sys”, giving you insight into processes, kernel details, and device information.

  • For external storage, “/media” and “/mnt” handle removable devices and temporary mounts, and “/opt” is where optional third-party software installs itself.

Over to you: Which Linux directory took you the longest to fully understand, and what finally made it click for you?


Symmetric vs. Asymmetric Encryption

Symmetric and asymmetric encryption often get explained together, but they solve very different problems.

  • Symmetric encryption uses a single shared key. The same key encrypts and decrypts the data. It’s fast, efficient, and ideal for large amounts of data. That’s why it’s used for things like encrypting files, database records, and message payloads.

    The catch is key distribution, both parties must already have the secret, and sharing it securely is hard.

  • Asymmetric encryption uses a key pair. A public key that can be shared with anyone, and a private key that stays secret. Data encrypted with the public key can only be decrypted with the private key.

    This removes the need for secure key sharing upfront, but it comes at a cost. It’s slower and computationally expensive, which makes it impractical for encrypting large payloads.

    That’s why asymmetric encryption is usually used for identity, authentication, and key exchange, not bulk data.

Over to you: What’s the most common misunderstanding you’ve seen about encryption in system design?


Network Troubleshooting Test Flow

Most network issues look complicated, but the troubleshooting process doesn’t have to be.

A reliable way to diagnose problems is to test the network layer by layer, starting from your own machine and moving outward until you find exactly where things break.

That’s exactly why we put together this flow: a structured, end-to-end checklist that mirrors how packets actually move through a system.


We’re Hiring

I am hiring for 2 roles: Technical Deep Dive Writer (System Design or AI Systems), and Lead Instructor (Building the World’s Most Useful AI Cohort). Job descriptions below.

1. Technical Deep Dive Writer

ByteByteGo started with a simple idea: explain system design clearly. Over time, it has grown into one of the largest technical education platforms for engineers, reaching millions of engineers every month. We believe it can become much bigger and more impactful.

This role is for someone exceptional who wants to help build that future by producing the highest-quality technical content on the internet.

You will work very closely with me to produce deep, accurate, and well structured technical content. The goal is not volume. The goal is to set the quality bar for how system design and modern AI systems are explained at scale.

The role is to turn technical knowledge into world class technical writing.

What you will do:

  • Turn complex systems into explanations that are precise, readable, and memorable.

  • Create clear technical diagrams that accurately represent system architecture and tradeoffs.

  • Collaborate directly with tech companies like Amazon, Shopify, Cursor, Yelp, etc.

  • Continuously raise the bar for clarity, correctness, and depth.

Who we are looking for

  • 5+ years of experience building large scale systems

  • Ability to explain complex ideas without oversimplifying

  • Strong ownership mindset and pride in craft

Role Type: Part time remote (10-20 hours per week), with the possibility of converting to full time

Compensation: Competitive

This is not just a writing role. It is a chance to help build the most trusted technical education brand in the industry.

How to apply: If you are interested, please send your resume and a previous writing sample to [email protected]


2. Lead Instructor, Building the World’s Most Useful AI Cohort

Cohort Course Name: Building Production AI Systems

This cohort is focused on one of the hardest problems in modern engineering. Taking AI systems from impressive demos to reliable, secure, production ready systems used by real users.

Our learners already understand generative AI concepts. What they want and need is engineering rigor. How to evaluate models properly. How to ship safely. How to scale without blowing up cost or reliability. How to operate AI systems in the real world.

This role is for someone exceptional who has done this work in production and wants to help shape the future of AI engineering education.

Role Type: Part time remote (20+ hours per week), with the possibility of converting to full time

Compensation: Very Competitive

Responsibilities

  • Develop and maintain a curriculum centered on ai production

  • Design labs/assignments. Keep them runnable with modern tooling

  • Teach live sessions (lecture + hands-on lab)

  • Run weekly office hours

  • Provide clear feedback on assignments and async questions

Required expertise

Production AI Engineering

  • You have shipped and maintained AI features used by thousands of users. You have “war stories” regarding outages, cost spikes, or quality regressions.

  • Deep understanding of the FastAPI + Pydantic + Celery/Redis stack for handling asynchronous AI tasks.

  • Can articulate the nuances between Latency vs. Cost and Reliability vs. Velocity.

AI Evals

  • experience in evaluating AI systems like LLMs, RAGs, Agents, or image/video generation models)

  • Ability to explain standard metrics and design eval datasets (golden, adversarial, regression)

  • implement scoring (rules, rubrics, LLM-as-judge with guardrails)

  • Familiar with industry eval patterns and frameworks (e.g., OpenAI Evals)

AI security and guardrails

  • prompt injection,

  • insecure output handling, model DoS, supply chain risks.

  • Threat Modeling: Experience mapping threats to taxonomies like MITRE ATLAS.

  • Implementing input sanitization and output validation to prevent prompt injection and model DoS.

Deployment and optimization

  • Infrastructure: Comfort with Docker, Kubernetes, and hybrid routing (using a mix of self-hosted and managed APIs like Azure OpenAI or Bedrock).

  • Optimization: Hands-on experience with Quantization (FP8/INT8), Prompt Caching, Pruning, distillation, distributed inference, efficient attention variants, and batching strategies to maximize throughput.

  • Serving Engines like vLLM or NVIDIA TensorRT-LLM.

  • Comfortable with production deployment patterns (containerization, staged rollouts, canaries).

  • Backend: Python (Expert), FastAPI, Pydantic v2, and asynchronous programming.

Monitoring and observability

  • Ability to teach tracing and quality monitoring

Desirable Technical Stack Knowledge

  • Frameworks: LangGraph, CrewAI, or Haystack.

  • Databases: Vector DBs (Pinecone, Weaviate, Qdrant) and their indexing strategies (HNSW, IVFFlat).

  • Observability: LangSmith, Honeycomb, or Arize Phoenix.

  • Backend: Python (Expert), FastAPI, Pydantic v2, and asynchronous programming.

This is not just a teaching role. It is a chance to help scale the most popular AI cohort and define how production grade AI engineering is taught.

Let’s build the most popular AI cohort together.

How to Apply

Send your resume and a short note on why you’re excited about this role to [email protected]

We are hiring at ByteByteGo

2026-01-24 00:31:08

I am hiring for 2 roles: Technical Deep Dive Writer (System Design or AI Systems), and Lead Instructor (Building the World’s Most Useful AI Cohort). Job descriptions below.

1. Technical Deep Dive Writer

ByteByteGo started with a simple idea: explain system design clearly. Over time, it has grown into one of the largest technical education platforms for engineers, reaching millions of engineers every month. We believe it can become much bigger and more impactful.

This role is for someone exceptional who wants to help build that future by producing the highest-quality technical content on the internet.

You will work very closely with me to produce deep, accurate, and well structured technical content. The goal is not volume. The goal is to set the quality bar for how system design and modern AI systems are explained at scale.

The role is to turn technical knowledge into world class technical writing.

What you will do:

  • Turn complex systems into explanations that are precise, readable, and memorable.

  • Create clear technical diagrams that accurately represent system architecture and tradeoffs.

  • Collaborate directly with tech companies like Amazon, Shopify, Cursor, Yelp, etc.

  • Continuously raise the bar for clarity, correctness, and depth.

Who we are looking for

  • 5+ years of experience building large scale systems

  • Ability to explain complex ideas without oversimplifying

  • Strong ownership mindset and pride in craft

Role Type: Part time remote (10-20 hours per week), with the possibility of converting to full time

Compensation: Competitive

This is not just a writing role. It is a chance to help build the most trusted technical education brand in the industry.

How to apply: If you are interested, please send your resume and a previous writing sample to [email protected]


2. Lead Instructor, Building the World’s Most Useful AI Cohort

Cohort Course Name: Building Production AI Systems

This cohort is focused on one of the hardest problems in modern engineering. Taking AI systems from impressive demos to reliable, secure, production ready systems used by real users.

Our learners already understand generative AI concepts. What they want and need is engineering rigor. How to evaluate models properly. How to ship safely. How to scale without blowing up cost or reliability. How to operate AI systems in the real world.

This role is for someone exceptional who has done this work in production and wants to help shape the future of AI engineering education.

Role Type: Part time remote (20+ hours per week), with the possibility of converting to full time

Compensation: Very Competitive

Responsibilities

  • Develop and maintain a curriculum centered on ai production

  • Design labs/assignments. Keep them runnable with modern tooling

  • Teach live sessions (lecture + hands-on lab)

  • Run weekly office hours

  • Provide clear feedback on assignments and async questions

Required expertise

Production AI Engineering

  • You have shipped and maintained AI features used by thousands of users. You have “war stories” regarding outages, cost spikes, or quality regressions.

  • Deep understanding of the FastAPI + Pydantic + Celery/Redis stack for handling asynchronous AI tasks.

  • Can articulate the nuances between Latency vs. Cost and Reliability vs. Velocity.

AI Evals

  • experience in evaluating AI systems like LLMs, RAGs, Agents, or image/video generation models)

  • Ability to explain standard metrics and design eval datasets (golden, adversarial, regression)

  • implement scoring (rules, rubrics, LLM-as-judge with guardrails)

  • Familiar with industry eval patterns and frameworks (e.g., OpenAI Evals)

AI security and guardrails

  • prompt injection,

  • insecure output handling, model DoS, supply chain risks.

  • Threat Modeling: Experience mapping threats to taxonomies like MITRE ATLAS.

  • Implementing input sanitization and output validation to prevent prompt injection and model DoS.

Deployment and optimization

  • Infrastructure: Comfort with Docker, Kubernetes, and hybrid routing (using a mix of self-hosted and managed APIs like Azure OpenAI or Bedrock).

  • Optimization: Hands-on experience with Quantization (FP8/INT8), Prompt Caching, Pruning, distillation, distributed inference, efficient attention variants, and batching strategies to maximize throughput.

  • Serving Engines like vLLM or NVIDIA TensorRT-LLM.

  • Comfortable with production deployment patterns (containerization, staged rollouts, canaries).

  • Backend: Python (Expert), FastAPI, Pydantic v2, and asynchronous programming.

Monitoring and observability

  • Ability to teach tracing and quality monitoring

Desirable Technical Stack Knowledge

  • Frameworks: LangGraph, CrewAI, or Haystack.

  • Databases: Vector DBs (Pinecone, Weaviate, Qdrant) and their indexing strategies (HNSW, IVFFlat).

  • Observability: LangSmith, Honeycomb, or Arize Phoenix.

  • Backend: Python (Expert), FastAPI, Pydantic v2, and asynchronous programming.

This is not just a teaching role. It is a chance to help scale the most popular AI cohort and define how production grade AI engineering is taught.

Let’s build the most popular AI cohort together.

How to Apply

Send your resume and a short note on why you’re excited about this role to [email protected]