2026-02-06 00:30:40
Authentication serves as the first line of defense in ensuring the security of applications and the sensitive data they handle. Whether it’s a personal banking app, a corporate platform, or an e-commerce website, effective authentication mechanisms are needed to verify the identity of users and safeguard their access to resources.
Without proper authentication, applications are vulnerable to unauthorized access, data breaches, and malicious attacks, potentially resulting in significant financial loss, reputational damage, and privacy violations.
In addition to security, authentication plays a critical role in the user experience. By identifying users, applications can provide personalized services, remember user preferences, and enable functionalities like Single Sign-On (SSO) across platforms.
With evolving threats, implementing secure and efficient authentication is more challenging than ever. Developers must navigate between competing priorities such as security (ensuring protection against different attack types like session hijacking, token theft, and replay attacks), scalability (supporting millions of users without compromising performance), and user experience (maintaining ease of use while applying strong security measures).
To tackle these challenges, developers rely on various authentication techniques. In this article, we’ll explore multiple authentication techniques used in applications and understand their advantages and disadvantages.
2026-02-05 00:32:00
Most AI code review tools analyze the diff. Sometimes the file, occasionally the repo.
Experienced engineers work differently. They remember that Slack thread that explains why this database pattern exists. They know David on the platform team has strong opinions about error handling. They’ve internalized dozens of unwritten conventions.
Unblocked is the only AI code review tool that uses deep insight of your codebase, docs, and discussions to give high-signal feedback based on how your system actually works – instead of flooding your PR with stylistic nitpicks.
“Unblocked has reversed my AI fatigue completely. The level of precision is wild.” - Senior developer, Clio
Prompt engineering is the process of crafting instructions that guide AI language models to generate desired outcomes. At first glance, it might seem straightforward. We simply tell the AI what we want, and it delivers. However, anyone who has worked with these models quickly discovers that writing effective prompts is more challenging than it appears.
The ease of getting started with prompt engineering can be misleading.
While anyone can write a prompt, not everyone can write one that consistently produces high-quality results. Think of it as the difference between being able to communicate and being able to communicate effectively. The fundamentals are accessible, but mastery requires practice, experimentation, and understanding how these models process information.
In this article, we will look at the core techniques and best practices for prompt engineering. We will explore different prompting approaches, from simple zero-shot instructions to advanced chain-of-thought reasoning.
A prompt typically consists of several components:
The task description explains what we want the model to do, including any role or persona we want it to adopt.
The context provides necessary background information. Examples demonstrate the desired behavior or format.
Finally, the concrete task is the specific question to answer or action to perform.
Most model APIs allow us to split prompts into system prompts and user prompts.
System prompts typically contain task descriptions and role-playing instructions that shape how the model behaves throughout the conversation.
On the other hand, user prompts contain the actual task or question. For instance, if we are building a chatbot that helps buyers understand property disclosures, the system prompt might instruct the model to act as an experienced real estate agent, while the user prompt contains the specific question about a property.
See the diagram below:
Clarity is the key factor to effective prompting. Just as clear communication helps humans understand what we need, specific and unambiguous instructions help AI models generate appropriate responses. We should explain exactly what we want, define any scoring systems or formats we expect, and eliminate assumptions about what the model might already know.
Context is equally important. Providing relevant information helps models perform better and reduces hallucinations. If we want the model to answer questions about a research paper, including that paper in the context will significantly improve response quality. Without sufficient context, the model must rely on its internal knowledge, which may be outdated or incorrect.
Meet GitHub Copilot. Accelerate software innovation on any platform or code repository with GitHub Copilot, the agentic AI software development tool that meets you where you are.
With GitHub Copilot your team can:
Plan, build, and deploy with transformed AI-powered workflows.
Use agentic capabilities to tackle hard tasks: spec-driven development, docs generation, testing, and app modernization/migration.
Integrate GitHub Copilot anywhere: your teams, your toolchain, with flexible plans for agentic workflows.
In-context learning is the fundamental mechanism that makes prompt engineering work.
This term refers to a model’s ability to learn new behaviors from examples provided in the prompt itself, without requiring any updates to the model’s weights. When we show a model examples of how to perform a task, it can adapt its responses to match the pattern we have demonstrated.
Models are typically better at understanding instructions at the beginning and end of prompts compared to the middle. This phenomenon, sometimes called the “needle in a haystack” problem, means we should place the most important information at strategic positions in our prompts.
The number of examples needed depends on both the model’s capability and the task’s complexity. Stronger models generally require fewer examples to understand what we want. For simpler tasks, powerful models might not need any examples at all. For domain-specific applications or complex formatting requirements, providing several examples can make a significant difference.
Let’s look at some key prompting techniques:
Zero-shot prompting means giving the model instructions without providing any examples. In this approach, we simply describe what we want, and the model attempts to fulfill the request based on its training.
This technique works best for straightforward tasks where the desired output is clear from the instructions alone. For example, “Translate the following text to French” or “Summarize this article in three sentences” are both effective zero-shot prompts.
The main advantage of zero-shot prompting is efficiency. It uses fewer tokens, which reduces costs and latency. The prompts are also simpler to write and maintain. However, zero-shot prompting has limitations. When we need specific formatting or behavior that differs from the model’s default responses, zero-shot prompts may not be sufficient.
Best practices for zero-shot prompting include being as explicit as possible about what we want, specifying the output format clearly, and stating any constraints or requirements upfront. If the model’s initial response is not what we expected, we should revise the prompt to add more detail rather than immediately jumping to few-shot examples.
Few-shot prompting involves providing examples that demonstrate how we want the model to respond. One-shot prompting uses a single example, while few-shot typically means two to five or more examples.
This technique is valuable when we need specific formatting or when the desired behavior might be ambiguous from instructions alone. For instance, if we are building a bot to talk to young children and want it to respond to questions about fictional characters in a particular way, showing an example helps the model understand the expected tone and approach.
Consider this comparison. Without an example, if a child asks, “Will Santa bring me presents on Christmas?”, a model might explain that Santa Claus is fictional. However, if we provide an example like “Q: Is the tooth fairy real? A: Of course! Put your tooth under your pillow tonight,” the model learns to maintain the magical perspective appropriate for young children.
The number of examples matters. More examples generally lead to better performance, but we are limited by context length and cost considerations. For most applications, three to five examples strike a good balance. We should experiment to find the optimal number for our specific use case.
When formatting examples, we can save tokens by choosing efficient structures. For instance, “pizza -> edible” uses fewer tokens than “Input: pizza, Output: edible” while conveying the same information. These small optimizations add up, especially when working with multiple examples.
Chain-of-thought prompting, often abbreviated as CoT, involves explicitly asking the model to think step by step before providing an answer. This technique encourages systematic problem-solving and has been shown to significantly improve performance on complex reasoning tasks.
The simplest implementation is adding phrases like “think step by step” or “explain your reasoning” to our prompts. The model then works through the problem methodically, showing its reasoning process before arriving at a conclusion.
CoT often improves model performance across various benchmarks, particularly for mathematical problems, logic puzzles, and multi-step reasoning tasks. CoT also helps reduce hallucinations because the model must justify its answers with explicit reasoning steps.
We can implement CoT in several ways. Zero-shot CoT simply adds a reasoning instruction to our prompt. We can also specify the exact steps we want the model to follow, or provide examples that demonstrate the reasoning process. The variation depends on the specific application and how much control we need over the reasoning structure.
The trade-off with CoT is increased latency and cost. The model generates more tokens as it works through its reasoning, which takes more time and increases API costs. For complex tasks where accuracy is critical, this trade-off is usually worthwhile.
Role prompting assigns a specific persona or area of expertise to the model. By telling the model to adopt a particular role, we influence the perspective and style of its responses.
For example, if we ask a model to score a simple essay like “Summer is the best season. The sun is warm. I go swimming. Ice cream tastes good in summer,” it might give a low score based on general writing standards. However, if we first instruct the model to adopt the persona of a first-grade teacher, it will evaluate the essay from that perspective and likely assign a higher, more appropriate score.
Role prompting is particularly effective for customer service applications, educational content, creative writing, and any scenario where the context or expertise level matters. The model can adjust its vocabulary, level of detail, and approach based on the assigned role.
When using role prompting, we should be specific about the role and any relevant characteristics. Rather than just saying “act as a teacher,” we might say “act as an encouraging first-grade teacher who focuses on effort and improvement.” The more specific we are, the better the model can embody that perspective.
Prompt chaining involves breaking complex tasks into smaller, manageable subtasks, each with its own prompt. Instead of handling everything in one giant prompt, we create a series of simpler prompts and chain them together.
Consider a customer support chatbot. The process of responding to a customer request can be decomposed into two main steps:
Classify the intent of the request
Generate an appropriate response based on that intent
The first prompt focuses solely on determining whether the customer needs billing help, technical support, account management, or general information. Based on that classification, we then use a second, specialized prompt to generate the actual response.
This approach offers several benefits. Each prompt is simpler to write and maintain. We can monitor and debug each step independently. We can use different models for different steps, perhaps using a faster, cheaper model for intent classification and a more powerful model for response generation. We can also execute independent steps in parallel when possible.
The main drawback is increased perceived latency for end users. With multiple steps, users wait longer to see the final output. However, for complex applications, the improved reliability and maintainability often outweigh this concern.
Some best practices for effective prompting are as follows:
Be Clear and Specific: Ambiguity is the enemy of effective prompting. We should eliminate all uncertainty about what we want the model to do. If we want the model to score essays, we need to specify the scoring scale. Should it use 1 to 5 or 1 to 10? Are fractional scores allowed? What should the model do if it is uncertain about a score?
Provide Sufficient Context: Context helps models generate accurate, relevant responses. If we want the model to answer questions about a document, including that document in the prompt is essential. Without it, the model can only rely on its training data, which may lead to outdated or incorrect information.
Specify Output Format: We should explicitly state how we want the model to respond. Do we want a concise answer or a detailed explanation? Should the output be formatted as JSON, a bulleted list, or a paragraph? Should the model include preambles, or should it get straight to the point?
Use Examples Strategically: Examples are powerful tools for reducing ambiguity, but they come with a cost in terms of tokens and context length. We should provide examples when the desired format or behavior is not obvious from instructions alone. For straightforward tasks, examples may not be necessary.
Iterate and Experiment: Prompt engineering is iterative. We rarely write the perfect prompt on the first try. We should start with a basic prompt, test it, observe the results, and refine based on what we learn.
Versioning Prompts: We should version our prompts so we can track changes over time. Using consistent evaluation data allows us to compare different prompt variations objectively. We should test prompts not just in isolation but in the context of the complete system to ensure that improvements in one area do not create problems elsewhere.
Some common pitfalls that should be avoided when writing prompts are as follows:
Being Too Vague: One of the most common mistakes is assuming the model understands our intent without explicit explanation. Vague prompts like “write something about climate change” leave too much open to interpretation. Do we want a scientific explanation, a persuasive essay, a news article, or a social media post? What length? What perspective? The model will make its own choices, which may not align with what we actually want.
Overcomplicating Prompts: While clarity and detail are important, we can go too far in the other direction. Overly complex prompts with excessive instructions, too many examples, or convoluted logic can confuse the model rather than help it. We should aim for the simplest prompt that achieves our goal. If a zero-shot prompt works well, there is no need to add examples. If three examples are sufficient, five may not improve results.
Ignoring Output Format: Failing to specify the output format can cause problems, especially when model outputs feed into other systems. If we need structured data but do not request it explicitly, the model might generate unstructured text that requires additional parsing or cleaning. This adds complexity and potential points of failure to our application.
Not Testing Sufficiently: A single successful output does not mean the prompt is reliable. We should test prompts with various inputs, including edge cases and unusual scenarios. What works for typical cases might fail when inputs are slightly different or unexpected. Building a small evaluation dataset and testing systematically helps identify weaknesses before they become problems in production.
Effective prompt engineering combines clear communication, strategic use of examples, and systematic experimentation.
The core techniques we have explored, including zero-shot prompting, few-shot prompting, chain-of-thought reasoning, role prompting, and prompt chaining, provide a solid foundation for working with AI language models.
The key principles remain consistent across different models and applications:
Be specific and clear in our instructions.
Provide sufficient context for the model to work with.
Specify the output format we need.
Use examples when they add value and iterate based on results.
2026-02-04 00:30:36
Cut through the noise with this engineer-friendly guide to Kubernetes observability. Save this reference for fast-track access to essential kubectl commands and critical metrics, from disk I/O and network latency to real-time cluster events. Perfect for scaling, debugging, and tuning your workloads without sifting through endless docs.
Digital services require accurate extraction of information from user-submitted documents such as identification cards, driver’s licenses, and vehicle registration certificates. This process is essential for electronic know-your-customer (eKYC) verification. However, the diversity of languages and document formats across the region makes this task particularly challenging.
Grab Engineering Team faced significant obstacles with traditional Optical Character Recognition (OCR) systems, which struggled to handle the variety of document templates. While powerful proprietary Large Language Models (LLMs) were available, they often failed to adequately understand Southeast Asian languages, produced errors and hallucinations, and suffered from high latency. Open-source Vision LLMs offered better efficiency but lacked the accuracy required for production deployment.
This situation prompted Grab to fine-tune existing models and eventually build a lightweight, specialized Vision LLM from the ground up. In this article, we will look at the complete architecture, the technical decisions made, and the results achieved.

Disclaimer: This post is based on publicly shared details from the Grab Engineering Team. Please comment if you notice any inaccuracies.
Before diving into the solution, it helps to understand what a Vision LLM is and how it differs from traditional text-based language models.
A standard LLM processes text inputs and generates text outputs. A Vision LLM extends this capability by enabling the model to understand and process images. The architecture consists of three essential components working together:
The first component is the image encoder. This module processes an image and converts it into a numerical format that computers can work with. Think of it as translating visual information into a structured representation of numbers and vectors.
The second component is the vision-language projector. This acts as a bridge between the image encoder and the language model. It transforms the numerical representation of the image into a format that the language model can interpret and use alongside text inputs.
The third component is the language model itself. This is the familiar text-processing model that takes both the transformed image information and any text instructions to generate a final text output. In the case of document processing, this output would be the extracted text and structured information from the document.
See the diagram below:
Engineering teams at Coinbase, MSCI, and Zscaler have at least one thing in common: they use Resolve AI’s AI SRE to make MTTR 5x faster and increase dev productivity by up to 75%.
When it comes to production issues, the numbers hurt: 54% of significant outages exceed $100,000 lost. Downtime cost the Global 2000 ~$400 billion annually.
It’s why eng teams leverage our AI SRE to correlate code, infrastructure, and telemetry and provide real-time root cause analysis, prescriptive remediation, and continuous learning.
Time to try an AI SRE? This guide covers:
The ROI of an AI SRE
Whether you should build or buy
How to assess AI SRE solutions
Grab evaluated several open-source models capable of performing OCR and Key Information Extraction (KIE). The options included Qwen2VL, miniCPM, Llama3.2 Vision, Pixtral 12B, GOT-OCR2.0, and NVLM 1.0.
After thorough evaluation, Grab selected Qwen2-VL 2B as the base multimodal LLM. This decision was driven by several critical factors:
First, the model size was appropriate. With 2 billion parameters, it was small enough to allow full fine-tuning on GPUs with limited VRAM resources. Larger models would have required more expensive infrastructure and longer training times.
Second, the model offered good Southeast Asian language support. The tokenizer showed efficiency for languages like Thai and Vietnamese, indicating decent native vocabulary coverage. A tokenizer is the component that breaks text into smaller units (tokens) that the model can process. Efficient tokenization means the model can represent these languages without wasting capacity.
Third, and perhaps most importantly, Qwen2-VL supports dynamic resolution. Unlike models that require fixed-size image inputs, this model can process images in their native resolution. This capability is critical for OCR tasks because resizing or cropping images can distort text characters, leading to recognition errors. Preserving the original resolution maintains text integrity and improves accuracy.
Initial benchmarking of Qwen2VL and miniCPM on Grab’s dataset revealed low accuracy, primarily due to the limited coverage of Southeast Asian languages. This finding motivated the team to pursue fine-tuning to improve OCR and KIE accuracy.
However, training LLMs is both data-intensive and GPU resource-intensive, which brings up two important questions: how to use open-source and internal data effectively, and how to customize the model to reduce latency while maintaining high accuracy.
Grab developed two approaches to generate training data for the model:
The first approach involved creating synthetic training data. Grab extracted Southeast Asian language text content from Common Crawl, a large online text corpus that contains data from across the internet. Using an in-house synthetic data pipeline, the team generated text images by rendering this content in various fonts, backgrounds, and augmentations.
The resulting dataset included text in Bahasa Indonesia, Thai, Vietnamese, and English. Each image contained a paragraph of random sentences extracted from the corpus. This synthetic approach offered several advantages. It allowed controlled generation of training examples, enabled the creation of unlimited variations, and ensured coverage of different visual styles and document conditions.
The second approach leveraged real documents collected by Grab. Experiments showed that applying document detection and orientation correction significantly improved OCR and information extraction.
To generate a preprocessing dataset, Grab built Documint, an internal platform that creates an auto-labelling and preprocessing framework for document understanding.
Documint prepares high-quality, labelled datasets through various submodules that execute the full OCR and KIE task. The team used this pipeline with a large volume of Grab-collected cards and documents to extract training labels. Human reviewers then refined the data to achieve high label accuracy.
Documint consists of four main modules:
The detection module identifies the document region from a full picture.
The orientation module determines the correction angle needed, such as 180 degrees if a document is upside down.
The OCR module extracts text values in an unstructured format.
Finally, the KIE module converts the unstructured text into structured JSON values.

Grab conducted the model development in three distinct phases, each building on the lessons learned from the previous phase:
The first attempt at fine-tuning involved a technique called Low-Rank Adaptation, or LoRA.
This method is efficient because it updates only a small portion of the model’s parameters rather than retraining the entire model. Specifically, LoRA adds small trainable matrices to the model while keeping most of the original weights frozen. This approach minimizes computational resource requirements and reduces training time.
Grab trained the model on curated document data that included various document templates in multiple languages. The performance showed promise for documents with Latin scripts. The LoRA fine-tuned Qwen2VL-2B achieved high field-level accuracy for Indonesian documents.
However, the fine-tuned model struggled with two categories of documents:
First, it had difficulty with documents containing non-Latin scripts, such as Thai and Vietnamese.
Second, it performed poorly on unstructured layouts with small, dense text.
The experiments revealed a key limitation. While open-source Vision LLMs often have extensive multilingual text corpus coverage for the language model decoder’s pre-training, they lack visual examples of text in Southeast Asian languages during vision encoder training. The language model might understand Thai text, but the vision encoder had never learned to recognize what Thai characters look like in images. This insight drove the decision to pursue full parameter fine-tuning.
Drawing from the Large Language and Vision Assistant (LLAVA) methodology, Grab implemented a two-stage training approach:
In Stage 1, called continual pre-training, the team trained only the vision components of the model using synthetic OCR datasets created for Bahasa Indonesia, Thai, Vietnamese, and English. This stage helped the model learn the unique visual patterns of Southeast Asian scripts. During this stage, the language model remained frozen, meaning its weights were not updated.
In Stage 2, called full-parameter fine-tuning, Grab fine-tuned the entire model. This included the vision encoder, the projector, and the language model. The team used task-specific document data for this training. All components of the model were now trainable and could be optimized together for the document extraction task.
The results were significant. For example, the Thai document accuracy increased by 70 percentage points from the baseline. Vietnamese document accuracy rose by 40 percentage points from the baseline. Indonesian documents saw a 15 percentage point improvement, and Philippine documents improved by 6 percentage points.
The fully fine-tuned Qwen2-VL 2B model delivered substantial improvements, especially on documents that the LoRA model had struggled with.
While the 2B model succeeded, full fine-tuning pushed the limits of available GPUs.
To optimize resource usage and create a model perfectly tailored to their needs, Grab decided to build a lightweight Vision LLM with approximately 1 billion parameters from scratch.
The strategy involved combining the best components from different models. Grab took the powerful vision encoder from the larger Qwen2-VL 2B model, which had proven effective at understanding document images. The team paired it with the compact and efficient language decoder from the Qwen2.5 0.5B model. They connected these components with an adjusted projector layer to ensure seamless communication between the vision encoder and language decoder.
This combination created a custom Vision LLM with approximately 1 billion parameters, optimized for both training and deployment.
Grab trained this new model using a comprehensive four-stage process:
Stage 1 focused on projector alignment. The first step was to train the new projector layer to ensure the vision encoder and language decoder could communicate effectively. Without proper alignment, the language model would not be able to interpret the vision encoder’s outputs correctly.
Stage 2 involved vision tower enhancement. The team trained the vision encoder on a vast and diverse set of public multimodal datasets. These datasets covered tasks like visual question answering, general OCR, and image captioning. This stage improved the model’s foundational visual understanding across various scenarios.
Stage 3 centered on language-specific visual training. Grab trained the model on two types of synthetic OCR data specific to Southeast Asian languages. This stage proved critical. Without it, performance on non-Latin documents dropped by as much as 10 percentage points. This stage ensured the vision encoder could recognize the specific visual characteristics of Thai, Vietnamese, and other regional scripts.
Stage 4 completed the process with task-centric fine-tuning. The team performed full-parameter fine-tuning on the custom 1B model using the curated document dataset. This final stage optimized the entire system for the specific production use case of document information extraction.
The final 1B model achieved remarkable results across two key metrics: accuracy and latency.
For accuracy, the model performed comparably to the larger 2B model, staying within a 3 percentage point accuracy gap across most document types. The model also maintained strong generalization when trained on quality-augmented datasets, meaning it could handle variations it had not seen during training.
For latency, the results were even more impressive. The 1B model achieved 48 percent faster processing at the P50 latency (median response time), 56 percent faster at P90 latency (90th percentile), and 56 percent faster at P99 latency (99th percentile, representing worst-case scenarios).
These latency improvements are particularly important. Grab identified that one of the biggest weaknesses of external APIs like ChatGPT or Gemini was the P99 latency, which can easily be 3 to 4 times higher than the P50 latency. This variability would not be acceptable for large-scale production rollouts where consistent performance is essential.
The project yielded several important insights that can guide similar efforts.
Full parameter fine-tuning proved superior to LoRA for specialized, non-Latin script domains. While LoRA is efficient, it cannot match the performance gains of updating all model parameters when dealing with significantly different data distributions.
Lightweight models can be highly effective. A smaller model of approximately 1 billion parameters, built from scratch and trained comprehensively, can achieve near state-of-the-art results. This validates the approach of custom architecture over simply using the largest available model.
The choice of base model matters significantly. Starting with a model that has native support for target languages is crucial for success. Trying to force a model to learn languages it was not designed for leads to suboptimal results.
Data quality plays a critical role. Meticulous dataset preprocessing and augmentation are as important as model architecture in achieving consistent and accurate results. The effort invested in building Documint and creating synthetic datasets directly contributed to the final model’s success.
Finally, native resolution support is transformative for OCR tasks. A model that can handle dynamic image resolutions preserves text integrity and dramatically improves OCR capabilities. This feature prevents the distortion that occurs when images are resized to fit fixed input dimensions.
Grab’s journey of building a Vision LLM demonstrates that specialized Vision LLMs can effectively replace traditional OCR pipelines with a single, unified, highly accurate model. This opens new possibilities for document processing at scale.
The project shows that with strategic training approaches, high-quality data preparation, and thoughtful model architecture decisions, smaller specialized models can outperform larger general-purpose alternatives. The resulting system processes documents faster and more accurately than previous solutions while using fewer computational resources.
Grab continues to enhance these capabilities. The team is developing Chain of Thought-based OCR and KIE models to strengthen generalization and tackle even more diverse document scenarios. They are also extending support to all Grab markets, bringing advanced document processing to Myanmar, Cambodia, and beyond.
References:
2026-02-03 00:31:12
One of the clearest AI predictions for 2026: models won’t be the bottleneck—context will. As AI agents pull from vector stores, session state, long-term memory, SQL, and more, finding the right data becomes the hard part. Miss critical context and responses fall apart. Send too much and latency and costs spike.
Context engines emerge as the fix. A single layer to store, index, and serve structured and unstructured data, across short- and long-term memory. The result: faster responses, lower costs, and AI apps that actually work in production.
When we interact with modern large language models like GPT, Claude, or Gemini, we are witnessing a process fundamentally different from how humans form sentences. While we naturally construct thoughts and convert them into words, LLMs operate through a cyclical conversion process.
Understanding this process reveals both the capabilities and limitations of these powerful systems.
At the heart of most modern LLMs lies an architecture called a transformer. Introduced in 2017, transformers are sequence prediction algorithms built from neural network layers. The architecture has three essential components:
An embedding layer that converts tokens into numerical representations.
Multiple transformer layers where computation happens.
Output layer that converts results back into text.
See the diagram below:
Transformers process all words simultaneously rather than one at a time, enabling them to learn from massive text datasets and capture complex word relationships.
In this article, we will look at how the transformer architecture works in a step-by-step manner.
Before any computation can happen, the model must convert text into a form it can work with. This begins with tokenization, where text gets broken down into fundamental units called tokens. These are not always complete words. They can be subwords, word fragments, or even individual characters.
Consider this example input: “I love transformers!” The tokenizer might break this into: [”I”, “ love”, “ transform”, “ers”, “!”]. Notice that “transformers” became two separate tokens. Each unique token in the vocabulary gets assigned a unique integer ID:
“I” might be token 150
“love” might be token 8942
“transform” might be token 3301
“ers” might be token 1847
“!” might be token 254
These IDs are arbitrary identifiers with no inherent relationships. Tokens 150 and 151 are not similar just because their numbers are close. The overall vocabulary typically contains 50,000 to 100,000 unique tokens that the model learned during training.
Neural networks cannot work directly with token IDs because they are just fixed identifiers. Each token ID gets mapped to a vector, a list of continuous numbers usually containing hundreds or thousands of dimensions. These are called embeddings.
Here is a simplified example with five dimensions (real models may use 768 to 4096):
Token “dog” becomes [0.23, -0.67, 0.45, 0.89, -0.12]
Token “wolf” becomes [0.25, -0.65, 0.47, 0.91, -0.10]
Token “car” becomes [-0.82, 0.34, -0.56, 0.12, 0.78]
Notice how “dog” and “wolf” have similar numbers, while “car” is completely different. This creates a semantic space where related concepts cluster together.
Why the need for multiple dimensions? This is because with just one number per word, we might encounter contradictions. For example:
“stock” equals 5.2 (financial term)
“capital” equals 5.3 (similar financial term)
“rare” equals -5.2 (antonym: uncommon)
“debt” equals -5.3 (antonym of capital)
Now, “rare” and “debt” both have similar negative values, implying they are related, which makes no sense. Hundreds of dimensions allow the model to represent complex relationships without such contradictions.
In this space, we can perform mathematical operations. The embedding for “king” minus “man” plus “woman” approximately equals “queen.” These relationships emerge during training from patterns in text data.
Transformers do not inherently understand word order. Without additional information, “The dog chased the cat” and “The cat chased the dog” would look identical because both contain the same tokens.
The solution is positional embeddings. Every position gets mapped to a position vector, just like tokens get mapped to meaning vectors.
For the token “dog” appearing at position 2, it might look like the following:
Word embedding: [0.23, -0.67, 0.45, 0.89, -0.12]
Position 2 embedding: [0.05, 0.12, -0.08, 0.03, 0.02]
Combined (element-wise sum): [0.28, -0.55, 0.37, 0.92, -0.10]
This combined embedding captures both the meaning of the word and its context of use. This is also what flows into the transformer layers.
The transformer layers implement the attention mechanism, which is the key innovation that makes these models so powerful. Each transformer layer operates using three components for every token: queries, keys, and values. We can think of this as a fuzzy dictionary lookup where the model compares what it is looking for (the query) against all possible answers (the keys) and returns weighted combinations of the corresponding values.
Let us walk through a concrete example. Consider the sentence: “The cat sat on the mat because it was comfortable.”
When the model processes the word “it,” it needs to determine what “it” refers to. Here is what happens:
First, the embedding for “it” generates a query vector asking essentially, “What noun am I referring to?”
Next, this query is compared against the keys from all previous tokens. Each comparison produces a similarity score. For example:
“The” (article) generates score: 0.05
“cat” (noun) generates score: 8.3
“sat” (verb) generates score: 0.2
“on” (preposition) generates score: 0.03
“the” (article) generates score: 0.04
“mat” (noun) generates score: 4.1
“because” (conjunction) generates score: 0.1
The raw scores are then converted into attention weights that sum to 1.0. For example:
“cat” receives attention weight: 0.75 (75 percent)
“mat” receives attention weight: 0.20 (20 percent)
All other tokens: 0.05 total (5 percent combined)
Finally, the model takes the value vectors from each token and combines them using these weights. For example:
Output = (0.75 × Value_cat) + (0.20 × Value_mat) + (0.03 × Value_the) + ...
The value from “cat” contributes 75 percent to the output, “mat” contributes 20 percent, and everything else is nearly ignored. This weighted combination becomes the new representation for “it” that captures the contextual understanding that “it” most likely refers to “cat.”
This attention process happens in every transformer layer, but each layer learns to detect different patterns.
Early layers learn basic patterns like grammar and common word pairs. When processing “cat,” these layers might heavily attend to “The” because they learn that articles and their nouns are related.
Middle layers learn sentence structure and relationships between phrases. They might figure out that “cat” is the subject of “sat” and that “on the mat” forms a prepositional phrase indicating location.
Deep layers extract abstract meaning. They might understand that this sentence describes a physical situation and implies the cat is comfortable or resting.
Each layer refines the representation progressively. The output of one layer becomes the input for the next, with each layer adding more contextual understanding.
Importantly, only the final transformer layer needs to predict an actual token. All intermediate layers perform the same attention operations but simply transform the representations to be more useful for downstream layers. A middle layer does not output token predictions. Instead, it outputs refined vector representations that flow to the next layer.
This stacking of many layers, each specializing in different aspects of language understanding, is what enables LLMs to capture complex patterns and generate coherent text.
After flowing through all layers, the final vector must be converted to text. The unembedding layer compares this vector against every token embedding and produces scores.
For example, to complete “I love to eat,” the unembedding might produce:
“pizza”: 65.2
“tacos”: 64.8
“sushi”: 64.1
“food”: 58.3
“barbeque”: 57.9
“car”: -12.4
“42”: -45.8
These arbitrary scores get converted to probabilities using softmax:
“pizza”: 28.3 percent
“tacos”: 24.1 percent
“sushi”: 18.9 percent
“food”: 7.2 percent
“barbeque”: 6.1 percent
“car”: 0.0001 percent
“42”: 0.0000001 percent
Tokens with similar scores (65.2 versus 64.8) receive similar probabilities (28.3 versus 24.1 percent), while low-scoring tokens get near-zero probabilities.
The model does not select the highest probability token. Instead, it randomly samples from this distribution. Think of a roulette wheel where each token gets a slice proportional to its probability. Pizza gets 28.3 percent, tacos get 24.1 percent, and 42 gets a microscopic slice.
The reason for this randomness is that always picking a specific value like “pizza” would create repetitive, unnatural output. Random sampling weighted by probability allows selection of “tacos,” “sushi,” or “barbeque,” producing varied, natural responses. Occasionally, a lower-probability token gets picked, leading to creative outputs.
The generation process repeats for every token. Let us walk through an example where the initial prompt is “The capital of France.” Here’s how different cycles go through the transformer:
Cycle 1:
Input: [”The”, “capital”, “of”, “France”]
Process through all layers
Sample: “is” (80 percent)
Output so far: “The capital of France is”
Cycle 2:
Input: [”The”, “capital”, “of”, “France”, “is”] (includes new token)
Process through all layers (5 tokens now)
Sample: “Paris” (92 percent)
Output so far: “The capital of France is Paris”
Cycle 3:
Input: [”The”, “capital”, “of”, “France”, “is”, “Paris”] (6 tokens)
Process through all layers
Sample: “.” (65 percent)
Output so far: “The capital of France is Paris.”
Cycle 4:
Input: [”The”, “capital”, “of”, “France”, “is”, “Paris”, “.”] (7 tokens)
Process through all layers
Sample: [EoS] token (88 percent)
Stop the loop
Final output: “The capital of France is Paris.”
The [EoS] or end-of-sequence token signals completion. Each cycle processes all previous tokens. This is why generation can slow as responses lengthen.
This is called autoregressive generation because each output depends on all previous outputs. If an unusual token gets selected (perhaps “chalk” with 0.01 percent probability in “I love to eat chalk”), all subsequent tokens will be influenced by this choice.
The transformer flow operates in two contexts: training and inference.
During training, the model learns language patterns from billions of text examples. It starts with random weights and gradually adjusts them. Here is how training works:
Training text: “The cat sat on the mat.”
Model receives: “The cat sat on the”
With random initial weights, the model might predict:
“banana”: 25 percent
“car”: 22 percent
“mat”: 3 percent (correct answer has low probability)
“elephant”: 18 percent
The training process calculates the error (mat should have been higher) and uses backpropagation to adjust every weight:
Embeddings for “on” and “the” get adjusted
Attention weights in all 96 layers get adjusted
Unembedding layer gets adjusted
Each adjustment is tiny (0.245 to 0.247), but it accumulates across billions of examples. After seeing “sat on the” followed by “mat” thousands of times in different contexts, the model learns this pattern. Training takes weeks on thousands of GPUs and costs millions of dollars. Once complete, weights are frozen.
During inference, the transformer runs with frozen weights:
User query: “Complete this: The cat sat on the”
The model processes the input with its learned weights and outputs: “mat” (85 percent), “floor” (8 percent), “chair” (3 percent). It samples “mat” and returns it. No weight changes occur.
The model used its learned knowledge but did not learn anything new. The conversations do not update model weights. To teach the model new information, we would need to retrain it with new data, which requires substantial computational resources.
See the diagram below that shows the various steps in an LLM execution flow:
The transformer architecture provides an elegant solution to understanding and generating human language. By converting text to numerical representations, using attention mechanisms to capture relationships between words, and stacking many layers to learn increasingly abstract patterns, transformers enable modern LLMs to produce coherent and useful text.
This process involves seven key steps that repeat for every generated token: tokenization, embedding creation, positional encoding, processing through transformer layers with attention mechanisms, unembedding to scores, sampling from probabilities, and decoding back to text. Each step builds on the previous one, transforming raw text into mathematical representations that the model can manipulate, then back into human-readable output.
Understanding this process reveals both the capabilities and limitations of these systems. In essence, LLMs are sophisticated pattern-matching machines that predict the most likely next token based on patterns learned from massive datasets.
2026-02-01 00:31:08
Richard Socher and Bryan McCann are among the most-cited AI researchers in the world. They just released 35 predictions for 2026. Three that stand out:
The LLM revolution has been “mined out” and capital floods back to fundamental research
“Reward engineering” becomes a job; prompts can’t handle what’s coming next
Traditional coding will be gone by December; AI writes the code and humans manage it
This week’s system design refresher:
HTTP/2 over TCP vs HTTP/3 over QUIC
How Cursor Agent Works
How Git Really Stores Your Data
How NAT Works
Building a Computer Vision App on Ring APIs
We’re hiring at ByteByteGo
HTTP/2 vs HTTP/3 looks like an HTTP upgrade. It’s actually a transport-layer rethink.
HTTP/2 fixed a big problem in HTTP/1.1: too many connections. It introduced multiplexing, allowing multiple requests and responses to share a single connection. On paper, that sounds ideal.
But under the hood, HTTP/2 still runs on TCP. All streams share the same TCP connection, the same ordering, and the same congestion control. When a single TCP packet is lost, TCP pauses delivery until it’s retransmitted.
Since packets can carry data from multiple streams, one loss ends up blocking all streams. That’s TCP head-of-line blocking. Multiplexed at the HTTP layer, serialized at the transport layer.
HTTP/3 takes a different approach. Instead of TCP, it runs over QUIC, which is built on UDP. QUIC moves multiplexing down into the transport layer itself.
Each stream is independent, with its own ordering and recovery. If a packet is lost, only the affected stream waits. The others keep flowing. Same idea at the HTTP layer. Very different behavior on the wire.
HTTP/2: multiplexing above TCP
HTTP/3: multiplexing inside the transport
Over to you: Have you actually seen TCP head-of-line blocking show up in real systems, or is it mostly theoretical in your experience?
Cursor recently shipped Composer, its agentic coding model, and shared that the agent can be ~4× faster!
We worked with the Cursor team, particularly Lee Robinson, to understand how the system is put together, and what drives the speed.
A coding agent is a system that can take a task, explore a repo, edit multiple files, and iterate until the build and tests pass.
Inside Cursor, a router first picks a suitable coding model (including Composer) to handle the request.
The system then starts a loop: retrieve the most relevant code (context retrieval), use tools to open and edit files, and run commands in a sandbox. Once the tests pass, the task is complete.
Cursor uses three key techniques to keep this loop fast:
Mixture-of-Expert (MoE): A sparse MoE architecture activates only a subset of model weights per token.
Speculative decoding: a smaller model drafts multiple tokens at once, then a larger model verifies them in parallel to reduce latency.
Context compaction: summarize older steps and keep only the active working set so the prompt stays relevant and short as iterations continue.
Ever wondered what actually happens inside Git when you run commands like add, commit, or checkout? Most developers use Git every day, but very few know what’s going on under the hood.
Git has two layers:
Porcelain (user-facing commands): add, commit, checkout, rebase, etc.
Plumbing (low-level building blocks): hash-object, cat-file, read-tree, update-index, and more.
When you trigger a Git command:
Your porcelain command is translated by Git
It calls lower-level plumbing operations
Plumbing writes directly into the .git directory (Git’s entire internal database)
Inside the .git directory: Git stores everything it needs to reconstruct your repo.
objects/ : all file content and metadata stored by hash
refs/ : branches and tags
index : staging area
config : repo configuration
HEAD : current branch pointer
The .git folder is your repository. If you delete it, the project loses its entire history.
Everything in Git is built from just four objects:
blob : file contents
tree : directories
commit : metadata + parents
tag : annotated reference
Over to you: Which Git command has confused you the most in real-world projects?
Every device in your home probably shares the same public IP, yet each one browses, streams, and connects independently.
Ever wondered how that’s even possible?
That magic is handled by NAT (Network Address Translation), one of the silent workhorses of modern networking. It’s the reason IPv4 hasn’t run out completely, and why your router can hide dozens of devices behind a single public IP.
The Core Idea: Inside your local network, devices use private IP addresses that never leave your home or office. Your router, however, uses a single public IP address when talking to the outside world.
NAT rewrites each outbound request so it appears to come from that public IP address, assigning a unique port mapping for every internal connection.
Outbound NAT (Local to Internet)
When a device sends a request:
NAT replaces the private IP address with the public one
Assigns a unique port so it can track the connection
Sends the packet out to the internet as if it originated from the router
Reverse NAT (Internet to Local)
When the response returns:
NAT checks its translation table
Restores the original private IP address and port
Delivers the packet to the correct device on the local network
Ring just announced a new Appstore. For the first time, third party developers can request early access to Ring APIs.
This changes Ring from a closed product into a programmable platform.
We are one of the first teams working with early Ring API access.
We explored what developers can build with Ring event data and how quickly we can take it to production.
We built a Driveway Derby Detector. Here is how it works at a high level:
We registered our endpoints and received client credentials for Developer APIs (Self-serve through developer.amazon.com/ring)
When the camera detects motion, we get notified on the webhook (< 30 min integration)
We pull the associated video clips (< 30 min integration)
We run the clip through YOLO based object detection model (YMMV based on your application)
We emit the data from the model to a DynamoDB database
We wrote an application which creates visuals with various graphs to detect high speeds of wild drivers in our family when they enter our driveway
If you want to try this yourself, you can request early access here
I am hiring for 2 roles: Technical Deep Dive Writer (System Design or AI Systems), and Lead Instructor (Building the World’s Most Useful AI Cohort).
We are looking for exceptional people who love teaching and enjoy breaking down complex ideas. You will work very closely with me to produce deep, accurate, and well structured technical content. The goal is not volume. The goal is to set the quality bar for how system design and modern AI systems are explained.
If you are interested, please send your resume along with a short note on why you are excited about the role to [email protected]
Job descriptions are below.
Technical Deep Dive Writer
Lead Instructor, Building the World’s Most Popular AI Cohort
2026-01-30 00:30:59
In modern software development, APIs serve as the critical communication layer between clients and backend services.
Whether we are building a web application, mobile app, or any internet-based system, the API layer acts as the primary interface through which clients access functionality and data. As our applications grow and attract more users, the ability to scale this API layer becomes increasingly important for maintaining performance and delivering a positive user experience.
API scalability refers to the system’s ability to handle increasing amounts of traffic and requests without degrading performance. As applications gain popularity, they inevitably face surges in user demand. Without proper scaling mechanisms, these traffic spikes can lead to slow response times, timeouts, or even complete system failures.
In this article, we will learn how to scale APIs effectively using different strategies.