MoreRSS

site iconByteByteGoModify

System design and interviewing experts, authors of best-selling books, offer newsletters and courses.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of ByteByteGo

EP193: Database Types You Should Know in 2025

2025-12-14 00:30:43

8 Insights into Real-World Cloud Security Postures (Sponsored)

To better understand the vulnerabilities and threats facing modern DevOps organizations, Datadog analyzed security posture data from a sample of thousands of organizations that use AWS, Azure, or Google Cloud.

In this report, you’ll gain valuable cloud security insights based on this research including:

  • How long-lived credentials create opportunities for attackers to breach cloud environments

  • Adoption of proactive cloud security mechanisms such as S3 Public Access Block or IMDSv2 in AWS

  • Most common risks when using managed Kubernetes distributions

Read the report


This week’s system design refresher:

  • Transformers Step-by-Step Explained (Youtube video)

  • Database Types You Should Know in 2025

  • Apache Kafka vs. RabbitMQ

  • The HTTP Mindmap

  • How DNS Works

  • SPONSOR US


Transformers Step-by-Step Explained (Attention Is All You Need)


Database Types You Should Know in 2025

There’s no such thing as a one-size-fits-all database anymore. Modern applications rely on multiple database types, from real-time analytics to vector search for AI. Knowing which type to use can make or break your system’s performance.

  • Relational: Traditional row-and-column databases, great for structured data and transactions.

  • Columnar: Optimized for analytics, storing data by columns for fast aggregations.

  • Key-Value: Stores data as simple key–value pairs, enabling fast lookups.

  • In-memory: Stores data in RAM for ultra-low latency lookups, ideal for caching or session management.

  • Wide-Column: Handles massive amounts of semi-structured data across distributed nodes.

  • Time-series: Specialized for metrics, logs, and sensor data with time as a primary dimension.

  • Immutable Ledger: Ensures tamper-proof, cryptographically verifiable transaction logs.

  • Graph: Models complex relationships, perfect for social networks and fraud detection

  • Document: Flexible JSON-like storage, great for modern apps with evolving schemas.

  • Geospatial: Manages location-aware data such as maps, routes, and spatial queries.

  • Text-search: Full-text indexing and search with ranking, filters, and analytics.

  • Blob: Stores unstructured objects like images, videos, and files.

  • Vector: Powers AI/ML apps by enabling similarity search across embeddings.

Over to you: Which database type do you think will grow fastest in the next 5 years?


Apache Kafka vs. RabbitMQ

Kafka and RabbitMQ both handle messages, but they solve fundamentally different problems. Understanding the difference matters when designing distributed systems.

Kafka is a distributed log. Producers append messages to partitions. Those messages stick around based on retention policy, not because someone consumed them. Consumers pull messages at their own pace using offsets. You can rewind, replay, reprocess everything. It is designed for high throughput event streaming where multiple consumers need the same data independently.

RabbitMQ is a message broker. Producers publish messages to exchanges. Those exchanges route to queues based on binding keys and patterns (direct, topic, fanout). Messages get pushed to consumers and then deleted once acknowledged. It is built for task distribution and traditional messaging workflows.

The common mistake is using Kafka like a queue or RabbitMQ like an event log. They’re different tools built for different use cases.

Over to you: If you had to explain when NOT to use Kafka, what would you say?


The HTTP Mindmap

HTTP has evolved from HTTP/1.1 to HTTP/2, and now HTTP/3, which uses the QUIC protocol over UDP for improved performance. Today, it’s the backbone of almost everything on the internet, from browsers and APIs to streaming, cloud, and AI systems.

At the foundation, we have underlying protocols. TCP/IP for IPv4 and IPv6 traffic. Unix domain sockets for local communication. HTTP/3 running over UDP instead of TCP. These handle the actual data transport before HTTP even comes into play.

Security wraps around everything. HTTPS isn’t optional anymore. WebSockets power real-time connections. Web servers manage workloads. CDNs distribute content globally. DNS resolves everything to IPs. Proxies (forward, reverse, and API gateways) route, filter, and secure traffic in between.

Web services exchange data in different formats, REST with JSON, SOAP for enterprise systems, RPC for direct calls, and GraphQL for flexible queries. Crawlers and bots index the web, guided by robots.txt files that set the boundaries.

The network world connects everything. LANs, WANs, and protocols like FTP for file transfers, IMAP/POP3 for email, and BitTorrent for peer-to-peer communication. For observability, packet capture tools like Wireshark, tcpdump, and OpenTelemetry let developers peek under the hood to understand performance, latency, and behavior across the stack.

Over to you: HTTP has been evolving for 30+ years, what do you think the next big shift will be?


How DNS Works

You type a domain name and hit enter, but what actually happens before that webpage loads is more complex than most people realize. DNS is the phonebook of the internet, and every request you make triggers a chain of lookups across multiple servers.

Step 1: Someone types bytebytego. com into their browser and presses enter.

Step 2: Before doing anything, the browser looks for a cached IP address. Operating system cache gets checked too.

Step 3: Cache miss triggers a DNS query. The browser sends a query to the configured DNS resolver, usually provided by your ISP or a service like Google DNS or Cloudflare.

Step 4: Resolver checks its own cache.

Step 5-6: If the resolver doesn’t have the answer cached, it asks the root servers, “Where can I find .com?” For bytebytego. com, the root server responds with the address of the .com TLD name server.

Step 7-8: Resolver queries the .com TLD server. TLD server returns the authoritative server address.

Step 9-10: This server has the actual A/AAAA record mapping the domain to an IP address. The resolver finally gets the answer: 172. 67. 21. 11 for bytebytego. com.

Step 11-12: The IP gets cached at the resolver level for future lookups, and returned to the browser.

Step 13-14: The browser stores this for its own future use, and uses the IP to make the actual HTTP request.

Step 15: The web server returns the requested content.

All this happens in milliseconds, before your first page even starts loading.

Over to you: Which DNS tools or commands do you rely on most, dig, nslookup, or something else?


Can a web server provide real-time updates?

An HTTP server cannot automatically initiate a connection to a browser. As a result, the web browser is the initiator. What should we do next to get real-time updates from the HTTP server?

Both the web browser and the HTTP server could be responsible for this task.

  • Web browsers do the heavy lifting: short polling or long polling. With short polling, the browser will retry until it gets the latest data. With long polling, the HTTP server doesn’t return results until new data has arrived.

  • HTTP server and web browser cooperate: WebSocket or SSE (server-sent event). In both cases, the HTTP server could directly send the latest data to the browser after the connection is established. The difference is that SSE is uni-directional, so the browser cannot send a new request to the server, while WebSocket is fully-duplex, so the browser can keep sending new requests.

Over to you: of the four solutions (long polling, short polling, SSE, WebSocket), which ones are commonly used, and for what use cases?


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

How OpenAI, Gemini, and Claude Use Agents to Power Deep Research

2025-12-13 00:30:57

Power your company’s IT with AI (Sponsored)

What if you could spend most of your IT resources on innovation, not maintenance?

The latest report from the IBM Institute for Business Value explores how businesses are using intelligent automation to get more out of their technology, drive growth & cost the cost of complexity.

Get the insights


Disclaimer: The details in this post have been derived from the details shared online by OpenAI, Gemini, xAI, Perplexity, Microsoft, Qwen, and Anthropic Engineering Teams. All credit for the technical details goes to OpenAI, Gemini, xAI, Perplexity, Microsoft, Qwen, and Anthropic Engineering Teams. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Deep Research has become a standard capability across modern LLM platforms.

ChatGPT, Gemini, and Claude all support tasks that run for long periods of time and gather information from large portions of the public web.

A typical deep research request may involve dozens of searches, several rounds of filtering, and the careful assembly of a final, well-structured report. For example, a query like “list 100 companies working on AI agents in 2025” does not rely on a single search result. It activates a coordinated system that explores a wide landscape of information over 15 to 30 minutes before presenting a final answer.

This article explains how these systems work behind the scenes.

We will walk through the architecture that enables Deep Research, how different LLMs implement it, how agents coordinate with one another, and how the final report is synthesized and validated before being delivered to the user.

High-Level Architecture

Deep Research systems are built from AI agents that cooperate with each other. In this context, an AI agent is a service driven by an LLM that can accept goals, design workflows to achieve those goals, and interact with its environment through tools such as web search or code execution.

See the diagram below to understand the concept of an AI Agent:

At a high level, the architecture begins with the user request. The user’s query is sent into a multi-agent research system. Inside this system, there is usually an orchestrator or lead agent that takes responsibility for the overall research strategy.

The orchestrator receives the query, interprets what the user wants, and then creates a plan for how to answer the question. That plan is broken into smaller pieces and delegated to multiple sub-agents. The most common sub-agents are “web search” agents. Each of these is instructed to search the web for a specific part of the overall topic or a particular sub-task, such as one region, one time period, or one dimension of the question.

Once the web agents finish their work, they return two things:

  • The content they have extracted. This typically takes the form of text snippets, summaries, or key facts.

  • Citations that record exactly where that content came from, such as URLs and page titles.

These results then move into what we can call the “synthesizer” flow. This stage often contains two agents: a synthesizer agent and a citations agent. In some systems, the orchestrator itself also acts as the synthesizer, so a separate agent is not required.

The synthesizer agent takes all the content returned by the web agents and converts it into the final research report. It organizes the information into sections, resolves overlaps, and builds a coherent narrative. The citations agent then reads through the synthesized report and makes sure that each statement is supported by the correct sources. It inserts citations in the right locations in the text, so that the final report is thoroughly backed by the underlying material.

After this synthesis and citation process is complete, the synthesizer (or orchestrator) returns the final, fully cited research report to the user.

Anthropic has published a high-level diagram of its “Advanced Research” mode, which illustrates such a multi-agent research system in action. It shows the lead agent, the various sub-agents, and the data flowing between them through planning, research, and synthesis.

The Current Landscape of Research Agents

Although the broad idea behind Deep Research is shared across platforms, each major provider implements its own variations.

OpenAI Deep Research

OpenAI’s deep research agent is built around a reasoning model that uses reinforcement learning.

The model is trained to plan multi-step research tasks, decide when to search, when to read, and how to combine information into a final answer. The use of reinforcement learning helps the agent improve over time by rewarding good sequences of tool calls and research decisions.

Gemini Deep Research

Google DeepMind’s Gemini Deep Research system is built on top of the Gemini model, which is multimodal. That means the same system can reason over text, images, and other types of inputs.

For deep research, this allows Gemini to integrate information from documents, web pages, and other media into a combined response. Gemini’s agent uses its planning ability to decide what to look for, how to structure the research, and how to bring everything together into one report.

Claude Advanced Research

Anthropic’s advanced research system uses a clearly defined multi-agent architecture. There is a lead agent that orchestrates several sub-agents running in parallel. Each sub-agent is asked to explore a specific part of the problem space.

For complex topics, this design allows Claude to divide the subject into multiple angles and explore them at the same time, then bring the results back to the orchestrator for synthesis.

Perplexity Deep Research

Perplexity’s deep research agent uses an iterative information retrieval loop.

Instead of a single pass of search and summary, it repeatedly adjusts its retrieval based on new insights discovered along the way.

Perplexity also uses a hybrid architecture that can autonomously select the best underlying models for different parts of the task. For example, one model might be better at summarization while another is better at search interpretation, and the system can route work accordingly.

Grok DeepSearch

Grok DeepSearch has a segment-level module processing pipeline.

Content is processed in segments, and each segment passes through a credibility assessment stage. Additionally, Grok uses a sparse attention mechanism that allows it to perform concurrent reasoning across multiple pieces of text.

The system can also dynamically allocate resources, switching between retrieval and analysis modes as needed, all inside a secure sandbox environment.

Microsoft Copilot Researcher and Analyst

Microsoft has introduced two related reasoning agents:

  • A Researcher is focused on complex, multi-step research tasks that combine web information with a user’s work data. It uses sophisticated orchestration and search capabilities to handle multi-stage questions.

  • An Analyst is an advanced data analytics agent that can interpret and transform raw data into useful insights. It uses a chain-of-thought reasoning approach to break down analytical problems, apply appropriate operations, and present the results.

Both Researcher and Analyst are designed to work securely over enterprise data and the public web.

Qwen Deep Research

Alibaba’s Qwen Deep Research is an advanced agent that supports dynamic research blueprinting.

It can generate an initial research plan, then refine that plan interactively. Qwen’s architecture supports concurrent task orchestration, which means that retrieval, validation, and synthesis of information can happen in parallel. This allows the system to retrieve data, verify it, and integrate it into the final output efficiently.

User Query and Initial Planning

The entire deep research workflow starts with a single user query.

Users can phrase requests in many different ways. Some users write very vague prompts such as “tell me everything about AI agents,” while others provide highly detailed, focused instructions. The system must be able to handle this variability and translate the query into a precise, machine-executable research plan.

This initial stage is critical. It converts the user’s often broad or ambiguous request into a clear strategy with specific steps. The quality of the final report is directly tied to the quality of this plan. If the plan is incomplete or misinterprets the user’s intent, the resulting research will miss key information or go in the wrong direction.

See the diagram below:

Different systems handle this planning phase in different ways.

Interactive Clarification (OpenAI)

Some architectures, such as OpenAI’s Deep Research, use an interactive clarification approach. Here, the agent does not immediately start a long research process. Instead, it may ask the user follow-up questions. These questions are designed to refine the research scope, clarify the objectives, and confirm exactly what information the user cares about.

For example, if the user asks for a comparison of technologies, the agent might ask whether the user wants only recent developments, whether specific regions should be included, or whether certain constraints apply. This conversational back-and-forth continues until the agent has a crisp understanding of the user’s needs, at which point it commits to the full research process.

Autonomous Plan Proposal (Gemini)

Other systems, such as Google’s Gemini, take a different path. Rather than asking the user follow-up questions by default, Gemini can autonomously generate a comprehensive multi-step plan based on its interpretation of the initial query. This plan outlines the sub-tasks and research angles the system intends to explore.

Gemini then presents this proposed plan to the user for review and approval. The user can read the plan, make edits, add constraints, or remove unwanted sub-tasks. Once the user is satisfied and approves the plan, the system begins the research process.

Sub-Agent Delegation and Parallel Execution

Once the plan is ready, the system moves from strategy to execution. Instead of a single agent performing all steps, the lead agent delegates work to multiple sub-agents that “work for” it.

The diagram below from Anthropic shows how the lead agent assigns work to specialized agents that run in parallel and then gather results back into a central synthesis process.

Task Delegation and Sub-Agent Specialization

The lead agent delegates each sub-task using a structured API call. Technically, this means the orchestrator calls another service (the sub-agent) with a payload that contains everything the sub-agent needs:

  • A precise prompt that explains its specific research goal, such as “Investigate the financial performance of NVIDIA in Q4 2024.”

  • Any constraints, such as time ranges, data sources, or limits on how many pages to read.

  • Access permissions and tool configuration, so the sub-agent knows which tools it can use.

Sub-agents are often specialized rather than fully general. While some systems may have general-purpose “research agents,” it is more common to see a pool of agents tuned for particular functions. Examples include:

  • A web search agent specialized in forming effective search queries, interacting with search engines, and interpreting result snippets.

  • A data analysis agent that has access to a code interpreter and can perform statistical analyses, process CSV files, or generate simple visualizations.

By using specialized agents, the system can apply the best tool and approach to each part of the plan, which improves both the accuracy and efficiency of the overall research.

Parallel Execution and Tool Use

A key benefit of this architecture is parallel execution. Since sub-agents are separate services, many of them can run at the same time. One sub-agent might be researching market trends, another might be gathering historical financial data, and a third might be investigating competitor strategies, all in parallel.

However, not all tasks run simultaneously. Some tasks must wait for others to complete. The orchestrator keeps track of dependencies and triggers sub-agents when their inputs are ready.

To interact with the outside world, sub-agents use tools. The agents themselves do not have direct access to the web or files. Instead, they issue tool calls that the system executes on their behalf.

Common tools include:

  • Search tool: The agent calls something like web_search(query=”analyst ratings for Microsoft 365 Copilot”). The system sends this query to an external search engine API (such as Google or Bing) and returns a list of URLs and snippets.

  • Browser tool: After receiving search results, the agent can call browse(url=”...”) to fetch the full content of a webpage. The browser tool returns the page text, which the agent then processes.

  • Code interpreter tool: For numerical or data-heavy tasks, the agent can write Python code and execute it in a secure, sandboxed environment. The code interpreter might read CSV data, compute averages, or run basic analyses. The agent then reads the output and incorporates the findings into its report.

Information Retrieval and Contextual Awareness

As a sub-agent receives data from tools, it must constantly evaluate whether the information is relevant to its goal. This involves:

  • Checking whether the source is authoritative or credible.

  • Cross-referencing facts across multiple pages when possible.

  • Noticing when initial search results are weak and adjusting the query.

For example, if a search returns mostly irrelevant marketing pages, the agent might refine the query with more specific terms or filters. It might add keywords like “PDF,” “quarterly report,” or a specific year to narrow the results.

When the agent finds useful content, it extracts the relevant snippets and stores them along with their original URLs. This pairing of content and citation is essential because it ensures that every piece of information used later in the synthesis stage is traceable back to its source.

Each sub-agent maintains its own short-term memory or “context” of what it has seen so far. This memory allows it to build a coherent understanding of its sub-task and avoid repeating work. When the sub-agent finishes its assignment, it returns a well-structured packet of information that includes both the findings and their citations.

The output of the entire retrieval phase is not yet a single document. Instead, it is a collection of these self-contained information packets from all sub-agents, each focused on a different part of the research problem.

See the diagram below:

Synthesis and Report Generation

Once all sub-agents return their results, the system enters the synthesis phase. At this point, the system has a large set of fragmented insights, each tied to a specific part of the research plan. The objective is to transform these pieces into a unified report.

See the diagram below:

Content Aggregation and Thematic Analysis

The orchestrator or synthesizer agent begins by collecting all information packets. It performs a high-level analysis to identify themes, overlaps, and logical connections. For example, insights about market adoption may complement insights about customer sentiment, and both may feed into a broader section of the report.

The synthesizer then constructs a narrative outline for the final document. It decides the structure that best fits the material, whether chronological, thematic, or based on a problem and solution. Redundant information from multiple sub-agents is merged into a single, clean statement.

Narrative Generation and the Citation Process

With the outline ready, the agent begins writing the report. It incorporates extracted facts, creates transitions between sections, and maintains a consistent tone. As it writes, each claim is connected to its source. Some systems assign this step to a dedicated citation agent that reviews the draft and inserts citations in the correct locations.

This stage is important because it prevents hallucinations and ensures that every assertion in the final report can be traced back to a verified source.

The outcome is a polished research document supported by citations and, when needed, a formal bibliography.

Conclusion

Deep Research systems rely on multi-agent architectures that coordinate planning, parallel exploration, and structured synthesis.

Specialized sub-agents retrieve information, evaluate it, and return detailed findings. The orchestrator or synthesizer then turns this distributed knowledge into a coherent and well-cited report. As LLMs improve in planning, reasoning, and tool use, these systems will continue to become more capable, more reliable, and more comprehensive.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

Must-Know System Performance Strategies

2025-12-12 00:31:24

When we start building software, we often think of performance as simply how fast our application runs.

We might equate performance to making a function run faster or optimizing a short piece of code. However, as we move into professional software development and system architecture, we must adopt a more strategic and precise definition of what performance truly is.

We must realize that system performance is not just an abstract idea of “speed”. Instead, it is a formal, measurable quality defined by industry standards.

This standardized quality attribute is called Performance Efficiency. Performance is formally defined as the degree to which a software system or component meets its responsiveness and throughput requirements within the limits of its available resources.

In simple terms, performance is a strategic ratio: it measures the useful work we get done compared to the resources (like time, CPU, and memory) we use up while operating under a specific workload. A high-performing system maximizes the work output while minimizing resource waste

In this article, we will look at system performance in detail, understand how it can be measured, and investigate key strategies that can be used to improve the performance of a system on different levels.

Measuring Performance

Read more

The $250 Million Paper

2025-12-11 00:30:49

Build product instead of babysitting prod. (Sponsored)

Engineering teams at Coinbase, MSCI, and Zscaler have at least one thing in common: they use Resolve AI’s AI SRE to make MTTR 5x faster and increase dev productivity by up to 75%.

When it comes to production issues, the numbers hurt: 54% of significant outages exceed $100,000 lost. Downtime cost the Global 2000 ~$400 billion annually.

It’s why eng teams leverage our AI SRE to correlate code, infrastructure, and telemetry, and provide real-time root cause analysis, prescriptive remediation, and continuous learning.

Time to try an AI SRE? This guide covers:

  1. The ROI of an AI SRE

  2. Whether you should build or buy

  3. How to assess AI SRE solutions

Get the free guide


Recently, a twenty four year old researcher named Matt Deitke received a two hundred and fifty million dollar offer from Meta’s Superintelligence Lab. While the exact details behind this offer are not public, many people believe his work on multimodal models, especially the paper called Molmo, played a major role. Molmo stands out because it shows how to build a strong vision language model from the ground up without relying on any closed proprietary systems. This is rare in a landscape where most open models indirectly depend on private APIs for training data.

This article explains what Molmo is, why it matters, and how it solves a long-standing problem in vision language modeling. It also walks through the datasets, training methods, and architectural decisions that make Molmo unique.

The Core Problem Molmo Solves

Vision language models, or VLMs, are systems like GPT-4o or Google Gemini that can understand images and text together. We can ask them to describe a picture, identify objects, answer questions about a scene, or perform reasoning that requires both visual and textual understanding.

See the diagram below:

Many open weight VLMs exist today, but most of them rely on a training approach called distillation. In distillation, a smaller student model learns by imitating the outputs of a larger teacher model.

The general process looks like this:

  • The teacher sees an image.

  • It produces an output, such as “A black cat sitting on a red couch.”

  • Researchers collect these outputs.

  • The student is trained to reproduce the teacher’s answers.

Developers may generate millions of captions using a proprietary model like GPT 4 Vision, then use those captions as training data for an “open” model. This approach is fast and inexpensive because it avoids large-scale human labeling. However, it creates several serious problems.

  • The first problem is that the result is not truly open. If the student model was trained on labels from a private API, it cannot be recreated without that API. This creates permanent dependence.

  • The second problem is that the community does not learn how to build stronger models. Instead, it learns how to copy a closed model’s behavior. The foundational knowledge stays locked away.

  • The third problem is that performance becomes limited. A student model rarely surpasses its teacher, so the model inherits the teacher’s strengths and weaknesses.

This is similar to copying a classmate’s homework. It might work for the moment, but we do not gain the underlying skill, and if the classmate stops helping, we are stuck.

Molmo was designed to break this cycle. It is trained entirely on datasets that do not rely on existing VLMs. To make this possible, the authors also created PixMo, a suite of human-built datasets that form the foundation for Molmo’s training.

The PixMo Datasets

PixMo is a collection of seven datasets, all created without using any other vision language model. The goal of PixMo is to provide high-quality, VLM-free data that allows Molmo to be trained from scratch.

There are three main components of PixMo datasets:

PixMo Cap: Dense Captions

For pre-training, Molmo needed rich descriptions of images. Typing long captions is slow and tiring, so the researchers used a simple but powerful idea. They asked annotators to speak about each image for sixty to ninety seconds instead of typing.

People naturally describe more when speaking. The resulting captions averaged one hundred and ninety-six words. This is far more detailed than typical datasets like COCO, where captions average around eleven words. The audio was then transcribed, producing a massive dataset of high-quality text. The audio recordings also serve as proof that real humans generated the descriptions.

These long captions include observations about lighting, background blur, object relationships, subtle visual cues, and even artistic style. This gives Molmo a deeper understanding of images than what short captions can provide.

PixMo Points: Pointing and Grounding

PixMo Points may be the most innovative dataset in the project. The team collected more than two point three million point annotations.

Each annotation is simply a click on a specific pixel in the image. Since pointing is faster than drawing bounding boxes or segmentation masks, the dataset could scale very quickly.

The point annotations teach Molmo three important abilities:

  • Pointing to objects.

  • Counting objects by pointing to each one.

  • Providing visual explanations by showing where ethe vidence is located.

This dataset helps the model connect language to precise spatial areas, making it better at grounding its understanding in the image.

PixMo AskModelAnything: Instruction Following

This dataset provides pairs of questions and answers about images. It was created through a human-supervised process.

The steps are:

  • A human annotator writes a question about the image.

  • A language-only LLM produces a draft answer based on OCR text and the dense caption.

  • The annotator reviews the answer.

  • The annotator may accept it, reject it, or ask for a revised version.

Since the LLM only sees text and never sees the image itself, the dataset remains VLM-free. Every answer is human-approved.

Molmo Architecture and Training

Molmo uses the common structure seen in most vision language models.

A Vision Transformer acts as the vision encoder. A large language model handles reasoning and text generation. A connector module aligns visual and language features so both parts can work together.

Although this architecture is standard, Molmo’s training pipeline contains several smart engineering decisions.

One important idea is the overlapping multi-crop strategy. High-resolution images are too large to be processed at once, so they are split into smaller square crops. This can accidentally cut important objects in half. Molmo solves this by letting the crops overlap so that any object on the border of one crop appears fully in another. This helps the model see complete objects more consistently.

Another improvement comes from the quality of PixMo Cap. Many VLMs need a long connector training stage using noisy web data. Molmo does not need this stage because the PixMo Cap dataset is strong enough to train directly. This makes training simpler and reduces noise.

The authors also designed an efficient method for images with multiple questions. Instead of feeding the same image through the model again and again, they encode the image once and then process all questions in a single masked sequence. This reduces training time by more than half.

Molmo also uses text-only dropout during pre-training. Sometimes the model does not get to see the text tokens, so it must rely more heavily on visual information. This prevents it from predicting the next token simply through language patterns and strengthens true visual understanding.

Each of these choices supports the others and increases the value obtained from the dataset.

Molmo’s Advantages

The key advantages provided by Molmo are as follows:

Strong understanding of superior data

The spoken word captions in PixMo Cap give Molmo a richer base of visual knowledge.

Instead of learning from short captions like “A brown dog catching a frisbee,” Molmo sees long descriptions that mention lighting conditions, camera angle, background blur, object texture, emotional cues, and implied motion. This leads to deeper and more detailed visual representations.

New reasoning abilities through pointing

The PixMo Points dataset unlocks new forms of reasoning.

For example, when asked “How many cars are in this image?” many VLMs simply guess a number. Molmo performs a step-by-step process. It points to each car one by one and then gives the final count. This makes the reasoning visible and easy to verify. It also makes errors easier to fix and opens the door to future systems that require pixel-level instructions, such as robots.

Better synergy between data and training

Molmo’s success comes from a combination of high-quality data and a training pipeline built to maximize that data.

Overlapping crops help preserve detail. Efficient batching uses more instruction data in less time. Text-only dropout forces the model to learn from the image. High-quality captions reduce the need for noisy training stages.

These elements reinforce one another and create a cleaner, more effective approach to multimodal training.

Conclusion

The Molmo and PixMo project shows that it is possible to build a powerful vision language model without copying proprietary systems.

It demonstrates that high-quality human data can outperform synthetic data produced by closed models. It also highlights how thoughtful dataset design can simplify training and improve results at the same time.

This first principles approach may be one reason why the work attracted strong interest from a major AI lab. It offers a way for the research community to build strong, reproducible, and truly open multimodal models.


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

Dropbox Multimedia Search: Making File Search More Useful

2025-12-10 00:30:52

How to stop bots from abusing free trials (Sponsored)

Free trials help AI apps grow, but bots and fake accounts exploit them. They steal tokens, burn compute, and disrupt real users.

Cursor, the fast-growing AI code assistant, uses WorkOS Radar to detect and stop abuse in real time. With device fingerprinting and behavioral signals, Radar blocks fraud before it reaches your app.

Start protecting your app for free →


Disclaimer: The details in this post have been derived from the details shared online by the Dropbox Engineering Team. All credit for the technical details goes to the Dropbox Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

You’re racing against a deadline, and you desperately need that specific image from last month’s campaign or that video clip from a client presentation. You know it exists somewhere in your folders, but where? Was it in that project folder? A shared team drive? Or nested somewhere three levels deep in an old archive?

We’ve all been in this situation at some point, as this is the daily reality for knowledge workers who lose countless hours hunting for the right files within their cloud storage.

The problem becomes even more frustrating with multimedia content. While documents often have descriptive titles and searchable text inside them, images and videos typically come with cryptic default names like IMG_6798 or VID_20240315. Without meaningful labels, these files become nearly impossible to locate unless you manually browse through folders or remember exactly where you saved them.

Dropbox solved this problem by building multimedia search capabilities into Dropbox Dash, their universal search and knowledge management platform.

The challenge their engineering team faced wasn’t just about finding a file anymore. It’s about finding what’s inside that file. And when the folder structure inevitably breaks down, when files get moved or renamed by team members, or when you simply can’t recall the location of what you need, traditional filename-based search falls short.

In this article, we’ll explore how the Dropbox engineering team implemented multimedia search features and the technical challenges they faced along the way.

Challenges of Multimedia Search

Building a search feature for images, videos, and audio files presents a fundamentally different set of problems compared to searching through text documents.

Some of the key challenges are as follows:

  • Storage Costs: The sheer size difference is significant. Image files average about 3X larger than typical documents, while video files clock in at roughly 13X larger. These size differences directly translate to increased storage demands and costs.

  • Compute Intensity: Beyond storage, multimedia files require substantially more processing power to extract useful features. The complexity goes beyond just handling larger files. Unlike text documents, multimedia search needs visual previews at multiple resolutions to be useful, dramatically increasing computational requirements.

  • Ranking Relevance: Dropbox Dash already operated a sophisticated multi-phase retrieval and ranking system optimized for textual content. Extending this to multimedia meant indexing entirely new types of signals, creating query plans that leverage these signals effectively, and handling edge cases to avoid irrelevant results appearing at the top.

  • Preview Generation Dilemma: Users need visual previews to quickly identify the right file, and they need these previews in multiple resolutions for a smooth experience. However, only a small fraction of indexed files actually get viewed during searches. Pre-generating previews for everything would be extremely wasteful, but generating them on demand during searches introduces latency challenges that could frustrate users.

The Dropbox engineering team had to ensure their solution supported seamless browsing, filtering, and previewing of media content directly within Dash. This meant confronting higher infrastructure costs, stricter performance requirements, and adapting systems originally designed for text-based retrieval.

The Architecture

To deliver fast and accurate multimedia search while keeping costs manageable, the Dropbox engineering team designed a solution built on three core pillars:

  • A metadata-first indexing pipeline

  • Intelligent location-aware search

  • A preview generation system that creates visuals only when needed

Indexing Pipeline for Metadata

The foundation of multimedia search begins with indexing, the process of cataloging files so they can be quickly retrieved later. Dropbox made a critical early decision to index lightweight metadata rather than performing deep content analysis on every single file. This approach dramatically reduces computational costs while still enabling effective search.

Before building this multimedia search capability, Dropbox had intentionally avoided downloading or storing raw media blobs to keep storage and compute costs low. However, this meant their existing search index lacked the necessary features to support rich, media-specific search experiences. To bridge this gap, the team added support for ingesting multimedia blob content to extract the required features. Importantly, they retain the raw content not just for preview generation, but also to enable computing additional features in the future without needing to re-ingest files.

To power this indexing pipeline, Dropbox leveraged Riviera, its existing internal compute framework that already processes tens of petabytes of data daily for Dropbox Search. By reusing proven infrastructure, the team gained immediate benefits in scalability and reliability without building something entirely from scratch.

The indexing process extracts several key pieces of information from each multimedia file. This includes basic details like file path and title, EXIF data such as camera metadata, timestamps, and GPS coordinates, and even third-party preview URLs when available from applications like Canva.

See the diagram below:

The data flows through the system in the following way:

  • Raw files are stored in a blob store

  • Riviera extracts features and metadata from these files

  • Information flows through third-party connectors

  • Kafka message brokers transport the data

  • Transformers process and structure the information

  • Finally, everything populates the search index

This metadata-first approach provides a lightweight foundation for search functionality while keeping processing overhead minimal. The team plans to selectively incorporate deeper content analysis techniques like semantic embeddings and optical character recognition in future iterations, but starting simple allowed them to ship faster.

Geolocation-Aware Retrieval System

Another feature Dropbox built into multimedia search is the ability to find photos and videos based on where they were taken. This geolocation-aware system works through a process called reverse geocoding.

See the diagram below:

During indexing, when a file contains GPS coordinates in its metadata, Dropbox converts those coordinates into a hierarchical chain of location IDs. For example, a photo taken in San Francisco would generate a chain linking San Francisco to California to the United States. This hierarchy is crucial because it enables flexible searching at different geographic levels.

At query time, when a user searches for something like “photos from California,” the system identifies that “California” is a geographic reference, validates it against a cached mapping of location IDs, and retrieves all photos tagged with that location or any of its child locations, like San Francisco. Since the number of known geographic locations has a manageable size, Dropbox caches the entire location mapping at service startup, making lookups extremely fast.

This approach elegantly handles the challenge of location-based search without requiring users to remember exact locations or browse through folder structures organized by place.

Just-In-Time Preview Generation

The most interesting architectural decision Dropbox made was generating previews on demand rather than pre-computing them for all files. This choice directly addresses the preview generation dilemma mentioned earlier.

The rationale was straightforward. Dropbox ingests files at a rate roughly three orders of magnitude higher than users query for them. Pre-computing previews for every single multimedia file would be prohibitively expensive, especially since only a small fraction of indexed files actually get viewed during searches.

Instead, when a search returns results, the system generates preview URLs that the frontend can fetch. These URLs point to a preview service built on top of Riviera that generates thumbnails and previews in multiple resolutions on the fly. To avoid repeatedly generating the same preview, the system caches them for 30 days, striking a balance between storage costs and performance.

See the diagram below:

The team optimized for speed by running preview URL generation in parallel with other search operations like ranking results, checking permissions, and fetching additional metadata. This parallelization significantly reduces overall response time. When users want to see more detail about a specific file, such as camera information or exact timestamps, the system fetches this metadata on demand through a separate endpoint, keeping the initial search response lean and fast.

See the diagram below:

Technical Trade-Offs and Design Decisions

Building the multimedia search feature required the Dropbox engineering team to make deliberate choices about where to invest resources and where to optimize for efficiency.

Cost vs. Performance Decisions

The team made three key trade-offs to balance system performance with infrastructure costs.

  • First, they chose metadata-only indexing initially, deferring expensive content analysis techniques like OCR and semantic embeddings to future iterations. This allowed them to ship faster while keeping compute costs manageable.

  • Second, they shifted the compute from the write path to the read path by generating previews just-in-time rather than during ingestion.

  • Finally, they implemented selective ingestion that currently covers 97% of media files, with ongoing work to optimize handling of edge cases.

Reusing What Works

Rather than building everything from scratch, Dropbox maximized code reusability wherever possible. They leveraged the existing Riviera framework for consistency with their established infrastructure and reused the Dropbox preview service that was already battle-tested. The team also shared frontend components between Dropbox and Dash, ensuring a consistent user experience across both platforms.

A critical organizational decision was establishing clear API boundaries between different systems. This separation allowed multiple teams to work in parallel rather than sequentially, significantly accelerating development timelines without creating integration headaches later.

Conclusion

Building a multimedia search for Dropbox Dash showcases how thoughtful engineering can solve complex problems without over-engineering the solution. By starting with lightweight metadata indexing, deferring expensive operations to query time, and leveraging existing infrastructure wherever possible, the Dropbox engineering team created a scalable system that balances performance with cost efficiency.

The development process itself offers valuable lessons. When faced with interdependencies that could have slowed progress, the team temporarily proxied Dropbox Search results through a custom endpoint during UX development. This workaround allowed frontend work to proceed in parallel while the backend infrastructure was being built, dramatically accelerating the overall timeline.

Performance monitoring played a crucial role in refining the system. The team added latency tracking for preview generation, used instrumentation to identify bottlenecks, and implemented aggressive concurrency improvements based on the metrics they gathered. This data-driven approach to optimization ensured they focused efforts where they would have the most impact.

As mentioned, Dropbox plans to enhance multimedia search with semantic embeddings and optical character recognition, bringing even deeper content understanding to the platform. The architecture they’ve built maintains clear paths for these future enhancements without requiring fundamental redesigns.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].

How Reddit Migrated Comments Functionality from Python to Go

2025-12-08 00:30:37

Unwrap Unbeatable Holiday Deals with Verizon (Sponsored)

Reliability shouldn’t cost extra—and Verizon proves it this holiday season. Switch to Verizon and get four lines on Unlimited Welcome for $25 per line/month (with Auto Pay, plus taxes and fees) and everyone gets one of the hottest devices, all on them. No trade-in required. Devices include:

Everyone gets a better deal—flexibility, savings, and support with no extra cost.

Explore Holiday Deals and see here for terms.


Disclaimer: The details in this post have been derived from the details shared online by the Reddit Engineering Team. All credit for the technical details goes to the Reddit Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

When you upvote a clever comment on Reddit or reply to a discussion thread, you’re interacting with their Comments model. This model is probably the most important and high-traffic model in Reddit’s architectural setup.

Reddit’s infrastructure was built around four Core Models: Comments, Accounts, Posts, and Subreddits.

These models power virtually everything users do on the platform. For years, all four models were served from a single legacy Python service, with ownership awkwardly split across different teams. By 2024, this monolithic architecture had become a problem:

  • The service suffered from recurring reliability and performance issues.

  • Maintaining it had become increasingly difficult for all teams involved.

  • Ownership responsibilities were unclear and fragmented.

In 2024, the Reddit engineering team decided to break up this monolith into modern, domain-specific Go microservices.

They chose comments as their first migration target because it represented Reddit’s largest dataset and handled the highest write throughput of any core model. If they could successfully migrate comments, they would prove their approach could handle anything.

In this article, we will look at how Reddit carried out this migration and the challenges it faced.

The Easy Part: Migrating Read Operations

Before diving into the complex scenario, it’s worth understanding how Reddit approached the simpler part of this migration: read endpoints.

When you view a comment, that’s a read operation. The server fetches data from storage and returns it to you without changing anything.

Reddit used a testing technique called “tap compare” for read migrations. The concept is straightforward:

  • A small percentage of traffic gets routed to the new Go microservice.

  • The new service generates its response internally.

  • Before returning anything, it calls the old Python endpoint to get that response too.

  • The system compares both responses and logs any differences.

  • The old endpoint’s response is what actually gets returned to users.

This approach meant that if the new service had bugs, users never saw them. The team got to validate their new code in production with real traffic while maintaining zero risk to user experience.

The Hard Part: Migrating Write Operations

Write operations are an entirely different challenge. When you post a comment or upvote one, you’re modifying data.

Reddit’s comment infrastructure doesn’t just save your action to one place. It writes to three distinct datastores simultaneously:

  • Postgres: The primary database where all comment data lives permanently.

  • Memcached: A caching layer that speeds up reads by keeping frequently accessed comments in fast memory.

  • Redis: An event store for CDC (Change Data Capture) events that notify other services whenever a comment changes.

The CDC events were particularly critical. Reddit guarantees 100% delivery of these events because downstream systems across the platform depend on them. Miss an event, and you could break features elsewhere.

The team couldn’t simply use basic tap compare for writes because of a fundamental constraint: comment IDs must be unique. You can’t write the same comment twice to the production database because the unique key constraint would reject it.

But without writing to production, how do you validate that your new implementation works correctly?

The Sister Datastore Solution

Reddit’s engineering team came up with a solution they called “sister datastores”. They created three completely separate datastores that mirrored their production infrastructure (Postgres, Memcached, and Redis). The critical difference was that only the new Go microservice would write to these sister stores.

Here’s how the dual-write flow worked:

  • A small percentage of write traffic is routed to the Go microservice.

  • Go calls the legacy Python service to perform the real production write.

  • Users see their comments posted normally (Python is still handling the actual work).

  • Go performs its own write to the completely isolated sister datastores.

  • After both writes are complete, the system compares production data against sister data.

This comparison happened across all three datastores. The Go service would query both production and sister instances, compare the results, and log any differences. The beauty of this approach was that even if Go’s implementation had bugs, those bugs would only affect the isolated sister datastores, never touching real user data.

The Scale of Verification

The verification process was substantial. Reddit migrated three write endpoints:

  • Create Comment: Posting new comments

  • Update Comment: Editing existing ones

  • Increment Comment Properties: Actions like upvoting

Each endpoint wrote to three datastores, and data had to be verified across two different service implementations. This resulted in multiple comparisons running simultaneously, each requiring careful validation and bug fixing.

But even this wasn’t enough. Early in the migration, the team discovered serialization problems. Serialization is the process of converting data structures into a format that can be stored or transmitted. Different programming languages serialize data differently. When Go wrote data to the datastores, Python services sometimes couldn’t deserialize (read back) that data correctly.

To catch these problems, the team added another verification layer.

They ran all tap comparisons through actual CDC event consumers in the legacy Python service. This meant Python code would attempt to deserialize and process events written by Go. If Python could successfully read and handle these events, they knew cross-language compatibility was working. This end-to-end verification ensured not just that Go wrote correct data, but that the entire ecosystem could consume it.

Challenges With Different Languages

Migrating between programming languages introduced unexpected complications beyond serialization.

One major issue involved database interactions. Python uses an ORM (Object-Relational Mapping), which is a tool that simplifies database queries. Reddit’s Go services don’t use an ORM and instead write direct database queries.

It turned out that Python’s ORM had hidden optimizations that the team didn’t fully understand. When they started ramping up the Go service, it put unexpected pressure on the Postgres database. The same operations that ran smoothly in Python were causing performance issues in Go.

Fortunately, they caught this early and optimized their Go queries. They also established better monitoring for database resource utilization. This experience taught them that future migrations would need careful attention to database access patterns, not just application logic.

The Race Condition Problem

Another tricky issue was race conditions in the tap compare logs.

The team started seeing mismatches that didn’t make sense. They would spend hours investigating, only to discover that the “bug” wasn’t a bug at all, but a timing problem.

Here’s an example scenario:

  • User updates a comment, changing the text to “hello”

  • Go writes “hello” to the sister datastore

  • Go calls Python to write “hello” to production

  • In those milliseconds, another user edits the same comment to “hello again”

  • When Go compares its write against production, they don’t match

These timing-based false positives made debugging difficult.

Was a mismatch caused by a real bug in the Go implementation, or just unlucky timing?

The team developed custom code to detect and ignore race condition mismatches. For future migrations, they plan to implement database versioning, which would let them compare only updates that happened from the same logical change.

Interestingly, this problem was specific to certain datastores:

  • Redis eventstore: No race condition issues because they used unique event IDs

  • Postgres and Memcached: Race conditions were common and needed special handling

Testing Strategy and Comment Complexity

Much of the migration time was spent manually reviewing tap compare logs in production.

When differences appeared, engineers would investigate the code, fix issues, and verifthat y those specific mismatches stopped appearing. Since tap compare logs only capture differences, once a problem was fixed, those logs would disappear.

This production-heavy testing approach worked, but it was time-consuming. The team realized they needed more comprehensive local testing before deploying to production. Part of the challenge was the sheer complexity of comment data.

A comment might seem like simple text, but Reddit’s comment model includes numerous variations:

  • Simple text vs rich text formatting vs media content

  • Photos and GIFs with different dimensions and content types

  • Subreddit-specific workflows (some use Automod requiring approval states)

  • Various types of awards that can be attached

  • Different moderation and approval states

All these variations create thousands of possible combinations for how a single comment can be represented in the system. The initial testing strategy covered common use cases locally, then relied on “tap compare” in production to surface edge cases. For future migrations, the team plans to use real production data to generate comprehensive test cases before ever deploying to production.

Why Go Instead of Python Microservices?

An important question that can come up in this scenario is this: if language differences caused so many problems, why not just create Python microservices instead?

Just sticking to Python would have avoided serialization issues and database access pattern changes entirely.

The answer reveals Reddit’s strategic infrastructure direction. Reddit’s infrastructure organization has made a strong commitment to Go for several reasons:

  • Concurrency advantages: For high-traffic services, Go can run fewer pods while achieving higher throughput than Python.

  • Existing ecosystem: Go is already widely used across Reddit’s infrastructure.

  • Better tooling: The existing Go support makes development easier and more consistent.

The engineering team considered only Go for this migration. From their perspective, the strategic long-term benefits of standardizing on Go outweighed the short-term challenges of cross-language compatibility.

Conclusion

The migration succeeded completely. All comment endpoints now run on the new Golang microservice with zero disruption to users. Comments became the first of Reddit’s four core models to fully escape the legacy monolith.

While the primary goal was maintaining performance parity while improving reliability, the migration delivered an unexpected bonus: all three migrated write endpoints saw their p99 latency cut in half. P99 latency measures how long the slowest 1% of requests take, which matters because those slow requests represent the worst user experience.

The improvements were substantial:

  • The legacy Python service occasionally had latency spikes reaching 15 seconds

  • New Go service shows consistently lower and more stable latency

  • Typical latency stays well under 100 milliseconds

See the charts below that show the latency improvements for various scenarios:

The migration also provided some valuable lessons for future work:

  • Database versioning is essential for handling race conditions properly by tracking which version of data is being compared

  • Comprehensive local testing informed by real production data will reduce debugging time in production

  • Database monitoring matters when changing how services access data, not just when changing application logic

  • End-to-end verification must include actual downstream consumers, not just byte-level data comparison

  • Custom tooling helps automate parts of the manual review process (like their race condition detection code)

As they continue migrating the remaining core models (Accounts have been completed, while Posts and Subreddits are in progress), these lessons will make each subsequent migration smoother.

References:


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing [email protected].