MoreRSS

site icon Josh Chen修改

Web 程序员,双语博客。
请复制 RSS 到你的阅读器,或快速订阅到 :

Inoreader Feedly Follow Feedbin Local Reader

Josh Chen的 RSS 预览

Vibe Coding Vibe Coding Tools

2025-05-07 08:00:01

I want to show steps of how I create a vibe coding tool using a vibe coding tool

  1. initial prompt
    build a chat-based agentic web app builder
    - chat UI in the left panel, where people can ask the agent to make changes to the code base, left panel can be hidden or shown by clicking the button on the top
    - the right panel can either be the preview or a code editor with a file explorer (switcher on the top right)
    - top left menu shows credit used, settings and help
    
  2. popup menu
    clicking for show or hide a popup menu, items in the menu are:
    - go to dashboard
    - credit usage
    - settings
    - appearance
    - help
    
  3. connect supabase
    Please connect my supabase project `TinyLove`
    
  4. add login
    Add login
    
  5. add api key
    Show me the secrets form
    
  6. connect chat api
    process user's request from the chat panel using openai, store the response in supabase and show the user-facing reply in the chat thread
    - prepend a system prompt to the prompt sent to openai
    - extract actions and user-facing reply from the response, and save both to supabase
    
  7. strip
    Let’s connect Stripe to my project. We will begin with secure payment processing.
    

answers to follow-up questions:

Are you looking to implement one-time payments or recurring subscriptions?
Do you want to track payment data in your Supabase database?
Will payments require user authentication, or should guests be able to make purchases too?
---
1. one-time payment for getting 100 credits at $10, tied to user id
2. yes
3. yes, guests not allowed
  1. fix panel dragging
    in index.tsx, make split between chat panel and main view draggable, make main view responsive
    
  2. try to fix supabase RLS error
    Supabase Error: {
      code: "42501",
      details: null,
      hint: null,
      message: 'new row violates row-level security policy for table "chat_messages"'
    }
    

    It didn’t work out so I disabled it on Supabase for now.

  3. code editor
    implement the code editor using @monaco-editor/react
    
  4. fetch file tree from api
    connect filetree to api endpoint 'api.mindynode.com/agent/{project}/workspace' to fetch the filetree json, use mocking for the api
    
  5. refactor suggestion (prompted by Lovable)
    Refactor src/components/editor/CodeEditor.tsx into smaller files without breaking any functionality. Split it into at least a FileTree component and a Editor component, as it's currently over 220 lines long.
    
  6. preview
    implement the preview, it should show an iframe, use src='news.mindynode.com' for mocking the preview
    
  7. credit logic
    implement credit logic in supabase edge function 'chat', before calling LLM, check the user's credit, decrease credit by 1 if it's bigger than 0, otherwise skip calling LLM and return an error message 'not enough credit'
    

Next Step

I’m working on the backend part, which basically connects openhand’s execution loop and adds a layer of API to file system and git of its /workspace, so stay tuned

Stanford CS336 Language Modeling from Scratch | Spring 2025 | GPUs

2025-05-01 08:00:01

Stanford CS336 Language Modeling from Scratch Spring 2025 GPUs

So hopefully everyone’s having a good time with assignment one. It’s due tonight. Let us know if you need an extension. Assignment two is coming out soon. We’re putting on the finishing touches onto some of the Triton stuff. Hopefully you’ll enjoy it. You’ll get to implement Flash Attention 2 or parts of Flash Attention 2, which I think will be nice.

So today we’re going to talk about GPUs. GPUs are the thing that makes our language models go. So they’re pretty critical to get right. If you haven’t really studied the hardware that makes your models run, they can seem pretty mysterious. So my goal today is to try to make CUDA and GPUs less magic. One of the things that I want to demystify— you don’t have to understand the plot. There’s a lot on the slide, I know. Why do GPUs get slow? They get slow in very mysterious ways. I will try to talk through this plot near towards the end of the lecture. As you increase the size of your matrix multiplies, you might expect it either gets slower or faster or whatever; you get these very unpredictable looking wavelike patterns, and you’re like, why is my GPU fast at certain multiples of certain numbers and slow at others? It’s very mysterious. We’ll try to understand that.

The other thing is we would like to understand how to make fast algorithms. I think almost all of you have heard of flash attention. It’s the thing that makes much longer context possible by very cleverly computing the attention operation inside a transformer. And so maybe you would like to come up with new algorithms or new implementations like flash attention— what primitives and what components do we need to understand in order to be able to do that? So those are kind of the two learning goals of today. The first one is by the end of the lecture, you should feel comfortable with GPUs; you should understand how they work. The second one is you should feel comfortable accelerating certain parts of your algorithms. If you make a new architecture, you should hopefully feel like you can try to accelerate that with CUDA.

Because hardware is not necessarily the domain in which I work, there are special resources that I have to give a lot of credit to, especially Horus Heath’s blog where he’s got a lot of fun GPU facts that you can learn about. For example, why are matrix multiplies that are filled with zeros faster than ones that are not filled with zeros? You can learn by going to his blog. There are also other resources that I’ve drawn from, like the CUDA mode group and the nice TPU book from Google. If this topic interests you, I’d encourage you to go and look at those resources to learn more because this is, in some ways, like a shallow but hopefully complete coverage of the hardware.

So today we’re only going to focus on the non-parallel parts of the hardware stack. We’re going to study the GPU like a single accelerator in depth, how they work, and some important parts. I’m also going to talk very briefly about TPUs because, in some ways, they’re very similar conceptually to a GPU. And so my discussion here is going to carry over. Then once we understand kind of the hardware and execution model of the GPU, we’re going to try to understand what makes GPUs go fast on certain workloads and what makes them slow. We’re going to understand the performance.

In the last part, this is going to be almost like a hands-on piece. I’m going to try to walk through flash attention. I’m going to take all the lessons that we’ve learned and try to walk you through flash attention, saying see here’s how it all comes together. So that’s the last part of today’s lecture.

Many of you have taken an NLP course, and these days in an NLP course, I think you teach some amount of scaling laws, and you’ve probably seen this. This is just setting the context. We know that having more compute is helpful for training large language models. This is a pre-training scaling chart, but you could replace this with an inference scaling chart if you would like. It’s generally agreed upon that the more compute you have, the more processing you can do on your data. You can ingest more data, you can train larger models; all of those lead to improved performance.

You might think of course, deep learning is really important, but what’s really driven performance is faster hardware, better utilization, improved parallelization. So that’s kind of setting the stage of why hardware is important to understand. And of course, once you think about compute scaling, you ask, okay, how do we get compute scaling? How do we get our models to train faster? In the early days of semiconductor scaling, if you were thinking about CPUs, how do they get faster? They would scale under something called Dennard scaling. With Moore’s Law, you would sort of double the amount of transistors on a chip every year, and if you have this doubling, you have Dennard scaling where smaller and smaller transistors can be driven at faster and faster clock speeds with lower and lower power, which in turn gives you more performance.

In the 1980s to 2000s, this sort of tapped out. You can kind of see in this chart by Hennessy and Patterson that single-thread performance— that’s the blue dots here— basically started to taper out. Of course, the number of transistors didn’t really start falling off. You did have chips with higher and higher transistor densities, but that wasn’t helpful. It wasn’t giving you higher throughput on single threads. This means that we can’t just do computation faster in absolute terms. What we have to make up for it with is parallel scaling. The story of scaling for deep learning and neural networks is going from single-thread scaling, which is just doing your computation faster in absolute terms, to parallel scaling where you have a lot of workloads that are all computed at once.

This is one of my favorite compute scaling charts by Bill Dally in his keynote, where he’s showing the super-exponential increase in the number of integer operations per second, going from the earliest K20s to the H100. It’s kind of like this really remarkable exponential or super-exponential curve. We have to really understand how to take advantage of this curve in order to really get the most out of our language model. That’s going to be our goal.

I’ve already hinted at this important difference. CPUs are something that I think everyone is familiar with once you start doing programming. It’s this execution model of you have a program; it goes through and in a single thread, it executes step by step what’s happening. In order to support that execution model, what do you need? Well, you need big control units. You just need to generally run these things very quickly because you have a lot of branching and a lot of conditional control logic. A CPU is going to dedicate a lot of its chip towards large control branch prediction, and it’s going to run these very quickly because it doesn’t have that many threads. There are CPUs with lots of cores now, but compared to a GPU, it’s almost nothing.

In contrast, the GPU has tons and tons of compute units, ALUs. There’s the little green boxes, and there are much smaller amounts of the chip dedicated to control. There’s a little bit of control logic orchestrating tons of compute units operating in parallel. This is kind of the picture of what is being emphasized in a CPU versus a GPU. But if you look at what the design goals are, they are designed for very different goals. You can think about CPUs as optimizing for latency. I want to finish my tasks as quickly as possible. If I have tasks T1 through T4 here on the right side, in a CPU, I’m going to try to finish each task as quickly as possible. And if you want any one of these tasks to be finished quickly, T1’s going to complete really quickly.

In a GPU, you’re optimizing for high throughput. I don’t care about latency; I just want all of my tasks that I have in aggregate to complete as quickly as possible. To support that, maybe you have lots of threads, and these threads can go to sleep and wake up very quickly. In the end, you finish all of your workload T1 through T4 before the CPU one does, even though individually all of these have sort of higher latency. They have different design principles and design goals.

A GPU has a pretty different anatomy. I don’t know if you’ve all ever looked at what a GPU layout diagram looks like. I’ll actually show you the chip figures in a moment here. The core idea, and this is an important conceptual concept behind a GPU, is that a GPU executes many SM (streaming multiprocessors). A streaming multiprocessor can be thought of as an atomic unit when you’re programming in something like Triton. They’re going to operate at the level of an SM, and within each SM, they’re going to contain many SPs (streaming processors). A streaming processor is going to execute a bunch of threads in parallel. One way to think about it is an SM has a bunch of control logic. It can decide what to execute. It can do, for example, branching. SPs are going to operate to take the same instruction and apply it to many different pieces of data. You can do tons and tons of parallel computation under this model.

An SM is sort of each granular unit of control. SPs can do a lot of computation individually. If you look at an A100, which is the previous generation GPU at this point, you’ve got 128 SMs; that’s a lot more than most cores for CPUs. Each of these SMs is going to have a very large number of SPs and specialized matrix multiply units inside them. That’s kind of the compute model. Was there a question? Sorry.

Yeah, to get the slide before GPUs. So is this GPU the same as a GPU? The question was, is this GPU the same as that GPU? Yes, this is a cartoon version of this. You can kind of think of each row as being SM. It’s got its own control units. Each green block might be one of these green blocks here like an SP32 processing unit inside of it. Each SM can operate various pieces that it owns, like the tensor cores, to do computation.

Cool. Okay. There are two important things. You think of GPU as computers; they compute, but actually computation is only one of the two important things we have to keep track of. Memory is arguably more important at this point, and it will continue to be more important in terms of the performance profiles of how we run our programs on the GPU. To understand memory, you kind of have to understand the physical layout of the GPU and the chip because when you’re operating at such fast speeds, the physical proximity of the memory starts to matter quite a bit. I will show you how things are laid out and how that relates to how you should think about memory access and performance.

The closer a piece of memory is to each SM, the faster it’s going to be. There are going to be certain very very fast kinds of memory like L1 and shared memory that live inside the SM. That’s going to be really fast. Things like registers, things you’re reading and writing very frequently, you’re going to want to put into the L1 and shared memory. As you can see, there are these green areas which are SMs, and then there are these blue areas. This is the GPU chip. These are L2 memory that’s kind of right next to the SM. They’re not inside the SM, but they’re physically still quite close. They’re still a factor of 10 slower, but they’re still reasonably fast.

Outside of the chip itself, this is sort of a— I think this is like a 3090 card or something like this or maybe a PCIe 100. Oh, this is a PCI 100. You’ve got your GPU here, and you’ve got actually DRAM sort of living next to the chip. It has to actually go physically outside of the chip and connect. You can kind of see on this chip diagram here, these yellow connectors at the edges. These are HPM connectors. They connect to the DRAM chips that are outside of the actual GPU. You can kind of see the speed that it takes to access these. On-chip memory is much faster— like 20 clock cycles to access something from there, whereas it’s going to take something like 200 or 300 clock cycles to access something from the L2 cache or global memory.

This factor of 10 is going to hurt you real bad. If you have a piece of computation that requires you to access global memory, it might mean that you actually run out of work to do on your SM. You’ve multiplied all the matrices; you’ve run out, now you just have to idle. Utilization won’t be good, and this will be a key theme in thinking about memory. In some sense, the key to thinking about how GPUs work, and in assignment two, you’re going to be writing high-performance code for a GPU. You have to think about the execution model of how a GPU actually executes things. This is somewhat complicated, but not insanely so.

There are three granularities of things that you need to think about. There are blocks, warps, and threads, and that’s the order in which the granularity narrows down. Blocks are kind of these big groups of threads, and each block is going to be assigned to an SM. Think about each SM as a worker, it’s its own autonomous unit, and a block is going to be assigned to an SM to process. This is each granular unit.

Now, within these blocks, there are a whole bunch of threads. Each thread is a piece of task that needs to be done. When these threads execute, they’re going to execute in groups. This is a thing called a warp. You take a block, which is a collection of threads, and you’re going to take threads from that block, and they’re going to execute in groups of 32 consecutively numbered threads each time. That’s called warps. You can see in this diagram what’s happening. You’ve got a bunch of blocks; each block is assigned to a different SM. Within each block, there are many different warps, and each warp is going to consist of a whole bunch of threads, and all of these threads are going to execute the same instruction on different data. This is kind of the execution model.

It seems probably mysterious what these blocks and warps and threads are. They will have important implications for our performance in how we design things like CUDA kernels later. Hopefully, you can remember this. I’ll refresh your memory as we go. Hopefully that’s clear.

That was the kind of logical execution model of a GPU. If you understand that, you understand how GPUs execute things. There’s also a logical memory model of a GPU. Now I’m not showing you the physical hardware; this is just how you think about the programming of a GPU. There are registers; these are really fast for storing single numbers. You’ve got local memory, shared memory, and global memory, and that increases in sort of memory hierarchy, gets slower and slower. Your code can write to global memory; it can also write to constant memory, which is not something that’s used too often. Each thread can access its own register and shared memory, but information that goes across blocks needs to be written to global memory. This is quite important.

Whenever you write a thread that executes something, ideally, it’s operating on the same small amount of data. You load that small amount of data into shared memory; all the threads are very happy accessing that shared memory. It terminates, it’s done. That would be a great execution model. Instead, if you have a thread that needs to access data all over the place, that’s going to have to access global memory, which is very slow. This theme will come back as we talk about different ways of operating on a GPU.

Hopefully that’s clear. That’s kind of the very high-level overview of a GPU. If you have questions about how any of that works, feel free to ask me as I go on.

Okay, so here’s a side thread. Last year I didn’t cover this because I think resources on TPU were a little thin. The nice TPU book or internet website I mentioned at the start of the lecture came out, and that has a lot of nice details. I talked to a few Google people about the TPU, and at a high level, it’s very similar to a GPU. I want to talk for a moment about TPUs. You may never operate on a TPU, but it’s important to understand that these alternative accelerators operate in many ways very similarly.

Here’s a diagram of what a TPU looks like. There’s something called a tensor core, and mentally you can think of a tensor core as being similar to an SM or streaming multiprocessor. Each of these are kind of its own atomic unit that can operate on data. There’s a scalar unit, which is basically a control unit, and it can also do CPU-like arbitrary things. You’ve got a vector unit that can operate on vectors. If you’ve got a vector and you want to operate entry-wise on it, that’s a good place to do it. Then it’s got a very big specialized part of the chip dedicated to just doing matrix multiplies called the MXU. It’s got very fast memory for vector memory and SME. Both of these are very fast on-chip or on-tensor core memory, and then there’s high bandwidth memory that lives outside of the chip.

Hopefully you see the similarities to an SM. There’s slow memory outside, very fast memory inside, and there’s specialized hardware to do matrix multiplication. The core structure is very much the same. The difference is— I’ll talk about this in the parallelism lecture next week. How the accelerators are together is a little bit different. I also didn’t mention, I didn’t talk about warps or any of that other stuff. Tensor cores are very simple because they’re optimized to just do matrix multiplies. Unlike the GPU, they don’t attempt to do anything but that. That’s in some ways very simple— much simpler in architecture but conceptually doing the same thing.

Yes. Is it tensor also in some ways optimized to general tensor or this is just enough to work on? The question was, is it called tensor? Because it can operate on arbitrary tensors. It can operate on arbitrary tensors, but the operations that the MXU performs is a matrix multiply, so it would always be like a batch matrix multiply operating on a tensor. It’s kind of both a yes and a no answer if that makes sense. They operate on tensors, but the operations they always perform are matrix multiplies, not more complicated tensor operations.

The reason the GPU has been so successful is that it scales up really easily. If you want more processing power, just add more SMs. You don’t have to worry about driving the clock faster and getting more heat dissipation problems. Programming-wise, CUDA is intimidating, but it’s not as horrendous to program because of its programming model. The way it works is within each SM; you have threads, and they execute the same instruction on many different pieces of data. That’s conceptually easy to reason about. You can think through what that means, and especially it’s nice if you’re operating over a matrix and doing simple operations. It’s a simple model.

Finally, each of these threads are very lightweight, and they can be stopped and started at any time. If you need to wait for another thread or if you need to evict something and start another process, all these threads are very lightweight. So this just means that there’s not much state associated with the threads, and they can be stopped and started, allowing GPUs to get high utilization within each SM.

GPUs are graphics processing units. For much of its life, in the early days, it was not used to do scientific computing. Researchers figured out how to use early NVIDIA GPUs to do fast matrix multiplies. This is one of the early papers on doing fast matrix multiplies with graphics hardware. It shows how you can hack things like the texture buffer to do matrix multiplies. Even without specific support for matrix operations, researchers figured out how to do it. Now, especially in this day and age, NVIDIA and others have realized matrix multiplies are special. If you’re doing deep learning, most of your workload is matrix multiplies.

Matrix multiplies are in some sense blessed operations; this is a chart showing the number of teraflops per second by different generations of NVIDIA GPUs. The orange line is your matrix multiply FLOPS. With your performance, you can get if you’re doing matrix multiplies. The blue line is your non-matrix multiply FLOPS. You see a big gap at V100s when they started putting in tensor cores— specialized hardware to do matrix multiplies. You see this gigantic gap in matrix multiply performance relative to non-matrix multiply performance. If you’re going to design any neural architecture, you have to have most of your workload be matrix multiplies because that’s the thing that’s orders of magnitude faster than any other operation you’re going to be able to do on a GPU.

If you make a non-matrix multiply-based neural network, you’re going to be in big trouble. The last thing I want you to understand as general facts is that matrix multiplies are fast, but it’s important to remember the relative scaling of the different components of the GPU. This is a very nice chart showing how quickly different components of the GPU or different components of what we call the language model training stack are scaling.

The blue line is the connectivity from the GPU to the host, like the server it’s attached to. You can use PCIe, NVLink, and all these fancy interconnects. They are growing, but they’re growing slowly. This chart is showing normalized scaling, bandwidth relative to the first generation of interconnects. The green line is the global memory speed; you go from GDDR to HBM2E, and that’s much faster— this is log scale, it’s 100x faster— but this is still slow scaling. The gray line is compute scaling; this is the number of floating-point operations if you’re considering matrix FLOPS. This shows how fast compute has been scaling, and this is astoundingly fast.

In the early days of scaling, your problems were FLOPS-based. You just didn’t have enough FLOPS to do your matrix multiplications. But now, all the way to the right with the H100s— these are astoundingly fast GPUs— your bottlenecks are probably going to end up being memory because memory is not growing as fast. As we go into the future, this is not really going to change. DRAM is very hard to scale, and you’re going to keep getting this bigger gap. If you’re designing hardware-efficient algorithms, you’re going to have to think more about memory. We’re going to keep a lookout on that. I’ll keep emphasizing this; it’s one of the important themes in GPUs.

I’ve been throwing lots of GPU facts at you, especially if you haven’t seen this recently and maybe this is kind of new. So just to recap, GPUs are these massively parallel processing systems. They have the same instructions applied across many different threads, and they have these things called SMs that are kind of like… Cores that there’s many many of them in the GPUs. Compute and matrix multiplies have scaled really fast, and they have scaled faster than memory. That is an important part of the characteristics that think about GPUs, but there is some fast memory. It’s not like everything is slow, so there’s nothing we can do. There’s the memory hierarchy, right? Some kinds of memory are very very fast, other kinds of memories are slow, and so if we exploit this hierarchy, maybe we can get things that are really really fast. So that’s kind of things to remember about the GPU, and if you remember these facts, you’re going to be able to think pretty cleanly about the performance components that I’m going to talk about next.

Any questions before I move on to the next part? Okay, cool. So now you all are GPU experts, and what we would like to do is we would like to make machine learning workloads go very fast on a GPU. I’m going to start with this chart, and one of our goals will be to understand what this chart exactly is. I think it’ll be a good puzzle to get us motivated. And so here, what we are doing is we are multiplying square matrices together, right? So the x-axis is the size of my square matrix multiplies. You know the y-axis here, this is the number of operations per second that I’m doing. So you can kind of think of this as hardware utilization on the y-axis, right?

As I get bigger and bigger matrices, I’m going to get better and better hardware utilization because I have more work to do. That overwhelms the overhead of sort of launching jobs and things like this. But there are all these weird things that are happening, right? You see one, two, three, four different lines, right? Each of these lines are kind of wavy in a way that’s kind of unpredictable, right? And so we would like to kind of understand what exactly is going on with these lines. And by the end of this section, my promise is that you will kind of understand exactly each one of these phenomena. You’ll be able to say, “Yeah, that plot looks totally normal. That is a natural thing for a GPU to do.”

Okay, so the very first part, right, is if you look at that plot, you will notice that it looks a little bit like this, right? If you’ve taken a systems hardware course, you should remember this as kind of the roofline model. The roofline model basically says if we’re looking at throughput or utilization, what we’re going to find is there’s two regimes. There’s going to be a regime that is sort of memory limited, right? That is on the left side of this curve in the green over here. Then there’s a part that is throughput limited on the right side. In some sense, you can kind of think of it as on the right side we are fully utilizing our compute units. All the matrix multiply units are multiplying all the time. On the diagonal here, we just have some sort of memory bottleneck, and so our ability to do computation is limited by the amount of intensity that we have, the amount of flops per byte that we have.

We want to avoid being in this left side region where we’re memory bound, and we would like to be on this right side where we’re getting in some sense full utilization of all of our compute units. So that’s in some sense the goal, and hopefully this roofline model looks something like this. Right? Like we’ve got sort of this diagonal part, and then we’ve got this flat part all the way at the top here. So that’s one part of the mystery.

This turns out to be kind of complex, right? The simple way to say this is let’s make sure that we’re not accessing memory unnecessarily, right? We have as few memory accesses to slow global memory as possible. But it turns out that in order to do that, we need a large array of tricks. There’s a lot of different things that you could do that would mess you up, that would make you very slow. The first one’s not a memory bottleneck. I’ll just mention it. It doesn’t come up too often. We’ll get it out of the way, and then we’ll talk about the remaining five items that in some sense are really core to thinking about GPU performance.

Okay, so the first thing that I want to talk about is conditionals. So as I said before, GPUs, their execution model is something called SIMT, right? Single Instruction Multiple Thread. Every thread in a warp is going to execute the same instruction, and it’s going to do so on different data. So what happens if I write a piece of code that looks like this? I have an if statement, and if the thread index is less than four, do something. If the thread index is greater than or equal to four, then do something else. Right? I have this very simple conditional model. If I run this on the GPU, what’s going to happen is that I’m going to run the instruction on four of my threads. I will actually pause my other four threads which are supposed to be executing the else part.

Then these other four threads will come live, and they will execute X, and my original four threads will go to sleep. I will just alternate executing each of these instructions. Why is that? I can’t execute A and X at the same time on these different threads. As I said again, every thread has to execute the same instruction. So conditional statements within a single warp can be really damaging because they will force you to pause any of the threads that are not doing exactly the main sort of control flow execution.

Okay, so that was the only non-memory thing that I wanted to mention. It should be kind of obvious that you should probably not be putting conditionals into your massively parallel compute unit. But once we’ve gotten that out of the way, sort of the other tricks that we need to consider are all kind of memory-based. The first thing I want to mention is lower precision. This is a big trick. This is an important trick. You should do it all the time. There’s kind of a going back to this plot of Billy.

There’s a slight of hand here. This looks really good because the numbers are going up and up and up. If you look at what’s driving GPU progress over all these years, you actually kind of see that it’s number representations. You go from FP32 to FP16 to INT8 and so on. You get many orders of magnitude gains from just having lower and lower precision in your GPU operations. Let me sort of clarify why that’s so important, right? If you have fewer bits in all the things that you’re computing and your weights and so on, you have much fewer bits to move, right? So even if you’re accessing these bits from global memory, they become much much less of a concern.

So let’s just give a simple example and let’s just think about kind of the arithmetic intensity of a simple element-wise operation. I’m going to do it in values. So that’s X equals max(0, X), and I’m going to do that on a vector of size n. Let’s say naively I’m going to do this on float 32. So how many memory accesses do I have? I have to read my X, I have to write the result of if X is less than zero, and that’s all in float 32. So that’s kind of eight bytes, right?

How many operations do I do? Well, I have to do X less than zero. So that’s one comparison operation. I do one flop, right? So I do, you know, eight bytes per single floating-point operation. If I do this in float 16 now, well, you know, I haven’t changed the flops intensity here, but I’ve halved the memory access. And so now I have four bytes per flop, right? In some sense, I’ve like gotten double the memory bandwidth for free, assuming that I can get away with float 16.

This is a key part of how a lot of things are designed. Part of the assignment is going to be you’re going to try and play with various mixed precision or low precision training and other kinds of things. A key part here is that not all the parts of your network and your training algorithm should be put into low precision, right? So let me give you an example of matrix multiplies. In matrix multiplies that are mixed precision, what you would do is you would have your inputs be 16 bit. So these are low precision, and then you’re going to do your multiplication in full 32 bit. That’s useful because the intermediate computations, as you’re accumulating partial sums, you would like that to be in high precision. So you’re accumulating this with an FP32 accumulator, and then your tensor core will return an FP32 result, which you can downcast if you would like back into 16 bit.

So we have our inputs in 16 bit, but things like the accumulation, we might want to do in 32 bit. There are lots of different operations that can use 16-bit storage, and there are operations that might need more precision. You want to keep it in either FP32 or FP16. Think you might want to have operations that need more range, like X functions. If you don’t have sort of the dynamic range, they might blow up or zero out. So you might want to put those in BF-16. There’s a lot of careful engineering that has to happen in order to make sure that you know these models are actually stable when they’re being trained with lower precision. But if you can do it, that’s really great because you’ve basically doubled the throughput of your bottleneck going from 32 to 16 bit, right? If your memory is your bottleneck.

Okay, the other one, and I think this is kind of what a lot of people think of when they say, “I’m going to write a CUDA kernel” or something. Operator fusion is kind of both very intuitive and a fun, natural one to think about. One mental model of how a GPU works and how memory works is this kind of fun diagram of a factory. Imagine you have a factory, and your factory is your compute part, right? It takes in little box widgets and then outputs little triangle widgets. If you grow your compute, but your conveyor, you know, takes memory to compute, is finite bandwidth, you know, you’re not going to be able to use your second factory, right? You’re still capped by the speed at which you can transfer things from memory to compute.

You’ve got this bottleneck. Now, of course, you already knew that, right? I’ve been sort of hammering in the memory bottleneck thing. But I think one insidious way in which you can incur a ton of overhead without really realizing it is kind of this left-hand side computation pattern, right? Imagine the left side of this plot is where the memory is. The right side is your compute unit. To do computation, I start with a square, and I move my squares from my memory to my compute. I do some operation. I turn them into triangles. Now, I ship my triangles back to memory. Then, okay, I realize I need triangles again. So I ship them back into the compute unit. Now the triangles become circles, and so on and so forth. I send my compute sort of back and forth to memory. You might call this kind of a very naive approach.

If you were just doing operations naively on the GPU and just shipping the results straight back to global memory, this is what you’d end up with. If you count the number of times a piece of data went back and forth, this is pretty terrible. You’ve incurred tons of memory overhead. Now you should be able to realize that if you look at the right side, well this compute, there’s no dependency, so I should be able to go square to triangle to circle to rectangle and ship the rectangle back. I can just keep everything in the compute unit the whole time, right?

That’s the right-hand side diagram, and this is the mental model of a fused kernel. You have a bunch of operations that are going to happen on a piece of data in sequence. Instead of writing it back into storage, what I’m going to do is I’m going to do all the computation as much as I can in one place, and then only when I have to ship it back to memory. So that’s this idea of kernel fusion.

There are some very simple examples of how if you write some naive code, you might get a naive set of launches. Here’s an example. I wrote a little neural network module. Let’s say I write a neural network module that takes in X and produces sin^2(X) and cos^2(X), right? Simple code. Now if I run this, you know the computation graph in PyTorch is going to look something like this, and it’s going to launch a whole bunch of CUDA kernels. It’s going to launch, take in X, and it’ll launch a CUDA kernel to compute sin(X). It’ll launch one to compute cos(X), then sin^2(X) and cos^2(X), and finally sin^2(X) plus cos^2(X), right?

There’s a bunch of back and forth that has to happen in order to do this computation. It’s exactly the left-hand side figure that I showed you before. If you were a little smarter, right, and you either wrote your own CUDA kernel or you used something like Torch Compile, you can easily realize that those five operations don’t really depend on very much; they use only a little bit of memory. So you can fuse them into a single operation that does everything on the GPU on a single thread without sending things back to global memory. Right?

Really easy fusion operations like this can be done automatically by compilers. I just mentioned Torch Compile. If you aren’t already doing this, you should strongly consider using Torch Compile everywhere. We’ll show you in the assignment Torch Compile as well. It’s pretty nice.

Okay, so I’ve gone through precision and fusion. If anyone has questions, let me know before I move on to recomputation and other kinds of tricks that we can do on the GPU.

Another thing that we can do is called recomputation. Recomputation is this idea of sort of spending more compute to avoid having to do memory access, right? Remember back to your original backpropagation lecture. This one’s actually from CS221. What do we do? Well, we take our inputs at the very bottom. These are the yellow ones. Then we propagate activations upwards. Those are also the yellow values on the tree. Then we compute the Jacobians backwards. Those are the green values on the edges.

To compute my gradients, I’m going to propagate. You multiply the Jacobian and the activations. I’m going to propagate the gradients backward, right? Well, if you think about it, those yellow values after the forward pass have to be stored, right? And then they’re stored, and then they have to be taken from global memory where I stored them and put them into the compute units. Mechanically, that’s how it has to happen. But that might actually be a ton of memory inputs and outputs happening. Instead, you might actually be able to avoid this.

Let me give you an example of how recomputation can speed things up. Here’s another sort of silly function that I might write. I’m just going to stack three sigmoids on top of each other. You can look at the left. That’s the forward graph. That should be exactly your mental model of three sigmoids on top of each other. Now, the computation graph for this, I’m going to compute the sigmoids, and I’m going to store S1 and S2, which are the activations of the sigmoids, and I have my outputs.

That’s my sort of forward pass. Now, the backward pass in this is kind of terrible. When I do my backward graph, I need to go and take S1 and S2 and I need to take the gradients coming sort of backwards into this outbox and then push it into this backward computation, and I’ll get the gradient of X. I need to have three memory reads, one memory write in order to compute the backward pass. For the forward pass, I need to do one memory read of X, and I need to do three memory writes for S1, S2, and out. Hopefully that’s clear.

This is a decent amount of memory reads and writes: I have to do eight of them, and I have very low arithmetic intensity because I have no matrix multiplies at all. The idea of recomputation is to say I don’t want to store those activations at all. I’m not going to put them into memory. I’m just going to recompute them on the fly in my backward pass. Now in my new forward pass, I don’t store S1 and S2. I take X as input. I compute my sigmoids, and I get my output.

Now that’s one memory read for X and one memory write for out. Now in my backward pass, I don’t have activations anymore. So what I’m going to do is I’m going to get both D_out, which is the backward signal coming in from above, and then X, which is my input. I’m going to take two of those, which are two memory reads. On the fly in my SM, in my local memory, I’m going to compute each of these sigmoids, and I’m going to put them into the backward graph. I’m going to recompute S1, S2, and out on the fly inside my local memory. Because I do that, there’s no global memory reads happening here, and then I have one memory write, which is D_X.

Now if you compare the two, I have 5 out of 8 of the memory access for the exact same computation. The price that we paid is that I’m going to have to recompute these three sigmoids. But if you were running sort of idle anyway because you were memory capped, this is a great trade-off. You would be very happy with this because now you’ve traded compute, which you have too much of, for memory bandwidth which you had too little of.

This is one great way of trading one thing you need for another thing that you have. Of course, this is different; it’s the same trick as gradient checkpointing and recomputing activations for memory savings. But this is being done for different reasons. This is for execution speed, not just because you’re running out of memory. It’s the same technique, but for different goals.

This one I think is actually kind of a really interesting one and not one that I knew until I started really looking into how the hardware model of a GPU and DRAM works. The slow memory, the global memory called DRAM in a GPU, is actually very very slow. In order to make it faster, there are certain optimizations that are being done at the hardware level. One of the optimizations that’s done at a hardware level for DRAM is that when you go and read a piece of memory, you don’t actually get just that value back. You actually get a whole chunk of the memory back, and this is called burst mode.

Let’s say I went on and tried to read the very first value of this big memory block. Instead of just the memory giving me back zero, it would actually give me back 0, 1, 2, 3, right? It would give me back four values at once. It would be like, “Here you go. I’m sure you’ll need the 1, 2, and 3 too in the future.” Each address space is cut up into what’s called burst sections, and then you’re given the entire burst section rather than just what you looked for.

This might seem very mystifying—like why would the memory give you three extra bytes for free when you’re just asking for one? There’s a very interesting hardware reason, which is that when you’re addressing into the memory, you know, in order to send the signal out from the memory, those bytes have to be moved to an amplifier. That’s the slow step. Once you’ve done that, you can get many many bytes for free. That’s why this burst section thing exists. It’s kind of masking this more expensive step of actually moving where the data is stored to this amplifier.

Regardless, this kind of means that we might be able to significantly accelerate our memory access if the pattern of memory access is good. If I want to read this entire block over here, if I access it in random order, then I’m going to have to basically query a number of times equal roughly to the length of my query. But if I check the very first value, then I’m going to get all this entire burst section at once. If I go and check number four, I’ll get this burst section, the second burst section at once.

I can basically get four times the throughput if I’m really clever about my memory accesses and only access just the bits I need from each burst section. This is called memory coalescing. If all the threads in a warp fall within the same burst, then basically the sort of smart hardware and programming model will group those queries. Instead of querying 0, 1, 2, 3, it will group them and say, “Just give me zero,” and then I will be able to read out all the 0, 1, 2, 3 at once from this kind of burst-mode DRAM. Remember that a warp is 32 sort of numbered threads, and so memory accesses from a warp happen together. When these warps are reading into these kind of burst sections, there are optimizations that can be done so that you’re getting all four bytes at once rather than getting one of them at a time individually. That will 4x the throughput that you have on your memory.

These are kind of very simple things, but they’re actually very important. Imagine I’m going to do matrix multiplications. This is a core thing that you’re going to have to do a ton if you were to sort of implement a neural network really from scratch in CUDA. In this case, imagine I’m going to read my matrices in one of two ways. I can read it by traversing the rows, right? Each thread is going to traverse the row. Or I can sort of read it in column order. Each thread is going to go down a column, right?

It turns out that this left one, where you’re going across different rows, is going to be quite slow because the memory reads are not going to be coalesced. Whereas if you’re going to this right side where each of the threads are going down, they’re incrementing in rows, then these memory reads will be coalesced. You can think about it for a moment why this is true. When I first looked at this diagram, I was like, “Isn’t it reversed?” It’s actually not. This is the correct one.

The way to think about this on the right-hand side diagram over here, I’m going to have a series of threads that’s trying to access, you know, left to right. So each thread is going to try to load the very first element. In the next time step, I’m going to load the element from this column, the second column, and then the third column and the fourth column, and so on. So if that happens, what happens at time step one? At time step one, my first thread loads this point, and then the second thread loads this point, and then this point and that point, right? Those can’t be coalesced at all. They’re reading different burst sections.

That means that I have to read this entire chunk of memory in order to perform any sort of an operation. Instead, if I was going in the column direction, all the threads will be reading within the single burst section. Only one memory read operation needs to be performed, and you get all the memory at once. This is a very low-level optimization, but it is very important. If your memory traversal order is all wrong, you will actually get much slower memory accesses than you really want.

Okay? So then that brings us to kind of the very last and kind of big one. This is the idea of tiling. Tiling is the idea that you would like to group together memory accesses in order to minimize the amount of global memory access that we have to do. To explain this one, I’m going to try to go through this example of a matrix multiply. Hopefully, I’ll be able to sort of explain to you why a naive algorithm for doing matrix multiply is going to be very problematic. Then afterward, I’m going to give you a tiled version of the same idea, and hopefully you’ll be able to see why that’s going to reduce the number of global memory reads that you have to do.

Let’s start with this very simple matrix multiply algorithm. I’ve got a matrix on the left side. I’ve got my N matrix on the top. In order to compute the matrix matrix product, right, I’m going to have to traverse over the rows of M and the columns of N and then take the inner product and store that into this P matrix, right, the corresponding rows. I’ve written out here each of the threads, the thread 0, 1, 1, 0, 1 corresponding to where they’re storing their outputs and the access order in which they access each of the individual elements.

Now notice here that what’s going to happen is that the memory access here is not coalesced like the row matrices. These are going to be accessed in a non-co-order, and I have repeated memory accesses. I’ve got M00 being accessed in the first thread, M00 being accessed here, N0 and N10 being accessed in two different threads, you know, so these values are being kind of read over and over from global memory into many different threads. And so this is going to be potentially very slow. So there’s a question of can we avoid having too many global memory reads and writes. What I would ideally like to do, right? So let me explain kind of the ideal outcome first and then I’ll explain the algorithm. The ideal outcome is that I would like to spend one sort of chunk of time loading pieces from global memory to shared memory where things are fast. I want to do a ton of computation in shared memory and then I want to kind of be done with that piece of data. Right? That’s the ideal outcome. I’ve minimized my global memory accesses.

So now how can I do this in this matrix multiply world? So now what I’m going to do is I’m going to take my matrices both the M matrix and the N matrix and I’m going to cut them up right into tiles. So here I’ve cut this up into 2x2 tiles. So I’ve got a 2x2 M tile and a 2x2 N tile, right? So I’ve got basically smaller submatrices within each of the matrices. And now imagine that my shared memory is big enough to be able to fit these submatrices, right? Within each of these SM. So now this gives a very simple algorithm with which we can do computation.

So what I’m going to do is I’m going to first load, let’s say this m00 tile on the top left over here and I’m going to also load my N00 tile into shared memory here. Right? So now I have these partial sums that I can compute. I can take the row product of m00 z m01 with n z n 0 and I can increment that into p 0. I can do the same with all the different submatrices that I can fill out over here. Right now then once I’m completely done sort of processing these two tiles, then I can load a new tile over here. And then I can repeat that computation with my M tile and my N2.0 tile loaded into shared memory. And then I can sort of increment my partial sums in P.

Right? So now I’ve really sort of consolidated and reduced the amount of global memory access I have to do. Right? I load as much memory as I can at once into shared memory. I do all of my sort of submatrix computations on that tile that I can and then I move on to the next one. Right? And of course the other nice thing is that because I’m loading an entire tile, I can traverse these submatrices in whatever order I want, like column major or row major. And so I can coalesce all the memory accesses whenever I’m loading a tile from global to shared memory.

So there’s kind of wins all around here when we tile our accesses. So we can do a little bit of tiling math. So we’ve got, let’s say, a matrix A, a matrix B, and a matrix C. So let’s say the full matrices, these are square matrices of size N. And let’s say I have a tile of size T.

Oh yes, question. Previous slide of loading m0. So three loading m00 again. So in that case I just wrote it for completeness, but m00 z let’s say is just stored in shared memory. Let’s just keep it cached. I won’t load it again. That’s definitely just there for completeness. Not that you would actually discard and reload the matrix again. That would be kind of insane. Cool.

Okay. And so we can kind of do very simple tiling math to think about what’s happening. So let’s say I’m going to do an n by n matrix multiply, right? So if I do a non-tiled matrix multiply, if I’m just going over rows and columns, then every input every time I process it has to come from global memory. So each input is read sort of n times from global memory, right? So each of these is read sort of n times. If I do a tiled matrix multiply, well, you know, the global reads are operating over a tile. So I’m reading each input n over t times from global memory and I’m reading t times within each tile, right?

Of course, I’m doing matrix-matrix multiplies so I can’t reduce the total number of reads; I have to read all the matrix elements, but I can shift the reads into basically fast shared memory, right? So I do t times memory reads into shared memory and n over t times from global memory, and that’s great because if we have big shared memory that can store big tiles, that’s a factor of t reduction in the total amount of data that has to come from global memory.

Right? So tiling can be really powerful when you’re operating over matrices and you can move things into shared memory. Tiling is quite complex. This is the source of many confusing things about GPU and matrix multiply performance. One thing that can happen once we start tiling, you start asking about discretization. So imagine I have a tile size of 128. That seems like a nice good round tile size. But then, you know, when I have a full matrix of 256 size, that’s great. That’s a 2x2 tile. Things load nicely.

Now, let’s say I have a 257 size tile on the column side. Now, this is a bad time because I need to have six tiles in order to cover this matrix. And the two tiles on the right are very sparse. There’s just not much stuff in there, right? And the problem with this is that each tile is going to be assigned to SM, right? So each of these tiles is going to be a block, and each thread is going to be operating within each tile. So those two tiles on the right, they’re not going to be doing very much at all, right? Those SM are going to be basically sitting idle.

And if you were kind of compute capped, you would have wanted to more evenly distribute the load between SM, right? So you have to basically optimize your tile sizes to try to avoid these kinds of scenarios. But in reality, there’s a lot of complex things that go into setting the tile size. Remember you have to coalesce your memory accesses. So you have to think carefully about that. You have to not exceed your shared memory size, right? So the tiles can’t be too big.

And you have to divide the matrix dimension hopefully evenly or as close to evenly as possible so you don’t end up with this situation of sort of an underutilized SM at the very end here. Yes, so you have say smaller sizes do something like would GPUs do something like where they can fetch the tile beforehand and if so would that happen the level?

Yeah. So you’re asking about whether or not you can overlap memory reads and computation. And yeah, that’s naturally done in GPUs like they’re always trying to use the available bandwidth. As long as shared memory is available, they can go and put things into it. The issue is that whenever you’re effectively utilizing your SMs, you’re basically maxed out on your shared memory, right? That’s the bottlenecked resource, and so there is no place to prefetch in some sense.

Cool. Okay. And the other thing that is very, very, you know, we’re getting into the weeds here, complex is the interaction between tiling and burst sections. So imagine I have a matrix layout that’s kind of like this, where I have my nice burst sections. And each burst section lines up nicely with a tile. So to read this tile, all I have to do is to get four different burst sections and I’ve gotten this entire tile.

Now imagine what happens if I add sort of one element extra and the way the matrix is laid out, you know, my sort of tile start sort of my burst sections flow over. So now what’s happening is when I load my tile, I’m going to load this first part and that’s really great. I get the entire first row as a burst section. Now in the second row, this actually belongs to two different burst sections. And so I have to do two reads in order to get this second row and so on and so forth.

So I’ve essentially doubled the number of memory accesses because I’ve added a single extra element at the very end there that’s kind of bumped up the alignment of my burst section and my align layout. And so basically if tiles or your matrix sizes aren’t multiples of your burst section, you can easily end up with situations like this where the rows don’t line up with the burst section and you’ve doubled the amount of memory access that you have to do.

And the way to get around this is you have to do padding to be able to kind of get nice round matrix sizes so that your burst sections line up with the size of your tiles. Right? So this is getting very into the weeds here. But if you really want to squeeze out all the performance from your matrix multiplies, these are the kinds of things you have to think about, right? And you will get bitten by this if you’re not thinking about it.

And of course, I guess things like torch compile and all the CUDA optimizations for matrix multiplies, they’re doing exactly the kinds of stuff that I just talked about, right? That’s the way you get better performance.

And so all this matrix complexity ends up in situations like this where I’m reading Andre’s tweet here but you know the most dramatic optimization to nano GPT is to increase the vocab size from 5257 to 5304, which is the nearest multiple of 64, which gives you much higher occupancy. Careful with your powers of two, right? So that’s a 25% speed up from adding how many, it’s like 47 dimensions to your vocab. How does that happen, right?

And so that kind of brings us back to the mystery. I was dragging you through all the GPU details in the hopes that you’ll have a full understanding of all the performance characteristics. But in some sense, the payoff is I now get to explain to you how this chart comes to be, and at the end you won’t find matrix multiply performance to be so mysterious or scary at the end here.

So the very first part is very simple; we understand compute intensity, right? This is exactly the roofline that I pointed out at the very beginning. So up until here, which is about 1536, right? There’s just not enough matrix multiply work to do, right? The just loading the matrix and doing very basic I/O, right, that you have to do is becoming a bottleneck below this point, right? So throughput is going to fall through to the ground.

Past this point you just don’t have enough memory bandwidth to support your compute units. Now on the right side here in theory if I draw the upper envelope this is the kind of maximum achievable performance. So it’s possible up here to saturate all of my compute units and get really great performance. But if you kind of mess up your matrix sizing you can end up in these kind of really weird places and within each one of these you can kind of end up in a weird trough.

And so we’re going to kind of think a little bit about why do you have all these different places you can end up. The very first thing, this first line here, this is a tiling alignment issue. So if you look at kind of the multiples here, I’ve now colored each of these lines based on kind of the divisibility of the matrix size and this is the size by which it’s divisible. So if it’s divisible by 32 then you’re in good shape; you’re in these purple dots up here. If you’re divisible by 16, you’re actually still up here.

There are two colors. And then if you’re green, k equals 8, you’re up here. If you’re orange, you’re k equals 2. And if you’re k equals 1, you’re all the way down here. If you’re not divisible by any number, don’t pick prime dimensions. You’re not going to get very good throughput on your matrix multiplies.

And a big part of this is once you get to kind of k equals 2 and k equals 1, you are basically forcing the situation where you can no longer read tiles in the sort of nicely aligned way with your burst reads. And that’s going to lead to some serious issues. So, that’s kind of a problem.

But then, okay. So that’s one part of the mystery, but I think another part of the mystery remains. Within this orange line, I think if you zoom into here, you see this giant drop, right, from this point all the way down to this point where you’re just kind of wondering what happened here? How could I lose so much performance increasing my dimension by two?

And so let’s just look at these numbers. And I think this is a fun puzzle. So, I’m just going to walk you through the puzzle. This is going to happen when you transition from 1792 to 1794, I guess, size. Let’s say four here, just so that it’s a factor of two still. Well, why does that happen?

Okay. Well, let’s say that we’re using a tile size of 256x128. That’s a pretty natural size. As a fun fact, you know, the matrix multiply units in these GPUs, they’re naturally operating on matrices of roughly size 128. So 256 x 128 is a very nice tile size, right? So that means how many tiles are there? Well, there’s seven times 14 tiles, right? Because we’re dividing the dimension of the matrix by the size of our tiles. That’s a total of 98 different tiles.

And if we increase this by one, well, you know, we’re going to have to round up each one of our coordinates. And so we’re going to have a lot more tiles, 120 of them, right? So we’ve increased the number of tiles by quite a bit. Well, you know what’s going to happen is not only did we significantly increase the tiles and some of them have lower utilization, which is bad, but actually even worse, an A100 has 108 SMs, right?

And if you go all the way back to the GPU execution model, right, SMs can execute in parallel and they’re kind of the execution units. And so when you have 98 SMs, they all go and run, right? You can dispatch them all. All the SMs are running; you got great utilization. Once you go to 120 tiles, now you’ve got more tiles than SMs. So 108 of those will execute and then you will go back and you’ll say all right, I’ve got some more SMs at very low utilization. You’re going to execute the remaining 12 and wait for those to complete, right, and that’s going to be really bad.

So if you look at your utilization, you got good utilization for a while, you’ll drop off a cliff and then you’ll sort of finish up your job, right? So this is something called wave quantization. Ideally your tile sizes are either much bigger than the number of SMs or they’re not like this where you’re just barely over the SM and you’ve caused this quantization sort of error.

Cool. All right. I know this is low-level details, but in many ways, I’ve been saying through many classes that language models and deep learning is attention to detail. And these kinds of attention to detail are the things that allow people to scale up LMs to really large sizes and get great performance.

So it’s worth knowing even if you’re not a person that’s going to do systems engineering. So, what were the tricks, right? Key ideas here. First one is you got to reduce the amount of memory accesses, right? So there’s lots of ways to do it. You can do coalescing so that you can sort of reuse reads that you’re getting for free. You can do fusion so that you can fuse multiple operations together and avoid unnecessary reads and writes.

You can move memory to shared memory. So you know even if you’re going to do reads they’re going to be from much faster memory. And that’s going to be sort of tiling tricks that you can do. And then finally you can kind of trade memory for other resources that you do have, right? So you can trade it for compute which is going to be recomputation or you can trade it for just numerical precision or stability which is going to be quantization.

So there’s lots of bags of tricks that you have in order to get sort of performance out, right? So there’s lots of things you can do; you just have to be really mindful of the role that memory plays in the performance of a GPU. That’s the key thing to get the most out.

Cool. Any questions on that before I sort of move to the final part with flash attention?

Okay, good. All right, so now I’m going to put it all together, right? Like I’m going to try to make it so that all the tricks that I taught you aren’t these like random disconnected facts about GPUs. They’re kind of part of the standard performance optimization toolkit and flash attention and flash attention 2 will hopefully teach you how that all comes together to build one of the foundations, I guess, of modern high performance transformers.

So flash attention, we know that it dramatically accelerates attention. Most of you probably know that that’s done through some CUDA kernel magic, but maybe you don’t know all the details, right? So you know what the paper says is okay, so there’s one part that’s happening which is you know you do attention on an unoptimized you know PyTorch transformer implementation. If you fuse the kernel and you do some things, you can get significant speed ups.

From the paper, they say we apply two established techniques, tiling and recomputation to overcome the technical challenge of computing exact attention in sub-quadratic HBM accesses, right? So it’s not sub-quadratic computation because you can’t do that; you have to compute attention in general, but they’re going to get sub-quadratic accesses to the high bandwidth or global memory, right?

And so that’s really the key—if your memory is the bottleneck, you want to make that not quadratic so that at least you can pay for quadratic cost with your compute rather than with your memory.

So just for a really quick recap, at this point you’ve implemented attention many, many times in many classes. So it’s going to be three different matrix multiplies. You’ve got a K, Q, and V with a softmax in between. The matrix multiplies are pretty simple; they can be done with tiling. I’ve shown you examples like that.

What’s different about attention? Well, there’s a softmax thing that’s going to be the real tricky bit. And then once we can deal with the softmax, all of the sort of matrix multiply things I was talking about will just come into play. The matrix multiply, as I said before, is exactly what I taught you.

So if you look at the figure one from the flash attention paper, this is really just a simple tiled matrix multiply, right? You see the K matrix, the Q matrix; you see it cut up into small blocks. Small blocks of it are being copied to SRAMM, they’re being multiplied, and then they’re being accumulated or they are sent to the HBM where you do softmaxes and then you multiply with a V, right? So this is all just really simple in terms of the KQV matrix multiply.

But now we have to think about the softmax, right? Like what’s going on with the softmax? So the key thing here is the softmax. Sorry, I’m going to roll back one step. So the issue with the softmax—what’s the problem with the softmax? It’s a global operation, right? The softmax in attention operates row by row. You have to sum the entire row, right?

To compute the sum normalizing term of the softmax, that’s very problematic. If I have tiles, right, ideally I want to do everything within the tiles, right? I don’t ever want to have to write back to the big matrix. And so I need a softmax that can be computed online within each tile, right? I want to do as much computation within each tile as possible.

So the key thing here is to use what’s called the online softmax. And so what is that? If you have a stream of values, right, normally the batch version of the softmax, you take all of your x1 through xn and you would exponentiate them, sum them, and you would divide them, right? That’s what you would do in your normal softmax.

And then you would maybe compute the maximum value and you would subtract that in order to be able to make this numerically stable, right? So this is the standard numerically stable softmax on the left side. The online softmax, I’ve taken this from Mikallof and Gimmelstein in 2018. You can sort of realize that you can pull out via sort of like a telescoping sum kind of an argument, basically the current running sort of normalizer term and the current sort of top term of e to the xi minus max of xk, right?

So what you’re going to do is you’re going to maintain your current max that you’ve seen over x1 through xj which is my current iteration and then I’m also going to maintain sort of this correction term. If my max updates, this is going to basically correct my max, and then I’m going to add my sort of new term over here.

Right? So this d of j is going to track online the top term of this equation, term two over here. And then, you know, at the end I can also then compute the normalizer and then get the normalized yi that I want, right? This d of v is itself sort of the normalization term that I need.

So the key thing here is that this can be done online. I don’t need the x1 through xn up front. All I need is sort of the stream of x1 through xn. And that’s really key because I can now compute the softmax tile by tile. Right? Within each tile, I can run this algorithm and that will let me compute kind of the partial softmax for that tile.

And then I can sort of write back if I need to all the components that I’m keeping track of. And that’s all that I kind of need in order to do this computation. Right? So I never have to materialize the full n squared matrix in order to compute the softmax. And so that’s basically it. But once you have that, you know, you’ve put it all together, and you can get the forward pass of flash attention.

And if you go and look at the flash attention 2 paper, which is going to be a thing that we’re going to ask you to implement, you’re going to be following through these steps here. You’re going to see exactly this idea. So first you’re going to have your KQ matrix multiply and this is going to be tiled. So these are little tiled chunks and they’re going to be multiplied.

And how am I going to compute the softmax? Well, I’m going to maintain sort of a running value of these sort of exponentiated sums. And then I’m going to keep incrementally updating it and correcting for the maximum terms. And by doing that I can compute all the necessary quantities kind of tile by tile, sort of going from one tile to another. And then just multiply once again with tiles with V in the end and that will give me sort of my full softmax output, right?

Yes, so we won’t be able to compute that output until we compute the multiplication across all tiles, right? So we do have to double back on each tile. So the question was, you can’t compute this until you are done with all the tiles. Yes, that’s correct.

But by let’s say I do all the tiles once, right? Like I do all n squared tiles. At that point I have all the components that I need in order to directly output the softmax. At that point I don’t have to redo recomputation because I have the normalizer terms already, right? By going through each of these kind of tiles at the end of going through all these tiles, I’ve built up, you know, L3 or L of N, which is the sum of all the exponentiated terms. So I already have that in my shared memory for this last tile.

And then that allows me to exponentiate and divide and then return all the components. Okay. So the backward pass, I’m not going to cover. You can do recomputation tile by tile which will allow you to avoid storing the softmax. Remember, I always want to avoid storing anything that’s of size n squared.

And so here I’ve been sort of clever with the tiles so that I don’t have to store any of the n squared components when I’m computing, for example, the softmax. But in the backwards pass, if I store the activations, that’s already something that’s n squared sized, right? So I don’t want to store my n squared activations. I’m going to have to recompute it on the fly tile by tile when I do the backwards pass.

Right. So that’s a really key other trick that they do in order to make the backwards pass possible. But otherwise it’s fairly standard. It’s really the same thing as computing the gradients, just tile by tile and doing that computation.

So okay, that brings us to the end here. Hopefully you’ve kind of seen how all of the pieces I talked about tiling and coalescing and recomputation come together to give you flash attention and all these really cool things that make your transformers go much faster.

So to recap for the whole lecture, right? Hardware is kind of the thing that has really powered all of the language models that we have today. And so if you really want to leverage your hardware, you have to understand the low-level details. I think all the systems advances really engage with a lot of the concepts that I taught today.

And the current GPU scaling, you know, that plot is really the one you should remember. It really incentivizes and encourages you to think about memory movement. Right? The memory movement is the bottleneck in all of this. And so you don’t want to just think about, oh how do I reduce the number of flops? That’s important too. Really, you really have to think about, okay, how do I make my memory movements more efficient?

And then finally, if you have to do a certain amount of computation, well, to optimize things, the way to do it is to optimize your data movement, right? To be able to avoid as much movement from the high bandwidth memory or the global memory as possible. You want to reduce that and have everything in the very fast shared memory, and that leads to good performance on things like flash attention.

Thanks, everyone.


This is an experimental rewrite

So hopefully everyone’s having a good time with assignment one. It’s due tonight, so let us know if you need an extension. Assignment two is coming out soon; we’re just putting the finishing touches on some of the Triton content. Hopefully, you’ll enjoy it! You’ll get to implement Flash Attention 2 or parts of it, which I think will be quite nice.

Today, we’re going to talk about GPUs, the essential components that drive our language models. Understanding them is critical. For those of you who haven’t studied the hardware behind your models, GPUs can seem quite mysterious. My goal today is to demystify CUDA and GPUs. One key aspect I want to clarify—while you don’t need to understand the entire plot on the slide, it’s important to grasp why GPUs can slow down. As the size of your matrix multiplies increases, you may expect consistent performance; however, you may notice unpredictable wave-like patterns. You might wonder why your GPU is fast for certain multiples of certain numbers but slow for others. We’ll explore that together.

Additionally, I want to discuss how to create fast algorithms. Many of you have likely heard of Flash Attention. It enables longer context processing by cleverly computing the attention operation within a transformer. Some of you may want to develop new algorithms or implementations like Flash Attention, which leads to essential questions: What primitives and components do we need to understand to do that? These are the two learning goals for today. By the end of the lecture, you should feel comfortable with GPUs and understand how they work. Additionally, you should feel equipped to accelerate parts of your algorithms; if you create a new architecture, you will hopefully feel confident enhancing it with CUDA.

Since hardware isn’t my primary focus, I owe a lot of credit to special resources, particularly Horus Heath’s blog, which has numerous fun GPU facts. For instance, you can learn why matrix multiplies filled with zeros are faster than those that are not. I’ve also drawn from other resources, like the CUDA mode group and Google’s TPU book. If you’re interested in this topic, I encourage you to check out these resources for more insights, as this overview provides a somewhat shallow yet complete coverage of the hardware.

Today, we will focus solely on the non-parallel aspects of the hardware stack. We’ll dive deep into how GPUs work as individual accelerators and discuss their important components. I’ll also touch briefly on TPUs, as they are conceptually similar to GPUs. Once we comprehend the hardware and execution model of GPUs, we’ll analyze what makes GPUs fast on certain workloads and what slows them down. We’ll explore their performance comprehensively.

In the last segment of our discussion, we’ll engage in a hands-on piece. I’ll guide you through Flash Attention, integrating all the lessons we’ve learned and demonstrating how it all comes together. That’s the final portion of today’s lecture.

Many of you have taken an NLP course, and it’s likely that some amount of scaling laws is covered in those classes. This chart serves to set the context. We recognize that having more computational power is beneficial for training large language models. While this is a pre-training scaling chart, you could easily replace it with an inference scaling chart. Generally speaking, the more compute you have, the more processing you can perform on your data. You can ingest larger datasets and train bigger models, all of which contribute to improved performance.

You might think deep learning is crucial, but what’s truly driven performance is faster hardware, better utilization, and improved parallelization. This highlights why it’s essential to understand hardware. Once we consider compute scaling, we must ask: how do we achieve compute scaling? How can we train our models more quickly? In the initial days of semiconductor scaling, when focusing on CPUs, the performance boost came from something called Dennard scaling. With Moore’s Law, you would double the number of transistors on a chip each year. This doubling meant that smaller transistors could be driven at higher clock speeds with lower power consumption, resulting in greater performance.

However, from the 1980s to the 2000s, this trend plateaued. As shown in Hennessy and Patterson’s chart, single-thread performance—represented by the blue dots—began to taper off. The number of transistors may not have fallen, and while you did have chips with increased transistor densities, they didn’t translate into higher throughput for single threads. This indicates that we can no longer solely rely on absolute computational speed; we have to compensate through parallel scaling. The scaling narrative for deep learning and neural networks has transitioned from single-thread performance—which focuses on completing tasks more quickly—to parallel scaling, where multiple workloads are processed simultaneously.

One compelling chart illustrating compute scaling is by Bill Dally in his keynote, showcasing the super-exponential rise in integer operations per second, moving from the earliest K20s to the H100. This remarkable curve underscores the need to harness this growth to maximize the potential of our language models.

I’ve pointed out this crucial difference before: CPUs are something with which most programmers are familiar. They operate on an execution model where a program steps through instructions in a single thread. To support this, they require robust control units and generally need to execute instructions quickly due to branching and conditional control logic. A CPU dedicates significant chip space to large control units for branch prediction, executing quickly because it handles relatively few threads—and although there are CPUs with many cores now, they are still limited compared to a GPU.

Conversely, a GPU contains numerous compute units (ALUs) and allocates smaller parts of the chip to control logic. While there is some control logic managing many compute units operating in parallel, the emphasis differs significantly between the two architectures. CPUs focus on optimizing latency, aiming to finish tasks as quickly as possible. For example, if I have tasks T1 through T4 on the right side, a CPU strives to complete each task swiftly, with T1 finishing first.

In a GPU, however, the optimization target is high throughput. Latency is less critical; the goal is for all tasks to complete quickly in aggregate. This might involve many threads that can rapidly go to sleep and wake up. Ultimately, the GPU completes the entire workload (T1 through T4) before the CPU does, though each individual task may have a higher latency. These distinct design principles highlight the differences between CPUs and GPUs.

A GPU also has a markedly different architecture. If you’ve looked at a GPU layout diagram, you’ll notice that a GPU encompasses multiple SMs (streaming multiprocessors), which can be viewed as atomic units when programming with something like Triton. Each SM contains many SPs (streaming processors), which execute multiple threads in parallel. An SM has control logic for basic decision-making, like branching, while SPs carry out the same instruction across various data pieces, enabling extensive parallel computation under this model.

Each SM serves as a basic control unit, while SPs carry out substantial computations independently. For instance, the A100 GPU—now a previous generation—boasts 128 SMs, far exceeding the core count of most CPUs. Each SM features numerous SPs and specialized matrix multiply units, exemplifying the compute model.

Possible image caption: Diagram of an A100 GPU layout showing SMs and SPs.

Was there a question? Sorry.

Someone asked about the slide before GPUs. Is this GPU the same as that GPU? Yes, this is a cartoon version. You can think of each row as being an SM with its own control units, while each green block perhaps represents an SP32 processing unit. Each SM manages its own components, like tensor cores, to perform computations.

Cool. There are two essential aspects to consider. Although GPUs are primarily for computation, memory is arguably even more crucial regarding the performance profiles of how we run our programs on them. To understand memory, we need to explore the physical arrangement of the GPU chip, as proximity plays a significant role in speed. I will illustrate how things are organized and how it relates to memory access and performance.

The closer a memory segment is to each SM, the faster it will be. There are fast memory types, like L1 and shared memory, that reside within the SM. Registers—frequently read and written pieces of data—should ideally be stored in L1 and shared memory. As demonstrated in the chip layout, green areas represent SMs while blue areas indicate L2 memory, which is near the SM. While L2 memory is slower (by a factor of 10), it remains relatively quick.

Outside the chip is your DRAM, which connects to the GPU—this particular chip diagram illustrates HPM connectors linking it to external DRAM. Accessing off-chip memory takes significantly longer—about 200 or 300 clock cycles—compared to 20 clock cycles for on-chip memory. This discrepancy in access time can negatively impact performance. If your computation requires accessing global memory, you may find that your SM runs out of tasks and idles, leading to poor utilization. This theme is critical when considering memory usage.

In essence, understanding how GPUs execute tasks will be key to writing high-performance code for a GPU in assignment two. This execution model isn’t overly complicated but requires some knowledge.

There are three granularity levels to consider: blocks, warps, and threads. Blocks are large groups of threads assigned to an SM. Think of each SM as a worker and each block as a collective ready to be processed by it.

Within these blocks, many threads exist. When executing, these threads work in groups called warps, where 32 consecutively numbered threads execute together. This diagram shows multiple blocks assigned to different SMs, with each block containing various warps. Each warp consists of numerous threads executing the same instruction on different data, illustrating the GPU’s execution model.

While blocks, warps, and threads may seem complex, they significantly impact our performance when designing CUDA kernels. This is vital to remember, and I’ll reiterate it as we proceed.

That outlines the logical execution model of a GPU, and if you grasp that, you will comprehend how GPUs execute tasks. But there’s also a logical memory model for a GPU. Without displaying the physical hardware, this encompasses how you program for a GPU. You have registers for quick single-number storage, along with local memory, shared memory, and global memory, which progressively gets slower. Your code can write to global memory and constant memory, although the latter is seldom used. Each thread has access to its own register and shared memory, but information shared across blocks must be written to global memory—this is quite important.

Ideally, your threads operate on a small amount of data stored in shared memory. If all threads efficiently access that shared memory, they will complete their tasks quickly. However, if a thread needs to pull data from various locations, it will have to rely on the significantly slower global memory. This theme will recur as we discuss different GPU operation strategies.

Hopefully, that’s clear. This overview provides a high-level understanding of a GPU. If you have any questions about any of this, please feel free to ask as we continue.

Now, let’s take a slight detour. Last year, I didn’t cover TPUs because there was limited information available. However, the nice TPU book or website I mentioned earlier was released and contains much useful content. After discussing it with a few people from Google, I found that TPUs are quite similar to GPUs, which makes them relevant even if you may never work directly with them.

Here’s a diagram of what a TPU looks like. There’s something called a tensor core, which you can think of as similar to an SM (streaming multiprocessor). Each serves as an atomic unit that processes data. There are also a scalar unit, functioning as the control unit capable of arbitrary operations, and a vector unit for entry-wise operations on vectors. Most importantly, the TPU features a specialized section of the chip dedicated to matrix multiplies, known as the MXU, alongside fast vector and SME memory.

Possible image caption: TPU architecture with tensor core and other components.

You should notice the similarities to an SM: external slow memory, very fast internal memory, and dedicated hardware for matrix multiplication. The underlying structure is essentially the same. However, I’ll explain the variance in how the accelerators function in the parallelism lecture next week. I won’t discuss warps or any of the GPU specifics here; tensor cores are much simpler since they focus solely on matrix multiplications, making their architecture straightforward while serving similar purposes.

Someone asked whether a tensor core is optimized for general tensors. The question was whether it operates on arbitrary tensors or just specific types. Indeed, it can process arbitrary tensors; however, the computations performed by the MXU are matrix multiplies, which means it typically deals with batch matrix multiplies acting on tensors. So, in a way, it’s a yes and no answer.

The GPU has seen immense success partly because it scales easily. If more processing power is needed, you merely add more SMs. There’s no need to push clock speeds higher and face the accompanying heat problems. Even though CUDA programming may seem intimidating, its model helps make tasks easier to conceptualize. Each SM contains threads executing the same instruction on numerous data pieces, which is straightforward to reason about—especially when working with matrices and simple operations.

In addition, these threads are lightweight, meaning they can be paused and resumed when needed. If one thread must wait or if you need to evict a task and start a new one, lightweight threads allow for high utilization within each SM.

Historically, GPUs were focused on graphics processing, and in their early days, they weren’t utilized for scientific computing. Researchers discovered how to leverage early NVIDIA GPUs for rapid matrix multiplies, as shown in one of the first papers exploring fast matrix computations with graphics hardware. They even figured out how to use texture buffers for matrix multiplication. Now, NVIDIA and others recognize that matrix multiplies are special operations vital for deep learning workloads.

Matrix multiplies can be considered privileged operations; the chart illustrates the teraflops per second across different generations of NVIDIA GPUs. The orange line indicates FLOPS from matrix multiplication, while the blue line illustrates non-matrix multiply performance. Notice the significant performance gap in the V100s, which introduced tensor cores—specialized hardware for matrix operations. If you design any neural architecture, ensuring that most of your workload consists of matrix multiplies is essential, as they are orders of magnitude faster than non-matrix multiply tasks.

Creating a neural network based on non-matrix multiply operations could lead to significant challenges. It’s also essential to grasp how the various GPU components scale relative to one another. This chart nicely depicts the scaling speed among different parts of what we call the language model training stack.

The blue line shows the connection speeds from the GPU to the host server, such as PCIe and NVLink. While these connectivity options are improving, they are doing so slowly. The green line illustrates global memory speed, moving from GDDR to HBM2E, which is significantly faster—100x faster in logarithmic scaling—yet still not scaling quickly. The gray line represents compute scaling—the number of floating-point operations considering matrix FLOPS—showing how quickly compute capabilities are growing.

In the past, your main constraints would have been FLOPS; there simply weren’t enough to perform needed matrix multiplications. Now, with cutting-edge H100s, you’re likely facing memory as a bottleneck since it’s not growing as rapidly. This trend is unlikely to change, as DRAM scaling presents significant challenges. Thus, when designing memory-efficient algorithms, it’s imperative to prioritize memory considerations. This is a recurring theme crucial for GPU performance.

I’ve shared many GPU insights today; if any of this seems new to you, let’s recap. GPUs are vast parallel processing systems applying the same instructions across numerous threads, featuring many SMs. Compute and matrix multiplication capabilities have advanced rapidly, outpacing memory improvements—a key aspect of GPU performance characteristics to keep in mind. However, not all memory is slow; there’s a hierarchy. Fast memory exists alongside slower options, and leveraging this hierarchy could lead to enhanced performance.

If these facts resonate with you, you will better understand the performance aspects I’ll discuss next.

Are there any questions before I transition to the next segment? Okay, excellent. Now that you’re all well-versed in GPUs, our next goal is making machine learning workloads run efficiently on them. I’m going to start with this chart, aiming to clarify its meanings and encourage us to think critically. We’re multiplying square matrices here. The x-axis shows the size of our square matrix multiplies, while the y-axis represents the operations per second executed—essentially, GPU utilization.

As the matrices grow larger, GPU utilization tends to improve since additional work offsets overhead from launching jobs. However, you might notice distinct, wavy lines representing various performance behaviors—unpredictable and complex patterns. We aim to decode what’s happening with these lines, and by the end of this section, I promise you will understand each of these phenomena clearly. You’ll be able to analyze this plot and recognize it as a typical GPU behavior.

The first observation in analyzing this plot draws parallels to the roofline model, familiar to those who’ve taken systems hardware courses. The roofline model suggests two main regimes regarding throughput or utilization. The left side of this curve, indicated in green, denotes a region that is memory-limited, while the right side reveals a throughput-limited area.

In essence, the right side reflects fully utilized compute units, with all matrix multiply units constantly working. The diagonal represents a memory bottleneck, where computational capabilities hinge on the intensity of operations, measured in FLOps per byte.

Our objective is to avoid that left side region, where performance is constrained by memory, aiming instead for the right side, where we achieve optimal utilization of compute units. In summary, the goal is to maintain minimal memory access interruptions and to manage global memory accesses wisely.

This effort, however, is complex. While we want to minimize unnecessary memory accesses, we must employ a variety of techniques to ensure optimal performance. The first point to touch upon is conditionals. As mentioned earlier, the execution model for GPUs is SIMT—Single Instruction Multiple Thread. If you write a code block that includes an if statement with different instructions for different thread indices, it will create execution delays in the warp.

The threads executing opposite instructions will pause until their turn arrives. This means GPUs struggle with conditional statements because of how the simultaneous execution model operates—it can severely hinder performance.

So let’s give a simple example and think about the arithmetic intensity of a basic element-wise operation. For instance, let’s consider the equation (X = \max(0, X)) applied to a vector of size (n). If we do this naively with 32-bit floating-point values, how many memory accesses do we encounter? First, I need to read my (X); then, I need to write the result when (X) is less than zero. Altogether, that amounts to eight bytes, right?

Now, how many operations do I perform? I have one comparison operation for checking if (X) is less than zero, which counts as a single floating-point operation (FLOP). Thus, my ratio is eight bytes per single floating-point operation. If I were to change this to 16-bit floating-point values, my FLOP intensity remains constant, but my memory access is effectively halved. So now I’m at four bytes per FLOP. In a sense, I’ve gained double the memory bandwidth for free, presuming that I can effectively work with 16-bit floating points.

This principle is central to the design of many components. For your assignment, you will experiment with mixed precision or low precision training among other variations. It’s a crucial point to understand that not all parts of your network and training algorithm should be converted to low precision. For example, when dealing with matrix multiplies in mixed precision, you would typically use 16-bit inputs. Your multiplication would be conducted in full 32-bit precision; this is beneficial because, during the accumulation of partial sums, you will want to utilize high precision.

Therefore, your calculations are maintained in 32-bit formats, allowing the tensor core to return a 32-bit result, which you can choose to downcast back to 16-bit if desired. While inputs may be in 16-bit format, operations involving accumulation might need to remain in 32-bit. Some operations may require more precision, like certain functions where the range is essential to avoid automatic blow-ups or zeroing out. In such cases, you might prefer to utilize BF16. Careful engineering is vital for ensuring that your models are stable when trained at lower precision levels. Successfully achieving this can effectively double the throughput of your bottleneck by transitioning from 32-bit to 16-bit under memory constraints.

Another concept people often associate with writing CUDA kernels is operator fusion—a straightforward yet intuitive approach. Visualize a factory as a mental model, representing your compute section. The factory takes in small box widgets and outputs small triangle widgets. If you increase your computation capacity but your conveyor belt, which represents memory bandwidth, remains finite, you won’t be able to fully utilize your additional compute units.

You already recognize the memory bottleneck, but what’s less apparent is how easy it can be to incur substantial overhead with the naive left-hand computation pattern. For instance, if I start with squares in memory, I would move them to the compute unit for processing, convert them to triangles, and then send them back to memory. If I then realize I need triangles again, I’d have to bring them back to the compute unit, where they transform into circles, and so on. This back-and-forth approach can lead to significant inefficiencies.

This naive method results in an excessive number of memory accesses. In contrast, the right-hand diagram illustrates a more efficient computation model, where data remains in the compute unit throughout successive operations, like transitioning from squares to triangles to circles and then to rectangles before returning the final result to memory. This strategy embodies the concept of kernel fusion, where multiple operations occur sequentially on a single piece of data, minimizing unnecessary memory writes.

Here’s a practical example. Imagine I create a neural network module that takes input (X) and produces (\sin^2(X)) and (\cos^2(X)). In PyTorch, the computation graph is likely to spawn several CUDA kernels: one kernel for (\sin(X)), another for (\cos(X)), followed by kernels for (\sin^2(X)), (\cos^2(X)), and finally for computing (\sin^2(X) + \cos^2(X)). This generates multiple trips back and forth in memory, mirroring the inefficiencies described in the left-hand diagram.

However, with a bit of foresight, either by crafting your own CUDA kernel or utilizing frameworks like Torch Compile, you can realize that these five operations have little dependency and only occupy a small amount of memory. Thus, you can unify them into a single operation that executes all computations on the GPU within a single thread, avoiding unnecessary global memory transfers. Simple fusion operations like this can be automatically handled by compilers. Keep in mind that using Torch Compile could significantly streamline your processes—it’s quite beneficial, and we’ll demonstrate its use in the assignment.

Now that we’ve discussed precision and fusion, are there any questions before I continue onto recomputation and other GPU optimization techniques?

Another effective strategy is recomputation, which involves investing more compute resources to reduce memory access. Reflecting back on your backpropagation lecture, we begin by propagating inputs at the base, progressing activations upwards, followed by computing Jacobians backwards. To compute gradients, you would multiply the Jacobian values with the activations, then propagate the gradients back up.

After the forward pass, those activation values must be stored in memory, creating frequent demands for data retrieval from global memory. Instead, you might skip storing these activations altogether, opting to recompute them on the fly during the back pass.

Here’s an illustration using a function with stacked three sigmoids. For the forward graph, let’s assume my operations yield activations (S1) and (S2) along with my outputs. During the backward graph, I would conventionally store (S1) and (S2), leading to multiple memory accesses. However, if I don’t store them and simply compute these values on the fly as needed, I significantly reduce the overall memory accesses from eight to just one read for input (X) and one memory write for the output.

So, by sacrificing the storage of activations and instead creating them in real-time during the backward pass, we optimize memory bandwidth utilization without compromising performance. This swapping of compute resources for memory access is extremely valuable, leveraging a system that may already be idling due to memory constraints, a trade-off that can lead to optimal execution speeds.

This technique shares similarities with gradient checkpointing but specifically aims to speed up execution rather than simply managing memory usage. There’s something particularly intriguing about how slow global memory—or DRAM—functions in GPUs. To enhance speed, a hardware optimization known as burst mode is often employed.

When you request a single value from a large memory block, instead of receiving just that value, you gain an entire chunk in burst mode. If you inquire about the first value in a memory block, for instance, you might get back (0, 1, 2, 3)—essentially, you receive a block’s worth of data.

This can seem counterintuitive, but the reasoning lies in the physical requirements of addressing memory, which necessitates moving the requested data to an amplifier—this process incurs latency. Subsequent requests operate more efficiently as you gain access to multiple bytes without additional delays. Essentially, if your memory access patterns are optimal, burst mode allows significant acceleration for your memory interactions.

If your access patterns are poor, reading memory randomly can hinder performance, making burst sections the smarter option for memory retrieval. If multiple threads in a warp exist within the same burst, the hardware can combine these queries into one efficient call.

For example, during matrix multiplications, how you read matrices affects speed—if you traverse rows individually, you generate non-coalesced memory reads, leading to slower performance. In contrast, if you read in column order, you’re set to achieve coalesced reads since all threads will pull from within the same burst section, resulting in better memory throughput.

Therefore, memory traversal order is crucial; improper patterns can lead to significant inefficiencies. This brings us to a significant concept: tiling. Tiling involves clustering memory accesses to minimize global memory operations during calculations.

Let’s consider a naive matrix multiplication algorithm. When trying to compute the product of two matrices, you necessarily traverse the rows of (M) and columns of (N), accumulating results in (P). However, this method generates repeated global memory accesses for certain values, creating performance issues.

My ideal solution involves offloading pieces of data from global memory to shared memory, where they can be accessed more efficiently. In practice, I’d divide both matrices (M) and (N) into tiles—submatrices small enough to fit in shared memory.

Upon loading, I compute partial sums from those tiles entirely in shared memory and only return results back to global memory when finished processing. This rules out excessive global memory overhead, as it allows for a streamlined operation where tiles can be accessed in any order, benefiting from efficient memory coalescing.

This optimizes memory interactions significantly. For a general (N \times N) matrix multiplication, a non-tiled approach requires (N) reads and writes from global memory, while a tiled approach can drastically reduce the total number of reads based on tile size.

Tiling development, while potent, comes with its own complexities. For instance, a poor selection of tile sizes can lead to inefficient SM utilization. If your matrix dimensions don’t align with your tiling strategy, you may end up with sparse tiles that underutilize processing resources in your SMs.

Adapting your tile sizes and avoiding these situations without overstepping shared memory limits involves careful consideration of your overall matrix dimensions and memory accesses.

To clarify your question about overlapping memory reads and compute, yes, it’s a built-in aspect of GPU architecture. GPUs constantly strive to maximize available bandwidth by utilizing shared memory effectively, but when fully utilizing your compute units, achieving further pre-fetching can become limited.

Finally, it’s important to understand how memory coalescing interacts with tiling. If a tile size aligns well with your burst sections, you can process multiple requests simultaneously. However, if your tiles spill over into different burst sections, accessing them becomes less efficient, necessitating additional reads—compromising the speed benefit of tiling.

Essentially, I’ve doubled the number of memory accesses because I’ve added an extra element at the end, which altered the alignment of my burst section and layout. If your tiles or matrix sizes aren’t multiples of your burst section, you can easily end up in situations where the rows don’t align with the burst section, resulting in an increase in the amount of memory access required.

To solve this problem, you need to implement padding to achieve nice round matrix sizes that align with your burst sections, right? I know this gets deep into the technical details, but if you want to maximize the performance of your matrix multiplications, these are critical considerations. You’ll encounter issues if you overlook them.

Of course, tools like Torch Compile and the various CUDA optimizations for matrix multiplications are designed to handle these specific challenges, right? That’s the key to achieving better performance.

This complexity surrounding matrices often leads to scenarios like the one in Andre’s tweet. The most significant optimization for Nano GPT was simply increasing the vocab size from 5257 to 5304, which is the nearest multiple of 64. This adjustment enhanced the occupancy, showcasing how just a small tweak—like adding 47 dimensions to your vocabulary—can lead to a remarkable 25% speed-up.

This brings us back to the mystery I aimed to clarify by dragging you through all the GPU intricacies. By the end, you’ll have a far better understanding of performance factors and will find matrix multiplication performance much less daunting.

The first part of this explanation is simple: compute intensity. This directly corresponds to the roofline I mentioned earlier. Up until about 1536, there’s insufficient matrix multiplication work to be done; just loading the matrices and performing basic I/O becomes a bottleneck below this threshold. Consequently, throughput suffers significantly.

Beyond this point, the memory bandwidth fails to support your compute units adequately. On the right side, in theory, if I draw the maximum achievable performance envelope, it’s possible to fully saturate all computing units and achieve impressive performance. However, if you misalign your matrix sizes, you may end up in some perplexing spots where performance dips occur.

Let’s think a bit about why there are so many different performance levels. The first line here illustrates a tiling alignment issue. I’ve colored each line according to the divisibility of the matrix size. If it’s divisible by 32, you’re in good shape, as represented by the purple dots. If it’s divisible by 16, you still remain in a good zone.

There are two colors to observe: the green for (k = 8) and orange for (k = 2). If (k = 1), then your performance drops down significantly. Avoid prime dimensions at all costs, as these won’t yield good matrix multiplication throughput.

A big issue comes when you reach (k = 2) or (k = 1)—you’ll find that reading tiles no longer aligns nicely with your burst reads, leading to serious performance problems.

Another layer of this mystery involves the significant drop represented by the orange line. If you look here, you see a giant dip in performance, raising the question: how could there be such a loss after only increasing the dimension by two?

Let’s dissect this puzzle: this performance issue arises when transitioning from size 1792 to 1794. To illustrate, let’s assume a tile size of 256x128, which is a natural choice given that matrix multiply units in GPUs are designed for around 128. So, at 256 x 128, there are seven times 14 tiles, totaling 98 different tiles.

By increasing the size by just one, you would need to round up each coordinate. This results in a total of 120 tiles, which significantly increases the number of tiles. Here’s the catch: if you’re running on an A100 GPU with 108 SMs, it can execute these tiles in parallel.

When there are 98 tiles, all SMs can run efficiently, maximizing utilization. However, once the number of tiles exceeds the SMs, the situation changes. Now only 108 SMs execute at full capacity, leading to some SMs being underutilized.

This situation is known as wave quantization. Ideally, your tile sizes should be larger than the number of SMs, or they should not be close to the SM count to avoid creating this kind of quantization error.

I know these are low-level details, but staying attuned to such specifics is crucial. Many aspects of deep learning, particularly in scaling language models, hinge on attention to detail.

To summarize some key strategies: first, reduce memory accesses. There are several techniques—you can implement coalescing to reuse reads, or fusion to combine multiple operations and avoid unnecessary memory operations.

Additionally, transferring memory to shared memory streamlines access since it’s much faster. Consider utilizing tiling tricks and trading memory for computational resources, like through recomputation to save on memory usage or enhancing numerical precision through quantization.

There are multiple strategies at your disposal to maximize performance. Remember to keep a sharp focus on the critical role memory plays in GPU performance.

Are there any questions about this before I move on to the final section regarding Flash Attention?

Alright, let’s synthesize everything we’ve discussed. I aim to show you how the various strategies I’ve taught aren’t random facts; they’re integral to the standard optimization toolkit for performance, particularly in Flash Attention and its iterations.

Flash Attention significantly speeds up the attention mechanism, and while many recognize it results from CUDA kernel optimizations, the specifics may not be clear to everyone. The paper explains that they utilize established techniques, such as tiling and recomputation, to tackle the challenge of computing exact attention with sub-quadratic high-bandwidth memory accesses.

The key takeaway is that if memory acts as the bottleneck, minimizing memory access helps manage computational costs.

To recap, you’ve implemented attention numerous times—typically involving three matrix multiplications for the keys, queries, and values, with a softmax in between. The matrix multiplication itself is straightforward and can effectively be handled using tiling.

The tricky component will be dealing with the softmax, as it’s a global operation needing row-wise summation. Ideally, all operations should occur within the tiles to avoid writing back data to the larger matrix. This is where online softmax computation comes in.

Online softmax allows calculations to be executed tile by tile without needing the entire dataset upfront. It utilizes a running total for normalization, which means computations can be managed effectively in each tile.

Thus, this system allows you to calculate the partial softmax for that tile without the necessity of processing the full n squared matrix.

Finally, in the backward pass, it’s necessary to use recomputation tile by tile, ensuring we refrain from storing any n squared data until needed.

This method is crucial for maintaining performance efficiencies, making it feasible to compute gradients without compromising computational resources.

And with that, we’ve covered how all these elements, from tiling to coalescing to recomputation, converge to optimize Flash Attention, enhancing transformer performance considerably.

To wrap up, hardware advancements are the underpinning of modern language models. Understanding low-level details is essential for leveraging these advancements, and the GPU scaling plot we discussed earlier reflects the importance of optimizing memory movement.

It’s pivotal to consider how to make memory interactions more efficient, which ultimately leads to improved performance, especially in systems like Flash Attention. Thanks, everyone.

A Fortune’s Foundation Solver

2025-04-27 08:00:01

I enjoy playing Fortune’s Foundation and using it as a benchmark for language model’s instruct following. Naturally I want to try solving it with code. Solving a solitaire style game is new to me. The first thing I did was BFS searching the game state. There are ony a few valid moves you can make so that decks change to a new unvisited state. I soon realized this is not the case as the game progresses. For a relative empty queue state, there are so many valid moves that the number of states explode quickly. The next thing I tried was sorting the states in the queue by game state scores using heuristics: straight cards like 3-4-5 is a positive; a blocking card in minor arcana is a negative; more empty queues is a positive.

fortunes's foundation

The naive BFS search with sorting still can’t reach the solution before it runs out of Node’s default 4G memory. I tried a few optimizations:

  • Use a priority queue to speed up fetching the best state.
  • Use dense string representation for the solution path.
  • Use A star search where the score = g + h, where g is the number of moves and h is the heuristic score.
  • For every 10k game states explored, limit the queue size to 1k, by pop minimum heap 1k times. (263s)
  • For every 10k game states explored, limit the queue size to 100. (35s)
  • For every 10k game states explored, limit the queue size to 50. (31s)

Here’s one interactive solution that I made with 60% vibe coding, I had to fix a few quite significant bugs about moving cards around.

You can click the Next button and step through the solution shown below.

You can also play for yourself by clicking a card, and click a target slot to move the card there.

Step 1: queue,0:queue,5
Step 2: queue,10:queue,6
Step 3: queue,0:queue,7
Step 4: queue,10:queue,2
Step 5: queue,10:queue,0
Step 6: queue,10:queue,5
Step 7: queue,10:queue,6
Step 8: queue,5:queue,10
Step 9: queue,5:queue,6
Step 10: queue,4:queue,5
Step 11: queue,10:queue,6
Step 12: queue,0:queue,10
Step 13: queue,0:queue,4
Step 14: queue,10:queue,4
Step 15: queue,2:queue,3
Step 16: queue,2:queue,10
Step 17: queue,3:queue,10
Step 18: queue,6:queue,10
Step 19: queue,6:queue,10
Step 20: queue,6:queue,10
Step 21: queue,6:queue,10
Step 22: queue,2:queue,0
Step 23: queue,2:queue,10
Step 24: queue,1:queue,10
Step 25: queue,2:queue,10
Step 26: queue,2:queue,4
Step 27: queue,9:queue,4
Step 28: queue,1:queue,7
Step 29: queue,3:queue,1
Step 30: queue,2:queue,4
Step 31: queue,3:queue,2
Step 32: queue,6:queue,2
Step 33: queue,3:queue,6
Step 34: queue,3:queue,0
Step 35: queue,5:queue,3
Step 36: queue,6:queue,5
Step 37: queue,6:queue,5
Step 38: queue,6:queue,5
Step 39: queue,6:queue,8
Step 40: queue,5:queue,6
Step 41: queue,5:queue,6
Step 42: queue,9:queue,6
Step 43: queue,9:queue,5
Step 44: queue,9:queue,2
Step 45: queue,9:queue,10
Step 46: queue,8:queue,0
Step 47: queue,8:queue,0
Step 48: queue,8:queue,5
Step 49: queue,7:queue,0
Step 50: queue,7:queue,0
Step 51: queue,10:queue,7
Step 52: queue,10:queue,7
Step 53: queue,10:queue,7
Step 54: queue,2:queue,3
Step 55: queue,2:queue,3
Step 56: queue,1:queue,3
Step 57: queue,1:queue,2
Step 58: queue,7:queue,0
Step 59: queue,6:queue,0
Step 60: queue,1:queue,0

Here’s the main loop:

let queue = [];
queue.push([getStateScore(initialState), initialState, []])
const visited = new Set();
while (queue.length > 0) {
    let [score, currentState, currentPath] = queue.pop();
    if (visited.has(hashState(currentState))) {
        continue;
    }
    visited.add(hashState(currentState));
    if (isGoalState(currentState)) {
        return currentPath;
    }
    const validMoves = getValidMoves(currentState);

    for (const move of validMoves) {
        const nextState = applyMove(currentState, move);
        const nextStateHash = hashState(nextState);

        if (!visited.has(nextStateHash)) {
            const nextg = g + 1;
            const newPath = [...currentPath, `${move.fromType},${move.fromIndex}:${move.toType},${move.toIndex}`];
            queue.push([getStateScore(nextState), nextState, newPath]);
        }
    }
}

Stanford CS336 Language Modeling from Scratch - Spring 2025 - Mixture of experts

2025-04-24 08:00:01

Stanford CS336 Language Modeling from Scratch - Spring 2025 - Mixture of experts

So, we’ll get started. Today, we’re going to cover a mixture of experts. Last year, this was kind of a fun bonus lecture that I threw together. But this year, thanks to lots of people doing research, this has become a much more critical lecture. So I’ve added a lot of the recent developments, and at the end, we’ll try to walk through DeepSeek V3 and try to understand what all the components that make up a state-of-the-art open-source system or at least on the architecture side look like.

So mixture of experts is how a lot of the most modern high-performance systems today are built and deployed. There was the funny Nvidia leak of GPT-4 being potentially revealed as GPTOE1 BT. But more broadly, others like Grock and DeepSeek and Llama 4 now have all adopted a mixture of experts architecture, and it seems like at this point in 2025, the advantage of mixtures of experts over dense architectures is very much clear. Almost all compute scales training a mixture of experts model if you do it well is going to give you benefits over a dense model, and so everyone seems to be doing it in both the East and the West. This will be an important thing to understand if you’re trying to build the best model that you can for the FLOPS that you have.

So mixture of experts is very simple. It’s a very terribly named concept. You might hear “mixture of experts” and think, “Oh, there must be experts specialized for different domains, and they’re like doing different things.” Like there’s a coding expert and an English expert and a languages expert. However, it is very far from that mental model. A mixture of experts is a type of fancy architecture that has several subcomponents called experts that are activated sparsely. In particular, when you think about mixture of experts, you should be thinking about the MLPS. This is where all the action is.

So a standard architecture and a mixture architecture are similar in almost all their components except for one. If you look at this slide over here, this shows the components of a standard transformer. You’ve got your self-attention, and you’ve got your FFN. If you zoom in, in a dense model, the feed-forward component just sort of exists as one big block. In a sparse model, what you would do is take this FFN and split it up, or you would copy it depending on how you’re going to be setting up your multiple copies, let’s say, of your FFN, your fully connected networks, and you’re going to have a router that picks a smaller number of those in each forward pass or at each inference.

So this is the basic idea behind it, and we’re going to replace this one big feed forward on the left side with a selector layer and many smaller ones. What’s the advantage of this? Well, if it’s sparsely activated, that is, let’s say it only picks one expert and an expert is the same size as your dense FFN, then the FLOPS between the left side and the right side, the dense model and the sparse model, have the same FLOPS. They’re doing the same matrix multiplies as you do your forward pass.

You have more parameters without affecting your FLOPS. If you’re a believer that what matters is having more parameters to, for example, memorize facts about the world, this is a great architecture. You can kind of see the intuition behind it. Hopefully, that’s all very clear.

So you might wonder, okay, it makes sense that you can get more parameters per FLOPS, but does that translate to actually better performance for the models that you’re training? There’s been, I think at this point, many papers showing that at the same FLOP count, at the same training amount of FLOPS, you get better performance out of a mixture of experts than out of a dense model.

This is a nice paper to reference. Today I’m going to go over a couple of the classic Google papers that put this field together, and this is one of them by Fetis et al. in 2022, where they show that if your FLOPS match your training FLOPS, so that’s the same amount of compute used for training, as you increase the number of experts, the training loss of your language model just keeps going down and down and down and down. More experts mean better results.

Of course, the experts aren’t free; you need to store the memory for these experts. When you do parallelism, you’re going to have to think about routing your data into 256 separate experts, so there are going to be system complexities. But if you’re only thinking about FLOPS, this is a great chart to see because you have the same FLOPS, but you’ve gotten free test loss here. As you train longer, the model with 128 experts gets better perplexity faster.

Hopefully, that’s quite clear. You might say, well, this is a 2022 paper. Is this true on modern architectures at modern scales? It continues to be very much true. AI2 had a very nice paper, OLO, which did a whole bunch of ablations and carefully controlled comparisons into dense versus other architectures, and they sort of see exactly the same thing. Here on the left side, this is still from Fetis et al. You see the 7x speedup from having many experts. On the right side, this is the OLO comparison. You see the pink one is the mixture of experts, and the teal one is dense. The training loss for the dense model goes down much more slowly than the mixture.

Hopefully, I have sold you on the value of learning this kind of slightly new architecture. We’re going to pay a price for all of this, but at least at the FLOPS level, this looks very compelling.

You might have a question about the bias tuning part because although it’s a pretty cheap computation, it affects our actual process pretty badly. You know, loading in and out can have its issues. So, the question was, in the last lecture, you know, I mentioned even small non-FLOPS, negligible FLOPS can be really significant in wall clock time. Is anything in the world going to look like that?

I think one of the drawbacks of why that’s not standard, let’s say at 224n, is because there’s significant systems complexities to making this thing efficient. It’s possible to make these things very efficient, especially if each expert lives on a separate device, so that you’re routing data to different places. You can be very efficient when you do that, but it’s not easy. There are a lot of infrastructural concerns, and you’re going to see a lot of complexities in getting this to work. But when it does work, you’re putting all of your FLOPS to use.

The last one I wanted to show is something that a lot of the companies really love because you get to present plots that look very compelling. This was from the DeepSeek V2 paper. On the X-axis, this is a little bit of slight of hand. This is only activated parameters, right? So this is only the parameters that are used for computation. You ignore all the deactivated experts, and the Y-axis is MMLU performance. We see DeepSeek V2 with very few activated parameters achieving really good MMLU performance. If you’re only interested in both training and inference FLOPS, activated parameters are the name of the game. You get really good performance here.

This is not just an ablation. This is a real system that someone spent a lot of money to train and deployed out in the wild. You’ll see this pattern recur in other examples as well.

The systems aspect also provides another axis of parallelism. I’m going to get into parallelism in much more detail in the systems lectures when I talk about how you’re going to take your model and cut it up into many small pieces and lay them out across many different devices.

When you have experts, there’s a very natural way to parallelize at the expert level. You have multiple different feed-forward blocks. You can take each of these experts and put them on different devices, right? Because experts are sparsely activated, all you have to do is take your token and route it to the appropriate device, and the computation will happen on that device. It’s a natural cutting point to shard your model into different devices. This is called expert parallelism, and this is another reason why they’re very popular. If you really want to parallelize really big models, this is a thing that you’re going to have to do.

Interestingly enough, a lot of work has been developed at Google, and many of the frontier closed labs were doing it, but I think the open results actually came from China very frequently. Quen and DeepSeek were doing a lot of work last year, and it’s only really recently that I think Western open-source groups started to do more work, like Mixstrol and Grock.

Now Llama has become an architecture as well. Llama 4 just got released as the latest and greatest. This is also a sparse model, and I’ll talk about Llama 4 as I go through the lecture.

As I said before, one of the starting points for this is some of the Chinese groups, like Quen and DeepSeek, have actually done some impressive benchmarking and evaluations of the results. Quen 1.5 was one of the first models that had large-scale testing and documentation. They took a Quen 1.5 dense model and had a nice trick to upcycle it into a mixture of experts. That’s a clever trick to take a dense model and turn it into one. They showed significant gains in compute efficiency while decreasing the total number of parameters relative to their 7B model.

DeepSeek, which is now famous, originally was not quite as well-known when these papers were released. They did foundational work in the open-source world. A big part of this lecture is going to trace the trajectory of the DeepSeek MOE architecture.

If you look at their original DeepSeek paper, you’ll see very nice comparisons showing what happens when you train a dense model with a particular amount of FLOPS versus a really naive model that doesn’t perform smart routing, compared to a smarter routing called the switch. You’ll see these carefully controlled comparisons showing that as you go from dense to sparse, all the benchmark metrics improve for a fixed amount of FLOPS.

This is very consistent, and DeepSeek V3 is something that almost everyone is aware of. This model is in some sense a culmination of this line of work. However, if you had been following this branch of neural networks and language modeling, you would have known about DeepSeek long before V3 became popular. At the very end of this lecture, you’ll see that DeepSeek V3 is not very different architecturally from the very earliest DeepSeek models. They had nailed the architecture way back when they were training much smaller two billion parameter models.

They really just got the engineering right to create something remarkably good, which is their V3 model.

I have spent quite a few minutes trying to hype up these models, and they really are worth hyping up. However, there’s a question of why they haven’t been more popular. Why isn’t this the standard thing we teach in NLP and language modeling classes?

It’s just that they’re very complex and messy. I’m hoping that over the next few years they’ll get simplified, but they still remain quite intricate. One of the things is that the infrastructure is very complex, and the biggest advantages really happen when you’re doing multi-node training. When you have to split up your models anyway, it makes sense to shard experts across different models. That’s a natural thing to do, but until you reach that point, they may not be as effective.

Some of the earlier Google papers talk about this trade-off, where they say actually when you get these really big models you have to split up, then experts become uniquely good. There are also other things that are really tricky. If you think about it carefully, the decision of which expert you route tokens to is a very difficult thing to learn.

In deep learning, we prefer differentiable objectives—smooth things we can take gradients of. However, routing decisions are not differentiable because we have to pick and commit to a particular expert. If we do that, we face a very tricky optimization problem. The training objectives required to make that work are either heuristic or unstable.

We have to carefully engineer these factors to get them to work. These are two reasons why you may not want to pursue this normally.

The classic design you should think of involves taking densely connected layers like the FFNs, splitting them up, or copying them, and having sparse routing decisions among them. Of course, you could do the same thing with a sparsely routed attention layer. Some people have taken this approach. However, it is rare to see this in major model releases.

I think I’ve seen people talking on the internet saying this approach is very unstable and difficult to train consistently. I haven’t seen ablations to back that up, but certainly, very few people have trained those kinds of models with attention mechanisms.

Now that I’ve told you about the basic architecture, it’s quite simple. You have a router of some kind and you route, and then you have different MLPS. What are the things that might vary across different choices? You might ask how we route. The routing function is an obviously important choice.

How many experts and how big should the experts be? That’s another choice. The final one is how would we train this router? This non-differentiable objective seems very difficult to train. These are very important design questions, and we’re going to cover each one, hopefully detailing the design space of all these aspects.

If you have any questions before I delve into each of these different subcomponents, now is the time.

If you’re interested in a broad overview of at least circa 2022, there’s a really nice survey or review paper by Fetis et al. in 2022 that covers many of these aspects. Many of my figures are credited to that paper.

When we think about how we’re going to route or essentially match tokens to experts, this is the core component, because tokens are going to be coming in. You have your sequence that you’re processing, and those sequences are going to be assigned to experts. Not all experts will process every token, which is the whole point of sparsity.

So you can ask how these routing decisions are made. You can have three kinds of choices. You can have token choice, where each token has a routing preference for different experts, and I will choose the top K experts for each token. Or I can have expert choice, where each expert has a rank preference over tokens, and then I’m going to choose the top K tokens for each expert. This has the benefit of being balanced over experts.

Lastly, you could solve a complicated optimization problem to ensure that the mapping between experts and tokens is somehow balanced. This is global assignment. Almost all the methods do token choice top K. In the early days, people tried many different implementations spanning the whole design space of token routers.

If you look at the big releases, they have converged to basically one class of routing mechanisms: token choice top K. Each token ranks order experts by affinity, and then there’s a kind of top K choice for each expert. This is referred to throughout this lecture because they have a series of nice ablations.

They compare token choice routing versus expert choice routing, and validation loss shows that token choice behaves much better and has faster loss decay.

The question is, is this function a function of the token itself, or its position? It’s a function of the hidden state, meaning the token gets processed with position embeddings and so forth, and the hidden state will come in and be processed by the MLP.

For the other two choices, when you say it’s more balanced across the experts, it still pertains to the current token sequence, but it forces them to be more distributed. It’s still the same set of tokens, but the ranking selector function, in token choice, I simply take the top K among the columns, while in expert choice, I take top K among the rows.

Top K among the columns balances the utilization for different experts with respect to tokens. There are various trade-offs at play in this routing.

You asked how does a token know which expert is the best? That is the role of the router. I will give you the router equation, but to spoil it a little, routers are much more lightweight than you think. Your token, let’s say, is represented by vector X, which is your hidden residual stream coming in.

X is going to get multiplied by a matrix W, and then you take a sigmoid or something. That’s the score. It’s a vector inner product, similar to an attention operation.

The choice of K, such as whether K is 1, is a hyperparameter and different work uses different values. I will talk about this again, but to give you the high-level intuition, the argument the earliest MOE papers made was that K should be greater than two to ensure exploration. If you do K equals 1, you may overly exploit a single choice and miss out on exploring others. With K equals 2, the second arm can provide exploration information.

K equals 2 was the canonical choice and continues to be popular. That would double the FLOPS. When people talk about results, they usually mention the number of activated parameters, accounting for the fact that when you put in two MLPS, it requires more resources.

When K is greater than one, do we combine the outputs of different experts? Yes, when K is one, the outputs get combined right away, like in the attention diagram—though the router routes to two MLPS up top, and their outputs combine right after. The aggregation happens just as a sum.

The variance people commonly use is top K in order to implement a high-performance system. Top K routing is what is mostly used in token choice routing. The residual stream inputs go into the router and act similar to an attention operation—performing a linear inner product and then a softmax before picking the top K most highly activated experts, whose outputs are gated.

Depending on the implementation, you might weight the outputs based on this router weight, or you might just output the weighted average or straight sum. Many papers and methods use top K, including Switch Transformers, GShard, Grock, and DeepSeek variants with different top K implementations.

A surprising fact is that you don’t even need a sophisticated router. You can just use a hashing function at the bottom to map inputs onto experts. Even with hashing, without any semantic information, you can still see performance gains, which is quite remarkable.

Some early work explored using RL for routing behavior. Although RL is great for learning discrete decisions, the cost of doing this is prohibitive, and the stability issues may deter researchers. There have been papers exploring solutions to linear assignment problems or optimal transport issues that are elegant but may not offer practicable benefits to offset the costs.

Now, I can point at a slide to discuss routing in detail. This is the top K routing that almost everyone has converged to. This routing method is used in DeepSeek V1 to V2, and Quen and Grock do almost exactly the same.

Instead of having a softmax at the bottom, DeepSeek V3 uses a modified approach; it’s a minor difference. Let’s walk through what’s happening. At the very bottom, we have our UFl input, and I need to determine which experts are activated.

To do this, similar to attention, I take my U input and compute the inner products with the learned vectors for each expert. These vectors represent the experts and indicate their activation direction. I calculate the inner products for expert and input affinity and compute a softmax to identify the best experts for each token.

After normalizing, I apply a top K function to select the K best weights, zeroing out the others before aggregating the outputs and adding that to my original residual stream to return it.

The mechanics of this routing process is straightforward, but learning it well can be quite complex. The benefit of softmax is that it tends to push towards a singular maximum, not a hard max, which is essential to shaping the routing behavior.

I’m having difficulty finding the intuition for combining the softmax with the top K selections. One way to think is that the softmax helps average the outputs later, ensuring they sum to one. The softmax is essentially a normalization operation designed to create a weighted average at the top.

People might wonder why not use softmax alone instead of top K. Using softmax universally would lose the efficiency aspect since too many experts would activate during training. It’s essential to maintain a sparse number of activated experts during both training and inference. This is why the gymnastics is required to uphold sparsity in activating experts. Top K, right? Okay. Yes. From the back.

Yeah. So, because you’re doing softmax first and then the top K get the weights, you no longer have to guarantee.

So, the question was, yeah, so the question was if you softmax first, you no longer sum to one. And yes, that’s absolutely right. You no longer sum to one. And in some ways, there’s no requirement that you have to sum to one because the next layer can magnify it back up. There are layer norms everywhere. It’s not as if it has to sum to one. But I think that is the reason why some of the other architectures basically move the location of the softmax. There’s a kind of aesthetic choice about whether you really want that weight to be normalized to one or not.

Yes. So I was wondering how the E vector here relates to the weight of the feed-forward. Okay. So the question was whether and how the E vectors relate to the feed forward. They’re not really tied in any way. The E vectors are just learned vectors for the… just think of the E as parameters for the router, right? They’re just separate objects from the FFM.

Yeah, I was just wondering how this compares to sampling from the softmax. Great. The question was about how it compares to sampling from the softmax. You can sample from the softmax, and some methods actually do a kind of soft sampling from the softmax. Specifically, one of the Google papers has a procedure where they take the top element of the softmax and then they randomly sample the second element proportional to the remainder of the softmax. And that gives you more exploration, which is good, but the drawback of that is that if you don’t sample at test time, now you’ve got a train-test mismatch.

Okay. Yes. Why not just re-normalize after the top K? Why not just re-normalize after K was the question. Is that right? Some models do that. Some models do re-normalize after the top K, but that’s kind of a choice. Some architectures don’t do that; some architectures do. It doesn’t actually matter because the scale can be basically adjusted post-hop, right? So there’s no reason why it has to sum to one after the G operation.

Cool. Oh, sorry. Yes, the bias term is U there up there. Yeah. So the first term of the sum if G is approximating a probability vector could be seen as an expectation of the function f_n right plus u. So, ff actually this is not an expectation of ff_n because each ff_n is a different f_n. So this is not actually an expectation and the gates are sparse. This is like a weighted selection operation over K different or actually capital N different ff_ns, and then the U_T at the very end there, if you remember the transformer, that’s the residual stream, right? So I’m adding back the inputs because I want a sort of identity connection throughout.

Okay. Oh, there’s another question. Why does the router have such a basic parameterization? What happens if you put more weights into your router function? The question was why is the router so basic? It seems like if you’re going to have experts, it’s important to route to the right experts. So why don’t you do that? I think there have been some ablations in some of the earlier Google papers on having MLP routers and more sophisticated things.

I think the sort of complex answer here is that systems concerns weigh heavily. If you’re using a lot of flops to make routing decisions, you have to pay for those flops, and so you have to get performance improvements in just the routing. And I think the one other thing to appreciate here is that there are really big limits to how well you can route because the learning process for this routing thing is actually pretty dicey.

How are you going to get gradients for which routers are good or bad? Well, the only thing you have is if you have the top two, then you can compare those two things that you have and you can push the gradients into S of T because your G is a weight, and then the S of T might inform your inner products. But that’s a very indirect way to be learning your affinity. So even if you make it complex, there’s no guarantee that you’re going to really learn the optimal router.

Great. Okay. So I think one of the great innovations of DeepSeek, which was very quickly adopted by all the other sort of Chinese UHE releases, is this idea of both a shared expert and fine-grained expert.

The basic structure that was originally proposed is to take your dense architecture and kind of copy the experts over. So in this case, if you have top two routing, you’re going to have twice the activated parameters of your original dense model. You take your model and you copy it over and you activate K equals 2. This is kind of what you might think of as the vanilla or basic model that you might start with.

People realized fairly quickly that having lots of experts is good. The logical next step beyond having lots of experts is that you want lots of experts, but you don’t want to pay the parameter cost for having lots of experts. DeepSsee basically argued that the right thing to do was to cut the expert up into smaller pieces.

Remember last lecture I was telling you that the kind of golden rule in some sense is to have your hidden layer and then multiply that by four, and that will give you your projection layer. Now what you would do is instead of multiplying by, let’s say, four, you might multiply by two. Now you have smaller matrices and more fine-grained experts. You can have twice as many of them, and you can take that logic much more to the extreme. You can quadruple or multiply by eight and keep decreasing the size of your projection dimension, leading to fine-grained experts.

There’s drawbacks; I’ll talk about later. It’s not free, so you have to be very careful about how you structure these things. The other thing that has been studied and noted is maybe it’s helpful to have at least some MLP that can capture shared structure.

Maybe there’s just processing that always needs to happen no matter which token you’re processing. In that case, it seems kind of wasteful to do all this routing work and to have parameters spread out everywhere when we can just have one shared expert or a few shared experts whose job it is to handle all the shared processing that’s needed.

And so they’re shared experts. This setup of using fine-grained experts plus shared experts originally came out in DeepSeek, although I think the original inspiration came from DeepSpeed and Quen and others. Almost all of the open releases since DeepSeek have adopted some sets of these innovations because it’s quite clear that especially fine-grained experts are really useful.

That’s kind of a no-brainer at this point to do. One of the things I really like about reading DeepSeek papers is that they do ablations. It’s not like a sales tech report; they actually care about whether or not their methods work. They have this lovely ablation in the DeepSeek paper where they show that the blue bar here is G-Shard. This is a very basic vanilla implementation.

You can have one shared expert; that’s the orange bar, and it gives you a big boost on some tasks and no boosts on others. You can have fine-grained experts; that’s the green and orange bars, and you get further boosts from that. If you compare the blue to the orange, composing all these differences gives you quite a big boost over others.

We can see that more experts and shared experts generally seem to help. Okay. Yes. Question. When it says seven out of something, does that mean it’s doing like top seven? Yes. Sorry, I should have explained that. That’s right. X out of Y means X activated out of Y total routed experts.

That’s right. And so you can kind of see the pattern here as well. As you increase the number of experts, you often also increase the number of activated experts. Especially if you’re doing fine-grained experts, flops-wise, it’s free, because each expert is now smaller.

Okay. So has the corroborating evidence that shows nicely that these things work. The bottom one I think I’ll start with because it’s more decisive. It shows fine-grained experts going from 8 to 32 to 64 fine-grained experts mirroring in some sense the DeepSeek ablations. You see very clear trends in losses and other kinds of metrics showing improvements going from 8 to 32 to 64. Fine-grained experts are great.

Shared experts, which is purple versus teal at the very top, you actually don’t see any gains, at least in the MO setup. They actually end up with no shared experts, even though the DeepSeek paper seemed to show more gain. That is maybe more mixed, given this follow-up or this third-party replication of these kinds of ideas.

At this point, you might be wondering what common configurations are. I think I’m going to take the page out of last lecture’s playbook of looking at a lot of the recent releases, looking at what people do and trying to talk a little about the patterns that have arisen.

Some of the early Google papers, such as GShard, Switch Transformer, Stmoe, some of them had really large numbers of routed experts. There were lots of interesting things going on in those papers. I’d encourage you to read them. Some of them happened in LSTMs and other architectures. Regardless, very quickly there was a kind of period of 8 to 16 experts like Mixtrol, DBRx, Grock with two active experts. Those worked reasonably well, but then DeepSeek v1 came out.

That has the prototypical configuration I told you about: fine-grained experts, 64 of them, six actively routed, two shared experts. Take that last column with a grain of salt because I had to back them out from config files and things like that, so I’m not 100% sure about the exact ratios here.

We’ve then got essentially Quen 1.5, Deepseek V3, Minax. These are Chinese models that follow essentially in the same footsteps as DeepSeek v1. The specific numbers are different, but they use fine-grained experts and they often have shared experts. They’re very similar to this original DeepSeek configuration.

OMO, Minimax, and Llama are very recent; they definitely do all this fine-grained expert stuff. Llama 4 also uses a shared expert, and you see variations in configuration, but you see what’s basically shared, which is this fine-grained expert idea, especially for the big models like Llama 4 and DeepSeek, which use very large numbers of routed experts or total experts.

Yes. Can you explain what the ratios represent? The ratio is representing roughly how much each expert is sliced relative to having just the standard dense configuration. In terms of hyperparameters, if you’re following the rule of thumb, your hidden dimension and sort of your projection from your MLP should be about 1 to 4 or 1 to 2.6 if you’re doing a gated network.

By looking at the hidden layers of these architectures, you can kind of see how many times they sliced up that original feed-forward size.

For those experts, does that mean that like still increasing their group like the factor? That’s right. You can think of this as roughly having 16 normally sized experts. Oh, okay. They have more parameters than the dense equivalent. They have six routed, so they have eight total active experts at any time, each that are quarter sized.

You should think of them as roughly double the flops of a dense equivalent. Some arithmetic, but hopefully the math is clear and consistent hopefully. Yes, the ratios like one are like… For some of the exotic ratios, I’m not quite sure why they’re that way, but they are very precisely whole numbers when you take the ratios between the FFNs and the implied hyperparameters.

I think those are exactly the split counts of how much they were sliced, but I’m not sure why they have one over 14. I mean, does it do you ever project to smaller dimension because that ratio is so small in the MLP?

So yeah. Oh, that’s why you’re asking like do they down project? Yeah, that’s right. In some of them, they are actually smaller. I don’t remember which models in particular, but in some of them, I do remember they were actually down projected.

Yes. What is the intuition for wanting more than one shared expert? Yeah, I mean, it does kind of seem like there was a period where some of the Chinese LM companies tried many shared experts and then people have come back to zero or one. If you look at the OM ablations, it’s not quite clear that even one shared expert is decisively useful.

I think the original motivation was that then you have equally sized experts. These are both one-quarter sized experts and now you have eight active experts total, so you can keep the sizes consistent. Otherwise, I don’t really see a particular justification for why it should be two smaller ones versus one larger one.

Okay, cool. So then hopefully you get a sense of how the routing works for a lot of these and how it’s all set up. The forward pass hopefully you fully understand.

Now we need to think about training, and training is pretty gnarly. The major challenge I foreshadowed earlier is that when we train, we cannot turn on all the experts because if we do that, then we pay the full flops cost of all the experts. Having a model that’s 256 times more expensive to train is a total no-go, so we need to train times sparsity, but sparse gaining decisions are obviously not differentiable.

We now have a kind of annoying RL-ish problem. So we could do any of these things like RL to optimize gating policies. We could do bandit-inspired things, randomization to explore, or we can just have some heuristics that try to balance things out, like put some loss terms in there and hope things work out.

Having gone through deep learning classes of many kinds, you can kind of guess internally which one people use in practice. I’ll talk about each one of these three in turn.

Okay, so RL, I think, is one of the earliest things that people tried. It’s probably the most principled thing you can do in this space. You have a non-differentiable routing decision. Well, think of that as a policy, throw RL at it, and then solve the problem.

Unfortunately, it’s not better than a lot of the other things that you can do. There is a paper by Clark et al. in 2020 who were exploring various scaling-related questions. They do have an RL baseline that I was able to dig up, but unfortunately, it’s not really that much better than using hashing for decisions.

They were really interested in benchmarking this thing called SBS, which is like a linear assignment kind of a method, and that thing handily beats doing RL. In practice, the gradient variances and complexity mean that it’s pretty finicky to use, and to my knowledge, no one at scale has really used an RL-based approach to optimize these gating decisions.

A thing that has been done much more at scale is stochastic approximations of various kinds. They might add a bit of perturbations. Here’s an example from Shazir in 2017. This is one of the early papers where they’re still going to do kind of top K routing. They’re going to keep the top K elements of this H of X operation and they’re going to softmax that to get the gate.

What we’re going to do to get this H of X operation is as follows. We’re going to have our original linear affinity. This is identical to what we were doing before. We were basically just computing our inputs X and a sort of learned weight for each gate.

This part is the same, but I’m actually now going to jitter it a little bit. I’m going to add normal noise and then I’m going to pick sort of a W noise scale that’s learned. This thing is going to control how much noise to inject into this process. You can think of this as a stochastic exploration policy.

By manipulating W noise in particular ways, like kneeling it down or doing various things, I can control the exploration-exploitation trade-offs that this is going to have. This is going to give you one solution to the explore-exploit dilemma. If you’re noising things up, each expert might randomly get some other tokens that it wasn’t expecting to get.

It’ll lead to experts that are less specialized but maybe a little bit more robust. That seems generally quite nice. Of course, the stochasticity also means you don’t get as much specialization, which leads to a loss of efficiency. There’s another approach that people have done where they multiply the router logits or add a multiplicative perturbation to the router logits, with the goal of getting less brittle experts.

But this jitter process was kind of removed in some of the later papers because they found it just didn’t work as well as some of the heuristic loss-based approaches. This stochastic routing trick was tried in early Google papers, but it’s generally been abandoned by a lot of the people training these models.

Okay. So yes, for the stochastic approach, what problem does that solve? Because we’re still taking the top K, so we still can’t differentiate backwards, right?

If you think about this, the question was we still can’t differentiate because we’re taking the top K. If you change your interpretation of the problem a little bit, you can see it as a bandit problem.

It has the same structure where you know you pull a bandit arm and you don’t see any of the other arms. You can’t allocate your resources efficiently. If you pull some of the other ones at random, now you’ve got enough data to be able to do some optimization.

This jittering is similar in spirit to an epsilon-greedy style exploration where you’re randomly pulling some of the other arms with some probability, where the probability itself depends on how confident you are about this routing decision. That’s the intuition, and then, of course, that’s going to give you some way of getting some signal back.

The thing that in practice people have ended up with is that we don’t do any of that. We don’t do RL; we don’t do stochastic exploration. But we rely on really another mechanism to keep things reasonable. If we’re doing top two routing, technically speaking, we do get some signal in the gradient descent process because we can compare the top two experts that we did evaluate.

It’s possible to do some optimization, but when we drop all the other constraints, the big issue that arises is that you just end up picking one expert all the time, and that expert is good at everything, and all the other experts are terrible. You end up in this local minimum where you’ve routed all of your tokens to one expert all the time.

So really the key game becomes how we get out of that local minimum, and loss balancing or balancing losses is the key trick to get out of this. This is important to understand because this is the loss that mostly everyone uses to train. If you were zoning out earlier, you probably should pay attention to this particular set of equations here.

This is originally from the Switch Transformer by Fillmore et al. in 2022, and they add this particular loss where they loop over each of the experts and take an inner product between the vector F and the vector P.

What are these vectors? F is for each of the experts; this is the fraction of the tokens that were allocated to expert I. You can think of this as a probability vector telling you what fraction of your tokens in your batch or whatever the unit is did you route to expert I.

Now P of I is the fraction of the router probability that was allocated to expert I. The router probability is the original softmaxed routing decision that I was sort of intending to send. This measures P of I is the intended probability from the router, and then F of I is the actual routing decision made by the top K method.

One thing to look at here is let’s take the derivative of this loss with respect to P of I. This is a linear function with respect to P of I, and you’ll see that the strongest down-weighing action happens on the biggest experts with the biggest allocations.

It’s actually proportional to the amount of tokens you get. You’ll be pushed downwards more strongly if you received more tokens. This is the basic behavior of this loss, and almost everybody uses this kind of F.P trick to balance tokens across different units.

The basic unit that you might want to balance over initially is batches. You might want each batch to get allocated evenly to experts, but you might also have other kinds of balancing that you want to do. DeepSeek does exactly this.

I’ll talk about all the variants they’ve thrown in, but the first thing is per-expert balancing per batch. Each batch they want to make sure experts get an even number of tokens. This is from the DeepSeek paper, and this looks very familiar to you.

This is exactly the same F.P inner product structure as before. P of I is defined a little differently; that’s S of I of T, but that should be familiar from earlier. That’s the softmax pre-top K, right? So hopefully this looks good to you. The other thing you might want is to balance across experts.

That’s all well and good, but you might also want to think about systems concerns because you’re going to shard your experts onto different devices, and you might want to balance per device. You might have another loss that’s essentially the same structure, but instead of summing which tokens go to which experts, you might measure which tokens go to which devices.

That’s going to be a different F that’s measured over device groups rather than over each expert. Now you can set up a different loss to balance over devices. If you optimize this, you’re naturally going to learn routing functions that ensure each GPU, each TPU, or whatever you have, has an even number of tokens, leading to even utilization. That would be great from a systems perspective.

Basically, everyone does this kind of thing. DeepSeek V3 actually kind of innovates a little bit. This is cool, and I don’t think I’ve seen this before. It’s one of the first things in the world that doesn’t actually come from Google, really. They have gotten rid of the per-expert balancing term entirely.

Instead, what they now do is they take their softmax scores and add a little fudge factor B of I, where B of I is a little fudge factor score for each expert. Expert I might get upped or downed. If an expert isn’t getting enough tokens, it’s going to be given a higher B of I, allowing it to grab more tokens.

The way this works is that they’re going to learn B of I through a simple online gradient scheme, online learning. They’re going to measure at each batch what each of the experts are getting, like are they getting an even number of tokens? If they’re not getting enough tokens, they add a gamma learning rate to B of I, making it higher. If they’re getting too many tokens, they’re going to subtract gamma, making that expert slightly less attractive.

They’re just learning little offsets for each of the S of I. Notice here, you’re only using the B of I to make the routing decisions. You’re not actually sending it over as part of your gating weights. That’s a somewhat important thing to do. They call this auxiliary loss-free balancing.

If you go and read the DeepSeek V3 paper, which all of you should because it’s a really nice paper, they’ll make a big deal about how this makes training stable, great, wonderful. Of course, you keep reading the section and they’re like, actually, for each sequence maybe we still want to be balanced, and this doesn’t work well enough, so we’ve added the heuristic loss back.

They do have something called a complementary sequence-wise auxiliary loss that’s basically exactly the auxiliary loss they decided they needed because what they wanted to do was balance load across experts at a per-sequence level rather than a per-batch level.

I’m not sure why they do this particular thing rather than any other B of style trick, but that’s just kind of what they do in DeepSeek V3. So it’s not fully auxiliary loss-free as they’d like you to believe.

Okay. Oh yes. Question. This is a bit of an unfair question, but if we didn’t have to worry about systems optimizations, do you think the performance of this model would be a lot better, or would it stay roughly the same?

If we didn’t consider systems optimization, would the performance of this model be better or stay the same? When you say this model, what do you mean? Deep Seek V3 or just in general? So are you saying if we ignore systems concerns, do we think it could still be good? Is that kind of one way of asking that? Question? Would the performance on downstream tasks, for example, be better than what we have right now? Yeah. So I think I didn’t have to balance this; I must set roughly equal numbers of tokens for every expert. Yeah. That’s right. That’s right. Well, I think actually per expert balancing this term, right? This is not a systems concern. So, you still want to do this because if you don’t do this, what you’ll find is—I’m going to keep referring to the old model paper because they have so many ablations. They have a really nice ablation where they get rid of exactly this. What they find is basically early on in training, the model just picks one or two experts, and all the other experts are dead. The router never sends anything to them. So, you’re just wasting memory at that point, right? So now you’ve just lost performance for free. You’ve effectively gotten a smaller model. And so even if you ignore all the other device balancing parallelism concerns, you’ve just gotten a worse model because you didn’t properly allocate your experts, right? It’s the same way as like you want to use all your parameters, right? You would like to effectively use your parameters. You want to do expert debalancing.

Sorry, say device. What does device refer to? Yeah, actually, so normally this would refer to GPU or TPU. There is a subtlety. I’ll talk about this maybe in the very last or second to last slide. There are more sophisticated and cool versions of this where you try to balance things to minimize communication costs as well. And so there’s broader notions of device, like one rack or whatever else, but here it usually refers to GPU.

Yes, going back to the fact that hashing as a routing algorithm seems to improve performance—like is there intuition for that? Because that’s effectively just like randomly choosing one of the few forward members to send it through. Right? So like why does having multiple copies of that, I guess each of which gets less data, why does that make performance better? Yes, the question was why does hashing do anything at all? I don’t have the really precise intuition for this, but you can make arguments either way. One is, you know, even if you’re hashing, the same tokens are going to go to the same kinds of sequences. And so each expert will still get some deterministic subset of the inputs. There’s some specialization that can still occur. It’s just non-semantic or, you know, non-learned. If you’re a distribution Zipian, like the word “the” might dominate one expert, and so you might still get actual semantic specialization where one expert is effectively dominated by very frequent things. A random routing function probably wouldn’t be a pure random thing that’s not dependent on input. Yeah, I would bet that that would be really terrible. Yes, I have never run or seen that, but yes, I think that would be horrible. Good.

Yes. So for like during LM, you have many layers, right? Many transformers. I think in the lecture you mentioned that each expert, okay, so like you do have like 32 layers, like 64 experts. That’s like a lot of GPUs. Or I wonder if like experts are bundled together on like a single GPU. Is that the question? Like won’t you need lots of GPUs if you have lots of layers and lots of experts? Yes, if you exclusively give a GPU to a single expert, that would be kind of crazy. But you would kind of shard things so that each GPU would hold enough of these units to effectively use memory, right? The name of the game in parallelism is you always want to use up all of your memory because that’s one of your resources, right? You don’t want to paralyze more than you have to.

Cool. Okay. Excellent. Oh, okay. I did put the ablation in here. Yeah. So, this is exactly what happens to the question of what happens if you don’t do expert balancing loss. I think the great picture to see is this bottom left one. If you don’t do load balancing, you know, what are the tokens assigned to which expert? You see the pink and the yellow expert; they just kind of take over. They take up about 50% of the tokens. All the other experts are dead. They do nothing, right? And so you’ve wasted the majority of your experts at this point. Six out of eight of your experts. And you’ve created a two-expert model unintentionally. That gives you worse losses as seen on the top right, the teal lines. Of course, maybe that’s still better than the dense model because at least you’ve got two experts going. But you could have done better, right, counterfactually speaking.

Okay. So, I won’t go quite as deep as I could into the system side because I haven’t really started to cover the core systems concepts necessary for you to deeply appreciate a lot of the parallelism concerns like the hierarchy of communication speeds in a data center and so on. But really, as I said before, one thing to keep in mind is just how nicely it can fit into devices. The thing that people say is expert parallelism involves sending one or a few experts onto each device. What happens when you are basically processing a token? Well, you would hit the router, and after the router, you now have picked a few experts. And so now you would have a collective communication call, like an all-to-all communication dispatch that would send the tokens to the relevant devices. The feed forwards would compute their outputs, and then you would return the tokens to sort of where they belong. Or you would combine, I guess, multiple experts, and so you would need another sort of collective communication call. If your feed-forward computations are sort of big and beefy enough, you can kind of pay for the cost of basically doing this expert parallelism.

One of the things that’s nice about this is that it’s another form of parallelism in your toolkit. You’ve got on the right side data parallelism, model parallelism of two or three different kinds, and then you’ve got expert parallelism. You can combine all of them to come up with sort of ways of trading off all the resources you have: the communication speed, the amount of data that you have, your batch size, your number of experts, and your memory. I’m not going to go into too much detail about how specifically this is going to help, but keep in mind that this gives you another sort of tool in your expert toolkit.

Another thing that is also useful is, let’s say you have multiple experts on a single device. You might hope that because the computations are sparse, like let’s say token one gets multiplied to expert zero, the second one is expert one, and this third one’s expert two. So, this is really three matrix multiplies that are small and sparse, and you might hope that modern GPUs can sort of take advantage of these kinds of sparse matrix multiplications. And that’s exactly right. So if you lay out your experts correctly and the weights are fused in the right way, then modern sparse matrix multiply engines can effectively make sure that you’re not wasting any flops in doing this one big matrix multiply. So, modern libraries like Meta Mega Blocks can basically take advantage of this device-level kind of sparsity support to do multiple expert computations all at once. This is yet another advantage that you get.

One fun side thing, which maybe isn’t mysterious to you all anymore because you’ve sort of grown up in the era of GPT-4. When the GPT-4 API first came out, it was kind of mysterious to me because when you set the temperature to zero, you kind of got different responses even though it was supposed to be deterministic. Lots of people speculated about why would that be. I’m not saying this is the answer to that reason, but there is actually an interesting source of randomness. So, think about what happens. You’re going to route your tokens to experts, right? And experts live in different devices. It could be that you have a lot of examples. You’re going to batch your queries when you’re processing them. And so if you’ve batched your queries, these tokens are going to get routed into different experts. So imagine you’ve got this batch to process and you’ve got a bunch of experts, but for whatever reason, this batch really loves expert number three. All the tokens go to expert number three. So now what happens? Well, the device for expert number three doesn’t have enough memory to load all of those tokens. And then what happens is what people call token dropping. This happens at training time as well. You often have what’s called a load factor where you’re controlling the maximum number of allowed tokens. And if the router just allocates too many tokens to an expert, you just drop those tokens off either for systems reasons or because you’re just worried that that expert is going to take over, at least in training time. So now this token has gotten dropped, and it’s not going to get anything at all. The MLP is just going to do a zero computation, and the residual connection is just going to pass things straight forward. And then you’re going to return an output. If your token got dropped, you’re going to get a different result than if your token didn’t get dropped. Based on who else is in your batch, this can induce stochasticity both at training time and inference time, which is kind of an interesting thing that you don’t normally think about because you almost never think about cross-batch effects when doing inference.

Okay, so that’s kind of the main bits of the main basic components of building the system. A fun side thing, if you were to actually go out tomorrow and try to train, I think the system side will make you a little bit sad, but the other thing that would make you sad is probably the stability side of things. These models have this property that sometimes they’ll just kind of blow up on you if you try to fine-tune them. They’re very difficult to fine-tune, and they’ll sometimes blow up on you. Barrett Zoff and others really studied. They had a whole paper on trying to make things more stable. There’s a paper which is the one I’m referencing here, whose entire purpose is to stabilize training. There are a couple tricks that I’ll mention that I think are relevant and that people do. The first one is if you’re doing the router softmax—this goes back to last lecture about stability, right? Like what did I say about stability? The thing to be afraid of is the softmaxes. Softmax is always where you want to be afraid. So they do all the computations in float 32 for the router computations just to be safe. Sometimes, they also add an auxiliary z-loss. Hopefully, you remember that it was just last lecture when you do log of the sum of the exponentiated values in the softmax, square that, and add that as an extra loss. This is going to keep the normalizer values near one, which is nice for stability. This is actually one of the places where z-loss was used earlier before it got sort of more popular for training models. You can kind of see the effects here if you look at the losses. I think the second plot here is maybe great. If you remove the z-loss from your routing function, you see these giant loss spikes in your validation loss where the model just kind of goes a little bit crazy for a couple iterations and then gets pulled back. Of course, it still trains okay, but you are better off having the z-loss than not having a z-loss. There is a pretty noticeable gap in the validation loss by the end here, right?

Other things that can happen—of course, you want to fine-tune your RLHF if you’re going to ship and release it. This turns out to be kind of problematic. Some of the earlier work, you know, when people were starting to do this was back in the BERT and P5 era. There was a lot of fine-tuning going on. One of the things people saw was there’s a lot of overfitting that happens if you were doing sparse models. You see this big gap between train and val, right? This blue and orange line, whereas the dense model, this green and red line, has a smaller train-test gap. There were a lot of worries about overfitting because you have these gigantic parameter models that you’re fine-tuning on small data. One of the solutions proposed at the time—though I don’t think this is very popular, as far as I understand—is to architect yours such that not every layer is a layer, but you alternate dense layers and sparse layers. Then you can just fine-tune the dense layers, and that will still be fine, right? That behaves just like a dense model.

Another solution, which we saw in the DeepSeek MOE paper, is to use a lot of data. If overfitting is a problem, we have access to lots and lots of SFT data, so just shovel all of those in. In the case of DeepSeek, they used 1.4 million training examples. Maybe you’re not quite as worried about these overfitting concerns. The last thing I’ll end with, which is a trick in the toolkit that people have done and seen, is upcycling. This idea is to take a dense model, like the one over here, and then you take your MLP and make a bunch of copies of it. Then you maybe perturb it, and then you have your router that’s initialized from scratch, and then you just pretend this is—train it from that point on. You just initialize these from a dense model. This is a trick that’s called upcycling, and people have shown that if you can get it to work, it is a very cost-effective way of training. It is great for inference because not every MLP is going to be active at inference time. So, you’re going to effectively get a much larger parameter model without doing the training of a much larger parameter model. Several people have succeeded at this. Mini CPM, which I’ll mention again in the scaling wall lecture, is a Chinese open LLM that basically tried to build really good small language models. They succeeded at taking a dense model and upcycling it. You can see that their numbers get significantly better in the last two rows, right?

The dense models get a pretty non-trivial bump in performance. Quen, I mentioned at the start of this lecture, one of their earliest attempts was taking one of their dense models and then building upcycled. They got fairly significant performance gains relative to sort of smaller models at the time. They got models on par with their 7B models with a 2.7 billion parameter active model.

To wrap up, I want to walk through the DeepSeek architecture at the very end here. Hopefully, this will give you a sense of the first thing I want to do. I want you to understand the DeepSeek V3 architecture setup and all the changes that they did because that’s an example of a modern high-performance open-source system. I also want you to maybe appreciate that architectures don’t change that much. DeepSeek v1 is not that new; it’s maybe a year and a half or something, maybe two years old. They basically nailed the architecture at that point. I want you to see what they changed from that very early attempt to their big training run. This is the very first starting point. I’m calling it DeepSeek v1, but actually, the right way to refer to it is DeepSeek; it’s a 16 billion parameter model with 2.8 of those parameters active. You’ve seen already this diagram. This is the shared two shared plus 64 fine-grained experts, of which four of them are active at a time or maybe about six of them are active at a time. Sorry. The routing—you’ve already seen this; I presented this in the middle of the lecture. This is the very standard top K routing where the softmax is at the bottom before the top K selection. For balancing right at training time, all they do is add this auxiliary loss balancing term, right?

Both the expert and device level balancing terms, right? So hopefully, you remember those from earlier. So that’s DeepSeek v1. They saw how effective their model was. To add some more context, DeepSeek originally had a dense model, and then they had a model, and that model was remarkably good. So when they went to v2, they went straight to that, and now this is a 236 billion parameter model, of which 21 of those billion parameters are active. You need a lot of memory, but your flops consumption for inferring this model is not so bad now. The architecture is identical. I copied literally the same figure because the architecture is literally the same minus changes to the number of experts that are active. We’ve got some new things happening, but not too many new things. The top selector is the same. The equation from before, this previous equation, is still how they do things. They have this very clever trick that they add on.

At the beginning, I was going to say, what’s the drawback of having fine-grained experts? Why can’t I have, I don’t know, 1024 fine-grained experts or 2046 fine-grained experts? The problem is when you shard your experts very finely and have a lot of active experts, you’re going to have to route to those experts, right? Your communication costs potentially grow, and if you’re very fragmented, you might have to send a lot of tokens to a lot of devices. The clever thing they come up with is to say, I’m not just going to, you know, for each batch route to the top K experts naively, which might force me to send my tokens to lots of devices. What I’m going to do is I’m going to first pick top M devices. So I’m going to do my normal scoring calculation, but I’m first going to subset the set of allowed devices to top M. Once I’ve picked my devices, I’m going to pick top K for each token within each device. So now I’ve restricted the devices. This really controls the communication cost. This gives you more efficient training when you’re scaling up to these gigantic sizes. You need to start really engaging with the systems aspect of things when you’re training a 236 billion parameter model.

The other thing that reflects the systems concerns at this scale is that they add a communication balancing loss. One way of thinking about things is, you know, for an expert, there are kind of inputs and outputs. The inputs are the token that comes in, and you route to your expert. The outputs are you have to bring the tokens back where they belong. If a batch belongs on this device, it has to go back where the original device was. We have to think about both the input communication cost and the output communication cost. They add a balancing loss to try to balance out the output communication cost as well, not just the input side. That’s a minor note, but you can kind of see their attention to detail on trying to make sure all the different systems aspects are properly taken care of.

Finally, we get to the big DeepSeek v3—sorry, that should say v3 not v2 up there—671 billion parameters, of which 37 are active. Once again, exactly the same figure because the architecture itself doesn’t change. That’s stayed the same since DeepSeek MOE, right? If it works, don’t change it. They do change a couple of things. Maybe they were, you know, hearing you all say, “Why don’t you normalize to one?” So, you know, they’ve normalized the gate to one. They’ve moved the softmax normalizer operation up there. They are not actually exponentiating the gating decisions. They’re actually taking sigmoids, which is a sort of softer, more nicely behaved operation than the softmax. They have some changes here, but conceptually this is still the same as the top K routing decision. You hopefully see very similar things happening.

In terms of the losses, they’ve gone to this auxiliary loss-free trick of this being incremented or decremented based on the expert load. They have a sequence-wise auxiliary loss. Just to add some context, why would you want to balance different experts on a single sequence? The thing they’re very concerned about at training time is that it’s fine to not have a sequence-wise balancing loss, but at inference time, it might be the case that someone sends you very out-of-distribution sequences, and that might overwhelm certain experts, right? So, at inference time, you can’t control which sequences you get. You might want sort of stronger balancing that operates at a single sequence level rather than the overall batch level.

Okay. And in the Oh, sorry. Yes. Does v3 still do the top M devices? Does it keep the B2 improvement? Yeah, they keep the top M improvement. They do not keep, for example, the communication loss. So they’ve jettisoned some things, but top M is a clever idea; they keep it.

Yeah. But it’s not like they always add things. They have removed some of the things. In the last two or so minutes of the class, I’m going to go over the non-core parts of DeepSeek v3 because I think we’re already at the point where I’ve explained most of DeepSeek v3. I might as well go through the rest of DeepSeek v3 at this point. You all know how that works. They have a clever sort of optimization for the attention piece called MLA or multi-head latent attention. You all actually already know all the ingredients that you need to understand this because at the end of the last lecture, I talked about GQA and MHA, right? Those are all inference optimizations that you need to optimize the size of the KV cache.

The DeepSeek folks take a different approach to optimizing this. Instead of reducing the number of heads, they’re actually going to project the heads into a lower dimensional space. You have your inputs H of T, and instead of generating the K’s and V’s directly from these H of T’s, what I’m going to do is generate a low-dimensional C. You can think of this as a compressed version of H. This C is going to be smaller and easier to cache. I’m just going to cache these C’s. Whenever I need these K’s and V’s, I can sort of up-project from this KV conceptually speaking. Then I can take the inner products with the Q’s, right? You can see how this would be a KV cache savings if I only have to save the C instead of the higher dimensional H of T. That’s exactly the idea. You take your H of T, project it into a lower dimensional C, and then up-project this back into the K’s and V’s. If the C’s are small, you’ve compressed the KV cache. That’s good.

In terms of the computation, if you’re thinking about flops, you might think this is not good because I have to multiply an extra matrix W U K. I didn’t have this matrix before; that’s an extra matrix multiply I have to pay for. The clever thing here is remember that on the other side of K, I’m going to take K and Q. That Q.K is going to be an inner product in the attention operation, right? Q itself has a projection matrix Q. The trick here is you can merge this W U K and this Q matrix together into one matrix. I haven’t gotten extra matrix multiplies; I’ve just merged this new matrix multiply into my other one. This is just associativity. I can just merge the two. They also compress the queries for memory savings during training, but that one is not quite as necessary because it doesn’t interact with the KV cache.

I’m only going to mention this last one in passing because it is a subtlety, but it’s kind of a clever subtlety that you realize. This original trick, the sort of thing that I just described at the top, is not compatible with rope. The reason is that, you know, the rope matrices, you know, basically you have the Q’s and the K’s, and you rotate each of those Q’s and K’s by multiplying with a rotation matrix RQ and RK. But if you do that, these RQs and RKs are in between the query projection and this latent vector up projection matrix. Since I can’t reorder these matrix multiplies, rope kind of gets in the way. They still have a solution of basically doing rope on non-compressed dimensions. That’s kind of a side point; I think it’s not quite as important. You can look at the paper if you’re super interested.

The other thing they do, and this is the last thing I promise, is they have a minor change in their loss function called MTP where they predict multiple tokens in parallel. Normally, you have your inputs, you shift them to the left by one. You’re predicting one token in the future, and then your transformer is going to predict all those tokens. That’s your normal transformer loss. Before you make those predictions, you can take the hidden state; you can pass it to a very lightweight one-layer transformer, and that model can predict one token in the future. The model is not just predicting the next token; it’s predicting two tokens into the future. Hopefully, that all makes sense. This is just a small lightweight model that can do that. You can sort of see the architecture right here. The one thing that is kind of disappointing that I learned as I was researching for this lecture is that they only do MTP with one token ahead. Even though they have this very complicated diagram of how they could do it for many tokens, it turns out it’s only done for one token.

Okay, so now I’m all done. We’re kind of now at the core of how you would build and deploy a really high-performance large-scale system. They take advantage of the sparsity idea that you don’t need all of the parameters all the time. Discrete routing is the real big challenge. I think this is one of the big reasons why it didn’t immediately catch on. It’s very scary to have to try to optimize this top K routing decisions, but heuristics somehow seem to work, right? They just do. There’s a lot of empirical evidence now that at least for flop-constrained settings, it’s just a good idea. It’s cost-effective. Do it. So definitely worth learning.

Thanks a lot for listening.

Jeff Dean’s talk at ETH Zurich in April 2025 on important trends in AI

2025-04-22 08:00:01

Jeff Dean’s talk at ETH Zurich in April 2025 on important trends in AI

[Music]

All right, welcome everyone. Great to see a full house. It is my great pleasure to introduce Jeff Dean, who is Google’s chief scientist. He joined Google in 1999, where he’s been building, co-designing, and co-implementing the pillars of Google’s distributed technology with systems like MapReduce, Bigtable, Spanner, and TensorFlow, more recently, Pathways.

In 2011, he co-founded the Google Brain team, and since then, his focus and research have been on systems and applications for AI. Today, he’s going to tell us about important trends in AI, and I should also mention he’s won many awards. He’s the recipient of the ACM prize for computing, the IT Levonne Newman medal, the Mark Weiser award, and he’s an ACM fellow among many others. So, we are very excited to have you here, in case you can’t tell by the turnout, and very much looking forward to your talk. So, a warm welcome to Jeff Dean.

Thank you so much for the delightful introduction. I’m really excited to be here, and I’m going to talk to you today about important trends in AI. How do we get to where we are with the current state of what models can do? What can we do now that sort of the field has advanced to the current level? And how can we shape what we want AI to do in the future? This is joint work with many people at Google and elsewhere, so it’s not all my work. Many of it is collaborative work; some of it is not necessarily my work, but I think it’s an important set of work to discuss.

Okay, so some observations, most of which are probably reasonably obvious to you. Most importantly, machine learning has really changed our expectations of what we think computers are capable of doing. If you think back 10 years ago, computers could barely see with the rudimentary computer vision performance. Speech recognition worked but not super well. Language understanding in terms of language models was somewhat limited in capabilities.

What we’ve seen over the last 12, 13, 14 years is that increasing scale of compute used to train the models, the data, and the model size increases generally delivers better results. There’s an almost truism to that in many ways, where we’ve seen this over and over again over the last 15 years: bigger models and more data give you better performance in problems we actually care about in terms of capabilities of computers.

Algorithmic and model architecture improvements have also been really important in this, so it’s not just about throwing more hardware at the problem. Algorithmic and model architecture improvements have actually been more significant than just the hardware improvements we’ve seen in the last decade. As a result of all of this, the computations we want to run on computing hardware are really changing. How we think about building the computer hardware to run the applications of today and tomorrow is really shifting from traditional CPU-based computation.

First, I’m going to go through a section that is a whirlwind. One slide per advance. I should relaunch Chrome within two days. Hang on, let me agree. I should probably relaunch Chrome, but let’s try to not do it right now.

So, a whirlwind of one or two slides per particular technique that has been really influential in getting modern models to how they came to be, and let’s just launch right into that. It’s going to be mostly chronological but not quite.

A key building block from the last century is neural networks. A lot of almost all of the advances you see in machine learning, at the largest scale and in the capabilities you see computers have, are based on neural network-based computation. These are made up of artificial neurons, loosely based on how real neurons behave in some ways, but they are very imperfect reproductions of how we understand real neurons to behave. There are lots we don’t understand, but they are one of the underlying building blocks.

Another key building block is backpropagation as a way to optimize the weights of the neural network. By essentially backpropagating errors from the output the model gave you to the output you wanted, backpropagation gives a very effective algorithm for updating the weights of a neural network to minimize errors on training data. Because of the generalization properties of neural networks, you can then generalize to problems or particular examples the neural network has not seen.

These two things are key to a lot of the deep learning revolution: backpropagation and neural nets. One of the things that I and some other people worked on in 2012 was this notion that maybe if we were to train really big neural networks, they would be even better than small ones. We had this hypothesis and in 2012 we decided it would be kind of fun to train a very large neural network and see if we could do it using an unsupervised learning algorithm.

We trained this large neural network that was about 60 times bigger than the previously largest known neural network in 2012, using 16,000 CPU cores. At that time, we didn’t have GPUs in our data center; we had a lot of regular CPUs. What we saw was that this unsupervised training objective followed by supervised training actually gave a 70% relative improvement in the less thinly contested ImageNet 22K category. Most of the ImageNet results you hear about are in the 1000 category section. This was more interesting, perhaps because it has 22,000 very fine-grain categories.

This was a significant advance and proved our hypothesis of larger models being more capable if you put sufficient training computation behind them. As part of that work, we developed our first large-scale neural network infrastructure systems project. This was called Disbelief, partly because it was distributed over many machines for a distributed computing system but also because our colleagues didn’t think it was going to work. It was a little bit of a play on words.

When training these large models, and the model doesn’t fit on a single computer, there are a few different ways to imagine parallelizing that computation. The first is to take your model, which typically in a neural net has many layers of neurons, and slice them both vertically and horizontally to produce pieces of the model on each computer while managing communication between the edges crossing between the different splits made in your model. The other thing you can do is data parallelism, where now you have many copies of the underlying model on different machines, perhaps combined with model parallelism, with each copy being on many machines.

Then, you partition the data you’re training on across those different model replicas. In the case of what we were doing in Disbelief, we had a centralized system that could accept gradient updates from different replicas of the model and apply them to the parameters. We did this not in a mathematically correct way, as we were doing it completely asynchronously. Different model replicas would compute a bit of data, send a gradient based on the parameters and training data for that batch back to the parameter server. By then, the parameters had moved because other model replicas had applied their gradients in the interim, which is clearly not mathematically correct according to the gradient descent algorithm, but it works.

That’s nice, and it enabled us to scale to very large models, even using CPUs. In 2013, we used that framework to scale up training of dense representations of words using a word embedding model called Word2Vec. One of the things that is really useful coming out of this work is that having a representation of a word that is a high-dimensional vector gives you two nice properties if you train it in particular ways. One way to train it is by taking the representation, the vector representing the middle word, and trying to predict the nearby words from that representation.

Another version is taking all the surrounding words and trying to predict the middle word, but they both work kind of roughly equally well. When you train embedding vectors for words in this way, you find you can represent words with these high-dimensional vectors that have two nice properties. One is that nearby words in this high-dimensional space, after you train on lots of data, tend to be related because you nudged all the words related to cats, pumas, and tigers into the same part of the thousand-dimensional space.

The other interesting thing is that directions are meaningful in this space. To transform a male version of a word to a female version, you go in the same direction, regardless of whether the words are king and queen, man and woman, bull and cow, or various other examples. Linguistic properties emerge from the training process in the directions between different points in the space.

In 2014, three of my colleagues—Ilia Sutskever, Oriol Vinyals, and Quoc Le—developed a model called sequence-to-sequence learning with neural networks. The idea here is you have some input sequence and you want to predict an output sequence from that input sequence. A classic case is translation, where you have the English sentence and then, using the representation you’ve built up by processing the input English sentence one word at a time, you now have a dense representation that you start to decode into the French sentence.

By processing lots of language sentence pairs of English and French, you essentially learn to do a language translation system purely from this kind of sequence-to-sequence based neural network. If you use that to initialize the state of the neural decoder when starting to translate, it actually works, and you scale up the LSTMs to show that it can work better and better.

In about 2013, I started to get worried because as we were making bigger and bigger neural networks for things like speech, vision, and language, I began to calculate that if speech recognition starts to work better, people might use it and that might be problematic if we want to serve many users in the system. I did rough calculations and determined that if 100 million of our users started talking to their phones for three minutes a day, and at that time the models were big enough that they couldn’t run on devices, they had to run in our data center.

I discovered that rolling out a better speech model that we had, which would reduce the error rate by 40%, was significant. We knew it was going to be better if we could serve it to a lot of people. However, my calculations indicated that serving those 100 million people for three minutes a day would require doubling the number of computers Google had just to roll out that improvement in the speech recognition model. This is one of our many products.

I started talking to some of our colleagues in our technical infrastructure group who had hardware expertise, and we decided it would be sensible to build more customized hardware for neural network inference. This was the genesis of the tensor processing unit (TPU) line. The first version was specialized for inference only, using reduced precision and operating with only 8-bit integer operations in its multiplier. The target was to build something really good at low precision linear algebra, which would be useful for serving a lot of different kinds of neural network-based models without needing all the complex features of modern CPUs, like branch predictors or caches.

Fast forward, the largest team produced a TPU that was 15 to 30 times faster than contemporary CPUs and GPUs for these kinds of tasks and 30 to 80 times more energy-efficient. By the way, this is now the most cited paper in ISCA’s 50-year history, which is quite impressive since it was only published in 2017. This really started our foray into more specialized compute for machine learning models.

Then we considered scaling up and focusing on training, not just inference. That’s when we began thinking about systems that resemble machine learning supercomputers, with high-speed interconnect between many chips densely connected by custom high-speed interconnect. We have done six generations of TPU pods that are great for both inference and training. These connect thousands of chips together. The initial pod had 256, then 1000, then 4000, and the most recent ones have been around eight or nine thousand chips, all connected with custom high-speed networks.

Since version 4, they have featured a really exotic optical network. You can take a rack of 64 chips and connect it to another rack of 64 chips, using optical switching and mirror movements to make them function as though they’re next to one another on the data center floor, even if they’re not. You can read about that in the ISCA paper.

We announced the latest version last week—Ironwood. We’ve stopped naming them with numbers, which confuses me, but Ironwood has a fairly large pod size. It’s got 9216 chips, each of which can perform 4614 teraflops, totaling 42.5 exaflops in one of these pods, with reduced precision floating points. This is 8-bit floating point precision, quite a boost from the previous generation.

Compared to the first training pod, it represents about a 3600 increase in compute capability in the pod over seven years. Doing lots of clever circuit design and shrinking fab processes, with lower precision operations than the original TPUv2, we’re achieving about a 30x improvement in energy efficiency per flop compared to the first training pod of 2018.

Another trend that’s important is that open-source tools for machine learning have enabled a broader community to participate in improving those tools and using them to tackle machine learning problems across various disciplines. TensorFlow, which we released in 2015, PyTorch, which came in 2016, and Jax, another Google-developed open-source framework with a more functional style, emerged around 2017 or 18. These three packages have significantly pushed the field forward in terms of accessibility and standardization.

In 2017, some of my colleagues observed that in a recurrent model, you have a sequential process of absorbing one token at a time and updating the internal state of the model before advancing to the next one. This inherent sequential step limits parallelism and efficiency in learning from large amounts of data. They proposed saving all the internal states and developing a mechanism called attention to refer back to all the states you went through to alleviate this.

This is a hugely influential paper because it demonstrated that, with 10 to 100 times less compute and 10 times smaller models, you could achieve better performance than the state-of-the-art LSTM and other model architectures at the time. This log-scale difference has been significant. Nearly all modern large language models you hear about use transformers as the underlying model architecture, with variations.

This was not new in 2018 but really came into vogue then, realizing that language modeling at scale can be done with self-supervised data. You can use any piece of text to predict other parts of the text, generating large amounts of training data. This is a major reason these language models have become so good—more text to train on equals improved quality. There are various training objectives; the first is autoregressive, where you look at the prefix of words and predict the next word.

Many models today follow this approach, letting you create training puzzles. For instance, “Zurich is blank.” The model uses the context to predict the missing word. You can also employ fill-in-the-blank style training examples, creating diverse training examples from the same text. Both training objectives are useful, but autoregressive ones are more common, especially in applications like chatbots, where only past context is available.

In 2021, other colleagues of mine developed a method to map image tasks into a transformer-based model. Prior to that, most people used convolutional neural networks of some form. Essentially, they were able to take an image, break it into patches, and similarly to how Word2Vec embeds words into dense representations, represent those patches with high-dimensional vectors that incorporate aspects like color and orientation.

Then, you feed these patch representations into the transformer model. Instead of using word embeddings for the input, you use patch embeddings, allowing you to handle image data. As you’ll see, when training multimodal models, you can combine text and images, embedding visual patches with a visual model and text patches with a part of a text model.

The attention operation in the transformer attends to relevant parts of the image when asked what’s in it. For example, it’s focused on the airplane or the dog, but when faced with confounding elements, the attention is less focused, scanning over the entire image to gather visual clues to predict the correct response. This has been hugely influential in unifying transformers for text with those for images.

Another innovation came in 2017, when I and some colleagues developed a way to create sparse models that have a large capacity but activate only a small portion of the model for each token or example. In our original paper, we used around 48 experts per layer but would activate just two. This allows the model to maintain a large capacity while only selectively using portions based on what’s relevant, enhancing efficiency.

The choice of which experts to activate is learned end-to-end through backpropagation, enabling the model to handle various contexts, like dates and times or geographical locations. We achieved an 8x reduction in training compute cost for the same accuracy, or major improvements in accuracy for the same training cost. When you encounter graphs comparing compute budgets and accuracy scores, you want to line things up horizontally to illustrate less compute needed for the same accuracy.

We’ve continued to conduct substantial work on sparse models because we see it as a vital direction for models with large capacity that require activation of a small percentage of the model.

In 2018, we began rethinking software abstractions for large distributed machine learning computations. We aimed to train models at a larger scale, connecting together many TPU pods in software. Each smaller box with yellow dots represents a TPU pod, and we wanted to enable seamless connectivity among many of these. distributed system manage the right sort of communication mechanism for when one of these chips needs to talk to another. So when two yellow chips in the same small box need to talk to each other, you use the very high-speed TPU network.

When the chip in the upper left box needs to talk to one in the pod in the same building, it will use the data center network within that building. If it needs to talk across buildings, it will use the network that goes between buildings in the same data center facility. And you can even have TPU pods connected together in different regions via larger wide area network links. That big orange orangey red arrow and by having this nice scalable software that can simplify running these large scale computations.

So in fact, one of the abstractions that pathways gives to the sort of machine learning developer researcher is you just have a single Python process and Jax has a notion of devices. So normally if you’re just running on a single machine with say four TPU chips in it, it shows up as a process with four chips. But what Pathways does when you run it under Jax with Pathways underneath it, all the chips in this entire training job just show up as devices for Jax.

So you have a single Python process and it looks like you just have a single sea of say 10,000 or 20,000 TPU devices, and you can run computations on that and Pathways takes care of mapping that computation onto the actual physical devices. One of the things we’ve just done last week was made the Pathway system, which we’ve used internally for now six years, available for cloud customers using our cloud TPU products.

Another observation by some colleagues of mine was that thinking longer at inference time is very useful. So, in the same way that your third grade math teacher told you to show your work when you were solving problems because you were more likely to get the steps the sequence of steps right in order to solve the problem correctly. It turns out large language models are the same way. If you just give them an example problem, Sean has five toys for Christmas he got two from his mom and his dad. How many toys do you have now? The answer is nine. That’s the one-shot example in the input.

Now you’re asked a new problem. John takes care of 10 dogs. Each dog takes half an hour a day to walk and takes care of the business. How many hours a week does he spend taking care of dogs? Then the model got this particular problem wrong. It said 50. That’s not correct. But if you encourage the model to show its work by in the one example problem you’ve given it, actually show it that hey, this is kind of the sequence of steps to work out the problem. Sean started with five toys. If he got two toys each from his mom and his dad, then he has four more toys. 5 plus 4 is nine. The answer is nine.

So that seems very simple, but it actually turns out that this tremendously helps models become more accurate because they are now encouraged to think through the steps in order to solve the problem in a finer grain way. You see that as the model scale improves, the solve rate goes up somewhat if you just use standard prompting but goes up dramatically when you use chain of thought prompting. This is for like a benchmark of like roughly eighth grade math level problems. So prompting the model to show its work improves the accuracy on reasoning tasks.

You can think of this as also a way of using more compute at inference time because now it has to produce all these extra tokens in order to actually get to the right format of answer. In 2014, Jeff Hinton, Oral Vinol, and I developed a technique called distillation, distilling the knowledge in a neural network. The idea was you have a really good model and you want to put its knowledge into a different model, typically a smaller one.

So the typical way you’re training the small model is let’s say you’re doing next token prediction. So the prefix you see is perform the concerto for blank and the true next word is violin. So you can train your language model with that objective and if you guess violin correctly, great. If you guess it wrong, then you get some back propagation error from the training objective. It turns out that works okay. But if you can use your teacher model to give you not just the correct answer, but a distribution over what it thinks are good answers for this question for this particular word, it gives you a much richer signal of training.

Think of the loss you get for the original just violin. You get zero correct for everything except violin and then you get a one. But here the distribution of probabilities is violin 0.4, piano 2, trumpet 0.01, but airplane is extremely unlikely in this circumstance. The concerto over airplane, I don’t know, I guess you could have one, but unlikely. That really rich gradient signal is something that you can use to inject much more knowledge into every training example for the smaller model and enables you to get to convergence much more quickly.

If you look at some of these comparisons, this is a speech-based setting where you have a training frame accuracy, but what you really care about is the test frame accuracy of did you predict the sound in this frame of audio correctly? The baseline with 100% of the training data gets 58.9% on the test frame accuracy. If you strip the training set down to only 3% of the training data, then your training frame accuracy actually goes up because your model overfits to the very small number of training examples you have. But your test frame accuracy plummets because now you’re in an overfitting regime and you can’t do very well on new test examples you’ve never seen before.

But if you use these soft targets produced by the distillation process and use only 3% of the training data, what you see is you get pretty good training frame accuracy, but you get almost as accurate at the test frame accuracy with only 3% of the data. This is a really nice property because it means you can suddenly transfer the knowledge of a large neural network into a small neural network and make it almost as accurate as the large one.

This was rejected from NeurIPS 2014. We published it in a workshop and put it in an archive and it now has 24,000 citations. In 2022, some colleagues and I looked at different ways of mapping computation onto our TPU pods for doing efficient inference. There are a whole bunch of variations one can consider. You know, do you keep the weights stationary in one of the dimensions of the network? Do you keep them stationary in both dimensions so that your weights are now spread across a two-dimensional thing? Or do you gather the weights and bring them to the part? The details aren’t that important, but there’s a bunch of different ways of doing it.

One of the things that is true is the right choices for how to do this actually depend on a lot of different factors. One is what is your batch size, which can have a lot of influence on whether one of these three techniques is actually better. Latency constraints can also have a big effect. So if you think about this, we have these three different techniques: weight stationary, weight gathered, and XY weight gathered, and there’s even another one XYZ weight gathered. What you see is the little dotted things at the bottom of these techniques are the best to do at varying different batch sizes and that the right answer changes as you change the batch size.

That also means your floating-point utilization of your hardware also changes depending on your strategy. The right answer depends on how large your batch size is. At very small batch size, you want to use a 2D weight gathered in this case. At larger batch size, a 2D weight stationary at small sizes, and a 2D weight gathered at larger. It’s just to say that there’s a lot of complicated choices in how you decide how to partition a model and do inference at scale.

In 2023, some colleagues of mine developed a technique called speculative decoding. The idea here is we’re going to use a small drafter model, maybe 10 to 20 times smaller than the larger model, with the idea being that many things are actually quite predictable by a small model. We can sequentially predict from the very small drafter model much more rapidly than we can sequentially predict from the very large model.

We’re going to predict the next K tokens with the small model, and then we’re going to ask the large model to predict K tokens in a row. We can advance this generation by as many tokens as match in the prefix of size K. Essentially, if you do this with just the large slow model, it’s going to trundle along predicting one word at a time. But if you do this with the drafter model, you see the drafter is predicting four or five words at a time and then the larger model is trying to predict and will advance as many as the words match that the drafter model created for you.

By doing size K predictions for K words, you essentially amortize the memory overhead of bringing in the weights of the model in order to then predict K words instead of just one. There’s an awful lot of things that have happened all kind of combining together to really improve the quality of models that people are seeing today. Better accelerator hardware. That’s true in TPUs, but also Nvidia GPUs have gotten a lot better in recent years for machine learning focused applications as well.

Software abstractions are really important because they enable you to have these nice layers where you can focus a lot on the performance and the abstractions provided by those things and then people on top can build useful things without necessarily having to think about the details as much underneath those abstractions. Model architectures have seen huge improvements, in particular transformers, visual transformers, and are really heavily used in the most modern models.

Training algorithms, unsupervised and self-supervised learning, asynchronous training, distillation, and I didn’t talk about supervised fine-tuning after you’ve pre-trained your model or RL from human feedback or other kinds of computational feedback. That’s a super important aspect: chain of thought, speculative decoding, and inference time compute scaling. All of these are really important in the modern era.

Now I’m going to talk a little bit about the Gemini models that we’ve been training and how most of these innovations are used in various iterations of the Gemini models. Gemini is really a project that started as a collaboration between Google DeepMind, Google Research, and the rest of Google. We started this in February 2023 with our goal being to train the best multimodal models in the world and use them across Google.

There are all kinds of ways in which these models can help various Google products. They’re also available externally through our cloud APIs. This is kind of a timeline of what we’ve been up to since February 2023. We released Gemini 1.0 in December 2023, followed soon thereafter by Gemini 1.5 and so on. One of the things we wanted was to make these models multimodal from the very beginning because we felt like just text models were not as useful as models that could sort of understand language, understand visual inputs, understand audio, and also produce all those things.

The initial versions of the model did not produce audio as output, but they could take audio, video, images, and text as input and produce images and text as output. We’ve since added the ability to produce audio output as well. Gemini 1.5 introduced this very long context length so that you can provide inputs that are millions of tokens in length.

Think about a thousand-page document; that is about a million tokens. So you can now put 50 research papers or a very long book or multiple books into the context window. One of the nice things about the input data in the model, particularly transformer models, because of the attention mechanism, is that information is very clear to the model. Unlike training data where you’ve sort of trained on trillions of tokens, and you’ve optimized your billions or tens of billions of parameters of weights with those trillions of tokens, you’ve kind of stirred them all together and you’ve lost a little bit of the fidelity of the exact pieces of information there.

In the context window, that information is very clear to the model and enables it to sort of extract, summarize, and reason over that data much more capably than other kinds of data. In Gemini 2.0, as I said, these models build on a lot of these innovations. We use TPUs, we do cross data center training across metropolitan areas, using pathways, using Jax on top of that, the distributed representations of words and image data is super important, transformers, sparse mixture of experts, and distillation, and a lot more things besides.

But really these all kind of come together in our model training recipe and our model serving recipes. Just about a month ago, we released Gemini 2.5 Pro, which is our most recent model. This has been pretty well received because it has a significant leap forward in some of our various benchmarks that it performs on. It’s gotten a lot better at coding compared to our previous Gemini models.

Actually, there’s an arena for how to compare model quality across different models that is run by LM Marina, which is a Berkeley affiliated group of grad students. They enable users to enter a prompt and then pick two random models that they’re backed by that are behind the scenes, and then they show the output from both models to the user anonymously. So you don’t know which model is which. And then you’re asked which output do you like better.

It’s sort of a head-to-head competition of language models, and through thousands of trials like this, you can actually get a very good sense of the strength of models, at least in terms of how well the answers reflect what people using this LM arena like. We found it pretty useful. It does correlate quite well with the strength of the models.

This has a pretty significant ELO improvement over our previous models. It’s actually done pretty well on a whole bunch of independent evaluations that people do across the web, and on various academic benchmarks on the left there. We are sadly number four on New York Times connections. So we’ll have to work on that. But in general, this set of leaderboards covers quite a broad set of areas. Some of these are coding related, some are math related, some are sort of multimodal related.

We really try to focus on making good general-purpose models that are effective at a lot of different things. Users are generally enjoying this. Some of this is a little over-the-top phrase, but people do seem to like it. In particular, the long context abilities are really good for coding, particularly now that the reasoning capabilities of the model are also greatly improved.

Having a million or two million tokens of context enables you to put large code bases entirely into the context window and then ask the model to do fairly complicated things like can you please refactor this for me or can you introduce a new feature that has this capability. It also enables you to process other kinds of data. For example, this bottom person has a dataset of a thousand poems, 230,000 tokens, and then asked a bunch of stuff which requires reasoning over all those poems. They were quite impressed by that because I guess that’s hard.

One of the things we really focus on is the ELO score I mentioned from Ellarina. Higher in the ELO score means a more capable higher quality model as judged by those users. On the x-axis, there’s the cost of a whole bunch of different kinds of commercial models. Importantly, the x-axis is a log scale, so don’t miss that important point.

Just emphasizing the point, where you want to be is as far up and to the right as you possibly can. We produce a series of different models with different quality and cost trade-offs. Our flash models over to the right are generally quite cheap. They are about 15 cents per million tokens. Our most recent 2.5 Pro model is more expensive because it’s a much heavier weight model, which costs more for us to run it, but it’s still quite affordable for the quality you get.

Generally, we like to see that we have a variety of offerings on the Pareto frontier of this quality-cost trade-off. We are going to work to keep pushing up and to the right there as much as we possibly can.

Gemini is a pretty large-scale effort. If you look at the Gemini 1.5 paper, we do have quite a few authors. It’s very hard to write a short paper if you have to list all your authors. Truly, it’s a large-scale team effort and everyone here contributed tremendously to this. One of the things we’ve had to figure out was how can we best structure this so we can have that many people effectively contributing to a single model project.

Some of the structuring techniques we use are to have different areas that people loosely affiliate with. Some people are much more focused on the pre-training process or on data or on safety or values. Not to say that these are very hard boundaries, but generally some people have some affiliation with some of these more than others.

There are overall tech leads of the project, which include myself, Oriel Vinyols, and Nom Shazir. We have a really capable program management and product management team. Although Gemini is kind of a model creation thing, it does have a lot of product implications because we want to release that model into lots of different surfaces at Google. Interacting with all those other teams about what features they need, where they see the model perform well, and, more importantly, where it is not performing well, and getting feedback from them is something that’s really important.

We kind of have three broad categories of these different areas: model development, pre-training where you’re training on a large corpus of text and other multimodal data; post-training where you’ve finished pre-training the model on lots of data and now you’re trying to coax the model into behaving in certain ways with relatively small amounts of data using things like reinforcement learning or supervised fine-tuning.

On-device models are another important aspect; we have Gemini models running on phones that have a slightly different character than some of the larger data center-based ones. The core areas are kind of the ones that crosscut most aspects of Gemini: training data evaluations, infrastructure, the codebase for research and for model expressing, the production model training, and inference systems.

Serving is really important for long-term research within Gemini. There’s also a lot of research that happens outside of Gemini, and we sort of keep an eye on that kind of work, and our colleagues will say, “Hey, we have something that might be sensible to consider for the next generation of Gemini.” Capabilities are generally about particular narrower aspects of the model: can we make it safe and behave well? Is it really good for coding? Can we make it good at vision tasks in particular or audio tasks in particular?

Agent behavior is now a very important aspect of what we’re doing. Internationalization is crucial because we want this thing to work well in hundreds of languages, not just five. These are kind of broad areas. We have roughly a third of our people in the San Francisco Bay Area. I’m based in Mountain View. About a third are in London, and a third are in a bunch of other places including Zurich, New York City, Paris, Boston, Bangalore, Tel Aviv, and Seattle, which are some of the bigger concentrations of people not in the first two areas.

Time zones are really annoying. The golden hours between the California West Coast and London, Europe during the workday are relatively limited. It’s maybe two or three hours a day that you really have sensible meeting times for both sides. Past that, one side is like, I don’t know, our poor Bangalore colleagues are never in golden hours with anyone else. But it is a worldwide effort. There are some benefits to having people all around the world because when the model is training, there’s always someone awake and sort of paying attention to a large-scale training run.

Often, you might fire off a question to a colleague in London, and they are not there, but when you wake up in the morning, you know they’ve answered and done a bunch of work on your behalf. There are benefits, but distributed work is challenging. One of the ways we’ve been able to make this work is we have lots of large and small discussions and information sharing conducted in virtual Google chat spaces. I’m in 200 of these.

I wake up brushing my teeth and get probably seven alerts while I’m brushing my teeth in the morning because my London colleagues are busy at work and excited about sharing things in various chat rooms. We have a slightly formalized request for comments, which is really a one to ten-page document about some piece of work or thread of work or results that have been gotten or experiments they’re thinking about to sort of get some results.

People will give feedback in Google Docs style. We have a slightly formalized way for some of these to say, yes, we think this should make it into the next generation of our model training, or the new recipe. We have leaderboards and common baselines to enable good data-driven decision-making about how to improve the model. There are many rounds of experimentation, lots of experiments at small scale. You want to advance the smaller scale experiments that seem promising to the next scale to see if the results kind of hold up and are on trend.

Every so often, every few weeks, you… Incorporate successful experiments demonstrated at the largest scale into a new candidate baseline. You run that candidate baseline, see if it’s better than the previous baseline, and does it have any sort of unexpected interactions among the few things you piled in there. And then you repeat. So that’s kind of particularly for some of the pre-training recipe development. That’s the way we do that.

I mentioned scaling of people but also training of computing hardware. Scaling of computing hardware is quite annoying. So I’ll give you just one example. Silent data corruption. Despite the best efforts given the scale of these ML systems and the size of the training jobs, you will get hardware errors that sometimes are not going to be detected by the hardware and these incorrect computations because it’s a very large coupled system. One buggy chip can then spread to the entire model. Non-deterministically producing incorrect results which can happen for particular pieces of hardware, which can happen on any piece of hardware randomly due to various background radiation kinds of aspects. These become worse at scale with synchronous stochastic gradient descent and it can spread bad results.

One of the things we do is we, as we’re training, we monitor the norm of our gradients and if we see large spikes in that we get concerned. Is it justified to be concerned? We don’t know. It’s certainly a large gradient relative to the ones we’ve seen recently. And you can also get anomalies with no silent data corruption error. The first one was actually a silent data corruption error and the way we detect that is we rewind a few steps and we replay in a deterministic manner and if we see the same result then we say well it must be in the data; it’s probably not hardware failures. If we see a different answer, though, that’s concerning because everything’s supposed to be deterministic when we replay.

In this case, we did see an anomaly in the gradient, but we replayed it and we actually saw that the same large gradient value occurred in the replay as well. Now you can also detect SDCs if you just happen to replay without an anomaly. This is probably like the low bits of your exponent getting flipped by an error rather than the high bits. The high bits being flipped is bad because then all of a sudden you have 10 to the 12th and the gradient when you expected a 7.

I’m going to skip that and give you some examples of what these models can do. They can help fix bugs in your code, which is nice. This person uploaded their entire codebase, all the issues, and it identified the urgent thing. I guess it was replaying; it was calling some handler twice and so the code added a flag to say has the handler been called and if it hasn’t then call it.

In-context learning, so Kalamong is a language spoken by about 200 people in the world. There’s a woman who wrote a PhD thesis on a grammar of Kalamong. There’s no effectively written internet training data on Kalamong. But what we’ve observed is that if you put this book into context in the model and then ask it to translate English to Kalamong or Kalamong to English, it can actually do about as well as a human language learner who’s been given the grammar book in a dictionary for Kalamong to translate.

That’s kind of nice because it shows in-context learning at the level of I put in a 400-page PhD thesis about a topic the model has no idea about and it actually is able to sort of make sense of Kalamong and translate it.

With that video of a bookshelf to JSON, it’s kind of fun. You might not have thought of that as an input method but you can do that. It’s kind of good. Video understanding and summarization. So you can actually put in fairly long videos. A million tokens is about two hours of video. The prompt is in a table, please write the sport, the team, and athletes involved, the year, and a short description of why each of these moments in sports are so iconic. The model gets to see the pixels of the video and the audio track.

It’s like an 11-minute video I think. And so then the output of the model is that which is probably more sort of text extraction, structured data extraction than you thought you might be able to get out of in-context video. I think people are not yet clued into the fact that you can actually take multimodal data like that and do pretty interesting things.

Digitization of historical data. You can take weather data that looks like that from 100 years ago and just say, “Please give it to me in JSON,” and it will do that. They’ve done it for 144 tables and that cost them 10 p. But now they’re able to actually sort of unlock all this weather data.

Code generation via high-level language. Here’s the prompt we’re going to give to our Gemini 2.5 model. P5JS to explore a Mandelbrot set. That’s the prompt. Oh, can’t. I’m so sad. Why is it not able to do that? It was working before. Oh, I’m not on the Wi-Fi. It’s true. I’m not. Well, anyway, it makes a really nice interactive visual Mandelbrot explorer like that.

Now that we have these models, what will this all mean for us in society? I think it’s a really important set of topics. So I and eight other co-authors recently got together and wrote this paper called Shaping AI’s Impact on Billions of Lives. A bunch of computer scientists and people with machine learning backgrounds from academia, big tech companies, and startups; we wanted to propose what the impact of an AI in the world could be given directed research and policy efforts.

A lot of people in this space are thinking about what will happen with AI if we’re laissez-faire. Will we all be doomed or will we have incredible advances? I think really a pragmatic approach is to say let’s, as society and machine learning researchers and practitioners and experts, all work together to try to shape things so that we get the best aspects of AI and minimize the downsides.

Really, that was what this paper was intended to be: a discussion of how might we do that collectively. We interviewed 24 different experts in seven different fields: employment, education, healthcare, information, and media. We talked to former President Barack Obama, Sal Khan in education, John Jumper, who we talked to before he won the Nobel Prize, but he won the Nobel Prize later, Neil Stevenson, Dario, Amade, Bob Octar, and we uncovered five guidelines for AI for public good.

I will ignore everything after this, but you can see shapingai.com. There’s an archive paper from that site that I think is a pretty nice discussion of what will happen in a bunch of different areas, including employment, education, healthcare, or what could happen in some of those areas. It’s pretty important for us to all work together to get this all right.

With that, I will conclude by saying we also proposed some nice milestones of what people should work on in some of these areas. These models are becoming incredibly powerful and useful tools and I think you’re going to see continued improvement in this as there’s more investment and more people in the field doing research and those advances get incorporated into the leading models. You’re going to see even more capable models.

It’s going to have a dramatic impact in a lot of areas and it’s going to potentially make really deep expertise available to a lot of people across a lot of different areas. I think that’s one of the things that is both most exciting but also kind of disconcerting to some people is that expertise being widely available and done well. I think our AI-assisted future is really bright.

Thank you. [Applause]

Thank you very much for the great talk. A little token of appreciation from the department. Thank you so much. Some chocolates and a systems group t-shirt. I love coming to Switzerland because I get chocolate and a t-shirt. Thank you very much.

And we’ll now proceed to the Q&A. We have one mic and we have one cube that we can toss around. We’ve discussed that we’ll sort of also try to prioritize students especially for questions. If you can raise your hands if you have questions and you can point in a general area.

And my aim is probably not that great any nice. Ah well done. [Applause]

Hi. Thank you so much, and especially for the last paper you presented. Oh yeah, hold it into your mouth. Like this. Yeah, perfect. There we go. So, thank you for the talk and especially the last paper. It’s very important, I think, and so on that point a bit.

AI safety is definitely on our minds I think and it’s super unclear especially from outside, for example, big research labs, what would even be positive and what would be really impactful. So maybe from the perspective of really making sure everything goes well, everything is in human control, and everything, what would you do as maybe a PhD student starting a thesis, a professor with a bunch of research grant money, or even a startup? Let’s say you could acquire a startup this year; what would it do in the area of AI safety, particularly in the area of AI safety?

Yes, exactly. I mean I think AI safety is a pretty broad topic. I think there’s a bunch of concerns about the increasing capabilities of these models being able to enable people to do things that they wouldn’t otherwise be able to do that are somewhat nefarious or undesirable from a societal perspective. So I think some of that can be addressed with some technical means, but I also think that there’s going to need to be policy-based and regulatory-based things that impose some restrictions on some aspects of that.

One of the topics that we covered in the paper was about misinformation and public discourse. There, I think you know there’s clearly an ability for AI models to create more realistic misinformation in the world and enable people to create it at mass scale with lower costs. Misinformation is not a new thing; you could always create it, but now you have these tools that enable sort of more realistic and more rapid creation. So that is definitely an issue.

I think there’s a corresponding research question of how do you detect this information that is perhaps generated by a different machine learning model. There’s also some questions about how do you turn the problem onto a more positive spin. One of the things we’ve suggested in the paper was there’s actually some early evidence that AI models can be used to enable more constructive discourse in online forums.

That’s an area where I think looking at how could AI models encourage more positive conversations, identify misinformation in the flows of conversations that people are having with each other, these are some things that I think are pretty interesting. There’s a whole bunch of ideas in that paper that I think are worthy of study, and I don’t think the solution is necessarily going to be purely technical for all these problems.

Thank you. Yep. And send the cube over to him, but we’ll take someone else for the moment if that’s okay. Sure. Yes. Where was the question here? I thought there was one over here. Yeah, there we go. Should I? Yep. All right.

So, when I go to social networks, I’m very hyped, right? And I see messages like the ones that you saw. These LLMs are truly incredible. However, in my day-to-day work, when I try to use AI or LLMs, I’m often disappointed. Who needs training? Is the LLM that needs more training or is it me? I’m asking wrong.

It’s an excellent question. I suspect the answer is a bit of both, right? I mean, I do think you know using these tools, like first the arc of progress in these models has gotten quite steep. The Gemini models from eight months ago are not nearly as good as the Gemini models now. Sometimes people develop an impression of what the models are capable of from their previous experience trying to ask them to do something complicated and they failed miserably.

But now that might be something that is on the border of possibility or actually will work really well there. So I think part of it is looking at what the current models can do, not what the ones of ancient history eight months ago can do. Another aspect is becoming familiar with how to coax the models to do what you want. It’s quite interesting that with a one-page carefully crafted prompt you can almost create a completely different application of a general model than if you craft a different one-page prompt.

You know one one-page prompt might say, can you take this video contents and please make me an educational game that reflects the concepts explored in the lecture video? And it will actually in some cases create a fully working software-based game that highlights the concepts in an arbitrary lecture or scientific video. It doesn’t always work, but that is kind of at the frontier of possibilities now; 30% of the time it might work or something.

But also, more training for the models will help because then the models are going to get better and I think you’re seeing this from Gemini 1 to 1.5 to 2 to 2.5 a lot of progress and I suspect Gemini 3.0 models and beyond will be substantially better than the current ones. That’s a general trend in the industry; the models are becoming better.

Thank you for your talk. I noticed on your slide where you summarized all of the innovations in AI, you listed hardware, you listed algorithms, you listed all the improvements, but data was absent. There are lots of concerns in the field that data might be the new bottleneck. I’m curious about your personal opinion on this. Is it a bottleneck? And if not, how do people get by? How do we get past scraping all of the internet?

I guess I didn’t list data, but it has been really important. It’s just there’s not like a specific artifact generally to point to in a lot of the data-related work. It’s really about curation of high-quality data that we spend a lot of time on, say within the Gemini project. I think there’s concerns I’ve heard of about running out of high-quality data in order to improve the capabilities of these models.

I find that not very credible at the moment because, first, there’s an awful lot of data we’re not training on right now. If you think about all the video data in the world, we’re training on some video data, but it’s a very tiny fraction of, say, the YouTube corpus. That’s only some of the video in the world. So, I don’t think we’re running close to running out of raw data.

The other thing I would say as an ML research problem is there’s a whole bunch of work I think we can do to get more quality improvements from the model per unit of training or per token of training data. If you think about, we were discussing this in a session earlier; you have a two-sentence description of how to add numbers together, right? The model is just trained to absorb that by predicting the next token, but that doesn’t generally mean it’s actually learned the algorithm for adding two numbers together in a deep and sort of algorithmic way. It’s got an X token predictor for predicting the rule, but in some sense, it’s oblivious to the actual algorithm.

If you think about what you would really want the model to be able to do, it would be to read that algorithm and then build a representation internally that enables it to run that algorithm when it needs to. That would be extracting way more value out of those 15 tokens than what it is currently. I think there’s lots of room to go.

In the improving image convolutional neural network era, you know people were training on a million images with a thousand categories and one of the ways they would make the models more powerful is they would make many passes over that training data. The textual data corpus we have is large enough that we’re not able to computationally afford to make lots and lots of passes over it, but with improving hardware capabilities, you might be able to make 50 passes over the data instead of three, and that would probably improve the qualities of the model, but we don’t know how much.

Thanks a lot for the super interesting talk. Where in your personal life or work do you use AI most, and where do you use it least because it doesn’t work yet? What are you like surprised by on both ends of the capability spectrum, like as you in your work as an employee of a research lab or leader?

I think where I personally use it and where many of my colleagues use it is like helping to write some bits of code. I often tend to ask it to do things that are not super complicated. With the more capable models, I should start venturing out, as this gentleman perhaps should, to more and more expectations of what the model can do.

It will sort of do a reasonable job of writing sort of test cases for code I’ve written or extensions of things that are straightforward. I’ve used it to generate images for various kinds of things. I think I used it for this kind of thing. I use it to summarize papers or I put in a large piece of textual content and ask it questions about that. More and more you’re seeing people integrate the use of these models into things they find that they’re able to do that are useful for them.

I think that’s sort of the general trend in society. Where doesn’t it work? I’ve asked it to do more complicated coding questions and sometimes it works, sometimes it doesn’t. Then you’re like, okay I understand why it didn’t work because that’s pretty complicated and it would have taken me a long time to figure out, so thanks.

Thank you for your presentation; it was super interesting. I was wondering for the upcoming research, what would be the most interesting part to focus on? Is it improving transformers for the computer vision area more important or AI safety regarding to prevent hallucination of large language models? What would be the most important part that you are going to focus on?

I think one of the beauties of this field is it’s not that there’s just one important problem. There are many important problems. One of the meta things I do when I’m trying to think about research topics is to try to pick something that if I make progress on it or we as a collective set of colleagues make progress on, something important will be advanced. So I think avoiding sort of incremental things where even if the best possible outcome happens, you’re kind of like you want to avoid that.

All the areas you mentioned and like 50 other ones besides are really important. Other ones that I’m personally thinking about are: how can we have much more efficient inference hardware? How can you have much larger context windows for these models than a million tokens? How do you identify higher quality data? How do you scale infrastructure? How do you do asynchronous training in a better way in a distributed fashion with low bandwidth between the systems?

How do you have interesting more exotic sparser model structures than just kind of branch out to experts and come back together, which seems kind of relatively too simple for truly sparse interesting model structures? I think there’s like 50 other ideas I could rattle off. You should pick something you’re really excited about and that you think will matter.

One more question. Yeah, one more question. Oh, I don’t know. You pick. How about we get one farther in the back because we have ignored the back? The gentleman in the black t-shirt there, and it’s close enough to throw.

Hi. Thank you very much for the presentation; it was incredible. My question is about what’s the next challenge because I see that these models are getting better and better in all the benchmarks gradually, but is there some sort of binary challenge, some outcome that they are not yet able to do? I don’t know, formal reasoning, some activity that’s, let’s call it the next breakthrough?

I think one thing that’s not quite a discreet step but I think is going to be very hard is the current models. If you think about what we’re going to want the models to be able to do, it’s to operate sort of a bit autonomously and to do fairly complicated things that you ask the model to do with relative independence. Can you, you know, can you go off and plan me a visit to Zurich for two days because I have a couple of extra days and I want to do some fun stuff?

That is a little ambiguous; it might require the model to use some tools to go figure out, well, what is the Zurich place and what could I do here? What you’re seeing is that the models are capable of breaking down complex things into a few steps, maybe doing some limited amount of tool use to chain some things together in order to do those relatively simple tasks. But you’re not seeing models able to take a very complicated thing and break it down into 50 substeps on its own or use many, many complicated tools to accomplish some major piece of work that might take you two months.

There’s a huge vast difference between where we are now, which is it can do those kind of three or four or five-step tasks with maybe 60 to 70% accuracy, and it can do a month of work in a thousand steps with 95% accuracy. I think that is where people would like to be able to get systems, but it’s a very vast gulf between where we are now and what one imagines would be possible that is definitely not now.

That’s maybe a sort of continuum rather than a single thing that suddenly now you can do this, but you will see more and more capabilities of the models as they can do 10-step tasks with 90% accuracy as an intermediate point. Thank you very much. Let’s thank Jeff one more time for his talk. [Applause]


This is an experimental rewrite

[Music]

Host: All right, welcome everyone! It’s great to see a full house. I’m thrilled to introduce Jeff Dean, Google’s chief scientist. He joined Google in 1999, where he has played a key role in the development of foundational technologies like MapReduce, Bigtable, Spanner, and more recently, TensorFlow and Pathways.

In 2011, Jeff co-founded the Google Brain team, and since then, his research has focused on AI systems and applications. Today, he’ll be discussing important trends in AI. I should also mention that Jeff has received numerous awards, including the ACM Prize for Computing, the IT Levonne Newman Medal, the Mark Weiser Award, and he’s an ACM Fellow among many others. We’re very excited to have you here, Jeff, and we look forward to your talk. So let’s give a warm welcome to Jeff Dean!

Jeff Dean: Thank you so much for that kind introduction. I’m really excited to be here today to talk about significant trends in AI. We’ll cover how we arrived at our current understanding of what AI models can do, what advancements we’ve made, and how we can shape the future of AI. It’s worth noting that this work is the result of collaboration with many talented individuals at Google and beyond.

Okay, let’s dive in. Some observations I’m about to share might be quite familiar to you. Most importantly, machine learning has transformed our expectations of what computers can achieve. If you look back 10 years, computers had very basic capabilities in computer vision, speech recognition wasn’t very accurate, and language models had limited functionality.

Over the past 12 to 14 years, we’ve observed that as we increase the scale of computation used to train models, the amount of data and the size of the models, we generally see better results. It’s almost a truism at this point: bigger models and more data yield improved performance in tasks we care about regarding computer capabilities.

That said, it’s crucial to note that advancements in algorithms and model architectures have also played a significant role. This means it’s not just about scaling up hardware but that algorithmic developments and architectural improvements are often more decisive than hardware enhancements over the past decade. Consequently, the way we think about the computations we want to run on hardware is shifting, moving away from traditional CPU-centric computation.

Jeff Dean: Now, I will take you through a whirlwind review, with one slide per major advancement. I’ll likely need to relaunch Chrome soon, but let’s not pause for that right now.

So let’s jump into this rapid overview of pivotal techniques that shaped modern models—but note that this will be mostly chronological, though not strictly.

A key foundational component from the last century is neural networks. Almost every major advancement you see in machine learning, especially at a large scale, stems from neural network-based computation. These networks consist of artificial neurons, loosely connected to how biological neurons function, though not perfectly accurate. There’s still much we do not understand about them, but they represent one of the core building blocks.

Another critical building block is backpropagation, a mechanism to optimize the weights of a neural network. By backpropagating the errors from the model’s output to the desired output, backpropagation provides a powerful way to adjust the weights and minimize errors on training data. Thanks to the generalization capabilities of neural networks, they can also perform well on unseen examples.

These two elements, neural networks and backpropagation, are fundamental to the deep learning revolution. In 2012, some colleagues and I hypothesized that training larger neural networks might yield even better performance than smaller ones. We decided to test this idea by training a particularly large neural network and employing an unsupervised learning algorithm.

We trained a neural network 60 times larger than any known network at that time, leveraging 16,000 CPU cores. Back then, we didn’t have GPUs in our data centers—only CPUs. What we discovered was that by using this unsupervised training objective followed by supervised training, we had a 70% relative improvement in performance in the less commonly contested ImageNet 22K category. This category is interesting because it includes 22,000 very fine-grained categories, unlike the 1,000-category section most are familiar with.

This outcome not only proved our initial hypothesis that larger models could be more capable with sufficient training computation but also led to the development of our first large-scale neural network infrastructure project, aptly named Disbelief. The name reflects its distributed nature across many machines and the skepticism from some of our colleagues who doubted it would succeed.

When it comes to training large models that can’t fit on a single machine, there are several ways to parallelize the computations. The first method involves partitioning the model itself, both vertically and horizontally, distributing pieces across different computers while managing communications between the model splits. Another approach is data parallelism, where multiple copies of the same model exist on different machines, possibly combined with model parallelism, where each copy operates on multiple machines.

In our Disbelief project, we centralized the system to accept gradient updates from different model replicas. This was done asynchronously; each model replica computes a bit of data, sends gradients based on its parameters and training data, and relays it back to the parameter server. The challenge here was that by the time the parameters exchanged hands, they had already changed due to updates from other model replicas, which deviated from the mathematically correct gradient descent algorithm—but it worked nonetheless.

This setup proved effective and enabled us to scale up to very large models even with CPUs. In 2013, we applied that framework to enhance training dense representations of words through a word embedding model called Word2Vec. This work illustrated how representing a word as a high-dimensional vector could yield two beneficial properties if trained correctly.

One method involves taking the representation of a middle word and predicting nearby words, while another looks at surrounding words to predict the middle one. Both methods yield similar results. By training word embedding vectors in this way, we discovered that words closely situated in this high-dimensional space tended to be semantically related—similar words would cluster together, like “cats,” “pumas,” and “tigers.”

Another intriguing discovery from this approach is that the directional relationships within this space are meaningful. For example, transforming a male-associated word to its female counterpart consistently follows the same directional path, regardless of the specific pairings—such as “king” and “queen” or “man” and “woman.” This reflects that linguistic properties emerge as a result of the training process in the relationships between different points in the space.

In 2014, my colleagues Ilia Sutskever, Oriol Vinyals, and Quoc Le developed a model called sequence-to-sequence learning with neural networks. The concept is simple: you take an input sequence and aim to predict an output sequence from it. A classic example is translation, where you input an English sentence and use the dense representation built from processing that sentence word by word to then decode it into the French counterpart.

When trained on substantial language sentence pairs, like English to French, you create a translation system purely based on this sequence-to-sequence neural network model. By initializing the neural decoder using this trained state for translation, the system proves effective and shows improved scalability with LSTMs.

In 2013, I began to feel the pressure of increasing model sizes as we worked on applications like speech recognition and text generation. I calculated that if speech recognition improved significantly, it could overwhelm our resources, especially if 100 million users started interacting with their devices for approximately three minutes daily.

At that juncture, I estimated that deploying a superior speech model, anticipated to lessen error rates by 40%, would necessitate doubling Google’s computer fleet merely to implement that improvement.

This led me to consult colleagues in our technical infrastructure team who had hardware experience, and together we decided it would be prudent to develop specialized hardware for neural network inference. Thus, the tensor processing unit (TPU) line was born. The first TPU version was designed solely for inference, optimizing for reduced precision and executing 8-bit integer operations. The goal was to create highly efficient hardware for linear algebra operations without needing the intricate features typical of modern CPUs.

Fast forward, and our latest TPU generation has demonstrated performance up to 15 to 30 times faster compared to conventional CPUs and GPUs in these tasks, with energy efficiency increases ranging from 30 to 80 times. Interestingly, our TPU paper has gained substantial recognition, becoming the most cited in the 50-year history of ISCA since its publication in 2017.

Further, we began contemplating scaling for training, not just inference. This idea evolved into creating machine learning supercomputers with high-speed interconnections among numerous chips, resulting in six generations of TPU pods optimized for both training and inference.

These TPU pods connect thousands of chips; the initial pod housed 256 chips, which grew to 4000 in some of the latest iterations—currently, we’re operating around eight to nine thousand chips, all linked by custom high-speed networks.

Since version four, we’ve incorporated an innovative optical network. You can connect racks of 64 chips in distant locations, functioning seamlessly as if they are adjacent to each other within the data center.

We recently unveiled the latest version, Ironwood, which has abandoned numerical naming for clarity. Ironwood offers a substantial pod size with 9216 chips, each capable of executing 4614 teraflops. In total, this pod achieves 42.5 exaflops using reduced precision floating points. This represents a roughly 3600x increase in computational capacity over the span of seven years.

This incredible boost is thanks to strategic circuit design advancements, optimizing fabrication processes, and lowering precision operations compared to the original TPUv2, allowing for about a 30x improvement in energy efficiency per floating-point operation compared to our initial training pod from 2018.

Moreover, another significant trend is the emergence of open-source tools for machine learning, which have empowered a broader community to both improve and utilize these tools for diverse machine learning challenges. TensorFlow, released in 2015, PyTorch, which debuted in 2016, and Jax—another open-source framework from Google—emerged around 2017 or 2018. Together, these frameworks have propelled the field forward in terms of accessibility and standardization.

In 2017, some colleagues noted that in recurrent models, the sequential process of absorbing one token at a time limited learning efficiency and parallelism. They proposed saving all internal states while developing a mechanism known as attention, which refers back to all previous states.

This influential paper illustrated that, utilizing 10 to 100 times less compute with 10 times smaller models, you could achieve better performance than existing architectures like LSTMs at that time. This breakthrough has enabled nearly all contemporary large language models to adopt transformers as a foundational architecture, often with various enhancements.

While this concept was not entirely new in 2018, it gained traction as the realization emerged that language modeling at scale could leverage self-supervised data. You can use any piece of text to predict other parts, creating vast amounts of training data. This innovation is a major factor in the quality and effectiveness of these language models—more text leads to improved results.

Different training objectives can be employed, one of which is autoregressive training, where the model looks at the prefix of words and predicts the subsequent word. Many of today’s models operate on this principle, creating training examples like, “Zurich is _____.” The model fills in the blank using context.

Another approach involves fill-in-the-blank training, which generates diverse training scenarios from the same text. While both training objectives are valuable, autoregressive methods tend to be more prevalent, particularly in applications such as chatbots, which only have access to past contextual information during interactions.

In 2021, my colleagues developed a way to apply transformer models to image tasks, transitioning from the previously dominant convolutional neural networks. They innovatively dissected an image into patches, representing these patches with high-dimensional vectors similar to Word2Vec’s approach with words.

This transformation enables patch representations to be fed into the transformer model, allowing the handling of image data through patch embeddings rather than solely word embeddings. As you will see, when training multimodal models, you can integrate text and images, enabling visual patches to work alongside text patches.

The attention operation within the transformer remarkably attends to pertinent areas of an image. For instance, when asked about the contents of an image, it can focus on details like an airplane or a dog. However, in the presence of distracting elements, it broadens its attention, scanning the entirety of the image for visual clues that help generate the correct predictions. This pivotal innovation has unified transformer capabilities across textual and visual data.

Another development occurred in 2017 when some colleagues and I created a mechanism for sparse models. These models possess large capacity but only activate a fraction of the model for each token or example. Initially, we used around 48 experts per layer but activated just two at any given time. This architecture allows the model to retain substantial capacity while efficiently utilizing a small subset relevant to the task.

The activation of the appropriate experts is learned end-to-end through backpropagation, enabling the model to manage varied contexts—like handling dates or geographical locations. This method allowed us to achieve an 8x reduction in training compute costs for equivalent accuracy or significant accuracy gains at the same computational expense. When you see graphs that compare compute budgets to accuracy scores, you want to align them horizontally to demonstrate that less computational power is sufficient for maintaining the same accuracy levels.

We are continuing to explore sparse models’ potential because we believe it to be a crucial avenue for developing models with substantial capacity while only activating a minimal portion relevant to the current task.

In 2018, we also began rethinking the software abstractions necessary for large-scale distributed machine learning. Our goal was to connect multiple TPU pods together and streamline the training processes. Each small box with yellow dots in our diagram represents a TPU pod; our objective was to facilitate seamless integration among these components.

This distributed system manages communication effectively, ensuring that chips within the same pod can utilize the high-speed TPU network, while those needing to connect across pods within the same building, or even different regions, use appropriate networks for efficient data transfer.

The Pathways framework simplifies this by allowing the machine learning developer or researcher to operate with a single Python process. When using Jax, devices can be abstracted seamlessly. For instance, when using four TPUs in a single machine, they are recognized as a cohesive unit. However, under Pathways with Jax, all devices across the training task appear as a comprehensive array of 10,000 or 20,000 TPU devices.

This capability simplifies computation management, with Pathways automatically mapping operations onto the actual hardware. Just last week, we made the Pathways system, which we’ve utilized internally for six years, available for cloud customers through our cloud TPU offerings.

Additionally, some colleagues observed that extending inference time to think longer can be beneficial. Just as your third-grade math teacher advised you to show your work to increase the likelihood of solving problems correctly, large language models can benefit from a similar approach. For example, consider a problem framed like this: “Sean has five toys for Christmas, having received two from each parent. How many toys does he have now?” The model needs to calculate the answer as nine.

In contrast, when posed with a new problem, like “John takes care of ten dogs, each requiring thirty minutes a day. How many hours does he spend weekly on this?” The model initially responded incorrectly. However, if encouraged to show its reasoning, it could clarify the steps shown in the previous example—”Sean started with five toys; if he received two from both parents, that totals four additional toys. Therefore, 5 plus 4 equals 9. The answer is nine.” Jeff Dean: It might seem simple, but this actually greatly enhances the models’ accuracy. Now, they are encouraged to think through the steps to solve problems in a more detailed way. You can see that as the model’s scale improves, the problem-solving rate increases somewhat with standard prompting, but it skyrockets when you use chain-of-thought prompting. This is particularly evident with benchmark tests that cover roughly eighth-grade math problems. So, prompting the model to show its reasoning improves accuracy on reasoning tasks.

You can also view this as a strategy for utilizing more computational resources during inference since it requires the model to generate extra tokens to produce the correct answer format. Back in 2014, Jeff Hinton, Oral Vinyals, and I developed a technique known as distillation, which transfers knowledge from one neural network to another, typically a smaller model.

In the classic approach, you’d train a small model using next token prediction. For instance, if the input is “perform the concerto for _____,” the expected word is “violin.” When training your language model with this objective, if it predicts “violin” correctly, that’s great. If it guesses incorrectly, you get a back-propagation error from the training objective. While this method works decently, using the teacher model to offer not just the correct answer but a probability distribution of what constitutes good answers for any given word delivers a richer training signal.

Instead of just receiving a binary signal for “violin,” where it’s correct only once, that distribution—like “violin, 0.4; piano, 0.2; trumpet, 0.01; airplane, unlikely”—provides a far richer gradient signal. This allows you to inject more knowledge into each training example for the smaller model, enabling it to reach convergence more quickly.

As you can see from some comparisons in a speech-based setting, training frame accuracy is important, but what really matters is the test frame accuracy—did the model correctly predict the sound in a frame of audio? The baseline with 100% of the training data achieves 58.9% on test frame accuracy. However, if you reduce the training set to only 3%, the training frame accuracy might actually increase due to overfitting to the very limited examples, but your test frame accuracy would plummet, rendering it ineffective for unseen test cases.

When you implement soft targets generated through the distillation process with just 3% of the training data, you still get decent training frame accuracy and nearly equivalent test frame accuracy. This trait is advantageous because it means you can transfer the knowledge from a large neural network to a smaller one, maintaining nearly the same level of accuracy.

Interestingly, this approach was initially rejected from NeurIPS 2014, but we published it in a workshop, and it now has 24,000 citations. In 2022, some colleagues and I investigated different strategies for mapping computation onto our TPU pods for efficient inference. There are many variations one could consider, such as whether to keep the weights stationary across various dimensions of the network.

While the details vary, it’s clear that the appropriate choices depend on numerous factors, including batch size, which significantly influences which technique works best. Techniques like weight stationary, weight gathered, and variations of these can greatly affect performance based on batch size.

For instance, at small batch sizes, a 2D weight-gathered approach might be most effective, while at larger batch sizes, a weight-stationary method could work better. This complexity highlights the importance of choosing efficient strategies for model partitioning and inference at scale.

In 2023, some of my colleagues developed a technique known as speculative decoding. This involves utilizing a smaller drafter model—10 to 20 times smaller than the larger model—since many tasks can be effectively predicted by a smaller model. We can promptly predict the next K tokens with the drafter model, and then the larger model makes predictions for K tokens in succession as well.

By doing this, you’ve amortized the memory overhead of loading the model weights, allowing for K predictions instead of just one. Many developments have combined to significantly enhance model quality in recent times. We’ve seen progress in better accelerator hardware, notably with TPUs and Nvidia GPUs optimizing for machine learning applications.

Software abstractions play a crucial role too, as they allow easier building of useful applications without needing to delve too deeply into underlying details. Model architectures, particularly transformers and visual transformers, are now integral to modern models. Significant advancements in training algorithms—unstructured and self-supervised learning, distillation, and others like supervised fine-tuning, reinforce the learning process.

Next, I’ll discuss the Gemini models we’ve been training and how many of these innovations are reflected in various iterations. Gemini represents a collaborative effort across Google DeepMind, Google Research, and the broader Google team, which we started in February 2023. Our objective is to create the best multimodal models in the world to integrate across various Google products.

Here’s a timeline of our progress since February 2023, culminating in the December release of Gemini 1.0, followed swiftly by Gemini 1.5. From the outset, we aimed to make these models multimodal, recognizing that models limited solely to text would not be as beneficial as those capable of understanding and generating language, audio, visual inputs, and more.

Initially, the model could process audio, video, images, and text as input, producing images and text as outputs, and we later added audio output capabilities. Gemini 1.5 introduced an extended context length, enabling input of up to millions of tokens.

To illustrate, imagine processing a thousand-page document—it translates to roughly a million tokens. This allows the model to handle multiple long research papers or entire books within the context window, leveraging the attention mechanism that makes information very clear to it.

In Gemini 2.0, we build on numerous innovations. We leverage TPUs, utilize cross-data-center training, apply Pathways and Jax, focus on distributed representations for words and image data, and integrate sparse mixtures of experts alongside distillation techniques.

Just a month ago, we released Gemini 2.5 Pro, which has received positive feedback due to significant improvements across various benchmarks, especially in coding tasks compared to earlier Gemini models. The model evaluation landscape incorporates user feedback through platforms like LM Marina, which allows users to compare outputs anonymously and gauge preferences—this provides valuable insight into model strengths.

This evaluation method aligns well with independent assessments across the web and academic benchmarks. Currently, we find ourselves in the fourth spot of the New York Times connections, indicating areas needing improvement. Nonetheless, our goal is to deliver general-purpose models effective across a wide array of tasks, including coding and reasoning, enhancing user experience.

Providing a million or two million tokens of context enables the embedding of large codebases entirely within the context window. The model can then be tasked with complex operations, such as refactoring or introducing new features. One user was able to take a dataset of a thousand poems—230,000 tokens—and ask the model to perform reasoning tasks over them, yielding impressive results.

An important metric we focus on is the ELO score from Ellarina. A higher ELO score indicates a more capable and higher-quality model from users’ perspectives. The comparison includes various commercial models, with the x-axis displayed on a logarithmic scale, highlighting the need for maximal performance along the right-hand side.

We offer a variety of models that cater to different quality and cost trade-offs. Our flash models are cost-effective, priced at around 15 cents per million tokens. The newer 2.5 Pro model is more expensive due to its increased complexity, but still reasonably priced given the quality it provides.

Ultimately, our goal is to keep progressing towards the upper-right corner of the quality-cost trade-off in our model offerings. The Gemini initiative remains a large-scale project, with contributions from numerous authors. Structuring such broad efforts requires delineation of roles across areas like pre-training, safety, values, and more, coordinating smoothly to enhance our model capabilities.

We rely on effective communication, using platforms such as Google Chat to facilitate ongoing collaboration across different regions. Despite time zone challenges, the global team structure has advantages, with team members always available to monitor large-scale training runs and contribute insights based on their work while others rest.

In addition to structured discussions and feedback via Google Docs, we maintain common baselines and leaderboards to fuel data-driven decisions about model improvement. Experimentation at varying scales is crucial, moving successful small-scale trials into larger scale evaluations to test trends.

We monitor for silent data corruption during training, aware that hardware errors can emerge in our ML systems, potentially affecting overall computations. Monitoring gradient norms helps us identify anomalies—if a problematic gradient emerges, we can rewind and replay computations to check for data issues versus hardware errors.

Let me share some examples of what these models are capable of. They can assist in fixing bugs in codebases effectively, as seen when one user uploaded their entire repository, allowing the model to pinpoint urgent issues.

In-context learning is another fascinating aspect. For instance, there’s a language called Kalamong, spoken by a mere 200 individuals globally. One researcher wrote a PhD thesis on its grammar, but no internet training data exists for it. Interestingly, when this thesis is used as input, the model can achieve translation accuracy comparable to a novice language learner, thanks to the understanding fostered by the grammar and dictionary provided. Speaker 1: With that video of a bookshelf to JSON, it’s kind of fun. You might not have thought of that as an input method, but you can do that. It’s actually quite useful.

Speaker 1: When it comes to video understanding and summarization, you can input fairly long videos—about a million tokens translates to roughly two hours of video. The prompt I would use is in a table: “Please write the sport, the team, the athletes involved, the year, and a short description of why each of these moments in sports is so iconic.” The model gets to see both the pixels of the video and the audio track.

Speaker 1: For instance, consider an 11-minute video that the model analyzes. The output is structured data extraction, which might be more text extraction than what you initially thought you could achieve from in-context video. I think many people still aren’t fully aware of the interesting possibilities of taking multimodal data like this.

Speaker 1: Let’s talk about the digitization of historical data. You can take weather data from 100 years ago and simply ask, “Please give it to me in JSON.” The model can handle that. They did it for 144 tables, and it only cost them 10 p. Now they’re able to unlock all this historical weather data.

Speaker 1: Now, regarding code generation via high-level languages, here’s the prompt we’re going to give our Gemini 2.5 model: “P5JS to explore a Mandelbrot set.” Oh, wait! I can’t do that right now. I’m so sad. It was working before, but oh, I’m not on Wi-Fi. That’s true. Anyway, it generates a really nice interactive visual Mandelbrot explorer when connected.

Speaker 1: Now that we have these models, what will it all mean for us in society? This raises a really important set of topics. I, along with eight other co-authors, recently wrote a paper titled “Shaping AI’s Impact on Billions of Lives.” We are a group of computer scientists and machine learning experts from academia, big tech, and startups, and we wanted to explore the potential impact of AI on the world through directed research and policy efforts.

Speaker 1: Many people in this field are contemplating what will happen with AI if we take a laissez-faire approach. Will we all be doomed, or will we see incredible advances? A pragmatic approach would be to collaborate as a society—machine learning researchers, practitioners, and experts—to shape the future, maximizing the benefits of AI while minimizing the downsides.

Speaker 1: This paper is intended to be a collective discussion on how we might achieve that. We interviewed 24 different experts in seven fields: employment, education, healthcare, information, and media. Noteworthy individuals included former President Barack Obama, Sal Khan in education, and John Jumper, who later won a Nobel Prize. We uncovered five guidelines for AI for public good.

Speaker 1: I won’t delve further into the paper, but you can visit shapingai.com, where there’s an archive paper that nicely discusses potential impacts in various areas, including employment, education, and healthcare. It’s critical that we all collaborate to get this right.

Speaker 1: To conclude, we also proposed some important milestones for research in these areas. These models are becoming increasingly powerful and useful tools. As more investments pour in and more researchers join the field, you’ll see continuous improvements, leading to even more capable models.

Speaker 1: This progress will have a dramatic impact in numerous fields, potentially making deep expertise widely available. That’s both exciting and a bit concerning to some people—that kind of expertise can and should be done well. I genuinely believe our AI-assisted future looks bright.

Audience: [Applause]

Speaker 1: Thank you very much for the great talk! We have a little token of appreciation from the department—some chocolates and a systems group t-shirt. I love coming to Switzerland for these treats! Thank you so much.

Speaker 1: Now, let’s proceed to the Q&A session. We have a mic and a tossing cube for questions. We’ll prioritize students for asking questions, so please raise your hands if you have one and point in a general direction.

Speaker 1: My throwing aim might not be great, but let’s try! Ah, well done! [Applause]

Audience Member 1: Hi! Thank you for your presentation, especially for discussing that last paper. AI safety is definitely at the forefront of our minds, but it seems unclear from an outsider’s perspective—especially for big research labs—what would be considered positive and impactful. If you were a PhD student starting a thesis, a professor with grant money, or if you could acquire a startup this year, what would you focus on in AI safety?

Speaker 1: That’s an excellent question. AI safety is quite broad. There are concerns about the increasing capabilities of these models enabling people to engage in nefarious actions that would be undesirable from a societal viewpoint. While some of these issues can be addressed technically, policy-based and regulatory measures will also be essential.

Speaker 1: One topic we explored in the paper was misinformation and public discourse. AI models can generate increasingly realistic misinformation and allow mass production of it more cheaply. While misinformation isn’t new, these tools make it easier to create quickly and effectively.

Speaker 1: There’s also an interesting research question about how to detect misinformation produced by AI. We suggested that AI can actually enable more constructive discourse in online forums. Looking at how AI can promote positive conversations and identify misinformation in discussions is intriguing and worth studying.

Audience Member 2: Thank you! I’ll pass the cube to the next person.

Audience Member 3: Currently, when I visit social networks, I feel hyped by claims about LLMs being incredible. But in my daily work when I try to use AI or LLMs, I’m often disappointed. Who needs more training? Is it me, or is the LLM just not trained well enough?

Speaker 1: That’s a great question! I suspect the answer is a bit of both. The progress in these models has been steep. The Gemini models from eight months ago can’t compete with today’s versions. Sometimes users form opinions based on their past experiences with older models, which might have failed.

Speaker 1: It’s important to remember that the current models may excel at tasks that previously seemed impossible. Additionally, becoming familiar with how to effectively prompt the models is crucial. A thoughtfully crafted prompt can lead to significantly different outcomes.

Speaker 1: For example, a one-page prompt might ask, “Can you take this video content and create an educational game that reflects the concepts explored?” In some cases, it will generate a fully functional game based on the lecture’s materials. It doesn’t always work, but it’s on the frontier of what’s possible; it might succeed around 30% of the time.

Speaker 1: More training for the models will also contribute to improvement. You’re noticing substantial advancements from Gemini 1 to 1.5, 2, and now to 2.5. I expect Gemini 3.0 and beyond will be even better. This trend in the industry shows continual improvements in models.

Audience Member 4: Thank you for your talk! On your slide summarizing innovations in AI, you listed hardware, algorithms, and improvements, but data was absent. There are concerns that data might become the new bottleneck. What’s your take on this?

Speaker 1: I should have mentioned data—it is indeed crucial. There’s often no specific artifact linked to many data-related issues. Instead, it’s about curating high-quality data, which we focus on in the Gemini project.

Speaker 1: Although some worries exist about running out of high-quality data for improving model capabilities, I find such concerns hard to justify. There is an immense volume of data we are not utilizing. For instance, while we’ve trained on certain video data, it represents a tiny portion of the overall YouTube corpus and far less than the total video data available.

Speaker 1: As a machine learning research problem, there’s also substantial work left to improve the quality obtainable from each training token. For instance, if a model learns from just a two-sentence description of how to add numbers, it may not genuinely grasp the underlying algorithm.

Speaker 1: Ideally, a model would be capable of reading and developing an internal representation that allows it to execute an algorithm when required, thus extracting more value from the training data.

Speaker 1: Consider the era of improving convolutional neural networks, where researchers trained on a million images across a thousand categories. They’d often bolster model power by making multiple passes over the training set. While we have a large corpus of textual data, our computational limitations have prevented repeated passes. However, with advancing hardware, making additional passes could yield significant improvements in model quality, though the exact impact remains uncertain.

Audience Member 5: Thank you for your engaging presentation! I’m curious: where in your personal or professional life do you find AI most useful, and where does it fall short? Are there any surprises on both ends?

Speaker 1: Personally, I use AI with tasks like coding assistance. I often have it handle relatively straightforward requests. With more capable models, I should explore various uses to challenge what they can do.

Speaker 1: The models generally do a decent job of generating test cases for the code I’ve written or extending straightforward code. I also utilize it for generating images or summarizing papers. It’s fascinating to see how these models have become integrated into tasks that genuinely help.

Speaker 1: On the flip side, sometimes when I request complex coding solutions, the outcomes can vary widely. I recognize why they sometimes fail—that really complicated requests can be challenging for anyone.

Audience Member 6: Thank you for the super interesting talk! For your upcoming research, what area do you find most intriguing? Is it enhancing transformers for computer vision, or focusing on AI safety to prevent hallucinations in large language models?

Speaker 1: The field is beautiful in that it encompasses many significant challenges. My approach to selecting research topics is to focus on those where progress will yield substantial advancements.

Speaker 1: The areas you mentioned, plus many others, are critical. I’m personally interested in topics like creating more efficient inference hardware, developing larger context windows, identifying higher-quality data, scaling infrastructure, and enhancing asynchronous training in distributed networks.

Speaker 1: Also, exploring more exotic, sparser model structures could lead to groundbreaking advances. There are numerous ideas worth pursuing, and I encourage you to choose a topic that excites you and holds the potential for real impact.

Audience Member 7: One more question, please!

Speaker 1: Sure! Let’s pick someone from further back—we’ve neglected that area.

Audience Member 8: Hi! Thank you very much for the incredible presentation! I’d like to know what the next challenge is. These models are improving steadily across benchmarks, but is there a specific outcome they still struggle with? Perhaps formal reasoning or some other breakthrough activity?

Speaker 1: Great question! While it’s not precisely a discrete challenge, one significant hurdle is the need for models to operate autonomously in a more complex manner. We want them to undertake relatively complicated tasks with a good amount of independence.

Speaker 1: For instance, could the model plan a two-day visit to Zurich, suggesting activities based on what it learns about the city? That’s a task mangled with ambiguity, requiring tools to gather information about Zurich and potential plans.

Speaker 1: Right now, models can handle simpler tasks—breaking down complex tasks into a few steps with some limited tool use—but they struggle when faced with intricate challenges that involve many elements to process over time.

Speaker 1: There’s a vast gap between current capabilities, like the ability to manage three to five steps with around 60-70% accuracy, versus effectively managing a hundred tasks over a lengthy period with high reliability. Bridging that gap is a major goal going forward.

Speaker 1: So while there isn’t one singular breakthrough, we’ll undoubtedly witness gradual improvements, enabling models to perform more ten-step tasks with increased accuracy along the way.

Speaker 1: Thank you very much! Let’s give another round of applause for Jeff and his talk. [Applause]

399 电气革命、世界大战与能源危机:周小康谈博世百年沉浮

2025-04-22 08:00:01

399 电气革命、世界大战与能源危机:周小康谈博世百年沉浮

谈一谈从第二次工业革命以来在其中崛起的德国的工业界,尤其是像博士这样的一个公司,代表了可能人类从19世纪一直到现代整个科技成长的某种路径,以及我们的听众可能会很熟悉康神的能源研究背景。

其实康神在进入能源行业之前,也长期做过企业的战略咨询,所以今天聊这个话题,也等于是你的老本行了。

是的,我关心企业,关心技术,有很多很多年头了。因为在我从事能源大宗商品交易之前,其实做过很长一段时间企业的战略咨询管理咨询。那么在那个阶段的话,我是非常感兴趣各种各样的企业案例,特别是那些失败的案例,会更加多引人注目。

但是那些已经经历过百年的企业,甚至说数百年的企业,它的贯穿始终的话,会有大量的波折。我们可以很轻松地说微中有机,机中有微,但对他们来说,他们是一部非常非常厚的书。这个书是一个抽象的概念,可能最后弄成企业史也是多卷本的。

那么今天我们讲的博士的话,也是我一个非常非常感兴趣的题目。为什么呢?因为我们这里可以多点点buff。首先它是一家德国企业,我们知道在工业革命第一波的话是由英国发动的,第二波起来的话,那是德国和美国。

为什么呢?因为第二次工业革命,它实际上是一个原动机,从蒸汽机变成电动机、内燃机的过程。所以说第二次工业革命是一个一体两面的双重革命,它带来了很多的变革,包括我们现在接触的主要的交通工具,就是在第二次工业革命当中出现的。先是有了汽车,然后后来又有了飞机,这个都跟我们今天讲的这个对象,博士公司有很密切的关系。

而且大家知道我很喜欢研究能源,能源是我的老本行,我可能三句话离不开油和气。所以讲到内燃机就会讲到油,我当然非常非常想要去介绍这段历史,而不是仅仅在现在我们谈新旧能源变革。在我们确实看到中国的新能源汽车在全世界各个市场当中,已经占有了非常非常重要的市场份额,特别是反共欧洲。

我们也经常会说老欧洲他的汽车制造也还行不行,其实我们从今天谈论博士就会看出,他们历史上是如何应对一轮轮的挑战的。但是他蹚过了这一条条沟壑,甚至于是大风大浪见都惯了。

我们既然谈到这家企业的话,首先要讲讲这家企业,它的主要业务是什么,产品是什么。我跟前面说了,我是一个传统能源的监守者,或者说从某种角度来说非常保守。包括我自己的作家也是传统的一部油车。我们现在送一台油车去保养的时候,我们有小保养、大保养,可能还有不少零部件要弃下来。

你会发现弃下来的零部件,它有很多品牌,就是博士的。比如说我们最重要的内燃机上的一部件火花塞,这也是博士起家的重要的一个产品。甚至于博士他自己公司的内刊,就把火花塞作为他内刊的名字。可想而知,这个火花塞对博士的历史有多少重要的影响。

就这个公司虽然他并不直接制造整车,但实际上是深度参与到汽车工业当中。包括我们还有一些油车和新能源汽车所共用的一些零部件,比如说我们的雨刮器,还有我们一些看不见的东西,但是对车辆的行驶安全起到重要作用。我们现在说车辆的智能化,智能化已经有好几波的发展了,它还是依赖于一些重要的芯片。

我们知道现在几乎无车没有这个ABS防暴式系统,然后几乎无车没有转向稳定系统ESP,对不对?那么这些东西的话,博士都是他的早期开拓者。可以说作为政策企业来说,他的重点是设计,是生产制造。我们准确说应该是总装,包括我们有营销,我们有后服务等等,但是在源头上面,这些重的零部件巨头,实际上是跟整车企业一起,才推动我们的汽车不断的向前进,让这个已经有一百多年的这个产业继续在21世纪扮演一个推动我们生活向更美好发展,让我们的初心能够更轻松愉悦的这样一个进程。

而且我们知道车可能有很多个父亲,因为很多地方都研发出了汽车,拼压出来的,所以在这个国家,我们有这个国家的汽车支付。但是我们非常自豪的,汽车只有一个故乡,这个故乡就在斯图加特。我们很大阵段的博士是伴随着奔驰一起起来的。

那么我们知道奔驰在汽车史上他的地位是非常非常突出的,卡尔·本茨本人,不止他,还有很多其他人,对不对?我们的名字会联系起来,我们有卡尔·本茨,有格特利布·戴姆勒,有威廉·麦巴赫。然后博士虽然不跟这三个奔驰公司的创始人在一起,但肯定有他的一页,这两个公司的历史,包括其他德国汽车工业重点企业的历史,应该是写就在一起的。

然后他们汇集在人类的汽车工业的历史上。那么我们怎么去理解这种所谓零部件企业和这个整车企业他们之间的这样的一个关系呢?其实零部件企业在整个汽车的供应链当中,它发挥的作用非常关键。有的时候,核心技术并不是掌握在整车企业当中的,而在零部件企业当中的。那么他们的良性互动可以带来非常好的结果。

那比如说我们举一个很经典的例子,最早的时候,像博士的ABS、EPS这些东西,在奔驰的高阶的车型上面,因为他们刚刚退出的时候,成本也比较高,它反映的是高端车的一个比较好的配置。但是,随着这个时间推移,他们要进行下放。它对于这些重要的零部件设备能不能降到一个合理的成本是有要求的。

同时的话,对于它自己产品线来说,它要精心设计,我不能够出现太多下壳上的这种状态。那么我传统汽车企业如何去推进这个过程,是需要跟零部件企业一些合作产生的。反过来,当整车企业面临一些问题的时候,有可能极大的来提供一些商机。

比如说在1990年末的时候,奔驰企图进军紧凑性汽车市场,要推出一个我们今天很熟悉的车,奔驰A级车。当时,在1997年,它在路跑测试过程当中出现了翻车的情况。那么为了了解决翻车,车辆重心问题,最后奔驰不得不停止这个A级车的销售,然后并宣布为所有车辆安装ESP。

那么这个显然来说的话,为博士带来了一个巨大空间。它一次性使得原来高高在上的这样的配置拉低到了入门级车型上面,这就是由整车厂所面临的危机,反过来为零部件企业提供巨大空间的这样的一个案例。所以我们看到,往往典型的情况就是,当某一个配置成为行业事实上的标配的时候,那么这些零部件企业的话,就拥有了自己的甜蜜时刻。

所以零部件企业的话,很多时候是一个后继勃发的。你看到一项技术成为你的配置清单上的一项,经历过漫长的研发、试用、推广,最后整车企业跟零部件企业这种战略合作关系,使得汽车的市场,我们认为是叫丰富多彩。

既然刚刚已经提到了德国作为汽车的故乡,尤其提到了斯图加特这座城市作为奔驰和博士这个双子的故乡,我记得很多集之前我们有一集也提到过,就是德国它的整个的工业和城市的那种特殊的一个关系,好像一座一座的工业城,每个城市都有自己代表的一个品牌,这好像是个德国特色。

对,我们节目请过陆大鹏,他写过《德意志贵族》。那么我们就要知道,有很多的自治城市,有很多的贵族的领地,德国的历史、德国的地理在很长一段时间里面是高度破碎化的。那么这个地方,就形成了一些小区域上面,它有自己的特色。

那么还有一个非常重要的,就是我们的高老爷。高老爷讲过19世纪,其实博士的发展就是在19世纪后半叶,伴随着整个德国西部莱因河地区,它的经济发展起来开始出现了产业的聚集和产业的迭代升级。那么这个又牵涉到什么呢?就因为1871年,德意志第二帝国成立了,国内的贸易障碍打破了之后,为它的经济发展起到了进一步的推动作用。

当然了,这里距离博士的诞生还要又早了20年时间。其实很多人对德国的理解,我觉得最最直观对它地方分土人气的理解是通过德甲比赛转播。比如说我们有这个拜尔,我们称之为药厂,对不对?一旦谈到汽车的话呢,我们首先讲到狼堡,波尔夫斯堡,那是大众的总部,那慕尼黑就不用讲了,跟宝马有关系。

然后的话,我们还有这个著名的MAN,在商用车领域的这个连片的地区,其实都跟这个斯图加特距离不是很远。那么这个地方在德国的历史上曾经有一个名称,它叫斯瓦本公爵里。现在这个地方,它的这个行政区呢,叫巴登福腾堡,或者叫巴福州。巴福州最大的城市,首府就是斯图加特。

实际上,我们讲到博士他的创始人罗伯特·博士先生,他最早还不是斯图加特人。他家里无论是父系还是母系,都是开客栈的,他原来的故乡是离一个地方叫做乌尔姆很近。今天我们好像乌尔姆想到德国的城市不会觉得这个很重要,但是他曾经是神圣罗马帝国的第二大城市。

而且斯瓦本这个地方靠近黑森林,然后又是两条河相对来说都比较靠近上游的地方。我们知道欧洲两条母亲河莱茵河、多瑙河。那么就构成这个地方他的商道非常的密集,所以开客栈就很好开,积累了一些财富,然后慢慢在19世纪他们开始由商道上的这个经营开始向纯实转移,代表着是德国城市的进一步发展。

那么这个城市之后在培养后代上面,罗伯特·博士他没有去继续从事他们祖先经历了几代人的这些营生,而是去选择了一个非常重要的转向,就是他开始从事技工学门手艺,就从这种开客栈的服务业转变成一个手工业的。

那么在这个过程当中,我们知道他反映的是德国在19世纪中叶的时候,他的整个经济有一个起飞。起飞一开始并不是有一个我们非常熟悉的中央计划指定经济的定点说你这个地方该干什么,他是根据这个地方传统上的趣味优势所形成的。

那么在斯瓦本这个地方,尤其斯图加特,他翻的这个工业呢,肯定是不像蓝营兰地区,或者是我们非常非常熟悉的一个地方,就是埃森,那个是克鲁伯的所带地,那个是大工业,而这个地方他的工业规模就没有那么大。那么这种情况下,如果你需要原动机的话,你会选择什么?你会选择一些相对规模、动力输出不是太大的,但是我要求有更加强的灵活度,可以进行部署。

到今天为止,斯图加特也只是个几十万人的城市,作为这么顶尖的全球汽车企业的一个总部,他就几十万人。他在更早期的时候工业发展的时候,你可想而知,他是没有发展出那种非常非常巨大的这种联合企业、一体化企业。所以在这个地方开始产生什么呢?对于新一轮动力的思考,这个革命最后产生了内燃机。

内燃机我们今天典型的内燃机有汽油机、柴油机,但是最早的汽油机其实还不是炒汽油的,而是烧煤气。其实汽油机曾经有一个阶段是比较五大三出的,不是说跟汽车紧密在一起,而是固定定点使用供这个地方的一个比较小范围内的动力使用。

那么汽油机就碰到了一个什么样的焦点呢?它实际上是在那个时期从科学上来说,它是热力学的一个捷径,然后从技术上来说,汽油机你要能往复运转起来,你需要打火吗?所以它里面就出现了一个重要的部件,那这个部件就是火花塞。

这个火花塞部件就是博士起家的两个主要业务之一。为什么说是两个主要业务呢?博士最早的时候还有另外一个业务。我们知道欧洲的贵族青年,有闲阶级,喜欢在成年的时候去壮游一趟,那么罗伯特·博士先生呢,当时就是拿了家里面的钱,买了一张船票,撞游到了美洲大陆。

那有点超出以前的撞游的范围了,对,我们在地中海游览名胜古迹、希腊罗马,他是去探寻工业技术的。所以他在美国有机会让他到了蒙罗公园,著名的爱迪生实验室里面去工作。所以他感受到了早期电气化的成果。

那么他回到了德国之后,就有一个业务摆在他面前,就是如何为千家万户来架设电线,来装最早的一些电器,其实主要是电灯泡了。这个业务是不是只有他做呢?其实也不是啊,其实很多人在做这个事情,著名的爱因斯坦他家里面曾经干过跟博士家一样的活,一条街一条街去装路灯。

但是很显然,爱因斯坦家做的不太好,爱因斯坦也不想去接这个班,那他去搞科学去了。博士家做的还可以,虽然他也面临着德国非常大的企业,德国电器A1G的竞争,而博士利用区域护城河吧,能够让公司维持下来,让他的驾电线的业务一直养到这个新兴业务产生。

就是活化财,实际上当时有一家公司找了他,而这家公司现在还在呢,德国的通用公司,现在是柴油机方面非常领先的一个公司,但是当年的话,他的汽油机方面也是在奥托发明这个汽油机以后,开始应用了。

他对罗伯特·博士发出了要约需求,你能不能帮我搞这样一个东西。因为当时的话,还在一个技术发展的很早期阶段,基本上是谁有胆量去试一下,大家都会让你去试,然后你能做出来的话,我就可以试用,这是早期探索阶段。

那么后来发现这个东西好用,正好在斯图加特,我们前面说的那几位改变时代的大佬,什么卡尔·本茨啊、格特利布·戴姆勒啊、威廉·麦巴赫啊,开始搞他的汽车。那么汽车上面所需要的是一个高转速、小型化的汽油机,他对于打火的这个需要就更高,因为如果是固定的话,我们可以拖一个电池,而汽车当时的马力很有限,所以我上面要装一个傻大黑粗的电瓶是不现实的。

那么能不能够用机械的方式、电磁的方式能够提供电火的需要,这就是当时的一个技术难点,最后幸运的这个博士先生,他研发出来了技术。那么这个里面呢,不得不说,这个迭代速度也是非常快的,在一开始的话呢,还是低压的火化塞,经过几年的迭代之后,到了二世纪初的时候,就发明出了高压火化塞。

那么这个正好顺应了汽车工业的一个快速发展。我们今天的话,还有一些博物馆里面,我们看到老爷车,这些老爷车的话基本上车和车都不太重样。它是个工坊的一个手工业的产品。

那么后面推动汽车的发展的话,它又是有闲阶级把汽车当做一个消遣品;那么很重要一点就是汽车运动。二十世纪初的时候,汽车运动极大的推动了汽车性能的发展,甚至可以说这是当时的一个最重要的推动因素之一。其中竞速,我就要更大马力的发动机。

一缸不够,我要来几缸啊,然后我转速也要飞快。这样的话,对于火化塞这个部件的要求就大幅提高了。同时的话,多缸的汽油机引入之后,它大大扩展了这个市场,原来是单缸的,后来我变成几缸了,那我不是香港就成了几倍吗,迅速的把博士的业务给点燃起来。

然后到了这个二十世纪的上半叶,慢慢慢慢各国也开始引入像美国这样的规模化的生产,然后他把它变成了一个可以汇集大众、普通人能够购买的一个商品。那么这样的话,我们就是要数一数七缸,就知道博士的这个业务肯定会快速壮大。

所以也撞上了当时德国的一个算是汽车规模化生产的一个浪潮。应该准确的说是欧入范围,因为博士的产品在当时的话就具有很强的品质优势。所以其实不光是在德国有经销,实际上在大洋彼岸也有,在其他一些国家也有。

那么其他国家也出现了一些替代产品,但是自始终来说的话,博士的产品还是具有一些质量上的优势,虽然它本身比较贵,这也为博士的利润积累奠定了很好的条件。在这股浪潮当中,作为一个并不是特别古老的企业,我们站在二世纪之交的时候,它也不过就是一个成立的十年的企业。

对不对?它迅速的勇力朝头,帮助它成为一个全球化的企业。像这样的企业在二世纪最初的时候,准确的说是在一战之前那个欢乐年代,大家对于全球的经济往来、冲的贸易有一个非常乐观的估计。

在这个时候博士已经走到了全球化的浪潮当中去。所以它当时已经变成了一个全球企业的一个雏形。是的,然而转眼这样的挑战也来了,最大的挑战就是一战。实际上来说,一战我们可以认为是一个逆全球化的过程,这个过程的话一直延续到两次世界大战之间。

我们知道现在的话,最关心的就是贸易战。 那么1930年代的时候,全世界的贸易战可能比现在还要更夸张一些。除了关税壁垒以外,还有很多非关税壁垒。但是这都远远比不上一战更糟糕。

对,他们可能面临的是专业厂房可能就没了。那么博士面临了什么问题呢?在一战当中,它在美国的企业就被当作敌产没收了。其实这个事情并不罕见。我们知道今天有两家企业,名字基本上差不多的,只是后缀不一样。一个叫美国的莫沙东,一个叫德国的默克,它最早其实就是默克在美国的企业。同一家公司,它实际上祖宗是一家公司。像这样被当作敌产没收的德国企业在一战当中是有很多的。后来不还有那个饮料芬达吗?对,它是可口可乐的德国公司。

然后你好不容易开拓的北美市场,你要知道那个可是全球第一大汽车市场。然后因为打了个一战,所以就没有了。然后在本土的话,也面临非常多的挑战。虽然一战的战火并不是在德国本土燃烧的,但它也面临着各国的民族主义情绪。特别是在一战结束之后,你再想进入欧洲其他国家的市场,就变得很难了。战争结束之后,德国的经济也长期不好。在威马时代,大量的经济问题使得20年代成为一个非常动荡的年代。

对于博士来说,一个重要的转型就是设法能够实现自己的多元化,能赚钱的生意我多来几桩。因此它开发了很多消费品。今天我们知道的博士的一些电器产品,其实就是在1920年代为了应对战后萧条所推出的。在很短的时间里面,就推出了一些产品,包括美容美发的电吹风等,都是当时搞的。因此,从一个可能只有男性才会愿意接受的汽车用品的公司,变成了一个女性也受欢迎的公司。

对于博士来说,自己起家的领域还是在内燃机领域。那么博士当时另外一个很重要的赛道被他选中的就是柴油机。其实在19世纪末的时候就有柴油机,在20世纪初的时候,柴油机已经有很广泛的应用了。但是柴油机要成为一个成熟、可靠的内燃机类型,还需要一些重要的零部件的革命。博士在里面就进行了大量的研发创新。到1926、27年的时候,它搞出了非常不错的直喷燃料泵。因为我们知道,对于内燃机来说,汽油机和柴油机是完全不同的两种工作方式。我们有时候把柴油机称之为迪塞尔机,它采用的是压燃的方式,也就是说,它不用火化塞,而是把可燃气体在气缸内直接通过压缩的方法来点燃。

我们就要用一个非常高压的方式把燃料喷进发动机的气缸里面。这个部件可以认为是柴油机的一个关键部件,而当时博士进行了创新,推出了这个设备。因此,这条赛道对于博士来说,一直到今天都是受益的,使它能够在1920年代德国经济不景气的状态下活下来。

当然,它也不失时机地在20年代末的时候,成功地利用了当时大萧条。而在柏林里面就有一幕,德国人在购买纽约股票市场的股票。结果碰上了1929年大危机,股票崩了。表面上看,这里面的人物最后的结果不好,这个投机是失败的,但是也有投机成功的。博士就利用这个机会低价回购股票,通过一连串的安排,使得它能够置于中立国的分支机构旗下,隐藏它的直接控股权。因为在一战当中吃过一次亏,当然它在二战当中又被追溯了,也是命运多舛。要在北美开个业务多不容易。

如果我们去看整个20世纪上半叶,德国作为一个国家来说,它的命运也是非常复杂的。但是对于这些德国企业,那就更是如此。在这样的一个极端环境当中,你会怎么看待这一世纪,比如三四十年代的整个事件。对于博士,它作为一个工业企业的意义,或者说它之后给它提供的思想上的一些沉淀。

在这个里面,我们就要挖一些根本,深入的角度来看,这个公司,它的起源、它的基因是怎样的。在一个极端年代当中,你的底线在哪里。我们就可以看到,其实我们对于德国人或者德国企业的认知,很多东西是非常刻板的。我们的认知可能更加偏重于一种高度普鲁士化的德国人影响,但是普鲁士不代表德国。

实际上我们前面已经说过,德国是一个高度分散的状态。像德国西南部相对来说经济比较发达,有一些城市的手工业慢慢发展出来的现代工业,它跟普鲁士这种融合贵族为基础的完全不同。具体到企业当中来说,它受分于19世纪后半年的德国整个经济社会的发展。我们知道在19世纪下半年,最重要的一个事件是什么呢?1848年革命。这个革命带来一种进步主义和社会改良的一种倾向,对于工人运动、财富分配等非常敏感的议题,尤其是在像斯图加特这样的地方,因为它在1848年也是德分析之间的主要社会民主运动的据点。

包括罗伯特·博士先生本人早前就参与了这个运动。因此,他在企业实践过程当中对于战争的理解其实是比较特殊的。一方面,他认识到这是对公司在全球开展业务的高度不利,基本上关上了大门。另一方面,他作为一个德国人又被大时代所裹挟。我们知道博士先生最后去世在1944年,那已经是二战爆发了。跟一战不一样,博士在一战期间基本上没有参与军工生产,甚至于把能赚回来的利润,即使在这个时候赚利润很困难的市场情况下,运用于一些社会稳定。在公共拨款很有限的情况下,捐赠作为基础设施建设成立一些专门的基金等,在一战当中都做过。

然后,当纳粹执政以后,实际上这里面充斥了很多不得已而为之的事情。这个要客到陆老师,陆大鹏老师曾经翻过一本书,叫《纳粹的腐败与反腐败》。如果你是非黑即白的,可能就无法生存。在这个过程中,你也不得不与他们进行一些打交道。但是这个事情过程当中你会发现,纳粹一开始并不喜欢博士这样的企业。为什么?因为它的整个倾向相对来说偏左翼,包括我们讲,它没有直接参与军国主义的扩军备战活动,虽然是在采购方面,但这还处于一种商业上可以接受的程度。同时,它也不是纳粹的主要赞助者,甚至纳粹还非常讨厌它,因为问它要钱它不给。

此外,它在犹太人问题上尽最大限度保护了公司里的犹太员工,使得它在战后没有成为集中清算的企业,这个是帮助它在二战后能够快速复兴的一个重要因素。

你刚才直接讲了他们在战时的种种。但是我觉得听我们这个节目的人,还是非常感兴趣博士在二战当中,他到底为整个德国服务,参与到这样的一个生产当中到底有哪些项目。我们举个很简单的例子,内燃机用在汽车上,也用在飞机上。在飞机领域,整个二三十年代间,竞争激烈,尤其是竞速飞机运动,德国也发明出了一个非常优秀的竞速飞机,就是BF109,并且在上面使用了戴姆勒·奔驰的发动机。这是一款倒V型的战斗机,使用高压火化塞和燃油喷射器,因此在于英国进行空战时,由于其使用化油器,在特定的机动动作下,可能就没有梅塞斯密特战斗机更加可靠。

另外,在二战期间,博士开始涉足军用电子设备。因为在汽车之上,20年代的时候,大家要为汽车上增加一些娱乐活动,那么最早的娱乐活动就是增加收音机。慢慢开始,这些东西随着电子技术的发展,博士也有一些应用,并通过收并购一些企业开始拓展。至于在二战期间,它也进行了这方面的研发,为战后开展更多的业务领域发展奠定条件。因此,我们不能不提及博士在这段历史中的表现。

二次世界大战结束之后,博士之后的发展与战前时代相比,有没有一些比较明显的区别?我觉得变化非常大,因为一个简单的道理,我们今年已经是1945年二战结束80周年。也就是说,博士以1945年划界,之前的历史要比之后的历史要短。之前的历史由于它有创新的产品,也是汽车工业的爆发式发展,从无到有的过程。因此,它开创了自己的事业,除了走过国内以外,在全球范围内是比较容易的,应该说是一个自然发展、爆发式发展的阶段。但是1945年之后,它所面临的环境却要比战前更复杂,它会有随机性和周期性相叠加的经济、政治、社会环境,使得一个企业不得不随时随地面临挑战,甚至危机。

二战后的博士发展是一个典型的不断发现问题、做危机干预的过程。对公司的经营战略以及公司的管理,包括其技术研发进行不断调整。这个过程中,任何一步如果调整得不到,都有可能对公司的发展,甚至是生存带来极大的影响。

另外,我们要知道,在二战之前,各个国家稍微有一点汽车工艺的萌芽。有一些国家在前列,比如德国本身、美国。博士这家公司可以说是与生俱来就要国际化的公司。因为它的产品在早期几乎没有竞争对手,可以凭借品质和特点在市场里面大杀四方。因此,它在一战之前进入这些市场是很容易的,改成了一波全球化。

但是一战之后,由于德国参加了战争,被当作敌产没收掉,尽管利用商业机会在1929年之后能够把北美业务买回来,但很快在二战开始之后又丧失了。因此,二战之后,它想要进入这些市场,特别是发达国家市场时,面临的抵制是很强的,因为这些国家自己的汽车工业建立起来了,有了巨头和自己的供应链。因此,它再要想进入这些市场就很困难。

二战之后,首先是在德国内市场,博士的市场占有率在国内市场是非常高的。整个营收中,德国的占比要远超过海外,使它从一个与生俱来的全球化的公司变成了一个压倒性依赖于德国市场的公司。不过,到63年之后,公司的主要经营指标开始出现一些早期的下滑。如果不进行及时的干预,就会面临整个公司发展的停顿。因此,当时公司的一个重要的管理委员会主席,也是执掌公司20年的梅克尔先生发现了这个问题,这些矛盾包括国内市场相对饱和,产品相对单一。

因此,它开始创新产品线,同时,在德国国内劳动力成本上升的情况下,考虑是否要跟随德国汽车市场进入一些新兴市场等,这些都在1960年代就开始布局了,包括如何通过收购将曾经在北美的品牌、专利技术重新收回博士旗下,把北美市场再拿回来。

这一过程在1960到1970年代,由于一些因素的产生,给博士公司提供了非常好的契机,这就是环保浪潮和能源危机。这个话题就讲到我比较熟悉的层面了。其实很多技术,博士公司都有储备。但是在1960年代、1970年代这种挑战的环境下,博士是有机会去抓住的。我们看到在1960年代,虽然油价还没有那么高,但是最大的挑战来自于环保,因为汽车排放已经在许多城市造成了严重的大气污染,甚至我们听到一些公害事件,比如洛杉矶光滑球污染,就是由汽车排放导致的。

汽车排放是整个动力系统到排气系统各组件产生的。因此,博士在它的动力系统第一个环节——点火开始,就能够把整个工作流程纳入到创新当中,提供一些产品。例如,它在1960年代推出了汽车尾气的检测传感器。利用战后发展的电子技术,它开始对整个发动机的工作进行优化控制。这个阶段的产品创新构成了博士一条新的产品理念,以至于到了1970年代初的时候,公司已经很明确把环保和节能作为主要的发展方向。

还有一点非常重要的是,在1920年代应对当时需求不足而推出的柴油机技术系列,使得博士公司能够在节油的浪潮当中推出一系列与柴油发动机动力有关的技术。实际上,这些技术最早是在欧洲发扬光大的。我们知道在1973年,对西方世界进行石油制裁的浪潮中,欧洲受到的损害并不比美国要轻。甚至对于能源进口依赖程度较高的欧洲市场来说,面对高油价的耐受程度远远弱于美国。

因此,在节能技术的推进上面,欧洲的速度要快于美国。我们一直在讨论70年代的两次石油危机,能够反映现实世界发生的事情,实际上比当时梅克尔的预见要更严重一些。 在欧洲的话,欧洲的乘用车是比较喜欢用柴油车的。甚至于说我们国内的话,常见的一些由欧洲引进的汽车的型号在当地都是有柴油机版本的,而在中国引进的时候就没有柴油机版本。但是这一类的动力在北美就不持很吃香,然后美国的车主还依然喜欢汽油机,特别是一些大排量的汽油机。

到了1986年,油价下跌之后,这种顽固的习惯使得像德国车并不是在北美能够像我们理解那样可以迅速占领市场空间,反而出现了传统的美式汽车,它的销量开始出现反弹。这个过程一直影响到90年代末的时候,所以我们概念当中,20多年前在1990年代的后半期,美资企业还保有了非常强的影响力,通用和福特。当时世界500强的投名还是通用汽车,但是这一点可以说是没有充分考虑到未来新一轮油价上涨的影响,导致他们在2005年之后,尤其是2008年,吃了非常大的亏。

这个问题在1980年代中期就已经埋下了。还有的就是不同的道路环境。我们知道车要跑在路上,消费者的消费习惯、驾乘习惯和当地的整个社会、经济地理、基础设施等等都有广泛的相互作用,所以这形成了一种比较顽固的汽车使用的文化。我们看到在德国比较好使、非常受欢迎的一些技术在美国就不是那么受欢迎,汽车厂也不是那么愿意把它加载上去。

比如说在70年代,德国车已经开始安装汽车刹车防暴系统,也就是我们说的ABS。但是这个在德国就特别好使,为什么呢?因为德国的高速公路不限速,而美国的高速公路要限速,所以美国不会觉得ABS是一个必需品。这个在美国的装车比例非常低,甚至到1990年代初的时候,美国车对于ABS的装车比例都不是很高。这个极大地限制了当时ABS开创者博士的市场空间,我花很大的精力去推广这个事情,在美国可能都不好使,这带来了一个很大很大的影响。

所以说这些因素都造成了80年代是一个完全不同于大家理解的营商环境。经历了70年代80年代之后,不仅我的零部件企业要想怎么样提高生产效率、降低成本,其实主机厂也有这样的想法。这种想法来自于他们要实现自己的供应链当中供应商的多元化,要考虑到如何降低供应商的成本以赢得竞争,同时要避免当出现问题的情况下没有办法进行实际的调整。所以他们开始多元化,这对于像博士这样的企业来说是个巨大的挑战。传统上德国企业之间的合作关系是比较密切的,像博士这样的巨头和整车巨头之间都有一些互相担任对方官吏委员会成员的形式,所以他们的协作关系是非常好的。

可是到了70年代之后,开始出现我需要引入新的竞争对手,其中还出现一些非常难缠的竞争对手,就是在行业外,或者说是行业当中排名不高的,但他本身母公司是一个巨头的公司,开始杀入汽车行业。这个里面我们不得不谈到一家公司,也是德国的巨头,那就是西门子。如果说其他竞争对手我们可以置之不理的话,西门子显然是不可以置之不理的,因为他的实力太强,而且西门子还是一个在电子领域有非常强造诣的公司。

所以当博士在电子领域想设法把自己的产品追上或者引领这个行业潮流的时候,出现这样一个不俗的竞争对手是非常可怕的。那么我们不得不面临说,可能会在多国本土损失一半甚至更多的市场份额,这逼迫博士去寻找海外市场。在1960年代70年代,其实博士已经开始考虑能不能再向南欧,相对来说劳动成本有优势的一些国家来建立自己的工厂。同时利用当时德国的政策,因为德国在1963年之后出现了比较严重的劳动力紧缺问题,开始引入外劳。

也就是说,在南欧一些国家在农闲阶段,有些工人是可以到德国去打工的,能够发放短期的劳工签证。后来这个短期劳工签证被企业玩成了长期雇佣。再后来的话,甚至于拓展范围,像巴尔干的南斯拉夫,虽然是一个社会主义国家,但是他的劳工是可以持着劳工签证到德国去务工的。再后面的话,甚至于出现了像北非、图尼斯、摩洛哥,像土耳其,都可以拿到外劳签证到德国去工作来解决德国的劳工问题。这些也进入了博士的工厂,帮助博士克服劳动成本上升的问题。

当然在博士比较强有力的岗位平行下,也逐步把他们从一个非熟练工人培养成了熟练工人,乃至于拥有比较扎实技术的技术人员。当然到1990年代初的时候,博士依然是一个主要生产能力、主要的员工都还在德国国内的这样的企业。那么让博士走上全球化快车道的一个标志性事件就是1993年那一轮博士几十年来未遇的劳资纠纷,甚至产生了被迫进行大规模裁员的应变事件,这就是两德统一带来的一个后果。

总体来说,前面我们讲过的罗伯特博士先生是一位有社会民主思潮的企业家,他更加能够与自己的员工及工会进行协商。协商也成为了博士的公司基因。但是到了1970年代之后,矛盾开始激化,博士也开始引入越来越多的现代管理模式。所以这种潜移默化的冲突是在积累的。到了1993年之后,博士也开始采取一些观工厂、大规模裁员的行动,这个过程中流失了很多具有比较熟练生产能力的员工。虽然降低了公司的成本,但对公司维持自己的技术能力是有很大的妨碍的。

我们知道像这样的技术性企业需要有非常多扎实功底的一线技术人员在地下维持,所以公司开始反思。我是不是走得有点过头了?他们开始反思自1963年梅克尔担任管理委员会主席以来,使用强力手段引入现代化管理来不断降低成本、提高效率的这种模式,是否要让位于一种更加能够调和矛盾的方式。这种反思持续了十几年,其实是有收获的。

因为等到2008年更大规模的全球金融危机起来的时候,其实美国三大汽车企业最后是以大规模裁员、关闭工厂、然后破产保护、政府救济,最后才勉强活过来。这种情况跟博士的应对是有差距的。要解决问题就是能不能寻找新的增长点。这个新的增长点很大程度上来自于新型市场国家,尤其是在90年代以来,博士大大加强了对中国市场的耕耘。好像德系的车企90年代以来的崛起都是跟中国市场的销量有关系,大众也是。

我们这里要稍微讲一下博士是怎么进入中国的。博士作为一家古老的德国企业,进入中国很早。在1909年老爷车时代就开设了代表处。但是德国在中国一直没有投资、没有设厂,这个设厂的话要等到1975年。为什么是75年呢?我们想想,1975年文革还没有结束,但我们已经重新恢复了联合国的合法席位,跟一些西方国家开始建立外交关系。然后跟西方国家的经贸往来在文革后期的调整整顿的氛围下已经开始了。

在1975年,北京召开了一个德国技术博览会。博士作为德国代表性企业当然有参展,而且他参展的主要产品是柴油机相关的产品。因为当时我们中国不存在乘用车市场,但我们对商用车,像卡车这样的车型的需求量是非常巨大的,能不能提高卡车的技术水平甚至对我们的国防有意义。所以像德国博士他的柴油机的喷射泵等产品一定会受到中国的重视。

随着70年代末,中国开始实行改革开放,博士在1984年与中国的两家公司签订了生产许可证,生产博士具有代表性的柴油喷射泵。可以看到它是由商用车向乘用车发展的。乘用车要等到1984年之后,大众和中国结缘,开始在中国建立合资工厂,也就是上海大众的建立。此后博士才慢慢进入。

其实在这个过程当中,博士仍然没有在中国直接建立合资工厂,这个要到1993年。所以其实这个时间节点是非常有趣的,它在80年代中期和1990年代中期实际上是两波浪潮。博士对于中国市场一直保持巨大的兴趣,但进入是采取了一个很稳健的态度。

在这里我们不得不提到博士的合作伙伴大众,实际上大众在1980年代在北美市场走得并不是那么顺风顺水,面临的压力是非常巨大的。尤其是1986年之后,我们看到油价下跌,广场协议签订后德国马克大幅升值,所以在1988年,德国大众停止了在北美的生产汽车活动。这样一来,中国市场就变得非常重要。

在1984年上海大众建立的基础上,1980年末又开始谈判与一汽建立合资工厂,所以到了90年代初的时候,我们发现大众在中国具有了特殊地位。因为它作为一家外资企业,可以在国内有两家合资工厂,特别是第二家合资工厂一汽大众,生产的是比桑塔纳更加具有观车色彩的奥迪100,而且由于它的定位更高,所以配置显然更多。很多博士的产品开始引入。

到了1993年,博士开始扩大在中国的业务,我们可以看到跟前面在1980年代末和1990年代初,德国本土博士面临的整体问题是有关系的。我们就把这两件历史事件有机串联起来。1994年的4月份,博士宣布在中国签署六个合资协议,整个投资规模达到了3.3亿美金,这可以认为是博士在中国加大投资的一次里程碑事件。后来,我们知道中国的汽车工业经历了一个差不多15年的漫长发展,到了2008年09年,我们就已经卖到了1000万辆汽车的年销量门槛,后面又卖过了2000万辆,成为全世界第一大的汽车市场。

博士在中国的业务也伴随着中国汽车保有量的大幅增长而成为非常引人注目的市场。甚至可以说,在2008年以后这一段时间,全球汽车工业非常不景气的状态下,中国市场的快速发展大大提振了公司的业绩,使得公司能够比较快地克服金融危机的影响。

你刚刚说到的那个执掌了博士20年的掌门人梅克尔,他在70年代石油危机当中带领整个博士走出困境,具体当时他是提出了什么样的战略,对后来的博士有什么影响。简单来说,就是要有战略远见性。如何进行战略远见需要公司的高层能够对目前所处的问题有很好的总结。

比如说在1970年代初的时候,当时的博士掌门人梅克尔就在总部专门为高管进行一次有关博士集团商业策略的演讲。注意这是在1970年的2月份,也就是说在当时距离第一次石油危机的高峰期1973年还有两年时间。这是一种远见。光有远见似乎我们会把所有的功劳挂在那个有远见的个人身上,而我们今天讨论的重点是能不能进行更好的沟通,把个人意志贯彻为组织意志,变成组织行动,这是重要的,尤其重要的一点是能不能在公司内部形成共识,有良好的交流手段。

当时梅克尔的报告里面把博士的企业战略总结为五条,这五条今天读起来依然具有价值。第一条就是要积极应对经济波动。这不用多讲,我们现在面临的波动不仅是正常经济周期,还有不确定性的各国经济政策,甚至于前几年还有疫情这样前所未有的挑战。

第二个就是要实现多元化。当时把多元化作为扩展企业收入的重要导向,所以1970年代博士除了传统汽车的零部件业务外,开始在消费市场上寻求空间,甚至不惜冒着反垄断调查的风险与主要竞争对手建立同盟关系,这就是和西门子建立的合作关系,来开发业务。今天依然可以在博士的许多产品当中看到。

第三个是全球化,这是在1970年代的背景下实现的。我们常说1973年是一个全球化的转折点。因为在那个时候,我们碰到的是美元危机、尼克松冲击,全球范围内的资本主义世界面临的危机。

而新兴国家有非常强烈的发展诉求,苏东国家也在寻求转型,这是在1970年代的大背景下提出全球化的概念,应该说这批全球跨国企业具有非常高的远见。第四是权力下放,70年代实际上很多企业都采取了分散化管理,改变了过去集中统一控制的模式,虽然公司掌门人的影响力非常大,但他意识到如果权力不能下放,组织会陷入僵化。经过战后20年的发展,这些企业都变成了巨无霸,提高灵活性就需要分散管理。

最后是强化财务纪律,不是基于借款,而是通过不断获得业务独立性,独立业务单元自负盈亏,避免占用公司过多的资金。这五条在1970年代看是非常高的远见,也是非常落地的。像这样的故事后面又一再发生,当你尝到甜头时,证明这套方法有效,你就会不断使用,形成良性的自我循环。

这种形式和内容是统一的。我们今天会有一些企业热衷于用上传下达的方式来体现权威,而不重点放在沟通的效率上。这个其实是非常值得从博士身上引鉴到的真知。到底用什么样的沟通方法才能很好地将公司管理战略贯彻为整体一致的行为。

例如站在2025年,当下看待这些事情,无论是汽车行业还是其他泛称工业行业,我过去听到一种叙事,认为今天的世界是所谓的中美崛起。这像过去传统的汽车制造大国如日本和欧洲,表现相对较弱,对他们来说是一个新的挑战吗?当然是挑战,而且现在是多时之秋,面临的复杂环境在过去四五年时间内与日俱增,并没有任何缓解。

这种新的形势变化可能具有很长远的影响,比如说20年的疫情、21年全球供应链的打断、22年俄乌战争,包括欧洲在能源上的压力,这些企业在具有重要工业领域的地位它承担的任务是非常重的。我们几年前还认为德国经济是一潭死水,但现在发现德国正在松动它的财政,这被认为是德国的天条。德国率先松动财政纪律,可能迎来酒汉焚甘霖的效果。

大家想一想,在过去事件中等待的就是这个结果。无论出于什么原因,这个结果来了。必须重视的一点是我们的刻板印象在过去几年处于不断循环加强的状态。例如传统上我们对外资企业采取仰望态度,基于它的产品在技术上的高位,产品无可替代,自然需要仰望。

当你也具有评价替代品,实际上能在同等质量上相互竞争时,它所依赖的产业组织管理会被逐步消解,逐渐你会对其失去神秘感。这种仰视的情绪不再,甚至随着心态的膨胀开始俯视他们,觉得这种心态是不必要的。

我们看到这些企业的百余年历史可以体会它们在成长过程中面临跟各种各样的挑战,包括大型全球性战争,它们可能让生产基地受到影响,再到经历无数次经济危机,内外源因素完全不同,经历过技术上的重大变革,这对行业领域也可能造成颠覆性影响。仅仅是公司管理层的更迭就经历过无数次,对依赖传统管理方式甚至家族企业的很多国内企业来说都是不可想象的。

所以今天,我们不能因为局部的成功而产生自满情绪。从这些企业的成长经历中依然能得到巨大的启示。好,今天非常感谢康神来给我们以博士案例为切入点,回顾了跨越19世纪到21世纪的整个欧洲工业巨头面临的各种处境,以及他们给我们今天改革开放40周年以来飞速发展的中国企业所带来的经验。我觉得正如康神前面所说,我们不是说所有东西都已经学完了,世界上留给我们学习的东西依然很多,尤其是这些非常宝贵的经验。

好,今天这一期节目就聊到这里,感谢各位的收听,谢谢大家,我们下期再见,拜拜。