MoreRSS

site iconDaniel LemireModify

Computer science professor at the University of Quebec (TELUQ), open-source hacker, and long-time blogger.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Daniel Lemire

处理器越来越宽

2025-09-01 22:26:18

Our processors execute instructions based on a clock. Thus, a 4 GHz processor has 4 billion cycles per second. It is difficult to increase the clock frequency of our processors. If you go much beyond 5 GHz, your processor is likely to overheat or otherwise fail.

So, how do we go faster? Modern processors can execute multiple instructions simultaneously. It is sometimes called superscalar execution. Most processors can handle 4 instructions per cycle or more. A recent Apple processor can easily sustain over 8 instructions per cycle.

However, the number of instructions executed depends on the instructions themselves. There are inexpensive instructions (like additions) and more costly ones (like integer division). The less costly the instructions, the more can be executed per cycle.

A processor has several execution units. With four execution units capable of performing additions, you might execute 4 additions per cycle. More execution units allow more instructions per cycle.

Typically, x86-64 processors (Intel/AMD) can retire at most one multiplication instruction per cycle, making multiplications relatively expensive compared to additions. In contrast, recent Apple processors can retire two multiplications per cycle.

The latest AMD processors (Zen 5) have three execution units capable of performing multiplications, potentially allowing 3 multiplications per cycle in some cases. Based solely on execution units, a Zen 5 processor could theoretically retire 3 additions and 3 multiplications per cycle.

But that is not all. I only counted conventional multiplications on general-purpose 64-bit registers. The Zen 5 has four execution units for 512-bit registers, two of which can perform multiplications. These 512-bit registers allow us to do many multiplications at once, by packing several values in each register.

Generally our general-purpose processors are getting wider: they can retire more instructions per cycle. That is not the only possible design. Indeed, these wider processors require many more transistors. Instead, you could use these transistors to build more processors. And that is what many people expected: they expected that our computers would contain many more general-purpose processors.

A processor design like the AMD Zen 5 is truly remarkable. It is not simply a matter of adding execution units. You have to bring the data to these units, you have to order the computation, handle the branches.

What it means for the programmers is that even when you do not use parallelism explicitly, your code executes in a parallel manner under the hood no matter what.

在 macOS 下使用 Visual Studio Code 调试 C++

2025-08-25 03:51:49

My favorite text editor is Visual Studio Code. I estimate that it is likely the most popular software development environment. Many major software corporations have adopted Visual Studio Code.

The naming is a bit strange because Visual Studio Code has almost nothing to do with Visual Studio, except for the fact that it comes from Microsoft. For short, we often call it ‘VS Code’ although I prefer to spell out the full name.

Visual Studio Code has an interesting architecture. Visual Studio Code is largely written in TypeScript (so basically JavaScript) on top of Electron. Electron itself is made of the Node.js runtime environment together with the Web engine Chromium. Electron provides an all-purpose approach to building desktop application using JavaScript or TypeScript.

In Electron, Node.js is itself based on the Google v8 engine for JavaScript execution. Thus, Microsoft is building both on a community-supported engine (Node.js) as well as on a stack of Google software.

What is somewhat remarkable is that even though Visual Studio Code runs on a JavaScript engine, it is generally quite fast. On my main laptop, it starts up in about 0.1s. I rarely notice any lag. It is almost always snappy, whether I use it on macOS or under Windows. Under Windows, Visual Studio Code feels faster than Visual Studio. Yet Visual Studio is written in C# and C++, languages that allow much better optimization, in principle. What makes it work is all the optimization work that went into v8, Chromium, Node.js.

Visual Studio Code also seems almost infinitely extensible. It is highly portable and it has great support for Terminals. Coupled with Microsoft copilot, you get a decent AI experience for when you need to do some vibe coding. I also love the ‘Remote SSH’ extensions which allows you to connect to a remote server by ssh and work as if it was the local machine.

When I do system programming, I usually code in C or C++ using CMake as my build system. In my opinion, CMake is a great build system. I combine it with CPM for handling my dependencies.

Microsoft makes available a useful extension for CMake users called CMake Tools. I do not have much use for it but on the rare occasions when I need to launch a debugger to do non-trivial work, it is handy.

For the most part, my debugging usage under Linux is simple:

  1. Open the repository where the CMake project is. It might then ask me for the compiler, but I leave it unspecified. It seems to work fine.
  2. I select the target I want to run/debug by typing F1 and selecting CMake: Set Builder Target. If I just type the name of the target (e.g., the executable file), it seems to work.
  3. In the text editor, I click to the left of a line where I want the debugger to stop. You can also add conditional stops.
  4. I click on the little ‘bug’ icon in the bar at the bottom of the the Visual Studio Code window.

It tends to just work.

For some projects, you want to pass CMake some configuration flags. You can just create a JSON file settings.json inside the subdirectory .vscode. The JSON file contains a JSON object, and you can just add a cmake.configureArgs with special settings, such as…

{"cmake.configureArgs": ["-DSIMDJSON_DEVELOPER_MODE=ON"]}

The settings.json has many other uses. You can set preferences for the user interface, you can exclude files from the search tool, you can configure linting and so forth. You can also check in the file settings.json with your project under version control so that everyone gets the same preferences.

Unfortunately, under macOS, my debugging experience is not as smooth. The issue likely comes from the fact that macOS defaults on LLVM instead of GCC as C and C++ compilers.

Thus under macOS, I add two non-obvious steps.

  1. I install ab extension called CodeLLDB by Vadim Chugunov.
  2. I create a file called launch.json inside the subdirectory .vscode:
    {"configurations": [
    {
      "name": "Launch (lldb)",
      "type": "lldb",
      "request": "launch",
      "program": "${command:cmake.launchTargetPath}",
      "cwd": "${workspaceFolder}",
    }]
    }

A button “Launch (lldb)” should appear at the bottom of the Visual Studio Code window. Pressing it should launch the debugger. Everything else is then like when I am under Linux. It just tends to work.

Visual Studio Code is an instance of a tool that never does anything quite perfectly. It expects you to edit JSON files by hand. It expects you to find the right extension by yourself. The debugger environnement is fine, but you won’t write love letters about it. But the whole package of Visual Studio Code succeeds brilliantly. Everything tends to just be good enough so that you can get your work done with minimal fuss.

The web itself relies on generic technologies (HTML, CSS, JavaScript) which, though individually imperfect, form a coherent and adaptable whole. Visual Studio Code reflects this philosophy: it does not try to do everything perfectly, but it provides a platform where each developer can build their own workflow. This modularity, combined with a clean interface and an active community, explains why it has become one of my favourite tools.

可预测的内存访问速度更快

2025-08-16 05:42:56

Loading data from memory often takes several nanoseconds. While the processor waits for the data, it may be forced to wait without performing useful work. Hardware prefetchers in modern processors anticipate memory accesses by loading data into the cache before it is requested, thereby optimizing performance. Their effectiveness varies depending on the access pattern: sequential reads benefit from efficient prefetching, unlike random accesses.

To test the impact of prefetchers, I wrote a Go program that uses a single array access function. The execution time is measured to compare performance. I start with a large array of 32-bit integers (64 MiB).

  1. Sequential access: I read every 8 integers.
  2. Random access: I read every 8 integers in random order.
  3. Backward access: I read every 8 integers from the end.
  4. Interleaved access: I read every 8 integers, starting from the first, the middle one, the second one, the one after the middle one, and so forth.
  5. Bouncing access: I read every 8 integers, starting from the first, then the last, then the second, then the second last and so forth.

I skip integers that are not at an index divisible by eight: I do so to minimize ‘cache line’ effects. The code looks as follow:

type DataStruct struct {
    a, b, c, d, e, f, g, h uint32
}

var arr []DataStruct

for j := 0; j < arraySize; j++ {
  sum += arr[indices[j]].a // Accessing only the first field
}

Running the program on my Apple laptop, I get that everything is much faster than the pure random access. It serves to illustrate how good our processors are at predicting data access.

My Go program is available.

我们为什么需要 SIMD 指令?

2025-08-10 05:49:16

Last week, I was chatting with a student and I was explaining what SIMD instructions were. I was making the point that, in practice, all modern processors have SIMD instructions or the equivalent. Admittedly, some small embedded processors do not, but they lack many other standard features as well. SIMD stands for Single Instruction, Multiple Data, a type of parallel computing architecture that allows a single instruction to process multiple data elements simultaneously. For example, you can compare 16 bytes with 16 other bytes using a single instruction.

Suppose you have the following string: stuvwxyzabcdefgh. You want to know whether the string contains the character ‘e‘. What you can do with SIMD instructions is load the input string in a register, and then compare it (using a single instruction) with the string eeeeeeeeeeeeeeee. The result would be something equivalent to 0000000000001000 indicating that there is, indeed, a letter e in the input.

Our programming languages tend to abstract away these SIMD instructions and it perfectly possible to have a long career in the software industry without even knowing what SIMD is. In fact, I suspect that most programmers do not know about SIMD instructions. If you are programming web applications in JavaScript, it is not likely to come up as a topic. (Fun fact, there was an attempt to introduce SIMD in JavaScript by the JavaScript SIMD API.)

Yet if SIMD is everywhere but few people know about it, is it even needed ?

Suppose that you are looking for the first instance of a given character in a string.  In C or C++, you might implement a function like so:

const char* naive_find(const char* start, const char* end, 
                char character) {
    while (start != end) {
        if (*start == character) {
            return start;
        }
        ++start;
    }
    return end;
}

The naive_find function searches for the first occurrence of a specific character within a range of characters defined by two pointers, start and end. It takes as input a pointer to the beginning of the range (start), a pointer to the end (end), and the character to find (character). The function iterates through the range character by character using a while loop, checking at each step if the current character (*start) matches the target character. If a match is found, the function returns a pointer to that position. Otherwise, it increments start to move to the next character. If no matching character is found before reaching end, the function returns end, indicating that the character was not found in the specified range. My function is not Unicode-aware, but it is still fairly generic.

What is wrong with this function? As implemented, it might require about 6 CPU instruction per character. Indeed, you have to compare the pointers, de-reference the pointer, compare the result, increment the point, and so forth. Either you or the compiler can improve this number somewhat, but that’s the basic result. Unfortunately, your processor may not be able to retire more than 6 instructions per cycle. In fact, it is likely that your processor might not even be able to sustain 6 instructions per cycle.

Thus, naively implemented, a simple search for a character in a string will run at the speed of your processor or less: if your processor runs at 4 GHz, you will run through the string at 4 GB/s. Importantly, that’s likely true irrespective of whether the string is small and fits in CPU cache, or whether it is large and located outside of the CPU cache.

Is that a problem ? Isn’t 4 GB/s very fast ? Well. It is slower than a disk. The disk in my aging PlayStation 5 has a bandwidth of 5 GB/s. You can go to Amazon and order a disk with a 15 GB/s bandwidth.

Instead, let us compare against the ‘find’ function that we implemented in the simdutf library, using SIMD instructions. The performance you get depends on the processor and the SIMD instructions it supports. Let me use my Apple M4 processor as a reference. It has relatively weak SIMD support with only 16-byte SIMD registers. It pales in comparison to recent AMD processors (Zen 5) which have full support for 64-byte SIMD registers. Still, we can use about 4 instructions per block of 16 bytes. That’s over 20 times fewer instructions per input character. For strings of a few kilobytes or more, I get the following speeds.

naive search 4 GB/s
simdutf::find 110 GB/s

That is, the simdutf::find function is more than 20 times faster because it drastically reduces the number of required instructions.

Given our current CPU designs, I believe the SIMD instructions are effectively a requirement to achieve decent performance (i.e., process data faster than it can be read from a disk) on common tasks like a character search.

The source code of my benchmark is available. You might also be interested by the simdutf library which offers many fast string functions.

创新始于消费者,而非学术界

2025-07-16 23:58:54

“All innovation comes from industry is just wrong, universities invented many useful things.”

But that’s not the argument. Nobody thinks that Knuth contributed nothing to software programming.

Rather the point is that the direction of the arrow is almost entirely wrong. It is not

academia → industry → consumers

This is almost entirely wrong. I am not saying that the arrows do not exist… but it is a complex network where academia is mostly on the receiving end. Academia adapts to changes in society. It is rarely the initiator as far as technological innovation goes.

But let me clarify that academia does, sometimes, initiate innovation. It happens. But, more often, innovation actually starts with consumers (not even industry).

Take the mobile revolution. It was consumers who took their iPhone to work and installed an email client on it. And they changed the nature of work, creating all sorts of new businesses around mobile computing.

You can build some kind of story to pretend that the iPhone was invented by a professor… but it wasn’t.

Also, it wasn’t invented by Steve Jobs. Not really. Jobs paid close attention to consumers and what they were doing, and he adapted the iPhone. A virtuous circle arose.

So innovation works more like this…

academia ← industry ← consumers

If we are getting progress in AI right now, it is because consumers are adopting ChatGPT, Claude and Grok. And the way people are using these tools is pushing industry to adapt.

Academia is almost nowhere to be seen. It will come last. In the coming years, you will see new courses about how to build systems based on large language models. This will be everywhere after everyone in industry has adopted it.

And we all know this. You don’t see software engineers going back to campus to learn about how to develop software systems in this new era.

Look, they are still teaching UML on campus. And the only way it might die is that it is getting difficult to find a working copy of Rational Rose.

In any case, the fact that innovation is often driven by consumers explain largely why free market economies like the United States are where innovation comes from. You can have the best universities in the world, and the most subsidized industry you can imagine… without consumers, you won’t innovate.

校园里的叛逆者

2025-07-15 23:49:23

« Normal science, the activity in which most scientists inevitably spend most all their time, is predicated on the assumption that the scientific community knows what the world is like. Normal science often suppresses fundamental novelties because they are necessarily subversive of its basic commitments. As a puzzle-solving activity, normal science does not aim at novelties of fact or theory and, when successful, finds none. » Thomas Kuhn

The linear model of innovation is almost entirely backward. This model describes progress like so: University professors and their students develop the new ideas, these ideas are then taken up by industry which deploys them.

You can come up with stories that are supportive of this model… But on the ground, we are still fighting to get UML and the waterfall model off the curriculum. Major universities still forbid the use of LLMs in software courses (as if they could).

Universities are almost constantly behind. Not only are they behind, they often promote old, broken ideas. Schools still teach about the ‘Semantic Web’ in 2025.

Don’t get me wrong. The linear model can work, sometimes. It obviously can. But there are preconditions, and these preconditions are rarely met.

Part of the issue is ‘peer review’ which has grown to cover everything. ‘Peer review’ means ‘do whatever your peers are doing and you will be fine’. It is fundamentally reactionary.

Innovations still emerge from universities, but through people who are rebels. They either survive the pressure of peer review, or are just wired differently.

Regular professors are mostly conservative forces. To be clear, I do not mean ‘right wing’. I mean that they are anchored in old ideas and they resist new ones.

Want to see innovation on campus ? Look for the rebels.