2025-08-10 05:49:16
Last week, I was chatting with a student and I was explaining what SIMD instructions were. I was making the point that, in practice, all modern processors have SIMD instructions or the equivalent. Admittedly, some small embedded processors do not, but they lack many other standard features as well. SIMD stands for Single Instruction, Multiple Data, a type of parallel computing architecture that allows a single instruction to process multiple data elements simultaneously. For example, you can compare 16 bytes with 16 other bytes using a single instruction.
Suppose you have the following string: stuvwxyzabcdefgh. You want to know whether the string contains the character ‘e‘. What you can do with SIMD instructions is load the input string in a register, and then compare it (using a single instruction) with the string eeeeeeeeeeeeeeee. The result would be something equivalent to 0000000000001000 indicating that there is, indeed, a letter e in the input.
Our programming languages tend to abstract away these SIMD instructions and it perfectly possible to have a long career in the software industry without even knowing what SIMD is. In fact, I suspect that most programmers do not know about SIMD instructions. If you are programming web applications in JavaScript, it is not likely to come up as a topic. (Fun fact, there was an attempt to introduce SIMD in JavaScript by the JavaScript SIMD API.)
Yet if SIMD is everywhere but few people know about it, is it even needed ?
Suppose that you are looking for the first instance of a given character in a string. In C or C++, you might implement a function like so:
const char* naive_find(const char* start, const char* end, char character) { while (start != end) { if (*start == character) { return start; } ++start; } return end; }
The naive_find function searches for the first occurrence of a specific character within a range of characters defined by two pointers, start and end. It takes as input a pointer to the beginning of the range (start), a pointer to the end (end), and the character to find (character). The function iterates through the range character by character using a while loop, checking at each step if the current character (*start) matches the target character. If a match is found, the function returns a pointer to that position. Otherwise, it increments start to move to the next character. If no matching character is found before reaching end, the function returns end, indicating that the character was not found in the specified range. My function is not Unicode-aware, but it is still fairly generic.
What is wrong with this function? As implemented, it might require about 6 CPU instruction per character. Indeed, you have to compare the pointers, de-reference the pointer, compare the result, increment the point, and so forth. Either you or the compiler can improve this number somewhat, but that’s the basic result. Unfortunately, your processor may not be able to retire more than 6 instructions per cycle. In fact, it is likely that your processor might not even be able to sustain 6 instructions per cycle.
Thus, naively implemented, a simple search for a character in a string will run at the speed of your processor or less: if your processor runs at 4 GHz, you will run through the string at 4 GB/s. Importantly, that’s likely true irrespective of whether the string is small and fits in CPU cache, or whether it is large and located outside of the CPU cache.
Is that a problem ? Isn’t 4 GB/s very fast ? Well. It is slower than a disk. The disk in my aging PlayStation 5 has a bandwidth of 5 GB/s. You can go to Amazon and order a disk with a 15 GB/s bandwidth.
Instead, let us compare against the ‘find’ function that we implemented in the simdutf library, using SIMD instructions. The performance you get depends on the processor and the SIMD instructions it supports. Let me use my Apple M4 processor as a reference. It has relatively weak SIMD support with only 16-byte SIMD registers. It pales in comparison to recent AMD processors (Zen 5) which have full support for 64-byte SIMD registers. Still, we can use about 4 instructions per block of 16 bytes. That’s over 20 times fewer instructions per input character. For strings of a few kilobytes or more, I get the following speeds.
naive search | 4 GB/s |
simdutf::find | 110 GB/s |
That is, the simdutf::find function is more than 20 times faster because it drastically reduces the number of required instructions.
Given our current CPU designs, I believe the SIMD instructions are effectively a requirement to achieve decent performance (i.e., process data faster than it can be read from a disk) on common tasks like a character search.
The source code of my benchmark is available. You might also be interested by the simdutf library which offers many fast string functions.
2025-07-16 23:58:54
“All innovation comes from industry is just wrong, universities invented many useful things.”
But that’s not the argument. Nobody thinks that Knuth contributed nothing to software programming.
Rather the point is that the direction of the arrow is almost entirely wrong. It is not
academia → industry → consumers
This is almost entirely wrong. I am not saying that the arrows do not exist… but it is a complex network where academia is mostly on the receiving end. Academia adapts to changes in society. It is rarely the initiator as far as technological innovation goes.
But let me clarify that academia does, sometimes, initiate innovation. It happens. But, more often, innovation actually starts with consumers (not even industry).
Take the mobile revolution. It was consumers who took their iPhone to work and installed an email client on it. And they changed the nature of work, creating all sorts of new businesses around mobile computing.
You can build some kind of story to pretend that the iPhone was invented by a professor… but it wasn’t.
Also, it wasn’t invented by Steve Jobs. Not really. Jobs paid close attention to consumers and what they were doing, and he adapted the iPhone. A virtuous circle arose.
So innovation works more like this…
academia ← industry ← consumers
If we are getting progress in AI right now, it is because consumers are adopting ChatGPT, Claude and Grok. And the way people are using these tools is pushing industry to adapt.
Academia is almost nowhere to be seen. It will come last. In the coming years, you will see new courses about how to build systems based on large language models. This will be everywhere after everyone in industry has adopted it.
And we all know this. You don’t see software engineers going back to campus to learn about how to develop software systems in this new era.
Look, they are still teaching UML on campus. And the only way it might die is that it is getting difficult to find a working copy of Rational Rose.
In any case, the fact that innovation is often driven by consumers explain largely why free market economies like the United States are where innovation comes from. You can have the best universities in the world, and the most subsidized industry you can imagine… without consumers, you won’t innovate.
2025-07-15 23:49:23
« Normal science, the activity in which most scientists inevitably spend most all their time, is predicated on the assumption that the scientific community knows what the world is like. Normal science often suppresses fundamental novelties because they are necessarily subversive of its basic commitments. As a puzzle-solving activity, normal science does not aim at novelties of fact or theory and, when successful, finds none. » Thomas Kuhn
The linear model of innovation is almost entirely backward. This model describes progress like so: University professors and their students develop the new ideas, these ideas are then taken up by industry which deploys them.
You can come up with stories that are supportive of this model… But on the ground, we are still fighting to get UML and the waterfall model off the curriculum. Major universities still forbid the use of LLMs in software courses (as if they could).
Universities are almost constantly behind. Not only are they behind, they often promote old, broken ideas. Schools still teach about the ‘Semantic Web’ in 2025.
Don’t get me wrong. The linear model can work, sometimes. It obviously can. But there are preconditions, and these preconditions are rarely met.
Part of the issue is ‘peer review’ which has grown to cover everything. ‘Peer review’ means ‘do whatever your peers are doing and you will be fine’. It is fundamentally reactionary.
Innovations still emerge from universities, but through people who are rebels. They either survive the pressure of peer review, or are just wired differently.
Regular professors are mostly conservative forces. To be clear, I do not mean ‘right wing’. I mean that they are anchored in old ideas and they resist new ones.
Want to see innovation on campus ? Look for the rebels.
2025-07-15 07:12:29
One of my most popular blog posts of all times is Data alignment for speed: myth or reality? According to my dashboard, hundreds of people a week still load the old blog post. A few times a year, I get an email from someone who disagrees.
The blog post makes a simple point. Programmers are often told to worry about ‘unaligned loads’ for performance. The point of my blog post is that you should generally no worry about alignment when optimizing your code.
An unaligned load occurs when a processor attempts to read data from memory at an address that is not properly aligned. Most computer architectures require data to be accessed at addresses that are multiples of the data’s size (e.g., 4-byte data should be accessed at addresses divisible by 4). For example, a 4-byte integer of float should be loaded from an address like 0x1000 or 0x1004 (aligned), but if the load is attempted from 0x1001 (not divisible by 4), it is unaligned. In some conditions, an unaligned load can crash your system and it general leads to ‘undefined behaviors’ in C++ or C.
Related to this alignment issue is that data is typically organized in cache lines (64 bytes or 128 bytes or most systems) that are loaded together. If you load data randomly from memory, you might touch two cache lines which could cause an additional cache miss. If you need to load data spanning two cache lines, there might be a penalty (say one cycle) as the processor needs to access the two cache lines and reassemble the data. Further, there is also the concept of a page of memory (4 kB or more). Accessing an additional page could be costly, and you typically want to avoid accessing more pages than you need to. However, you have to be somewhat unlucky to frequently cross two pages with one load operation.
How can you end up with unaligned loads? It often happens when you access low-level data structures, assigning data to some bytes. For example, you might load a binary file from disk, and it might say that all the bytes after the first one are 32-bit integers. Without copying the data, it could be difficult to align the data. You might also be packing data: imagine that you have a pair of values, one that fits in byte and the other that requires 4 bytes. You could pack these values using 5 bytes, instead of 8 bytes.
There are cases were you should worry about alignment. If you are crafting your own memory copy function, you want to be standard compliant (in C/C++) or you need atomic operations (for multithreaded operations). You might also encounter 4K Aliasing, an issue Intel describes where arrays stored in memory at locations that are nearly a multiple of 4KB can mislead the processor into thinking data is being written and then immediately read.
However, my general point is that it is unlikely to be a performance concern.
I decided to run a new test given that I haven’t revisited this problem since 2012. Back then I used a hash function. I use SIMD-based dot products with either ARM NEON intrinsics or AVX2 intrinsics. I build two large arrays of 32-bit floats and I compute the scalar product. That is, I multiply the elements and sum the products. The arrays fit in a megabyte so that we are not RAM limited.
I run benchmarks on an Apple M4 processor as well as on an Intel Ice Lake processor.
On the Apple M4… we can barely see the alignment (10% effect).
Byte Offset | ns/float | ins/float | instruction/cycle |
---|---|---|---|
0 | 0.059 | 0.89 | 2.93 |
1 | 0.064 | 0.89 | 2.75 |
2 | 0.062 | 0.89 | 2.82 |
3 | 0.064 | 0.89 | 2.69 |
4 | 0.062 | 0.89 | 2.84 |
5 | 0.064 | 0.89 | 2.75 |
6 | 0.062 | 0.89 | 2.69 |
7 | 0.064 | 0.89 | 2.75 |
And we cannot see much of an effect on the Intel Ice Lake processor.
Byte Offset | ns/float | ins/float | instruction/cycle |
---|---|---|---|
0 | 0.086 | 0.38 | 1.36 |
1 | 0.087 | 0.38 | 1.36 |
2 | 0.087 | 0.38 | 1.36 |
3 | 0.087 | 0.38 | 1.36 |
4 | 0.086 | 0.38 | 1.36 |
5 | 0.087 | 0.38 | 1.36 |
6 | 0.086 | 0.38 | 1.36 |
7 | 0.086 | 0.38 | 1.36 |
Using the 512-bit registers from AVX-512 does not change the conclusion.
My point is not that you cannot somehow detect the performance difference due to alignment in some tests. My point is that it is simply not something that you should generally worry about as far as performance goes.
2025-07-13 01:19:16
Studying productivity is challenging. About 15-20 years ago, I was obsessed over my own productivity.
I created a spying agent to monitor my activities and time spent. It would look at which window is opened, and so forth. I would also monitor which web sites were loaded, and so forth.
I discovered that I spent over 40% of my day on email, which was shocking. However, this didn’t improve my productivity.
Why is it so hard? The issue is that you’re likely not measuring what you think. Productivity is value per unit of time, but defining “value” is problematic.
Long-term, value follows a Pareto distribution, where most value is created in unpredictable, short bursts.
For instance, a sudden idea in the shower might lead to an hour of work equivalent in value to the rest of the month’s efforts.
Does it mean that you should slack off, wait for the brilliant insight? No. That’s the problem with the Pareto distribution in general.
Maybe you are running a company and figure that 20% of your employees do 80% of the work. So you fire the 80% that are less productive. And what happens? Maybe you find out that Joe, who seemed unproductive, was holding your business together and you need to rehire him quickly.
The fundamental issue is that you have limited knowledge. You do not know where the value lies when you are in the middle of it. So by slacking off, you are likely to just greatly diminish the probability that you will have a sudden burst of high productivity.
Long term, you can probably identify the less productive activities. Maybe you have been going to these meetings for two years now, and nothing got done.
But this like side project of yours, that looks like a waste of time, could be (truly) the most important work you could be doing.
So you should humble and patient. Don’t waste time micromanaging yourself or others. You know less than you think.
2025-07-10 04:33:11
The Apple M2, introduced in 2022, and the Apple M4, launched in 2024, are both ARM-based system-on-chip (SoC) designs featuring unified memory architecture. That is, they use the same memory for both graphics (GPU) and main computations (CPU). The M2 processor relies on LPDDR5 memory whereas the M4 relies on LPDDR5X which should provide slightly more bandwidth.
The exact bandwidth you get from an Apple system depends on your configuration. But I am interested in single-core random access performance. To measure this performance, I construct a large array of indexes. These indexes form a random loop: starting from any element, if you read its value, treat it as an index, move to this index and so forth, you will visit each and every element in the large array. This type of benchmark is often described as ‘pointer chasing’ since it simulates what happens when your software is filled with pointers to data structures which themselves are made of pointers, and so forth.
When loading any value from memory, there is a latency of many cycles. Thankfully, modern processors can sustain many such loads at the same time. How many depends on the processor but modern processors can sustain tens of memory requests at any given time. This phenomenon is part of what we call memory-level parallelism : the ability of the memory subsystem to sustain many tasks at once.
Thus we can split the pointer-chasing benchmark into lanes. Instead of starting at just one place, you can start at two locations at once, one at the ‘beginning’ and the other at the midpoint. And so forth. I refer the number of such divisions as a ‘lane’. So it is one lane, two lanes and so forth. Obviously, the more lanes you have, the faster you can go. From how fast you can go, you can estimate the effective bandwidth by assuming that each hit in the array is equivalent to loading a cache line (128 bytes). The array is made of over 33 million 64-bit words.
I run my benchmarks on two processors (Apple M2 and Apple M4). I have to limit the number of lanes since beyond a certain point, there is too much noise. A maximum of 28 lanes works well.
Maybe unsurprisingly, I find that the difference between the M4 and the M2 is not enormous (about 15%). Both processors can visibly sustain 28 lanes.