2025-09-30 05:28:33
In software, we represent real numbers as binary floating-point numbers. Effectively, we represent real numbers as a fixed-precision integer (the significand) multiplied by a power of two. Thus we do not represent exactly the number ‘pi’, but we get a very close approximation: 3.141592653589793115997963468544185161590576171875. The first 16 digits are exact.
You can represent the approximation of pi as 7074237752028440 times 2 to the power -51. It is a bit annoying to represent numbers in this manner, so let us use the hexadecimal float notation.
In the hexadecimal float notation, The number 1 would be 0x1.0 and the number two would be 0x2.0. It works just like the decimal numbers. Except that the value one half would be 0x0.8 and the infinite string 0xffffff…. would be 1.
Let us go back to the number pi. We write the significand as an hexadecimal number, 0x1921fb54442d18 and we insert a period right after the first digit: 0x1.921fb54442d18. It is a number in the range [1,2), specifically 0x1921fb54442d18 times 2 to the power -52.
To get the number pi, we need to multiply by 2, which we do by appending ‘p+1’ at the end: 0x1.921fb54442d18p+1.
Of course, you do not compute any of this by hand. In modern C++, you just do:
std::print("Pi (hex): {:a}\n", std::numbers::pi);
If you are using 64-bit floating point numbers, then you can go from about -1.7976931348623157e+308 to 1.7976931348623157e308 using the exponential notation where 1.79769e308 means 1.79769 times 10 to the power 308. In hexadecimal notation, the largest value is 0x1.fffffffffffffp+1023. The integer value 0x1fffffffffffff is 9007199254740991 and the largest value is 9007199254740991 times 2 to the power 1023-52.
Numbers outside of this range are represented by an infinite value. Indeed, our computers have a notion of infinity. And they know that 1 over infinity is 0. Thus in modern C++, the following would print zero.
double infinity = std::numeric_limits<double>::infinity(); std::print("Zero: {}\n", 1 / infinity);
You might be surprised to find that if you type in a string value that is larger than the maximal value that can represented, you do not get an error or an infinite value. For example, the number represented by the string 1.7976931348623158e308 is larger than the largest value that can be represented by a floating-point type, but it is not infinite. The following C++ code will print ‘true’:
std::print("{}\n", 1.7976931348623158e308 == std::numeric_limits<double>::max());
So what is the smallest number represented as a string that will map to infinity? Let us go back to the hexadecimal representation.
By default, we always round to the nearest value. Thus all number strings between 0x1.ffffffffffffe8p+1023 and 0x1.fffffffffffff8p+1023 fall to 0x1.fffffffffffffp+1023. If you are right at 0x1.fffffffffffff8p+1023, then we ’round to the nearest even’ value, which is 0x2.0p+1023. Thus the number 0x1.fffffffffffff8p+1023 is right at the border with the infinite values.
In decimal, this value is 179769313486231580793728971405303415079934132710037826936173778980444968292764750946649017977587207096330286416692887910946555547851940402630657488671505820681908902000708383676273854845817711531764475730270069855571366959622842914819860834936475292719074168444365510704342711559699508093042880177904174497792.0.
If you type this string as a number constant, many compilers (irrespective of the programming language) will complain and reject it. Any number string just slightly smaller should be fine.
2025-09-22 03:10:18
When I was an undergraduate student, I discovered symbolic algebra. It was great! Instead of solving for a variable by hand, I could just put all the equations in the machine and get the result.
I soon found that symbolic algebra did not turn me into a genius mathematician. It could do the boring work, but I often found myself “getting stuck.” I would start from a problem, throw it at the machine and get back a mess that did not get me closer to the solution.
Over time, I realized that these tools had an undesirable effect on me. You see, human beings are lazy by nature. If you can avoid having to think hard about an issue, you will.
And I fear that much of the same is happening with large language models. Why think it through? Why read the documentation? Let us just have the machine try and try again until it succeeds.
It works. Symbolic algebra can solve lots of interesting mathematical problems. Large language models can solve even more problems.
But if you just sit there and mindlessly babysit a large language model, where are your new skills going to come from? Where are your deep insights going to come from?
I am not being a Luddite. I encourage you to embrace new tools when you can. But you should not dispense with doing the hard work.
2025-09-08 03:44:56
Suppose that you have a long string and you want to insert line breaks every 72 characters. You might need to do this if you need to write a public cryptographic key to a text file.
A simple C function ought to suffice. I use the letter K to indicate the length of the lines. I copy from an input buffer to an output buffer.
void insert_line_feed(const char *buffer, size_t length, int K, char *output) { if (K == 0) { memcpy(output, buffer, length); return; } size_t input_pos = 0; size_t next_line_feed = K; while (input_pos < length) { output[0] = buffer[input_pos]; output++; input_pos++; next_line_feed--; if (next_line_feed == 0) { output[0] = '\n'; output++; next_line_feed = K; } } }
This character-by-character process might be inefficient. To go faster, we might call memcpy to copy blocks of data.
void insert_line_feed_memcpy(const char *buffer, size_t length, int K, char *output) { if (K == 0) { memcpy(output, buffer, length); return; } size_t input_pos = 0; while (input_pos + K < length) { std::memcpy(output, buffer + input_pos, K); output += K; input_pos += K; output[0] = '\n'; output++; } std::memcpy(output, buffer + input_pos, length - input_pos); }
The memcpy function is likely to be turned into just a few instruction. For example, if you compile for a recent AMD processor (Zen 5), it might generate only two instructions (two vmovups) when the length of the lines (K) is 64.
Can we do better ?
In general, I expect that you cannot do much better than using the memcpy function. Compilers are simply great a optimizing it.
Yet it might be interesting to explore whether deliberate use of SIMD instructions could optimize this code. SIMD (Single Instruction, Multiple Data) instructions process multiple data elements simultaneously with a single instruction: the memcpy function automatically uses it. We can utilize SIMD instructions through intrinsic functions, which are compiler-supplied interfaces that enable direct access to processor-specific instructions, optimizing performance while preserving high-level code readability.
Let me focus on AVX2, the instruction set supported by effectively all x64 (Intel and AMD) processors. We can load 32-byte registers and write 32-byte registers. Thus we need a function that takes a 32-byte register and inserts a line-feed character at some location (N) in it. For cases where N is less than 16, the function shifts the input vector right by one byte to align the data correctly, using _mm256_alignr_epi8 and _mm256_blend_epi32, before applying the shuffle mask and inserting the newline. When N is 16 or greater, it directly uses a shuffle mask from the precomputed shuffle_masks array to reorder the input bytes and insert the newline, leveraging a comparison with `0x80` to identify the insertion point and blending the result with a vector of newline characters for efficient parallel processing.
inline __m256i insert_line_feed32(__m256i input, int N) { __m256i line_feed_vector = _mm256_set1_epi8('\n'); __m128i identity = _mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15); if (K >= 16) { __m128i maskhi = _mm_loadu_si128(shuffle_masks[N - 16]); __m256i mask = _mm256_set_m128i(maskhi, identity); __m256i lf_pos = _mm256_cmpeq_epi8(mask, _mm256_set1_epi8(0x80)); __m256i shuffled = _mm256_shuffle_epi8(input, mask); __m256i result = _mm256_blendv_epi8(shuffled, line_feed_vector, lf_pos); return result; } // Shift input right by 1 byte __m256i shift = _mm256_alignr_epi8( input, _mm256_permute2x128_si256(input, input, 0x21), 15); input = _mm256_blend_epi32(input, shift, 0xF0); __m128i masklo = _mm_loadu_si128(shuffle_masks[N]); __m256i mask = _mm256_set_m128i(identity, masklo); __m256i lf_pos = _mm256_cmpeq_epi8(mask, _mm256_set1_epi8(0x80)); __m256i shuffled = _mm256_shuffle_epi8(input, mask); __m256i result = _mm256_blendv_epi8(shuffled, line_feed_vector, lf_pos); return result; }
Can we go faster by using such a fancy function ? Let us test it out. I wrote a benchmark. I use a large input string on an Intel Ice Lake processor with GCC 12.
character-by-character | 1.0 GB/s | 8.0 ins/byte |
memcpy | 11 GB/s | 0.46 ins/byte |
AVX2 | 16 GB/s | 0.52 ins/byte |
The handcrafted AVX2 approach is faster in my tests than the memcpy approach despite using more instructions. However, the handcrafted AVX2 approach stores data to memory using fewer instructions.
2025-09-01 22:26:18
Our processors execute instructions based on a clock. Thus, a 4 GHz processor has 4 billion cycles per second. It is difficult to increase the clock frequency of our processors. If you go much beyond 5 GHz, your processor is likely to overheat or otherwise fail.
So, how do we go faster? Modern processors can execute multiple instructions simultaneously. It is sometimes called superscalar execution. Most processors can handle 4 instructions per cycle or more. A recent Apple processor can easily sustain over 8 instructions per cycle.
However, the number of instructions executed depends on the instructions themselves. There are inexpensive instructions (like additions) and more costly ones (like integer division). The less costly the instructions, the more can be executed per cycle.
A processor has several execution units. With four execution units capable of performing additions, you might execute 4 additions per cycle. More execution units allow more instructions per cycle.
Typically, x86-64 processors (Intel/AMD) can retire at most one multiplication instruction per cycle, making multiplications relatively expensive compared to additions. In contrast, recent Apple processors can retire two multiplications per cycle.
The latest AMD processors (Zen 5) have three execution units capable of performing multiplications, potentially allowing 3 multiplications per cycle in some cases. Based solely on execution units, a Zen 5 processor could theoretically retire 3 additions and 3 multiplications per cycle.
But that is not all. I only counted conventional multiplications on general-purpose 64-bit registers. The Zen 5 has four execution units for 512-bit registers, two of which can perform multiplications. These 512-bit registers allow us to do many multiplications at once, by packing several values in each register.
Generally our general-purpose processors are getting wider: they can retire more instructions per cycle. That is not the only possible design. Indeed, these wider processors require many more transistors. Instead, you could use these transistors to build more processors. And that is what many people expected: they expected that our computers would contain many more general-purpose processors.
A processor design like the AMD Zen 5 is truly remarkable. It is not simply a matter of adding execution units. You have to bring the data to these units, you have to order the computation, handle the branches.
What it means for the programmers is that even when you do not use parallelism explicitly, your code executes in a parallel manner under the hood no matter what.
2025-08-25 03:51:49
My favorite text editor is Visual Studio Code. I estimate that it is likely the most popular software development environment. Many major software corporations have adopted Visual Studio Code.
The naming is a bit strange because Visual Studio Code has almost nothing to do with Visual Studio, except for the fact that it comes from Microsoft. For short, we often call it ‘VS Code’ although I prefer to spell out the full name.
Visual Studio Code has an interesting architecture. Visual Studio Code is largely written in TypeScript (so basically JavaScript) on top of Electron. Electron itself is made of the Node.js runtime environment together with the Web engine Chromium. Electron provides an all-purpose approach to building desktop application using JavaScript or TypeScript.
In Electron, Node.js is itself based on the Google v8 engine for JavaScript execution. Thus, Microsoft is building both on a community-supported engine (Node.js) as well as on a stack of Google software.
What is somewhat remarkable is that even though Visual Studio Code runs on a JavaScript engine, it is generally quite fast. On my main laptop, it starts up in about 0.1s. I rarely notice any lag. It is almost always snappy, whether I use it on macOS or under Windows. Under Windows, Visual Studio Code feels faster than Visual Studio. Yet Visual Studio is written in C# and C++, languages that allow much better optimization, in principle. What makes it work is all the optimization work that went into v8, Chromium, Node.js.
Visual Studio Code also seems almost infinitely extensible. It is highly portable and it has great support for Terminals. Coupled with Microsoft copilot, you get a decent AI experience for when you need to do some vibe coding. I also love the ‘Remote SSH’ extensions which allows you to connect to a remote server by ssh and work as if it was the local machine.
When I do system programming, I usually code in C or C++ using CMake as my build system. In my opinion, CMake is a great build system. I combine it with CPM for handling my dependencies.
Microsoft makes available a useful extension for CMake users called CMake Tools. I do not have much use for it but on the rare occasions when I need to launch a debugger to do non-trivial work, it is handy.
For the most part, my debugging usage under Linux is simple:
It tends to just work.
For some projects, you want to pass CMake some configuration flags. You can just create a JSON file settings.json inside the subdirectory .vscode. The JSON file contains a JSON object, and you can just add a cmake.configureArgs with special settings, such as…
{"cmake.configureArgs": ["-DSIMDJSON_DEVELOPER_MODE=ON"]}
The settings.json has many other uses. You can set preferences for the user interface, you can exclude files from the search tool, you can configure linting and so forth. You can also check in the file settings.json with your project under version control so that everyone gets the same preferences.
Unfortunately, under macOS, my debugging experience is not as smooth. The issue likely comes from the fact that macOS defaults on LLVM instead of GCC as C and C++ compilers.
Thus under macOS, I add two non-obvious steps.
CodeLLDB
by Vadim Chugunov.{"configurations": [ { "name": "Launch (lldb)", "type": "lldb", "request": "launch", "program": "${command:cmake.launchTargetPath}", "cwd": "${workspaceFolder}", }] }
A button “Launch (lldb)” should appear at the bottom of the Visual Studio Code window. Pressing it should launch the debugger. Everything else is then like when I am under Linux. It just tends to work.
Visual Studio Code is an instance of a tool that never does anything quite perfectly. It expects you to edit JSON files by hand. It expects you to find the right extension by yourself. The debugger environnement is fine, but you won’t write love letters about it. But the whole package of Visual Studio Code succeeds brilliantly. Everything tends to just be good enough so that you can get your work done with minimal fuss.
The web itself relies on generic technologies (HTML, CSS, JavaScript) which, though individually imperfect, form a coherent and adaptable whole. Visual Studio Code reflects this philosophy: it does not try to do everything perfectly, but it provides a platform where each developer can build their own workflow. This modularity, combined with a clean interface and an active community, explains why it has become one of my favourite tools.
2025-08-16 05:42:56
Loading data from memory often takes several nanoseconds. While the processor waits for the data, it may be forced to wait without performing useful work. Hardware prefetchers in modern processors anticipate memory accesses by loading data into the cache before it is requested, thereby optimizing performance. Their effectiveness varies depending on the access pattern: sequential reads benefit from efficient prefetching, unlike random accesses.
To test the impact of prefetchers, I wrote a Go program that uses a single array access function. The execution time is measured to compare performance. I start with a large array of 32-bit integers (64 MiB).
I skip integers that are not at an index divisible by eight: I do so to minimize ‘cache line’ effects. The code looks as follow:
type DataStruct struct { a, b, c, d, e, f, g, h uint32 } var arr []DataStruct for j := 0; j < arraySize; j++ { sum += arr[indices[j]].a // Accessing only the first field }
Running the program on my Apple laptop, I get that everything is much faster than the pure random access. It serves to illustrate how good our processors are at predicting data access.