I'm not talking about skills or knowledge, nor am I trying to sell the world on the idea of performance optimization. Our world already prioritizes speeding up everything. Optimizing code performance is hard work because it's a problem whose very nature dictates the use of brute force—an exhaustive search of options—and there's nothing you can do about it.
The article you're reading is, in part, a reflection on the frustrations I experience with code optimization. But I'll also try to offer practical advice that, I hope, will brighten the path for those pursuing optimization.
I'm currently working on an article about integer formatting, which discusses the design of a very specific algorithm. I still haven't finished it because I need to make about five decisions related to the algorithm. I don't know how they interact with each other, and I need to analyze 25 options to confidently say which one is the best. Several of my projects have stalled for the same reason—I lack the willpower to implement a dozen different combinations of code options.
Weeding out "obviously" suboptimal approaches is practically a heuristic. I'm comforted by the thought that I'm more familiar with x86-64 processors than most, but even these processors still manage to spring surprises on me from time to time. A naive algorithm might prove more useful than expected due to vectorization. And cleverly written code might fail due to a branch misprediction or a bug in the store-to-load forwarding mechanism that speeds up data reads performed after writes.
Optimization requires repeated trial and error. I don't like the saying, "Intuition won't help here—profile the code," because this phrase seems to imply that profiling is a viable substitute for theoretical calculations, although in fact it isn't. But I can't say that profiling is completely unnecessary. I often joke that it perf report's my go-to disassembler.
But what's worse is that you can't trust even "obviously" good code. In the previous article , I optimized the code by replacing a single-pass linear algorithm with one with greater-than-linear complexity. This isn't a unique case. Just yesterday, I saw someone optimize Barrett's best-in-class algorithm by dividing numbers as values of type [ ] double, rounding the resulting number, and calculating the remainder based on the division result. This seems stupid and will likely never be useful. But it was.
It's worth noting, however, that the good thing about optimization is that the work can be divided among several people who test different options. Open-source projects, in particular, take advantage of this successfully, as the people working on them have different skills and knowledge, and they pay attention to different ideas.
When working on your own projects, try to leverage what's already been done. Specifically, you can talk to colleagues or read about how others have solved similar problems.
Here's another example. Imagine a program that executes either action A or action B n times, depending on probability p. If p differs significantly from 1/2, the specifics of branch prediction suggest that it's better to implement the choice using if. If p is close to 1/2, branch prediction will perform poorly, and an approach that doesn't use branching is better. Here, not only the comparative performance Aand B, but also the "cost" of incorrect branch prediction plays a role, and this can depend not only on the CPU itself but also on the specific code running on it.
Ideally, you should have a set of testing programs that plot graphs and automatically find optimal parameter values, although setting up such a set of programs can be tedious. With this approach, constantly restarting checks is both cheaper and less emotionally taxing. Even if an automated check takes half an hour, you can spend that time doing other things.
For example, this happens when two lookup tables ( LUTs ) don't fit in the cache together, even though one of them fits. Sometimes this can be fixed by splitting the computation into multiple passes, where each pass only accesses auxiliary data that fits in the cache. This doesn't necessarily mean performing two passes over the entire data, doubling the memory bandwidth. The data can be split into chunks and performing two passes on one of the chunks, which will improve performance if, for example, the chunk fits in the L3 cache. But sometimes this approach isn't feasible, and then I literally bang my head against the wall.
Register limitations are even worse, as they're a problem unique to the instruction set architecture (ISA), not the microarchitecture . The hardware has plenty of registers; they're simply not accessible to user-written code. Here, you could try dividing the data between general-purpose registers and vector registers, and this approach works well as long as you rarely cross the line between GPR and SIMD. But if that happens, you might want to consider a career change .
It doesn't have to be this way. Field-programmable gate arrays ( FPGAs ) allow you to design your own hardware (well, sort of). Alternative approaches, like interaction networks , offer the chance to make software-defined operations as efficient as those typically implemented in hardware. But that's not like the world we live in, where Intel regularly adds useful instructions to AVX-512 and then abandons them. So I have to choose between processors that support FP16 vp2intersector FP16 . As a result, I not only have to benchmark different code, I also have to test it on different CPUs to decide which EC2 instance to deploy it on.
The only advice I can give here is to strive for the best possible result, even if it's worse than the theoretical optimum. Reduce the size of one of the LUTs by moving some calculations to runtime, rewrite a chunk of assembly code to improve register management, and if all else fails, accept that you'll have to make a choice.
let condition1 = HashSet::from([a, b]).contains(&c);
let condition2 = a == c || b == c;
But compilers aren't going to optimize the first example down to the second ( the JVM JIT, in some cases, isn't ). They don't reason using abstractions, and they certainly don't care about your own abstractions. This applies not only to high-level code: LLVM doesn't even understand that the bitwise AND operator is an intersection.
The task where compilers truly shine isn't optimization. They transform code written in high-level languages into abstractions that require no additional computational resources to process. But they aren't known for their resourcefulness or wit. Compilers are optimal transpilers. With rare exceptions, they generate the exact code the programmer wrote in the source code. They allow you to write assembly language using the syntax and capabilities of Rust or C++. But don't forget that the construct you write arr.map(|x| x / c)will cause [unclear idiv] without performing obvious precalculations like [unclear] libdivide.
Sometimes I think that the flag -O2should perhaps be renamed to -fzero-cost-abstractions.
This might give the impression that I'm arguing that compilers are only good at routine support tasks, but they're not good at those either. For example, they can be terrible at register allocation. If a piece of code that's executed infrequently needs a lot of registers, the GCC compiler accommodates this need by flushing variables accessed by the frequently executed section of code at every iteration, not just upon entry to the rarely executed section. Clang handles this simple example better, but it also misbehaves in more complex cases.
This leads to the following conclusion: never blindly trust the compiler. Always check the disassembly results of your code, use a profiler like [profile] perfthat analyzes the code at the instruction level. And don't hesitate to use this information to influence the compiler and nudge it in the right direction if it leads to noticeable improvements.
Despite compilers' obvious flaws, they don't allow us to correct them when they make mistakes. There's no way to give a compiler both optimized assembly code and its equivalent C code, and have it use the former in normal operation and the latter in special cases. Custom calling conventions are largely unsupported, as is the choice between code with and without jumps, and many other assembly tricks. There is such a thing as low-level intrinsics, but LLVM and rustc, when working with them, try to show off their intelligence by rewriting them. This sometimes leads to a performance hit, leaving no choice but to apply an optimization barrier.
Equivalent graphs ( e-graphs ), as popularized by Cranelift , are attempting to solve this problem, but as far as I know, there hasn't been much progress in this area yet. I'm still hoping for the best, though.
Apple Silicon has nothing like that. I have absolutely no idea how to work with M1+ processors. Here's an optimization guide for Apple Silicon processors, which is only 169 pages long and reads like it was written for beginners, not professionals. It looks like a textbook you'd find on Hacker News, not something I'd be interested in. It only contains rough timing and throughput estimates for certain instruction categories, and there's disappointingly little tabular data. There's no mention of micro-op fusion, nor is there any information about ports. Dougall Johnson's research is extremely helpful, but it only covers M1 processors, not newer CPUs, and it also leaves many questions unanswered.
Even Apple's fork of LLVM doesn't have instruction scheduling annotations for Apple Silicon. How can I write efficient code when Apple hasn't bothered to tune its own compiler? Optimizing code for such a platform is 90% reverse engineering and 10% writing meaningful code, and writing meaningful code is already a difficult task.
The proper solution to this problem is intellectual property theft, but I'm not allowed to say that, so I'll remain silent. Oops.
Optimization allows you to save time, and time is the only resource that people never have enough of.
The article you're reading is, in part, a reflection on the frustrations I experience with code optimization. But I'll also try to offer practical advice that, I hope, will brighten the path for those pursuing optimization.
Compatibility
Certain optimizations can only work in conjunction with each other, while others, when combined, actually degrade performance rather than improve it. Being an expert in this field means knowing the existing optimization methods. And an "optimization master" is someone who knows exactly which ones to choose.I'm currently working on an article about integer formatting, which discusses the design of a very specific algorithm. I still haven't finished it because I need to make about five decisions related to the algorithm. I don't know how they interact with each other, and I need to analyze 25 options to confidently say which one is the best. Several of my projects have stalled for the same reason—I lack the willpower to implement a dozen different combinations of code options.
Weeding out "obviously" suboptimal approaches is practically a heuristic. I'm comforted by the thought that I'm more familiar with x86-64 processors than most, but even these processors still manage to spring surprises on me from time to time. A naive algorithm might prove more useful than expected due to vectorization. And cleverly written code might fail due to a branch misprediction or a bug in the store-to-load forwarding mechanism that speeds up data reads performed after writes.
Optimization requires repeated trial and error. I don't like the saying, "Intuition won't help here—profile the code," because this phrase seems to imply that profiling is a viable substitute for theoretical calculations, although in fact it isn't. But I can't say that profiling is completely unnecessary. I often joke that it perf report's my go-to disassembler.
But what's worse is that you can't trust even "obviously" good code. In the previous article , I optimized the code by replacing a single-pass linear algorithm with one with greater-than-linear complexity. This isn't a unique case. Just yesterday, I saw someone optimize Barrett's best-in-class algorithm by dividing numbers as values of type [ ] double, rounding the resulting number, and calculating the remainder based on the division result. This seems stupid and will likely never be useful. But it was.
It's worth noting, however, that the good thing about optimization is that the work can be divided among several people who test different options. Open-source projects, in particular, take advantage of this successfully, as the people working on them have different skills and knowledge, and they pay attention to different ideas.
When working on your own projects, try to leverage what's already been done. Specifically, you can talk to colleagues or read about how others have solved similar problems.
Consistency
A variation of this are algorithms that have a threshold boundary. Here, we not only make optimization decisions but also need to select parameters, continuing to apply trial and error. For example:- Hybrid sorting algorithms are able to switch between different implementations due to large Big O constants.
- The FFT (Fast Fourier Transform) algorithm can switch between recursive and iterative approaches to make better use of the processor cache.
- When working with data of varying density, sets with different structures may prove optimal. These may include a bit set, a hash set, or a complementary hash set.
Here's another example. Imagine a program that executes either action A or action B n times, depending on probability p. If p differs significantly from 1/2, the specifics of branch prediction suggest that it's better to implement the choice using if. If p is close to 1/2, branch prediction will perform poorly, and an approach that doesn't use branching is better. Here, not only the comparative performance Aand B, but also the "cost" of incorrect branch prediction plays a role, and this can depend not only on the CPU itself but also on the specific code running on it.
Ideally, you should have a set of testing programs that plot graphs and automatically find optimal parameter values, although setting up such a set of programs can be tedious. With this approach, constantly restarting checks is both cheaper and less emotionally taxing. Even if an automated check takes half an hour, you can spend that time doing other things.
Incompatibility
The worst example of an optimization that is incompatible with something is one that does not work because of some external constraint.For example, this happens when two lookup tables ( LUTs ) don't fit in the cache together, even though one of them fits. Sometimes this can be fixed by splitting the computation into multiple passes, where each pass only accesses auxiliary data that fits in the cache. This doesn't necessarily mean performing two passes over the entire data, doubling the memory bandwidth. The data can be split into chunks and performing two passes on one of the chunks, which will improve performance if, for example, the chunk fits in the L3 cache. But sometimes this approach isn't feasible, and then I literally bang my head against the wall.
Register limitations are even worse, as they're a problem unique to the instruction set architecture (ISA), not the microarchitecture . The hardware has plenty of registers; they're simply not accessible to user-written code. Here, you could try dividing the data between general-purpose registers and vector registers, and this approach works well as long as you rarely cross the line between GPR and SIMD. But if that happens, you might want to consider a career change .
It doesn't have to be this way. Field-programmable gate arrays ( FPGAs ) allow you to design your own hardware (well, sort of). Alternative approaches, like interaction networks , offer the chance to make software-defined operations as efficient as those typically implemented in hardware. But that's not like the world we live in, where Intel regularly adds useful instructions to AVX-512 and then abandons them. So I have to choose between processors that support FP16 vp2intersector FP16 . As a result, I not only have to benchmark different code, I also have to test it on different CPUs to decide which EC2 instance to deploy it on.
The only advice I can give here is to strive for the best possible result, even if it's worse than the theoretical optimum. Reduce the size of one of the LUTs by moving some calculations to runtime, rewrite a chunk of assembly code to improve register management, and if all else fails, accept that you'll have to make a choice.
Compilers
Many people are accustomed to the idea that "compilers are smarter than people." But nothing could be further from the truth. Any developer can understand that the following two examples are (expectedly) equivalent:let condition1 = HashSet::from([a, b]).contains(&c);
let condition2 = a == c || b == c;
But compilers aren't going to optimize the first example down to the second ( the JVM JIT, in some cases, isn't ). They don't reason using abstractions, and they certainly don't care about your own abstractions. This applies not only to high-level code: LLVM doesn't even understand that the bitwise AND operator is an intersection.
The task where compilers truly shine isn't optimization. They transform code written in high-level languages into abstractions that require no additional computational resources to process. But they aren't known for their resourcefulness or wit. Compilers are optimal transpilers. With rare exceptions, they generate the exact code the programmer wrote in the source code. They allow you to write assembly language using the syntax and capabilities of Rust or C++. But don't forget that the construct you write arr.map(|x| x / c)will cause [unclear idiv] without performing obvious precalculations like [unclear] libdivide.
Sometimes I think that the flag -O2should perhaps be renamed to -fzero-cost-abstractions.
This might give the impression that I'm arguing that compilers are only good at routine support tasks, but they're not good at those either. For example, they can be terrible at register allocation. If a piece of code that's executed infrequently needs a lot of registers, the GCC compiler accommodates this need by flushing variables accessed by the frequently executed section of code at every iteration, not just upon entry to the rarely executed section. Clang handles this simple example better, but it also misbehaves in more complex cases.
This leads to the following conclusion: never blindly trust the compiler. Always check the disassembly results of your code, use a profiler like [profile] perfthat analyzes the code at the instruction level. And don't hesitate to use this information to influence the compiler and nudge it in the right direction if it leads to noticeable improvements.
Despite compilers' obvious flaws, they don't allow us to correct them when they make mistakes. There's no way to give a compiler both optimized assembly code and its equivalent C code, and have it use the former in normal operation and the latter in special cases. Custom calling conventions are largely unsupported, as is the choice between code with and without jumps, and many other assembly tricks. There is such a thing as low-level intrinsics, but LLVM and rustc, when working with them, try to show off their intelligence by rewriting them. This sometimes leads to a performance hit, leaving no choice but to apply an optimization barrier.
Equivalent graphs ( e-graphs ), as popularized by Cranelift , are attempting to solve this problem, but as far as I know, there hasn't been much progress in this area yet. I'm still hoping for the best, though.
Documentation
The uops.info website provides execution time and port data for every instruction on x86 processors and includes information on many Intel and AMD processors. Agner Fogh wrote a code optimization guide for x86 processors and published his own tables. The Intel Developer's Manual contains over 5,000 pages, describing not only the instruction set but also many of the internal mechanisms of Intel processors.Apple Silicon has nothing like that. I have absolutely no idea how to work with M1+ processors. Here's an optimization guide for Apple Silicon processors, which is only 169 pages long and reads like it was written for beginners, not professionals. It looks like a textbook you'd find on Hacker News, not something I'd be interested in. It only contains rough timing and throughput estimates for certain instruction categories, and there's disappointingly little tabular data. There's no mention of micro-op fusion, nor is there any information about ports. Dougall Johnson's research is extremely helpful, but it only covers M1 processors, not newer CPUs, and it also leaves many questions unanswered.
Even Apple's fork of LLVM doesn't have instruction scheduling annotations for Apple Silicon. How can I write efficient code when Apple hasn't bothered to tune its own compiler? Optimizing code for such a platform is 90% reverse engineering and 10% writing meaningful code, and writing meaningful code is already a difficult task.
The proper solution to this problem is intellectual property theft, but I'm not allowed to say that, so I'll remain silent. Oops.
Results
Performance optimization is a complex task. The point is, anyone attempting it needs the following:- Manually explore dozens of options without going crazy.
- Messing around with imperfect tools. (Profilers and MCAs are useful, but they remain toys whose capabilities don't match the true complexity of what they're used to study.)
- Trying to squeeze square shapes into round holes until they fit. Trying to combine incompatible optimizations.
- Confronting corporate greed and cultural indifference.
Optimization allows you to save time, and time is the only resource that people never have enough of.