-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Would Stabilizer actually help on modern hardware? #5
Comments
I do not have good numbers for this, since Stabilizer currently does not entirely work. But I think a good way to see the problem is to run the same benchmark a bunch of times and look at the avg. Then do a recompile with a minor code change in an unrelated part of the code (like adding some code in error handling that never runs during benchmarks). If you do this with a few changes and rounds of benchmarking, you'll likely notice that your average is sometimes very different even though there should be absolutely no change to the computational load. For example in zlib-ng we see decompression avg speed changes with code changes only in the compression code, there is no interaction between these codepaths and the code for each is pretty far apart in memory so they never end up in the same cachline. With old machines this would have meant there is no way they effect each other at all. But this is a big issue with benchmarking on modern machines. What I observe:
These are problems that are affected by memory placement and cache alignment.
In my experience these problems are only getting worse the more advanced the CPUs get and the deeper the caches get. Older in-order CPU with less clever tricks like automatic prefetching and caching were a lot easier to get repeatable benchmarks from. Many benchmarks are very hard to get repeatable results from across multiple commits of code changes, resulting in a proposed speedup sometimes showing a big slowdown during PR review/testing. Doing 100 runs, discarding the 60 slowest ones and thus only doing an average on the 40 fastest ones seems to be able to mitigate some of the OS-caused variation like memory placement and interrupts. Stabilizer, when working properly, should be able to at least nearly eliminate the variation caused by the OS, the linker decisions and code size changes in unrelated code. I am not sure what I could add to the readme to illustrate this better, feel free to make suggestions. But I hope this helps you and others gain a little bit deeper knowledge of how and why Stabilizer is important. I tried to keep this summary easy to read and understand, as it gets very technical when you start to look at individual cpu core designs for example. PS: This summary is based on my own research and experience with benchmarking, I have done a lot of research into this myself both in a professional capacity as well as out of a general interest, but this is not a peer-reviewed research article. 😉 |
Thanks for your summary. This is really helpful. |
@pca006132 Yes, that is possible, but quite advanced manual hacking and I'd only really consider that for really small projects. Imagine doing that with Chrome or Firefox for example 😄 Something I did not explain in the above post is the TLB buffer of the CPU. The effect of a TLB miss is most easily compared to a cacheline miss. Both of these are affected by code/data placement in memory, and both have a performance penalty that often is not the same across multiple compilations (with code size changes mainly). Aligning functions to a page boundary would mean that each function occupies a separate TLB entry, so while this would avoid the guesswork of what functions live in the same TLB or not, it would also run the risk of running out of TLB buffer entries entirely, potentially leading to even more unpredictable benchmarks. So here too it would be beneficial with Stabilizer, since Stabilizer would automatically randomize code and data placement, thus ensuring that all functions and data stores would be roughly equally "badly" placed in memory on average. This is still not perfect, but a very good approximation. |
Yes, this approach would probably not scale if we align all functions to page boundary, but it might (a bit might) make our results a bit more deterministic if we can only align those really hot functions, e.g. the tight compression loop you mentioned. Anyway, I guess I need to try stabilizer later when I have time to learn from the actual benchmarks. |
I have no numbers either, but Stabilizer should greatly hinder compiler optimizations. For example, code randomization de facto disables inlining, which AFAIK is one of the most important optimization techniques because it implicitly enables inter procedure optimization. I can even imagine that it's possible to construct a patch that will make the code under Stabilizer run slower, while without stabilizer there will be a speedup. Would Stabilizer with this limitation still be useful for you, @Dead2? |
@magras I am unclear on what you are suggesting here. Are you suggesting making the code running with Stabilizer artificially even slower than currently? That would probably defeat the purpose of using Stabilizer for benchmarking though, so I think I am missing something here. Or are you suggesting making it runtime selectable whether to run with/without Stabilizer enabled perhaps? I think what @pca006132 was suggesting with linker scripts for function alignments is more of an alternative to Stabilizer rather than a modification for it. |
I'm sorry, I'll try again with more context. Let me explain how Stabilizer achieves code location randomization. Stabilizer's pass runs before clang's optimization passes and modifies almost every function call to load the target address from a table located right after the end of the calling function. It's similar, but not exactly equivalent to PLT. Let's assume there is no actual code relocation and no additional runtime costs associated with it, just the code transformation I described above. Now there is a benchmark measuring this function performance:
and there is an optimized version of
I believe all of the big three compilers will optimize the patched version to Yes, this is an artificially constructed situation. But I have doubts about the Stabilizer design because with Stabilizer we are measuring performance of a code very different from the actual release version. I have troubles with motivation caused by these doubts. Probably it's still worth fixing zlib-ng benchmark crashes and getting actual numbers and first hand experience with Stabilizer, but... |
@magras Ah, now I understand what you mean. So the ideal method would probably be for Stabilizer to hook in at some point after the optimizer and inliner has already been run (completely or partially), and only then rewrite function calls and returns. The actual compiler-side implementation of this is beyond me as you know, for now at least. I think this would clearly be a great benefit. What we really want to benchmark is the code changes (or possibly the optimization flags), and the best way to do that is of course to benchmark on an application that is as close to the "release" compiled as possible. Small question; Could we run certain optimization passes direcly before/during Stabilizer, for example just the inliner? Then we would do our changes on the already merged functions afterwards. IDK whether that would be possible have run before Stabilizer or whether that would break and require a rewrite (with regards to different kinds of IR etc). |
@Dead2, it's possible to tap into different stages of the compiler, but it might reduce the effect of Stabilizer. I'll explain in the next post what I mean (point about micro benchmarks). Btw, I'm not an expert too. I had learned llvm while studying Stabilizer's code.
Inlining isn't the only problem. Right now I know only one technique to fix issues caused by code relocation - deoptimization. TLS, global variables, bulk copy of constants - they all deoptimized. I think their impact is much less than inlining, but I believe I can construct an analogous example for TLS, which will be much closer to a real optimization in a real code. The are costs associated with Stabilizer's runtime too. Every function call starts with trampoline (push actual function address to the stack and
I'm not sure how to reorder or duplicate a builtin pass in llvm, but probably it is achievable. There shouldn't be problems with IR. It might change calculated inlining costs, but probably it's fine. |
Let me paint a broader picture of how I see Stabilizer.
Hence in my opinion Stabilizer was designed for big projects with long running performance tests. I don't have such projects. My main grudge is about rerandomization. It breaks compiler assumption about stability of function and global variable addresses and causes all the troubles. I have thoughts about abandoning code relocation at runtime and randomizing layout only at compile time. Probably with linker scripts, but I never used them and don't know how hard it would be to generate them with random layout. The obvious advantage of this approach is that we will be measuring the production code without any additional overhead. But of course there are downsides too:
@Dead2, would this approach work for your projects? |
Hi, I saw this fork from the original stabilizer repo and is very interested in it. I wanted to know how much a difference does this make for modern hardware with much larger cache size, more associative ways and better hardware prefetcher. Putting some figures in the readme can help others know that whether this is still important nowadays and attract others to contribute.
The text was updated successfully, but these errors were encountered: