C++ Beyond the Syllabus #3: Micro-Benchmarking with Quick Bench
You’ve identified a section of code you’d like to speed up… now what?
You’ve identified a section of code you’d like to speed up… now what?
Familiarity with micro-benchmarking will significantly enhance your efficacy.
Micro-benchmarking allows you to do two main things:
- Compare performance of different code implementations
- Identify what functions/lines use the most CPU time
In this post, we’ll explore Quick Bench, a web-based tool designed to compare different C++ code implementations.
In school, we are taught to compare algorithms based on their Big-O complexity. In industry, code performance is far more nuanced.
Quick Bench helps you get a feel for the constant factor between two or more implementations with the same time complexity.
The action described by objective #2 is called profiling, which is super interesting, but is outside the scope of this article. Here are a few tools to use for profiling:
perf
(linux)gprof
(GNU)TCMalloc
andgperftools
(linux, windows, older macOS versions)- Valgrind (GNU)
- Intel VTune Profiler (linux, windows, online tool)
magic-trace
(linux)
Before we dive into Quick Bench, let’s clarify a few things micro-benchmarking is not good for…
- absolute runtime estimates for code snippets
- always claims like “implementation A is always faster than implementation B”
- insights on entire application performance or large code segments
If you haven’t already, check out last week’s post, Intro to Benchmarking & Macro-Benchmarking Deep Dive, to see how macro-benchmarking can help you narrow down on smaller code segments to micro-benchmark.
One of the best ways to learn the thing, is to do the thing! You can follow along at quick-bench.com.
Setting Your Configuration
When you first open up Quick Bench, you’ll see something like this.
You will want to configure your Quick Bench environment to match your production environment. Instructions on how to find all of this information for your setup can be found in the article linked below.
A few additional configurations:
STL = libc++(LLVM)
- I use MacOSDisassembly = None
- We won’t be looking into assembly code today.optim = O3
- We want our benchmarking environment to have the same compiler optimizations as our production environment (usually-O3
or-Ofast
)… the one exception to this is that we don’t want the compiler optimizing away the code we’re trying to benchmark 😅 — we’ll address this throughout the post
You can check out the article linked above for a refresher on (dis)assembly and compiler optimizations.
Library Syntax
Let’s start by breaking down the default example code as seen below.
This example consists of four main components:
(A) — Function Definition
- All code to be benchmarked must live in some function that is later registered (see (D)).
- The
state
argument here contains any inputs you may pass into the function. As discussed in the Arguments section below, you can instruct Quick Bench to use certain inputs to aid your testing.
(B)— State Loop
- The state loop runs many times, repeatedly measuring the contained code.
- You can imagine implicit
start
andend
timestamps being taken at the start and end of each loop iteration. - Anything outside of the loop is not measured.
(C)— Do Not Optimize Specifier
- Recall that compiler optimizations may completely remove unused code while generating assembly instructions. This line tricks the compiler into thinking the variable
created_string
is actually used somewhere. - I’ll go into more detail on this in the Optimizations section below.
(D) — Benchmark Registration
- You can include helper functions and develop alternate code implementations within the code editor.
- Benchmarking will only be performed on a function once it is registered via this syntax.
Interpreting Results
We can run a benchmark of the default example code by clicking the large green Run Benchmark
button. From there, we see the following results.
These results show the StringCreation
benchmark ran faster than the StringCopy
benchmark. But what actually happened when we ran the benchmark? And how should we correctly interpret the results?
When we clicked Run Benchmark
, the Quick Bench backend got to work. It sent our code to a cluster of AWS servers and ran a few things…
For each registered function, Quick Bench ran the state loop many times and discarded the first few iterations so the cache could “warm up”. The information we see is an average of each iteration.
Remember how I mentioned benchmarking isn’t great for absolute time estimates?
One reason for this is that our production environment usually isn’t the same as the benchmarking environment. In the case of Quick Bench, there could be hundreds of other users running their benchmarks at the same time as you, causing the AWS cluster to be slower than your production environment would be.
Alternatively, there could be no other users running benchmarks, and any absolute time estimates could yield faster results than your production app, which might have a lot of moving components that aren’t taken into account when micro-benchmarking.
Quick Bench accounts for this by only showing relative comparisons. Alongside your benchmarks, it will run a Noop
(no operation) benchmark, that looks something like this (you can run this yourself to fact check! 🤓):
static void ExampleNoop(benchmark::State& state) {
for (auto _ : state) {
benchmark::DoNotOptimize(0);
}
}
BENCHMARK(ExampleNoop);
The expected absolute time of a Noop
is the time we’d expect the CPU to complete a single operation. Quick Bench uses this Noop
time as the common denominator when measuring benchmarks. Go ahead and click Show Noop bar
at the bottom of the results and then hover over the blue bar representing StringCreation
.
You’ll see that our StringCreation
benchmark was about 2x faster than StringCopy
and about 4x slower than Noop
.
If you’re like me, you might have a few questions:
- Why don’t we need
benchmark::DoNotOptimize(copy);
included in ourStringCopy
function? - With full optimizations enabled (our
optim=O3
setting), is the string literal constructor (StringCreation
) really over 2x faster than the string copy constructor (StringCopy
)?
These are some good questions… let’s dive into optimizations and compiler configurations a bit more.
Optimizations & Compiler Configurations
Recall how I set the compiler configurations to use LLVM
, Clang 14.0
, C++20
, and O3
to match my local development environment. With these configurations, we saw StringCreation
perform 2.2x faster than StringCopy
.
If we try out Clang 17.0
, the most recent supported version, we get different results:
What?!? How can StringCopy
now take effectively no CPU time?
Well, it seems likely that somewhere between Clang
versions 14.0
and 17.0
the compiler developers added support to optimize away unused strings created by the copy constructor.
If we simply add in the missing benchmark::DoNotOptimize(copy);
line mentioned above, we see much more believable results:
This is why it’s so important to…
- Instruct the compiler not to optimize anything that might remove the core of what you aim to benchmark, and
- Make sure your benchmark environment configuration matches your production environment.
Now that we’ve discussed this benchmark::DoNotOptimize
function a few times. You’re probably wondering what it actually does.
If we had to guess, we might say this function performs some read operation on the argument. Then, the compiler wouldn’t optimize away the creation of the variable, because it is accessed later on.
The problem here is that we don’t want there to be an unnecessary read of the argument from within the state loop, because it would skew the measurements! As an extreme example, imagine passing some very large object that doesn’t fit in the cache as an argument to that implementation of benchmark::DoNotOptimize
.
Quick Bench gets around this by just tricking the compiler into thinking the variable has some side effects, which the compiler itself might not know about. The actual nuances of the function are beyond the scope of this article, but it is super interesting, so I’d definitely recommend checking out benchmark::DoNotOptimize
’s source code and Chandler Carruth’s explanation of it from CppCon 2015.
Arguments & Randomization
So we’ve covered most of the basics, but what we haven’t touched on so far are the arguments to our benchmark functions and randomization.
Let’s get to it…
There are a handful of ways to pass integer arguments to your functions in Quick Bench. Let’s start with a new code example measuring the benefits of std::vector::reserve
(Quick Bench link):
Let’s go over a few important notes about this example…
- At point (0) on line 20 is the only difference between
VecPushBack
andVecReserve
. The latter callsstd::vector::reserve
with the end-size of the vector, which avoids any resizing while inserting elements. - At points labeled (A), we pass the value
4
into our benchmark functions. - Points labeled (B) access the argument via the
state.range(0)
syntax.
In Quick Bench, an argument passed into a registered benchmark function results in a single call to the function in which the state loop is run many times with the same argument.
When we run this benchmark…
We see VecReserve
is faster than VecPushBack
, which matches what we’d expect.
You can also see the argument passed into the benchmarks in each of the column labels (format RegisteredFunctionName/Arg
).
This was a pretty simple example that only compared implementations for a single vector size.
If we were truly benchmarking this functionality, we’d want to test a variety of different inputs. To do this, we can chain arguments together on each benchmark call.
On the left of the above figure, you can see the chained argument syntax. Keep in mind that each individual argument passed in with this syntax results in a separate call to the registered function. Within each of those independent calls, the argument will be the same for all iterations of the state loop.
On the right, we now see a graph of the inputs to each benchmark. If you would rather view these results as a bar chart with information like “variation X is Y times faster than variation Z”, you can toggle that where it says Line
on the bottom right.
The chained argument syntax shown above is pretty repetitive, though. We can replace it with an equivalent and more succinct version like this:
This syntax says that beginning from 4
, the lower bound of our range, let’s multiply each argument by 2
, the range multiplier, to generate the next argument. We will continue this up through 32
, the (inclusive) range upper bound.
This is all great, but how can we randomize inputs?
Let’s consider a new example more suitable for this use case.
Here’s one implementation for a benchmark comparing the square root computation of an argument with std::sqrt
versus a custom made std::pow
solution. For randomization, we’re simply using the STL rand()
function.
This does indeed randomize the “argument” used for each of the registered functions, but as we mentioned earlier, these functions are only called once, exposing a few problems with this solution:
- We have no guarantee the random arguments will be the same for each registered function.
- The same input is used for all iterations of the state loop.
For all we know, Sqrt
might be calculating the square root of some large prime number like 100,000,007
while Pow
is simply calculating the square root of 4
.
Below is an alternate solution using some new syntax designed for complex arguments (Quick Bench link).
With this new syntax…
- Box (A) contains code to generate 20 random numbers.
- Box (B) has a generator function, which takes the registered benchmark function as an argument and then adds the 20 random numbers as arguments to that benchmark.
- Boxes labeled (C) register each benchmark function and apply the generator function from box (B).
Based on our results above, we can conclude the std::sqrt
solution is indeed faster than our custom std::pow
solution.
What else?
We could go on for hours talking about other cool features, tips & tricks for using Quick Bench. Stay tuned for more articles discussing some of these features in-depth! 🤓
In the meantime, you can check out the Google Benchmark documentation, which Quick Bench is built upon. This has tips and tricks for custom statistics, multi-threaded benchmarking, CPU timers, asymptotic complexity, and so much more!
What’s next?
Did you find this article useful? Give it a clap!
Want to learn about other C++ features and tools that aren’t always taught in school?
Subscribe to have C++ Beyond the Syllabus delivered directly to your inbox every week :) 🤓
Sources & More Info
- CppCon 2015: Chandler Carruth “Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!”
- GitHub — google/benchmark
- A cool list of Quick Bench links to interesting benchmarks (GitHub — mathieumallet)
Collaborators
A special thanks to a few friends who took time to edit and review this post: