C++ Beyond the Syllabus #3: Micro-Benchmarking with Quick Bench

You’ve identified a section of code you’d like to speed up… now what?

Jared Miller

10 min readJun 27, 2024

You’ve identified a section of code you’d like to speed up… now what?

Familiarity with micro-benchmarking will significantly enhance your efficacy.

Micro-benchmarking allows you to do two main things:

Compare performance of different code implementations
Identify what functions/lines use the most CPU time

In this post, we’ll explore Quick Bench, a web-based tool designed to compare different C++ code implementations.

In school, we are taught to compare algorithms based on their Big-O complexity. In industry, code performance is far more nuanced.

Quick Bench helps you get a feel for the constant factor between two or more implementations with the same time complexity.

The action described by objective #2 is called profiling, which is super interesting, but is outside the scope of this article. Here are a few tools to use for profiling:

perf (linux)
gprof (GNU)
TCMalloc and gperftools (linux, windows, older macOS versions)
Valgrind (GNU)
Intel VTune Profiler (linux, windows, online tool)
magic-trace (linux)

Before we dive into Quick Bench, let’s clarify a few things micro-benchmarking is not good for…

absolute runtime estimates for code snippets
always claims like “implementation A is always faster than implementation B”
insights on entire application performance or large code segments

If you haven’t already, check out last week’s post, Intro to Benchmarking & Macro-Benchmarking Deep Dive, to see how macro-benchmarking can help you narrow down on smaller code segments to micro-benchmark.

C++ Beyond the Syllabus #2: Macro-Benchmarking Deep Dive

Write with intention, optimize with precision. Benchmarking helps you measure what matters.

jaredmil.medium.com

One of the best ways to learn the thing, is to do the thing! You can follow along at quick-bench.com.

Setting Your Configuration

When you first open up Quick Bench, you’ll see something like this.

You will want to configure your Quick Bench environment to match your production environment. Instructions on how to find all of this information for your setup can be found in the article linked below.

C++ Beyond the Syllabus #1: Compiler Explorer

Compiler Explorer is my go-to first step when comparing different implementations or learning new features. Here’s why…

jaredmil.medium.com

A few additional configurations:

STL = libc++(LLVM) - I use MacOS
Disassembly = None - We won’t be looking into assembly code today.
optim = O3 - We want our benchmarking environment to have the same compiler optimizations as our production environment (usually -O3 or -Ofast )… the one exception to this is that we don’t want the compiler optimizing away the code we’re trying to benchmark 😅 — we’ll address this throughout the post

You can check out the article linked above for a refresher on (dis)assembly and compiler optimizations.

Library Syntax

Let’s start by breaking down the default example code as seen below.

This example consists of four main components:

(A) — Function Definition

All code to be benchmarked must live in some function that is later registered (see (D)).
The state argument here contains any inputs you may pass into the function. As discussed in the Arguments section below, you can instruct Quick Bench to use certain inputs to aid your testing.

(B)— State Loop

The state loop runs many times, repeatedly measuring the contained code.
You can imagine implicit start and end timestamps being taken at the start and end of each loop iteration.
Anything outside of the loop is not measured.

(C)— Do Not Optimize Specifier

Recall that compiler optimizations may completely remove unused code while generating assembly instructions. This line tricks the compiler into thinking the variable created_string is actually used somewhere.
I’ll go into more detail on this in the Optimizations section below.

(D) — Benchmark Registration

You can include helper functions and develop alternate code implementations within the code editor.
Benchmarking will only be performed on a function once it is registered via this syntax.

Interpreting Results

We can run a benchmark of the default example code by clicking the large green Run Benchmark button. From there, we see the following results.

These results show the StringCreation benchmark ran faster than the StringCopy benchmark. But what actually happened when we ran the benchmark? And how should we correctly interpret the results?

When we clicked Run Benchmark, the Quick Bench backend got to work. It sent our code to a cluster of AWS servers and ran a few things…

For each registered function, Quick Bench ran the state loop many times and discarded the first few iterations so the cache could “warm up”. The information we see is an average of each iteration.

Remember how I mentioned benchmarking isn’t great for absolute time estimates?

One reason for this is that our production environment usually isn’t the same as the benchmarking environment. In the case of Quick Bench, there could be hundreds of other users running their benchmarks at the same time as you, causing the AWS cluster to be slower than your production environment would be.

Alternatively, there could be no other users running benchmarks, and any absolute time estimates could yield faster results than your production app, which might have a lot of moving components that aren’t taken into account when micro-benchmarking.

Quick Bench accounts for this by only showing relative comparisons. Alongside your benchmarks, it will run a Noop (no operation) benchmark, that looks something like this (you can run this yourself to fact check! 🤓):

static void ExampleNoop(benchmark::State& state) {
  for (auto _ : state) {
    benchmark::DoNotOptimize(0);
  }
}
BENCHMARK(ExampleNoop);

The expected absolute time of a Noop is the time we’d expect the CPU to complete a single operation. Quick Bench uses this Noop time as the common denominator when measuring benchmarks. Go ahead and click Show Noop bar at the bottom of the results and then hover over the blue bar representing StringCreation.

You’ll see that our StringCreation benchmark was about 2x faster than StringCopy and about 4x slower than Noop.

If you’re like me, you might have a few questions:

Why don’t we need benchmark::DoNotOptimize(copy); included in our StringCopy function?
With full optimizations enabled (our optim=O3 setting), is the string literal constructor (StringCreation) really over 2x faster than the string copy constructor (StringCopy)?

These are some good questions… let’s dive into optimizations and compiler configurations a bit more.

Optimizations & Compiler Configurations

Recall how I set the compiler configurations to use LLVM, Clang 14.0, C++20, and O3 to match my local development environment. With these configurations, we saw StringCreation perform 2.2x faster than StringCopy.

If we try out Clang 17.0, the most recent supported version, we get different results:

What?!? How can StringCopy now take effectively no CPU time?

Well, it seems likely that somewhere between Clang versions 14.0 and 17.0 the compiler developers added support to optimize away unused strings created by the copy constructor.

If we simply add in the missing benchmark::DoNotOptimize(copy); line mentioned above, we see much more believable results:

This is why it’s so important to…

Instruct the compiler not to optimize anything that might remove the core of what you aim to benchmark, and
Make sure your benchmark environment configuration matches your production environment.

Now that we’ve discussed this benchmark::DoNotOptimize function a few times. You’re probably wondering what it actually does.

If we had to guess, we might say this function performs some read operation on the argument. Then, the compiler wouldn’t optimize away the creation of the variable, because it is accessed later on.

The problem here is that we don’t want there to be an unnecessary read of the argument from within the state loop, because it would skew the measurements! As an extreme example, imagine passing some very large object that doesn’t fit in the cache as an argument to that implementation of benchmark::DoNotOptimize.

Quick Bench gets around this by just tricking the compiler into thinking the variable has some side effects, which the compiler itself might not know about. The actual nuances of the function are beyond the scope of this article, but it is super interesting, so I’d definitely recommend checking out benchmark::DoNotOptimize’s source code and Chandler Carruth’s explanation of it from CppCon 2015.

Arguments & Randomization

So we’ve covered most of the basics, but what we haven’t touched on so far are the arguments to our benchmark functions and randomization.

Let’s get to it…

There are a handful of ways to pass integer arguments to your functions in Quick Bench. Let’s start with a new code example measuring the benefits of std::vector::reserve (Quick Bench link):

Let’s go over a few important notes about this example…

At point (0) on line 20 is the only difference between VecPushBack and VecReserve. The latter calls std::vector::reserve with the end-size of the vector, which avoids any resizing while inserting elements.
At points labeled (A), we pass the value 4 into our benchmark functions.
Points labeled (B) access the argument via the state.range(0) syntax.

In Quick Bench, an argument passed into a registered benchmark function results in a single call to the function in which the state loop is run many times with the same argument.

When we run this benchmark…

We see VecReserve is faster than VecPushBack, which matches what we’d expect.

You can also see the argument passed into the benchmarks in each of the column labels (format RegisteredFunctionName/Arg).

This was a pretty simple example that only compared implementations for a single vector size.

If we were truly benchmarking this functionality, we’d want to test a variety of different inputs. To do this, we can chain arguments together on each benchmark call.

On the left of the above figure, you can see the chained argument syntax. Keep in mind that each individual argument passed in with this syntax results in a separate call to the registered function. Within each of those independent calls, the argument will be the same for all iterations of the state loop.

On the right, we now see a graph of the inputs to each benchmark. If you would rather view these results as a bar chart with information like “variation X is Y times faster than variation Z”, you can toggle that where it says Line on the bottom right.

The chained argument syntax shown above is pretty repetitive, though. We can replace it with an equivalent and more succinct version like this:

This syntax says that beginning from 4, the lower bound of our range, let’s multiply each argument by 2, the range multiplier, to generate the next argument. We will continue this up through 32, the (inclusive) range upper bound.

This is all great, but how can we randomize inputs?

Let’s consider a new example more suitable for this use case.

Here’s one implementation for a benchmark comparing the square root computation of an argument with std::sqrt versus a custom made std::pow solution. For randomization, we’re simply using the STL rand() function.

This does indeed randomize the “argument” used for each of the registered functions, but as we mentioned earlier, these functions are only called once, exposing a few problems with this solution:

We have no guarantee the random arguments will be the same for each registered function.
The same input is used for all iterations of the state loop.

For all we know, Sqrt might be calculating the square root of some large prime number like 100,000,007 while Pow is simply calculating the square root of 4.

Below is an alternate solution using some new syntax designed for complex arguments (Quick Bench link).

With this new syntax…

Box (A) contains code to generate 20 random numbers.
Box (B) has a generator function, which takes the registered benchmark function as an argument and then adds the 20 random numbers as arguments to that benchmark.
Boxes labeled (C) register each benchmark function and apply the generator function from box (B).

Based on our results above, we can conclude the std::sqrt solution is indeed faster than our custom std::pow solution.

What else?

We could go on for hours talking about other cool features, tips & tricks for using Quick Bench. Stay tuned for more articles discussing some of these features in-depth! 🤓

In the meantime, you can check out the Google Benchmark documentation, which Quick Bench is built upon. This has tips and tricks for custom statistics, multi-threaded benchmarking, CPU timers, asymptotic complexity, and so much more!

What’s next?

Did you find this article useful? Give it a clap!

Want to learn about other C++ features and tools that aren’t always taught in school?

Subscribe to have C++ Beyond the Syllabus delivered directly to your inbox every week :) 🤓

Sources & More Info

Collaborators

A special thanks to a few friends who took time to edit and review this post:

Janay Bengal (LinkedIn)
Sam Maksimovich (LinkedIn)
Vanita Sharma (LinkedIn, Medium)

C++ Beyond the Syllabus #3: Micro-Benchmarking with Quick Bench

You’ve identified a section of code you’d like to speed up… now what?

C++ Beyond the Syllabus #2: Macro-Benchmarking Deep Dive

Write with intention, optimize with precision. Benchmarking helps you measure what matters.

Setting Your Configuration

C++ Beyond the Syllabus #1: Compiler Explorer

Compiler Explorer is my go-to first step when comparing different implementations or learning new features. Here’s why…

Library Syntax

Interpreting Results

Optimizations & Compiler Configurations

Arguments & Randomization

What else?

What’s next?

Sources & More Info

Collaborators

Written by Jared Miller

No responses yet