C++ Beyond the Syllabus #2: Intro to Benchmarking & Macro-Benchmarking Deep Dive

Jared Miller
8 min readJun 20, 2024

--

Blog #2: Intro to Benchmarking & Macro-Benchmarking Deep Dive

Write with intention, optimize with precision. Benchmarking helps you measure what matters.

If performance didn’t matter, you probably wouldn’t be using C++ (or C). When developing…

  • automated trading systems, you want to be faster than your competitors;
  • video games, you want to minimize lag;
  • data center management systems, you want to minimize energy consumption.

The most important component in a software system is usually to produce the correct output. This can be easily tested. A good test suite will verify input x_i results in output f(x_i).

Performance is more subjective. We need to answer some questions, which typically don’t have clear-cut answers…

How performant does it need to be?

The CTO of a large HFT firm once told me (paraphrasing)

the most valuable decision an engineer can make is that some code doesn’t need to be optimized or written in the first place.

Almost every application will have a critical path. In HFT, this critical path often flows from some input (a stock price update, for example), which causes us to trigger (decide to trade) and send an order to the exchange. We usually want this critical path to be fast, consistent, or have some other measurable performance quality depending on the trading strategy employed by the application.

Similar to the CTO’s comment above, it’s important to recognize that most code outside the critical path should be performant, but does not need a crazy emphasis on latency.

Find the code paths where latency is important and benchmark those.

What metrics do we care about?

A senior engineer once told me a motto of his (paraphrasing)

in trading, money is made and lost in the outliers.

Let’s say you wrote a lightning fast trading engine with a tight latency distribution during normal market operations (great job!), but it becomes extremely tail latent during high-volume operations (yikes!). The trading engine’s value will probably degrade significantly during a meme stock rally or earnings calls, when volume is high and latency matters the most.

It would be smart to identify and address the latency contributors.

Identify latency contributors

Typical latency contributors include unnecessary copies, allocations, and resource contention. While you might be able to pinpoint code using bubble sort or repeatedly allocating many MBs of memory, it may be harder to identify other contributors. So how can we monitor those?

I’m glad you asked! Enter: benchmarking — a common practice in industry to identify and monitor latency-sensitive code.

We’ll split this introduction into macro- and micro-benchmarking. Both are important.

Generally, we macro-benchmark entire code paths to identify problem areas and conditions of concern. Then, we micro-benchmark and performance tune smaller, latency-sensitive critical sections.

Macro-Benchmarking

The content in this section is based on Bryce Adelstein-Lelbach’s CppCon 2015 talk: “Benchmarking C++ Code” and my own experiences.

The Macro-Benchmarking Process: (1) Pick a logical process to benchmark. (2) Decide what metrics matter. (3) Take the measurements. (4) Automate testing. (5) Create production monitoring systems.
The Macro-Benchmarking Process

Process

1. Decide what logical processes are worth benchmarking.

Maybe you want to measure…

  • the tick-to-trigger time in an automated trading application
  • the latency to render a frame in a video game
  • how long an update queue takes to process each event
  • how many copies of an element you make throughout a process

Each of these can be measured via macro-benchmarks: quantitative metrics of large-scale code performance.

For the remainder of this section, let’s use trading application example and aim to measure the tick-to-trigger time. Before adding any benchmarking, this (very simplified) process in our trading application might look something like this…

Simplified Trading Application System Diagram with a circular loop between the Exchange and the Trading Application. The Trading Application consists of a MarketDataHandler, which normalizes market data from the Exchange, and a MainEngine, which decides if the application should trigger on the market data and conditionally sends an order back to the exchange.
Simplified Trading Application System Diagram

This process initiates from some event in the market (a tick), like a large trade of Tesla stock, for example. Our black-box trading application may decide to send an order to the exchange (a trigger) via the function shouldWeTrigger(). If we decide to trigger, the application sends a message to the exchange to place an order.

2. Decide what metrics matter.

We need to decide what metrics matter and our requirements for those metrics. Usually this will be determined via a combination of technical and business needs:

  • What constitutes the start and end bounds of each individual tick-to-trigger latency measurement in our processing path?
  • How fast does our trading application need to trigger to beat our competitors to the exchange?

For our example trading application, we will likely decide that the tick-to-trigger measurement should start when we first receive the market data and end after we have sent our order to the exchange. This accounts for our entire processing path, plus any network latency involved in sending the actual order message.

Simplified Trading Application System Diagram with Macro-Benchmarking Start & End Points Identified
Simplified Trading Application System Diagram with Macro-Benchmarking Start & End Points Identified

Let’s say the analysts on our team have decided the average tick-to-trigger latency needs to be in the single digit microseconds (arbitrary, but pretty standard latency) to beat our competitors. To make sure that we are not sending orders to the exchange on very stale data, they decide our tail latency (99th percentile) needs to be sub-50 microseconds.

In practice, code latency distributions should roughly follow a normal distribution. And generally, the tighter the distribution, the better. To test data for normal distributions, we can do something called a mean-median test. To do this, we’ll likely want some downstream process to verify:

Mean-Median Test to Check For a Normal Distribution: |mean — median| / max(mean, median) < 0.01
Mean-Median Test to Check For a Normal Distribution

For the requirements above, we know we will need some data structure to track the mean, median and 99th percentile of our metric. We might also want to track standard deviation to determine the tightness of our distribution.

3. Actually take the measurements.

There are tons of libraries and customizable solutions for this. Some common libraries include:

  • std::chrono library to measure benchmark start and end times down to the nanosecond
  • boost::accumulators::accumulator_set to accumulate data points and extract summary statistics (great for our trading application example!)
  • gperftools (built on google/TCMalloc) tracks allocation data broken down by the allocating function, provides CPU utilization stats, and more

Some custom solutions for various specialized use cases…

  • augment copy constructors to count object copies
  • take timestamps in mutex/lock_guard acquire functions to measure wait times
  • measure processing queue size when adding and removing tasks

A super simplified system diagram of our trading application with macro-benchmarking might look something like this (please excuse the funky combination of pseudocode and real C++ syntax)…

Simplified Trading Application System Diagram with Macro-Benchmarking Pseudocode
Simplified Trading Application System Diagram with Macro-Benchmarking Pseudocode

4. Automate testing with unit & regression tests.

Unless a code change involves adding additional computational work, you probably want to make sure your code does not degrade in performance across change sets. Likewise, you’ll probably want to make sure any intended performance improvements do actually result in measured performance improvements.

We discussed above how functionality tests are trivial — they either pass or they fail. Performance-related tests are harder.

We can create a performance test suite with the following properties:

  • Run code many times, with randomized inputs generated to be within the parameters of expected production inputs, recording timestamps for each run. (Make sure to run this enough times for statistical significance, if that is a requirement of your test suite!)
  • Perform mean-median tests to check that each metric falls into a normal distribution.
  • Compare against a log file of historical test results to ensure performance hasn’t degraded.
  • If all of these tests pass, add the data from this test to the log file of historical results. Otherwise, output a warning.

5. Create monitoring systems for production.

Performance monitoring is important in production. It allows you to identify when performance has decreased or recovered. Paired with other information (user logs, market activity, etc.), you can learn trends and conditions that indicate a performance decrease is likely to happen in the future.

One common production setup for this is reporting macro-benchmark timestamps to an external database (not from within the critical path!) and viewing them in an online GUI like Grafana.

Here is what our trading application system diagram might look like after adding tests and production monitoring functionality…

Simplified Trading Application System Diagram with Tests & Production Monitoring Systems for Macro-Benchmarks
Simplified Trading Application System Diagram with Tests & Production Monitoring Systems for Macro-Benchmarks

Common Macro-Benchmarking Pitfalls

Make sure you are macro-benchmarking metrics that matter. Don’t waste developer time optimizing processes that don’t need to be fast.

Most code paths run on either a “hot” cache or a “cold” cache, not both.

  • If your production code path is common, it will likely have a hot cache. Make sure to discard the first few iterations of each benchmark test, so the cache can warm up.
  • If your production code path is only run on startup or is very infrequent compared to other large logical processes in the application, you may be more interested in running on a cold cache. If this is the case, you can force your benchmark tests to wipe its cache clean between iterations.

When writing tests…

  • Randomize inputs. Otherwise, your compiler will optimize things like branch predictions and you won’t get measurements representative of production.
  • Generate any randomized inputs prior to taking a “start” timestamp.
  • Have large sample sizes.
  • Do not disable compiler optimizations during testing. You want your benchmarks to resemble production performance. We’ll discuss cases where special handling for this is required in the micro-benchmarking deep dive next week.

Most of the time, you don’t need to follow a strict scientific process to gain valuable insights, but any hard conclusions should probably be statistically sound.

Micro-Benchmarking

With our findings from macro-benchmarking, we can detect inefficiencies and begin more targeted investigations with micro-benchmarking.

Neither form of benchmarking has a single correct method. Though, there are definitely incorrect methods.

While macro-benchmarking is rather straightforward, micro-benchmarking is a heck of a lot harder to get right. Next week, we’ll learn to avoid some common mistakes while using micro-benchmarking and uncover code inefficiencies via a short case study.

What’s next?

Did you find this article useful? Give it a clap!

Want to learn about other C++ features and tools that aren’t always taught in school?

Subscribe to have C++ Beyond the Syllabus delivered directly to your inbox every week :) 🤓

Sources

Collaborators

A special thanks to a few friends who took time to edit and review this post:

--

--

Jared Miller
Jared Miller

Written by Jared Miller

A C++ Software Engineer in high frequency trading. Excited about low-latency code, distributed systems, and education technology.

No responses yet