Benchmark Calibration

Peter
Oct 14, 2024
7 min read

I believe it is necessary to take a closer look at my benchmarking suite. Three things have led to this:

Some previous experiments were looking for small performance differences (4% or maybe even less)
Some previous results indicated small performance improvements but these might just be inconsistent benchmark performance.
Some experiments showed sensitivity to ambient temperature.

In light of this I decided it would be worthwhile trying to measure the accuracy and reliability of my benchmarking suite so that I can be more confident in my conclusions.

Setup

For this exercise I reset my Linux PC to stock BIOS settings and ran the entire test suite from start to finish 6 times in one day. The first session began early in the day when my office was quite cool. By the time I started running session 4 the sun had moved around and had significantly increased the temperature in my office (hotter than I would be comfortable working in). Before I started session 5, I ran my office air conditioner for 30 minutes to cool the room down to a comfortable temperature (23°C). For the 6th and final session, I kept the air conditioning running but also removed the case side panel to help try and reduce component temperatures even further.

This means that in general, Session 1 was "cool", Sessions 2-4 were in progressively hotter environments, Session 5 was "cool" again, and Session 6 possibly enjoyed the best cooling of all.

Some benchmarks such as mypy are executed 5 times with identical parameters. Other benchmarks like ripgrep are executed more than a dozen times, but with a different parameter each time.

Limitations

I hypothesize that some benchmarks may suffer from interference from the following:

Internet-connected services consuming CPU resources or memory bandwidth.
Background services that I cannot disable due to corporate policy (e.g., packagekit, falcon-sensor)
Component temperatures affected by the previous benchmark (e.g., do increased temperatures from a mypy run affect whichever benchmark follows it).

Results

Mypy

Session	Mypy Duration	Increase from Session 5 Run 3
Session 1	276.96	+1.6%
Session 1	277.17	+1.6% (median)
Session 1	277.94	+1.9%
Session 1	275.28	+0.9% (fastest)
Session 1	278.62	+2.2%
Session 2	277.27	+1.7% (median)
Session 2	279.10	+2.3%
Session 2	276.73	+1.5%
Session 2	275.45	+1.0% (fastest)
Session 2	279.38	+2.4%
Session 3	278.17	+2.0% (median)
Session 3	277.49	+1.8%
Session 3	279.51	+2.5%
Session 3	275.69	+1.1% (fastest)
Session 3	278.17	+2.0%
Session 4	278.04	+2.0%
Session 4	277.37	+1.7%
Session 4	278.47	+2.1%
Session 4	277.71	+1.8% (median)
Session 4	276.63	+1.4% (fastest)
Session 5	276.07	+1.2%
Session 5	277.87	+1.9%
Session 5	272.70	+0.0% (fastest overall)
Session 5	279.21	+2.4%
Session 5	277.38	+1.7% (median)
Session 6	276.93	+1.5%
Session 6	277.41	+1.7%
Session 6	278.15	+2.0%
Session 6	278.82	+2.2%
Session 6	273.92	+0.4%

When looking at the individual run results for each of the sessions, we notice some interesting patterns.

The fastest run for each session is most commonly the fourth, sometimes the fifth or third run, but never the first run.
Session 5 (first with a cool environment) had the fastest run at 272.70, but the very next run of this session was one of the slowest at 279.21.
Based on the above two points, it appears there is measurable interference which is not related to temperature.
However, it appears there is also measurable interference from increased temperature since Session 4 could only get within 1.4% of the fastest time.
It seems possible that we would not reach the fastest possible time for a single configuration within 5 runs, and perhaps more are required if we want to hit the best possible time.
Even though Session 5 (cool) has a much better fastest time than Session 4 (hot), if I was only reporting on the median time as I have been doing for previous experiments, then Session 5's reported time would be +1.7% and Session 4's report time would be +1.8%. Session 1 would have the best reported time of +1.6%. These reported times would be >1.5% greater than the best known possible time for each configuration.

Pytest - Empty Test

Session	Duration	Increase from Session 2 run 5
Session 1	31.00	+0.9%
Session 1	30.94	+0.7% (median)
Session 1	30.84	+0.4%
Session 1	31.01	+1.0%
Session 1	30.83	+0.4% (fastest)
Session 2	30.95	+0.7% (median)
Session 2	31.17	+1.5%
Session 2	30.86	+0.5%
Session 2	30.95	+0.8%
Session 2	30.72	+0.0% (fastest overall)
Session 3	31.00	+0.9%
Session 3	30.86	+0.5% (fastest)
Session 3	30.91	+0.6% (median)
Session 3	30.96	+0.8%
Session 3	30.89	+0.6%
Session 4	30.85	+0.5%
Session 4	30.92	+0.7%
Session 4	30.78	+0.2% (fastest)
Session 4	30.90	+0.6%
Session 4	30.86	+0.5% (median)
Session 5	30.84	+0.4%
Session 5	31.10	+1.3%
Session 5	30.92	+0.7%
Session 5	30.79	+0.2% (fastest)
Session 5	30.86	+.05% (median)
Session 6	30.98	+0.9%
Session 6	30.85	+0.4%
Session 6	30.79	+0.2% (fastest)
Session 6	31.00	+0.9%
Session 6	30.86	+0.5% (median)

Unlike mypy, the pytest-empty benchmark appears unlikely to be affected much by poor cooling or any other kind of interference. For most sessions, the median time was within 0.3% of the fastest time. If I was reporting on median times, Sessions 4, 5 and 6 would tie for first place and Session 2's fastest overall time of 30.72 would be hidden as it tied for last place with Session 1.

It seems reasonable to conclude that this benchmark is accurate to within 1%.

Pytest - Single Tests

The pytest-single-test benchmark is a little different to the pytest-empty benchmark in that instead of 5 runs of the same task, it runs 110 times with a different parameter each time (each parameter is the name of an individual unit test suite). I won't show a full table of results here, but I did analyze the results and made the following observations:

For any parameter, the smallest gap between the fastest "best time" and the slowest "best time" was 0.29%.
The largest gap between the fastest "best time" and the slowest "best time" was 1.88%
The average gap between fastest "best time" and slowest "best time" for each parameter was 0.96%

Also, the following table shows how many times each session achieved the fastest "best time" or the slowest "best time" across the 110 parameters:

Session	Fastest	Slowest
Session 1	10	18
Session 2	37	15
Session 3	3	26
Session 4	13	20
Session 5	29	18
Session 6	18	13

I'm trying to convince myself that Sessions 3 and 4 had fewer wins due to hot environments, however Session 4 was the hottest but fared substantially better than Session 3. Session 1 was also quite cool and should have had more Fastest best-times if temperature was the dominating factor.

Ripgrep

The ripgrep benchmark is similar to the pytest-single-test benchmark in that it has a suite of parameters and it runs exactly once with each parameter. However, unlike pytest-single-test, the variation in execution time across my six calibration sessions is massive:

The smallest gap between the fastest "best time" and the slowest "best time" was 3.61%
The largest gap between fastest "best time" and slowest "best time" was 10.36%
The average difference between fastest "best time" and slowest "best time" was 7.38%

Here is the table showing how many times each session achieved the fastest "best time" or the slowest "best time" across the 16 parameters:

Session	Fastest	Slowest
Session 1	1	2
Session 2	3	2
Session 3	5	2
Session 4	0	1
Session 5	0	6
Session 6	5	3

I don't see much evidence that this benchmark was affected by thermals - rather there appears to have been some other kind of interference which disproportionately affected Session 5.

The margin of error on this benchmark is huge - 7.38% average difference between best times means that in its current form this benchmark is useless for measuring most performance tweaks.

Git Status

The Git Status benchmark is simple again like Mypy, but collects 10 samples instead of 5 since it is so fast. These are the raw results:

This benchmark is also quite inaccurate. Some observations:

Even though Session 2 Run 2 was the fastest, Session 2's median was 3.1% higher.
Compared to that run, Session 1's best time was +2.2%. Session 3's best time was +0.9%. Session 4's best time was +2.4%. Session 5's best time was +1.7%. Session 6's best time was +2.8%.

The median time for each Session was as at least 0.7% higher than the session's best time, but as often 1.4% higher (or more).

Git Log

While this benchmark wasn't as bad as Git Status, for Session 5 the best time was 3.2% slower than the fastest "best time", and the Median times were often 1.4% or more higher than the best time.

Final Thoughts

I believe it would be a worthwhile exercise to run this experiment again but with more iterations of each un-parameterized benchmark to determine the probability of achieving a "fast" time for a given number of iterations, where "fast" is defined as being within say 1% of the fastest known time for the configuration. This is especially needed for fast benchmarks like Ripgrep and Git Status which are showing a large amount of variance in execution times. Ideally I want to determine a number of iterations that is 99% likely to get within 1% of the theoretical "best time" for a given configuration.

For the parameterized benchmarks I probably need to have fewer parameters, but execute each one multiple times to ensure I am getting close to the "best time" for each parameter.

Due to the consistent level of interference It seems that reporting the fastest time for each configuration may be necessary to properly distinguish between better/worse configurations. The theoretical advantage of reporting median time is that a configuration with consistently good run times will be reported as being superior to a configuration that gets "lucky" with a single good run but has more slower runs, but my benchmarks are so controlled that this is probably not applicable - the poorer run times are most likely to due to interference.

Mac vs PC