Measuring Performance
- Peter
- Mar 27, 2024
- 4 min read
Updated: Mar 27, 2024
While the title of this blog is "Mac vs PC", it needs to be understood that it's still in the context of my specific work situation - I want to prove that a custom-built PC can be superior in my workplace.
What Questions am I Trying to Answer?
Before getting into benchmarks, it's useful to think about what questions I'm trying to answer, both for myself and for anyone I'm sharing the results with.
For myself, I would like to know:
In general, can a PC match the performance of my current M2 MBP, the next-gen M3 MBP, and what I assume will be an M4 MBP arriving later this year.
Are there any particular tools or scenarios where the PC or Mac stands out as superior?
Are there any tools or scenarios where the PC or Mac is vastly inferior?
In fine-grained detail, which PC components have the biggest impact on performance. (This is to help identify which components are worth spending extra money on for higher-performance parts).
Other People would probably want to know the answers to these questions:
Is the PC overall substantially faster than the Mac?
Is the MBP a very fast work computer, or just a fast laptop?
Is performance a compelling reason to switch to a different device?
Are there specific scenarios or workloads where the Mac is vastly inferior?
Can the Mac's performance be noticeably improved by getting a higher-spec'd device (e.g. more RAM)
Therefore, I will need to find or construct a tool that can run a suite of benchmarks using my workplace's codebase and developer tools and make it easy to compare the performance between different machines or machine configurations, across a variety of tools.
What to Benchmark
The benchmarking suite needs to include individual measurements for as many as possible of the following:
Running mypy over the entire codebase. Unfortunately the mypy daemon is very unreliable, and caching doesn't work due to some of our custom plugins)
Running single unit tests. This is the bread and butter of writing python code here, and the state of affairs right now is that running a single test will take 15-30 seconds, meaning you can test a maximum of 2 changes per minute. Ouch. Ideally I want to find a number of "single tests" and measure all of them so that my benchmark isn't skewed by someone else making an unrelated change to a test.
Running test suites. This is not something I currently do, but it would present immense value if I could run relevant test suites before opening a PR to ensure that I haven't introduced an unexpected test failure.
Starting/Hot-Reloading the Python web app: I'm not sure if this a symptom of django or just an unavailable part of having such a massive python monolith, but starting or hot-reloading our web app backend takes 20 seconds or more. (By the way this is atrocious when you consider that a massive TypeScript app can hot-reload in a few seconds, allowing for much greater engineer productivity).
Starting/Hot-Reloading the JS frontend: Our product has a relatively small frontend due to the industry we're working in, but it would still be nice to compare the difference here. Especially because JS compilation may be able to take better advantage of a PC's higher CPU thread count.
Ripgrep. By now it is very common for developers to use ripgrep for code searching, so I want to know how quickly ripgrep can respond to some representative queries.
git status/log/checkout. We have a big git monorepo - big in the sense that it's bigger than the average open source project. It contains millions of lines of source code and often feels sluggish, so it would be good to know how much the PC is helping to make the repo feel lightweight again.
PyLSP Go To Definition. Similar to ripgrep, I often use this feature to navigate through the codebase, so I want to know whether this action is faster.
Vim - Go to Tag. This is one of my other primary code-navigation tools. I use Gutentags to keep my tags file up-to-date, which happens in the background automatically so performance isn't an issue there. But because our codebase is quite massive the tags file ends up being 100MB and jumping to a tag definition is sometimes noticeably less than "instant".
Ideally, it will also provide a reproducible mechanism for adding "typical load" to the device that the measurements are closer to real-world performance where someone will have many browser tabs open, the Slack app, and in many cases a music player such as Spotify.
Analyzing Results
Finally, the benchmarking tool needs to make it easy to compare results across two vectors:
Two different devices or device configurations running the same version of the benchmark suite (same version of our monorepo).
The same device running two different versions of the benchmark suite (so that I can update the benchmark suite periodically and get a picture of tools performance has shifted).
Off-The-Shelf Tools Considered
Hyperfine is a nify little tool for benchmarking CLI commands, and makes it easy to perform multiple runs, remove interference from caches, or parameterize the command (test with a range of inputs). However it doesn't achieve the kind of macro-benchmarking that I need for this project (multiple benchmarks). It is likely however that I will make use of it's standalone scripts for generating charts.



Comments