Comparing Benchmark Results with Elixir

In my last two posts I discussed the tools available to profile Elixir code using my implementation of the Game of Life. I surveyed Elixir profiling tools in order to direct optimization of my Game of Life program. The profilers will tell me which parts of my program to look at so that I can make informed changes to the program. But, in order to evaluate the changes I need to run my program, as a benchmark, and evaluate the change in performance.

Comparing Benchmark Results

This is a topic that I have struggled with for years in my professional life. I run a benchmark several times and get varying results. Then, I make a change to my program and I want to decide if the change causes an improvement in the benchmark score. I rerun the benchmark several times and again get varying results. How do I compare these results? I can compare means, but is that accurate? How do I tell if the mean of the second run is large enough to be meaningful? How do I know if it is “in the noise?”

Here’s an example, I ran my game of life program 5 times like this

mix run -e "Profile.run_test"

and recorded the following run times

This kind of variation has always confused me.

I started digging into the topic of benchmarking and comparing results. I was inspired by Aysylu Greenberg’s Strange Loop 2014 talk “Benchmarking: You’re Doing It Wrong”. The thing in this talk the struck me the most in this talk was the idea that I should run the benchmark enough times for the mean to stabilize. I thought this would solve all my problems and started to try it out. I ended up using the benchfella Mix task (more on that in a future post). I added code to benchfella compute the means of all runs and keep running until the mean stabilized. However, I found that I still two runs would stabilize to two different means without changing anything in my program! That I was still stuck with variation and mean stabilization wasn’t enough.

Using Statistical Hypothesis testing

In order to implement the mean stabilization I had to brush up a bit on statistics. And after I found that mean stabilization didn’t solve all my problems I continued looking at other statistical methods to help handle this situation.

At first, I had trouble looking specifically for information about benchmarking but then I changed my search. I looked for statistical methods for determining if a change in behavior is significant. This search turned up basic hypothesis testing. I read a few articles on the topic and concluded that I should be able to use this for comparing benchmark results. After some more searching I decided to use Student’s t-test to compare results. I followed ‘Student’s’ t Test (For Independent Samples) as a guide.

The article, and other statistics literature, describes data in terms of samples from populations. In my case a population coresponded to a version of the program or benchmark. I considered two populations at a time. That is the base version of benchmark and a modified version, the version cotaining the optimizations I wanted to evaluate. Samples are the measurements taken from each run.

I’m going to skip over most of the theory and focus on the practical applications. But if you’re interested in the theory there are many fine resources available if you search for them.

So, I started by running the base version a few times and collecting the time to run. Then, I applied the optimization and collected a few more samples. When I automate this process I hope to have a more precise definition of the number of times to run.

Given the sample measurements from each version of the benchmark I could compute some simple statistics like the sample means and standard deviations. Given these values I was able to compute a statistic called t. The t statistic needs to be interpreted by looking its value up in a table such as T Distribution Critical Values Table. I looked for a row in the table that matched the degrees of freedom of my experiment. The degrees of freedom are defined by the number of total samples taken from all populations minus 2. That is, if I took 5 samples from the base version and 6 samples from the modified version then THE degrees of freedom would be

        5 + 6 - 2 = 9

and I would look at the row 9 in the table. In row 9 I found a list of values and searched for the last value in the row that is less than or equal to our t value. So, for example if I had

        t = 4.3

I would select the column labeled 0.001. Which would mean I have p < 0.001 or that I have 99.9% statistical confidence that the difference in means of the two populations is statistically signicant.

Puting it all together

I decided to start with a simple spreadsheet to calculate the statistics. If it pans out I will implement Mix tasks (possible as part of benchfella) to automate the process. You can download and review my spreadsheet template here. You can use it by filling your sample data into columns A and C. Here’s an image showing a filled in spreadsheet:

Glider t test

Next I decided toto test out the process and the spreadsheet in order check that it matched my expectations. I setup two tests:

Compare two sets of runs of the same version of the benchmark. I epxected no statistically significant difference between the two runs.
Compare two different versions of the benchmark. One version is the base version and the second version is modified in a way in which we can reason that it should be faster. I expected a statistically significant difference between the two runs.

The example in the image above shows the results from test 1. I collected two sets of runs of the base version of the glider test. In column A I put the results from 17 runs of base version. Then in column B I put results from another 20 runs. The spreadsheet computed the values t and df and I was able to look up p from the T Distribution Critical Values Table. I found that p > 0.05 which means I can have no confidence that the difference in means between column A and column B is statistically significant. This matched my expectation for test 1.

Next I modified the benchmark with a very simple change to be faster. I wanted to make sure I could reason about the change and be sure that the benchmark would be faster. I modified the benchmark to run only 4 generations for the Game of Life instead of the 5 generations run in the base version. This does 20% less work and therefore we expect it to run faster. I ran 13 runs of this version of the benchmark and filled in the spreadsheet to compute the comparison:

Glider t test

The results show the values for t and df which imply p < 0.001 meaning that I could assume at least 99.9% confidence that the difference between the means was significant. This matched my expectations for test 2.

At this point I had put together a process for comparing benchmark runs and I had put it though a two basic tests.

Next Steps

In my next Learning Elixir post I will discuss ways of automating this process using Mix tasks. Then, I’ll work through a series of optimizations of Elixir code, using the process to evaluate performance at each step.

Joseph Kain

Comparing Benchmark Results with Elixir