Automating Comparison of Benchmark Results with Elixir

Last week I built a process for comparing benchmark results. This week I applied that system to my Game of Life implementation. I must admit that I already did some of the optmization work before putting this tool together. Not having a process for deciding if an optimization was really a win or not was the motivation for going back and building the tools. So, over the last week I automated my process for benchmark comparison and started testing my so-called optmizations.

Automating the Collection Process With Mix

If I was going to start optimizing I needed a convinient way to run the benchmark and collect the results. I decided to add a Mix task. I started by writing a simple task named lifebench to run my benchmark 10 times and write the results to a file. You can read over the code on github. I didn’t use the simpler name bench as it conflicts with the wonderful benchfella task which I already have included in my project.

At the time I included my lifebench task directly in my Game of Life project. This certainly wasn’t the right place for it. I did this for simplicty while working out the details. The next step will be to extract the code and generlize it into either a new package, or potentially integrate it into benchfella. I treated this work as a development splike. Once I’ve worked out the details I can go back and write a cleaner version.

Writing the lifebench task was a good learning experience. The actual mechanics of running a function 10 times and writing the data to a file was straightforward but I learned about Mix tasks including how to compile the project before starting the task.

When I was done I could collect simple results files like this:

The file simply contains the time for each run of the benchmark. Later, I may need more data for but for my purposes this was enough.

Automating the Comparison Process With Mix

After automating collection I was able to collect results but I still needed to automate the comparison process. Last week I built the process for comparing benchmark results as a spreadsheet. But using this process was a little combersome. It meant collecting results and pasting them into a spreadsheet. Including the results here in a blog post meant capturing a screenshot. Including the results in my notes was an even bigger hassle. I wanted an easy to use, plain-text solution to lower the overhead of benchmarking, comparing results and writing about the process.

I wrote another Mix task, this time named lifebench.cmp which you can find in the github repo. It would take two of the result files produced by lifebench, compare them, and report on the statistical significance. This task implemented the process I described last week.

This was a fun task to write as it was challenging to pose the task as a parse and transform style execution. But I was very happy with the end result.

I used the statistics package to do some of the stats. Though I found that I had to reimplement some things in order to get accurate results.

First, the Statistics.stdev, when used on samples from a population, computes the uncorrected sample standard deviation while I needed the corrected version. I was able to write up the following implementation of the corrected sample standard deviation in Elixir:

  defp corrected_sample_stdev(list) do
    mean = Statistics.mean(list)
    (Enum.map(list, fn(x) -> (mean - x) * (mean - x) end) |> Enum.sum) / (Enum.count(list) - 1)
    |> :math.sqrt
  end

I’m still trying to decide if I like this style of using

> inside longer expressions.

Next, the Statistics.Distributions module includes the t-distribution however, the comments say that it is an approximation. I found that it was a bit inaccurate for my needs. For example:

    Statistics.Distributions.T.cdf(3.057950289696126, 18)

returned 1.0044. Butm this should have been the proability that a random sample from the t distribution with 18 degrees of freedom is less than 3.057950289696126. The probability is 1.044? That just can’t be.

Now, I can’t do any better at implementing the CDF for the t-distribution than Max Sharples has created in the Statistics module. As he points out in the comments better math support is required to implement the CDF properly.

However, I could simply automate the look up of critical T values form a table. The tables are reasonably accurate and the critical values are all I really need for my purposes. I quickly put this together by importing a table and writing a set of functions to search through it. When this was done it worked just fine for lifebench.cmp

Comparing Benchmark Results

With my tasks put together I could finally compare benchmark results. I collected a couple of different results based on some optimizations I was working on. (More details later). Here’s an example of a successful optimization.

    base:           consolidated protocols:

    14920470        14693408
    15010309        14795711
    14921889        14649470
    15186957        14683630
    14811559        14904153
    14783809        14985803
    15121291        14870837
    14931534        14933960
    14929569        14834048
    15176096        14776376

$ mix lifebench.cmp notes/base notes/consolidated
14979348.3 -> 14812739.6 (-1.11%) with p < 0.005
t = 2.9010263661034448, 18 degrees of freedom

Consolidated protocols are a statistically significant improvement of 1.11% in runtime. That is, it’s a win.

I tested another potential optimization:

    base:           precomputed:

    14920470        15451593
    15010309        15004750
    14921889        14722077
    15186957        14619587
    14811559        14962463
    14783809        14758489
    15121291        14815883
    14931534        15061693
    14929569        14875596
    15176096        14731563

$ mix lifebench.cmp notes/base-ce19ac6 notes/precomputed
14979348.3 -> 14900369.4 (-0.53%) with p < 1
t = 0.9003947139724158, 18 degrees of freedom

Here there the statistics do not support a significant change in performance based on this code change. I discarded this optimization.

Next Week’s Post

Next week I’ll discuss the optimiations I performed in more detail. I’ll continue to use the lifebench and lifebench.cmp tasks in order to measure and compare the results accurately.

I’m also working on extracting lifebench and lifebench.cmp into standalone Mix tasks that you and I can use with other projects. I’l post an annoucement and link once they are ready.

Joseph Kain

Automating Comparison of Benchmark Results with Elixir

Automating the Collection Process With Mix

Automating the Comparison Process With Mix

Comparing Benchmark Results

Next Week’s Post

You might also enjoy (View all posts)