Optimizing Elixir

Over the last few weeks I’ve been writing a series on Optimizing Elixir code for my Game of Life. Actaully, so far I’ve only written about tools - available profiling tools and tools I’mbuilding for Comparing Benchmark Results Its time to finally start applying these tools and optmizing the code!

Integrate bmark

The first thing I did was to replace the lifebench and lifebench.cmp tasks with my standalone bmark tool. lifebench and bmark are similar but I’ve made a few improvements to the reporting in bmark that I want to incorporate. I added bmark to my mix.exs file. At the moment it refernces it via local path but once I fix up a few remaining issues I’ll put it on github.

I added a bmark based benchmark using the existing Profile module. For new code I would forgoe the Profile module and write the benchmarking code directly in the _bmark.ex file. Here’s my bmark:

defmodule LifeBench do
  use Bmark

  bmark :glider do
    Profile.run_test
  end
end

After writing this I uncovered the first of several bugs in bmark. Its good to run it through a somewhat realistic case like my Game of Life. A GenServer call times out because the benchmark takes too long to run. After a quick fix I run into the second bug: I need to create a directory named results to store the results file. If the directory doesn’t exist the benchmark is run but the results are lost. Another quick fix and I’m on my way.

Baseline

Now that I have bmark ready I can start measuring. I want to measure in the production environment so I start by setting

export MIX_ENV=prod

in my shell. You can assume I’m running production from now on.

I collect baseline results by running mix bmark

lifebench.glider.results
15050451
15282246
15095683
15049299
15292709
15018099
15009078
15266227
15146941
15111238

I want to save these in a base directory. Hmm, I need to manually copy results to base. I’ll need to do this copy everytime. For the moment I choose to do this manually, but I’ll want to automate this soon as it will be tedious and prone to error.

Consolidating Protocols in Elixir

Serveral weeks ago Johan Wärlander suggested that I consolidate the protocols used in my Game of Life implementation in order to reduce time spent in functions like ‘Elixir.Code’,’ensure_compiled?’. He points out that protocol consolidation should be a part of preparing an Elixir application for a production release. Before Johan’s comment I hadn’t heard of protocol consolidation so I read up a bit on it. José Valim’s,in a github issue, described the process in terms of generating fast dispatches (without lookups) when all protocol implementations are known. This works well in the case of packaging a release. The protocol compilation can be invoked by mix as

$ mix compile.protocols

which writes out new binary files containing the consolidated protocols. But, this requires that the application be run as

$ elixir -pa _build/MIX_ENV/consolidated -S mix run

to set the loadpath to include the consolidated protocol files. I consolidated protocols and reran the benchmark:

lifebench.glider.results:  lifebench.glider.results:
15050451                   14887006
15282246                   15028554
15095683                   14762181
15049299                   14905275
15292709                   14577768
15018099                   14883831
15009078                   14795362
15266227                   14897746
15146941                   14763634
15111238                   14904190

15132197.1 -> 14840554.7 (-1.93%) with p < 0.0005
t = 5.60108220481427, 18 degrees of freedom

Another problem with bmark: I use the basename of the results files for the header in the report. But, in my setup with separate resluts directories the files have the same name and so I can’t tell the two columns apart. Another fix to bmark and I have:

base/lifebench.glider.results:  consolidate/lifebench.glider.results:
15050451                        14887006
15282246                        15028554
15095683                        14762181
15049299                        14905275
15292709                        14577768
15018099                        14883831
15009078                        14795362
15266227                        14897746
15146941                        14763634
15111238                        14904190

15132197.1 -> 14840554.7 (-1.93%) with p < 0.0005
t = 5.60108220481427, 18 degrees of freedom

Ok, so the report looks nice and this is a performance win! Protocol consolidation has lowered the run time by almost 2%.

Going forward I need to continue to run with protocol consolidation, but this isn’t a code change so it doesn’t “stick”. I need to make sure that I

Remember to compile the protocols before each profile run
Remember to run with the right command line to enable the use of the consolidated protocols.

This is a lot to remember, I better right myself a script to do it.

#!/bin/sh

if [ -z "$1" ]; then
  echo "Usage bmark.sh <name>"
  exit -1
fi

export MIX_ENV=prod
mkdir -p results
mix compile
mix compile.protocols
elixir -pa _build/prod/consolidated/ -S mix bmark
mv results $1

I also decided to go ahead and save the results directory based on a command line argument. And I set the production environment in the script just in case I start a new shell and forget to set MIX_ENV. With my new script ready I can move ahead to the next optimization.

Count Live Neighbors

Next we profile with ExProf:

FUNCTION                                           CALLS      %      TIME  [uS / CALLS]
--------                                           -----    ---      ----  [----------]
'Elixir.Access':get/2                             419073   0.53     85973  [      0.21]
'Elixir.Enumerable.List':reduce/3                 657360   0.54     86879  [      0.13]
'Elixir.Access':'impl_for!'/1                     419073   0.57     91691  [      0.22]
'Elixir.Life':'-list_of_neighbors/3-fun-0-'/6     369765   0.60     96884  [      0.26]
code_server:call/2                                419077   1.59    256773  [      0.61]
maps:put/3                                         49302  12.09   1949883  [     39.55]
'Elixir.Access':impl_for/1                        419073  14.65   2363776  [      5.64]
maps:find/2                                       419067  63.10  10178202  [     24.29]

I see that the profile hasn’t really changed that much. There is probably less runtime compilatilation but I still see time spent in maps:find/2. My first question is what is maps:find/2? It comes from Erlang, but when is it used? As a guess, I look at Elixir.Access for Map and find the get function. The get function is what is used to implement the [] operator on Map (or anything that implements the Access protocol). I see that Elixir.Access.get for Map calls :maps.find(key, map) to do the Map lookup. So, we know that we are spending our time looking up values from Maps.

Also, I see 'Elixir.Life':'-list_of_neighbors/3-fun-0-'/6 in the profile which I know is part of count_live_neighbors/3 and I know that this path does a lot of map lookups. I decied to try optmizing this code a little. In hindsight the optimizations I tried didn’t make sense, but I’ll show them here and show how the bmark.cmp process can give a more formal way of making the determination - remember that code performance can often be hard to predict or reason about without hard information.

Looking at list_of_neighbors/3 I wonder how fast the for comprehension is. I could try rewriting the function using a module constant instead of generating the dx, dy values with the comprehension.

I rewrote this:

defp list_of_neighbors(x, y, board) do
  for dx <- [-1, 0, 1], dy <- [-1, 0, 1], {dx, dy} != {0, 0}, do: board[{x + dx, y + dy}]
end

@deltas [
  {-1, -1}, { 0, -1}, { 1, -1},
  {-1,  0},           { 1,  0},
  {-1,  1}, { 0,  1}, { 1,  1}
]

defp list_of_neighbors(x, y, board) do
  for dx <- [-1, 0, 1], dy <- [-1, 0, 1], {dx, dy} != {0, 0}, do: board[{x + dx, y + dy}]
  Enum.map(@deltas, fn ({dx, dy}) -> board[{x + dx, y + dy}] end)
end

and then measured the results

consolidate/lifebench.glider.results:  precomputed/lifebench.glider.results:
14887006                               14642711
15028554                               14897523
14762181                               14698632
14905275                               14978539
14577768                               14775842
14883831                               14751767
14795362                               14859778
14897746                               14872981
14763634                               14648105
14904190                               14634486

14840554.7 -> 14776036.4 (-0.44%) with p < 1
t = 1.1845853684535732, 18 degrees of freedom

Sadly, this is not a win. I decided to take this one step further and try to reduce the work in count_live_neighbors/3 by taking the several pipeline steps and combining them into a single fold operation. I started with:

defp count_live_neighbors(x, y, board) do
  list_of_neighbors(x, y, board)
  |> Enum.map(&state_as_int/1)
  |> Enum.sum
end

defp list_of_neighbors(x, y, board) do
  Enum.map(@deltas, fn ({dx, dy}) -> board[{x + dx, y + dy}] end)
end

and rewrote it as:

defp count_live_neighbors(x, y, board) do
  List.foldl(@deltas, 0, fn ({dx, dy}, acc) -> 
    acc + (board[{x + dx, y + dy}] |> state_as_int)
  end)
end

I measured again:

consolidate/lifebench.glider.results:  foldl/lifebench.glider.results:
14887006                               14742553
15028554                               14879462
14762181                               14712122
14905275                               14925838
14577768                               14772430
14883831                               14604655
14795362                               14682689
14897746                               14799597
14763634                               14634907
14904190                               14979943

14840554.7 -> 14773419.6 (-0.45%) with p < 1
t = 1.220154927852377, 18 degrees of freedom

But still no win. Was it right to push through the first failed optimization this way? I think it was, sometimes it can lead to better ideas or a code structure that lends itself to furhter optmization. But in this case it just didn’t pay off. I discarded these changes.

As I said, in hindsight the optimizations I tried didn’t make sense. This is because I didn’t change the number of calls to maps:find/2 which clearly dominates the profile. Instead, I was only channging the stucture of the code around these calls. But, I find that kind of insight hard to see at the onset.

Concurrent Computation

Elixir is all about concurrency so I had been itching to try out a parallel version. I was pretty confident that a concurrent version would see a benefit from parallel execution on my 4 core system. But, I had wanted to save this optmization until the end. I didn’t envision changing too much in converting to a concurrent version and wanted to make sure I was starting with the best version of the base code.

My version of The Game of Life has a Board module and that module has a map function which I use to apply updates to each cell in the game. Building a concurrent version of the Game of Life was simple matter of changing Board.map into a parallel map. Here’s the original version:

def map(board, f) do
  board
  |> Map.keys
  |> Enum.map(fn (key) -> { key, f.(key, board[key]) } end)
  |> List.foldr(Map.new, fn ({key, value}, acc) -> Map.put(acc, key, value) end)
end

The board itself is stored as a map and I use the existing Enum.map by creating a List of all keys in my map. Then map over the list and then consolidate the results back into a new Map using List.foldr I wrote up a naive concurrent version stating a process for each invocation of the function f:

def map(board, f) do
  board
  |> Map.keys
  |> pmap(fn (key) -> { key, f.(key, board[key]) } end)
  |> List.foldr(Map.new, fn ({key, value}, acc) -> Map.put(acc, key, value) end)
end

def pmap(list, f) do
  list
  |> Enum.map(fn (elem) -> Task.async(fn -> f.(elem) end) end)
  |> Enum.map(fn (task) -> Task.await(task) end)
end

Here I’ve replaced the use of List.map with my own pmap function that uses Task.async to run each function invocation in a separate process.

consolidate/lifebench.glider.results:  naive-pmap/lifebench.glider.results:
14887006                               26889439
15028554                               27339230
14762181                               27067765
14905275                               27026322
14577768                               26777941
14883831                               27135888
14795362                               26867260
14897746                               27028423
14763634                               27354742
14904190                               26881294

14840554.7 -> 27036830.4 (+82.18%) with p < 0.0005
t = 167.14386999331413, 18 degrees of freedom

This is a huge loss! Process creation in Erlang is supposed to be cheap but based on the results I think that the balance of work between the function f and the process creation in Task.async is off. That f which would be Life.apply_rules/3 does very little work compared to Task.async. While in the absolute Task.async may be cheap it is relatively expensive and the cost outweights the benefit.

But, still a concurrent version should be able to win. I try restructuring a bit and do the pmap in chunks of 8:

  def pmap(list, f) do
    list
    |> Enum.chunk(8, 8, [])
    |> Enum.flat_map(fn (chunk) ->
      chunk
      |> Enum.map(fn (elem) -> Task.async(fn -> f.(elem) end) end)
      |> Enum.map(fn (task) -> Task.await(task) end)
    end)
  end

consolidate/lifebench.glider.results:  chunk-pmap/lifebench.glider.results:
14887006                               24274268
15028554                               24563751
14762181                               24492221
14905275                               24516553
14577768                               24335224
14883831                               24158102
14795362                               24357174
14897746                               24213098
14763634                               24466586
14904190                               24289248

14840554.7 -> 24366622.5 (+64.19%) with p < 0.0005
t = 163.96384320448385, 18 degrees of freedom

This looks like it is slightly less of a huge loss than the previous version. Which I think means I still don’t have the balance right. So I increase the chunk size to 128:

consolidate/lifebench.glider.results:  chunk-128-pmap/lifebench.glider.results:
14887006                               6426990
15028554                               6416149
14762181                               6507946
14905275                               6453309
14577768                               6491314
14883831                               6405073
14795362                               6504260
14897746                               6449789
14763634                               6532929
14904190                               6509800

14840554.7 -> 6469755.9 (-56.41%) with p < 0.0005
t = 203.48398714422547, 18 degrees of freedom

And finally we have our win!

Next steps

First of all, there are a few more things I want to clean up in bmark. I’m thining that bmark should handle the protocol consolidation for the user. It doesn’t make sense to have to do this manually, though perhaps this could be controlled by a configuration option. There are also some other larger ideas that I will discuss in a future post.

Next week, I plan to continue with the profiling and optimization process.

Joseph Kain