Over the last few weeks I’ve been writing a series on Optimizing Elixir code for my Game of Life. Actaully, so far I’ve only written about tools - available profiling tools and tools I’mbuilding for Comparing Benchmark Results Its time to finally start applying these tools and optmizing the code!
Integrate bmark
The first thing I did was to replace the lifebench
and lifebench.cmp
tasks with my standalone bmark tool. lifebench and bmark are similar but I’ve made a few improvements to the reporting in bmark that I want to incorporate. I added bmark to my mix.exs
file. At the moment it refernces it via local path but once I fix up a few remaining issues I’ll put it on github.
I added a bmark based benchmark using the existing Profile
module. For new code I would forgoe the Profile module and write the benchmarking code directly in the _bmark.ex
file. Here’s my bmark:
defmodule LifeBench do
use Bmark
bmark :glider do
Profile.run_test
end
end
After writing this I uncovered the first of several bugs in bmark. Its good to run it through a somewhat realistic case like my Game of Life. A GenServer call times out because the benchmark takes too long to run. After a quick fix I run into the second bug: I need to create a directory named results
to store the results file. If the directory doesn’t exist the benchmark is run but the results are lost. Another quick fix and I’m on my way.
Baseline
Now that I have bmark ready I can start measuring. I want to measure in the production environment so I start by setting
export MIX_ENV=prod
in my shell. You can assume I’m running production from now on.
I collect baseline results by running mix bmark
lifebench.glider.results
15050451
15282246
15095683
15049299
15292709
15018099
15009078
15266227
15146941
15111238
I want to save these in a base directory. Hmm, I need to manually copy results to base. I’ll need to do this copy everytime. For the moment I choose to do this manually, but I’ll want to automate this soon as it will be tedious and prone to error.
Consolidating Protocols in Elixir
Serveral weeks ago Johan Wärlander suggested that I consolidate the protocols used in my Game of Life implementation in order to reduce time spent in functions like ‘Elixir.Code’,’ensure_compiled?’. He points out that protocol consolidation should be a part of preparing an Elixir application for a production release. Before Johan’s comment I hadn’t heard of protocol consolidation so I read up a bit on it. José Valim’s,in a github issue, described the process in terms of generating fast dispatches (without lookups) when all protocol implementations are known. This works well in the case of packaging a release. The protocol compilation can be invoked by mix as
$ mix compile.protocols
which writes out new binary files containing the consolidated protocols. But, this requires that the application be run as
$ elixir -pa _build/MIX_ENV/consolidated -S mix run
to set the loadpath to include the consolidated protocol files. I consolidated protocols and reran the benchmark:
lifebench.glider.results: lifebench.glider.results:
15050451 14887006
15282246 15028554
15095683 14762181
15049299 14905275
15292709 14577768
15018099 14883831
15009078 14795362
15266227 14897746
15146941 14763634
15111238 14904190
15132197.1 -> 14840554.7 (-1.93%) with p < 0.0005
t = 5.60108220481427, 18 degrees of freedom
Another problem with bmark: I use the basename of the results files for the header in the report. But, in my setup with separate resluts directories the files have the same name and so I can’t tell the two columns apart. Another fix to bmark and I have:
base/lifebench.glider.results: consolidate/lifebench.glider.results:
15050451 14887006
15282246 15028554
15095683 14762181
15049299 14905275
15292709 14577768
15018099 14883831
15009078 14795362
15266227 14897746
15146941 14763634
15111238 14904190
15132197.1 -> 14840554.7 (-1.93%) with p < 0.0005
t = 5.60108220481427, 18 degrees of freedom
Ok, so the report looks nice and this is a performance win! Protocol consolidation has lowered the run time by almost 2%.
Going forward I need to continue to run with protocol consolidation, but this isn’t a code change so it doesn’t “stick”. I need to make sure that I
- Remember to compile the protocols before each profile run
- Remember to run with the right command line to enable the use of the consolidated protocols.
This is a lot to remember, I better right myself a script to do it.
#!/bin/sh
if [ -z "$1" ]; then
echo "Usage bmark.sh <name>"
exit -1
fi
export MIX_ENV=prod
mkdir -p results
mix compile
mix compile.protocols
elixir -pa _build/prod/consolidated/ -S mix bmark
mv results $1
I also decided to go ahead and save the results directory based on a command line argument. And I set the production environment in the script just in case I start a new shell and forget to set MIX_ENV
. With my new script ready I can move ahead to the next optimization.
Count Live Neighbors
Next we profile with ExProf:
FUNCTION CALLS % TIME [uS / CALLS]
-------- ----- --- ---- [----------]
'Elixir.Access':get/2 419073 0.53 85973 [ 0.21]
'Elixir.Enumerable.List':reduce/3 657360 0.54 86879 [ 0.13]
'Elixir.Access':'impl_for!'/1 419073 0.57 91691 [ 0.22]
'Elixir.Life':'-list_of_neighbors/3-fun-0-'/6 369765 0.60 96884 [ 0.26]
code_server:call/2 419077 1.59 256773 [ 0.61]
maps:put/3 49302 12.09 1949883 [ 39.55]
'Elixir.Access':impl_for/1 419073 14.65 2363776 [ 5.64]
maps:find/2 419067 63.10 10178202 [ 24.29]
I see that the profile hasn’t really changed that much. There is probably less runtime compilatilation but I still see time spent in maps:find/2. My first question is what is maps:find/2? It comes from Erlang, but when is it used? As a guess, I look at Elixir.Access
for Map and find the get
function. The get
function is what is used to implement the []
operator on Map (or anything that implements the Access
protocol). I see that Elixir.Access.get
for Map calls :maps.find(key, map)
to do the Map lookup. So, we know that we are spending our time looking up values from Maps.
Also, I see 'Elixir.Life':'-list_of_neighbors/3-fun-0-'/6
in the profile which I know is part of count_live_neighbors/3
and I know that this path does a lot of map lookups. I decied to try optmizing this code a little. In hindsight the optimizations I tried didn’t make sense, but I’ll show them here and show how the bmark.cmp
process can give a more formal way of making the determination - remember that code performance can often be hard to predict or reason about without hard information.
Looking at list_of_neighbors/3
I wonder how fast the for comprehension is. I could try rewriting the function using a module constant instead of generating the dx, dy values with the comprehension.
I rewrote this:
defp list_of_neighbors(x, y, board) do
for dx <- [-1, 0, 1], dy <- [-1, 0, 1], {dx, dy} != {0, 0}, do: board[{x + dx, y + dy}]
end
as
@deltas [
{-1, -1}, { 0, -1}, { 1, -1},
{-1, 0}, { 1, 0},
{-1, 1}, { 0, 1}, { 1, 1}
]
defp list_of_neighbors(x, y, board) do
for dx <- [-1, 0, 1], dy <- [-1, 0, 1], {dx, dy} != {0, 0}, do: board[{x + dx, y + dy}]
Enum.map(@deltas, fn ({dx, dy}) -> board[{x + dx, y + dy}] end)
end
and then measured the results
consolidate/lifebench.glider.results: precomputed/lifebench.glider.results:
14887006 14642711
15028554 14897523
14762181 14698632
14905275 14978539
14577768 14775842
14883831 14751767
14795362 14859778
14897746 14872981
14763634 14648105
14904190 14634486
14840554.7 -> 14776036.4 (-0.44%) with p < 1
t = 1.1845853684535732, 18 degrees of freedom
Sadly, this is not a win. I decided to take this one step further and try to reduce the work in count_live_neighbors/3
by taking the several pipeline steps and combining them into a single fold operation. I started with:
defp count_live_neighbors(x, y, board) do
list_of_neighbors(x, y, board)
|> Enum.map(&state_as_int/1)
|> Enum.sum
end
defp list_of_neighbors(x, y, board) do
Enum.map(@deltas, fn ({dx, dy}) -> board[{x + dx, y + dy}] end)
end
and rewrote it as:
defp count_live_neighbors(x, y, board) do
List.foldl(@deltas, 0, fn ({dx, dy}, acc) ->
acc + (board[{x + dx, y + dy}] |> state_as_int)
end)
end
I measured again:
consolidate/lifebench.glider.results: foldl/lifebench.glider.results:
14887006 14742553
15028554 14879462
14762181 14712122
14905275 14925838
14577768 14772430
14883831 14604655
14795362 14682689
14897746 14799597
14763634 14634907
14904190 14979943
14840554.7 -> 14773419.6 (-0.45%) with p < 1
t = 1.220154927852377, 18 degrees of freedom
But still no win. Was it right to push through the first failed optimization this way? I think it was, sometimes it can lead to better ideas or a code structure that lends itself to furhter optmization. But in this case it just didn’t pay off. I discarded these changes.
As I said, in hindsight the optimizations I tried didn’t make sense. This is because I didn’t change the number of calls to maps:find/2
which clearly dominates the profile. Instead, I was only channging the stucture of the code around these calls. But, I find that kind of insight hard to see at the onset.
Concurrent Computation
Elixir is all about concurrency so I had been itching to try out a parallel version. I was pretty confident that a concurrent version would see a benefit from parallel execution on my 4 core system. But, I had wanted to save this optmization until the end. I didn’t envision changing too much in converting to a concurrent version and wanted to make sure I was starting with the best version of the base code.
My version of The Game of Life has a Board
module and that module has a map
function which I use to apply updates to each cell in the game. Building a concurrent version of the Game of Life was simple matter of changing Board.map
into a parallel map. Here’s the original version:
def map(board, f) do
board
|> Map.keys
|> Enum.map(fn (key) -> { key, f.(key, board[key]) } end)
|> List.foldr(Map.new, fn ({key, value}, acc) -> Map.put(acc, key, value) end)
end
The board itself is stored as a map and I use the existing Enum.map
by creating a List
of all keys in my map. Then map over the list and then consolidate the results back into a new Map using List.foldr
I wrote up a naive concurrent version stating a process for each invocation of the function f
:
def map(board, f) do
board
|> Map.keys
|> pmap(fn (key) -> { key, f.(key, board[key]) } end)
|> List.foldr(Map.new, fn ({key, value}, acc) -> Map.put(acc, key, value) end)
end
def pmap(list, f) do
list
|> Enum.map(fn (elem) -> Task.async(fn -> f.(elem) end) end)
|> Enum.map(fn (task) -> Task.await(task) end)
end
Here I’ve replaced the use of List.map
with my own pmap
function that uses Task.async
to run each function invocation in a separate process.
consolidate/lifebench.glider.results: naive-pmap/lifebench.glider.results:
14887006 26889439
15028554 27339230
14762181 27067765
14905275 27026322
14577768 26777941
14883831 27135888
14795362 26867260
14897746 27028423
14763634 27354742
14904190 26881294
14840554.7 -> 27036830.4 (+82.18%) with p < 0.0005
t = 167.14386999331413, 18 degrees of freedom
This is a huge loss! Process creation in Erlang is supposed to be cheap but based on the results I think that the balance of work between the function f
and the process creation in Task.async
is off. That f
which would be Life.apply_rules/3
does very little work compared to Task.async
. While in the absolute Task.async
may be cheap it is relatively expensive and the cost outweights the benefit.
But, still a concurrent version should be able to win. I try restructuring a bit and do the pmap in chunks of 8:
def pmap(list, f) do
list
|> Enum.chunk(8, 8, [])
|> Enum.flat_map(fn (chunk) ->
chunk
|> Enum.map(fn (elem) -> Task.async(fn -> f.(elem) end) end)
|> Enum.map(fn (task) -> Task.await(task) end)
end)
end
consolidate/lifebench.glider.results: chunk-pmap/lifebench.glider.results:
14887006 24274268
15028554 24563751
14762181 24492221
14905275 24516553
14577768 24335224
14883831 24158102
14795362 24357174
14897746 24213098
14763634 24466586
14904190 24289248
14840554.7 -> 24366622.5 (+64.19%) with p < 0.0005
t = 163.96384320448385, 18 degrees of freedom
This looks like it is slightly less of a huge loss than the previous version. Which I think means I still don’t have the balance right. So I increase the chunk size to 128:
consolidate/lifebench.glider.results: chunk-128-pmap/lifebench.glider.results:
14887006 6426990
15028554 6416149
14762181 6507946
14905275 6453309
14577768 6491314
14883831 6405073
14795362 6504260
14897746 6449789
14763634 6532929
14904190 6509800
14840554.7 -> 6469755.9 (-56.41%) with p < 0.0005
t = 203.48398714422547, 18 degrees of freedom
And finally we have our win!
Next steps
First of all, there are a few more things I want to clean up in bmark. I’m thining that bmark should handle the protocol consolidation for the user. It doesn’t make sense to have to do this manually, though perhaps this could be controlled by a configuration option. There are also some other larger ideas that I will discuss in a future post.
Next week, I plan to continue with the profiling and optimization process.