I have to admit that I almost didn’t get a post out this week. I’ve been trying to write about my Blocking Queue and ways of using it in a larger application. But I’ve been blocked on this topic for some time. Then @Lectrick asked about generating YAML with Elixir in a comment on my recent post on parsing YAML in Elixir. I wasn’t able to find a module for generating YAML, so I decided to write one.
Going through the YAML specification (more on this later) showed me that writing a YAML generator is more involved than I thought. But, I was able to get a good start on it in this post and hopefully I will be able to continue developing it over several more posts.
Anyway, I’ve decided to call the module Yamlix, here’s how I went about starting it.
Getting started
First, I put together a basic Mix project:
$ mix new yamlix
* creating README.md
* creating .gitignore
* creating mix.exs
* creating config
* creating config/config.exs
* creating lib
* creating lib/yamlix.ex
* creating test
* creating test/test_helper.exs
* creating test/yamlix_test.exs
Your mix project was created successfully.
You can use mix to compile it, test it, and more:
cd yamlix
mix test
Run `mix help` for more commands.
Then I added a LICENSE file for MIT license.
YAML Spec
There’s a specification for YAML 1.2 here. I’ll do my best to follow this specification.
Tests
I’m going to need lots of tests. I’ll start with a simple integeration test and work my way down to unit tests.
defmodule YamlixTest do
use ExUnit.Case
test "it dumps integer scalars" do
assert Yamlix.dump(5) == "--- 5\n...\n"
end
end
Of couse, this test fails
1) test it dumps scalars (YamlixTest)
test/yamlix_test.exs:4
** (UndefinedFunctionError) undefined function: Yamlix.dump/1
stacktrace:
(yamlix) Yamlix.dump(5)
test/yamlix_test.exs:5
because I haven’t written the dump
function. dump
will be the public interface to Yamlix, so let’s write it. The YAML spec, section 3.1 describes a process for dumping YAML. Following this process, dump
should look something like this:
@spec dump(integer()) :: String.t
def dump(scalar) do
scalar |> represent |> serialize |> present
end
The scalar (or any input) is converted to an internal representation on the form of a graph. The graph, in turn, is serialized and turned into a linear stream. That stream is presented as text forming the YAML output.
Now, I just need to write the functions represent
, serialize
, and present
.
Of course, this test still doesn’t pass:
lib/yamlix.ex:7: warning: variable scalar is unused
Compiled lib/yamlix.ex
Generated yamlix app
1) test it dumps integer scalars (YamlixTest)
test/yamlix_test.exs:4
Assertion with == failed
code: Yamlix.dump(5) == "--- 5\n...\n"
lhs: nil
rhs: "--- 5\n...\n"
stacktrace:
test/yamlix_test.exs:5
So, let’s do something simple to make it pass. This will insure we keep it passing as we write a more complete implementation:
defp present(_) do
"--- " <>
"5\n" <>
"...\n"
end
Note, I still have the warning lib/yamlix.ex:7: warning: variable scalar is unused
which is great. Warnings remind me that I still have work to do. In this case I need to actually dump YAML for the passed in scalar rather than just hard code a result of “— 5”.
Warnings remind me that I still have work to do.
To work through this I need another test:
test "it dumps string scalars" do
assert Yamlix.dump("s") == "--- s\n...\n"
end
To pass both this test and the previous test we will convert scalar
to a string and then insert it into the document. Conversion to string should happen in serialize
. So we end up with:
@spec dump(integer()) :: String.t
def dump(scalar) do
scalar |> represent |> serialize |> present
end
defp represent(scalar) do
scalar
end
defp serialize(rep) do
to_string(rep)
end
defp present(content) do
"--- " <>
content <>
"\n...\n"
end
But at this point we have violated the type specification. We really need checking for this.
Extended tests
I’ve been carying this script from project to project:
#!/bin/bash -e
mix compile
mix dialyzer
mix test
MIX_ENV=docs mix inch
mix docs
I’ll add this to the project along with Dialyxer and inch. For more information on working with these projects seem
- My post on Dialyxer
- The inch_ex github page
Strange, Dialyzer passes my code:
Starting Dialyzer
dialyzer --no_check_plt --plt /Users/jkain/.dialyxir_core_17_1.0.4.plt -Wunmatched_returns -Werror_handling -Wrace_conditions -Wunderspecs /Users/jkain/Documents/Projects/elixir/yamlix/_build/dev/lib/yamlix/ebin
Proceeding with analysis... done in 0m0.69s
done (passed successfully)
I realize now, that Dialyzer doesn’t analyze the tests which is where the type violation is. I’ll just have to fix the problem anyway. We’ll do our best to handle any type, so we have this spec:
@spec dump(any) :: String.t
def dump(scalar) do
scalar |> represent |> serialize |> present
end
Ok, so at this point we have a couple of basic end-to-end tests and the basic structure for our YAML generator. We have lots of test infrastructure to help us. The next step is to start working with serializing more types, and working through the features listed in the YAML spec.
Stepping back and looking at the design
At first I dove into the YAML spec. Section 3.2 describes the YAML information model and starts with the Representation Graph. The graph consists of Nodes and Tags. The Nodes simply represent data to be serialized. The Tag for a node contains metadata that describes the type of the data in the Node. The Tags will allow Yamlix to serialize different data types like Structs and Tuples. However, for the Tags to be useful the YAML parser has to recognize them.
I need to take a step back and think about this project. My original intent was that yamerl would parse the YAML generated by yamlix. But, if yamlix generates Tags for Elixir specific structures yamerl won’t recognize them.
yamerl
I need to do a little research into yamerl to understand what Tags it does recognize and if it provides a way to extend the Tag support.
Based on the yamerl reference there does seem to be a way to provide a “[l]ist of Erlang modules to extend supported node types”. So, part of the yamlix project may need to include writing node modules to yamerl.
Based on the source files, yamerl contains support for:
- bool
- bool_ext (accepts more values for true (“y”, “Y”, etc.) and false)
- Erlang atoms
- Erlang functions
- float
- float_ext
- int
- int_ext (accepts more bases)
- IP Address
- map
- null
- Sequence
- Size
- str
- timestamp
Serializable types
From Elixir, the types that yamlix will serialize, initially, will be:
- Integers
- Floats
- Bool
- Atom
- Strings
- Lists
- Tuples
- Struct
- Maps
- Streams ?
I’m not sure how to handle Streams just yet. I think Yamlix.dump/1
should dump something for streams. But the best it can do is to dump a list and it must evaluate the entire stream in order to generate YAML. What happens when the YAML is read back in? It might make sense for a new stream to be created so that the types match with what was passed to Yamlix.dump/1
originally. I think it will take some thought and experimentation to design the right API for handling Streams.
I’m going to prioritize the following order for supporting types:
- Lists
- Maps
- Integers
- Floats
- Bool
- Atom
- Strings
- Struct
- Tuples
- Streams
Items 1-7 are all supported by yamerl. Items 9-10 will require Elixir or Yamlix specific tags and extensions to yamerl for parsing. So, I’ll save them for the end.
Design
Even if I am using a BDD style of development I think it will pay off to think through a design. The YAML spec recommends a design which I have already started trying to follow. The steps should be
- represent - convert native Elixir data types into a graph of nodes representing the same data types. This means, for example, iterating over all pairs in
Map
and recusively generating graph nodes for the data stored in theMap
. Imagine a Map of Lists of more Maps. - serialize - This step linearizes the graph and generates canonical string values. It arranges the nodes into linear order that could be written out as YAML. If there are loops in the graph then aliases have to be built (to refer back to prior nodes).
- present - This is the process of writing out the serialized graph as a formatted string.
In our current implementation our steps work like this:
- represent - This stage does nothing. It accepts only scalars and passes them through to the next step.
- serialize - This stage accepts only scalars and converts them to canonical strings.
- present - This stage writes out the YAML header and footer and it writes the string content received from serialized inbetween them.
Graph Representation
I think the next step should be to focus on building the graph representation of input structure. Based on the spec’s figure 3.3 Representation Model, we can build a series of types
- ScalarNode
- Has the canonical string value
- SequenceNode
- Has a list of Nodes for each value
- MappingNode
- Has a map of key -> nodes
Since our test suite consists of two tests of scalars I’ll start with ScalarNode. I wrote up this module as a start:
defmodule RepresentationGraph do
defmodule Node do
defmodule Scalar do
defstruct value: "", tag: ""
end
def new(scalar) do
%Scalar{value: scalar, tag: ""}
end
def value(%Scalar{value: v, tag: _}) do
v
end
end
def represent(scalar) do
Node.new(scalar)
end
end
The RepresentationGraph.Node
module will be an abstraction over the different specializations of nodes: scalar, sequence, and map. It provides the function Node.new/1
to create a new node and an accessor Node.value/1
to query the value from the Node
. Currently, I support only one type of node, Node.Scalar
which holds a value and tag.
The RepresentationGraph
module takes over the function represent/1
(fomerly in Yamlix
). Currently, it just creates a new Node.Scalar
and returns it.
I use the RepresentationGraph
set of modules like this:
defmodule Yamlix do
alias RepresentationGraph, as: R;
alias RepresentationGraph.Node;
@spec dump(any) :: String.t
def dump(scalar) do
scalar |> R.represent |> serialize |> present
end
defp serialize(node) do
to_string(Node.value(node))
end
defp present(content) do
"--- " <>
content <>
"\n...\n"
end
end
I now call RepresentationGraph.represent/1
to build a representation. Then, in serialize/1
I use Node.value/1
to extract the value when building the cannonical string. present/1
is unchanged.
Next Steps
This has turned into a pretty long post and there is a lot more to do in Yamlix. The next steps are
- Get this on GitHub
- Add support for maps - maps will require a recrusive traversal through the Map which will help flesh out the design for Yamlix
- Work through the YAML spec
These are things I’ll wprk on in next week’s post.