Yamlix

I have to admit that I almost didn’t get a post out this week. I’ve been trying to write about my Blocking Queue and ways of using it in a larger application. But I’ve been blocked on this topic for some time. Then @Lectrick asked about generating YAML with Elixir in a comment on my recent post on parsing YAML in Elixir. I wasn’t able to find a module for generating YAML, so I decided to write one.

Going through the YAML specification (more on this later) showed me that writing a YAML generator is more involved than I thought. But, I was able to get a good start on it in this post and hopefully I will be able to continue developing it over several more posts.

Anyway, I’ve decided to call the module Yamlix, here’s how I went about starting it.

Getting started

First, I put together a basic Mix project:

$ mix new yamlix
* creating README.md
* creating .gitignore
* creating mix.exs
* creating config
* creating config/config.exs
* creating lib
* creating lib/yamlix.ex
* creating test
* creating test/test_helper.exs
* creating test/yamlix_test.exs

Your mix project was created successfully.
You can use mix to compile it, test it, and more:

    cd yamlix
    mix test

Run `mix help` for more commands.

Then I added a LICENSE file for MIT license.

YAML Spec

There’s a specification for YAML 1.2 here. I’ll do my best to follow this specification.

Tests

I’m going to need lots of tests. I’ll start with a simple integeration test and work my way down to unit tests.

defmodule YamlixTest do
  use ExUnit.Case

  test "it dumps integer scalars" do
    assert Yamlix.dump(5) == "--- 5\n...\n"
  end
end

Of couse, this test fails

  1) test it dumps scalars (YamlixTest)
     test/yamlix_test.exs:4
     ** (UndefinedFunctionError) undefined function: Yamlix.dump/1
     stacktrace:
       (yamlix) Yamlix.dump(5)
       test/yamlix_test.exs:5

because I haven’t written the dump function. dump will be the public interface to Yamlix, so let’s write it. The YAML spec, section 3.1 describes a process for dumping YAML. Following this process, dump should look something like this:

@spec dump(integer()) :: String.t
def dump(scalar) do
  scalar |> represent |> serialize |> present
end

The scalar (or any input) is converted to an internal representation on the form of a graph. The graph, in turn, is serialized and turned into a linear stream. That stream is presented as text forming the YAML output.

Now, I just need to write the functions represent, serialize, and present.

Of course, this test still doesn’t pass:

lib/yamlix.ex:7: warning: variable scalar is unused
Compiled lib/yamlix.ex
Generated yamlix app


  1) test it dumps integer scalars (YamlixTest)
     test/yamlix_test.exs:4
     Assertion with == failed
     code: Yamlix.dump(5) == "--- 5\n...\n"
     lhs:  nil
     rhs:  "--- 5\n...\n"
     stacktrace:
       test/yamlix_test.exs:5

So, let’s do something simple to make it pass. This will insure we keep it passing as we write a more complete implementation:

defp present(_) do
  "--- " <>
  "5\n"  <>
  "...\n"
end

Note, I still have the warning lib/yamlix.ex:7: warning: variable scalar is unused which is great. Warnings remind me that I still have work to do. In this case I need to actually dump YAML for the passed in scalar rather than just hard code a result of “— 5”.

Warnings remind me that I still have work to do.

To work through this I need another test:

test "it dumps string scalars" do
  assert Yamlix.dump("s") == "--- s\n...\n"
end

To pass both this test and the previous test we will convert scalar to a string and then insert it into the document. Conversion to string should happen in serialize. So we end up with:

@spec dump(integer()) :: String.t
def dump(scalar) do
  scalar |> represent |> serialize |> present
end

defp represent(scalar) do
  scalar
end

defp serialize(rep) do
  to_string(rep)
end

defp present(content) do
  "--- " <>
  content  <>
  "\n...\n"
end

But at this point we have violated the type specification. We really need checking for this.

Extended tests

I’ve been carying this script from project to project:

#!/bin/bash -e

mix compile
mix dialyzer
mix test
MIX_ENV=docs mix inch
mix docs

I’ll add this to the project along with Dialyxer and inch. For more information on working with these projects seem

My post on Dialyxer
The inch_ex github page

Strange, Dialyzer passes my code:

Starting Dialyzer
dialyzer --no_check_plt --plt /Users/jkain/.dialyxir_core_17_1.0.4.plt -Wunmatched_returns -Werror_handling -Wrace_conditions -Wunderspecs /Users/jkain/Documents/Projects/elixir/yamlix/_build/dev/lib/yamlix/ebin
  Proceeding with analysis... done in 0m0.69s
done (passed successfully)

I realize now, that Dialyzer doesn’t analyze the tests which is where the type violation is. I’ll just have to fix the problem anyway. We’ll do our best to handle any type, so we have this spec:

@spec dump(any) :: String.t
def dump(scalar) do
  scalar |> represent |> serialize |> present
end

Ok, so at this point we have a couple of basic end-to-end tests and the basic structure for our YAML generator. We have lots of test infrastructure to help us. The next step is to start working with serializing more types, and working through the features listed in the YAML spec.

Stepping back and looking at the design

At first I dove into the YAML spec. Section 3.2 describes the YAML information model and starts with the Representation Graph. The graph consists of Nodes and Tags. The Nodes simply represent data to be serialized. The Tag for a node contains metadata that describes the type of the data in the Node. The Tags will allow Yamlix to serialize different data types like Structs and Tuples. However, for the Tags to be useful the YAML parser has to recognize them.

I need to take a step back and think about this project. My original intent was that yamerl would parse the YAML generated by yamlix. But, if yamlix generates Tags for Elixir specific structures yamerl won’t recognize them.

yamerl

I need to do a little research into yamerl to understand what Tags it does recognize and if it provides a way to extend the Tag support.

Based on the yamerl reference there does seem to be a way to provide a “[l]ist of Erlang modules to extend supported node types”. So, part of the yamlix project may need to include writing node modules to yamerl.

Based on the source files, yamerl contains support for:

bool
bool_ext (accepts more values for true (“y”, “Y”, etc.) and false)
Erlang atoms
Erlang functions
float
float_ext
int
int_ext (accepts more bases)
IP Address
map
null
Sequence
Size
str
timestamp

Serializable types

From Elixir, the types that yamlix will serialize, initially, will be:

Integers
Floats
Bool
Atom
Strings
Lists
Tuples
Struct
Maps
Streams ?

I’m not sure how to handle Streams just yet. I think Yamlix.dump/1 should dump something for streams. But the best it can do is to dump a list and it must evaluate the entire stream in order to generate YAML. What happens when the YAML is read back in? It might make sense for a new stream to be created so that the types match with what was passed to Yamlix.dump/1 originally. I think it will take some thought and experimentation to design the right API for handling Streams.

I’m going to prioritize the following order for supporting types:

Lists
Maps
Integers
Floats
Bool
Atom
Strings
Struct
Tuples
Streams

Items 1-7 are all supported by yamerl. Items 9-10 will require Elixir or Yamlix specific tags and extensions to yamerl for parsing. So, I’ll save them for the end.

Design

Even if I am using a BDD style of development I think it will pay off to think through a design. The YAML spec recommends a design which I have already started trying to follow. The steps should be

represent - convert native Elixir data types into a graph of nodes representing the same data types. This means, for example, iterating over all pairs in Map and recusively generating graph nodes for the data stored in the Map. Imagine a Map of Lists of more Maps.
serialize - This step linearizes the graph and generates canonical string values. It arranges the nodes into linear order that could be written out as YAML. If there are loops in the graph then aliases have to be built (to refer back to prior nodes).
present - This is the process of writing out the serialized graph as a formatted string.

In our current implementation our steps work like this:

represent - This stage does nothing. It accepts only scalars and passes them through to the next step.
serialize - This stage accepts only scalars and converts them to canonical strings.
present - This stage writes out the YAML header and footer and it writes the string content received from serialized inbetween them.

Graph Representation

I think the next step should be to focus on building the graph representation of input structure. Based on the spec’s figure 3.3 Representation Model, we can build a series of types

ScalarNode
- Has the canonical string value
SequenceNode
- Has a list of Nodes for each value
MappingNode
- Has a map of key -> nodes

Since our test suite consists of two tests of scalars I’ll start with ScalarNode. I wrote up this module as a start:

defmodule RepresentationGraph do
  defmodule Node do
    defmodule Scalar do
      defstruct value: "", tag: ""
    end

    def new(scalar) do
      %Scalar{value: scalar, tag: ""}
    end

    def value(%Scalar{value: v, tag: _}) do
      v
    end
  end

  def represent(scalar) do
    Node.new(scalar)
  end
end

The RepresentationGraph.Node module will be an abstraction over the different specializations of nodes: scalar, sequence, and map. It provides the function Node.new/1 to create a new node and an accessor Node.value/1 to query the value from the Node. Currently, I support only one type of node, Node.Scalar which holds a value and tag.

The RepresentationGraph module takes over the function represent/1 (fomerly in Yamlix). Currently, it just creates a new Node.Scalar and returns it.

I use the RepresentationGraph set of modules like this:

defmodule Yamlix do
  alias RepresentationGraph, as: R;
  alias RepresentationGraph.Node;

  @spec dump(any) :: String.t
  def dump(scalar) do
    scalar |> R.represent |> serialize |> present
  end

  defp serialize(node) do
    to_string(Node.value(node))
  end

  defp present(content) do
    "--- " <>
    content  <>
    "\n...\n"
  end
end

I now call RepresentationGraph.represent/1 to build a representation. Then, in serialize/1 I use Node.value/1 to extract the value when building the cannonical string. present/1 is unchanged.

Next Steps

This has turned into a pretty long post and there is a lot more to do in Yamlix. The next steps are

Get this on GitHub
Add support for maps - maps will require a recrusive traversal through the Map which will help flesh out the design for Yamlix
Work through the YAML spec

These are things I’ll wprk on in next week’s post.

Joseph Kain