Yamlix: Indentation

In this post I’ll continue the Yamlix project that I’ve discussed previously in Yamlix and Yamlix - Lists and Maps.

This week I’ll start by going over the specification and looking for things that need to be implemented in Yamlix. Then, I’ll spend the rest of the post working on generating YAML for more complex structures and getting the indentation right.

If you want to look over the source you can find it in the Yamlix repo on GitHub.

The YAML 1.2 Specification

At this point we have a lot of basic support done. I should start working through the spec and making sure I’m conforming. I also need to implement proper formatting: formatting is definitely not correct for nested structures.

First, I’ll read through the spec and summarize what the different sections describe. Then I can put together a plan for implementation.

Section 1 - Introduction
Section 2 - Examples of YAML
Section 3 - Description of the dump and load processes and models
Section 4 - Syntax
- BNF Production rules
- Indentation and Context
- Naming conventions
Section 5 - Character usage
- Description of allowed characters, disallowed characters must be escaped
- It is recommended that we output UTF-8
- Indicator Characters - characters that indicate structure in the document
- Tabs and spaces are the only accepted white space characters
- Rules for escaping non-printable characters
Section 6 - Structure of YAML - spaces, lines, directives, node properties
- Indentation only with spaces (no tabs)
- Long lines can be folded
- Tag shorthand
- Nodes can have an anchor property
Section 7 - Flow style
- I don’t think Yamlix will emit flow style so I will skip over this
Section 8 - Block style
- Block scalars
- Block sequences - says that properties are not allowed, what exactly does this mean?
- Compact mapping sequences don’t allow properties, must use expanded forn with “?:” to include properties
Section 9 - YAML Character stream
- Concerns multiple documents, I don’t need this (at least not right now)
Section 10 - Recomended Schemas
- Seems like this is just for parsing

A TODO list

After reading the spec I put together a list of TODO items

Serialize to a tree using anchors and comparison based on section 3.2.2
Escape non-printable characters in scalars based on Section 5
Indent based on Section 6
Use tag shorthand based on Section 6, syntax 6.8.2
Generate anchors based on Section 6
Fold long scalars based on Section 6 and Section 8
Choose between compact and non-compact forms for mappings and sequences based on need for properties based on Section 8

And then we still have the following types left to support. These will require extension to yamerl to parse:

Struct
Tuples
Streams

I’ve filed GitHub issues for all of these.

Indentation

I’ll start working with indentation and of course, the first step is a test.

While writing this test I realized that when writing out a string like “1” I need to include quotes around the scalar value. But, for other strings I don’t. I’ve removed this from test so that I don’t have to fix this right now. But, I’ve filed another GitHub issues so I remember to come back to it later.

Now, for the test itself. The test is a little long but I think it’s intention is clear:

test "it indents" do
  complex = %{
    "key1" => [ "elem1", "elem2", "elem3" ],
    "key2" => %{ 1 => "a", 2 => "b" },
    "key3" => 4,
    "key4" => [
      %{ "aa" => "w", "bb" => [ "p", "q" ] },
      %{ "cc" => "y", "dd" => "z" }
    ],
    "key5" => 5
  }
  
  # Expected data generated by Ruby's YAML support
  assert Yamlix.dump(complex) == """
  ---
  key1:
  - elem1
  - elem2
  - elem3
  key2:
    1: a
    2: b
  key3: 4
  key4:
  - aa: w
    bb:
    - p
    - q
  - cc: y
    dd: z
  key5: 5
  ...
  """
end

I ported the complex data structure to Ruby (this just a matter of removing the % signs) and then used Ruby’s YAML generator to generate the expected data. Here’s the Ruby program:

require 'yaml'

complex = {
  "key1" => [ "elem1", "elem2", "elem3" ],
  "key2" => { 1 => "a", 2 => "b" },
  "key3" => 4,
  "key4" => [
    { "aa" => "w", "bb" => [ "p", "q" ] },
    { "cc" => "y", "dd" => "z" }
  ],
  "key5" => 5
}

puts YAML.dump(complex)

It’s nice to have a working YAML generator to use to generate reference YAML.

Of course, the new test fails:

    1) test it indents (YamlixTest)
       test/yamlix_test.exs:72
       Assertion with == failed
       code: Yamlix.dump(complex) == "---\nkey1:\n- elem1\n- elem2\n- elem3\nkey2:\n  1: a\n  2: b\nkey3: 4\nkey4:\n- aa: w\n  bb:\n  - p\n  - q\n- cc: y\n  dd: z\nkey5: 5\n...\n"
       lhs:  "--- \nkey1: \n- elem1\n- elem2\n- elem3\nkey2: \n1: a\n2: b\nkey3: 4\nkey4: \n- \naa: w\nbb: \n- p\n- q\n- \ncc: y\ndd: z\nkey5: 5\n...\n"
       rhs:  "---\nkey1:\n- elem1\n- elem2\n- elem3\nkey2:\n  1: a\n  2: b\nkey3: 4\nkey4:\n- aa: w\n  bb:\n  - p\n  - q\n- cc: y\n  dd: z\nkey5: 5\n...\n"
       stacktrace:
         test/yamlix_test.exs:85

Understanding the Test error

But this error is a little hard to read with the YAML written out all on one line. In other tests I could get by with the one-line output but it isn’t going to cut it for indentation. I added an IO.puts to the test to get a dump of Yamlix output:

---
key1:
- elem1
- elem2
- elem3
key2:
1: a
2: b
key3: 4
key4:
-
aa: w
bb:
- p
- q
-
cc: y
dd: z
key5: 5
...

So, we should characterize what’s working and what isn’t:

The keys from the top-level map (“key1”, “key2”, … “key5”) are indented correctly. This is because they need 0 spaces of identation.
“elem1”, “elem2”, and “elem3” are also indented correctly. This is because their level implies indentation of 2 spaces but the “- “ characters count toward that indentation. So this is correct by accident.
Keys “1” and “2” are not indented correctly. They are at the second level and should be indented with 2 spaces.
Keys “aa” and “cc” are in the wrong places. They should be on the preceeding lines following the “- “
Keys “bb” and “dd” are not indented correctly and should be indented 2 spaces.
Items “p” and “q” are at the third level and should be indented 4 spaces though the “- “ characters count toward the indentation.
“key5” is in the right place because it is at the top-level.

First, I need to decide where the indentation belongs in the code. Figure 3.1 “Processing Overview” from the YAML 1.2 Spec describes spacing and formatting in the presentation stage. I’ll follow this in Yamlix.

This will require some drastic changes from what we have now.

Implementing Indentation

First of all, serialize shouldn’t be responsbile for generating the strings. It’s role should be to convert the input graph into a tree (by removing any cycles using aliases). I’d like to deal with this later. The present function should then walk through the tree generating and format the output as it goes. This suggests that present should call to_string.

To begin with, I’ll move the call to to_string. This doesn’t really get us any closer to identing but it gives us a better abstraction.

defp serialize(graph) do
  graph
end

defp present(tree) do
  "--- " <>
  to_string(tree) <>
  "\n...\n"
end

Next, we need to traverse the tree in present. This function will start to grow so I will move it to its own module in lib/yamlix/presentation.ex

defmodule Presentation do
  def present(tree) do
    "--- " <>
    to_string(tree) <>
    "\n...\n"
  end
end

And I’ve updated the Yamlix module to call present from Presentation. Now, to make some functional changes. I need to traverse the tree, handle indentation, and generate YAML.

This was more challenging that I expected and took me several tries to get right. The tests I’ve written so far helped me to idenitfy problems and eventually to build up a working solution.

The frist thing I needed was a function to generate enough spaces for indentation:

defp indent(count) do
  String.duplicate(" ", count * 2)
end

The strategy that I employed with indent was to mantain a current indentation level as I traversed the tree. Whenever I descended into a child node in the tree I would increase the level. For example, I had a function like this:

defp block_sequence(%Node{value: list, tag: t}, level) do
  list |> List.foldl "\n", fn val, acc ->
    acc <> indent(n) <> "- " <> sequence_element(val, level + 1)
  end
end

to print out YAML for lists. block_sequence is passed a sequence node and the value n where n represents the indentation level. The YAML spec describes that for sequences like

key:
  key2:
  - a
  - b
  - c

The characters - are considered indentation. So the value a is a indentation level 2 (thought of as 4 spaces). So, when indenting the sequence element I use level. If the sequence element is some block form of its own it needs to be indented to the next level so I pass level + 1 to sequence_element

I ran into problems getting the indentation right between scalar nodes and block nodes in the sequence value.

I then tried a different approach. The YAML spec includes parameterized BNF production rules for all legal YAML. BNF stands for Backus-Naur Form which is a language for describing formal grammars. Here’s an example for block style sequences:

[183] l+block-sequence(n)      ::=  ( s-indent(n+m) c-l-block-seq-entry(n+m) )+
                                    /* For some fixed auto-detected m > 0 */
[184] c-l-block-seq-entry(n)   ::=  “-” /* Not followed by an ns-char */
                                    s-l+block-indented(n,block-in)
[185] s-l+block-indented(n,c)  ::=    ( s-indent(m)
                                       ( ns-l-compact-sequence(n+1+m)
                                       | ns-l-compact-mapping(n+1+m) ) )
                                    | s-l+block-node(n,c)
                                    | ( e-node s-l-comments )	 
[186] ns-l-compact-sequence(n) ::=  c-l-block-seq-entry(n)
                                    ( s-indent(n) c-l-block-seq-entry(n) )*

The block-in parameter is described as:

In block styles, indentation is used to delineate structure. To capture human perception of
indentation the rules require special treatment of the “-” character, used in block sequences.
Hence in some cases productions need to behave differently inside block sequences (block-in
context) and outside them (block-out context).

I started writing functions to match these production rules. I had a function l_block_sequence and s_l_block_indented and so on. The functions followed the production rules. There are many of these rules. I found this strategy a bit tedious and still prone to error.

I took a step back and thought a little more. The description of block-in above was actually really helpful. Especially the point that “in some cases productions need to behave differently inside block sequences”. I started rewriting to code, following my original strategy but applying lessons I’d learned while reading through the production rules. I ended up with:

defmodule Presentation do
  alias RepresentationGraph.Node
    
  def present(tree) do
    "--- " <>
    produce(tree) <>
    "...\n"
  end
  
  defp produce(%Node{value: list, tag: t}) when is_list(list) do
    block_sequence(%Node{value: list, tag: t}, 0)
  end
  defp produce(%Node{value: map, tag: t}) when is_map(map) do
    block_mapping(%Node{value: map, tag: t}, 0)
  end
  defp produce(node) do
    literal(node, 0) <> "\n"
  end
  
  defp block_sequence(%Node{value: list, tag: t}, n) do
    list |> List.foldl "\n", fn val, acc ->
      acc <> indent(n) <> "- " <> sequence_element(val, n + 1)
    end
  end
  
  defp sequence_element(%Node{value: list, tag: t}, n) when is_list(list) do
    block_sequence(%Node{value: list, tag: t}, n)
  end
  defp sequence_element(%Node{value: map, tag: t}, n) when is_map(map) do
    [key | keys] = Map.keys(map)
    
    mapping_pair(map, key, n) <> (
      keys |> List.foldl "", fn key, acc ->
        acc <> indent(n) <> mapping_pair(map, key, n)
      end
    )
  end
  defp sequence_element(node, _n) do
    literal(node, 0) <> "\n"
  end
  
  defp block_mapping(%Node{value: map, tag: t}, n) do
    Map.keys(map) |> List.foldl "\n", fn key, acc ->
      acc <> indent(n) <> mapping_pair(map, key, n)
    end
  end
  
  defp mapping_pair(map, key, n) do
    literal(key, 0) <> ":" <> mapping_value(Map.get(map, key), n)
  end
  
  defp mapping_value(%Node{value: list, tag: t}, n) when is_list(list) do
    block_sequence(%Node{value: list, tag: t}, n)
  end
  defp mapping_value(%Node{value: map, tag: t}, n) when is_map(map) do
    block_mapping(%Node{value: map, tag: t}, n + 1)
  end
  defp mapping_value(node, _n) do
    " " <> literal(node, 0) <> "\n" 
  end

  defp literal(%Node{value: val, tag: t}, n) do
    indent(n) <> tag_and_space(t) <> Kernel.to_string(val)
  end

  defp tag_and_space(t) do
    case t do
      "" -> ""
      tag -> tag <> " "
    end
  end

  @spec indent(non_neg_integer) :: String.t
  defp indent(0), do: ""
  defp indent(level) do
    String.duplicate(" ", level * 2)
  end
end

There’s a lot here. So let’s consider an example. Suppose we have a tree like:

  # This would be produced for [ %{ "a" => "b", "c" => "d"} ]
  %RepresentationGraph.Node {
    value: [
      %RepresentationGraph.Node{value: %{
        %RepresentationGraph.Node{value: "a"} => %RepresentationGraph.Node{value: "b"},
        %RepresentationGraph.Node{value: "c"} => %RepresentationGraph.Node{value: "d"}
        }
      },

    ]
  }

Then our tree begins with a list node so in present/1 the call to produce will match this function:

defp produce(%Node{value: list, tag: t}) when is_list(list) do
  block_sequence(%Node{value: list, tag: t}, 0)
end

and will call block_sequence with 0 indentation (the initial level):

defp block_sequence(%Node{value: list, tag: t}, n) do
  list |> List.foldl "\n", fn val, acc ->
    acc <> indent(n) <> "- " <> sequence_element(val, n + 1)
  end
end

The first thing to note here is that lists always start on a new line because the initial value in the fold is "\n". Each element is indented to level n followed by “- “. The elements of the sequence are emitted by sequence_element/2 at the next indentation level. The function sequence_element/2 has multiple heads to handle different node types. Now, in our example, the elements of the lists are mapping nodes. So we would emit those mapping nodes with this map variant for sequence_element:

defp sequence_element(%Node{value: map, tag: t}, n) when is_map(map) do
  [key | keys] = Map.keys(map)
  
  mapping_pair(map, key, n) <> (
    keys |> List.foldl "", fn key, acc ->
      acc <> indent(n) <> mapping_pair(map, key, n)
    end
  )
end

This function is tailored to emitting a mapping as part of a sequence in that it handles the first entry of the map as a special case which is alredy indented. The map keys are destructured into the first key and the remaining keys. The first key (“a” in the example) and it’s associated value (“b”) are not indented and are emitted by mapping_pair/3. The remaining keys and values are indented and emited using mapping_pair/3.

You may have noticied that there is a bug in this function: it does not work for empty maps because Map.keys(map) will be an empty list and won’t match [key | keys]. I’ll add this problem to the TODO list and we’ll come back to it later.

The function mapping_pair/3 which is used to print the key / value pair is:

defp mapping_pair(map, key, n) do
  literal(key, 0) <> ":" <> mapping_value(Map.get(map, key), n)
end

This function assumes that keys are scalars. This may be an incorrect assumption but it will serve us until I write better tests. I’ve filed an issue on GitHub to look into this futher.

The value can be any type of node and is handled by mapping_value/3 which uses the same strategy we’ve used before - it has multiple heads, one per node type. which correctly format the nodes based on their context within a mapping node. This strategy goes on recursively until the entire tree has been processed.

Generating YAML for Empty Maps

We wrote down a TODO item to handle empty maps in sequence_element/2. To make sure we do this correctly, we need a test. I started writing this test:

test "it dumps empty maps in a list" do
  list = [%{ "a" => "b"}, %{}]
  assert Yamlix.dump(list) == """
  --- 
  - a: b
  - ???
  ...
  """
end

but wasn’t sure what I should expect for the empty map. Should the sequence element be omitted altogether? I ended up checking what Ruby’s YAML generator does. This ruby program:

require 'yaml'

list = [{ "a" => "b"}, {}]
puts YAML.dump(list)

generates

---
- a: b
- {}

{} uses the YAML flow style as opposed to block style. For a non-empty map flow style could be {a : b, c : d}. It makes sense for Ruby to use this style for the empty map and I’ll do the same in Yamlix. So my test becomes:

  test "it dumps empty maps in a list" do
    list = [%{ "a" => "b"}, %{}]
    assert Yamlix.dump(list) == """
    --- 
    - a: b
    - {}
    ...
    """
  end

When I run the test I get a match error:

1) test it dumps empty maps in a list (YamlixTest)
   test/yamlix_test.exs:52
   ** (MatchError) no match of right hand side value: []
   stacktrace:
     (yamlix) lib/yamlix/presentation.ex:30: Presentation.sequence_element/2
     (yamlix) lib/yamlix/presentation.ex:22: anonymous fn/3 in Presentation.block_sequence/2
     (elixir) lib/list.ex:102: List."-foldl/3-lists^foldl/2-0-"/3
     (yamlix) lib/yamlix/presentation.ex:6: Presentation.present/1
     test/yamlix_test.exs:54

which is what we expected. To fix this we can use a case form in sequence_element/2 like this:

defp sequence_element(%Node{value: map, tag: t}, n) when is_map(map) do
  case Map.keys(map) do
    [] -> "{}\n"
    [key | keys] -> mapping_pair(map, key, n) <> (
      keys |> List.foldl "", fn key, acc ->
        acc <> indent(n) <> mapping_pair(map, key, n)
      end
    )
  end
end

The case matches on Map.keys(map) - the list of keys in the map. In the case of an empty list of keys, we just emit the empty list. In the case of a non-empty list, we have the same code from the original version. With this change the test passes.

Next Steps

At this point I’m pretty happy with indentation. The next steps from here would be to follow the TODO list (also GitHub issues)

Serialize to a tree using anchors and comparison based on section 3.2.2
Escape non-printable characters in scalars based on Section 5
Use tag shorthand based on Section 6, syntax 6.8.2
Generate anchors based on Section 6
Fold long scalars based on Section 6 and Section 8
Choose between compact and non-compact forms for mappings and sequences based on need for properties based on Section 8
Struct
Tuples
Streams
Presentation.mapping_pair/3 assumes mapping keys are scalars

I hope you’ll join me next when I work through a couple of these features.

Joseph Kain