Slax, an Elixir library for SAX parsing XML documents, is now available on Hex. This post will briefly cover what SAX parsing is and some of the tradeoffs associated with SAX parsing, then dive into how to use Slax.

What Is SAX Parsing?

The Simple API for XML (SAX) is an event based approach to parsing XML documents. A document is parsed as a stream and individual pieces of the document are passed to a callback function as they are encountered. This is in contrast to the Document Object Model (DOM) approach to parsing, where an entire document is loaded into a data structure that supports arbitrary queries against the document. (Those queries are typically expressed via XPath or XQuery.)

A SAX parser handling a basic XML document like:

<xml>
  <node name="test node" id="55">
    Value
  </node>
</xml>

would generate seven events, and consequently invoke the callback function seven times. Those seven events are:

  1. the start of the document
  2. the start of the <xml> node
  3. the start of the <node> node (with name and id values captured as attributes)
  4. the text Value
  5. the end of the <node> node
  6. the end of the <xml> node
  7. the end of the document

As a result of its streaming nature, a SAX parser is responsible for maintaining its own state, if data from or knowledge of previous events is required to process the current event. For example, a callback function would need to track whether the outer node was an <Author> or an <Illustrator> to properly extract author names from <Name> nodes in the following document:

<Book>
  <Title>The Gruffalo</Title>
  <Author>
    <Name>Julia Donaldson</Name>
    <DateOfBirth>1948-09-16</DateOfBirth>
  </Author>
  <Illustrator>
    <Name>Axel Scheffler</Name>
    <DateOfBirth>1957-??-??</DateOfBirth>
  </Illustrator>
</Book>

When To Use (And Not Use) SAX Parsing

DOM parsing provides a simpler metaphor for interacting with XML data than SAX parsing, allowing easier querying, iteration over a subset of nodes, etc. However, a DOM parser cannot return data to its caller until it has processed the entire document, making it unsuitable for use with streaming data. Additionally, the representation of the entire parsed document will be held in memory and, while not an issue for small or even medium sized documents, can be problematic for documents that reach hundreds of megabytes or even gigabytes in size.

SAX parsing addresses both of those concerns. The event based approach to parsing pairs nicely with streaming data and the memory usage of the parser is determined solely by what data the callback function decides to hold onto. There are tradeoffs associated with SAX parsing, of course, and it is not wholly superior to DOM parsing. Certain aspects of XML, such as validation, require scanning the entire document, which can be difficult to reconcile with the SAX approach. Additionally, parsing self-referential XML documents (e.g. when the document contains a glossary of repeated terms) can lead to complicated or confusing callback implementations.

How To Use Slax

To parse an XML document using Slax, simply pass a String or IO device to the Slax.parse/3 function, along with a parser module and (optional) initial state for the parser.

The parser module is expected to implement a handle/2 function that will receive a struct representing a SAX event and the current parser state. The return value of the handle/2 function becomes the new state of the parser and will be passed to the ensuing invocation of handle/2.

The IO device does not need to contain the entire document when Slax.parse/3 is invoked; Slax will repeatedly call IO.read/2 until :eof is received.

Defining a Parser Module

The Slax.Parser module defines a behaviour which all parsers are expected to adopt. The functions a parser must implement are: init/1, finalize/1 and handle/2 (however, init/1 is optional).

The intent of each function, in brief, is:

  • init/1 is used to construct the initial state of the parser. It receives any options passed to the Slax.parse/3 function and its return value becomes the parser’s initial state. If not implemented, any options passed to Slax.parse/3 become the parser’s initial state.
  • handle/2 is called by Slax for each event generated when parsing the given document. It receives a struct representing the event as well as the current parser state.
  • finalize/1 is called once the document has been fully parsed and receives the current parser state as its sole argument. The value returned from finalize/1 will be the value returned by Slax.parse, and as such is a convenient way to translate internal parser state into externally usable data.

Slax also provides a macro to define default implementations for all functions, which allows a parser implementation to only provide handle/2 function heads that match events it is concerned with. (Parser state will remain unchanged should a call to handle/2 go unmatched.) A parser module can invoke the macro via use Slax.Parser. Additionally, default options can be specified via the :state keyword when invoking the macro (e.g. use Slax.Parser, state: []). These default options will be passed to the optional init callback unless options are provded to the Slax.parse/3 function.

Slax uses 11 different structs to represent the various events that can be encountered during parsing, but most parsers will only be concerned with 3: StartElement, Characters and EndElement. Details about all 11 structs and their fields can be found in the documentation.

Putting everything together, a basic parser module that prints author’s names from XML documents similar to what was show above looks like:

defmodule AuthorPrinter do
  @behaviour Slax.Parser

  use Slax.Parser

  alias Slax.Event.{Characters, EndElement, StartElement}

  def init(_args), do: %{in_author: false, in_name: false}

  def handle(%StartElement{local_name: "Author"}, state), do: Map.put(state, :in_author, true)
  def handle(%EndElement{local_name: "Author"}, state), do: Map.put(state, :in_author, false)
  def handle(%StartElement{local_name: "Name"}, state), do: Map.put(state, :in_name, true)
  def handle(%EndElement{local_name: "Name"}, state), do: Map.put(state, :in_name, false)
  def handle(%Characters{characters: name}, state = %{in_author: true, in_name: true}) do
    IO.puts(name)
    state
  end
end

The parser represents its state as a Map and uses separate keys to track when it is inside of a <Author> node and a <Name> node. Any characters encountered when inside both <Author> and <Name> nodes are then written to stdout. It’s important to remember that the return value of the handle/2 function becomes the new state of the parser, which explains why the parser is careful to return the current state after printing the author’s name.

Concurrency

Because the handle/2 function is called from the same process as Slax, and because the return value of that function is needed for the ensuing handle/2 call, no progress is made by the parser until the handle/2 callback returns. Should a parser need to perform a long running or asynchronous computation, a viable solution is to dispatch work to a GenServer (via cast, not call) from the parser’s handle/2 callback.

More Information

More detailed information is available in the documentation.

The source code for Slax is available on GitHub and bug reports, feature requests and suggestions for improvement are welcomed.