Skip to content

Latest commit

 

History

History
235 lines (172 loc) · 6.39 KB

README.md

File metadata and controls

235 lines (172 loc) · 6.39 KB

Hex pm

Saxy

Saxy (Sá xị) is an XML SAX parser and encoder in Elixir that focuses on speed, usability and standard compliance.

Comply with Extensible Markup Language (XML) 1.0 (Fifth Edition).

Features highlight

  • An incredibly fast XML 1.0 SAX parser.
  • An extremely fast XML encoder.
  • Native support for streaming parsing large XML files.
  • Parse XML documents into simple DOM format.
  • Support quick returning in event handlers.

Installation

Add :saxy to your mix.exs.

def deps do
  [{:saxy, "~> 0.9.1"}]
end

Overview

Full documentation is available on HexDocs.

SAX parser

A SAX event handler implementation is required before starting parsing.

defmodule MyEventHandler do
  @behaviour Saxy.Handler

  def handle_event(:start_document, prolog, state) do
    IO.inspect("Start parsing document")
    {:ok, [{:start_document, prolog} | state]}
  end

  def handle_event(:end_document, _data, state) do
    IO.inspect("Finish parsing document")
    {:ok, [{:end_document} | state]}
  end

  def handle_event(:start_element, {name, attributes}, state) do
    IO.inspect("Start parsing element #{name} with attributes #{inspect(attributes)}")
    {:ok, [{:start_element, name, attributes} | state]}
  end

  def handle_event(:end_element, name, state) do
    IO.inspect("Finish parsing element #{name}")
    {:ok, [{:end_element, name} | state]}
  end

  def handle_event(:characters, chars, state) do
    IO.inspect("Receive characters #{chars}")
    {:ok, [{:chacters, chars} | state]}
  end
end

Then start parsing XML documents with:

iex> xml = "<?xml version='1.0' ?><foo bar='value'></foo>"
iex> Saxy.parse_string(xml, MyEventHandler, [])
{:ok,
 [{:end_document},
  {:end_element, "foo"},
  {:start_element, "foo", [{"bar", "value"}]},
  {:start_document, [version: "1.0"]}]}

Streaming parsing

Saxy also accepts file stream as the input:

stream = File.stream!("/path/to/file")

Saxy.parse_stream(stream, MyEventHandler, initial_state)

It even supports parsing a normal stream.

stream = File.stream!("/path/to/file") |> Stream.filter(&(&1 != "\n"))

Saxy.parse_stream(stream, MyEventHandler, initial_state)

Partial parsing

Saxy can parse part of an XML document, and parse more of it later.

alias Saxy.Parser.Partial

xml = """
<?xml version=1.0' ?>
<foo bar=value'>
</foo>
"""
split_xml = String.split(xml, "\n")

{:ok, context} = Partial.init(MyEventHandler, initial_state)
{:ok, context} = Partial.parse(Enum.at(split_xml, 0), context)
{:ok, context} = Partial.parse(Enum.at(split_xml, 1), context)
{:ok, context} = Partial.parse(Enum.at(split_xml, 2), context)
{:ok, state} = Partial.finish(context)

Simple DOM format exporting

Sometimes it will be convenient to just export the XML document into simple DOM format, which is a 3-element tuple including the tag name, attributes, and a list of its children.

Saxy.SimpleForm module has this nicely supported:

Saxy.SimpleForm.parse_string(data)

{"menu", [],
 [
   {"movie",
    [{"id", "tt0120338"}, {"url", "https://www.imdb.com/title/tt0120338/"}],
    [{"name", [], ["Titanic"]}, {"characters", [], ["Jack &amp; Rose"]}]},
   {"movie",
    [{"id", "tt0109830"}, {"url", "https://www.imdb.com/title/tt0109830/"}],
    [
      {"name", [], ["Forest Gump"]},
      {"characters", [], ["Forest &amp; Jenny"]}
    ]}
 ]}

xmerl format exporting

Saxy supports exporting to xmerl format, which you could then use for xmerl_xpath or SweetXML.

Note that xmerl format requires tag and attribute names to be atoms. By default Saxy uses String.to_existing_atom/1 to avoid runtime atom creation. You could override this behaviour by specifying :atom_fun option to String.to_atom/1.

iex> string = File.read!("/path/to/my.xml")
iex> Saxy.Xmerl.parse_string(string, atom_fun: &String.to_atom/1)
{:ok,
 {:xmlElement,
  :foo,
  :foo,
  [],
  {:xmlNamespace, [], []},
  [],
  1,
  [{:xmlAttribute, :bar, :bar, [], [], [], 1, [], 'value', :undefined}],
  [],
  [],
  [],
  :undeclared}}

XML builder

Saxy offers two APIs to build simple form and encode XML document.

Use Saxy.XML to build and compose XML simple form, then Saxy.encode!/2 to encode the built element into XML binary.

iex> import Saxy.XML
iex> element = element("person", [gender: "female"], "Alice")
{"person", [{"gender", "female"}], [{:characters, "Alice"}]}
iex> Saxy.encode!(element, [])
"<?xml version=\"1.0\"?><person gender=\"female\">Alice</person>"

See Saxy.XML for more XML building APIs.

Saxy also provides Saxy.Builder protocol to help composing structs into simple form.

defmodule Person do
  @derive {Saxy.Builder, name: "person", attributes: [:gender], children: [:name]}

  defstruct [:gender, :name]
end

iex> jack = %Person{gender: :male, name: "Jack"}
iex> john = %Person{gender: :male, name: "John"}
iex> import Saxy.XML
iex> root = element("people", [], [jack, john])
iex> Saxy.encode!(root, [])
"<?xml version=\"1.0\"?><people><person gender=\"male\">Jack</person><person gender=\"male\">John</person></people>"

Benchmarking

Benchmarking in XML is hard and highly depends on the complexity of the document. Saxy usually yields 1.4 times better than Erlsom in benchmark results. With deeply nested documents, it is particularly noticeably faster with 4.35 times faster.

As for XML builder, Saxy is usually 4 times faster than xml_builder on simple element encoding, and 17 times faster in deeply nested elements encoding.

The benchmark suite can be found in this repository.

Limitations

  • No XSD supported.
  • No DTD supported, when the parser encounters a <!DOCTYPE, it simply stops parsing.

Where did the name come from?

Sa xi Chuong Duong

☝️ Sa Xi, pronounced like sa-see, is an awesome soft drink made by Chuong Duong.

Contributing

If you have any issues or ideas, feel free to write to https://github.com/qcam/saxy/issues.

To start developing:

  1. Fork the repository.
  2. Write your code and related tests.
  3. Create a pull request at https://github.com/qcam/saxy/pulls.