XML Module

Contents

Overview

The xml module implements a small and fast pure-Lua XML parser. It can be used in “SAX” or “DOM” mode, and the DOM mode supports parse-time pruning, allowing data to be very efficiently extracted from very large files.

local xml = require "xml"

Entity references and numeric character references in the XML are converted to utf-8 strings, assuming (but not verifying) that the document encoding is either utf-8 or a compatible one (e.g. us-ascii).

Functions

xml.SAX(text, fnTest, fnStart, fnEnd, fnComm, fnPI)

Parse XML text in SAX (callback) mode.

text can be a string or a 'read' function that can be called repeatedly to get more text. 'read' returns a string, or nil for end of stream.

Five callbacks can be provided:

The return value is true on success, nil, <error> on failure.

xml.DOM(text, [map])

Parse XML into a document tree.

map is a description of the elements to capture. If not given, the parser will capture all document contents. See Pruning for a description of other map values.

The DOM parser returns a tree of nodes that represent a document tree. CDATA nodes are represented by string values. By default, each returned element is described by a table with the following contents:

node._type  = element name
node.attr   = value of attribute named 'attr'
node[1...N] = child nodes 1...N

Comments are treated as nodes with _type == “_comment”, and processing instructions are treated as nodes with _type == “_pi_<name>”, where <name> is the name of the processing instruction.

Pruning

The DOM parser accepts a map value, analogous to a schema, that specifies how to treat elements within the document tree. The map is structured as a graph that describes which nodes to extract in the document, similar to how a regular expression can describe what text to extract from a string.

Maps can also be viewed as state machines. The parser keeps track of a “current” map as it parses the document. If and when it descends into child elements, the map describes what map to use as the “current” map for that child element.

Each node in the map may possess the following fields:

Here are some constructors for map nodes:

xml.CaptureAll() returns a map that will capture all descendants and any text.

xml.CaptureText() returns a map that will capture text, but no sub-elements.

xml.TextNode() returns a map that will capture text, and evaluate to string, not an array of strings.

xml.ByName(node) = capture values as described by 'node', but place them in a named field of the parent (versus another child). Any occurrences of the element will override any attribute of the same name. If the element occurs more than once in a parent element, the last occurrence overrides the preceding ones.

xml.ListByName(node) = capture values as described by 'node', but append them to an array stored in a named field of the parent. Any occurrences of the element will override any attribute of the same name.

xml.ByName and xml.ListByName can be composed with other constructors but not with themselves or each other.

Here are some pre-defined node types:

xml.STRING : this node describes an element that appears once in its parent element and contains a string. It will appear as a named field in the document tree with a string value.

xml.NUMBER : this node describes an element that appears once in its parent element and contains a number. It will appear as a named field in the document tree with a number value.

Refer to xml.lua for more node types.

The following example map structure will capture style elements from a well-formed XHTML document, discarding the body and other head elements:

local map = {
   html = xml.ByName {
      head = xml.ByName {
         title = xml.STRING,
         style = xml.STRING_LIST,
      }
   }
}

local tree = xml.DOM(text, map)

print(tree.html.head.title)  --> a string
print(#tree.html.head.style) --> number of <style> elements

Limitations