Gem #21: How to parse an XML text

Let's get started…

There are two main APIs to parse an XML file. One (the Document Object Model, DOM) reads the file and generates a tree in memory representing the whole document. Typically, because of the amount of operations mandated by the specifications, this tree is several times larger than the document itself, and thus depending on the amount of memory on your machine, it might limit the size of documents your application can read. On the other hand, it provides a lot of flexibility in the handling of these trees.

The other method (SAX) is based on callbacks, which are called when various constructs are seen while reading the XML file. This requires almost no memory, but makes the processing of the XML file additional work for your application. It is however very well suited when you want to store the XML data in an application-specific data structure. In fact, XML/Ada itself uses SAX to build the DOM tree.

In both cases, XML/Ada needs an object (an "input_source") to read the actual XML data. This data can be found either on the disk, in memory, read from a socket, or any other possible source you can imagine. XML/Ada is carefully constructed so that it doesn't require the whole document in memory, and can just read one character at a time, which makes it adaptable to any possible input. This gem does not cover how to write your own input streams. This is in general quite easy, the only difficulty is to properly convert the bytes you are reading to unicode characters.

Here is a small example on using the DOM API to create a tree in memory. In this example, we are assuming the most frequent case of an XML file on the disk, and therefore we are using a File_Input as the input. The second object we need is the XML parser itself. When we want to create a DOM tree, we need to use a Tree_Reader, or a type derived from it. As we will see later, this is in fact a SAX parser (that is an event-based XML parser) whose callbacks are implemented to create the DOM tree. You can of course override its primitive operations if you want to do additional things (like verbose output, redirect error messages, pre-processing of the XML nodes,...).

with Input_Sources.File;  use Input_Sources.File;
with DOM.Readers;         use DOM.Readers;
with DOM.Core;            use DOM.Core;

procedure Read_XML_File (Filename : String) is
  Input  : File_Input;
  Reader : Tree_Reader;
  Doc    : Document;
begin
  Open (Filename, Input);
  Parse (Reader, Input);
  Close (Input);
  
  Doc := Get_Tree (Reader);
  ...
  Free (Reader);
end Read_XML_File;

The first three lines read the file into memory. The fourth line gets a handle on the tree itself, which you can then manipulate with the various subprograms found in the DOM.Core.* packages (and that are mandated by the W3C specifications). When we are done, we simply free the memory.

There are various settings that can be set on the reader before we actually parse the XML stream, for instance whether it should support XML namespaces, whether we want to validate the input, and so on.

As we mentioned before, there exists a second, lower-level API called SAX which is event-based. It defines one tagged type, a Reader, which has several primitive operations that act as callbacks. You can override the ones you want. In general, the result of calling them is to create an in-memory representation of the XML input (which is what the DOM interface does, really).

The following short example only detects the start of elements in the XML file, and prints their name on standard output. It has little interest in real applications, but is a good framework on which to base your own SAX parsers.

with Sax.Attributes;
with Sax.Readers;     use Sax.Readers;
with Unicode.CES;     use Unicode.CES;

package Debug_Parsers is
   type Debug_Reader is new Reader with null record;
   overriding procedure Start_Element
     (Handler       : in out Debug_Reader;
      Namespace_URI : Unicode.CES.Byte_Sequence := "";
      Local_Name    : Unicode.CES.Byte_Sequence := "";
      Qname         : Unicode.CES.Byte_Sequence := "";
      Atts          : Sax.Attributes.Attributes'Class);
end Debug_Parsers;

Here is the implementation of the Start_Element callback. We are assuming, in this simple example, that the console on which we are printing the output can accept unicode characters (in fact, all Put_Line does is to print a series of bytes, which are interpreted by the console to do the proper rendering of unicode glyphs).

with Ada.Text_IO;   use Ada.Text_IO;

package body Debug_Parsers is
   procedure Start_Element
     (Handler       : in out Debug_Reader;
      Namespace_URI : Unicode.CES.Byte_Sequence := "";
      Local_Name    : Unicode.CES.Byte_Sequence := "";
      Qname         : Unicode.CES.Byte_Sequence := "";
      Atts          : Sax.Attributes.Attributes'Class)
   is
   begin
      Put_Line ("Found start of " & Qname);
   end Start_Element;
end Debug_Parsers;

And finally here is a short example of a program using that parser. Notice how it closely mimics what we did for DOM (which is not so surprising, since, once again, the DOM parser itself is really a special implementation of a SAX parser).

with Input_Sources.File;  use Input_Sources.File;
with Debug_Parsers;       use Debug_Parsers;

procedure Test_Sax is
  Input  : File_Input;
  Reader : Debug_Reader;
begin
  Open (Filename, Input);
  Parse (Reader, Input);
  Close (Input);
end Test_Sax;