[ Jocelyn Ireson-Paine's Home Page | Publications | Dobbs Code Talk Index | Dobbs Blog Version ]

Yet More XML: with Prolog

I just saw Mark Nelson's More on XML with his account of how difficult Visual C++ and MSXML make it to extract a node not all that far down from the root of an XML tree. So, since Mark was good enough to show us his XML file, I tried with SWI-Prolog.

Here's Mark's XML file. From it, he wants the contents of the Title element:

<ISBNdb server_time="2009-03-19T02:01:00Z">
<BookList total_results="1" page_size="10" page_number="1" shown_results="1">
<BookData book_id="the_data_compression_book" isbn="1558514341">
<Title>The Data Compression Book</Title>
<TitleLong></TitleLong>
<AuthorsText>Mark Nelson, Jean-Loup Gailly, </AuthorsText>
<PublisherText publisher_id="m_t_books">M&amp;T Books</PublisherText>
</BookData>
</BookList>
</ISBNdb>

Now, SWI-Prolog has a library for parsing XML. I've used it for decoding Excel spreadsheets saved as XML, but that wasn't recently, so my memory of the library was patchy. But I knew it returns the parsed XML as a list of lists, and lists are a standard data type in Prolog. So I only needed to know how to load the library, how to invoke the parser, and how the lists it returns represent XML. Luckily, there is a very helpful recent posting about this on the SWI-Prolog mailing list, R: [SWIPL] Working with strings from Prolog super-expert Richard O'Keefe.

Let's try what Richard suggests. I load the library, then parse Mark's XML into a Prolog variable also named "XML", and display that. Good: everything works, and the variable seems to have listy things in it:

Welcome to SWI-Prolog (Multi-threaded, 32 bits, Version 5.6.64)
...Rest of banner...
1 ?- cd('c:/dobbs').
true.

2 ?- use_module(library(sgml)).
% library(option) compiled into swi_option 0.02 sec, 7,664 bytes
% library(sgml) compiled into sgml 0.03 sec, 38,328 bytes
true.

3 ?- load_xml_file('mark.xml',XML), write(XML).
[element(ISBNdb, [server_time=2009-03-19T02:01:00Z], [
, element(BookList, [total_results=1, page_size=10, page_number=1, shown_results=1], [
, element(BookData, [book_id=the_data_compression_book, isbn=1558514341], [
, element(Title, [], [The Data Compression Book]),
, element(TitleLong, [], []),
, element(AuthorsText, [], [Mark Nelson, Jean-Loup Gailly, ]),
, element(PublisherText, [publisher_id=m_t_books], [M&T Books]),
]),
]),
])]

I'm working from Wi-Fi in a library which will close shortly, so I'm going to be really hasty. Richard's posting tells me that the parser returns XML elements as structures holding a tag-name field, an attributes field, and a children field. An XML file will be a list that contains a top-level element, and possibly other stuff I've not had time to read about. Mark's file appears to have a top-level element called ISBNdb, with children that include a BookList element. Let's check that:

4 ?- load_xml_file('mark.xml',XML), XML=[element(_,_,Kids0)], member( element('BookList',_,Kids1), Kids0 ), write(Kids1).
[
, element(BookData, [book_id=the_data_compression_book, isbn=1558514341], [
, element(Title, [], [The Data Compression Book]),
, element(TitleLong, [], []),
, element(AuthorsText, [], [Mark Nelson, Jean-Loup Gailly, ]),
, element(PublisherText, [publisher_id=m_t_books], [M&T Books]),
]),
]
I did as before, but this time, "unified" the top-level XML with a structure containing a new Prolog variable called Kids0. This is a kind of pattern-matching which will put the third field of the top-level element — the level-0 children — into Kids0. Then, I used the built-in predicate "member" to search Kids0 for an element whose first field was 'BookList'. I put its children into Kids1, and displayed that. And one of those level-1 children is a BookData element.

Now I'll iterate that, and write out the second-level children:

5 ?- load_xml_file('mark.xml',XML), XML=[element(_,_,Kids0)], member( element('BookList',_,Kids1), Kids0 ), member( element('BookData',_,Kids2), Kids1 ),write(Kids2).
[
, element(Title, [], [The Data Compression Book]),
, element(TitleLong, [], []),
, element(AuthorsText, [], [Mark Nelson, Jean-Loup Gailly, ]),
, element(PublisherText, [publisher_id=m_t_books], [M&T Books]),
]

And now the third-level children:

6 ?- load_xml_file('mark.xml',XML), XML=[element(_,_,Kids0)], member( element('BookList',_,Kids1), Kids0 ), member( element('BookData',_,Kids2), Kids1 ), member( element('Title',Attrs,Kids3), Kids2), write(Kids3).
[The Data Compression Book]
Lo and behold, there's the title!

I suppose I'm just showing that if you're lucky enough to have the right libraries, and a language that handles lists nicely and that is also interactive, it's easy to experiment and test your understanding of the data. Once those are sorted out, you can then go on to program a robust system for extracting stuff from it, including validity checks on list size and so on. Thanks for a nice example, Mark.