Monkeys at Keyboards: Java-Fu
© Michael James Heron
Topic: Java Programming
Level: 3
Version: beta

The Holocaust was an obscene period in our nation's history. I mean in this century's history. But we all lived in this century. I didn't live in this century.
Vice President Dan Quayle

7 - XML and You

PreviousTable of ContentsNext
Forum


Chapter Objectives

By the end of this chapter, the reader will be able to:

  • parse XML documents using the SAX framework
  • use XML namespaces and attributes


7.1

Introduction

XML is not a component, or a technology used for building components - yet it is a vital tool in developing truly interchangeable component solutions. The days of proprietary formats are fast fading away - word processed documents are often stored in XML. Databases usually provide a facility for import and export to a standard XML notation - computer games are often configured using XML files. As a standard, it is fast taking hold throughout the computing world.

XML (which is an acronym for eXtensible Markup Language) is a platform and implementation independent data format. Data elements are written using a syntax that is somewhat similar to HTML (this is unsurprising, since both XML and HTML share a common ancestry). The key benefit of XML is that it both stores and describes the data in one neatly encapsulated package. As such, its role is often central in developing effective programming components.

Of a necessity, this chapter will only be a cursory introduction to the XML format - the subject will be expanded upon in a later advanced text on the webpage.

7.2

My First XML Datatype

Let's consider a simple example of a data type - a book, such as one might find in a library (that's a place where they store books, for those of you unfamiliar with the idea). There are a range of different attributes that may be associated with such a data type. We may want the ISBN of the book... we may want the author. We may want the title. We may want the publisher. This is all simple enough - we know how to do this - after all, we've been doing this kind of thing for years.

The problem comes when we want to store such data. Let's say we have the following data definition:

A Book Datatype
Fig 7.1: A Book Datatype

We can easily save this to a flat file:

012345678
Java-Fu
Michael Heron
Monkeys at Keyboards
12345

Or we can easily serialize an appropriate object to a binary file (as we discussed in the last chapter). Both of these techniques are simple, but they suffer from some drawbacks.

The flat-file is not very adaptable... if I later want to include some information about the genre, I can't just slot it into the file and hope that everything still works:

012345678
Java-Fu
Michael Heron
Monkeys at Keyboards
12345
Fiction

Unless the IO code is changed, when it gets to the genre line it will be expecting the ISBN of the next book. In order to change the contents of the data-file, we must also change the code that reads it. Once we've changed the code, we can't read in an older version of the data file unless we also include some legacy code... supporting many older versions of a data file is very cumbersome.

The data file must also be accompanied by a suitable format definition if someone else is to read the data into another application. For this example, it may be obvious as to what line of the file relates to which data element, but that becomes less obvious if you're storing a 3D array.

If you want to read a serialized object from one application into another, you must have the appropriate class file incorporated into the code, and you must also be using compatible versions of the serializer libraries. This means, for example, it's difficult to import a Java serialized file into a .NET application.

XML data-sets neatly solve both of these problems by providing a data definition that is both adaptable and self-describing. They combine the universal compatibility of a text-based flat file with the compactness of representation of an object definition.

But that's all just back-talk - let's see what an XML data type actually looks like:

<?xml version = "1.0"?>
<book>
<ISBN>012345678</ISBN>
<title>Java-Fu</title>
<author> Michael Heron</author>
<publisher>Monkeys at Keyboards</publisher>
<pages>12345</pages>
<genre>Fiction</genre>
</book>

Pretty simple, huh? You better believe it is! The only thing that doesn't read like butter is the very first line - this is a header that indicates which version of the XML specification is used. Beyond that, you can read the whole thing and understand which pieces of data relate to which attributes.

This particular example demonstrates the hierarchical nature of XML data definitions... in this example, a book element contains a number of sub-elements: ISBN, title, author, publisher, pages and genre. Each of these elements could be expanded to contain other elements:

<?xml version = "1.0"?>
<book>
<ISBN>012345678</ISBN>
<title>Java-Fu</title>

<author>
<forename>Michael</forename>
<surname>Heron</surname>
</author>

<publisher>Monkeys at Keyboards</publisher>
<pages>12345</pages>
<genre>Fiction</genre>
</book>

The book element is known as the root element - every XML document defines only one of these - it's a reference to a container within which all other elements will reside. If we wanted multiple books, we may define them in a library root element.

There are several things we must understand about XML - firstly, it is a very strict format - and it is case sensitive. All elements are contained between an opening tag and a closing tag. HTML allows for the closing tag to be omitted in certain occasions with minimal (if any) impact:

<p>This is a paragraph.
<p>This is another paragraph!

An XML document will not parse correctly if the closing tag is omitted. An XML document that does not conform to the strict syntax rules is said to be badly formed, and is ineligible for structured parsing.

7.3

Parsing XML Documents

Before an XML document can be used within a Java program (or indeed, any program), it needs to be parsed - an XML document does nothing, it is purely a container for data. In order to turn that data into information, the document has to be rendered into a suitable internal representation.

There are two primary techniques for doing this - each of these are defined in a separate specification. The first technique is called SAX (Simple Access to XML), and is an event-driven model. The second technique is called DOM (Document Object Model), and works by building up a tree representation of a particular XML document. We'll look at this in the next chapter.

SAX parsers are considerably more efficient than DOM parser, especially when large scale documents are concerned. They lack the flexibility that DOM parsers provide - as with most things in programming, the choice of one over the other should be primarily driven by the needs of the application as opposed to any ideological entrenchment.

7.4

Everything You Ever Wanted To Know About SAX...

We're going to be meeting a whole host of classes and packages for the first time in this chapter - don't be scared, I'll be here to hold your trembling hands. The benefit of exploring this kind of process is that it is generic - you don't really need to understand what is going on in the short-term - you just need to be able to apply it you your own programming solutions. Like double-buffering (see the Javanomicon chapter twenty-two, section eight), XML parsing is a portable technique that can be applied to multiple situations with little modification.

There are three new packages that contain all of the new classes we need. These packages are:

import org.xml.sax.*; 
import org.xml.sax.helpers.*;
import javax.xml.parsers.*;

Any application that is going to be making use of SAX XML parsing will also need to import the java.io libraries - we need some way to read the file into the application, after all.

We'll look at parsing the simple XML document we defined above:

<?xml version = "1.0"?>
<book>
<ISBN>012345678</ISBN>
<title>Java-Fu</title>
<author> Michael Heron</author>
<publisher>Monkeys at Keyboards</publisher>
<pages>12345</pages>
<genre>Fiction</genre>
</book>

We'll read this from an XML file (in this case, called myBook.xml), and then create an instance of a suitable class from the data contained within.

We'll write a class that handles our XML parsing for us -the simplest way to do this is to make use of the DefaultHandler class contained within the org.xml.sax.helpers package:

import java.io.*; 
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.xml.parsers.*;


public class SaxParser extends DefaultHandler {
}

The DefaultHandler class is a simple implementation of an interface called ContentHandler, which defines all of the methods that need to be implemented for a class to act as a valid SAX parser. The DefaultHandler class does nothing with these methods - it just adds them as place-holders for the functionality that needs to be implemented by the developer.

The first new class we need to look at is called SAXParserFactory - this allow us to set up a default parser for an XML document. We don't create this using the new keyword - instead we make use of the static newInstance method of the SAXParserFactory class:

SAXParserFactory myParser = SAXParserFactory.newInstance(); 

This creates a non-validating version of the standard XML parser - but that's not enough! No sir, we need to call the newSAXParser method on the object we just created - this returns another object of type SAXParser:

SAXParser bing = myParser.newSAXParser(); 

It is this SAXParser object that handles the hard work of doing the - well, hard work. It has a method called parse that is used for this - this method takes two parameters. One is a File object that refers to the file to be used, and the second relates to an object that implements the ContentHandler interface... co-incidentally, that's exactly what our class actually does!

Both the newSAXParser method and parse method carry with it a pair of mandatory acknowledgement exceptions that must be contained within a try-catch block:

Mandatory Acknowledgement Exceptions
Fig 7.2: Mandatory Acknowledgement Exceptions

All of this setup goes into the constructor method of our parser class. We still haven't implemented all the required functionality of course - we have no code for handling what happens to each XML element as it is parsed. We'll come to that in due course!

In the mean-time, let's look at the basic skeleton of our SaxParser class:

import java.io.*; 
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.xml.parsers.*;


public class SaxParser extends DefaultHandler {

public SaxParser() {
SAXParserFactory myParser = SAXParserFactory.newInstance();
SAXParser bing = null;
File myFile = null;
myFile = new File ("myBook.xml");
try {
bing = myParser.newSAXParser();
}

catch (ParserConfigurationException ex) {
}

catch (SAXException ex) {
}

try {
bing.parse (myFile, this);
}

catch (SAXException ex) {
}

catch (IOException ex) {
}

}

}

Our class will now compile - huzzah for us!

Our next step is to actually parse some information from the XML document once we've made a connection to it.

7.5

Let's Talk About SAX baby

The ContentHandler interface carries with it four methods that need to be implemented - the DefaultHandler class provides placeholder implementations of these methods, but it is common for the developer to over-ride some or all of these methods. They are:

Methods to be implements
Fig 7.3: Methods to be implements

The namespace and simplename parameters are not of interest to us in this section - this is only a brief introduction to XML parsing. We'll have cause to return to this subject later in the chapter.

We'll start off by simply echoing everything we parse from the XML file to the screen - just to make sure that what we're reading in is what we think. We'll over-ride the startDocument and endDocument methods to print some meaningful text to the screen when we begin parsing:


public void startDocument() throws SAXException {
System.out.println ("Start of Document");
}


public void endDocument() throws SAXException {
System.out.println ("End of document");
}

Next, we'll output the opening tag whenever we encounter one:

public void startElement (String namespace, String simpleName, String qualifiedName 
, Attributes attr) throws SAXException {
System.out.print ("<" + qualifiedName + ">");
}

And then whenever we see a closing tag, we output that too:

public void endElement (String namespace, String simpleName, String qualifiedName) 
throws SAXException {
System.out.println ("</"+qualifiedName+">");
}

And finally, whenever we see anything else:

public void characters (char buf[], int offset, int len) throws SAXException 
{
String s = new String (buf, offset, len);
System.out.print (s);
}

When we run our completed parser, we get the following output:

Start of Document<book><ISBN>012345678</ISBN><title>Component Based 
Solutions Module Book</title><author>Michael Heron</author><publisher>Monkeys
at Keyboards</publisher><pages>12345</pages><genre>Fiction</genre></book>End

Not bad at all!

We're only part of the way there, though - all we're doing at the moment is displaying a copy of the XML file to the screen - we still haven't managed to do anything productive yet.

The technique for parsing an XML file into a separate object isn't the most obvious, and in a number of ways violates good programming practise. We need a number of class-wide variables that are manipulated via separate methods - because of the event-driven nature of SAX XML parsing, we cannot rely on a proper chain of communication between methods.

The XML document we have is going to map neatly onto an object of our own - this is an instance of standard class that contains attributes for each of the child elements for the book. The class itself is called Book.

Let's start off slowly, by simply creating an instance of the appropriate class whenever we encounter the starting tag book. We have a method called startElement that will serve us well for this:

Book tmpBook; 
public void startElement (String namespace, String simpleName, String qualifiedName
, Attributes attr) throws SAXException {
if (qualifiedName.equals ("book")) {
tmpBook = new Book();
}

}

Next, we need to implement the functionality for setting the various attributes of the new book object. Here's where we hit upon the first real snag - we know when a tag is opened and we know when a tag is closed (the startElement and endElement methods are called, respectively). However, when startElement is called, we don't know what text follows the tag... and when endElement is called, we don't know what element preceded it! It is only in the characters method that we have access to the actual text that was found between the tags.

We need to set up a variable to store the parameters that get passed into the characters method - we'll use a StringBuffer called contents for this,

Every time the characters method is called, we'll create a new string from the parameters that are passed, and then append this string to our contents StringBuffer:

public void characters (char buf[], int offset, int len) throws SAXException 
{
String s = new String (buf, offset, len);
contents.append (s);
}

Whenever we encounter a new opening tag, we will completely clear what the contents variable currently has:

public void startElement (String namespace, String simpleName, String qualifiedName 
, Attributes attr) throws SAXException {
contents = new StringBuffer();
if (qualifiedName.equals ("book")) {
tmpBook = new Book();
}

}

When we encounter a closing tag, we then take the appropriate action. If, for example, the closing tag is title, then we call setTitle on our tmpBook, passing as a parameter whatever the value of our contents variable happens to be. If the closing tag is book, then we're done with our current instance of the object and add it to an ArrayList:

public void endElement (String namespace, String simpleName, String qualifiedName) 
throws SAXException {
if (qualifiedName.equals ("book")) {
myBooks.add (tmpBook);
}

else if (qualifiedName.equals ("ISBN")) {
tmpBook.setISBN (contents.toString());
}

else if (qualifiedName.equals ("title")) {
tmpBook.setTitle (contents.toString());
}

else if (qualifiedName.equals ("author")) {
tmpBook.setAuthor (contents.toString());
}

else if (qualifiedName.equals ("genre")) {
tmpBook.setGenre (contents.toString());
}

else if (qualifiedName.equals ("pages")) {
tmpBook.setPages (Integer.parseInt (contents.toString()));
}

}

And that's it - we have a fully featured XML parser that will read in the contents of an XML file and turn it into delicious, tasty objects. Arr, num num num.

But even better! We can include several instances of the object in our XML file, and it will still parse them beautifully into separate objects:

<?xml version='1.0' encoding='utf-8'?> 
<library>
<book>
<ISBN>012345678</ISBN>
<title>Component Based Solutions Module Book</title>
<author>Michael Heron</author>
<publisher>Monkeys at Keyboards</publisher>
<pages>12345</pages>
<genre>Fiction</genre>
</book>
<book>
<ISBN>2222222</ISBN>
<title>The Javanomicon</title>
<author>Michael Heron</author>
<publisher>Monkeys at Keyboards</publisher>
<pages>54321</pages>
<genre>Cowboy Action Thriller</genre>
</book>
</library>

We can prove the validity of our parsing by outputting the information at the appropriate juncture:


public void endDocument() throws SAXException {
Book tmpBook;
for (int i = 0; i < myBooks.size(); i++) {
tmpBook = myBooks.get (i);
System.out.println (tmpBook.getISBN());
System.out.println (tmpBook.getTitle());
System.out.println (tmpBook.getAuthor());
System.out.println ("" + tmpBook.getPages());
System.out.println (tmpBook.getGenre());
}

}

Our implementation of XML parsing is fairly robust - we won't encounter any problems if we add a spurious tag:

	<book>
<ISBN>2222222</ISBN>
<title>The Javanomicon</title>
<author>Michael Heron</author>
<publisher>Monkeys at Keyboards</publisher>
<pages>54321</pages>
<bing>bong</bing>
<genre>Cowboy Action Thriller</genre>
</book>

Or if our specification for a particular instance of data is incomplete:

	<book>
<ISBN>2222222</ISBN>
<title>The Javanomicon</title>
<author>Michael Heron</author>
<genre>Cowboy Action Thriller</genre>
</book>

This demonstrates one of the key benefits of XML - its adaptability. Of course, nothing will happen within our program if we add new elements, and if we remove elements our internal representation will be incomplete - but our parsing routine will not be the weak point in our chain.

7.6

Attributes

So now we can parse out a simple XML document, and turn it into an internal object representation - and as I'm sure you can see, it's not all that difficult at all. However, there is an aspect of XML design we haven't discussed yet - that of attributes.

From basic HTML, you will undoubtedly recall that a tag itself is only part of any HTML document... tags usually carry with them some information. For example:

	<a href = "bing.html">click here</a>

This is an attribute, and defines information that is relevant to either the element itself, or to the system attempting to parse it. There is no hard and fast rule as to when you must use attributes, and when you must use children elements. The following two XML specifications contains exactly the same information:

Attributes versus elements
Fig 7.4: Attributes versus elements

Attributes values must always be contained with quotes - it doesn't matter if they are single or double.

The XML for attributes is very simple, especially if you have some basic experience with HTML... however, there is a complication that arises when parsing - no longer is the information contained within tags... instead, it is actually a part of the starting tag and must be parsed out when startElement is called. The Attributes parameter that gets passed as the fourth argument contains all the information we need.

Let's take a simple example - parsing out one single attribute from an XML document:

Whenever our parser hits the student tag, it's going to get an Attributes object passed to the startElement - this object will contain the name of each attribute, and its value.

Of course, we only want to do any querying of this object if there is data contained within - if there are no attributes, the Attributes object will be null:

public void startElement (String namespace, String simpleName, String qualifiedName 
, Attributes attr) throws SAXException {
if (attr != null) {
System.out.println ("" + attr.getValue ("sex"));
}

}

Wow, that's it? It sure is!

But this particular parsing routine requires the data to be in a definite format - usually this will be the case, but sometimes you will want to be a little more forgiving. For example, you may want to separate out attribute and element values to ensure they don't overlap... maybe you'd like to keep a HashMap of all attributes and their values. You may not know in advance what the attributes are going to be, but you can write a flexible parsing routine for dealing with them all.

The Attributes object itself contains a lookup table - you can step over every element in this table and get the attribute name and the attribute value with a few simple method calls:

public void startElement (String namespace, String simpleName, String qualifiedName 
, Attributes attr) throws SAXException {
HashMap myMap = new HashMap();
String aName, value;
if (attr != null) {
for (int i = 0; i < attr.getLength(); i++) {
aName = attr.getQName (i);
value = attr.getValue (aName);
myMap.put (aName, value);
}

}

}

7.7

XML Namespaces

Earlier we looked at the parameters passed into the startElement and endElement methods, and found there was only one that we were really interested in - the name of the element. Our knowledge of XML has a fairly major gap in that a well-formed document may contain a number of namespace collisions. For example:

<?xml version='1.0' encoding='utf-8'?> 
<library>
<book>
<name>My Book</name>
</book>
<author>
<name>Me</name>
</author>
</library>

In this document, name is used as part of two different elements, in two different contexts. One way to avoid this is to ensure that you always use a unique tag identifier for every element of data, but this has problems:

  • You may end up using a less descriptive name, purely for convenience's sake.
  • You may not be the one actually writing the XML file, and you can't ensure that an XML document you have to parse will adhere to your unique naming convention.

There are however ways to resolve the problem. The first way is brute force, and is to ensure that you maintain an internal representation of where you are within an XML document by making use of an appropriate data structure - in this case, a stack.

Every time we open a tag, we'll push it onto the stack. Every time we close a tag, we'll pop the last one off of the stack - in this way, we'll always know where we are in the XML document:

public void startElement (String namespace, String simpleName, String qualifiedName 
, Attributes attr) throwsSAXException {
myStack.push (qualifiedName);
}

public void endElement (String namespace, String simpleName, String qualifiedName)
throws SAXException {
Object ob = myStack.pop();
}

Then if we need to find what element we're currently working with, we just traverse our way down the stack and then execute the appropriate code. For example, if we're currently within the author element, we set the name of the Author object. If we're in the book element, we set the name of the Book object.

There is another way though, and that is through the use of XML namespaces - these allow the XML designer to ensure that each element has a unique identifier, even if they have the same name.

The standard for XML namespaces is that they should take the form of a full URL - this URL will never be used for anything (although many companies actually do use their own internal intranets and store information about the relevant XML document at the namespace URL), it's just used as an identifier. The identifier is applied to the element where it is defined, and all child elements of that element.

A namespace is defined using the xmlns attribute in the opening tag:

<?xml version='1.0' encoding='utf-8'?> 
<library>
<book xmlns="http://www.monkeys-at-keyboards.com/xml/books">
<name>My Book</name>
</book>
<author xmlns="http:// www.monkeys-at-keyboards.com /xml/author">
<name>Me</name>
</author>
</library>

Before we can make use of namespaces within our SAX parser, we need to call the method setNamespaceAware on the appropriate object:

SAXParserFactory myParser = SAXParserFactory.newInstance(); 
myParser.setNamespaceAware (true);

Now, when startElement is called, our namespace parameter will now contain the appropriate value - we can use the value of this parameter to choose an appropriate course of action:

public void endElement (String namespace, String simpleName, String qualifiedName) 
throws SAXException {
if (qualifiedName.equals ("name")) {
if (namespace.equals ("http:\\www.monkeys-at-keyboards.com/xml/books") {
myBook.setName (contents);
}

else {
myAuthor.setName (contents);
}

}

}

7.8

Conclusion

This has been a very simple introduction to XML, and one of the standard frameworks for parsing such documents. Of a necessity, we have left out huge amounts of what is available as part of the XML specification - it's a rich and varied subject, and one on which it is worth spending more time than is available in this module.

The SAX parser model is simple and efficient - however, for large and complex XML documents it is often somewhat less effective. The DOM model is more complex and considerably less efficient, but offers a range of benefits that cannot be realised with the more primitive SAX model.

Further Reading

The following table details further reading on the topic in this chapter, and also any external resources that you may find useful.

ResourceDescription
NothingThere's no links as yet, so go find your own you shiftless wasters.

PreviousTable of ContentsNext

© 2004-2006 Michael James Heron