![]() | Monkeys at Keyboards: Java-Fu © Michael James Heron | ||||
| Topic: Java Programming Level: 3 Version: beta | |||||
8 - XML and DOM Parsing | |||||
| Previous | Table of Contents | Next |
| Forum |
| Chapter Objectives |
By the end of this chapter, the reader will be able to: |
In the next chapter, we will look at applying our new-found SAX parsing skills to building a useful XML-based Javabean. Our efforts will produce a seed, which we nurture until it grew into a beautiful flower of beautiful beauty. Before we do so, we can usefully spend some time looking into the alternate XML parsing routines available within Java. Our discussion about XML is in many ways a departure from the central theme of the module - the study of XML itself has little role in the development of component based programming solutions. In fact, it's possible to develop a component that never touches upon XML - but it does make things easier, particularly when it comes to making effective and adaptable data structures. In chapter eleven, we'll look at one area of component based programming that shows directly why XML is a powerful technology (ironically, we won't need to do anything with XML to be able to use it in a component), but in this chapter we're going to look at the Document Object Model, which can be used to represent an XML document in code. In the last chapter, we looked at how to write a parser routine that would let us read an XML document - we didn't however look at how we could write one to a file. With SAX, there's no simple way of doing this - SAX is an inherently one-way parsing model, and it offers no way of internally representing the structure of an XML document... and correspondingly, it offers no way of writing such a document to a file. DOM solves this by allowing for an internal representation of the structure of an XML document to be stored in code, and written out easily without tiresome file IO routines. So, enough pre-amble! Let's get on to the good stuff!
At its simplest, the Document Object Model is a simple 'tree' data structure. In our studies of Java last year, we never looked at how powerful this kind of data structure could be, mainly because its usual applications are considerably beyond the complexity of program we can reasonably expect to develop in the course of a second year programming module. There are many kinds of tree structures, but they all share a number of common characteristics. Tree structures are made up of a number of linked nodes, which are related to each other through child and parent relationships (much like with inheritance). Each child has at most one parent, and a parent can have a number of different children (how many depends on the kind of tree structure). Nodes without a parent are called root nodes, and nodes without any children are called leaf nodes:
The DOM model represents an XML document as a tree structure, with each element of the XML document having a corresponding node in the structure. The tree has structural integrity, so that the relationship between XML elements is preserved after the document has been parsed into a Java program - contrast this with a SAX parser model, which requires a programmer to enforce some kind of relationship on individual XML elements.
As with most of the tools we discuss throughout this module, we need to make use of a number of Java classes to develop our application - we need to import the following classes into any application designed for DOM parsing:
The first new class we're going to be introduced to is the Document class, which represents an entire XML document. The easiest way to envisage it is that the empty Document object represents the root node of a tree - as we parse an XML document, the appropriate child nodes are added. Similar to the way we use the newInstance method of a SAXParserFactory to create our SAX parser, we need to make use of the newInstance method of a DocumentBuilderFactory to create a DOM parser:
Then, we call the newDocumentBuilder method on our factory to create a new DocumentBuilder - this process can throw up a ParserConfigurationError, so we must try and catch that:
We then call the parse method on our DocumentBuilder and put the results into the Document object we created earlier. One of the stranger things about the Document Object Model is that it throws up SAX exceptions during the course of its parsing - quite unusual when you consider how different the two approaches are. The parse method throws a SAXException, so it is necessary to try and catch this before we can actually compile our program... our current skeleton code looks like this:
So far, so good - the process we've gone through is pretty similar to what we did for parsing a SAX document. We even use setNamespaceAware on the factory to ensure that our parser deals with namespaces, just as we did with our SAX parser. What we end up with after the parse call is a Document object that contains a representation of the structure of an XML document - notice that we don't need any parsing routines to handle this - it's all done for us. However, in order to actually get information out of the Document, we need to traverse the nodes of our Tree. We're using a slightly different version of the myBook.xml file that we saw in the last chapter. It looks like this:
After parsing, this is represented as the following tree structure:
Each of the nodes within a Document are known as Elements, and we can get instances of the node from our Document through the use of a method called getDocumentElement. We use the generic Element type to store this, but Element itself is actually an interface. Polymorphism allows us to use this as a general 'catch all':
Above, we talked about the idea of child nodes in a tree - obviously when we parse a document we cannot be sure how many elements, if any, it is going to have. In some implementations of the tree data type, a node can have at most two child elements. That is not the case with the tree data structure we are using here - it can have any number of child elements. Luckily, there is a method that will give us all of the children of a particular Element - it is called getChildNodes, and it returns an object of type NodeList:
We can use the getLength method on the NodeList to find out how many children, and we can use the item method to access a particular child (this will be returned as a Node):
Now, it's possible to explicitly travel down every node in a particular tree, but we're not particularly interested in doing that... we just want to know how we can get all of this information out of the XML document and into our application. From our rootElement, we have access to a method called getElementsByTagName - we pass the name of the element we want to get out, and it returns a NodeList of all the tags that matches that name. For example, we can get a NodeList of all of the books in the XML document:
And then iterate over each of those, creating a new Book element for each:
We can 'drill down' into each node using getElementsByTagName to extract further information to configure our book object:
Here though, we hit upon a snag - each node has a method called getNodeValue which returns the value of the node (duh), but when we attempt to display that with a System.out.println:
It just comes out as null! What gives? Well, you see... XML parsing can be tricky, because the way it is parsed is not necessarily internally consistent with what you expect. For one thing, this single node (as far as we are aware) is actually comprised of two nodes! The first node describes this particular node as being an Element... it's the second node that contains the actual value. This because, short of using some specialised techniques we won't discuss in this module, the DOM parsing routine will preserve the white space in an XML document within a node of its own! We should write a method to extract this kind of information for us, ignoring the white space - it should step over every child element in a node and append its value to a StringBuffer before giving us the results:
All we need to do is pass a Node object to this method, and it will give us the value we need... nifty:
We can then repeat this process for the title: Because our author name is broken up into two separate elements, we need to first query these individually, and then combine them into a single string to store in our Book object: There's only one thing left - our modified myBooks.xml file has an attribute on the book element that indicates the location of the book in question - we can determine the state of that by using the getAttribute method on our element:
This is of course the 'quick' way to parse such a document - after all, we're not taking advantage of the tree structure to map the XML representation onto our internal representation. So far, so mundane - after all, what's the point in using this new technique if we're already able to parse XML document with SAX? The real power of DOM lies in its flexibility - because it maintains a representation of the XML document, it's possible for a developer to alter the contents of that document while it is stored in the memory - effectively, the XML document can be changed easily... and then cleanly written out to an updated XML file.
An XML document is only of limited use if it cannot be changed - in real life, it must be able to adapt. What's the point of having a filing system if you have to ensure at design time that it contains absolutely everything that it would ever need to? There's no point of course, because it simply cannot be done. Really, the kind of representation system we've been using above doesn't allow us to make the most of XML - we still need to provide a 'translation layer' between our objects and the XML file. Really, we want to be able to make use of the tree structure directly so that we can benefit from its hierarchical relationships. The process of walking through a tree structure is called traversal, and is one of the more fundamental aspects of the more complex types of software data representations. In this section, we'll look at using the tree representation we've built of our XML document as a database of sorts - we'll be able to query the contents as well as set new information as required. For this particular scenario, it's very much a case of overkill to use a tree structure, but for other more complex examples it's undoubtedly the simplest way to represent changing, hierarchical sets of data... breadth and depth first searches are a good example of this in practise. The DOM libraries of Java provide us with a range of tools for simplifying the navigation of a tree structure - it's important to understand that these tools are not particular to XML parsing... they have a wider relevance that will hopefully become apparent in later years, but discussion of this is beyond the scope of this module. We'll make use of our parseValue method to ensure that we can easily extract the text contents of any given node, but beyond that we're going to use an entirely different coding framework to navigate the tree. We're going to have to be a little more flexible in how we approach our XML documents in order to write this properly - after all, our examples so far have only been very simple - we haven't looked at XML elements that could contain multiple nodes (beyond white space), such as:
This particular document parses out as the following tree:
This will do for our example tree traversal - we'll look at how we can manipulate the structure of this XML document to give our tree multiple branches, or multiple leaves, or change the value we've ascribed to leaves our branches! Exciting stuff! We'll also look at how we can query specific nodes to find out their values, and how we can then take our manipulated data structure and write it out as a neatly parsed external XML file. As with most things in programming, we're actually writing an application specific traversal routine - although XML allows for maximum portability of data, we still need to know in advance what the structure of a file is going to be. In the case of our structure above, we know that a tree has a trunk, and that a trunk has a number of branches - each of these branches will have a number of leaves, as well as some descriptive text that goes with them. Each leaf also has descriptive text that goes along with it. We want to develop an iterator structure that allows us to step through each of the nodes in the tree - because there's only going to be one trunk, we can safely take that as the root element. There may be any number of branches, so we need to implement a way to step over each of these. We also need to be able to query the elements that go with each branch. What we really need is a method that will step through all of the elements of a particular node, looking for a specified tag name. If it doesn't find it within the children of the node, it should check the children of the children, and the children of the children of the children, and so on. Sounds like a job for recursion to me! We need to be able to specify a tag to search for, and a node that will act as a root element for that particular search... should it find it, it will return the relevant node... so that gives us our method signature:
We know how to get the children of a particular node - we use the getChildNodes method:
We'll need two temporary variables - one is a string to hold the tag name we've currently found, if any. The other is a Node that allows us to refer to a particular child in the NodeList:
The first thing we want to do is check to see if the node that was just passed in matches the tag we want:
The second thing we want to do is step over all of the children of a particular node, and check to see if any of those match the tag we're looking for - we will worry about the children of each node after we have that working. We get a particular node by using the item method, and we use the getLocalName method to return the tag (if any... remember, white space is also represented as an element). If the tag we've found is equal to the tag we're searching for, then we return that node:
Then, if we haven't found it there, we need to check the children of that node - here's where our recursion comes in:
Then if we don't find anything, we just return a null pointer. This is a generic traversal method that will work for any tag in any XML document - the problem is that it is not very discriminating. It only returns the first node it finds. We also need a way to step through each of the branches in the tree. This is pretty easy to do - we just maintain an integer that indicates where we are, and then provide methods for moving on to the next and previous nodes. A very simple implementation of the nextNode method might look like this:
Alas, this won't actually work properly because of the accursed white space problem! So we need to base our counter on only those elements that actually belong to the XML specification - those with tags:
Our previousNode method looks virtually identical, except that we subtract from the counter rather than add to it:
We also need to provide a firstNode method to ensure that we always begin at the first valid element:
Now we've provided ourselves with an iterator structure that allows us to step through every tagged element of any given node, and then extract the information we require from each. This is all very generic, and applicable to any tree - but in order to actually browser our particular XML document, we need to actually tie it into the structure we have. Our XML document has a tree element, which contains a single trunk element - let's make that the base of our traversal. The code for parsing the XML document into a Document is the same as above:
We set our search at the first branch by using our newly minted firstNode method:
And we can extract all the branch node information from this by using findNode:
Or we can extract the information about a particular leaf:
But the traversal is only one part - we also need to be able to do something useful with the nodes we find. We already have a method that allows us to parse the text information from a node (our parseValue method), but we also need to be able to change the value of a node if we are going to make best use of our structure. For this, we use a similar routine to what we do for parsing information out of a node:
So, now we have an iterator structure, and a method for setting and getting the value of elements - obviously this has required quite a bit more code than our simple efforts before when we simply translated an XML document into an internal object representation - but this way we ensure that our hierarchical structure is mirrored, and that we can easily set and get information from each node. All that's left now is to write out our tree structure into a text file so that any changes we make to our XML can be stored between executions of the program.
There are three new packages we need to import before we can write out our tree structure as an XML file. These are:
The process we go through to set this up should be familiar by now - first we create a new instance of a suitable factory class - in this case, it is TransformerFactory:
Then we create an actual Transformer object using this class:
This throws a mandatory acknowledgement exception of type TransformerConfigurationException, so this must be tried and caught. The Transformer class handles parsing the tree structure we have into a format suitable for writing to a file - it saves us having to write complex file IO routines ourselves. We then create a DOMSource object, passing our document as a parameter to its constructor:
Next, we need to create an instance of a class called StreamResult - this acts as a wrapper around a connection to some output device. It can be as simple as System.out:
Or we can actually pass a File object as the parameter to the constructor, which will set us up for writing out our tree structure to an external file:
Once we've created an appropriate StreamResult object, we do the actual transformation that will take our XML tree and write it to the provided output device... the transform method of our Transformed object does this for us. We pass two parameters - the first is the DOMSource object we created, and the second is the StreamResult object. This method throws a TransformerException, so this must be tried and caught before the program will compile:
Putting it all together gives us our writeFile method:
And that's it... a program that will read an XML document from a file into a tree based structure, allow us to traverse the nodes and set (and read) individual values, and then write out the modified tree to an XML document. Powerful stuff!
DOM parsing is no more difficult than SAX parsing - in fact, it's actually a lot easier. The difficulty comes when it's time to manipulate the structure that the DOM parser produces. Tree navigation can be tricky, and requires a lot of thought and planning before it can be accomplished successfully. DOM offers a number of benefits over SAX parsing - primarily because of the added complexity of the tree model. For one thing, DOM allows for an XML document to be stored and manipulated in code - elements can be queried and changed. SAX on the other hand is a one-way process... there's no possibility of stepping back through the document. DOM is considerably less efficient than SAX - for very large XML documents it can be especially unwieldy. However, the extra flexibility it provides in many cases will offset the efficiency loss. As in all things, let the application itself decide what the best strategy is for a particular problem. Further ReadingThe following table details further reading on the topic in this chapter, and also any external resources that you may find useful.
|
| Previous | Table of Contents | Next |
© 2004-2006 Michael James Heron