![]() | Monkeys at Keyboards: The Javanomicon © Michael James Heron | ||||
| Topic: Java Programming Level: 2 Version: delta | |||||
20 - File I/O 2 | |||||
| Previous | Table of Contents | Next |
| Forum |
| Chapter Objectives |
By the end of this chapter, the reader will be able to:
|
In the last chapter we looked at the simpler of the two main frameworks for implementing file IO in a Java program. In the rest of this chapter we'll continue our exploration of file IO and take a look at Random Access Files, and also how we can use Java to allow us to read Stream Based files from remote resources on the internet. By the end of this chapter you should be comfortable with the techniques for implementing random access files in your own Java programs, but it is important to realise that beyond simple applications and simple sets of data, it is often far more useful to implement such storage via a database. Attempting to implement your own database logic is likely to be a disheartening experience that will leave you bitter and unable to form meaningful relationships. With that in mind, let's get on to the good stuff!
In the last chapter, we met the File class very late on when we looked at the JFileChooser class. When we met it, it was simply an object that we used to get a file reference. In reality, it is a much more useful class than that. A File object is an abstract reference to a file that is stored physically on your hard drive. It provides a clean interface that allows you to query aspects of the file that would otherwise require quite complex code - the length of the file, the permission flags of the file, and so on. A file object can also serve as the basis of more complex file IO classes like FileReader and FileWriter. In the last chapter we created a link to a file on the hard-drive by passing a String parameter to the appropriate constructor. We could just as easily pass in a reference to a File object:
The constructor method for the File class requires a String parameter that indicates where the file is to be found:
The getSelectedFile method of JFileChooser returns a File object rather than a string representing the filename, so it is very rare that you will need to construct a new File object within your code in response to a user's actions. If the file does not currently exist on your hard-drive, creating a File object will not create the file. However, the File class does come with a range of useful methods, one of which is createNewFile:
Calling this method will create an empty file at the desired location - behaviour which is consistent with what FileWriter does when you create a new instance of that class. In fact, the File object provides a large range of useful methods. Some of the ones that you may find most useful are:
These methods all have a role to play in creating robust File IO routines in applications. For example, before attempting to create a FileReader object around a File returned by a JFileChooser's getSelectedFile method, you may want to check to see if it exists, and if it does you then check to see if you have the access you need to read it:
An object of the File class allows us to do much more than simply ensure that file IO access doesn't go wrong. Files can point to either actual files or directories... those that point to directories have further useful functionality:
Already a great deal of file-manager type functionality is available to us through the medium of this class. File objects that point to actual files can be manipulated still further:
There is more that can be done with the File class... the interested student is directed towards the Java documentation for more details. Files provide a powerful interface allowing for sophisticated file-manager functionality. They are used all the way throughout the Java I/O class libraries to ensure consistency of access. However, they are not the only way we can access disk resources in Java.
Files are used to reference to local resources that are stored on a particular computer or those that are stored on a remote computer but are accessible through a UNC (Universal Naming Convention) reference. The URL class allows for both local and network resources to be accessed. In this chapter we'll look solely at network resources, but in the next chapter we'll have cause to look at using a URL object to make a connection to locally stored media files. In order to use the URL class we must import the java.net package. If we don't, Java won't recognise our class when we attempt to use it... a horrible situation, the like of which nightmares are made! We create instances of the URL class in the same way we create instances of the File class- we pass in a string that indicates to where the URL object is to point. In this case, it should point to a web-location:
Using this syntax, we create a link to a remote resource... in this case, the main Java web-site. In the same way that we need a FileReader to read from a File, a URL is an abstract way of indicating a particular network location. We need to build up an actual connection if we want to read the contents of the URL. This is a somewhat convoluted procedure to begin with, but at a certain point it becomes indistinguishable from building up a connection to a local file. There is a mandatory acknowledgement exception that can occur when creating a URL object, so we must deal with that before we can proceed. The exception is a MalformedURLException:
Once we've done this, we can start to build up our connection. The URL object provides access to a method that makes a connection to the remote resource - this method is called openConnection, and it returns an object of type URLConnection:
The URLConnection object contains some useful methods... one of which is getLastModified, which returns a long int representing the number of milliseconds since January 1st, 1970 (a rather odd, but common metric). We can easily compare this against a stored value in our own programs to ensure we don't perform costly network access when we have no need. This URLConnection object provides us with another method that is used in building up our connection. This method is called getInputStream, and it returns an object of type InputStream:
We then wrap an InputStreamReader around this InputStream:
Once we get to this point, we are at a stage that is comparable with when we create a FileReader on a normal file... we then put a BufferedReader around our InputStreamReader and access the file in the same way we discussed in the last chapter. From that point onwards, we don't care that the file is stored on a website as opposed to a file on our hard-drives... the procedure for reading and parsing the contents is identical. Setting up a connection to a URL is a procedure that is somewhat more complicated than setting up a connection to a local file, as you can see. We are not quite layering objects in the same way we do when creating a instance of PrintWriter - instead we are chaining objects together: a method provides an object that provides a method we need to get another object... until we get to the point where we can start layering. We don't need to understand how all of this is working, or why we need particular objects - just as long as it actually does work, we can concentrate on the important thing, which is reading and parsing information from the remote resource. We won't look at the complexities of parsing out useful information from any given HTML file in this particular book - that is left as an exercise for the interested reader.
For small applications (incidentally, like the applications you'll be writing throughout this book), stream based IO is usually an acceptable, even desirable, solution. It is simple and effective, and unless you are going to be working with a huge set of data and performing complex queries on that data, there is not a lot to be gained by the additional complexity represented by implementing random access routines. However, sometimes the data being accessed is so unacceptably large that it is simply an Affront To Decency to go through it all when looking for a single element of information. Consider if we had an application that stored a weather report for every hour of every day for the past ten years. Twenty four hours in a day, 365 days in a year, ten years worth of data gives us over 87,000 elements of data. Writing such an application is well within our capabilities at this point. Let's say each weather report is a mere kilobyte long and contains data such as wind speed, temperature, and so on. That seems like a reasonable assumption. This gives us a data set of about 87 megabytes. That's large, but not intractably so. We could write such a large data set as a single text file if we have some method of parsing it out - perhaps each report is separated by a semi-colon, and each element of data within the report is separated by a colon.
We could then read in this information from the file and use a StringTokenizer or the split method to break it up into the correct elements. However, to store this information in the computer's memory requires 87 Megabytes in addition to anything else being run on the computer. That seems like an unacceptable extravagance, and something that we shouldn't even consider doing. Most of the time, the data is purely historical and the user is only going to want to query a single element at a time. That leaves us with the option of storing the information on disk and simply returning the correct report when requested by the user. Now we see the problem - stream based files must be read line by line... if someone wanted element 48,000 then we'd need to read the 47,999 elements that came before it. That's really inefficient. It would be great if there was some way we could say to Java 'give me the 48,000th element' without going through all the other ones. Thankfully, that's exactly what random access files allow us to do. In order for a random access file to work, we need some way of computing the desired index from the information we actually have. It's a bit like storing information in an array - it's only useful if we know what index we want. If we're adding names and addresses to an address book, and the only information we have is the name then we have no way of computing an index and we'd need to do a linear search on the data file (unless the data-file is in a sorted order... more on this later). That gives us no real benefit over a stream file because we need to search every element in order anyway. Consider though our theoretical weather report application... if we want a report for a specific hour in a specific day in a specific year, then we can easily compute the index we want: (( number of days passed * reports in a day) + report number) -1. We subtract the one because, as with arrays, we start counting from zero. If we began keeping records at midnight on the 1st of January 1995, and we want the record for 4PM on February 3rd 2000, then we just need to compute the number of days that have passed since then:
And then the report number (the first report in the database was at midnight):
Therefore the index we want is found with the formula ((1858 * 24) + 16) - 1. We want index 44607, which we then pull from our random access file.
The class we need to make use of for Random Access Files is called (funnily enough) RandomAccessFile. We pass a String parameter representing the file to be used as the first parameter, and a second String representing in what form we're going to open the file. RandomAccessFile objects can either be opened for reading only, or for reading and writing. We represent this as either the string "r" for read only, or "rw" for reading and writing:
RandomAccessFile objects have a method called seek that takes a parameter of type long that indicates how many bytes into the file we should move before we start reading. This is a much more efficient operation than actually reading up until that point- we just tell Java from where in the file we want to start. However, there is a catch to this, and dang it isn't there always? Actually working out how far we want to move through the file is a bit tricky and requires us to do some old fashioned arithmeti Alas! We will refer to individual blocks of data within a file as an element, and coherent 'clumps' of elements as a record. For example, an address book data file will be made up of a number of records, each of which has a number of elements (such as name, address, telephone number, etc). Each of the standard data types in Java have a fixed amount of space they take up on memory or on disk. This is a standard that we can rely on when it comes time to work out how many bytes we need to seek for a record.
Therefore, to work out the amount of disk-space one particular record takes up, we simply add together the fixed space requirements of each element of data. For records that are made up purely of primitive data types, this is very easy. For example, consider the following record:
This particular record would take up 4 + 4 + 8 bytes... sixteen bytes. So if we have a data file that is 400 bytes long, we know there must be 25 records (400 / 16). If we want record 1, we seek to position 0. If we want record 2, we seek to position 16, and so on. However, Strings (which are very commonly used), are objects and have no fixed length... the size of a string varies according to how many characters it contains. This means that any record containing strings is not going to follow our neat pattern. Consider a sample file for our record above:
We know if we seek to position 16, then we have found record two, and if we seek to position 32 then we've found record three, and so on. Consider if we had a similar data structure containing a String:
We have a problem because the strings have no fixed length. Michael takes up 7 bytes, and the age takes up 4... that particular record takes up 11 bytes. The next record (James) takes up 9 bytes. The next (Steven) takes up 10 bytes. We don't know how long any given record is going to be, so any attempt to use the seek method is just as likely to dump us in the middle of a record as it is to dump us at the start of a record. This means that we cannot use a random access file with Strings as they are implemented as standard because they have no fixed size. If we are to use strings in a random access file, we must ensure that they have a fixed length. Some programming languages, like Visual Basic, provide syntax for fixed length strings, but Java provides no such functionality - if we want it, we have to code it ourselves. We must pick a fixed size for a string, and ensure that no string to be written to the random access file is bigger than that fixed size, or smaller than that fixed size. If it is larger, we need to clip it or simply not write that record to the file - we instead give an error message. If it is smaller, we need to pad it with spaces or some other bunk character. We do know how much space any given string takes up - it takes up one byte for each character, as well as an additional two characters at the start that hold the length of the string:
As long as we know the fixed size of a string, we can use them in our random access files in the same was as intrinsic data types. For example:
Now that we know the length of a record, we know what values we need to feed our seek method to find any given record. Much like a standard File class, the RandomAccessFile class has a method called length that returns the length of its data file. As we stated above, the number of records in a file is equal to the length divided by the size of each individual record. We get the number of bytes to seek for a particular file by multiplying its position in the file (starting at 0) by the length of an individual record:
If we want to move right to the end of the file (for example, for writing a new record):
This will tell Java to start right at the very end of the data file, and therefore we can start writing to the file without worrying about overwriting anything already present. The RandomAccessFile class provides a number of methods for writing data - it doesn't have a single consistent interface as does a PrintWriter. Instead, we must call the appropriate write method depending on the data we want to insert:
Writing Strings makes use of a method called writeUTF. UTF (Unicode Transformation Format) is a machine-readable, binary encoded form of Unicode that is ideally suited for dealing with String objects. We pass the string to be written as a parameter to this method:
Reading data works in the same way, with a selection of methods that map onto data types:
The RandomAccessFile object maintains track of where it is in the file after each read or write operation. It is not required that you seek after each read/write, since the object will handle that itself... you need to seek only when you wish to move to another record somewhere else in the file. But enough of this dry theory! Let's look at an example of Random Access Files in practise. We'll look at a simple database application that works on the Student record type shown above. This means we know in advance how big each record is going to be.
First, we need a student data type. Although we are writing individual elements to an underlying data file, it's still useful to have an actual Student class... so let's code that:
The benefit of having this particular class is that it handles the padding and clipping part of the requirements - we create an instance of the class passing in a name, an address and an age. If the name is less than 20 characters long, it adds a space until it isn't. If it's greater than 20 characters long, it sets the name as being the first 20 characters. It handles the address in a similar way, except it pads for 100 characters (or clips to 100 characters) appropriately. The age is a primitive data type, and so no special handling code needs to be added for this. The interface for our application is very simple - we provide two buttons. One says get and is used to get a record from our data file. The other is add and lets us add another record. There is a text component at the top for specifying which record we want to get, and a text area in the center that shows the details of the last record we loaded. When the add button is pressed, it flashes up a number of input dialog boxes for the details - this is absolutely disgusting from a user interface point of view, but we don't want to distract ourselves from the main purpose of the application by complicating the layout. The file we're using for our data is called students.dat. Let's look at how we implement our adding functionality first - after all, we need to have records in the data file before we can get them back out. Step one is to open up our file:
Then we need to get the information for a new record:
Then, we find the end of the file:
And then start writing our data:
And finally, we close the file:
And that's it, a new record added to the data file. Next, we need to add the functionality for getting a record. The JTextField at the top of the application is called recordToGet, and the user types a number into that to get the corresponding record:
Then we seek to the appropriate location in the file:
And then read in the data:
And then display it in the details JTextArea:
And that's it! All of this occurs in the actionPerformed of this particular application. The code for the actionPerformed is as follows:
As you can see, the framework for implementing random access files is a little more complicated than the framework for stream-based IO... but once you've done the legwork it can make the file IO of some applications much more efficient.
There is nothing magically efficient about random access files - like most things in programming, code is only as efficient as its developer. It's possible to write a stream-based application that is faster than a random access application... it all depends on applying the right tool to the right problem. Stream based applications are simple to set up, and if you are simply depositing information between executions of an application, it may very well be perfectly suitable for your needs. Random access files are never going to be as quick as accessing data in memory, so for small sets of data it is often good to read the entire file as one operation and store it in memory for as long as the application is running. For huge sets of data, this is not acceptable (or sometimes even possible), and so it is necessary to implement some degree of random access - but it is very much dependant on whether knowing the number of a record is useful information. If I have a vast address book of 200 megabytes, and I want to find the number of everyone called Brian, then knowing how to find a record at a particular location is no help to me - instead, I need to be able to do queries on the whole data set. What random access files do give you is flexibility. Searching through a stream based file is like working with a string, whereas working with a random access file is like working with an array. If you are seriously looking for an efficient file IO structure, then many of the standard algorithms for searching and sorting arrays can be applied effectively to dealing with large amounts of data in random access files. We've already discussed how linear and binary searching work in arrays - the exact same technique can be fruitfully applied to random access files. It all comes down to what your efficiency requirements actually are. Linear searching in a random access file offers no real benefit over parsing a sequential file. If however we could assure that our random access file was stored in a sorted order, then we could apply a binary search routine to find our data in a quick and scalable manner. As we discussed in the previous chapter, the big bottleneck in File IO is the actual physical disk operation. It's usually not feasible to sort a file that is stored on disk... the number of accesses required (especially for the kind of large data-sets that benefit most from binary searching) is immense. However, we can provide a process for reading in all the elements of data, sorting them in memory, and then writing out a new random access file - that's not difficult, although it can be quite time-consuming. If you have a file that must be searched frequently (but only infrequently updated), you may want to consider the overall performance gain. On the other hand, if your data file is infrequently searched and frequently updated, there will be an overall performance loss. Sorting the data file is simple... we read in all of the elements and store them on an ArrayList. We then sort the ArrayList, and write out each of the elements in order to the random access file. Searching though is something we have no provided methods for - we'd need to implement our own algorithm to do this. It's exactly the same algorithm we discussed way back in chapter 5.11 - it's just applied to a file rather than an array. The process is simple - we start at the middle of the file. If the element we're searching for is greater than the element we have, then we search towards the end of the file. If the element is smaller, then we search towards the beginning of the file. We do this iteratively until we find what we are looking for, or have exhausted our search and come up empty. Implementing this kind of search requires three variables. One keeps track of what index is the current left hand side of the searchable array/file, one keeps track of the right hand side, and the third keeps track of our middle pivot point... all of these are integers:
To begin with, left is set equal to 0... the very start of the array or file. Right is set to the last index in the array or file, which is the number of elements - 1:
Mid is then set to the sum of these divided by two (the middle):
Then we provide a do-while loop to handle the iteration. At each stage of the loop we check to see if our current element (indicated by mid) is what we're looking for. If it is, we return from the method. If it's not, we then check to see if the element we currently have is greater than the one we're looking for. If it is, then we set left to equal mid + 1, and them mid to equal (left + right) / 1. For example:
To begin with, left = 0, right = 12, and mid = 4. We're looking for the element "Michael"
Michael is greater than Jim, and so we set Left to equal mid + 1:
And then set mid to equal (left + right) / 2
Tada! Found! We follow a similar procedure for checking towards the left, except that we change the state of the variable right to equal mid - 1. Let's say we were searching for the value Geoff
Geoff is less than James, and so we set right to equal mid - 1:
And then mid to equal (left + right) /2 as usual:
Geoff is greater than Colin, so we set left to equal mid+1:
And then set Mid to equal (left + right) / 2:
Tada! A mighty win for our binary searching system! The code for this is very simple... all we're doing is manipulating integer arrays. It's the concept for this that is tricky rather than the syntax:
Putting all of this together into an actual binary searching program, we get the following:
The lesson of all of this is - efficiency is not free. It comes at a price, and that price is how much time and effort you are willing to spend developing the data structures and file access routines for your applications.
We've covered a lot of File IO territory in the previous two chapters - we've introduced a number of powerful tools that allow us to finally develop genuinely useful applications of all kinds. There is more, so much more, to dealing with file IO... within this book we've looked at writing either simple stream files (text files) or random access files of our own standard. Parsing files of other formats is a topic worthy of a book in itself, and in many cases is not possible for us to even look at because of closed standards (such as Microsoft's doc format). The techniques we have covered in these chapters are both very powerful in their own right - there is no right or wrong answer to which technique should be applied in general. The application and the format of the data should drive your decision to go with one style of data access over the other. This is the case with so many things in programming - let your problem define your program, and you will find all is right with the universe. Further ReadingThe following table details further reading on the topic in this chapter, and also any external resources that you may find useful.
|
| Previous | Table of Contents | Next |
© 2004-2006 Michael James Heron