![]() | Monkeys at Keyboards: The Javanomicon © Michael James Heron | ||||
| Topic: Java Programming Level: 2 Version: delta | |||||
21 - Case Study 6 - Flesch Readability Index | |||||
| Previous | Table of Contents | Next |
| Forum |
| Chapter Objectives |
By the end of this chapter, the reader will be able to:
|
In this chapter we're going to pull together two of the main themes of our first File IO chapter and build an application that reads in a Stream-Based file and parses it into a format suitable for manipulation. The scenario is that we are building an application to compute the Flesch Readability Index of any given text document. The Flesch Readability Index (henceforth referred to as FRI) is a numerical representation of how easy (or difficult) it is to read a particular piece of text. We are going to write an application with a simple interface that allows us to select a file to be analysed, and then reads it into our program, parses and computes the readability index, and then outputs that number along with a string representation of the text. The FRI is calculated according to the following formula:
The higher the index, the more readable the document.. The lower, the more difficult the document is to read. For the purposes of this application, we will define the number of syllables as being based on the number of vowels within words that separate consonants. Pairs of vowels and diphthongs are treated as a single vowel, and instances of the letter e at the end of a word are ignored. We will also be treating the letter y as a vowel. For example:
So, that's our scenario. Let's get cracking!
The interface for this case study is very simple. We require a label or a text area for when we output the string representation of the file we opened. We need a button that will display a JFileChooser dialog, and then another button that allows us to calculate the FRI. We will be using the BorderLayout manager for this, so let's have a look at our storyboard:
From this diagram, it should be a simple task to construct our interface code. We need to use File IO in this application, so we will have to make use of the free standing application structure, which means extending a JFrame as opposed to a JApplet. In previous case studies we have had to draw out complicated diagrams that represented exactly where each component was going to go and how big it was going to be. When we discussed layout managers in chapter 14, we freed ourselves from this complicated routine,. This allows us to concentrate on where components should go rather than how they actually get there. Our storyboard above shows a relationship between the main container, any panels, and the components. This alone is enough to allow us to construct the specified interface in code, as such:
Every day, in every way, we're getting better and better! When we compile and run this application, we can see what our interface is going to look like:
It's not fancy, but it will do exactly what we need. In effect, we've completed the view part of our application. What is left now is the model and the controller.
Our application looks the part, but it doesn't actually do anything yet. We don't even have any instances of ActionListener registered for our buttons. Step one is updating what we've written so far a little so that we're implementing ActionListener, providing an empty actionPerformed method, and registering ActionListener objects for each of the buttons:
Now that we have this basic framework, we can concentrate on the controller logic, which will all go into the actionPerformed method. We don't have anything to compute the index yet, but we do know that when the open button is pressed we should flash up a JFileChooser dialog:
We need to refine this slightly so that the showOpenDialog call is triggered within an if structure that makes sure the user clicked the APPROVE_OPTION button. If they did, we'll put their selected file into a File object that represents the user's choice. We need to import java.io for this part. The File variable declaration should be placed at the top of the file with the definitions for the components:
That's given us the framework we need to open a file and select a file... although as of yet we're not doing anything with it. The code for opening a file, reading its contents and parsing it is not really part of the controller component of this application - all that we should be providing at this point is the means for the user to interact with our model. We've allowed them to select their file, but we can't let them analyse it until we've written the code to do that. With that in mind, onto the model!
We're going to add a new object to our application - one that will take a File object as a constructor parameter and then analyse it according to the FRI. We'll call this class FleschComputer, because I think it sounds funny:
Within this class, we need to parse the provided file into a suitable string format, so we'll give ourselves another method called openFile that does just that. We already have a file reference, so we can use that as the basis for our FileReader object:
And then this myReader object becomes the basis for our BufferedReader:
And then we adopt the loop structure we discussed in our first chapter on File IO. We need a String variable that holds all of our input to date, and one that holds the current line. The line variable will be a local variable, and the allInput variable will be a class variable and so declared along with workingFile:
Now we have our data in a string format ready for manipulation, which means we no longer need our File object. Hrm... we really shouldn't have that as a class variable, since we only need it when we're opening a file. Instead, we'll pass it as a parameter to our method, thus improving efficiency a little bit. We'll then call openFile directly from the constructor method:
Now comes the difficult bit - parsing this string into a readability index.
There are a number of things we need to work out in order to get our FRI value. Remember our calculation:
So we need to find:
We'll provide a method that calculates each of these and returns the correct value as integer:
Both the countWords and countSentences methods should be quite easy to code - a sentence ends with a full stop, an exclamation mark or a question mark. Actually, that's not quite right since it will mean ellipses (three full stops in a row) will be counted as three sentences. So the calculation is actually a little bit more complicated than that. First, let's deal with the easy bit:
Once we have the easy bit, we can worry about the ellipses. We know we have an ellipses (or some other non-standard sentence ending) when we have found a full stop and the character after that one is also a full stop... so we simply don't count any of these towards the number of sentences. First we make sure that there is a next character:
And then we store the next character if there is one in a variable called next:
This gives us the following structure:
But that won't work quite right. Consider an ellipses located in the middle of a string (starting at position 11):
We start at position 11, and find a full stop. Then we check the next character (at position 12), which is also a full stop. So we don't count it. We move to position 12, and find a full stop. We check the next character (at position 13), which is also a full stop. We don't count that one either. We move to position 13, and find a full stop. But the next character (at position 14) here isn't a full stop and so this character gets counted towards the number of sentences. Alas, this is not correct since it is still part of the same ellipses. If there is no full stop as the next character, we also need to check to make sure there was no full stop as the previous character - only then can we be sure that ellipses are not counted as the end of a sentence:
So, that's the code we need to ensure the number of sentences is calculated. Now let's think about the number of words. This is a similar procedure, except that we count spaces as being the indicator between words. We won't worry about non-standard documents where there are no spaces between punctuation and letters, for example:
Our countWords method is going to be pretty simple:
All this leaves us to do is calculate the number of syllables. This is a bit trickier. We can do this in the same way we have done above, by simply crunching through every letter... but the logic for doing this is quite complicated when it starts relating to the letter e at the end of a word. Instead, we'll use a StringTokenizer, as we discussed way back when at the dawn of time (in chapter eight). We'll tokenize the input based on a space, and then calculate the number of syllables in each word. First we need a method that tells us whether a letter is a vowel. This is a simple method:
And we would also benefit greatly from having a method that returned the number of syllables in a particular word. Let's start with a simple implementation of this that simply counts the number of vowels without any other considerations.
Then we can worry about discounting the letter e at the end of a word:
The next part is counting diphthongs as a single vowel... we use a similar technique for this as the one we used to avoid ellipses. All we need in this case though is to check the next letter - we want them to be counted at least once:
Finally, we need a little bit at the end to make sure every word has at least one syllable. Just before we return from the method:
And then we need a method that calculates the number of syllables for each word and totals it all up. Here is where our StringTokenizer comes in. We need to import java.util to get the tokenizer, but once we have it we are simply brimming over with Cosmic Power:
And then, we put it all together! We have our formula, we have the methods that compute each element of it, so we'll have a final method: computeIndex that will return the reading index for this piece of text. This is going to be the interface to our class - the developer will create an instance passing in a File object, and then call computeIndex to get the FRI out of it. There is no need for the developer to ensure that the methods we coded above are chained together in an appropriate way:
However, this won't work quite correctly due to the way Java converts between ints and doubles... we need to give Java some guidance to help it come up with the correct calculation by casting all our ints to doubles:
And having done this, we then want to hook it back into our application. When we press the analyse button, we want to create an instance of our FleschComputer class and then call computeIndex:
All that's left to do is display the text we parsed... we have no method for getting that text as yet, but we simply add a getText method to our FleschComputer:
And then when we're done computing, we call that method and put the returned value into our text area:
And that's us finished the core application!
There are a number of things we could do with our FleschComputer to improve it from an object oriented standpoint. For one thing, our application will only read a file from disk... but we have a nice text area there that we might want to enter things into for a quick analysis of their readability. We can handle that easily by adding an overloaded constructor method:
It's a good idea to make our class as general as possible to aid in reusability. Requiring file IO means that we restrict its use to applications - it's possible though that we might want to also provide a facility for applets to make use of the FleschComputer, and so a range of constructor methods are preferable. We could also benefit from providing a range of utility methods that provide information in a range of formats. Being able to compute the index itself is fine, b ut that requires that we understand what the numbers represent. Maybe we could also provide a method that computes the index but returns a meaningful string instead of a cold, impersonal number:
Returning to the idea of more constructors being better, we could also provide a constructor method that allowed the user to analyse a web-page for readability - we discussed how to do that in our chapter on file IO. The difficult with web-pages is with the large number of special control tags that are used - these are invisible to the reader, but would still be counted in our readability score. That's not an ideal situation. However, we can apply what we learned about regular expressions in chapter 8.9 and get rid of them before we analyse. An HTML tag is contained between a pair of angle braces, so we can use the following regular expression to match them:
We can feed that into our replace method which will remove all instances of the tags:
It'll even get rid of closing tags! In fact, running it will show a problem - it will get rid of everything in the document! Alas! This is because the star metacharacter is greedy... it's not satisfied when it finds the first matching closing bracket... it continues all the way through the document until it finds the last closing bracket. Consider a simple example:
It finds the first matching element of the regular expression:
And then continues through the document matching every character (including closing brackets) to the period character. It's only when it finds the last bracket it will interpret it as a match:
To deal with this, we need to mark the star character as lazy - we do this with a ? symbol. A lazy expression will be happy with the first match it finds:
Our replace method call won't get rid of all the stray characters, but it'll get rid of enough of them. The rest are left as an exercise for the reader. We can now provide a third constructor that allows for our computer to work on remote HTML resources:
This harkens back to the idea of file parsing that we discussed briefly in chapter 12. There is no standard mechanism for parsing files... it's an application dependant process. A good grounding in string parsing is required to open up the doors of functionality that are otherwise closed to us.
As you can see from this particular application, the file access part is only one small aspect. We read the file into a string, and then we have to do all the Back-Breaking Labour of actually parsing it into useful information. This is true for most applications involving file IO - it's really just a slightly more complex variable. This particular example reintroduced is to the StringTokenizer we discussed way back in chapter 8. The way we use this tokenizer is identical regardless of whether the original string came from a file or a JTextArea. Spending time practising string parsing will have an immediate effect on how easily you are able to parse stream based IO files into a useful format. Further ReadingThe following table details further reading on the topic in this chapter, and also any external resources that you may find useful.
|
| Previous | Table of Contents | Next |
© 2004-2006 Michael James Heron