Monkeys at Keyboards: The Javanomicon
© Michael James Heron
Topic: Java Programming
Level: 2
Version: delta

The memory management on the PowerPC can be used to frighten small children.
Linus Torvalds

8 - String Theory

PreviousTable of ContentsNext
Forum


Chapter Objectives

By the end of this chapter, the reader will be able to:

  • Be able to manipulate the state of a String through predefined methods.
  • Be able to query the state of a string, and analyse its contents.
  • Be able to break a string up into regular substrings through use of string tokenization.
  • Be able to apply sophisticated pattern matching.


8.1

Introduction

Working with strings is a fundamental part of most programming languages. Some languages, like C, give little support for working with strings as a cohesive unit. Java on the other hand has extensive support that allows the developer to ignore the minutiae of string manipulation and concentrate instead on writing the application specific processing code of any given string based application.

Java provides two data types for dealing with alphabet data. The first of these is the primitive char data-type. The second of these is the comprehensive String class. We will look at both of these in the course of this chapter, as well as the concept of string tokenization which allows for strings to be parsed in a clean and effective way.

String tokenization is somewhat of an antiquated topic, as the StringTokenizer class that we'll be looking at has been superseded in many respects by trendier, easier versions. Again, I emphasise that this is not a book about Java - it's a book about learning how to program. String tokenization offers an excellent system for gaining an appreciation of many underlying data manipulation methods.

8.2

The Char Datatype

The char data type is one of Java's library of primitive data types. It is used to hold a single character, or a character code:

char middleInitial = 'J'; 
char lineBreak = '\n';

Since it is a primitive data type, variables of type char are very efficient in terms of their memory use and speed of access. They don't come with the overhead associated with a String variable.

In older C type languages - like C itself - strings of text are built up as arrays of characters:

char[] myName = { 'B', 'I', 'N', 'G', '!' }; 

Such representations require large amounts of processing to be useful - C provides a wealth of functions designed to make manipulating such arrays as painless as possible, but it is still an inelegant solution to something that is often a basic part of building modern applications.

The char data type holds a 16-bit Unicode character. Unicode is a numerical mapping system that maps all the standard alphanumerical symbols to unique whole number values. It doesn't matter what this precise mapping actually is - the only thing that is important is that internally, variables of type char are actually integer values that are mapped to characters using the Unicode standard.

For example, the Unicode number for the letter a is 94. The following is a valid assignment in Java:

char bing; 
bing = 96;

We can display this variable using System.out.println:

System.out.println ("" + bing); 

When we do so, it's not the number that is displayed on the screen, it is the letter that the number represents:

a

However, if we attempt to assign the value of an int variable to a char, we get a 'possible loss of precision' error:

int i = 96; 
bing = i;

In order for this to work, we must cast the int into a char - essentially we tell Java not to worry about the loss of precision, we know what we're doing. We cast the int by providing the type of variable we want to cast it into:

int i = 96; 
bing = (char) i;

We can compactly write the code for displaying all the letters from a to z with a standard for loop:

char bing; 
int i;
for (i = 96; i < 123; i++) {
bing = (char) i;
System.out.print (" " + bing);
}

The ease with which char variables can be manipulated by integer values is a strength when performing low level string parsing. We'll see an example of this in the substitution cipher case study.

If we wish to avoid the whole issue of knowing what number corresponds to what symbol, we can simply assign the symbol itself. We surround the letter to be assigned in single quotes:

char bing = 'a'; 

When Java sees the 'a' notation, it works out for itself what the appropriate Unicode number is and assigns that to the variable.

Char data types have some very specialised uses which we will have cause to explore in the next chapter, but largely they are superseded as a method for manipulating strings of text by the String class.

Java tip

Char variables provide an easy mechanism for performing arthimetic on alphanumerical values - for example, if we want the letter ten places on from the current letter we can just add 10 to the value. This has enormous power when dealing with low level character parsing.


8.3

The String Class

Java provides a powerful and versatile String class that deals with the storage and manipulation of whole strings of text. Whereas the char data-type is restricted to a single character, Strings can hold as many characters as are required.

Java provides some in-built syntactical support for dealing with Strings - to a degree, it's possible to ignore the fact that they are really objects and treat them like primitive data-types at least as far as the syntax goes. For example, Java provides two ways of creating string objects. One is the standard object instantiation syntax:

String myString = new String ("Hello!"); 

However, Java also allows us to declare a string in the same way we would declare a primitive variable:

String myString = "Hello"; 

Both of these lines of code do (almost) exactly the same thing - it is not the case that one creates a primitive data type and the other creates an object (as with int and Integer). This is a feature of Java itself and not of object orientation in general. This in-built support for simple string syntax is a potential source of confusion, especially since in most respects Strings behave just like any other objects.

There is some difference in the way Java treats these two statements, and we will discuss this later in this chapter.

We use the assignment operator to assign values to a string. Strings are immutable - they cannot be changed. The variable name of a string is a pointer to a location in memory. When a new value is assigned to a string, it creates a new string in another part of the computer's memory and points the variable name to this new string.

The previous string is then made available for garbage collection which we'll discuss in a few moments.

For example:

String myString = new String ("Hello Sailor!"); 

A created string
Fig 8.1: A created string

myString = new String ("Gnomes!"); 

A reassigned string
Fig 8.2: A reassigned string

The contents of the variable pointed to by myString are never changed - Java just changes to where in memory the variable points.

The addition operator works on Strings, but it does not perform a mathematical addition - instead it performs a concatenation. The string on the right hand side of the operator is appended to the end of the string on the left hand side of the operator:

String myString = "Hello " + "Sailor"; 
System.out.println (myString);

This section of code would print out:

Hello Sailor

Another area in which Java makes a special exception for strings is in the area of data-type conversion. If you attempt to add a string and a primitive data type together, Java will perform an automatic conversion so that what you end up with is a string containing the primitive variable:

String myString = "" + 10; 

This takes an empty string "" and attempts to add to it the number 10. After this operation, the variable myString actually holds the value "10".

This also works for objects - most classes in Java have a method called toString that is defined in them. This method simply returns a string that best represents the data contained within. Whenever an object is added to a string using the + operator, the method toString is called on the object and the return value of that method call is appended to the string.

Java tip

The difference in how Java deals with string declarations opens up a design choice for the developer. Consistency is important (unless other considerations reduce it to a secondary consideration), so you have a choice - do you maintain consistency with the object model, or do you go for the memory efficiency of the specialised syntax? If you want object consistency, you should declare all your strings using the new keyword. If you want the extra efficiency, you should declare your strings using the = operator. There is no right or wrong answer to this - both ways have their benefits. It is important however that you don't randomly mix and match, as that can lead to bad practise due to the ways the two approaches differ.


8.4

Garbage Collection

As with most objects, assigning a value to a variable that already contains an object will result in the old contents being over-written. In some languages, it is necessary for the developer to explicitly indicate that objects are to be destroyed - if they are not explicitly destroyed then they will stay around in memory for as long as the program is executing. This leads to memory leaks where the application uses up more system resources than it needs to.

Java does not require the developer to do this - instead, Java has an automatic system that runs at irregular intervals. This system is called garbage collection. Java maintains a list of all the objects in its memory and how many other objects are referring to them - this is known as an object's reference count. If the reference count for an object becomes zero (as in, the object is now completely inaccessible in code), the object becomes a candidate for garbage collection.

The variable name of an object does not point directly to its contents - instead, it points to the location of memory where the contents may be found. The reference count for a memory location increases by one every time a reference is made to that location in memory. For example:

ArrayList<String> bing; 

This allocates a space in memory that can store an ArrayList - nothing is actually stored there until:

bing = new ArrayList<String>(); 

The variable bing points to the space in memory that was created above, so the refrence count for that memory is incremented by one. If we later have:

ArrayList<String> bong = bing; 

The variable bong does not point to a copy of the memory location bing - it points to the same memory location. In this case, the reference count for the memory location is increased to two.

The reference count is decreased when a reference ceases to exist - references cease to exist when objects are destroyed by garbage collection, or when they fall out of scope.

Reference counting is not a flawless system - it's possible to create a chain of objects, each referring to the other... these will ensure that the reference count of an object never falls to zero. It may be that there is no way of ever referring to the chain in your program, but the reference count will never reflect this. This is a limitation and a danger when developing more complex data structures, but it doesn't significantly hamper the effectiveness of reference counting as a memory reclamation process.

Whenever the garbage collection routine runs, all objects that have a reference count of zero are destroyed and the area of memory in which they are located is cleared up and made available for assignment to other objects.

This procedure of Java frees the developer from having to worry about making sure their object creation and destruction model is airtight, and allows the developer to concentrate instead on writing code that meets the requirements of the problem.

Java tip

Garbage collection takes much of the burden for memory management off of the developer, but not all of it - the garbage collection routines will only kick in when an object has a zero reference count. If you are maintaining references to objects longer than you need, then the object will not be selected for garbage collection. For this reason, it is important that you choose the scope of your variables correctly.


8.5

String Comparison

Strings are objects, and as such the variable we use for them does not contain the string itself, but instead the memory location where that string may be found. This has implications for checking one string against another. Consider the following code:

String one = new String ("Hello"); 
String two = new String ("Hello");

Two strings
Fig 8.3: Two strings

With these two statements, two separate String objects are created with two separate variable names. Their contents are identical, so it is not unreasonable to think that we could use the equivalence operator to determine if they have the same value:

if (one == two) { 
}

In the example above, this will not work. This statements checks to see if one and two are actually the same object, not objects with the same contents. Since both variables point to different locations in memory, the equivalence operator will return false. This can be quite confusing.

Here is where the difference in creating strings alluded to above comes into play - this is only true for when strings are creating using the new keyword. If we create strings by simply assigning a value, then Java points the variable to a string object created in memory. If an appropriate object doesn't already exist, Java will create one. If it does, it will point the variable at that object. So:

String one = "Hello"; 
String two = "Hello";

One string
Fig 8.4: One string

With this set up, the equivalence operation will indeed return true, because both variables point to the same location of memory.

This can be a confusing distinction, but when doing day to day programming it is not often a real issue. We can ignore the whole problem by never using the equivalence operator on a string. Instead, we make use of the comparison methods provided by Java for the purpose of evaluating string similarity.

The first of these methods is the equals method - this is a method provided by all Java objects and is used to compare one instance of a class against another to determine if their values are the same. The equals method returns true or false, and in the case of a String object, it is a case sensitive comparison:

String one = "HeLLo"; 
if (one.equals ("hello")) {
System.out.println ("Yep, those are equal by yimminy!");
}

In this case of the code above, the comparison will return false because although the letters are identical, the case is different.

If we simply want to find out if two strings have the same letters and don't care about the case, we use the equalsIgnoreCase method:

String one = "HeLLo"; 
if (one.equalsIgnoreCase ("hello")) {
System.out.println ("Yep, those are equal by yimminy!");
}

In this case, the comparison will return true because we have explicitly stated we are not interested in the case of the comparison.

There is a third method for providing comparisons, and this method is compareTo. This method returns an integer number - this number is either positive, negative, or zero (naturally).

String one = "a horrible thing"; 
String two = "b horrible thing";
int comparison = one.compareTo (two);

In this comparison, we'll call string one the victim and string two the aggressor - for no particular reason other than we have to explain this in general terms and I quite like the edge of drama it gives to the whole thing.

The compareTo method does an Unicode value comparison on each letter in a string, until there comes to a point where the comparison is not equal. Consider two strings:

A string with indexes
Fig 8.5: A string with indexes

compareTo will first check the equivalence of the first two letters:

if ('H' == 'H') { 

Then the next two letters:

if ('E' == 'E') { 

And so on, until it finds two that don't match:

if ('P' == 'M') { 

When it does find an inequality, it sings a little hymn of joy and then returns a value. The value it returns depends on the Unicode comparison of the two. If the Unicode value of the victim letter is less than the Unicode value of the aggressor letter, then it returns a negative number..

If the victim letter is greater than the aggressor letter, it returns a positive number.

If both are equally matched all the way through a string, it returns 0.

The number that is returned is based on how far apart the letters are - note also that this is a case sensitive comparison.

Finally, we have a method called endsWith that returns true or false if a string ends with a particular substring:

String filePath = "C:\temp\someFile.java"; 
if (filePath.endsWith (".java") == true) {
System.out.println ("This is a java file.");
}

else {
System.out.println ("This isn't a java file.");
}

The natural companion to this method is startsWith, and that works just like you'd expect - it returns true if a string starts with a particular substring.

8.6

Extracting Information from Strings

Strings provide a number of methods for extracting information that exists as a subset of that which is contained in the String itself. Internally, individual characters in a String can be referred to with an index in the same way as elements of an array can be indexed:

A String
Fig 8.6: A String

We can use the charAt method to retrieve a single letter from the string - this is returned as a variable of type char:

String str = "WELCOME TO STRINGS"char letter; 
letter = charAt (11);

With this example, we get the character that is found at position 11 in the string:

A String indexed on element 11
Fig 8.7: A String indexed on element 11

In this case, the char variable letter contains the single character 'S'.

There is also a substring method that returns a portion of a string - we can use this to extract any information that we need from its context. We pass two parameters to the substring method. The first is the index to extract from, and the second is the index to extract to. The first is inclusive, the second is not... so to get the substring 'to' from the string above:

String extracted = str.substring (8, 10); 

Remember that the second parameter is not inclusive, so that 8 and 10 actually give the characters at indexes 8 and 6.

A String indexed on elements 8 and 9
Fig 8.8: A String indexed on elements 8 and 9

At the end of this process, the string variable extracted contains the value 'to'.

We can find where a particular substring is located within a larger string by using the indexOf method. This returns an integer that represents where the substring begins:

int start = str.indexOf ("TO"); 

Since the substring starts at position 8, the variable start has the value 8. If the substring is not contained within the larger string, the indexOf method will return -1.

There is a very similar method called lastIndexOf which will return the last instance of the substring within the string. For example, if we the following text:

A different string
Fig 8.9: A different string

String str = "STRINGS AND STRINGS"; 
int firstInstance = str.indexOf ("STRINGS");
int lastInstance = str.lastIndexOf ("STRINGS");

In this case, firstInstance has the value 0. The variable lastIndex contains the value 12, where the last instance of the substring starts.

The length method of a string holds how many characters are contained within. Calling length on str as defined above would give 18 as a return value - there are eighteen characters (including spaces) in the complete string.

We can combine these methods to allow us to easily extract substrings from larger strings without having to specify exact start and end parameters at design time.

String str = "This is an example string"; 
String substringToFind = "example";
int start = str.indexOf (substringToFind);
String extracted = str.substring (start, start + substringToFind.length());

This is a useful method for manipulating strings without requiring us to know in advance where the particular indexes of substrings are.

8.7

String Manipulation

The String class gives us a family of methods for manipulating the contents of a string. Usually we use these for ensuring the internal consistency of a particular string in order to set it up for later processing.

For example, consider the following code that counts through a string and returns how many vowels may be found within:

String str = "String to be searched"; 
int vowels = 0;
char temp;
for (int i = 0; i < str.length(); i++) {
temp = str.charAt (i);
if (temp == 'a' || temp == 'A' || temp == 'e' ||temp == 'E' || temp == 'i' || temp
== 'I' ||temp == 'o' || temp == 'O' || temp == 'u' ||temp == 'U') {
vowels += 1;
}

}

Despite the fact we're only searching for five letters, we need to have ten conditions - two to deal with each vowel to handle there being instances of upper and lower case letters.

It would be much more convenient for coding purposes if we could ensure that the case is consistent. Java provides us with two methods to do this: toUpperCase and toLowerCase.

As mentioned previously, Strings are immutable - calling this methods do not affect the string itself. Instead, they return another String object containing the altered string:

String myString = "ChEck tHis fOr VoWels"; 
String lowerCaseString = myString.toLowerCase();

Once we've called this method, our string is in a consistent format (it's all lower case) and so we can write a much simpler if statement:

if (temp == 'a' || temp == 'e' || temp == 'i' || temp == 'o' || temp == 'u') 
{
vowels += 1;
}

This is obviously much more convenient.

Another method called trim can be used to strip leading and trailing white space from a string - any space that precedes the actual text and any space that trails the actual text is removed. Space between words is preserved:

String str = " this is a sentence with white space "; 
String str = str.trim();
System.out.println (str);

After this, the output is:

this is a sentence with white space

We can then call toUpperCase to change the case of the whole string:

str = str.toUpperCase(); 

And the sentence becomes:

THIS IS A SENTENCE WITH WHITE SPACE

Finally, we can make use of the replace method of a string to replace instances of particular letters with instances of other letters:

String str = "Fuzzy Wuzzy Bear"; 
str = str.replace ('z', 'm');

This will replace every instance of the letter z with an instance of the letter m, leaving us with the string "Fummy Wummy Bear" stored in the variable str.

There is a similar version of replace in newer versions of Java - this method is called replaceAll, and allows for whole strings to be replaced:

String str = "Bing Bang Blue"; 
str = str.replace ("Bang", "Bing");

This will then replace every instance of the string Bang with an instance of the string Bing. This is very useful for large scale replacements.

Java tip

Don't spend time developing solutions that involve case sensitivity - wherever you can, find ways to resolve any strings you have into a general base case. Even if you can't change the case of the original string, you can easily make a copy of it and manipulate that to suit your purposes.


8.8

String Tokenization

We've now looked at the basics of strings and string manipulation - now it's time to look at how we can start doing something useful with them.

It is a common task in computing to parse regular substrings out of a string - so much so that Java provides us with a mechanism for doing this. Whether dealing with strings of numbers, directories in a file name or even spaces between words - it is often required that we break a string up into smaller strings based on some constant symbol.

In certain kinds of strings, we are only interested in certain parts of the information - the rest of the string is there purely to give context to the things we are interested in. This is apparent even in this document - we are not interested in the spaces between words, we're only interested in the words themselves. However, without the spaces, it makes the task of parsing out the real information much more difficult - sodifficultthatitisfarmoretimeconsumingthanitisworth.

In technical jargon, the spaces are known as delimiters. They exist to break up the words. The words themselves are tokens. All we are interested in are the tokens - the delimiters are useful for providing context and a sense of separation, but they are not interesting in themselves.

Java provides a class that can be used to break a string into tokens based on some delimiter. What this delimiter is doesn't matter - it can be a space, a comma, or even whole words.

The class is called StringTokenizer and is found in java.util. To create a string tokenizer, we need two parameters. One is the string to tokenize, the other is the delimiter we are going to use to determine where tokens begin and end:

String theString = "10, 20, 30, 40, 50"; 
String theDelimiter = ", ";
StringTokenizer myTokenizer = new StringTokenizer (theString, theDelimiter);

The delimiter can be a single letter or symbol, or it can be set as multiple symbols:

String theDelimiter = ", "; 

In the first instance, only a comma will be used as a delimiter. In the second, either a comma or a space will be a delimiter.

The tokenizer has a method called nextToken, and this returns the first token left in the string that we haven't previously parsed out. Tokenization is a one way process - once it's done, it's done. It works in a very similar way to the iterator we discussed in the last chapter.

Consider a string comprised of numbers separated by commas - as in the example above:

A delimited string
Fig 8.10: A delimited string

When we call nextToken for the first time, Java starts from the beginning of the string and searches through until it finds a valid delimiter:

First identified token
Fig 8.11: First identified token

Once it has found this delimiter, it returns a string containing all of non-delimiter characters it has found since its last delimiter (or since the start of the string if it hasn't yet found one). Since we haven't had a last delimiter here, it simply returns the whole string up to that point, minus the delimiter itself... so it returns the substring represented by 0 and 2:

"10"

The next time nextToken is called, Java starts reading through the string again from the character after the last delimiter until it finds another:

Second identified token
Fig 8.12: Second identified token

The tokenizer started reading again from position 3 (the one after the last delimiter), and the second token is found at position 4. It returns the substring represented by 3 and 5:

"20"

And so on through the string, until there are no more tokens. If nextToken is called on a tokenizer that has returned all the valid tokens, it will throw an error.

Since we don't know in advance how many tokens are going to be found in a string, we need to write some handling code to deal with this. There is a method called hasMoreTokens in the tokenizer object that returns true if there is information left to parse and false if there isn't. We can place our nextToken statement in a loop based on this condition:

while (myTokenizer.hasMoreTokens()) { 
System.out.println (myTokenizer.nextToken());
}

Tokenization is most powerful when used in conjunction with an ArrayList or some other suitable data structure::

ArrayList <String> myTokens = new ArrayList <String>(); 
String temp;
while (myTokenizer.hasMoreTokens()) {
temp = myTokenizer.nextToken();
myTokens.add (temp);
}

At the end of this process here, the myTokens ArrayList contains each of the tokens that existed in the string. This lends itself to further processing as discussed in the chapter on arrays and ArrayLists.

Tokenization is a much more effective process than manually parsing out substrings, although this too can be done:


public static void main (String args[]) {
int start, end;
ArrayList myList <String> = new ArrayList <String>();
Iterator it;
String myString = "String To Tokenize Using Custom Handling Code";
String delimiter = " ";
String temp;
while (myString.length() > 0) {
start = myString.indexOf (delimiter);
if (start > 0) {
temp = myString.substring (0, start);
myList.add (temp);
myString = myString.substring (start + 1, myString.length());
}

else {
myString = "";
}

}

it = myList.iterator();
while (it.hasNext()) {
System.out.println (it.next());
}

}

Obviously it is much more effective to use the tokenizer, and requires much less complicated parsing of strings.

Newer versions of Java provide an automated system for this with the split method of a String. This returns a String array containing all of the tokens in a particular string:


public class SplitExample {

public static void main (String args[]) {
String stringToSplit = "10, 20, 30, 40, 50";
String[] arrayOfTokens;
arrayOfTokens = stringToSplit.split (", ");
for (int i = 0; i < arrayOfTokens.length; i++) {
System.out.println (arrayOfTokens[i]);
}

}

}

There are however some problems with the split method - for one thing, it's a little bit more fragile than a StringTokenizer. A StringTokenizer will parse out multiple versions of the same delimiter without any problems... however, the split method will interpret the empty space between a pair of delimiters as a valid token. Consider the following assignment:

String stringToSplit = "10, , 20, , 30, 40, 50"; 

This will result in the following array:

Array of split tokens
Fig 8.13: Array of split tokens

It's unlikely that this is what we actually want, so we must perform some extra computation to ensure the integrity of our array... we must either eliminate all instances of doubled delimiters, or step over every element of the split array and remove any invalid elements.

However, the split method also offers some new and powerful functionality for defining what actually constitutes a delimiter... this involves delving into a Dark Area of the Soul known as regular expressions.

Java tip

String Tokenization is considered to be deprecated by the Java developers - it is a solution that is not going to be supported in later versions of Java. It is covered in this book because of the excellent lessons it teaches regarding the manipulation of strings and ArrayLists. When writing code 'in anger', you should make use of the split method.


8.9

Regular Expressions

Regular expressions are a syntactical convention for matching strings based on the properties and the structure of the string as opposed to its exact contents. A regular expression can be as simple as a normal string of text (called a string literal), or a complex compound of strange and mystical characters.

Regular expressions are constructed from a range of special control code - these are called metacharacters. One of these is the period character... this represents 'any character'.

One of the methods that makes use of regular expressions is the split method mentioned in the previous section... however, there is also a simpler method called matches. This is a comparison method that tells whether or not a string matches a given regular expression... you can think of it as a more powerful version of equals.

Consider a simple applet designed to let us search any string by any regular expression:

import java.awt.*; 
import java.applet.*;
import java.awt.event.*;
import javax.swing.*;


public class RegularExpressions extends JApplet implements ActionListener {
JLabel search, expression;
JTextField stringToSearch, regExpression;
JButton go;

public void init() {
setLayout (null);
search = new JLabel ("String to search:");
search.setBounds (30, 30, 130, 30);
expression = new JLabel ("Regular expression:");
expression.setBounds (30, 80, 130, 30);
stringToSearch = new JTextField();
stringToSearch.setBounds (150, 30, 250, 30);
regExpression = new JTextField();
regExpression.setBounds (150, 80, 250, 30);
go = new JButton ("Go!");
go.setBounds (200, 200, 80, 40);
go.addActionListener (this);
add (search);
add (expression);
add (stringToSearch);
add (regExpression);
add (go);
}


public void actionPerformed (ActionEvent e) {
boolean bing;
String toSearch;
String regEx;
toSearch = stringToSearch.getText();
regEx = regExpression.getText();
bing = toSearch.matches (regEx);
if (bing == true) {
JOptionPane.showMessageDialog (null, "There is ""a match! We win!");
}

else {
JOptionPane.showMessageDialog (null, "No match! ""Poor us!");
}

}

}

This will give us the following interface:

Applet interface
Fig 8.14: Applet interface

Now we can start experimenting, like scientists!

First, we try a simple match - we test the string "bing" against the regular expression "bing". Our applet finds a match. A mighty victory for us!

Then we try "Bing" against the regular expression "bing". No match is found. Alas - we are thwarted.

Here we're using string literals in the expression... however, we can be a bit more inventive. Let's try matching the string "bing" against the regular expression "bin.".

Even though there is no full-stop in the string to search, the period character is used as a 'stand-in' for any single character, and so there is a match. We also match with the string "bine" and "bina", but not "binary".

We can use as many period characters in the regular expression as we like... we can match 'bing' against the expression 'b.n.' and it will still return true for the match.

There are many more metacharacters. Some of the more useful ones:

Regular Expression Symbols
Fig 8.15: Regular Expression Symbols

There is a small problem with the * and + metacharacters - they are greedy, and will match everything until they find the last matching character in a document. We can use the special ? character to make it satisfied with the first match it finds.

Learning how to use regular expressions properly is not a task for the faint-hearted - some useful resources are included at the end of this chapter. However, let's look at an example where they could be useful.

Let's say we want to put some kind of validation check on a JTextField that prompts the user for their postcode. Postcodes in the United Kingdom are two letters followed by a number, an optional space, and then a number followed by two letters. Parsing that out manually is a bit of a chore... luckily a regular expression will simply this for us.

First, we need to match any letter... we use the [a-z] notation for this. We need to do this exactly twice, so we expand this to use the repetition notation {2}:

String regEx = "[a-z] { 2 } "; 

We then need to match the letter:

String regEx = "[a-z] { 2 } [0-9]"; 

This will then match any string that contains two starting letters followed by a number. The second half of the code is a variation of this.

To deal with the optional space we need to indicate which characters we are willing to match (just a space), and how many times... we need to be able to match on no occurances as well as any number of spaces (just to be nice)... this is handles by the string: [ ]*.

With these, we build up our full regular expression for validating a postcode:

String regEx = "[a-z] { 2 } [0-9][ ]*[0-9][a-z] { 2 } "; 
if (!postcode.matches (regEx)) {
JOptionPane.showMessageDialog (null, "Your postcode is invalid. Please re-enter.");
}

Regular expressions can be used in a number of String methods... most usefully, replaceAll, matches, and split.

We need to use regular expressions to solve the problem of inflexible delimiter parsing in the split method... we could solve our comma problem earlier by using the following split method call:

arrayOfTokens = stringToSplit.split (", +"); 

This would then correctly parse any number of commas as a valid delimiter. Note that it has a different effect to passing the same delimiter to a StringTokenizer which would then tokenize on either a comma or a plus sign.

Java tip

You must make use of regular expressions before the split method will precisely model the behaviour of the String Tokenizer. They are not directly compatible.


8.10

String Buffers

Appending to a String is a very costly procedure in Java. Due to their immutability, in order to add one string onto another, Java needs to create a separate String object that holds the appended contents of both.

There exists in the Java libraries another class called StringBuffer that is mutable, and can be used to perform many append operations without a performance penalty. It is created like any other object:

StringBuffer myBuffer = new StringBuffer ("Where are my pants?"); 

The append method of StringBuffer allows for new strings to be added to the end of what is already present in the object. The append method accepts parameters of any type, since it is overloaded.

myBuffer.append (" Oh yes, here they are!"); 

The StringBuffer class also supports the length method, the indexOf method, the replace method, and the substring method - this makes it a versatile replacement for the standard String object.

When it becomes necessary to return the contents of the StringBuffer as a simple String, the toString method can be used:

String finishedString = myBuffer.toString(); 
System.out.println (finishedString);

This outputs the following line:

Where are my pants?  Oh yes, here they are!

8.11

Conclusion

String parsing is an important part of developing modern programs. Java provides an extensive set of tools for making this task as painless as possible. Methods and classes are provided for dealing with comparing strings and manipulating strings, as well as parsing strings into regular tokens.

However, the language support for simplification of syntax when dealing with strings can easily lead to confusion, particularly as far as the implications of creating new strings are concerned. Students are advised to never rely on this particular facet of Java and instead concentrate on learning to use the predefined equivalence methods built into the class itself.

8.12

Reader Projects

Weather Data Analysis

A local scientific outfit has found a need for a piece of software to help interpret scientific data that is being received from a weather satellite orbiting over the earth. The interface to the satellite has already been dealt with, but there is a need for a piece of software to store the daily figures and calculate the averages.

Each day's data is stored as a string containing a series of comma delimited numbers For example:

10, 15, 11, 9, 8, 11

The application should allow the user to enter a series of these strings. When the user has entered as many of these strings as required, they should be able to press a button that calculates the following information:

  • The smallest number in each string
  • The largest number in each string.
  • The average of all numbers stored.
  • Which day has the largest sum of numbers.

For example, if the user enters three strings of data:

5, 10, 10, 3, 20

10, 12, 30, 10

5, 11

The application should provide the following information:

  • The smallest number is 3 in string 1, 10 in string 2, and 5 in string 2.
  • The largest number is 20 in string 1, 30 in string 2, and 11 in string 3
  • The average of the numbers is 7.45
  • Day 2 has the largest sum of numbers, with a total of 61.

Further Reading

The following table details further reading on the topic in this chapter, and also any external resources that you may find useful.

ResourceDescription
Example Programs from this chapterThis is a zip file of all the programs shown in this chapter.
StringBuffer classJava API documentation on the StringBuffer class.
String ClassJava API Documentation on the String class.
Regular ExpressionsSun tutorial on regular expressions

PreviousTable of ContentsNext

© 2004-2006 Michael James Heron