Home > Java, scala, Uncategorized > Java gotcha: Splitting a string

Java gotcha: Splitting a string


CSV, or comma separated value, files are the workhorse of flat file processing. They’re human readable, you can edit them in any spreadsheet editor worth its salt, they’re portable, and they’re easy to parse and read into your programs.  Unfortunately, text you might want to store in the columns might also have commas in it, which complicates the parsing significantly – instead of splitting on all of the commas, you have to know whether the comma is in a quoted string, in which case it doesn’t signify the end of a field … it gets messy, and you’re probably going to get it wrong if you roll your own.  For a quick and dirty script, sometimes it’s better to delimit the columns with a different character, one that comes up much less often in text.  One good choice is the pipe character, |.

Foolishly I processed lines of text like the following in Scala:


val FIELD_DELIMITER = "|";
val FILE_NAME_INDEX = 0
val AVG_COLOR_INDEX = 1
val THUMBNAIL_IMAGE_INDEX = 2


def parseLine(indexRow:String) = {
 Console.println("Row: " + indexRow)

 val entries = indexRow.split(FIELD_DELIMITER)
 // extract the file name, average color, thumbnail

}
 

Unfortunately, this code is subtly broken.  The problem is that the String split method does not accept a plain string on which to split – rather, it treats the string as a regular expression.  Since the pipe character has special meaning in regular expressions (namely “OR”), this has the effect of splitting the string on every character, rather than just at the pipe delimiters.

To get around this, you need to tell the regular expression engine that you want to treat the pipe as a literal, which you do by backslash escaping the pipe.  Unfortunately “\|” is not a valid string in Java, because it will attempt to interpret it as a control character (like the newline \n).  Instead, you need to backslash escape the backslash, leaving “\\|”

Conclusion:
Be careful when you’re using String.split in Java/Scala.  You have to be aware that it’s treating your string as a regular expression.  Read this StackOverflow discussion for a better understanding of the pros and cons of Java using regular expressions in many of its core String libraries.

  1. No comments yet.
  1. No trackbacks yet.

Leave a comment