Home > Java, scala, Uncategorized > Java gotcha: Splitting a string

Java gotcha: Splitting a string

CSV, or comma separated value, files are the workhorse of flat file processing. They’re human readable, you can edit them in any spreadsheet editor worth its salt, they’re portable, and they’re easy to parse and read into your programs.  Unfortunately, text you might want to store in the columns might also have commas in it, which complicates the parsing significantly – instead of splitting on all of the commas, you have to know whether the comma is in a quoted string, in which case it doesn’t signify the end of a field … it gets messy, and you’re probably going to get it wrong if you roll your own.  For a quick and dirty script, sometimes it’s better to delimit the columns with a different character, one that comes up much less often in text.  One good choice is the pipe character, |.

Foolishly I processed lines of text like the following in Scala:


def parseLine(indexRow:String) = {
 Console.println("Row: " + indexRow)

 val entries = indexRow.split(FIELD_DELIMITER)
 // extract the file name, average color, thumbnail


Unfortunately, this code is subtly broken.  The problem is that the String split method does not accept a plain string on which to split – rather, it treats the string as a regular expression.  Since the pipe character has special meaning in regular expressions (namely “OR”), this has the effect of splitting the string on every character, rather than just at the pipe delimiters.

To get around this, you need to tell the regular expression engine that you want to treat the pipe as a literal, which you do by backslash escaping the pipe.  Unfortunately “\|” is not a valid string in Java, because it will attempt to interpret it as a control character (like the newline \n).  Instead, you need to backslash escape the backslash, leaving “\\|”

Be careful when you’re using String.split in Java/Scala.  You have to be aware that it’s treating your string as a regular expression.  Read this StackOverflow discussion for a better understanding of the pros and cons of Java using regular expressions in many of its core String libraries.

  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: