Archive
Java gotcha: Splitting a string
CSV, or comma separated value, files are the workhorse of flat file processing. They’re human readable, you can edit them in any spreadsheet editor worth its salt, they’re portable, and they’re easy to parse and read into your programs. Unfortunately, text you might want to store in the columns might also have commas in it, which complicates the parsing significantly – instead of splitting on all of the commas, you have to know whether the comma is in a quoted string, in which case it doesn’t signify the end of a field … it gets messy, and you’re probably going to get it wrong if you roll your own. For a quick and dirty script, sometimes it’s better to delimit the columns with a different character, one that comes up much less often in text. One good choice is the pipe character, |.
Foolishly I processed lines of text like the following in Scala:
val FIELD_DELIMITER = "|"; val FILE_NAME_INDEX = 0 val AVG_COLOR_INDEX = 1 val THUMBNAIL_IMAGE_INDEX = 2 def parseLine(indexRow:String) = { Console.println("Row: " + indexRow) val entries = indexRow.split(FIELD_DELIMITER) // extract the file name, average color, thumbnail }
Unfortunately, this code is subtly broken. The problem is that the String split method does not accept a plain string on which to split – rather, it treats the string as a regular expression. Since the pipe character has special meaning in regular expressions (namely “OR”), this has the effect of splitting the string on every character, rather than just at the pipe delimiters.
To get around this, you need to tell the regular expression engine that you want to treat the pipe as a literal, which you do by backslash escaping the pipe. Unfortunately “\|” is not a valid string in Java, because it will attempt to interpret it as a control character (like the newline \n). Instead, you need to backslash escape the backslash, leaving “\\|”
Conclusion:
Be careful when you’re using String.split in Java/Scala. You have to be aware that it’s treating your string as a regular expression. Read this StackOverflow discussion for a better understanding of the pros and cons of Java using regular expressions in many of its core String libraries.
How to use Java .properties files in Mule
Externalizing ports and IP addresses (or anything else for that matter) in Mule
Mule is a great piece of open source software known as an Enterprise Service Bus. It is designed to make it easy to integrate various systems which were not explicitly built to work with each other. For instance, it handles all the details of various transport mechanisms (e-mail, HTTP, TCP, UDP, files) that your data might be shuttled around in, as well as the transformers that convert data from one format to another (e.g., the bytes of a TCP packet, into a String, into an XML document, into a Java object representing that XML).
Mule services are configured via XML, in particular the Spring framework. This post is designed to inform the reader as to how to incorporate Java .properties files into the XML.
.properties files, for those who are unfamiliar, is a simple Key=Value storage mechanism in widespread use in Java development. From the wikipedia explanation, here are a few lines of a .properties file:
website = http://en.wikipedia.org/
language = English
# The backslash below tells the application to continue reading
# the value onto the next line.
message = Welcome to \
Wikipedia!
The documentation for Mule / Spring advises that you break up configuration files into multiple files and then reassemble them as needed. In Mule’s case, this allows you to run one instance of mule per small function, allowing you to restart just the piece that needs to when a change is made, rather than bringing down all the pieces. Furthermore it makes it easy to reason about each modular piece when broken down in this way.
Unfortunately, splitting the files up like this can easily cause a lot of duplication, especially of IP addresses and ports. If you are sending objects to the same IP address and port from multiple configuration files, you might end up with multiple instances of lines of configuration like
tcp:inbound-endpoint address="tcp://192.56.33.21:235"
Fortunately, by storing the IP addresses and ports in .properties file, you can eliminate code duplication and allow the variables to be changed in a single place and have the change reflected in all files referencing these variables. Additionally, if you name the variables properly, the impenetrable IP addresses instead become self documenting strings:
tcp:inbound-endpoint address="${email.server.address}"
This ${x}
syntax should be familiar to anyone who has used Ant in the past. This basically says, find the property with the key email.server.address
and textually substitute its value here. This assumes that you have a .properties file with the line
email.server.address=tcp://192.56.33.21:235
For the purposes of this post, assume that the line is defined in a file called test.properties.
Unfortunately, the way to do this is not clearly defined in any document I’ve seen, which is why I want to explain how to do it here.
Existing information
The first result in google for “mule java properties” is Mule: Configuring Properties, which is 5 years old and refers to Mule 1.5 (I figured this out the hard way). There is more up to date information by searching for “configuring properties”, particularly Configuring Properties – Mule 2.x User Guide.
Here is the relevant information:
To load properties from a file, you can use the standard Spring element
:
xmlns:context="http://www.springframework.org/schema/context"
xsi:schemaLocation="http://www.springframework.org/schema/context
http://www.springframework.org/schema/context/spring-context-2.5.xsd"
<context:property-placeholder location="smtp.properties"/>
(In other words you need to add those schema declarations at the top of your file, after the description field, before creating a context:property-placeholder
element to tell Mule all the .properties files you want pulled in to your file).
Would that it were so easy.
If you change this example to the name of your .properties file, you will probably get a FileNotFoundException, even if you give the complete path to the .properties file.
A Fatal error: class path resource [C:/Mule/mule-standalone-2.2.1/conf/test.properties] cannot be opened because it does not exist
This message is extremely unhelpful because the file does exist at that exact location. After some digging, I found two workarounds, one of which is hinted at by the error message, another is not.
Classpath
By default, the .properties file is searched for in the classpath that the Mule environment runs in. Thus you need to ensure that the folder in which your test.properties file is located in is also on that classpath.
The wrapper.conf
file in $MULE_HOME/conf is where the classpath is defined:
wrapper.java.classpath.1=%MULE_LIB%
wrapper.java.classpath.2=%MULE_EXE%/../conf
wrapper.java.classpath.3=%MULE_HOME%/lib/boot/*.jar
If you place your .properties files in any of those folders, it will be picked up. That’s probably not the ideal place for your files, however. If you wish to add an additional folder, you merely add another entry. For instance, if I store the .properties files in /Dev/Mule/Configs/Properties, I would add the line
wrapper.java.classpath.4=/Dev/Mule/Configs/Properties
Note that you must add the consecutive numbers to each classpath you add, or Mule will not pick up on the change correctly.
You can make it explicit to the readers of your configuration file that you are including a .properties file that’s located on the classpath with the classpath prefix:
<context:property-placeholder location="classpath:test.properties">
Absolute locations
If you wish to specify the absolute path to a resource rather than relying on classpath resolution, you must prefix the path with file///
. So our previous example becomes
<context:property-placeholder location="file///Dev/Mule/Configs/Properties/test.properties">
(You also need 3 slashes even if you’re on Windows)
Conclusion
It’s a very good idea to externalize ports and IP addresses from the XML configuration files that Mule needs to run. This allows you to make changes to the ports and IP addresses in one place rather than in all the files that reference them. It also allows you to associate a more meaningful name to the addresses than an IP address; it is self-documenting in that regard. Unfortunately the process for importing .properties files into your Mule configuration files is not well documented, which is what I am attempting to remedy here.
Android – disappearing emulator ? Restart adb server
A screen shot illustrating a running emulator that does not appear in the list of attached devices
An emulator within the devices window
To solve this, you should take the following steps:
# Device is running but not showing up [497][nicholasdunn: /Users/nicholasdunn]$ adb devices List of devices attached # Kill and restart [498][nicholasdunn: /Users/nicholasdunn]$ adb kill-server [499][nicholasdunn: /Users/nicholasdunn]$ adb start-server * daemon not running. starting it now * * daemon started successfully * # Device appears, but is listed as offline [500][nicholasdunn: /Users/nicholasdunn]$ adb devices List of devices attached emulator-5554 offline # One more invocation of adb devices should get it recognized [501][nicholasdunn: /Users/nicholasdunn]$ adb devices List of devices attached emulator-5554 device
If this happens to you frequently (it does to me), you can create an alias within your .bash_profile file (~/.bash_profile):
alias adb-restart='adb kill-server; adb start-server; adb devices; adb devices'
Reload your .bash_profile file:
source ~/.bash_profile
You can then invoke it from the terminal by typing adb-restart. Sometimes one invocation of adb devices is enough to have the emulator show up as a device; others requires two. Not sure why that is. To be safe I’m including two in the script.
git – how to easily remove ‘deleted’ files
git is a distributed version control system. There are lots of tutorials online to teach you the basics of this great system; that’s not the intent of this post. Rather, I want to share a neat trick I found on another site.
When you move files that have been checked into git without using the git mv command, as often happens when using an IDE and renaming a file, you are left with untracked files, and deleted files.
Nick@Macintosh-3 ~/Desktop/git_example$ mkdir src Nick@Macintosh-3 ~/Desktop/git_example$ touch src/Hello.java Nick@Macintosh-3 ~/Desktop/git_example$ git add src/ Nick@Macintosh-3 ~/Desktop/git_example$ git ci -m "Initial commit" [master (root-commit) 97eb204] Initial commit 0 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 src/Hello.java Nick@Macintosh-3 ~/Desktop/git_example master$ git status # On branch master nothing to commit (working directory clean) Nick@Macintosh-3 ~/Desktop/git_example master$ ls src Nick@Macintosh-3 ~/Desktop/git_example master$ mv src/Hello.java src/Goodbye.java Nick@Macintosh-3 ~/Desktop/git_example master$ git status # On branch master # Changed but not updated: # (use "git add/rm <file>..." to update what will be committed) # (use "git checkout -- <file>..." to discard changes in working directory) # # deleted: src/Hello.java # # Untracked files: # (use "git add <file>..." to include in what will be committed) # # src/Goodbye.java no changes added to commit (use "git add" and/or "git commit -a")
As soon as you delete the src/Hello.java and add the src/Goodbye.java file, git is smart enough to realize that you really have just renamed or moved the file:
Nick@Macintosh-3 ~/Desktop/git_example master$ git rm src/Hello.java rm 'src/Hello.java' Nick@Macintosh-3 ~/Desktop/git_example master$ git add src/Goodbye.java Nick@Macintosh-3 ~/Desktop/git_example master$ git status # On branch master # Changes to be committed: # (use "git reset HEAD <file>..." to unstage) # # renamed: src/Hello.java -> src/Goodbye.java #
While this pattern is not too onerous when you move just a few files around, when you are in the midst of refactoring you could have multiple files moved into different directories, leading to a raft of these deleted files that need to be manually removed from git. Because they are no longer at the old location in the file system, we cannot use tab completion to help remove the old files; you must type out the full path to the file to be deleted. This can be a big pain.
For instance, imagine I have the following tree structure:
Nick@Macintosh-3 ~/Desktop/git_example master$ tree src/ src/ `-- org `-- example `-- nick |-- 1.java |-- 10.java |-- 2.java |-- 3.java |-- 4.java |-- 5.java |-- 6.java |-- 7.java |-- 8.java `-- 9.java
And we change the package name from an external, non-git process/program:
Nick@Macintosh-3 ~/Desktop/git_example master$ mv src/org/example/nick src/org/example/blog Nick@Macintosh-3 ~/Desktop/git_example master$ git add src/org/example/blog/ Nick@Macintosh-3 ~/Desktop/git_example master$ git st # On branch master # Changes to be committed: # (use "git reset HEAD <file>..." to unstage) # # new file: src/org/example/blog/1.java # new file: src/org/example/blog/10.java # new file: src/org/example/blog/2.java # new file: src/org/example/blog/3.java # new file: src/org/example/blog/4.java # new file: src/org/example/blog/5.java # new file: src/org/example/blog/6.java # new file: src/org/example/blog/7.java # new file: src/org/example/blog/8.java # new file: src/org/example/blog/9.java # # Changed but not updated: # (use "git add/rm <file>..." to update what will be committed) # (use "git checkout -- <file>..." to discard changes in working directory) # # deleted: src/org/example/nick/1.java # deleted: src/org/example/nick/10.java # deleted: src/org/example/nick/2.java # deleted: src/org/example/nick/3.java # deleted: src/org/example/nick/4.java # deleted: src/org/example/nick/5.java # deleted: src/org/example/nick/6.java # deleted: src/org/example/nick/7.java # deleted: src/org/example/nick/8.java # deleted: src/org/example/nick/9.java #
It is a bit of a pain to remove all of the deleted files individually.
Fortunately, there is a shortcut.
Nick@Macintosh-3 ~/Desktop/git_example master$ git ls-files --deleted src/org/example/nick/1.java src/org/example/nick/10.java src/org/example/nick/2.java src/org/example/nick/3.java src/org/example/nick/4.java src/org/example/nick/5.java src/org/example/nick/6.java src/org/example/nick/7.java src/org/example/nick/8.java src/org/example/nick/9.java Nick@Macintosh-3 ~/Desktop/git_example master$ git rm `git ls-files --deleted` rm 'src/org/example/nick/1.java' rm 'src/org/example/nick/10.java' rm 'src/org/example/nick/2.java' rm 'src/org/example/nick/3.java' rm 'src/org/example/nick/4.java' rm 'src/org/example/nick/5.java' rm 'src/org/example/nick/6.java' rm 'src/org/example/nick/7.java' rm 'src/org/example/nick/8.java' rm 'src/org/example/nick/9.java'
If this is a common enough use case, we can add aliases in the global ~/.gitconfig file to support this.
[alias] br = branch ci = commit co = checkout st = status # list files ls = ls-files # list deleted files lsd = ls-files --deleted # remove deleted files rmd = !git rm `git ls-files --deleted`
See the great wiki page on Aliases for information. The exclamation mark is used to indicate that the command invoked is a non-git one; you can execute arbitrary shell commands this way. For instance:
[alias] # A stupid thing to do, but illustrates that arbitrary unix commands can be executed in this manner cal = !cal Nick@Macintosh-3 ~/Desktop/git_example master$ git cal September 2010 Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
All credit to the git cheat sheet site that initially brought the ls-files –deleted trick to my attention. Hopefully the alias I have provided will be useful to some people.
Unix tip #3: Introduction to Find, Grep, Sed
public int randomValue() { // TODO: hook up the actual random number generator return 0; }
The problem is that these TODOs more often than not get ignored, especially if you have to search through the code yourself to try to find all of the remaining tasks. Fortunately, certain Programs (NetBeans and TextMate for two examples) can find instances of keywords indicating a task, extract the comments, and present them to you in a nice table view.
I’m going to step through the use of a few Unix tools that can be tied together to extract the data and create a similar view. In particular I will illustrate the use of find, grep, sed, and pipes.
The general steps I’ll be presenting are:
Step | Tools used |
1. Find all Java files | find |
2. Find each TODO item | grep |
3. Extract filename, line number, task | sed |
4. Format results of step 3 as an HTML table | find/grep/sed/shell script |
.
Finding instances of text with grep
In order to extract all of the TODO items from within our java files, we need a way of searching for matching text. grep is the tool to do that. Grep takes as input a list of files to search and a pattern to try to match against; it will then emit a set of lines matching the pattern.
For instance, to search for TODO or any version of that string (todo, ToDO), in all the .java files in the current directory, you would execute the following:
grep -i TODO *.java Telephone.java: // TODO: Document Telephone.java: // TODO: throw exception if precondition is violated
Note that the line numbers are omitted. If we want them, we use the -n command
grep -i -n TODO *.java Telephone.java:20: // TODO: Document Telephone.java:29: // TODO: throw exception if precondition is violated
If all we want to do is get a rough estimate as to how many documented TODOs we have, we can pipe the result of this argument into the wc utility, which counts bytes, characters, or lines. We want the number of lines.
grep -i -n TODO *.java | wc -l 2
This works fine with a single directory of files, but it will not handle nested directories. For instance, if my directory structure looks like the following:
tree . |-- BalancedTernary.java `-- Telephone.java 0 directories, 2 files
All of these files will be searched when grep is run. But if I introduce new files in subdirectories:
mkdir Subdir echo "//TODO: Create this file" > Subdir/Test.java tree |-- BalancedTernary.java |-- Subdir | `-- Test.java `-- Telephone.java 1 directory, 3 files
The new Test.java will not be searched. In order make grep search through all of the subdirectories (i.e., recursively), you can combine grep with another extremely useful Unix utility, find. Before moving on to find, I want to stress that grep is extremely useful and vital to anyone using a Unix based machine. See grep tutorials for many good examples of how to use grep.
Finding files with find
The find command is extremely useful. The man page describes find as
find – search for files in a directory hierarchy
There are a lot of arguments you can use, but to get started, the basic syntax is
find [<starting location>] -name <name pattern>
If the starting location is not provided, it is assumed to be in the current directory (. in Unix terms). In all the examples that follow I will explicitly list the starting directory.
For instance, if we want to find all the files that end with the extension “.java” in the current working directory, we could run the following:
find . -name "*.java" ./BalancedTernary.java ./Subdir/Test.java ./Telephone.java
Note that we must enclose the pattern in quotes in this example in order to prevent the shell from trying to expand the * wildcard. If we don’t, the shell will convert the asterisk into a space delimited set of all the files/directories in the current folder, which will lead to an error
find . -name *.java # expands to find . -name BalancedTernary.java Telephone.java find: Telephone.java: unknown option
Just as we can use the wc command to count the number of times a phrase appears in a file, we can use it to count the number of files matching a given pattern. That is because find outputs each matching file path to a separate line. Thus if we wanted to count the number of java files in all folders rooted in the current folder, we could do
find . -name "*.java" | wc -l 3
While I have only presented the -name flag, there are numerous other flags as well, such as whether the candidate file is a file or directory (-type f or -type d respectively), whether the match is smaller, the same, or bigger than a given size (-size +100M == bigger than 100 megabytes), or when the file was last modified (find -newer ordinary_file would only accept files that have a modification time newer than that of ordinary_file). A A great article for gaining more expertise is Mommy I found it! – 15 practical unix find commands.
Combining find with other commands
find becomes even more powerful when combined with the -exec option, which allows you to execute arbitrary commands for each file that matches the pattern. The syntax for doing that looks like
find [<starting location>] -name <name pattern> -exec <commands> {} \;
where the file path will be substituted for the {} characters. For instance, if we want to count the number of lines in each Java file, we could run
find . -name "*.java" -exec wc -l {} \; 23 ./BalancedTernary.java 1 ./Subdir/Test.java 88 ./Telephone.java
This has precisely the same effect as if we explicitly executed the wc -l command ourselves:
wc -l ./BalancedTernary.java wc -l ./Subdir/Test.java wc -l ./Telephone.java
As another example, we could backup all of the Java files in the directory by copying them and appending the suffix .bk to each
find . -name "*.java" -exec cp {} {}.bk \; Nick@Macintosh-3 ~/Desktop/Programming/Java/example$ ls BalancedTernary.java Subdir Telephone.java.bk BalancedTernary.java.bk Telephone.java
To undo this, we could remove all of the files ending in .bk:
find . -name “*.bk” -exec rm {} \;
Combining find and grep
Since I started the article talking about grep, it’s only natural that you can combine grep with find, and it often pays to do so.
For instance, by combining the earlier grep command to find all TODO items with the find command to find all java files, we suddenly have a command which will traverse an arbitrarily nested directory structure and search all the files we are interested in.
find . -name "*.java" -exec grep -i -n TODO {} \; 1://todo: Create this file 20: // todo: Document 29: // todo: throw exception if precondition is violated
Note that we no longer have the filename prepended to the output; if we want it back we can add the -H flag.
find . -name "*.java" -exec grep -Hin TODO {} \; ./Subdir/Test.java:1://todo: Create this file ./Telephone.java:20: // todo: Document ./Telephone.java:29: // todo: throw exception if precondition is violated
In this last snippet I have combined the individual -H, -i and -n flags together into the shorter -Hin; this works identically as listing them separately. (Not all Unix commands work this way; check the man page if you’re unsure).
An alternate exec terminator: Performance considerations
I said earlier that the basic syntax for combining find with other commands is
find [<starting location>] -name <name pattern> -exec <commands> {} \;
The ; terminates the exec clause, but because it can be interpreted as text, it has to be backslash escaped. While researching this article I found a Unix/Linux “find” Command Tutorial that introduced me to an alternative syntax for terminating the -exec clause of the find command. By replacing the semicolon with a + sign, files are grouped together in batches and sent to the given command rather than executed one at a time. Let me illustrate:
# Executes the 'echo' command on each file individually find . -exec echo {} \; . ./BalancedTernary.java ./Subdir ./Subdir/Test.java ./table.html ./Telephone.java ./test.a # Executes the 'echo' command on bundled groups of files find . -exec echo {} + . ./BalancedTernary.java ./Subdir ./Subdir/Test.java ./table.html ./Telephone.java ./test.a
This technique of grouping the files together can have a profound performance boost when used with commands that can handle space terminated arguments. For instance:
time find /Applications/ -name "*.java" -exec grep -i TODO {} \; real 1m36.458s user 0m3.912s sys 0m10.933s time find /Applications/ -name "*.java" -exec grep -i TODO {} + real 0m39.060s user 0m3.660s sys 0m6.571s # An alternate way of executing grep on batches of files at once # time find /Applications/ -name "*.java" -print0 | xargs -0 grep -i "TODO" real 0m50.486s user 0m4.230s sys 0m7.924s
By replacing the semicolon with the plus sign, I gained almost a 2.5x speed increase. Again, this will only work with commands that correctly handle whitespace separated arguments; the previous example with copy would fail miserably, because cp expects a single src/destination pair
# Will not work! find . -name "*.java" -exec cp {} {}.bk +
Converting results of find/grep into table form – Intro to sed, cut, and basename
In the last section, I showed how to combine find and grep. The output of the command will look something like this:
find . -name "*.java" -exec grep -Hin TODO {} + ./Subdir/Test.java:1://todo: Create this file ./Telephone.java:20: // todo: Document ./Telephone.java:29: // todo: throw exception if precondition is violated
The output has the path to the file, followed by a semicolon, followed by the matching line in the input file that had the TODO in it. Let’s mimic the output of the TODO list in TextMate, which simply displayed a two column table with File name and line number followed by the extracted comment. While we could use any programming language to do this text manipulation (Python springs to mind), I’m going to use a combination of sed and shell scripts to illustrate a few more powerful command line tools.
Recall that the output of our script so far looks like the following:
./Telephone.java:20: // todo: Document
In other words each line is in the form
relative/path/to/File:lineNumber:todo text
The colons delimiting the text allow us to split the constituent parts very easily. The command to do that is cut. With cut you specify the delimiter on which to split the text, and then which numbered fields you want (where fields are numbered 1 .. n)
As an example, here is code to extract the path (the first column of text):
find . -name "*.java" -exec grep -Hin TODO {} + | cut -d ":" -f 1 ./Subdir/Test.java ./Telephone.java ./Telephone.java
This gives us the path, one per line. If we want to convert the relative path into just the name of the file, like the TextMate example does, we want to strip out all of the leading directories, leaving just the file name. While we could code up a regular expression to perform the substitution, I prefer to avoid doing more work than I need to. Instead I’ll use the basename command, which does that for us.
find . -name "*.java" -exec grep -Hin TODO {} + | basename `cut -d ":" -f 1` Test.java Telephone.java Telephone.java
The line number, the second column of text, is just as easy to extract.
find . -name “*.java” -exec grep -Hin TODO {} + | cut -d “:” -f 2 1 20 29
The fact that the line of text extracted by grep could contain the colon character (and often will; I always write my TODOs as TODO: do x) means we have to be a bit smarter about how we use cut. If we assume that the text is just in the third column, we will lose the text if there are colons.
# Only taking the third column echo "./Telephone.java:20: // todo: Document" | cut -d ":" -f 3 // todo # Taking all columns after and including the third column echo "./Telephone.java:20: // todo: Document" | cut -d ":" -f 3- // todo: Document
While this works, it’s not the neatest output. In particular we want to get rid of the leading white space; otherwise it will mess up the formatting in the HTML table. Performing text substitution is the job of the sed tool. sed stands for stream editor and it is capable of doing extremely heavy duty find and replace tasks. I don’t pretend to be an expert with sed and this article won’t make you one either, but hopefully I can at least illustrate its usefulness. For a more in depth tutorial, see Sed – An Introduction and Tutorial.
A common use case for sed, as I mentioned, is to replace text. The general pattern is
sed ‘s/regexpToReplace/textToReplaceItWith/[g]’
The s can be read as “substitute”, and the optional g stands for global. If you omit it, it will only replace the first instance of the regular expression match that it finds. The g makes it search for all matches in the text.
Thus to remove leading white space, we can use the expression sed ‘s/^[ <tab>]*//g’
where the ^ character indicates that it must match the start of the line, and the text within brackets are the characters that will be matched by the regular expression. The * means to match zero or more instances. In other words, this line says “match the start of the string and all spaces and tabs you can until reaching other text, and replace it with nothing”.
The above command is not strictly correct. We need to indicate to sed that we want to replace the tab character. Unlike many Unix utilities, sed does not allow you to use the character sequence \t to indicate the tab character. Instead you need a literal tab at that place in the command. The problem with doing this is that your shell might swallow the tab before it gets to the sed command. In bash, the default shell environment on the Mac, the tab key is interpreted as a command to auto complete what is being typed. If you press the tab key twice, the shell will print out all the possible autocompletions.
For instance,
$lp<tab><tab> lp lpc lpmove lppasswd lpr lprsetup.sh lpadmin lpinfo lpoptions lpq lprm lpstat
Here I started typing lp, hit tab twice, and the shell produced a list of all the commands it knew about (technically, that are on the PATH environment variable). So we need a way to smuggle the tab key into the sed command, without triggering the shell’s autocompletion. The way to do this is with the “verbatim” command sequence, which instructs the shell not to interpret certain commands and instead to pass them treat them verbatim, as text.
To enter this temporary verbatim mode, you press Ctrl V (sometimes indicated as ^V online) followed by the key combination you want treated as text. Thus the real sed command to remove leading white space is sed ‘s/^[ ]*//’
$ sed 's/^[ ]*//' spaces spaces tabs tabs tabs and spaces tabs and spaces
The above snippet illustrates that sed reads from standard input by default and thus can be used interactively to test the replacements you have specified. Again, in the above text it looks like I have a string of spaces, but it’s really <space><ctrl v><tab> within the brackets. From here on out I will put a \t to indicate a tab but you should realize that you need to do the ctrl v tab sequence I just described instead.
(Aside: I have read online that some versions of sed actually do support the \t character sequence to indicate tabs, but the default sed shipping with Mac OSX does not.)
sed – combine multiple commands into one
If you have series of text replacements you want to do using sed, you can either pipe the chain of transformations you want to do from one sed invocation to another, or you can use the -e flag to chain them together.
echo "hello world" | sed 's/hello/goodbye/' | sed 's/world/frank/' goodbye frank echo "hello world" | sed -e 's/hello/goodbye/' -e 's/world/frank/'goodbye frank
Note that you need the -e immediately after the first sed pattern as well; I naively tried to do
echo "hello world" | sed 's/hello/goodbye/' -e 's/world/frank/'sed: -e: No such file or directory sed: s/world/frank/: No such file or directory
Integrating sed with find and grep
Combining all of the above sed goodness with the previous code we have
find . -name "*.java" -exec grep -Hin TODO {} + | cut -d ":" -f 3- | sed 's/^[ \t]*//' //todo: Create this file // todo: Document // todo: throw exception if precondition is violated
I don’t want the todo text in the comments, as it would be redundant. As such I will remove the double slashes followed by any white space followed by todo, followed by an optional colon, followed by any space.
find . -name "*.java" -exec grep -Hin TODO {} + | cut -d ":" -f 3- | sed -e 's/^[ \t]*//' -e 's/[\/*]*[ \t]*//' -e 's/TODO/todo/' -e 's/todo[:]*[ \t]*//' Create this file Document throw exception if precondition is violated
This can be read as
s/^[ \t]*// remove leading whitespace s/[\/*]* remove any number of forward slashes (/) or stars (*), which indicate the start of a comment [ \t]* remove whitespace s/TODO/todo convert uppercase TODO string into lower case todo remove the literal string 'todo' [:]* remove any colons that exist [ \t]* remove whitespace
We now have all the pieces we need to create our script.
Putting it all together
I’m going to show the script in its entirety without a huge amount of explanation. This post is more about the use of find/grep/sed than it is about shell scripting. I don’t claim to be an expert at writing shell scripts, so I wouldn’t be surprised if there’s a better way to do some of the following. It is not perfect; as the comments indicate, it wouldn’t handle text like ToDo correctly in the sed command. More importantly, there are some false positives in the lines it returns: things like toDouble match, because it contains the string ‘todo’. I’ll leave such improvements to the reader; if you do have any suggestions for the script, please add them to the comments below.
#!/bin/sh # From http://www.linuxweblog.com/bash-argument-numbers-check EXPECTED_ARGS=1 E_BADARGS=65 if [ $# -gt $EXPECTED_ARGS ] then echo "Usage: ./extract [starting_directory]" >&2 exit $E_BADARGS fi # By default, start in the current working directory, but if they provide # an argument, use that instead. if [ $# -eq $EXPECTED_ARGS ] then startingDir=$1 else startingDir="." fi # Start creating the HTML document echo "<html><head></head><body>" echo "<table border=1>" echo "<tr><td>Location</td><td>Comment</td></tr>" # The output of the find command will look like # ./Telephone.java:20: // todo: Document find $startingDir -name "*.java" -exec grep -Hin todo {} + | # Allows the script to read in piped in arguments while read data; do # The location of the file is the first argument fileLoc=`echo "$data" | cut -d ":" -f 1` fileName=`basename $fileLoc` # the line number is the second lineNumber=`echo "$data" | cut -d ":" -f 2` # all arguments after the second colon are the comment. Eliminate the TODO # text with a simple find and replace. # Note: only handles todo and TODO, would need some more logic to handle other cases comment=`echo "$data" | cut -d ":" -f 3- | sed -e 's/^[ ]*//' -e 's/[\/*]*[ ]*//' -e 's/TODO/todo/' -e 's/todo[:]*[ ]*//'` echo "<tr>" echo " <td><a href="$fileLoc">$fileName ($lineNumber)</a></td>" echo " <td>$comment</td>" echo "</tr>" done # Finish off the HTML document echo "</table>" echo "</body></html>" exit 0
If you save this script as a .sh file, you will need to make it executable before you can run it. From the terminal:
chmod +x extract.sh # Extract all the TODO comments in the Applications folder, and save it as an html table # Redirect the printed HTML to an HTML document ./extract.sh /Applications > table.html
The source code for the script is available on github. Running the script in my /Applications directory leads to the following HTML table:
Location | Comment |
Aquamacs (629) | return ((ObjectReference)val).toString(); // |
Aquamacs (633) | return val.toString(); // not correct in all cases |
Cycling (11) | support joint operations on more than one channel. |
Cycling (27) | what about objects with more than one input? |
Cycling (36) | improve feedback math — fixed point, like jit.wake? |
Cycling (277) | theta shift? |
Cycling (349) | double closest[] = new double[] {a[0].toDouble(), a[1].toDouble(), a[2].toDouble()}; |
Cycling (351) | double farthest[] = new double[] {a[0].toDouble(), a[1].toDouble(), a[2].toDouble()}; |
Cycling (5) | describe the class |
Cycling (22) | implement with a Vector to improve performance |
Cycling (8) | abort a thread if an incoming message arrives before completion |
Cycling (8) | have the search happen in a separate thread |
Cycling (9) | possible to separate the errors that results from not |
Cycling (191) | implement automatic replacement of shader name in prototype file |
PGraphicsOpenGL.java (738) | make this more efficient and just update a sub-part |
PGraphicsOpenGL.java (1165) | P3D overrides box to turn on triangle culling, but that’s a waste |
PGraphicsOpenGL.java (1180) | P3D overrides sphere to turn on triangle culling, but that’s a waste |
PGraphicsOpenGL.java (1508) | Should instead override textPlacedImpl() because createGlyphVector |
PGraphicsOpenGL.java (2207) | this expects a fourth arg that will be set to 1 |
PGraphicsOpenGL.java (2847) | not optimized properly, creates multiple temporary buffers |
PGraphicsOpenGL.java (2858) | is this possible without intbuffer? |
PGraphicsOpenGL.java (2870) | remove the implementation above and use setImpl instead, |
PGraphicsOpenGL.java (2978) | – extremely slow and not optimized. |
PGraphicsOpenGL.java (738) | make this more efficient and just update a sub-part |
PGraphicsOpenGL.java (1165) | P3D overrides box to turn on triangle culling, but that’s a waste |
PGraphicsOpenGL.java (1180) | P3D overrides sphere to turn on triangle culling, but that’s a waste |
PGraphicsOpenGL.java (1508) | Should instead override textPlacedImpl() because createGlyphVector |
PGraphicsOpenGL.java (2207) | this expects a fourth arg that will be set to 1 |
PGraphicsOpenGL.java (2847) | not optimized properly, creates multiple temporary buffers |
PGraphicsOpenGL.java (2858) | is this possible without intbuffer? |
PGraphicsOpenGL.java (2870) | remove the implementation above and use setImpl instead, |
PGraphicsOpenGL.java (2978) | – extremely slow and not optimized. |
The complete result can be found as another github gist.
Quick note: You have to be careful about what you echo in the shell. In an early version, I forgot to surround the text ($data) with quotes. This led to a problem when there were asterisks in the text, since the shell expanded the star into a list of all the files in the directory (aka file globbing). This is a relatively harmless problem; had the line had something like rm * instead, it would have been devastating. So make sure you surround your output text in quotes!
$ echo * ApplicationTODO.html BlogPost.mkdown Find text.mkdown PGraphicsOpenGL.java TabTodo.java Test.html TodoTest.java appTable.html extract.sh tab tab.txt table body.html table.awk table.html table1.html test.java $ echo "*" *
Conclusion
I have introduced the find command and how it can be used to locate files or directories on disk with certain properties (name, last modified date, etc). I then showed how grep can be used to search the contents of a file or stream of content for matching regular expressions. Next I showed you how to combine find with arbitrary Unix commands, including grep with the -exec option. Finally I tied all these concepts together by creating a simple script which searches through all of the java files in a directory for those lines that have TODO in them, and creates an HTML table summarizing the location of each of these tasks, alongside the TODO item text.