Archive

Archive for the ‘unix’ Category

Pandoc – an essential tool for Markdown users

March 23, 2011 7 comments

Pandoc is a great tool to convert between various text based formats. For instance, with a single input Markdown file, I can generate an HTML page of that document, a LaTeX document, and a beautifully typeset PDF.

I had troubles installing it on Mac OSX via MacPorts; a simpler solution for me was to download and install the Haskell package and then use the commands:

cabal update
cabal install pandoc

This assumes, of course, that the cabal program that the Haskell package installs is accessible from your path.

The next step for me was to install the excellent Pandoc TextMate bundle. This gives you the standard things like syntax highlighting of your document, as well as a variety of useful snippets. For instance, when I am in Pandoc mode and press ⌃ ⌥ ⌘ P, I get the following popup from which I can easily choose options via mouse or keyboard:

Easy way to preview your document in various output formats

Easy way to preview your document in various output formats

Before you can start using the Pandoc TextMate bundle, you must ensure that the Pandoc executable is on the PATH exposed to TextMate, which is different than your global system path. In other words, just because you can execute pandoc in a shell and have it work, this doesn’t mean it will work in TextMate. For instance, on my computer, Pandoc is located in:

$ which pandoc
/Users/ndunn/Library/Haskell/bin/pandoc

Go to TextMate -> Preferences -> Advanced -> PATH and append :/Users/ndunn/Library/Haskell/bin to the end of the PATH variable.

Appending the Pandoc path to the PATH variable

Appending the Pandoc path to the PATH variable

Pandoc makes a few extensions to the Markdown syntax, which I really like. For instance, you can designate a section of text to be interpreted literally by surrounding it with three ~ characters. Furthermore, you can specify what language the source code is in, and the Pandoc converter will syntax highlight it in the final document (assuming the correct extensions have been installed).

I like this setup because it allows you to specify the language of the block of text, which means that you can force TextMate to interpret it the same way. As I’ve blogged about previously, one can add source code syntax highlighting embedded in HTML documents. I added the following lines to my HTML language grammar in order to have a few different languages recognized and interpreted as source code within these delimited blocks.

Here is the relevant section:

    {   name = 'source.java';
            comment = 'Use Java grammar';
            begin = '~~~\s*{.java}';
            end = '~~~';
            patterns = ( { include = 'source.java'; } );
        },
        {   name = 'text.xml';
            comment = 'Use XML grammar';
            begin = '~~~\s*{.xml}';
            end = '~~~';
            patterns = ( { include = 'text.xml'; } );
        },
        {   name = 'source.shell';
            comment = 'Use Shell grammar';
            begin = '~~~\s*{.shell}';
            end = '~~~';
            patterns = ( { include = 'source.shell'; } );
        },
        {   name = 'source';
            begin = '~~~';
            end = '~~~';
            patterns = ( { include = 'source'; } );
        },

(One tricky bit to get used to is that you need to have at least one blank space between surrounding text and a ~~~ delimited block, or else the ~ characters are interpreted as strikeouts through the text.)

Here is a screenshot of this working in TextMate:

Syntax highlighting of sourcecode within the Pandoc document

Syntax highlighting of sourcecode within the Pandoc document

Finally, just to get really meta on you here’s a screenshot of the text of this document

Text version of the document

Text version of the document

followed by a screenshot of the HTML that Pandoc produces: HTML version of the document

followed by a screenshot of the PDF that LaTeX formatted via Pandoc: PDF version of the document

I hope this has piqued your interest in Pandoc. I love the beautiful output of LaTeX but hate working with its syntax. With Pandoc I’m free to compose in Markdown, a language with a very lightweight syntax, and then convert into TeX when and if I want to.

ack – Better than grep?

December 28, 2010 3 comments

I stumbled onto a really nice command line tool named ack while reading a StackOverflow question yesterday.  Living at the domain betterthangrep.com/, it purports to .. be better than grep.  Or, as they put it

ack is a tool like grep, designed for programmers with large trees of heterogeneous source code

I’ve written previously about how to combine find and grep, and really, ack exists to obviate the use of find and grep.  It ignores commonly ignored directories by default (e.g. all those .svn metadata folders that SVN insists on creating), and with a simple command line flag you can tell ack what sort of files you want searched.  Furthermore, because it recurses by default, you don’t need to use the find command to traverse the tree.

Using the todo example, a basic way of searching for the TODOs in all of our java files is to use the command

find . -name "*.java" -exec grep -i -n TODO {} \;

In ack, this is accomplished much easier:

ack -i --java TODO

Furthermore, the matching results are highlighted right away, making it extremely apparent where the matches occur.

I’m going to start using this at work and see if it can replace my grep/find hackery.  Will let you know.  Very impressed so far.

 

If you want to give it a try, the easiest way to install it is with macports:

port install p5-app-ack
Categories: unix Tags: , , , , , , ,

Excel 2008 for Mac’s CSV export bug

December 6, 2010 7 comments
I ran into this at work a few weeks ago and thought I’d share.

Excel 2008’s CSV export feature is broken.  For instance, enter the following fake data into Excel:

Row Name Age
0 Nick 23
1 Bill 48
Save as -> CSV file

Full list of choices

When you use standard unix commands to view the output, the results are all garbled.

[Documents]$ cat Workbook1.csv
1,Bill,48[Documents]$
$ wc -l Workbook1.csv
0 Workbook1.csv
What is the issue?  The file command reveals the problem:
$ file Workbook1.csv
Workbook1.csv: ASCII text, with CR line terminators
CR stands for Carriage return, the ‘\r’ control sequence which, along with the newline character (‘\n’), is used to break up lines on Windows.  Unix OSes like Mac OS expect a single ‘\n’ new line character to terminate lines.
How can we fix this?

dos2unix.

# convert the Workbook1.csv file into a Unix appropriate file
dos2unix Workbook1.csv WithUnixLineEndings.csv
If you don’t have dos2unix on your Mac, and you don’t want to install it, you can fake it with the tr command:
tr '\15' '\n' < Workbook1.csv # remove the carriage returns, replace with a newline
Row,Name,Age
0,Nick,23
1,Bill,48
Very annoying that the Mac Excel doesn’t respect Unix line terminators.  Interestingly, I found a post that talks about ensuring that you choose a CSV file encoded for Mac, but that option seems missing from the Mac version itself.
If I’m missing something obvious, please correct me.

Bash: How to redirect standard error to standard out

November 9, 2010 Leave a comment

Problem:

You have a program which is outputting information to standard error that you wish to search through.  When commands are chained together in Unix via the pipe operator, standard out is connected to standard in.  Thus you cannot easily search the contents of the standard error.  How can you find what you’re looking for?

Solution

The first solution is to save the standard error as a file, and search through the file.

command_producing_standard_error 2> stderr.txt; grep "search string" stderr.txt; rm stderr.txt

This works but you have to remember to remove the text file that’s created in the process.

A better solution, and one that allows you to use the standard error in an existing pipeline is to instead redirect standard error to standard out.

command_producing_standard_error 2>&1 | grep "search string"

Recall that 2 refers to standard error and 1 refers to standard out; those familiar with C/C++ should recognize ‘&’ as the address operator, and it serves a similar role here.  After this command, both the standard out and standard error are in one stream, standard out, and can be connected via the pipe (|) symbol to other programs, such as grep.

This tip is modified from information found in the Bash Cookbook, in the recipe “Saving Output When Redirect Doesn’t Seem To Work”.  Additional solutions and discussion can be found on unix.stackexchange.com.

Categories: unix Tags: , , , , ,

Quotes, quotes, quotes: A primer for the command line

October 25, 2010 1 comment

In Bash programming, there are a lot of ways to get input into programs.  In particular, there are a slew of different quoting methods you should understand.  This article provides a quick reference of the difference between using No quotes, Double Quotes, Single Quotes, and Backticks

No quotes

Standard shell scripts assumes arguments are space delimited.  You can iterate over elements in this way:

 


for i in Hi how are you; do echo $i; done
Hi
how
are
you

 

This is why it is a problem to have spaces in your file names.  For instance,

 


$ ls
with spaces.txt

$ cat with spaces.txt
cat: with: No such file or directory
cat: spaces.txt: No such file or directory

 

Here I naively typed with spaces.txt thinking the cat program could handle it.  Instead, cat saw two arguments: with, and spaces.txt.  In order to handle this, you can either escape the space,

 


$ cat with\ spaces.txt

 

or use the double quotes method.  (Note that if you use tab autocompletion, the backslash escape will be added automatically)

 

Double quotes

Double quotes can be used when you want to group multiple space delimited words together as a single argument.  For instance

for i in "Hi how" "are you"; do echo $i; done
Hi how
are you

In the previous example, I could do

$ cat "with spaces.txt"

and the filename would be passed as a single unit to cat.

An important thing to note is that shell variables are expanded within double quotes.

name=Frank; echo "Hello $name"
Hello Frank

This is crucial to understand.  It also allows you to solve problems caused by having spaces in file names, especially when combined with the * globbing behavior of the shell.  For instance, let’s say we wanted to iterate over all the text files in a directory and do something to them.

$ ls
with spaces.txt   withoutspaces.txt
$ for i in *.txt; do cat $i; done
cat: with: No such file or directory
cat: spaces.txt: No such file or directory
# Surround the $i with quotes and our space problem is solved.
$ for i in *.txt; do cat "$i"; done

(Yes I know iterating over and calling cat on each argument is silly, as cat can accept a list of files (e.g. *.txt).  But it illustrates the point that commands will be confused by spaces in the name and should use double quotes to handle the problem).

Single quotes are also good when you need to embed single quotes in a string (you do not need to escape them)

$ echo "'Single quotes'"
'Single quotes'
$ echo "\"Escaped quotes\""
"Escaped quotes"

Double quotes are my default while I’m working in the terminal.

Single quotes

Single quotes act just like double quotes except that the text inside of them is interpreted literally; in other words, the shell does not attempt to do any more expansion or substitution.  For instance,

$ name=Frank; echo 'Hello $name'
Hello $name

This can save you some backslash escaping your normally would have to do.

Use it when:

 

  • You need double quotes embedded in your string
$ echo '"How are you doing?", she said'
"How are you doing?", she said
  • You do not need any literal single quotes in your string (it’s very difficult to get single quotes/apostrophe literals to appear in such a string)

Back ticks

Back ticks (“, the key to the left of the 1 and above the Tab key on a standard US keyboard), allow you to substitute in the output of another command.  For instance:

$ current_dir=`pwd`
$ echo $current_dir
/Users/nicholasdunn/Desktop/Scripts
[/sourecode]

This can be combined with the double quotes, but will be treated as literal characters in the single quotes:


echo "`pwd`"
/Users/nicholasdunn/Desktop/Scripts
$ echo '`pwd`'
`pwd`

Use when:

You want to capture the results of another command, usually for purposes of assigning a variable.

Hopefully this brief tour through the different types of quotes in bash has been useful.

Categories: Uncategorized, unix Tags: , , ,

bpython – an excellent interpreter for python

October 12, 2010 1 comment

If you use Python, you know that its interactive shell is a great way to test out ideas and iterate quickly.  Unfortunately, the basic interactive shell is very barebones – there is no syntax highlighting, autocompletion, or any of the features we come to expect from working in IDEs.  Fortunately if you’re on a Unix system, there is a great program called bpython which adds all of those missing features.

 

As you are typing, suggestions appear at the bottom. Press tab to take the suggestion

 

If you have easy_install, it’s the simplest thing in the world to install:

sudo easy_install bpython

I can’t recommend this product enough.  It’s free, so what’re you waiting for?

Categories: Python, unix Tags: , , , ,

Mac OSX – copy terminal output to clipboard

October 12, 2010 2 comments

Here’s a quick tip: If you want the results of some shell computation to be accessible to your clipboard (e.g. so you can paste the results into an e-mail or into some pastebin service), you can pipe the command into the `pbcopy` program.

echo "Hello world" | pbcopy
# "Hello world" is now in your clipboard

Apparently there is a way to do a similar thing on Ubuntu as well

Categories: Apple, Uncategorized, unix Tags: , , , , ,

How to remove “smart” quotes from a text file

October 11, 2010 4 comments

If you’ve copied and pasted text from Microsoft Word, chances are there will be the so-called smart quotes in that text. Some programs don’t handle these characters very well. You can turn them off in Word but if you’re trying to remedy the problem after the fact, sed is your old friend.  I’ll show you how to replace these curly quotes with the traditional straight quote.

Recall that you can do global find/replace by using sed.

sed s/[”“]/'"'/g File.txt

This won’t actually change the contents of the File, but you can save the results to a new file

sed s/[”“]/'"'/g File.txt > WithoutSmartQuotes.txt

If you wish to save the files in place, overwriting the original contents, you would do

sed -i ".bk" s/[”“]/'"'/g File.txt

This tells the sed command to make the change “in place”, while backing up the original file to File.txt.bk in case anything goes wrong.

To fix the smart quotes in all the text files in a directory, do the following:

for i in *.txt; do sed -i ".bk" s/[”“]/'"'/g $i; done

At the conclusion of the command, you will have double the number of text files in the directory, due to all the backup files. When you’ve concluded that the changes are correct (do a diff File.txt File.txt.bk to see the difference), you can delete all the backup files with rm *.bk.

Unix tip #3: Introduction to Find, Grep, Sed

September 7, 2010 6 comments
I’ve written a few times before about Unix command line tools and how learning them can make you a more efficient programmer.  Today I’m going to introduce a few essential tools in the Unix toolkit.  While programming, one often notes future improvements or tasks with the use of a TODO comment.  For instance, if you have a dummy implementation of a method, you might comment that you need to fill in the actual implementation later:

 public int randomValue() {
 // TODO: hook up the actual random number generator
 return 0;
 }
 

The problem is that these TODOs more often than not get ignored, especially if you have to search through the code yourself to try to find all of the remaining tasks.  Fortunately, certain Programs (NetBeans and TextMate for two examples) can find instances of keywords indicating a task, extract the comments, and present them to you in a nice table view.

I’m going to step through the use of a few Unix tools that can be tied together to extract the data and create a similar view.  In particular I will illustrate the use of find, grep, sed,  and pipes.

The general steps I’ll be presenting are:

Step Tools used
1. Find all Java files find
2. Find each TODO item grep
3. Extract filename, line number, task sed
4. Format results of step 3 as an HTML table find/grep/sed/shell script

.

Finding instances of text with grep

In order to extract all of the TODO items from within our java files, we need a way of searching for matching text. grep is the tool to do that. Grep takes as input a list of files to search and a pattern to try to match against; it will then emit a set of lines matching the pattern.

For instance, to search for TODO or any version of that string (todo, ToDO), in all the .java files in the current directory, you would execute the following:

grep -i TODO *.java
Telephone.java:    // TODO: Document
Telephone.java:     // TODO: throw exception if precondition is violated

Note that the line numbers are omitted. If we want them, we use the -n command

grep -i -n TODO *.java
Telephone.java:20:    // TODO: Document
Telephone.java:29:     // TODO: throw exception if precondition is violated

If all we want to do is get a rough estimate as to how many documented TODOs we have, we can pipe the result of this argument into the wc utility, which counts bytes, characters, or lines. We want the number of lines.

grep -i -n TODO *.java | wc -l
       2

This works fine with a single directory of files, but it will not handle nested directories. For instance, if my directory structure looks like the following:

tree
.
|-- BalancedTernary.java
`-- Telephone.java

0 directories, 2 files

All of these files will be searched when grep is run. But if I introduce new files in subdirectories:

mkdir Subdir
echo "//TODO: Create this file" > Subdir/Test.java

tree
|-- BalancedTernary.java
|-- Subdir
|   `-- Test.java
`-- Telephone.java

1 directory, 3 files

The new Test.java will not be searched. In order make grep search through all of the subdirectories (i.e., recursively), you can combine grep with another extremely useful Unix utility, find. Before moving on to find, I want to stress that grep is extremely useful and vital to anyone using a Unix based machine. See grep tutorials for many good examples of how to use grep.

Finding files with find

The find command is extremely useful. The man page describes find as

find – search for files in a directory hierarchy

There are a lot of arguments you can use, but to get started, the basic syntax is

find [<starting location>] -name <name pattern>

If the starting location is not provided, it is assumed to be in the current directory (. in Unix terms). In all the examples that follow I will explicitly list the starting directory.

For instance, if we want to find all the files that end with the extension “.java” in the current working directory, we could run the following:

find . -name "*.java"
./BalancedTernary.java
./Subdir/Test.java
./Telephone.java

Note that we must enclose the pattern in quotes in this example in order to prevent the shell from trying to expand the * wildcard. If we don’t, the shell will convert the asterisk into a space delimited set of all the files/directories in the current folder, which will lead to an error

 find . -name *.java # expands to find . -name BalancedTernary.java Telephone.java
find: Telephone.java: unknown option

Just as we can use the wc command to count the number of times a phrase appears in a file, we can use it to count the number of files matching a given pattern. That is because find outputs each matching file path to a separate line. Thus if we wanted to count the number of java files in all folders rooted in the current folder, we could do

find . -name "*.java" | wc -l
       3

While I have only presented the -name flag, there are numerous other flags as well, such as whether the candidate file is a file or directory (-type f or -type d respectively), whether the match is smaller, the same, or bigger than a given size (-size +100M == bigger than 100 megabytes), or when the file was last modified (find -newer ordinary_file would only accept files that have a modification time newer than that of ordinary_file). A A great article for gaining more expertise is Mommy I found it! – 15 practical unix find commands.

Combining find with other commands

find becomes even more powerful when combined with the -exec option, which allows you to execute arbitrary commands for each file that matches the pattern. The syntax for doing that looks like

find [<starting location>] -name <name pattern> -exec <commands> {} \;

where the file path will be substituted for the {} characters. For instance, if we want to count the number of lines in each Java file, we could run

find . -name "*.java" -exec wc -l {} \;
      23 ./BalancedTernary.java
       1 ./Subdir/Test.java
      88 ./Telephone.java

This has precisely the same effect as if we explicitly executed the wc -l command ourselves:

wc -l ./BalancedTernary.java wc -l ./Subdir/Test.java wc -l ./Telephone.java

As another example, we could backup all of the Java files in the directory by copying them and appending the suffix .bk to each

find . -name "*.java" -exec cp {} {}.bk \;
Nick@Macintosh-3 ~/Desktop/Programming/Java/example$ ls
BalancedTernary.java    Subdir                  Telephone.java.bk
BalancedTernary.java.bk Telephone.java

To undo this, we could remove all of the files ending in .bk:

find . -name “*.bk” -exec rm {} \;

Combining find and grep

Since I started the article talking about grep, it’s only natural that you can combine grep with find, and it often pays to do so.

For instance, by combining the earlier grep command to find all TODO items with the find command to find all java files, we suddenly have a command which will traverse an arbitrarily nested directory structure and search all the files we are interested in.

find . -name "*.java" -exec grep -i -n TODO {}  \;
1://todo: Create this file
20:    // todo: Document
29:     // todo: throw exception if precondition is violated

Note that we no longer have the filename prepended to the output; if we want it back we can add the -H flag.

find . -name "*.java" -exec grep -Hin TODO {} \;
./Subdir/Test.java:1://todo: Create this file
./Telephone.java:20:    // todo: Document
./Telephone.java:29:     // todo: throw exception if precondition is violated

In this last snippet I have combined the individual -H, -i and -n flags together into the shorter -Hin; this works identically as listing them separately. (Not all Unix commands work this way; check the man page if you’re unsure).

An alternate exec terminator: Performance considerations

I said earlier that the basic syntax for combining find with other commands is

find [<starting location>] -name <name pattern> -exec <commands> {} \;

The ; terminates the exec clause, but because it can be interpreted as text, it has to be backslash escaped. While researching this article I found a Unix/Linux “find” Command Tutorial that introduced me to an alternative syntax for terminating the -exec clause of the find command. By replacing the semicolon with a + sign, files are grouped together in batches and sent to the given command rather than executed one at a time. Let me illustrate:

# Executes the 'echo' command on each file individually
find . -exec echo {} \;
.
./BalancedTernary.java
./Subdir
./Subdir/Test.java
./table.html
./Telephone.java
./test.a

# Executes the 'echo' command on bundled groups of files
find . -exec echo {} +
. ./BalancedTernary.java ./Subdir ./Subdir/Test.java ./table.html ./Telephone.java ./test.a

This technique of grouping the files together can have a profound performance boost when used with commands that can handle space terminated arguments. For instance:

time find /Applications/ -name "*.java" -exec grep -i TODO {} \;
real    1m36.458s
user    0m3.912s
sys 0m10.933s

time find /Applications/ -name "*.java" -exec grep -i TODO {} +
real    0m39.060s
user    0m3.660s
sys 0m6.571s

# An alternate way of executing grep on batches of files at once #
time find /Applications/ -name "*.java" -print0 | xargs -0 grep -i "TODO"
real    0m50.486s
user    0m4.230s
sys 0m7.924s

By replacing the semicolon with the plus sign, I gained almost a 2.5x speed increase. Again, this will only work with commands that correctly handle whitespace separated arguments; the previous example with copy would fail miserably, because cp expects a single src/destination pair

# Will not work!
find . -name "*.java" -exec cp {} {}.bk +

Converting results of find/grep into table form – Intro to sed, cut, and basename

In the last section, I showed how to combine find and grep. The output of the command will look something like this:

find . -name "*.java" -exec grep -Hin TODO {} +
./Subdir/Test.java:1://todo: Create this file
./Telephone.java:20:    // todo: Document
./Telephone.java:29:     // todo: throw exception if precondition is violated

The output has the path to the file, followed by a semicolon, followed by the matching line in the input file that had the TODO in it. Let’s mimic the output of the TODO list in TextMate, which simply displayed a two column table with File name and line number followed by the extracted comment. While we could use any programming language to do this text manipulation (Python springs to mind), I’m going to use a combination of sed and shell scripts to illustrate a few more powerful command line tools.

Recall that the output of our script so far looks like the following:

./Telephone.java:20: // todo: Document

In other words each line is in the form

relative/path/to/File:lineNumber:todo text

The colons delimiting the text allow us to split the constituent parts very easily. The command to do that is cut. With cut you specify the delimiter on which to split the text, and then which numbered fields you want (where fields are numbered 1 .. n)

As an example, here is code to extract the path (the first column of text):

find . -name "*.java" -exec grep -Hin TODO {} + | cut -d ":" -f 1
./Subdir/Test.java
./Telephone.java
./Telephone.java

This gives us the path, one per line. If we want to convert the relative path into just the name of the file, like the TextMate example does, we want to strip out all of the leading directories, leaving just the file name. While we could code up a regular expression to perform the substitution, I prefer to avoid doing more work than I need to. Instead I’ll use the basename command, which does that for us.

find . -name "*.java" -exec grep -Hin TODO {} + | basename `cut -d ":" -f 1`
Test.java
Telephone.java
Telephone.java

The line number, the second column of text, is just as easy to extract.

find . -name “*.java” -exec grep -Hin TODO {} + | cut -d “:” -f 2 1 20 29

The fact that the line of text extracted by grep could contain the colon character (and often will; I always write my TODOs as TODO: do x) means we have to be a bit smarter about how we use cut. If we assume that the text is just in the third column, we will lose the text if there are colons.

# Only taking the third column
echo "./Telephone.java:20:    // todo: Document" | cut -d ":" -f 3    // todo
# Taking all columns after and including the third column
echo "./Telephone.java:20:    // todo: Document" | cut -d ":" -f 3-
    // todo: Document

While this works, it’s not the neatest output. In particular we want to get rid of the leading white space; otherwise it will mess up the formatting in the HTML table. Performing text substitution is the job of the sed tool. sed stands for stream editor and it is capable of doing extremely heavy duty find and replace tasks. I don’t pretend to be an expert with sed and this article won’t make you one either, but hopefully I can at least illustrate its usefulness. For a more in depth tutorial, see Sed – An Introduction and Tutorial.

A common use case for sed, as I mentioned, is to replace text. The general pattern is

sed ‘s/regexpToReplace/textToReplaceItWith/[g]’

The s can be read as “substitute”, and the optional g stands for global. If you omit it, it will only replace the first instance of the regular expression match that it finds. The g makes it search for all matches in the text.

Thus to remove leading white space, we can use the expression sed ‘s/^[ <tab>]*//g’

where the ^ character indicates that it must match the start of the line, and the text within brackets are the characters that will be matched by the regular expression. The * means to match zero or more instances. In other words, this line says “match the start of the string and all spaces and tabs you can until reaching other text, and replace it with nothing”.

The above command is not strictly correct. We need to indicate to sed that we want to replace the tab character. Unlike many Unix utilities, sed does not allow you to use the character sequence \t to indicate the tab character. Instead you need a literal tab at that place in the command. The problem with doing this is that your shell might swallow the tab before it gets to the sed command. In bash, the default shell environment on the Mac, the tab key is interpreted as a command to auto complete what is being typed. If you press the tab key twice, the shell will print out all the possible autocompletions.

For instance,

$lp<tab><tab>
lp           lpc          lpmove       lppasswd     lpr          lprsetup.sh
lpadmin      lpinfo       lpoptions    lpq          lprm         lpstat

Here I started typing lp, hit tab twice, and the shell produced a list of all the commands it knew about (technically, that are on the PATH environment variable). So we need a way to smuggle the tab key into the sed command, without triggering the shell’s autocompletion. The way to do this is with the “verbatim” command sequence, which instructs the shell not to interpret certain commands and instead to pass them treat them verbatim, as text.

To enter this temporary verbatim mode, you press Ctrl V (sometimes indicated as ^V online) followed by the key combination you want treated as text. Thus the real sed command to remove leading white space is sed ‘s/^[ ]*//’

$ sed 's/^[    ]*//'
     spaces
spaces
        tabs
tabs
           tabs and spaces
tabs and spaces

The above snippet illustrates that sed reads from standard input by default and thus can be used interactively to test the replacements you have specified. Again, in the above text it looks like I have a string of spaces, but it’s really <space><ctrl v><tab> within the brackets. From here on out I will put a \t to indicate a tab but you should realize that you need to do the ctrl v tab sequence I just described instead.

(Aside: I have read online that some versions of sed actually do support the \t character sequence to indicate tabs, but the default sed shipping with Mac OSX does not.)

sed – combine multiple commands into one

If you have series of text replacements you want to do using sed, you can either pipe the chain of transformations you want to do from one sed invocation to another, or you can use the -e flag to chain them together.

echo "hello world" | sed 's/hello/goodbye/' | sed 's/world/frank/'
goodbye frank
echo "hello world" | sed -e 's/hello/goodbye/' -e 's/world/frank/'goodbye frank

Note that you need the -e immediately after the first sed pattern as well; I naively tried to do

echo "hello world" | sed 's/hello/goodbye/' -e 's/world/frank/'sed: -e: No such file or directory
sed: s/world/frank/: No such file or directory

Integrating sed with find and grep

Combining all of the above sed goodness with the previous code we have

find . -name "*.java" -exec grep -Hin TODO {} + | cut -d ":" -f 3- | sed 's/^[ \t]*//'
//todo: Create this file
// todo: Document
// todo: throw exception if precondition is violated

I don’t want the todo text in the comments, as it would be redundant. As such I will remove the double slashes followed by any white space followed by todo, followed by an optional colon, followed by any space.

find . -name "*.java" -exec grep -Hin TODO {} + | cut -d ":" -f 3- | sed -e 's/^[ \t]*//' -e 's/[\/*]*[ \t]*//' -e 's/TODO/todo/' -e 's/todo[:]*[ \t]*//'
 Create this file
 Document
 throw exception if precondition is violated

This can be read as

s/^[ \t]*//         remove leading whitespace
s/[\/*]*            remove any number of forward slashes (/) or stars (*), which indicate the start of a comment
[ \t]*              remove whitespace
s/TODO/todo         convert uppercase TODO string into lower case
todo                remove the literal string 'todo'
[:]*                remove any colons that exist
[ \t]*              remove whitespace

We now have all the pieces we need to create our script.

Putting it all together

I’m going to show the script in its entirety without a huge amount of explanation. This post is more about the use of find/grep/sed than it is about shell scripting. I don’t claim to be an expert at writing shell scripts, so I wouldn’t be surprised if there’s a better way to do some of the following. It is not perfect; as the comments indicate, it wouldn’t handle text like ToDo correctly in the sed command. More importantly, there are some false positives in the lines it returns: things like toDouble match, because it contains the string ‘todo’. I’ll leave such improvements to the reader; if you do have any suggestions for the script, please add them to the comments below.

#!/bin/sh

# From http://www.linuxweblog.com/bash-argument-numbers-check
EXPECTED_ARGS=1
E_BADARGS=65
if [ $# -gt $EXPECTED_ARGS ]
then
  echo "Usage: ./extract [starting_directory]" >&2
  exit $E_BADARGS
fi

# By default, start in the current working directory, but if they provide
# an argument, use that instead.
if [ $# -eq $EXPECTED_ARGS ]
then
    startingDir=$1
else
    startingDir="."
fi

# Start creating the HTML document
echo "<html><head></head><body>"
echo "<table border=1>"
echo "<tr><td>Location</td><td>Comment</td></tr>"

# The output of the find command will look like
# ./Telephone.java:20:    // todo: Document

find $startingDir -name "*.java" -exec grep -Hin todo {} + |
# Allows the script to read in piped in arguments
while read data; do

    # The location of the file is the first argument
    fileLoc=`echo "$data" | cut -d ":" -f 1`
    fileName=`basename $fileLoc`

    # the line number is the second
    lineNumber=`echo "$data" | cut -d ":" -f 2`

    # all arguments after the second colon are the comment.  Eliminate the TODO
    # text with a simple find and replace.
    # Note: only handles todo and TODO, would need some more logic to handle other cases
    comment=`echo "$data" | cut -d ":" -f 3- | sed -e 's/^[     ]*//' -e 's/[\/*]*[     ]*//' -e 's/TODO/todo/' -e 's/todo[:]*[     ]*//'`
    echo "<tr>"
    echo "  <td><a href="$fileLoc">$fileName ($lineNumber)</a></td>"
    echo "  <td>$comment</td>"
    echo "</tr>"
done

# Finish off the HTML document
echo "</table>"
echo "</body></html>"

exit 0

If you save this script as a .sh file, you will need to make it executable before you can run it. From the terminal:

chmod +x extract.sh
# Extract all the TODO comments in the Applications folder, and save it as an html table
# Redirect the printed HTML to an HTML document
./extract.sh /Applications > table.html

The source code for the script is available on github. Running the script in my /Applications directory leads to the following HTML table:

Location Comment
Aquamacs (629) return ((ObjectReference)val).toString(); //
Aquamacs (633) return val.toString(); // not correct in all cases
Cycling (11) support joint operations on more than one channel.
Cycling (27) what about objects with more than one input?
Cycling (36) improve feedback math — fixed point, like jit.wake?
Cycling (277) theta shift?
Cycling (349) double closest[] = new double[] {a[0].toDouble(), a[1].toDouble(), a[2].toDouble()};
Cycling (351) double farthest[] = new double[] {a[0].toDouble(), a[1].toDouble(), a[2].toDouble()};
Cycling (5) describe the class
Cycling (22) implement with a Vector to improve performance
Cycling (8) abort a thread if an incoming message arrives before completion
Cycling (8) have the search happen in a separate thread
Cycling (9) possible to separate the errors that results from not
Cycling (191) implement automatic replacement of shader name in prototype file
PGraphicsOpenGL.java (738) make this more efficient and just update a sub-part
PGraphicsOpenGL.java (1165) P3D overrides box to turn on triangle culling, but that’s a waste
PGraphicsOpenGL.java (1180) P3D overrides sphere to turn on triangle culling, but that’s a waste
PGraphicsOpenGL.java (1508) Should instead override textPlacedImpl() because createGlyphVector
PGraphicsOpenGL.java (2207) this expects a fourth arg that will be set to 1
PGraphicsOpenGL.java (2847) not optimized properly, creates multiple temporary buffers
PGraphicsOpenGL.java (2858) is this possible without intbuffer?
PGraphicsOpenGL.java (2870) remove the implementation above and use setImpl instead,
PGraphicsOpenGL.java (2978) – extremely slow and not optimized.
PGraphicsOpenGL.java (738) make this more efficient and just update a sub-part
PGraphicsOpenGL.java (1165) P3D overrides box to turn on triangle culling, but that’s a waste
PGraphicsOpenGL.java (1180) P3D overrides sphere to turn on triangle culling, but that’s a waste
PGraphicsOpenGL.java (1508) Should instead override textPlacedImpl() because createGlyphVector
PGraphicsOpenGL.java (2207) this expects a fourth arg that will be set to 1
PGraphicsOpenGL.java (2847) not optimized properly, creates multiple temporary buffers
PGraphicsOpenGL.java (2858) is this possible without intbuffer?
PGraphicsOpenGL.java (2870) remove the implementation above and use setImpl instead,
PGraphicsOpenGL.java (2978) – extremely slow and not optimized.

The complete result can be found as another github gist.

Quick note: You have to be careful about what you echo in the shell. In an early version, I forgot to surround the text ($data) with quotes. This led to a problem when there were asterisks in the text, since the shell expanded the star into a list of all the files in the directory (aka file globbing). This is a relatively harmless problem; had the line had something like rm * instead, it would have been devastating. So make sure you surround your output text in quotes!

$ echo *
ApplicationTODO.html BlogPost.mkdown Find text.mkdown PGraphicsOpenGL.java TabTodo.java Test.html TodoTest.java appTable.html extract.sh tab tab.txt table body.html table.awk table.html table1.html test.java
$ echo "*"
*

Conclusion

I have introduced the find command and how it can be used to locate files or directories on disk with certain properties (name, last modified date, etc). I then showed how grep can be used to search the contents of a file or stream of content for matching regular expressions. Next I showed you how to combine find with arbitrary Unix commands, including grep with the -exec option. Finally I tied all these concepts together by creating a simple script which searches through all of the java files in a directory for those lines that have TODO in them, and creates an HTML table summarizing the location of each of these tasks, alongside the TODO item text.

Categories: Uncategorized, unix Tags: , , , , ,

How to make git use TextMate as the default commit editor

July 21, 2010 2 comments
git config --global core.editor "mate -w"

Now when you do a git commit without specifying a commit message, TextMate will pop-up and allow you to enter a commit message in it. When you save the file and close the window, the commit will go through as normal. (If you have another text editor you prefer instead, just change the “mate -w” line to the preferred one)

For those curious what the -w argument is about, it tells the shell to wait for the mate process to terminate (the file to be saved and closed). Read this for more information about how to associate TextMate with various other shell scripts and programs.

Categories: textmate, Uncategorized, unix Tags: , ,