Archive

Archive for September, 2011

Python Gotcha: Word boundaries in regular expressions

September 22, 2011 2 comments

TL;DR

Be careful trying to match word boundaries in Python using regular expressions.  You have to be sure to either escape the match sequence or use raw strings.

Word boundaries

Word boundaries are a great way of performing regular expression searches for whole words while avoiding partial matches.  For instance, a search for the regular expression “the” would match both the word “the” and the start of the word “thesaurus”.

>>> import re
>>> re.match("the", "the")
# matches
>>> re.match("the", "thesaurus")
# matches 
In some cases, you might want to match just the word “the” by itself, but not when it’s embedded within another word.

The way to match a word boundary is with ‘\b’, as described in the Python documentation.  I wasted a few minutes wrestling with trying to get this to work.

>>> re.match("\bthe\b", "the")
# no match

It turns out that \b is also used as the backspace control sequence.  Thus in order for the regular expression engine to interpret the word boundary correctly, you need to escape the sequence:

>>> re.match("\\bthe\\b", "the")
# match

You can also use raw string literals and avoid the double backslashes:

>>> re.match(r"\bthe\b", "the")
# match

In case you haven’t seen the raw string prefix before, here is the relevant documentation:

String literals may optionally be prefixed with a letter ‘r’ or ‘R’; such strings are called raw strings and use different rules for interpreting backslash escape sequences.

Conclusion

Make sure you are familiar with the escape sequences for strings in Python, especially if you are dealing with regular expressions whose special characters might conflict.  The Java documentation for regular expressions makes this warning a bit more explicit than Python’s:

The string literal “\b”, for example, matches a single backspace character when interpreted as a regular expression, while “\\b” matches a word boundary.

Hopefully this blog post will help others running into this issue.

Advertisements