Python Gotcha: Word boundaries in regular expressions
Be careful trying to match word boundaries in Python using regular expressions. You have to be sure to either escape the match sequence or use raw strings.
Word boundaries are a great way of performing regular expression searches for whole words while avoiding partial matches. For instance, a search for the regular expression “the” would match both the word “the” and the start of the word “thesaurus”.
>>> import re >>> re.match("the", "the") # matches >>> re.match("the", "thesaurus") # matches
The way to match a word boundary is with ‘\b’, as described in the Python documentation. I wasted a few minutes wrestling with trying to get this to work.
>>> re.match("\bthe\b", "the") # no match
It turns out that \b is also used as the backspace control sequence. Thus in order for the regular expression engine to interpret the word boundary correctly, you need to escape the sequence:
>>> re.match("\\bthe\\b", "the") # match
You can also use raw string literals and avoid the double backslashes:
>>> re.match(r"\bthe\b", "the") # match
In case you haven’t seen the raw string prefix before, here is the relevant documentation:
String literals may optionally be prefixed with a letter ‘r’ or ‘R’; such strings are called raw strings and use different rules for interpreting backslash escape sequences.
Make sure you are familiar with the escape sequences for strings in Python, especially if you are dealing with regular expressions whose special characters might conflict. The Java documentation for regular expressions makes this warning a bit more explicit than Python’s:
The string literal “\b”, for example, matches a single backspace character when interpreted as a regular expression, while “\\b” matches a word boundary.
Hopefully this blog post will help others running into this issue.