Python Gotcha: Word boundaries in regular expressions
TL;DR
Be careful trying to match word boundaries in Python using regular expressions. You have to be sure to either escape the match sequence or use raw strings.
Word boundaries
Word boundaries are a great way of performing regular expression searches for whole words while avoiding partial matches. For instance, a search for the regular expression “the” would match both the word “the” and the start of the word “thesaurus”.
>>> import re >>> re.match("the", "the") # matches >>> re.match("the", "thesaurus") # matches
The way to match a word boundary is with ‘\b’, as described in the Python documentation. I wasted a few minutes wrestling with trying to get this to work.
>>> re.match("\bthe\b", "the") # no match
It turns out that \b is also used as the backspace control sequence. Thus in order for the regular expression engine to interpret the word boundary correctly, you need to escape the sequence:
>>> re.match("\\bthe\\b", "the") # match
You can also use raw string literals and avoid the double backslashes:
>>> re.match(r"\bthe\b", "the") # match
In case you haven’t seen the raw string prefix before, here is the relevant documentation:
String literals may optionally be prefixed with a letter ‘r’ or ‘R’; such strings are called raw strings and use different rules for interpreting backslash escape sequences.
Conclusion
Make sure you are familiar with the escape sequences for strings in Python, especially if you are dealing with regular expressions whose special characters might conflict. The Java documentation for regular expressions makes this warning a bit more explicit than Python’s:
The string literal “\b”, for example, matches a single backspace character when interpreted as a regular expression, while “\\b” matches a word boundary.
Hopefully this blog post will help others running into this issue.
Well, this is kinda old, but I thought I’d comment anyways: when writing RE patterns in Python, the recommendation is to always use raw strings: r’\bthe\b’ works fine.
Thanks for the info – wish I had known that when I wrote this post
Thank you! This was really helpful.