Python Gotcha: Word boundaries in regular expressions

Home > Java, programming, Python > Python Gotcha: Word boundaries in regular expressions

Python Gotcha: Word boundaries in regular expressions

September 22, 2011 i82much Leave a comment Go to comments

TL;DR

Be careful trying to match word boundaries in Python using regular expressions. You have to be sure to either escape the match sequence or use raw strings.

Word boundaries

Word boundaries are a great way of performing regular expression searches for whole words while avoiding partial matches. For instance, a search for the regular expression “the” would match both the word “the” and the start of the word “thesaurus”.

>>> import re
>>> re.match("the", "the")
# matches
>>> re.match("the", "thesaurus")
# matches

In some cases, you might want to match just the word “the” by itself, but not when it’s embedded within another word.

The way to match a word boundary is with ‘\b’, as described in the Python documentation. I wasted a few minutes wrestling with trying to get this to work.

>>> re.match("\bthe\b", "the")
# no match

It turns out that \b is also used as the backspace control sequence. Thus in order for the regular expression engine to interpret the word boundary correctly, you need to escape the sequence:

>>> re.match("\\bthe\\b", "the")
# match

You can also use raw string literals and avoid the double backslashes:

>>> re.match(r"\bthe\b", "the")
# match

In case you haven’t seen the raw string prefix before, here is the relevant documentation:

String literals may optionally be prefixed with a letter ‘r’ or ‘R’; such strings are called raw strings and use different rules for interpreting backslash escape sequences.

Conclusion

Make sure you are familiar with the escape sequences for strings in Python, especially if you are dealing with regular expressions whose special characters might conflict. The Java documentation for regular expressions makes this warning a bit more explicit than Python’s:

The string literal “\b”, for example, matches a single backspace character when interpreted as a regular expression, while “\\b” matches a word boundary.

Hopefully this blog post will help others running into this issue.

Categories: Java, programming, Python Tags: gotcha, java, python, regexp, regular expression, workaround

Comments (3) Trackbacks (1) Leave a comment Trackback

CL

March 26, 2013 at 10:12 am

Reply

Well, this is kinda old, but I thought I’d comment anyways: when writing RE patterns in Python, the recommendation is to always use raw strings: r’\bthe\b’ works fine.
- Nicholas Dunn
  
  March 26, 2013 at 10:54 am
  
  Reply
  
  Thanks for the info – wish I had known that when I wrote this post
Green

May 12, 2020 at 12:19 am

Reply

Thank you! This was really helpful.

November 22, 2022 at 9:46 pm

Python Regex Word Boundary? Quick Answer - Barkmanoil.com

Developmentality

Python Gotcha: Word boundaries in regular expressions

TL;DR

Word boundaries

Conclusion

Leave a comment Cancel reply

Top Posts

Email Subscription

Tags

Categories

Google+

Follow on twitter

Search the site

Stack Overflow profile

Nick’s tweets

Archives

Developmentality

Python Gotcha: Word boundaries in regular expressions

TL;DR

Word boundaries

Conclusion

Share this:

Related

Leave a comment Cancel reply

Top Posts

Email Subscription

Tags

Categories

Google+

Follow on twitter

Search the site

Stack Overflow profile

Nick’s tweets

Archives