Home > Java, programming, Python > Python Gotcha: Word boundaries in regular expressions

Python Gotcha: Word boundaries in regular expressions


TL;DR

Be careful trying to match word boundaries in Python using regular expressions.  You have to be sure to either escape the match sequence or use raw strings.

Word boundaries

Word boundaries are a great way of performing regular expression searches for whole words while avoiding partial matches.  For instance, a search for the regular expression “the” would match both the word “the” and the start of the word “thesaurus”.

>>> import re
>>> re.match("the", "the")
# matches
>>> re.match("the", "thesaurus")
# matches 
In some cases, you might want to match just the word “the” by itself, but not when it’s embedded within another word.

The way to match a word boundary is with ‘\b’, as described in the Python documentation.  I wasted a few minutes wrestling with trying to get this to work.

>>> re.match("\bthe\b", "the")
# no match

It turns out that \b is also used as the backspace control sequence.  Thus in order for the regular expression engine to interpret the word boundary correctly, you need to escape the sequence:

>>> re.match("\\bthe\\b", "the")
# match

You can also use raw string literals and avoid the double backslashes:

>>> re.match(r"\bthe\b", "the")
# match

In case you haven’t seen the raw string prefix before, here is the relevant documentation:

String literals may optionally be prefixed with a letter ‘r’ or ‘R’; such strings are called raw strings and use different rules for interpreting backslash escape sequences.

Conclusion

Make sure you are familiar with the escape sequences for strings in Python, especially if you are dealing with regular expressions whose special characters might conflict.  The Java documentation for regular expressions makes this warning a bit more explicit than Python’s:

The string literal “\b”, for example, matches a single backspace character when interpreted as a regular expression, while “\\b” matches a word boundary.

Hopefully this blog post will help others running into this issue.

Advertisement
  1. CL
    March 26, 2013 at 10:12 am

    Well, this is kinda old, but I thought I’d comment anyways: when writing RE patterns in Python, the recommendation is to always use raw strings: r’\bthe\b’ works fine.

    • Nicholas Dunn
      March 26, 2013 at 10:54 am

      Thanks for the info – wish I had known that when I wrote this post

  2. Green
    May 12, 2020 at 12:19 am

    Thank you! This was really helpful.

  1. November 22, 2022 at 9:46 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: