Archive

Archive for the ‘Python’ Category

Python Gotcha: Word boundaries in regular expressions

September 22, 2011 2 comments

TL;DR

Be careful trying to match word boundaries in Python using regular expressions.  You have to be sure to either escape the match sequence or use raw strings.

Word boundaries

Word boundaries are a great way of performing regular expression searches for whole words while avoiding partial matches.  For instance, a search for the regular expression “the” would match both the word “the” and the start of the word “thesaurus”.

>>> import re
>>> re.match("the", "the")
# matches
>>> re.match("the", "thesaurus")
# matches 
In some cases, you might want to match just the word “the” by itself, but not when it’s embedded within another word.

The way to match a word boundary is with ‘\b’, as described in the Python documentation.  I wasted a few minutes wrestling with trying to get this to work.

>>> re.match("\bthe\b", "the")
# no match

It turns out that \b is also used as the backspace control sequence.  Thus in order for the regular expression engine to interpret the word boundary correctly, you need to escape the sequence:

>>> re.match("\\bthe\\b", "the")
# match

You can also use raw string literals and avoid the double backslashes:

>>> re.match(r"\bthe\b", "the")
# match

In case you haven’t seen the raw string prefix before, here is the relevant documentation:

String literals may optionally be prefixed with a letter ‘r’ or ‘R’; such strings are called raw strings and use different rules for interpreting backslash escape sequences.

Conclusion

Make sure you are familiar with the escape sequences for strings in Python, especially if you are dealing with regular expressions whose special characters might conflict.  The Java documentation for regular expressions makes this warning a bit more explicit than Python’s:

The string literal “\b”, for example, matches a single backspace character when interpreted as a regular expression, while “\\b” matches a word boundary.

Hopefully this blog post will help others running into this issue.

Car Talk Puzzler #5: The Perfect Square Dance

August 18, 2011 2 comments

PUZZLER: The Perfect Square Dance!

Sally invited 17 guests to a dance party. She assigned each guest a number from 2 to 18, keeping 1 for herself. The sum of each couple’s numbers was a perfect square. What was the number of Sally’s partner?

Puzzler

The fifth in my ongoing series of solving Car Talk Puzzlers with my programming language of choice.  I’m using Python again, just like last time.

There are a few pieces to this.  The first is, how do we generate a list of all possible pairs that match the perfect square constraint?  With list comprehensions and a helper function this is easy.

def perfect_square(n):
  return math.sqrt(n) == int(math.sqrt(n))

# range creates a list that's exclusive of last number.  Thus to go from 1 to n,
# use range(1, n+1)
guests = range(1, num_guests + 1)
# By enforcing the x# a whole bunch of equivalent pairs (e.g. (3,6) and (6,3)).
pairs = [(x,y) for x in guests for y in guests if x<y and perfect_square(x+y)]
>>> guests
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
>>> pairs
[(1, 3), (1, 8), (1, 15), (2, 7), (2, 14), (3, 6), (3, 13), (4, 5), (4, 12), (5, 11), (6, 10), (7, 9), (7, 18), (8, 17), (9, 16), (10, 15), (11, 14), (12, 13)]
OK great.  At this point I have 18 pairs; 9 of these pairs together form the correct setup.  Now, I could try to be very clever and use logic to deduce the correct pairs.  But I’m lazy and it’s a small problem, so I’m going to use brute force.  I’m just going to take all combinations of 9 pairs and test to see which one uses each guest exactly once.  Not only am I so lazy as to do it this way, I’m not even going to write code to calculate the combinations.  Instead, I’ll use the excellent itertools package.
>>> itertools.combinations([1,2,3,4,5], 2)

# This is a sort of generator object which lazily returns the values as needed.
# To force them to all be evaluated at once, to see how this works,
# wrap the call in a list function.
>>>list(itertools.combinations([1,2,3,4,5],2))
[(1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5)]

This performs exactly as how we would expect.  Note that I want combinations rather than permutations because the order does not matter.

As I mentioned earlier, I assume that a brute force solution is going to be fast enough.  Let me check my math here.  How many 9-element combinations taken from an 18-element set are there?
N choose K formula

N choose K formula - number of k element combinations from set of n elements

Number of combinations

Yeah, I think we can handle checking 48620 combinations.

OK, but how do we know whether a given combination (9 pairs of numbers) match our criteria?  The easiest way is to check if the number of unique elements is the same as the number of guests; this indicates that each guest is used only once.  First we flatten our nested list of tuples into a single list, and then use a set to check uniqueness:

def flatten(nested):
  """ Flatten one level of nesting.  Returns a generator object
  For instance:
  list(flatten([(1,3),(5,6)])) --> [1,3,5,6] """
  return itertools.chain.from_iterable(nested)

def all_guests_present_once(combination):
  """ Returns whether each guest is present once
  Combination is a list of tuples, e.g. [(1,5),(7,8)]
  """
  flattened = list(flatten(combination))
  return len(set(flattened)) == len(flattened)

>>> all_guests_present_once([(1,3),(4,5)])
True
>>> all_guests_present_once([(1,3),(3,6)])
False
As you can see, I’m shortcutting a little bit and not checking for the perfect square aspect, since we already did that when constructing our set of pairs.

OK we’re ready to throw it all together.


def dance_arrangement(num_guests):
  """
  Returns a valid pairing for all guests if possible, else an empty set
  """
  # Clearly you need an even number of guests to have everyone paired
  if num_guests % 2 == 1:
    return []
  else:
    # range creates a list that's exclusive of last number.  Thus to go from 1 to n,
    # use range(1, n+1)
    guests = range(1, num_guests + 1)
    # By enforcing the x    # a whole bunch of equivalent pairs (e.g. (3,6) and (6,3)).
    pairs = [(x,y) for x in guests for y in guests if x    # brute force search
    all_arrangements = itertools.combinations(pairs, num_guests / 2)
    return filter(all_guests_present_once, all_arrangements)

Running the program with num_guests = 18, we get

[((1, 15), (2, 14), (3, 13), (4, 12), (5, 11), (6, 10), (7, 18), (8, 17), (9, 16))]
Tada!  Thus Sally’s partner was number 15.
I stumbled onto this page which also sought to solve this problem using programming; it ends with the question
“I wonder if 18 is unique as the total number of dancers which has a solution.  It should be easy to modify the program to check”
It is very easy to modify, and I checked.  I ran the program on all number of guests less than 20, and these are the results:
8 [((1, 8), (2, 7), (3, 6), (4, 5))]
14 [((1, 8), (2, 14), (3, 13), (4, 12), (5, 11), (6, 10), (7, 9))]
16 [((1, 8), (2, 7), (3, 6), (4, 5), (9, 16), (10, 15), (11, 14), (12, 13))]
18 [((1, 15), (2, 14), (3, 13), (4, 12), (5, 11), (6, 10), (7, 18), (8, 17), (9, 16))]

As you can see, 8, 14, and 16 guests can also be paired up in this way.  Something to keep in mind the next time you are going to have a party.

Full sourcecode can be found on Github.

__slots__ in Python: Save some space and prevent member variable additions

April 15, 2011 Leave a comment

Today I’m going to be writing about a feature of Python I’d never read before, namely __slots__. In a nutshell, using __slots__ allows you to decrease the memory needed by your classes, as well as prevent unintended assignment to new member variables.

By default, each class has a dictionary which it uses to map from attribute names to the member variable itself. Dictionaries are extremely well designed in Python, yet by their very nature they are somewhat wasteful of space. Why is this? Hash tables strive to minimize collisions by ensuring that the load factor (number of elements/size of internal array) does not get too high. In general hash tables use O(n) space, but with a constant factor nearer to 2 than 1 (again, in order to minimize collisions). For classes with very small numbers of member variables, the overhead might be even greater.

class DictExample:
  def __init__(self):
    self.int_var = 5
    self.list_var = [0,1,2,3,4]
    self.nested_dict = {'a':{'b':2}}

# Note that this extends from 'object'; the __slots__ only has an effect
# on these types of 'new' classes
class SlotsExample(object):
  __slots__ = ('int_var','list_var','nested_dict')

  def __init__(self):
    self.int_var = 5
    self.list_var = [0,1,2,3,4]
    self.nested_dict = {'a':{'b':2}}

# jump to the repl
>>> a = DictExample()
# Here is the dictionary I was talking about.
>>> a.__dict__
{'int_var': 5, 'list_var': [0, 1, 2, 3, 4], 'nested_dict': {'a': {'b': 2}}}
>>> a.x = 5
# We were able to assign a new member variable
>>> a.__dict__
{'x': 5, 'int_var': 5, 'list_var': [0, 1, 2, 3, 4], 'nested_dict': {'a': {'b': 2}}}



>>> b = SlotsExample()
# There is no longer a __dict__ object
>>> b.__dict__
Traceback (most recent call last):
  File "<input>", line 1, in <module>
AttributeError: 'SlotsExample' object has no attribute '__dict__'
>>> b.__slots__
('int_var', 'list_var', 'nested_dict')
>>> getattr(b, 'int_var')
5
>>> getattr(a, 'int_var')
5
>>> a.x = 5
# We cannot assign a new member variable; we have declared that there will only
# be member variables whose names appear in the __slots__ iterable
>>> b.x = 5
Traceback (most recent call last):
  File "<input>", line 1, in <module>
AttributeError: 'SlotsExample' object has no attribute 'x'

Note that for the __slots__ declaration to have any effect, you must inherit from object (i.e. be a ‘new style class’). Furthermore, if you extend a class with __slots__ defined, you must also declare __slots__ in that child class, or else it will have a dict allocated, obviating the space savings. See this StackOverflow question for more.

This feature was useful to me when using Python to implement a packed binary message format. The specification spells out in exquisite detail how each and every byte over the wire must be sent. By using the __slots__ mechanism, I was able to ensure that the client could not accidentally modify the message classes and add new member variables, which would not be serialized anyways.

Datetimes in Python – gotchas and workarounds

April 9, 2011 Leave a comment

I’ve written previously about working with dates in Java; as I mentioned there it’s very easy to get dates/times incorrect. I feel like I have a fairly good handle on how things work in Java, but today I was faced with learning how to deal with dates/times in Python. It wasn’t an altogether pleasant experience, but I’m going to show what I learned, so hopefully this is of use to you.

The task I was trying to accomplish was to convert between Unix timestamps (seconds/milliseconds since the Epoch) and more user friendly data objects. Creating a datetime is easy:

>>> from datetime import *
>>> datetime.now()
datetime.datetime(2011, 4, 5, 19, 36, 18, 894325)
>>> datetime(2004, 1, 24)
datetime.datetime(2004, 1, 24, 0, 0)

One gotcha to note is that both the month and day fields are 1 based (1 <= month <= 12, 1 <= day <= number of days in the given month and year), whereas in Java, the month field is 0 indexed.

By default, these datetime objects will use a naïve time zone understanding that ignores offsets from UTC/day light savings time. To fix this, you need to implement your own subclass of tzinfo. It’s kind of unfortunate that they don’t make this easier. Here is an example implementation representing the UTC timezone from the previously linked page:

from datetime import tzinfo, timedelta, datetime

ZERO = timedelta(0)
HOUR = timedelta(hours=1)

class UTC(tzinfo):

    def utcoffset(self, dt):
        return ZERO

    def tzname(self, dt):
        return "UTC"

    def dst(self, dt):
        return ZERO

OK that’s not the end of the world. Now assuming we’ve created out datetime object correctly, how do we retrieve its corresponding Unix style timestamp? Let’s look at the available methods.

>>> [x for x in dir(datetime) if not x.startswith("__")]
["astimezone", "combine", "ctime", "date", "day", "dst", "fromordinal", "fromtimestamp", "hour", "isocalendar", "isoformat", "isoweekday", "max", "microsecond", 
"min", "minute", "month", "now", "replace", "resolution", "second", "strftime", "strptime", "time", "timetuple", "timetz", "today", "toordinal", "tzinfo", "tznam
e", "utcfromtimestamp", "utcnow", "utcoffset", "utctimetuple", "weekday", "year"]

Well, there’s a bunch of methods that convert from a timestamp to a datetime object. But going back the other direction is a little harder. After digging, I found a way to do so:

>>> from datetime import datetime
>>> from time import mktime
>>> dt = datetime(2008, 5, 1, 13, 35, 41, 567777)
>>> seconds = mktime(dt.timetuple())
>>> seconds += (dt.microsecond / 1000000.0)
>>> seconds
1209663341.5677769
>>> 
>>> dt2 = datetime.fromtimestamp(seconds)
>>> dt == dt2
True

Well, there you have it. To convert from a unix timestamp to a datetime object, use datetime.fromtimestamp. To convert the other direction, use time.mktime(datetime_instance.timetuple()). I wish that the library authors had seen fit to maintain symmetry (i.e. datetime should implement a totimestamp method), but fortunately there is an easy workaround. The last thing to note if you’re used to Java is that the timestamps in Python measure seconds from the epoch, as opposed to Java which deals in milliseconds from the epoch.

Car Talk Puzzler #4: Flipping Ages

October 26, 2010 2 comments

RAY: This was sent in many weeks ago by Wendy Gladstone, and as usual I tweaked it a little bit.

She writes: “Recently I had a visit with my mom and we realized that the two digits that make up my age when reversed resulted in her age. For example, if she’s 73, I’m 37. We wondered how often this has happened over the years but we got sidetracked with other topics and we never came up with an answer.

“When I got home I figured out that the digits of our ages have been reversible six times so far. I also figured out that if we’re lucky it would happen again in a few years, and if we’re really lucky it would happen one more time after that. In other words, it would have happened 8 times over all. So the question is, how old am I now?”

Question
Answer

Here’s the fourth in my Car Talk Puzzler series; today I’m going to be using Python because it’s my current favorite language, and because it’s well suited to filtering, mapping, etc.  I won’t put too much commentary here.

#  Find all the ages such that the second age is the reverse of the first  age.  Don't worry that there are a lot of impossibilities; we'll fix it  through filtering
# Note that [::-1] is the slice operator that says iterate backwards through the string; this effectively reverses the list.
 matching_ages = map(lambda x:(x, int(str(x)[::-1])), range(0,100))
 matching_ages
 # OUT: [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7),  (8, 8), (9, 9), (10, 1), (11, 11), (12, 21), (13, 31), (14, 41), (15,  51), (16, 61), (17, 71), (18, 81), (19, 91), (20, 2), (21, 12), (22,  22), (23, 32), (24, 42), (25, 52), (26, 62), (27, 72), (28, 82), (29,  92), (30, 3), (31, 13), (32, 23), (33, 33), (34, 43), (35, 53), (36,  63), (37, 73), (38, 83), (39, 93), (40, 4), (41, 14), (42, 24), (43,  34), (44, 44), (45, 54), (46, 64), (47, 74), (48, 84), (49, 94), (50,  5), (51, 15), (52, 25), (53, 35), (54, 45), (55, 55), (56, 65), (57,  75), (58, 85), (59, 95), (60, 6), (61, 16), (62, 26), (63, 36), (64,  46), (65, 56), (66, 66), (67, 76), (68, 86), (69, 96), (70, 7), (71,  17), (72, 27), (73, 37), (74, 47), (75, 57), (76, 67), (77, 77), (78,  87), (79, 97), (80, 8), (81, 18), (82, 28), (83, 38), (84, 48), (85,  58), (86, 68), (87, 78), (88, 88), (89, 98), (90, 9), (91, 19), (92,  29), (93, 39), (94, 49), (95, 59), (96, 69), (97, 79), (98, 89), (99,  99)]

# Here we filter by only allowing matches in which the  mother's age is greater than that of the child.  Note the use of a  lambda expression, basically an anonymous function.
filtered1 = filter(lambda (mother,child):mother > child, matching_ages)
 filtered1
 # OUT: [(10, 1), (20, 2), (21, 12), (30, 3), (31, 13), (32, 23), (40,  4), (41, 14), (42, 24), (43, 34), (50, 5), (51, 15), (52, 25), (53, 35),  (54, 45), (60, 6), (61, 16), (62, 26), (63, 36), (64, 46), (65, 56),  (70, 7), (71, 17), (72, 27), (73, 37), (74, 47), (75, 57), (76, 67),  (80, 8), (81, 18), (82, 28), (83, 38), (84, 48), (85, 58), (86, 68),  (87, 78), (90, 9), (91, 19), (92, 29), (93, 39), (94, 49), (95, 59),  (96, 69), (97, 79), (98, 89)]

# Assume that the mother was at least 15 when she had the kid, and no more than 60
filtered2 = filter(lambda(mother, child):mother-child >= 15 and mother-child < 60, filtered1)
 filtered2
 # OUT: [(20, 2), (30, 3), (31, 13), (40, 4), (41, 14), (42, 24), (50,  5), (51, 15), (52, 25), (53, 35), (60, 6), (61, 16), (62, 26), (63, 36),  (64, 46), (71, 17), (72, 27), (73, 37), (74, 47), (75, 57), (82, 28),  (83, 38), (84, 48), (85, 58), (86, 68), (93, 39), (94, 49), (95, 59),  (96, 69), (97, 79)]
 len(filtered2)
 # OUT: 30

# Create a new list comprised of the differences in age between mother and child
 age_diff = map(lambda(mother,child):mother-child, filtered2)
 age_diff
 # OUT: [18, 27, 18, 36, 27, 18, 45, 36, 27, 18, 54, 45, 36, 27, 18, 54, 45, 36, 27, 18, 54, 45, 36, 27, 18]
 sorted(age_diff)
 # OUT: [18, 18, 18, 18, 18, 18, 18, 27, 27, 27, 27, 27, 27, 36, 36, 36, 36, 36, 45, 45, 45, 45, 54, 54, 54]

# The puzzler states that it's will happen a total of 8 times; that matches the age difference of 18 years

 filter(lambda(mother,child):mother-child == 18, filtered3)
# OUT: [(20, 2), (31, 13), (42, 24), (53, 35), (64, 46), (75, 57), (86, 68), (97, 79)]

Thus the mother is currently 75 years old and the daughter is 57.  Tada

 

bpython – an excellent interpreter for python

October 12, 2010 1 comment

If you use Python, you know that its interactive shell is a great way to test out ideas and iterate quickly.  Unfortunately, the basic interactive shell is very barebones – there is no syntax highlighting, autocompletion, or any of the features we come to expect from working in IDEs.  Fortunately if you’re on a Unix system, there is a great program called bpython which adds all of those missing features.

 

As you are typing, suggestions appear at the bottom. Press tab to take the suggestion

 

If you have easy_install, it’s the simplest thing in the world to install:

sudo easy_install bpython

I can’t recommend this product enough.  It’s free, so what’re you waiting for?

Categories: Python, unix Tags: , , , ,

Python Gotcha #1: Default arguments and mutable data structures

August 23, 2010 2 comments

I ran into an issue in Python the other day that I thought others might find instructive. The problem involves default arguments and mutable data structures.

Named arguments

In Python, you can enumerate the arguments to a method explicitly, rather than simply by putting the arguments in the order expected by the method. For instance

>>> def printit(a,b,c):
... print a,b,c
...
# Calling the method with the arguments in the expected order; 1 is assigned to a, 2 to b, 3 to c
>>> printit(1,2,3)
1 2 3
# Explicitly assigning the values to the different argument variables. Now you can call the method however you want.
>>> printint(c='c',b='b',a='a')
a b c

Default Arguments

Another feature of Python not present in Java is that of default arguments

>>> def defaulted(a,b='b',c='c'):
... print a,b,c
...
>>> defaulted(1,2,3)
1 2 3
>>> defaulted(1,2)
1 2 c
>>> defaulted(1)
1 b c
# This will fail because I do not explicitly define what a is
>>> defaulted(b=6)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: defaulted() takes at least 1 non-keyword argument (0 given)

These default arguments are very succinct and a nice feature. In Java, the best way to emulate this feature is through method overloading. From the linked article:

public void printFavorites(Color favoriteColor, String favoritePhrase, int favoriteNumber) {
     System.out.println("Favorite color: " + favoriteColor);
     System.out.println("Favorite phrase: " + favoritePhrase);
     System.out.println("Favorite number: " + favoriteNumber);
}

public void printFavorites(Color favoriteColor, String favoritePhrase) {
    printFavorites(favoriteColor, favoritePhrase, 0);
}

public void printFavorites(Color favoriteColor) {
    printFavorites(favoriteColor, "The best phrase evar", 0);
}

This increases the amount of boilerplate code, and is not quite as flexible as the python version – in Python you could define a default argument for each variable and then the user could choose which if any to override.

>>> def printFavorites(faveColor="Blue", favePhrase="This is awesome",faveNumber=5):
... print "Favorite color: ", faveColor
... print "Favorite phrase: ", favePhrase
... print "Favorite number: ", faveNumber
>>> printFavorites()
Favorite color: Blue
Favorite phrase: This is awesome
Favorite number: 5
>>> printFavorites(favePhrase="Just the phrase")
Favorite color: Blue
Favorite phrase: Just the phrase
Favorite number: 5
>>> printFavorites(faveColor="orange")
Favorite color: orange
Favorite phrase: This is awesome
Favorite number: 5

Regardless, it’s usually not too much of a hurdle in Java. But given that it’s such a nice feature in Python, I wanted to make use of it.

Real world example: Graph traversal

Let’s examine how to represent directed (potentially cyclic) graphs in Python, and how to do a depth first search of all paths between two
First let’s create a graph, and then show the code to find all the paths between two nodes.

Graph representation

In Python we can represent a graph as a nested dictionary:

g = {'A': ['B', 'C'],
'B': ['C', 'D'],
'C': ['A']}

Better visualized as

(Side note: this graph was created using the DOT language, something I plan to post more about in the future.)

The nested dictionary structure defined above maps nodes to their children (nodes in this case are simply represented as single letters, but they could be arbitrarily complex).

Search

I mentioned that the graphs might be cyclic; if we do not detect a cycle, the code would similarly loop until running out of memory. Thus we need to keep track of the path we’ve already explored to avoid cycling.

Here is code to find all the paths between two nodes (slightly modified from http://www.python.org/doc/essays/graphs.html; I will highlight exactly what I changed later in the article).

def find_all_paths(graph, start, end, path=[]):
  path.append([start])
  if start == end:
    return [path]
  if not graph.has_key(start):
    return []
  paths = []
  for node in graph[start]:
    if node not in path:
      # recursive call
      newpaths = find_all_paths(graph, node, end, path)
      for newpath in newpaths:
        paths.append(newpath)
  return paths

Let’s test:

>>> graph.find_all_paths(g, 'A','B')
[['A', 'B']]
>>> graph.find_all_paths(g, 'A','C')
[['A', 'B', 'C'], ['A', 'C']]

Both of these are correct; the path between A and B is just A and B but between A and C we have two potential paths – either the two hops from A to B to C, or directly from A to C. The code I presented contains a bug. What happens when I run the queries again?

>>> graph.find_all_paths(g, 'A','B')
[]
>>> graph.find_all_paths(g, 'A','C')
[]

Huh? What’s going on. It worked fine the first time, but subsequent method calls return the wrong result!

Solution

The problem, I discovered with a little bit of searching, is spelled out in the Python documents.

Important warning: The default value is evaluated only once. This makes a difference when the default is a mutable object such as a list, dictionary, or instances of most classes. For example, the following function accumulates the arguments passed to it on subsequent calls:

def f(a, L=[]):
  L.append(a)
  return L

print f(1)
print f(2)
print f(3)

This will print

[1]
[1, 2]
[1, 2, 3]

If you don’t want the default to be shared between subsequent calls, you can write the function like this instead:

def f(a, L=None):
  if L is None:
    L = []
  L.append(a)
  return L

I do not like the workaround presented in the docs, since it obfuscates what the argument is supposed to be. (compare visitedNodes=[] vs visitedNodes=None). There is another workaround for this, and it’s the approach taken originally on the site where the graph traversal code came from:

def find_all_paths(graph, start, end, path=[]):
path = path + [start]
# my buggy version had path.append(start)
...

Fortunately the Python docs are excellent and I was able to find the solution without too much trouble. Python programmers need to be aware that their default arguments are only evaluated once, especially when working with potentially mutable data structures like dictionaries and lists. Programmers must take care to realize that the following two calls are NOT equivalent, especially with respect to objects contained in default arguments.

exploredNodes.append("x")
exploredNodes = exploredNodes + ["x"]
Categories: Java, Python Tags: , , , , ,