Archive

Posts Tagged ‘python’

Music hack roundup

March 24, 2014 Leave a comment

Disclaimer: the opinions expressed are my own and do not represent that of my employer, Google

I love music. And programming. I really like pieces of technology that either create new or modify existing pieces of music. Here I detail some of my favorite projects I’ve found in the past few years.

Songsmith

Songsmith is a project from Microsoft Research. From the project description page:

Songsmith generates musical accompaniment to match a singer’s voice. Just choose a musical style, sing into your PC’s microphone, and Songsmith will create backing music for you. Then share your songs with your friends and family, post your songs online, or create your own music videos.

There was a commercial that detailed how the project worked, but what I found really great was what the community did with it. They fed the vocals of famous songs through the software to see what sort of music came out. The results are, ahem, mixed.

First, one that I think sounds pretty interesting – a swing version of Katy Perry’s I Kissed a Girl

Then going into the realm of hilarious:

We Will Rock You – Queen Vs Songsmith

Mortorhead’s Ace of Spades

Nirvana’s In Bloom

Enter Sandman

Roxanne

The Beatle’s A Day in the Life

The Swinger

According to musicmachinery.com’s writeup,

The Swinger is a bit of python code that takes any song and makes it swing.

If you’re not into music theory (or old music) you might not know what constitutes swing. The following video (only need to watch the first 30 seconds or so) is a great example of the difference of straight vs swing styles:

As you can hear, it sounds very different. The first half of each beat is stretched out, and the second half is shrunk down. It sounds even more different when you start listening to familiar songs converted from straight to swing, or vice versa. While most of the links have died, Daft Punk’s Around The World still plays, as does Jefferson Airplane’s White Rabbit.

The source code is available at https://github.com/echonest/remix/blob/master/examples/swinger/swinger.py.

Autocanonizer

From musicmachinery.com’s writeup of The Autocanonizer:

It takes any song and tries to make a canon out of it. A canon is a song that can be played against a copy of itself.

Wikipedia has a bit more information on what exactly a Canon is:

In music, a canon is a contrapuntal compositional technique that employs a melody with one or more imitations of the melody played after a given duration (e.g., quarter rest, one measure, etc.). The initial melody is called the leader (or dux), while the imitative melody, which is played in a different voice, is called the follower (or comes). The follower must imitate the leader, either as an exact replication of its rhythms and intervals or some transformation thereof (see “Types of canon”, below). Repeating canons in which all voices are musically identical are called rounds – “Row, Row, Row Your Boat” and “Frère Jacques” being widely known examples. An example of a classical strict canon is the Minuet of Haydn’s String Quartet in D Minor, Op. 76, No. 2 (White 1976, 66).

With that in mind, here are some example.

My favorite is Adele’s Someone Like You. This one sounds close to a round.

Screenshot from the Autocanonizer

Screenshot from the Autocanonizer

  • Over The Rainbow – starts rough but 30 seconds in it sounds good
  • The Fox – I like it. Lots of self harmonizing. The doubled up chorus actually works. It gets out of sync with itself at some points
  • Take Five – demonstrates that the technique works with odd meter too. Not perfectly lined up at some points

See all the examples available at http://static.echonest.com/autocanonizer/loader.html
Source code available at: https://github.com/plamere/autocanonizer

Hat tip to Greg Linden whose Google+ post alerted me to this, and reminded me of these other projects I’d seen before.

MajorVsMinor

MajorVsMinor is a slight departure from the others I’ve listed because there is a human in the loop – it’s not all algorithmic. From Oleg Berg’s description from olegberg.com

Hello! I am Oleg Berg, a musician from Donetsk, Ukraine. I digitally re-edit famous compositions altering harmonic scale, and I call this experimental music project «Major versus Minor». It may sound surprising and unusual, but it is always interesting. Listen to the music videos below. And please donate to keep the project going
[...]
I by no means intend to enhance the famous music hits as I rework them; they are perfect already. I simply imagine what would it sound like, had the author written it in another mood. And it appears, I succeed in my imaginations.

Again, if you’re not a music nerd you might not know what the difference between major key and minor is. In general picture minor = sad, major = happy. You’ll instantly hear the difference in these versions.

First, a must if you’re an Arrested Development fan.

“Final Countdown in Major key”

My favorite comment from MYxxLxxCHIBI1:

I was literally coming down here to say that myself. GOB finally got accepted to the Alliance of Magicians

Maybe my favorite one -
“Be Worry, Don’t Happy”: Minor Key

I like this one too.
Jingle Bells

“Hey Jude” in minor key

See the whole channel at https://www.youtube.com/user/MajorVsMinor

Since this isn’t a software project per se, there is no link to the source code. According to Asshat8182′s comment on Smells Like Teen Spirit in Major key (with a name like that, he must be a reliable source of information), the way it’s accomplished is

The ‘somehow’ is called Celemony Melodyne. Specifically the DNA function

According to Wikipedia:

Celemony Software GmbH is a German musical software company that specializes in digital audio pitch correction software. It produces Melodyne, an industry standard audio pitch modification tool similar to Auto-Tune

Conclusion

I hope you’ve found this short roundup of music hacks interesting. There are some very creative people out there. If you find what they’re doing interesting, please let them know and/or donate so they’ll keep making great stuff.

Python – “dict comprehension”

February 4, 2013 1 comment

I learned a new way to initialize dictionaries that I had never seen before, and I thought I’d share it with you. I previously blogged about three ways of creating dictionaries in Python. I showed that you can use an iterable of (key, value) tuples to initialize a dict:

professions_dict = dict(zip(names, professions))

Another way along these same lines is to perform the iteration within a list comprehension. Say that we have an iterable of Person objects, each of which has a name and profession. The professions_dict could be created as follows:

professions_dict = dict([(x.name, x.profession) for x in people])

What I didn’t realize is you can skip the tuples and call to dict, and use the comprehension within the dictionary literal itself:

professions_dict = {x.name: x.profession for x in people}

In my opinion this is much cleaner. The syntax and official documentation can be found in 5.2.5. Displays for sets and dictionaries.

Solving Scramble With Friends – a tale of three data structures

May 30, 2012 11 comments

Solving Scramble with Friends – A tale of three data structures

This post aims to illustrate how to solve Scramble With Friends/Boggle and the techniques and data structures necessary. In particular, we will see how to prune an enormous search space, how to use recursion to simplify the task, and the relative performance of three different data structures: an unsorted list, a sorted list with binary search, and a trie.

Problem statement

Given an N x N board, (4×4 in case of Boggle and Scramble with Friends), find all of the paths through the board which create a valid word. These paths must use each tile at most once and be contiguous; tile N must be adjacent to tile N – 1 in any of 8 directions (North, Northeast, East, Southeast, South, Southwest, West, Northwest).

Scramble Screenshot

In this example, ANT is a valid solution in the bottom left corner, but NURSE is not because the Ns are not located next to the U.

Naivest approach

First, imagine that we have a method that, given a path through the board returns all valid words that starts with that path, called DoSolve. For instance, DoSolve(['Qu']) would return all the valid words (and their locations) that start with Qu. DoSolve(['N', 'E']) would return all the valid paths that start with N followed by an E. With such a method, it is trivial to solve the problem. In pseudocode:

Solve(board):
    solutions = empty list
    for each location on board:
        append DoSolve(location) to solutions
    return solutions

The tricky part is, how do we make DoSolve work? Remember that it must

  1. return only valid words
  2. return only words that are formed from contiguous tiles
  3. must not repeat any tiles.

The easiest way to solve this problem is through recursion. As you probably remember, recursion is simply when a method calls itself. In order to avoid an infinite loop (and eventual stack overflow), there must be some sort of stopping criteria. As it was taught to me, you have to be sure that the problem becomes smaller at each step, until it becomes so small as to be trivially solved. Here’s that basic algorithm:

DoSolve(board, tiles_used)
    solutions = empty list

    current word = empty string
    for letter in tiles_used:
        append letter to current word

    # 1 - ensure that the word is valid
    if the current word is a valid word:
        append current word to solutions

    most recent tile = last tile in tiles_used

    # 2 - ensure that the next tile we use is adjacent to the last one
    for each tile adjacent to most recent tile:

        # 3 - ensure that we do not repeat any tiles
        if tile is on the board and tile hasn't been used previously:
            new_path = append tile to copy of tiles_used list
            solutions_starting_from_tile = DoSolve(board, new_path)
            append solutions_starting_from_tile to solutions

    return solutions

This will work, but it suffers from an incredible flaw. Do you see it?

The problem is that this method will waste an inordinate amount of time exhaustively searching the board for solutions even when the current path is completely useless. It will continue to search for words starting with QuR, QuRE, etc., etc., even though no words in the English language start with these letters. The algorithm is still correct, but it can be optimized very simply.

Pruning the search space

Those with some algorithms background might recognize the code above as a modified form of a depth first search. As described previously, a depth first search will exhaustively explore every possible path through the board. We can vastly improve the efficiency of this algorithm by quitting the search early when we know it won’t be fruitful. If we know all of the valid words, we can quit when we know that no word starts with the current string. Thus in the previous example, if the current word built up so far were “QuR”, the algorithm could determine that no words start with QuR and thus fail fast and early. This optimization will not affect the correctness of the algorithm because none of the potential paths we eliminate could possibly lead to a valid word; by constraint #1 this means that none of those paths would have ended up in the list of solutions in the first place.

How much does this save? On random 3×3 boards, The fastest implementation I have is sped up by a factor of 75. Solving 4×4 boards without pruning is infeasible.

Basic Python implementation

Assume as a black box we have two methods IsWord(string) and HasPrefix(string). The pseudocode above can be expressed with Python (forgive the slightly modified parameter list; I found it easier to write it this way):

def DoSolve(self, board, previous_locations, location, word_prefix):
"""Returns iterable of FoundWord objects.

Args:
  previous_locations: a list of already visited locations
  location: the current Location from where to start searching
  word_prefix: the current word built up so far, as a string
"""
solutions = []

new_word = word_prefix + board[location.row][location.col]
previous_locations.append(location)

if PREFIX_PRUNE and not self.HasPrefix(new_word):
  if DEBUG:
    print 'No words found starting with "%s"' %(new_word)
  return solutions

# This is a valid, complete words.
if self.IsWord(new_word):
  new_solution = FoundWord(new_word, previous_locations)
  if DEBUG:
    print 'Found new solution: %s' %(str(new_solution))
  solutions.append(new_solution)

# Recursively search all adjacent tiles
for new_loc in location.Adjacent():
  if board.IsValidLocation(new_loc) and new_loc not in previous_locations:
    # make a copy of the previous locations list so our current list
    # is not affected by this recursive call.
    defensive_copy = list(previous_locations)
    solutions.extend(self.DoSolve(board, defensive_copy, new_loc, new_word))
  else:
    if DEBUG:
      print 'Ignoring %s as it is invalid or already used.' %(str(new_loc))

return solutions

Data structures

The data structure and algorithms used to implement the IsWord and HasPrefix methods are incredibly important.

I will examine three implementations and discuss the performance characteristics of each:

  • Unsorted list
  • Sorted list with binary search
  • Trie

Unsorted list

An unsorted list is a terrible data structure to use for this task. Why? Because it leads to running time proportional to the number of elements in the word list.

class UnsortedListBoardSolver(BoardSolver):
  def __init__(self, valid_words):
    self.valid_words = valid_words

  def HasPrefix(self, prefix):
    for word in self.valid_words:
      if word.startswith(prefix):
        return True
    return False

  def IsWord(self, word):
    return word in self.valid_words

While this is easy to understand and reason about, it is extremely slow, especially for a large dictionary (I used approximately 200k words for testing).

Took 207.697 seconds to solve 1 boards; avg 207.697 with <__main__.UnsortedListBoardSolver object at 0x135f170>

(This time is with the pruning turned on).

Sorted list

Since we know the valid words ahead of time, we can take advantage of this fact and sort the list. With a sorted list, we can perform a binary search and cut our running time from O(n) to O(log n).

Writing a binary search from scratch is very error prone; in his classic work Programming Pearls, Jon Bently claims that fewer than 10% of programmers can implement it correctly. (See blog post for more).

Fortunately, there’s no reason whatsoever to write our own binary search algorithm. Python’s standard library already has an implementation in its bisect module. Following the example given in the module documentation, we get the following implementation:

class SortedListBoardSolver(BoardSolver):
  def __init__(self, valid_words):
    self.valid_words = sorted(valid_words)

  # http://docs.python.org/library/bisect.html#searching-sorted-lists
  def index(self, a, x):
    'Locate the leftmost value exactly equal to x'
    i = bisect.bisect_left(a, x)
    if i != len(a) and a[i] == x:
      return i
    raise ValueError

  def find_ge(self, a, x):
    'Find leftmost item greater than or equal to x'
    i = bisect.bisect_left(a, x)
    if i != len(a):
      return a[i]
    raise ValueError  

  def HasPrefix(self, prefix):
    try:
      word = self.find_ge(self.valid_words, prefix)
      return word.startswith(prefix)
    except ValueError:
      return False

  def IsWord(self, word):
    try:
      self.index(self.valid_words, word)
    except ValueError:
      return False
    return True

Running the same gauntlet of tests, we see that this data structure performs far better than the naive, unsorted list approach.

Took 0.094 seconds to solve 1 boards; avg 0.094 with <__main__.SortedListBoardSolver object at 0x135f1d0>
Took 0.361 seconds to solve 10 boards; avg 0.036 with <__main__.SortedListBoardSolver object at 0x135f1d0>
Took 2.622 seconds to solve 100 boards; avg 0.026 with <__main__.SortedListBoardSolver object at 0x135f1d0>
Took 25.065 seconds to solve 1000 boards; avg 0.025 with <__main__.SortedListBoardSolver object at 0x135f1d0>

Trie

The final data structure I want to illustrate is that of the trie. The Wikipedia article has a lot of great information about it.

Trie image

From Wikipedia:

In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.

Snip:

Looking up a key of length m takes worst case O(m) time. A BST performs O(log(n)) comparisons of keys, where n is the number of elements in the tree, because lookups depend on the depth of the tree, which is logarithmic in the number of keys if the tree is balanced. Hence in the worst case, a BST takes O(m log n) time. Moreover, in the worst case log(n) will approach m. Also, the simple operations tries use during lookup, such as array indexing using a character, are fast on real machines

Again, I don’t want to reinvent what’s already been done before, so I will be using Jeethu Rao’s implementation of a trie in Python rather than rolling my own.

Here is a demonstration of its API via the interactive prompt:

>>> import trie
>>> t = trie.Trie()
>>> t.add('hell', 1)
>>> t.add('hello', 2)
>>> t.find_full_match('h')
>>> t.find_full_match('hell')
1
>>> t.find_full_match('hello')
2
>>> t.find_prefix_matches('hello')
[2]
>>> t.find_prefix_matches('hell')
[2, 1]

Unfortunately, there’s a bug in his code:

>>> t = trie.Trie()
>>> t.add('hello', 0)
>>> 'fail' in t
False
>>> 'hello' in t
True
# Should be false; 'h' starts a string but
# it is not contained in data structure
>>> 'h' in t
True

His __contains__ method is as follows:

def __contains__( self, s ) :
  if self.find_full_match(s,_SENTINEL) is _SENTINEL :
      return False
  return True

The find_full_match is where the problem lies.

def find_full_match( self, key, fallback=None ) :
  "'
  Returns the value associated with the key if found else, returns fallback
  "'
  r = self._find_prefix_match( key )
  if not r[1] and r[0]:
    return r[0][0]
  return fallback

_find_prefix_match returns a tuple of node that the search terminated on, remainder of the string left to be matched. For instance,

>>> t._find_prefix_match('f')
[[None, {'h': [None, {'e': [None, {'l': [None, {'l': [None, {'o': [0, {}]}]}]}]}]}], 'f']
>>> t.root
[None, {'h': [None, {'e': [None, {'l': [None, {'l': [None, {'o': [0, {}]}]}]}]}]}]

This makes sense, ‘f’ doesn’t start any words in the trie containing just the word ‘hello’, so the root is returned with the ‘f’ string that doesn’t match. find_full_match correctly handles this case, since r[1] = ‘f’, not r[1] = False, and the fallback is returned. That fallback is used by contains to signify that the given string is not in the trie.

The problem is when the string in question starts a valid string but is itself not contained in the trie. As we saw previously, ‘h’ is considered to be in the trie.

>>> r = t._find_prefix_match('h')
>>> r
[[None, {'e': [None, {'l': [None, {'l': [None, {'o': [0, {}]}]}]}]}], "]
>>> r[0]
[None, {'e': [None, {'l': [None, {'l': [None, {'o': [0, {}]}]}]}]}]
>>> r[1]
"
>>> bool(not r[1] and r[0])
True
>>> r[0][0]
# None

The issue is that his code does not check that there is a value stored in the given node. Since no such value has been stored, the code returns None, which is not equal to the SENTINEL_ value that his __contains__ method expects. We can either change findfullmatch to handle this case correctly, or change the __contains__ method to handle the None result as a failure. Let’s modify the find_full_match method to obey it’s implied contract (and be easier to understand):

def find_full_match( self, key, fallback=None ) :
  "'
  Returns the value associated with the key if found else, returns fallback
  "'
  curr_node, remainder = self._find_prefix_match(key)
  stored_value = curr_node[0]
  has_stored_value = stored_value is not None
  if not remainder and has_stored_value:
    return stored_value
  return fallback

Let’s make sure this works:

>>> reload(trie)
<module 'trie' from 'trie.py'>
>>> t = trie.Trie()
>>> t.add('hello', 2)
>>> 'f' in t
False
>>> 'hell' in t
False
>>> 'hello' in t
True

OK, with that minor patch, here’s a first pass implementation of the solution using the trie:

class TrieBoardSolver(BoardSolver):
  def __init__(self, valid_words):
    self.trie = trie.Trie()
    for index, word in enumerate(valid_words):
      # 0 evaluates to False which screws up trie lookups; ensure value is 'truthy'.
      self.trie.add(word, index+1)

  def HasPrefix(self, prefix):
    return len(self.trie.find_prefix_matches(prefix)) > 0

  def IsWord(self, word):
    return word in self.trie

Unfortunately, this is slow. How slow?

Took 2.626 seconds to solve 1 boards; avg 2.626 with <__main__.TrieBoardSolver object at 0x135f070>
Took 22.681 seconds to solve 10 boards; avg 2.268 with <__main__.TrieBoardSolver object at 0x135f070>

While this isn’t as bad as the unsorted list, it’s still orders of magnitudes slower than the binary search implementation.

Why is this slow? Well, it’s doing a whole lot of unnecessary work. For instance, if we want to determine if ‘h’ is a valid prefix, this implementation will first construct the list of all words that start with h, only to have all that work thrown away when we see that the list is not empty.

A much more efficient approach is to cheat a little and use the previously discussed method _find_prefix_match which returns the node in the tree that the search stopped at and how much of the search string was unmatched.

By using this method directly, we can avoid creating the lists of words which we then throw away. We modify the HasPrefix method to the following:

def HasPrefix(self, prefix):
  curr_node, remainder = self.trie._find_prefix_match(prefix)
  return not remainder

With this optmization, the trie performance becomes competitive with the binary search:

Took 0.027 seconds to solve 1 boards; avg 0.027 with <__main__.TrieBoardSolver object at 0x135f230>
Took 0.019 seconds to solve 1 boards; avg 0.019 with <__main__.SortedListBoardSolver object at 0x135f330>
Took 0.199 seconds to solve 10 boards; avg 0.020 with <__main__.TrieBoardSolver object at 0x135f230>
Took 0.198 seconds to solve 10 boards; avg 0.020 with <__main__.SortedListBoardSolver object at 0x135f330>
Took 2.531 seconds to solve 100 boards; avg 0.025 with <__main__.TrieBoardSolver object at 0x135f230>
Took 2.453 seconds to solve 100 boards; avg 0.025 with <__main__.SortedListBoardSolver object at 0x135f330>

It is still slower, but not nearly as bad as before.

Solving the board in the screenshot

With all this machinery in place, we can run the original board in the screenshot through the algorithms:

words = ReadDictionary(sys.argv[1])
print 'Read %d words' %(len(words))

trie_solver = TrieBoardSolver(words)
sorted_list_solver = SortedListBoardSolver(words)
b = Board(4, 4)
b.board = [
    ['l', 'qu', 'r', 'e'],
    ['s', 'l', 'u', 's'],
    ['a', 't', 'i', 'c'],
    ['n', 'r', 'e', 'n']
]
print 'Solving with binary search'
sorted_solutions = sorted_list_solver.Solve(b)
print 'Solving with trie'
trie_solutions = trie_solver.Solve(b)

# Results should be exactly the same
assert sorted_solutions == trie_solutions

for solution in sorted(sorted_solutions):
    print solution
words = sorted(set([s.word for s in sorted_solutions]))
print words

The results appear after the conclusion.

Conclusion

I have presented both pseudocode and Python implementations of an algorithm for solving the classic Boggle/Scramble With Friends game. In the process, we saw such concepts as recursion, depth first search, optimizing code through pruning unfruitful search branches, and the importance of using the right data structure. This post does not aim to be exhaustive; I hope I have piqued your interest for learning more about tries and other lesser known data structures.

For results, the patched trie implementation, and the driver program, see below. To run the driver program, pass it a path to a list of valid words, one per line. e.g.

python2.6 solver.py /usr/share/dict/words

For mobile users, the link can be found at https://gist.github.com/2833942

Hello {planet_name}: Creating strings with dynamic content in Python

May 17, 2012 1 comment

2012-05-16

The ability to create strings with dynamic content is very important in creating applications. Python makes this easy, but it’s not always clear what the correct approach is. Let us go through four ways of creating strings:

  • Implicit concatenation
  • Explicit concatenation
  • String formatting via % operator
  • String formatting via format method

Implicit concatenation

If whitespace alone separates adjacent string literals, they will be concatenated together.

>>> 'hello' 'world'
'helloworld'

This can be useful if you have a long string that you want to break up

>>> a_long_string = ('hello this is a very very very very'
...                  'long string')
>>> print a_long_string
hello this is a very very very verylong string

Implicit concatenation does not work for inserting variables into strings:

>>> name = 'Nick'
>>> 'hello' name
  File "<stdin>", line 1
    'hello' name

In that case, we need to be more explicit.

Explicit concatenation

You can also use the + operator to concatenate strings.

>>> 'hello' + 'world'
'helloworld'

Unlike in Java, types are not automatically coerced; it is an error to attempt to concatenate non-strings to a string:

>>> num_examples = 4
>>> 'I have ' + num_examples + ' examples to discuss.'
Traceback (most recent call last):
  File "<input>", line 1, in <module>
TypeError: cannot concatenate 'str' and 'int' objects

You must explicitly convert non-string types into strings before using this operator.

>>> 'I have ' + str(num_examples) + ' examples to discuss.'
'I have 4 examples to discuss.'

String formatting via % operator

Creating longer strings by concatenating together different substrings works, but it’s not the most elegant solution. It is a common mistake to forget to put a space before or after whatever is inserted in the middle of two other strings, wasting time on the programmer’s part. It is also somewhat hard to scan while reading the source code; the + signs can distract from what the string is trying to say.

If you have used C before, you are probably familiar with printf and sprintf, which allow you to embed special characters within your strings which are subsituted with later arguments. For instance,

printf("hello %s\n", "world");

would produce “hello world”. Python has this feature built in to strings with the % operator. You can read more in depth about the [String Formatting Operations][] and its syntax, but at the very least you should memorize the flags %d (for integer types), %f (for floating point), and %s (for strings). For instance,

>>> 'I have %d things to talk about' %num_topics
'I have 4 things to talk about'

If you have more than one substitution to make, you must surround the substituted values in parentheses:

>>> 'I have %d things to talk about.  Number 1: %s' %(num_topics, 'Implicit concatenation')
'I have 4 things to talk about.  Number 1: Implicit concatenation'

This is a very powerful technique, especially when you learn the formatting flags that this operator supports. For instance,

>>> 'It is %.2f degrees out' %(98.63483)  # display 2 decimal places
'It is 98.63 degrees out'

This technique does not work as well when there are many substitutions to make:

query = """SELECT %s
FROM %s
WHERE %s
GORUP BY %s
ORDER BY %s""" %(columns, table, where_clause, order_by_clause, group_by_clause)

Did you catch the mistake? (Hint – check the order of args in parentheses) It’s easy to make a mistake like this in production. If you have more than 2 or 3 substitutions, I recommend the following technique – string format.

String formatting via format function

Strings have a format function which can also play the part of variable substiution. The benefit of this approach is that each substitution is named, and it is impossible to provide the arguments in the wrong order.

query = """SELECT {columns}
FROM {table}
WHERE {where_clause}
GORUP BY {group_by_clause}
ORDER BY {order_by_clause}""".format(columns = 'sum(num_users) AS actives',
                                    table = '1_day_actives',
                                    where_clause = 'country = "US"',
                                    order_by_clause = 'actives',
                                    # Note the wrong order - it doesn't matter, just as for
                                    # providing keyword arguments
                                    group_by_clause = 'actives')

If you prefer, you can use a dictionary to power the variable replacement.

data_dict = {
    'columns': 'sum(num_users) AS actives',
    'table': '1_day_actives',
    'where_clause': 'country = "US"',
    'order_by_clause': 'actives',
    'group_by_clause': 'actives'
}
query = """SELECT {columns}
FROM {table}
WHERE {where_clause}
GORUP BY {group_by_clause}
ORDER BY {order_by_clause}""".format(**data_dict)

Note the format method is only available in Python 2.6 and newer. As the documentation states,

This method of string formatting is the new standard in Python 3, and should be preferred to the % formatting described in String Formatting Operations in new code.

Hopefully I have shown that this is easier to read and less error prone than the alternatives.

Conclusion

It’s easy to get stuck in a rut and do things exactly the same way, even when alternatives exist. I only just recently learned about the string.format function, and find it preferable to all of the alternatives I have laid out. I hope you also find it useful.

Categories: programming, Python Tags: , ,

Three ways of creating dictionaries in Python

March 30, 2012 5 comments

Dictionaries are the fundamental data structure in Python, and a key tool in any Python programmer’s arsenal. They allow O(1) lookup speed, and have been heavily optimized for memory overhead and lookup speed efficiency.

Today I”m going to show you three ways of constructing a Python dictionary, as well as some additional tips and tricks.

Dictionary literals

Perhaps the most commonly used way of constructing a python dictionary is with curly bracket syntax:

d = {"age":25}

As dictionaries are mutable, you need not know all the entries in advance:

# Empty dict
d = {}
# Fill in the entries one by one
d["age"] = 25

From a list of tuples

You can also construct a dictionary from a list (or any iterable) of key, value pairs. For instance:

d = dict([("age", 25)])

This is perhaps most useful in the context of a list comprehension:

class Person(object):
    def __init__(self, name, profession):
        self.name = name
        self.profession = profession

people = [Person("Nick", "Programmer"), Person("Alice","Engineer")]
professions = dict([ (p.name, p.profession) for p in people ])
>>> print professions
{"Nick": "Programmer", "Alice": "Engineer"}

This is equivalent, though a bit shorter, to the following:

people = [Person("Nick", "Programmer"), Person("Alice","Engineer")]
professions = {}
for p in people:
    professions[p.name] = p.profession

This form of creating a dictionary is good for when you have a dynamic rather than static list of elements.

From two parallel lists

This method of constructing a dictionary is intimately related to the prior example. Say you have two lists of elements, perhaps pulled from a database table:

# Static lists for purpose of illustration
names = ["Nick", "Alice", "Kitty"]
professions = ["Programmer", "Engineer", "Art Therapist"]

If you wished to create a dictionary from name to profession, you could do the following:

professions_dict = {}
for i in range(len(names)):
    professions_dict[names[i]] = professions[i]

This is not ideal, however, as it involves an explicit iterator, and is starting to look like Java. The more Pythonic way to handle this case would be to use the zip method, which combines two iterables:

print zip(range(5), ["a","b","c","d","e"])
[(0, "a"), (1, "b"), (2, "c"), (3, "d"), (4, "e")]

names_and_professions = zip(names, professions)
print names_and_professions
[("Nick", "Programmer"), ("Alice", "Engineer"), ("Kitty", "Art Therapist")]

for name, profession in names_and_professions:
    professions_dict[name] = profession

As you can see, this is extremely similar to the previous section. You can dispense the iteration, and instead use the dict method:

professions_dict = dict(names_and_professions)
# You can dispence the extra variable and create an anonymous
# zipped list:
professions_dict = dict(zip(names, professions))

Further reading

zip

dict

Categories: Python Tags: , , ,

Python Gotcha: Word boundaries in regular expressions

September 22, 2011 2 comments

TL;DR

Be careful trying to match word boundaries in Python using regular expressions.  You have to be sure to either escape the match sequence or use raw strings.

Word boundaries

Word boundaries are a great way of performing regular expression searches for whole words while avoiding partial matches.  For instance, a search for the regular expression “the” would match both the word “the” and the start of the word “thesaurus”.

>>> import re
>>> re.match("the", "the")
# matches
>>> re.match("the", "thesaurus")
# matches 
In some cases, you might want to match just the word “the” by itself, but not when it’s embedded within another word.

The way to match a word boundary is with ‘\b’, as described in the Python documentation.  I wasted a few minutes wrestling with trying to get this to work.

>>> re.match("\bthe\b", "the")
# no match

It turns out that \b is also used as the backspace control sequence.  Thus in order for the regular expression engine to interpret the word boundary correctly, you need to escape the sequence:

>>> re.match("\\bthe\\b", "the")
# match

You can also use raw string literals and avoid the double backslashes:

>>> re.match(r"\bthe\b", "the")
# match

In case you haven’t seen the raw string prefix before, here is the relevant documentation:

String literals may optionally be prefixed with a letter ‘r’ or ‘R’; such strings are called raw strings and use different rules for interpreting backslash escape sequences.

Conclusion

Make sure you are familiar with the escape sequences for strings in Python, especially if you are dealing with regular expressions whose special characters might conflict.  The Java documentation for regular expressions makes this warning a bit more explicit than Python’s:

The string literal “\b”, for example, matches a single backspace character when interpreted as a regular expression, while “\\b” matches a word boundary.

Hopefully this blog post will help others running into this issue.

Car Talk Puzzler #5: The Perfect Square Dance

August 18, 2011 2 comments

PUZZLER: The Perfect Square Dance!

Sally invited 17 guests to a dance party. She assigned each guest a number from 2 to 18, keeping 1 for herself. The sum of each couple’s numbers was a perfect square. What was the number of Sally’s partner?

Puzzler

The fifth in my ongoing series of solving Car Talk Puzzlers with my programming language of choice.  I’m using Python again, just like last time.

There are a few pieces to this.  The first is, how do we generate a list of all possible pairs that match the perfect square constraint?  With list comprehensions and a helper function this is easy.

def perfect_square(n):
  return math.sqrt(n) == int(math.sqrt(n))

# range creates a list that's exclusive of last number.  Thus to go from 1 to n,
# use range(1, n+1)
guests = range(1, num_guests + 1)
# By enforcing the x# a whole bunch of equivalent pairs (e.g. (3,6) and (6,3)).
pairs = [(x,y) for x in guests for y in guests if x<y and perfect_square(x+y)]
>>> guests
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
>>> pairs
[(1, 3), (1, 8), (1, 15), (2, 7), (2, 14), (3, 6), (3, 13), (4, 5), (4, 12), (5, 11), (6, 10), (7, 9), (7, 18), (8, 17), (9, 16), (10, 15), (11, 14), (12, 13)]
OK great.  At this point I have 18 pairs; 9 of these pairs together form the correct setup.  Now, I could try to be very clever and use logic to deduce the correct pairs.  But I’m lazy and it’s a small problem, so I’m going to use brute force.  I’m just going to take all combinations of 9 pairs and test to see which one uses each guest exactly once.  Not only am I so lazy as to do it this way, I’m not even going to write code to calculate the combinations.  Instead, I’ll use the excellent itertools package.
>>> itertools.combinations([1,2,3,4,5], 2)

# This is a sort of generator object which lazily returns the values as needed.
# To force them to all be evaluated at once, to see how this works,
# wrap the call in a list function.
>>>list(itertools.combinations([1,2,3,4,5],2))
[(1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5)]

This performs exactly as how we would expect.  Note that I want combinations rather than permutations because the order does not matter.

As I mentioned earlier, I assume that a brute force solution is going to be fast enough.  Let me check my math here.  How many 9-element combinations taken from an 18-element set are there?
N choose K formula

N choose K formula - number of k element combinations from set of n elements

Number of combinations

Yeah, I think we can handle checking 48620 combinations.

OK, but how do we know whether a given combination (9 pairs of numbers) match our criteria?  The easiest way is to check if the number of unique elements is the same as the number of guests; this indicates that each guest is used only once.  First we flatten our nested list of tuples into a single list, and then use a set to check uniqueness:

def flatten(nested):
  """ Flatten one level of nesting.  Returns a generator object
  For instance:
  list(flatten([(1,3),(5,6)])) --> [1,3,5,6] """
  return itertools.chain.from_iterable(nested)

def all_guests_present_once(combination):
  """ Returns whether each guest is present once
  Combination is a list of tuples, e.g. [(1,5),(7,8)]
  """
  flattened = list(flatten(combination))
  return len(set(flattened)) == len(flattened)

>>> all_guests_present_once([(1,3),(4,5)])
True
>>> all_guests_present_once([(1,3),(3,6)])
False
As you can see, I’m shortcutting a little bit and not checking for the perfect square aspect, since we already did that when constructing our set of pairs.

OK we’re ready to throw it all together.


def dance_arrangement(num_guests):
  """
  Returns a valid pairing for all guests if possible, else an empty set
  """
  # Clearly you need an even number of guests to have everyone paired
  if num_guests % 2 == 1:
    return []
  else:
    # range creates a list that's exclusive of last number.  Thus to go from 1 to n,
    # use range(1, n+1)
    guests = range(1, num_guests + 1)
    # By enforcing the x    # a whole bunch of equivalent pairs (e.g. (3,6) and (6,3)).
    pairs = [(x,y) for x in guests for y in guests if x    # brute force search
    all_arrangements = itertools.combinations(pairs, num_guests / 2)
    return filter(all_guests_present_once, all_arrangements)

Running the program with num_guests = 18, we get

[((1, 15), (2, 14), (3, 13), (4, 12), (5, 11), (6, 10), (7, 18), (8, 17), (9, 16))]
Tada!  Thus Sally’s partner was number 15.
I stumbled onto this page which also sought to solve this problem using programming; it ends with the question
“I wonder if 18 is unique as the total number of dancers which has a solution.  It should be easy to modify the program to check”
It is very easy to modify, and I checked.  I ran the program on all number of guests less than 20, and these are the results:
8 [((1, 8), (2, 7), (3, 6), (4, 5))]
14 [((1, 8), (2, 14), (3, 13), (4, 12), (5, 11), (6, 10), (7, 9))]
16 [((1, 8), (2, 7), (3, 6), (4, 5), (9, 16), (10, 15), (11, 14), (12, 13))]
18 [((1, 15), (2, 14), (3, 13), (4, 12), (5, 11), (6, 10), (7, 18), (8, 17), (9, 16))]

As you can see, 8, 14, and 16 guests can also be paired up in this way.  Something to keep in mind the next time you are going to have a party.

Full sourcecode can be found on Github.

__slots__ in Python: Save some space and prevent member variable additions

April 15, 2011 Leave a comment

Today I’m going to be writing about a feature of Python I’d never read before, namely __slots__. In a nutshell, using __slots__ allows you to decrease the memory needed by your classes, as well as prevent unintended assignment to new member variables.

By default, each class has a dictionary which it uses to map from attribute names to the member variable itself. Dictionaries are extremely well designed in Python, yet by their very nature they are somewhat wasteful of space. Why is this? Hash tables strive to minimize collisions by ensuring that the load factor (number of elements/size of internal array) does not get too high. In general hash tables use O(n) space, but with a constant factor nearer to 2 than 1 (again, in order to minimize collisions). For classes with very small numbers of member variables, the overhead might be even greater.

class DictExample:
  def __init__(self):
    self.int_var = 5
    self.list_var = [0,1,2,3,4]
    self.nested_dict = {'a':{'b':2}}

# Note that this extends from 'object'; the __slots__ only has an effect
# on these types of 'new' classes
class SlotsExample(object):
  __slots__ = ('int_var','list_var','nested_dict')

  def __init__(self):
    self.int_var = 5
    self.list_var = [0,1,2,3,4]
    self.nested_dict = {'a':{'b':2}}

# jump to the repl
>>> a = DictExample()
# Here is the dictionary I was talking about.
>>> a.__dict__
{'int_var': 5, 'list_var': [0, 1, 2, 3, 4], 'nested_dict': {'a': {'b': 2}}}
>>> a.x = 5
# We were able to assign a new member variable
>>> a.__dict__
{'x': 5, 'int_var': 5, 'list_var': [0, 1, 2, 3, 4], 'nested_dict': {'a': {'b': 2}}}



>>> b = SlotsExample()
# There is no longer a __dict__ object
>>> b.__dict__
Traceback (most recent call last):
  File "<input>", line 1, in <module>
AttributeError: 'SlotsExample' object has no attribute '__dict__'
>>> b.__slots__
('int_var', 'list_var', 'nested_dict')
>>> getattr(b, 'int_var')
5
>>> getattr(a, 'int_var')
5
>>> a.x = 5
# We cannot assign a new member variable; we have declared that there will only
# be member variables whose names appear in the __slots__ iterable
>>> b.x = 5
Traceback (most recent call last):
  File "<input>", line 1, in <module>
AttributeError: 'SlotsExample' object has no attribute 'x'

Note that for the __slots__ declaration to have any effect, you must inherit from object (i.e. be a ‘new style class’). Furthermore, if you extend a class with __slots__ defined, you must also declare __slots__ in that child class, or else it will have a dict allocated, obviating the space savings. See this StackOverflow question for more.

This feature was useful to me when using Python to implement a packed binary message format. The specification spells out in exquisite detail how each and every byte over the wire must be sent. By using the __slots__ mechanism, I was able to ensure that the client could not accidentally modify the message classes and add new member variables, which would not be serialized anyways.

Datetimes in Python – gotchas and workarounds

April 9, 2011 Leave a comment

I’ve written previously about working with dates in Java; as I mentioned there it’s very easy to get dates/times incorrect. I feel like I have a fairly good handle on how things work in Java, but today I was faced with learning how to deal with dates/times in Python. It wasn’t an altogether pleasant experience, but I’m going to show what I learned, so hopefully this is of use to you.

The task I was trying to accomplish was to convert between Unix timestamps (seconds/milliseconds since the Epoch) and more user friendly data objects. Creating a datetime is easy:

>>> from datetime import *
>>> datetime.now()
datetime.datetime(2011, 4, 5, 19, 36, 18, 894325)
>>> datetime(2004, 1, 24)
datetime.datetime(2004, 1, 24, 0, 0)

One gotcha to note is that both the month and day fields are 1 based (1 <= month <= 12, 1 <= day <= number of days in the given month and year), whereas in Java, the month field is 0 indexed.

By default, these datetime objects will use a naïve time zone understanding that ignores offsets from UTC/day light savings time. To fix this, you need to implement your own subclass of tzinfo. It’s kind of unfortunate that they don’t make this easier. Here is an example implementation representing the UTC timezone from the previously linked page:

from datetime import tzinfo, timedelta, datetime

ZERO = timedelta(0)
HOUR = timedelta(hours=1)

class UTC(tzinfo):

    def utcoffset(self, dt):
        return ZERO

    def tzname(self, dt):
        return "UTC"

    def dst(self, dt):
        return ZERO

OK that’s not the end of the world. Now assuming we’ve created out datetime object correctly, how do we retrieve its corresponding Unix style timestamp? Let’s look at the available methods.

>>> [x for x in dir(datetime) if not x.startswith("__")]
["astimezone", "combine", "ctime", "date", "day", "dst", "fromordinal", "fromtimestamp", "hour", "isocalendar", "isoformat", "isoweekday", "max", "microsecond", 
"min", "minute", "month", "now", "replace", "resolution", "second", "strftime", "strptime", "time", "timetuple", "timetz", "today", "toordinal", "tzinfo", "tznam
e", "utcfromtimestamp", "utcnow", "utcoffset", "utctimetuple", "weekday", "year"]

Well, there’s a bunch of methods that convert from a timestamp to a datetime object. But going back the other direction is a little harder. After digging, I found a way to do so:

>>> from datetime import datetime
>>> from time import mktime
>>> dt = datetime(2008, 5, 1, 13, 35, 41, 567777)
>>> seconds = mktime(dt.timetuple())
>>> seconds += (dt.microsecond / 1000000.0)
>>> seconds
1209663341.5677769
>>> 
>>> dt2 = datetime.fromtimestamp(seconds)
>>> dt == dt2
True

Well, there you have it. To convert from a unix timestamp to a datetime object, use datetime.fromtimestamp. To convert the other direction, use time.mktime(datetime_instance.timetuple()). I wish that the library authors had seen fit to maintain symmetry (i.e. datetime should implement a totimestamp method), but fortunately there is an easy workaround. The last thing to note if you’re used to Java is that the timestamps in Python measure seconds from the epoch, as opposed to Java which deals in milliseconds from the epoch.

Java annoyances

January 28, 2011 Leave a comment
Having had Java as the programming language of the vast majority of my undergraduate courses, as well as the language I program in every day, I am most comfortable and fluent in it.  When I return to Java after using different languages such as AWKPython, or Ruby, I’m always left with a bitter taste in my mouth.  There are some things Java just makes way too hard, verbose, and painful to accomplish.  It’s for that reason that I’m learning Scala, what could be (simplistically) described as a cleaned up, more succint version of Java.

Asymmetry in standard libraries

Symmetry is an important feature of a library; it basically means that methods come in pairs.  For instance, you’d expect that a class with a read method has a write method, or one with a set method has a get method.  (That’s not always the case, certainly, but API writers often strive for symmetry.  See Practical API Design: Confessions of a Java Framework Architect) As another example, there’s both a method to convert from an array to a list (Arrays.asList) and there is a method to go the other direction (List.toArray()).  Unfortunately, not all of the Java library APIs adhere to this convention.  The one that bothers me the most is in the String library.  There is a String split method that breaks a String up around a given regular expression, but no corresponding method to reconstitute a String from a collection of other String objects, with a specified separator between them.  This leads to code like the following to comma separate a collection of strings:
String[] strings = ...;
StringBuilder b = new StringBuilder();
for (int i = 0; i < strings.length; i++) {
 b.append(strings[i]);
 if (i != strings.length -1) {
 b.append(",");
 }
}
System.out.println(b.toString());
This whole mess could be replaced with one line of Python code
print(",".join(strings))
or in Scala:
println(strings.mkString(","))
It’s pretty sad that you have to either write that ugly mess, or turn to something like Apache StringUtils.

Different treatment of primitives and objects

It is a lot harder to deal with variable length collections of primitive types than it should be.  This is because you cannot create collections out of things that are not objects.  You can create them out of the boxed primitive type wrappers, but then you have to iterate through and convert back into the primitive types.

In other words, you can create a List<Double> but you cannot create a List<double>.  This leads to code like the following:
// Need a double[] but don't know how long it's going to be
List<Double> doubles = new LinkedList<Double>();
for (...) {
 doubles.append(theComputedValue);
}
// Option 1: Use for loop and iterate over Double list, converting to primitive values through
// auto unboxing
// Bad: leads to O(n^2) running time with LinkedList
double[] doubleArray = new double[doubles.size()];
for (int i = 0; i < doubleArray.length; i++) {
 doubleArray[i] = doubles.get(i);
}
// Option 2: Use enhanced for loop syntax (described below), along with
// additional index variable.
// Better performance but extraneous index value hanging around
int index = 0;
// Automatic unboxing
for (double d : doubles) {
 doubleArray[index++] = d;
}
...
b[index] // Oops
As I blogged about previously, there is a library called Apache Commons Primitives that can be used for variable sized lists for primitive types, but it is a shame one has to turn to third party libraries for such a common task.

Patchwork iteration support

Java 5 introduced the “Enhanced for loop” syntax which allows you to replace
Collection<String> strings = new ArrayList<String>();
Iterator<String> it = strings.iterator();
while (it.hasNext()) {
 String theString = it.next();
}
with the much simpler
for (String s : strings) {
 // deal with the String
}
Here’s the rub: this syntax is supported for arrays and Iterable objects.  But guess what?  Iterators are not Iterable.  Why is this a problem?  Well, you might want to return read-only iterators to your data.  If you do this, the client code cannot use the enhanced for loop syntax, and is stuck with the earlier hasNext() code.  If you want to use the enhanced for loop syntax to work for Iterators, you need to introduce a wrapper around the Iterator which implements the Iterable interface.  From the previously linked blog post:
class IterableIterator<T> implements  Iterable<T> {
    private Iterator<T> iter;
    public IterableIterator(Iterator<T> iter) {
        this.iter = iter;
    }
    // Fulfill the Iterable interface
    public Iterator<T> iterator() {
        return iter;
    }
}
I hope this strikes you as inelegant as well.
Furthermore, arrays are not iterable either, despite the fact that you can use the enhanced for loop syntax with them.
What this all boils down to is that there’s no great way to accept an Iterable collection of objects.  If you accept an Iterable<E>, you close yourself off to arrays and iterators.  You’d have to convert the arrays to a suitable collection type by using the Arrays.asList method.  It would be great if we could treat arrays, collections, etc., agnostically when all we want to do is iterate over their elements.

Lack of type inference for constructors with generics

Yes, we all know we should program to an interface rather than to a specific implementation; doing so will allow our code to be much more flexible and easily changed later.  Furthermore, we also know we should use generics for our collections rather than raw collections of objects; this allows us to catch typing errors before they occur.  So in other words

// BAD: Raw hashmap and programming to the implementation!
HashMap b = new HashMap();
// Good
Map<String, Integer> wordCounts = new HashMap<String, Integer>();
In fact, this lack of type inference is one reason why Joshua Bloch suggests that static factory methods can be better that constructors – it is possible to have a static factory method that can infer the correct types and instantiate the object, without making you explicitly repeat the type parameters.  For instance, Google Guava provides many static methods to instantiate maps:
Map<String, Integer> wordCounts = Maps.newHashMap();
Fortunately, the problem of having to repeat type parameters twice for constructors is being fixed in JDK 7 with something called the Diamond Operator.  It will allow you to replace
Map<String, Integer> wordCounts = new HashMap<String, Integer>();
with
Map<String, Integer> wordCounts = new HashMap<>();
This improvement to the language can’t come fast enough.

Conclusion

I use Java on a daily basis for about 90% of the work I need to do.  I’m comfortable with it, I understand its syntax, it’s fast, it’s powerful.  After being exposed to languages like python and scala, certain issues in Java stand out in stark contrast, and thus I’ve enumerated a few of the reasons that Java annoys me on a daily basis.  Fortunately excellent libraries exist to correct many of the annoyances, but it’s painful to have to use them to do such basic things as joining a list of Strings with a given separator character, or creating a variable sized list of primitive types. Fortunately Java continues to evolve, and at least some of my irritations will be fixed in JDK 7.
Post in the comments if you have better workarounds than those that I’ve suggested, you have other languages that make these tasks easy and would like to highlight them, or any other reason you can think of.