Archive

Archive for April, 2011

New lines in XML attributes

April 26, 2011 Leave a comment

If you have an attribute in xml that spans multiple lines, e.g.

 q2="2
B"

you might expect the newline literal to be encoded in the resulting string when the attribute is parsed. Instead, the above example will be parsed as “2 B”, at least with Java’s SAX parser implementation. In order to have the new line literal included, you should insert the entity & #10; instead (this entity keeps getting eaten by wordpress, so ignore the space) This StackOverflow answer by Tomalak gives some more insight:

Bottom line is, the value string is saved verbatim. You get out what you put in, no need to interfere.
However… some implementations are not compliant. For example, they will encode & characters in attribute values, but forget about newline characters or tabs. This puts you in a losing position since you can’t simply replace newlines with
beforehand.

Upon parsing such a document, literal newlines in attributes are normalized into a single space (again, in accordance to the spec) – and thus they are lost.
Saving (and retaining!) newlines in attributes is impossible in these implementations.

Advertisements
Categories: Java Tags: , ,

__slots__ in Python: Save some space and prevent member variable additions

April 15, 2011 Leave a comment

Today I’m going to be writing about a feature of Python I’d never read before, namely __slots__. In a nutshell, using __slots__ allows you to decrease the memory needed by your classes, as well as prevent unintended assignment to new member variables.

By default, each class has a dictionary which it uses to map from attribute names to the member variable itself. Dictionaries are extremely well designed in Python, yet by their very nature they are somewhat wasteful of space. Why is this? Hash tables strive to minimize collisions by ensuring that the load factor (number of elements/size of internal array) does not get too high. In general hash tables use O(n) space, but with a constant factor nearer to 2 than 1 (again, in order to minimize collisions). For classes with very small numbers of member variables, the overhead might be even greater.

class DictExample:
  def __init__(self):
    self.int_var = 5
    self.list_var = [0,1,2,3,4]
    self.nested_dict = {'a':{'b':2}}

# Note that this extends from 'object'; the __slots__ only has an effect
# on these types of 'new' classes
class SlotsExample(object):
  __slots__ = ('int_var','list_var','nested_dict')

  def __init__(self):
    self.int_var = 5
    self.list_var = [0,1,2,3,4]
    self.nested_dict = {'a':{'b':2}}

# jump to the repl
>>> a = DictExample()
# Here is the dictionary I was talking about.
>>> a.__dict__
{'int_var': 5, 'list_var': [0, 1, 2, 3, 4], 'nested_dict': {'a': {'b': 2}}}
>>> a.x = 5
# We were able to assign a new member variable
>>> a.__dict__
{'x': 5, 'int_var': 5, 'list_var': [0, 1, 2, 3, 4], 'nested_dict': {'a': {'b': 2}}}



>>> b = SlotsExample()
# There is no longer a __dict__ object
>>> b.__dict__
Traceback (most recent call last):
  File "<input>", line 1, in <module>
AttributeError: 'SlotsExample' object has no attribute '__dict__'
>>> b.__slots__
('int_var', 'list_var', 'nested_dict')
>>> getattr(b, 'int_var')
5
>>> getattr(a, 'int_var')
5
>>> a.x = 5
# We cannot assign a new member variable; we have declared that there will only
# be member variables whose names appear in the __slots__ iterable
>>> b.x = 5
Traceback (most recent call last):
  File "<input>", line 1, in <module>
AttributeError: 'SlotsExample' object has no attribute 'x'

Note that for the __slots__ declaration to have any effect, you must inherit from object (i.e. be a ‘new style class’). Furthermore, if you extend a class with __slots__ defined, you must also declare __slots__ in that child class, or else it will have a dict allocated, obviating the space savings. See this StackOverflow question for more.

This feature was useful to me when using Python to implement a packed binary message format. The specification spells out in exquisite detail how each and every byte over the wire must be sent. By using the __slots__ mechanism, I was able to ensure that the client could not accidentally modify the message classes and add new member variables, which would not be serialized anyways.

TextMate Grammar Editing Tip – “Edit in TextMate”

April 12, 2011 Leave a comment

I wrote previously about creating language grammars in TextMate and I’ve been doing a bit more of this lately. One thing that makes this process a lot less painful is following the advice from the official Textmate book and installing the “Edit in TextMate” bundle.  Do this by going to Bundles->TextMate->Install “Edit in TextMate”, and follow the instructions.  After rebooting TextMate, you can press ⌃⌘E while within the Edit Grammar file to open a live copy of the document in a syntax highlighted textmate window.  Every time you hit Save, the changes are pushed back to the unstyled document pane.  This drastically speeds up development, as you no longer have to copy and paste text between the windows, but instead can hit save any time you want to try your changes out.

Live updating, syntax highlighted language grammar

Live updating, syntax highlighted language grammar

Datetimes in Python – gotchas and workarounds

April 9, 2011 Leave a comment

I’ve written previously about working with dates in Java; as I mentioned there it’s very easy to get dates/times incorrect. I feel like I have a fairly good handle on how things work in Java, but today I was faced with learning how to deal with dates/times in Python. It wasn’t an altogether pleasant experience, but I’m going to show what I learned, so hopefully this is of use to you.

The task I was trying to accomplish was to convert between Unix timestamps (seconds/milliseconds since the Epoch) and more user friendly data objects. Creating a datetime is easy:

>>> from datetime import *
>>> datetime.now()
datetime.datetime(2011, 4, 5, 19, 36, 18, 894325)
>>> datetime(2004, 1, 24)
datetime.datetime(2004, 1, 24, 0, 0)

One gotcha to note is that both the month and day fields are 1 based (1 <= month <= 12, 1 <= day <= number of days in the given month and year), whereas in Java, the month field is 0 indexed.

By default, these datetime objects will use a naïve time zone understanding that ignores offsets from UTC/day light savings time. To fix this, you need to implement your own subclass of tzinfo. It’s kind of unfortunate that they don’t make this easier. Here is an example implementation representing the UTC timezone from the previously linked page:

from datetime import tzinfo, timedelta, datetime

ZERO = timedelta(0)
HOUR = timedelta(hours=1)

class UTC(tzinfo):

    def utcoffset(self, dt):
        return ZERO

    def tzname(self, dt):
        return "UTC"

    def dst(self, dt):
        return ZERO

OK that’s not the end of the world. Now assuming we’ve created out datetime object correctly, how do we retrieve its corresponding Unix style timestamp? Let’s look at the available methods.

>>> [x for x in dir(datetime) if not x.startswith("__")]
["astimezone", "combine", "ctime", "date", "day", "dst", "fromordinal", "fromtimestamp", "hour", "isocalendar", "isoformat", "isoweekday", "max", "microsecond", 
"min", "minute", "month", "now", "replace", "resolution", "second", "strftime", "strptime", "time", "timetuple", "timetz", "today", "toordinal", "tzinfo", "tznam
e", "utcfromtimestamp", "utcnow", "utcoffset", "utctimetuple", "weekday", "year"]

Well, there’s a bunch of methods that convert from a timestamp to a datetime object. But going back the other direction is a little harder. After digging, I found a way to do so:

>>> from datetime import datetime
>>> from time import mktime
>>> dt = datetime(2008, 5, 1, 13, 35, 41, 567777)
>>> seconds = mktime(dt.timetuple())
>>> seconds += (dt.microsecond / 1000000.0)
>>> seconds
1209663341.5677769
>>> 
>>> dt2 = datetime.fromtimestamp(seconds)
>>> dt == dt2
True

Well, there you have it. To convert from a unix timestamp to a datetime object, use datetime.fromtimestamp. To convert the other direction, use time.mktime(datetime_instance.timetuple()). I wish that the library authors had seen fit to maintain symmetry (i.e. datetime should implement a totimestamp method), but fortunately there is an easy workaround. The last thing to note if you’re used to Java is that the timestamps in Python measure seconds from the epoch, as opposed to Java which deals in milliseconds from the epoch.

WordPress Stats April Fool’s

April 1, 2011 3 comments

WordPress Stats gag

While not as flashy as some other April Fool’s day pranks, WordPress definitely got me for a second.

Categories: Uncategorized Tags: ,

Gmail Motion

April 1, 2011 Leave a comment

To send, lick a stamp and place it on your knee

Very amusing video

Categories: Uncategorized Tags: ,