February 2010 (1)
September 2009 (1)
May 2009 (1)
April 2009 (1)
March 2009 (4)
January 2009 (3)
November 2008 (2)
October 2008 (2)
September 2008 (1)
August 2008 (5)
July 2008 (3)
June 2008 (1)
May 2008 (5)
April 2008 (8)
March 2008 (3)
February 2008 (1)
January 2008 (2)
December 2007 (2)
November 2007 (4)
October 2007 (17)
September 2007 (9)
Perils of late binding in Python
Thursday, April 30 2009
So I haven’t been doing a lot of Python recently, and I got tripped up by something that in retrospect should have been obvious.
You can write code that references an undefined thingy, and Python won’t complain until you actually run the code and try to access the thingy.
Eg:
>>> def f():
... z()
...
>>> f()
Traceback (most recent call last):
File "", line 1, in
File "", line 2, in f
NameError: global name 'z' is not defined
Which kind of sucks when you forgot to write a test for that code, and you get a runtime exception.
Tags: python
Tuesday, March 03 2009
Had a burst of hacking over the weekend, and one of the outcomes was the realisation that I have a few practises that could be usefully put into a template for new scripts.
So: here is my current starting point for any new script.
#!/usr/bin/pythonTags: python
# -*- coding: utf-8 -*-
from optparse import OptionParser
def _test():
import doctest
doctest.testmod()
def _profile_main(filename):
import cProfile, pstats
prof = cProfile.Profile()
ctx = """_main(filename)"""
prof = prof.runctx(ctx, globals(), locals())
stats = pstats.Stats(prof)
stats.sort_stats("time")
stats.print_stats(10)
def _blurt(s):
pass
def _main(filename):
pass
if __name__ == "__main__":
usage = "usage: %prog [options]"
parser = OptionParser(usage=usage)
parser.add_option('--profile', '-P',
help = "Print out profiling stats",
action = 'store_true')
parser.add_option('--test', '-t',
help ='Run doctests',
action = 'store_true')
parser.add_option('--verbose', '-v',
help ='print debugging output',
action = 'store_true')
(options, args) = parser.parse_args()
# assign non-flag arguments here
# filename = args[0]
def really_blurt(s):
print s
if options.verbose:
_blurt = really_blurt
if options.profile:
_profile_main(filename)
exit()
if options.test:
_blurt = really_blurt
_test()
exit()
_main()
Kiwibank’s KeepSafe feature, and ETAOIN SHRDLU
Friday, January 30 2009
Kiwibank have added a new step to their login process, called KeepSafe.
In this step, user knows the answer to a small range of questions they have selected, like “Where were you born” or “What’s your pet’s name?” And when they log in they are prompted with the questions and asked to select random letters from the answer (eg to select the 1st and 5th letters).
The aim is to defeat keyloggers. The user uses their mouse to select letters from a display of the alphabet, and they never type the whole answer, so an attacker who logged mouse clicks would have to capture multiple logins.
My guess is that password-stealing malware is common enough now that it poses a significant risk to banks.
Unfortunately for users, this system is quite inconvenient. It involves an unaccustomed degree of mental and physical dexterity to select the correct letters. It also is unaccessible for people with text only browsers, or who have Javascript turned off (ironically, the very people least likely to be vulnerable to malware).
A friend suggested that their Keepsafe answer would be “Keepsafe is bloody annoying”. This inspired me. I realise now that the savvier user will set all their Keepsafe answers to AAAAAAAAAAAA.
I also wonder whether it wouldn’t be reasonably easy to guess Keepsafe answers. If I were a wily hacker, I’d use my dictionary to compile stats of the most common letters in English words, by word length and position in the word. Let’s see.
#!/usr/bin/python
import string
f = file('/usr/share/dict/words')
counts = [{'all':0},{'all':0},{'all':0},{'all':0},{'all':0},{'all':0}]
# snag all 6 letter words
for line in [l.lower().strip() for l in f.readlines() if len(l) == 7]:
for i in range(6):
# count the letters in position i
letter = line[i]
counts[i][letter] = counts[i].get(letter, 0) + 1
# keep a total so we can compute a percentage easily
counts[i]['all'] = counts[i]['all'] + 1
for pos in range(6):
print "Position %d" % (pos + 1)
tops = {}
for letter in string.lowercase:
tops[letter] = counts[pos].get(letter,0)*100/counts[pos]['all']
# take the top ten most frequent letters
for pair in sorted(tops.iteritems(), key=lambda(k,v):(v,k), reverse=True)[0:9]:
print "%s %02.2f%%" % (pair[0], pair[1]),
Results:
Position 1
s 11.00% c 7.00% b 7.00% p 6.00% m 6.00% t 5.00% r 5.00% d 5.00% a 5.00%
Position 2
a 18.00% o 15.00% e 13.00% i 10.00% u 9.00% r 7.00% l 5.00% n 3.00% h 3.00%
Position 3
r 10.00% a 9.00% n 8.00% l 7.00% s 6.00% o 6.00% i 6.00% t 5.00% e 5.00%
Position 4
i 10.00% e 10.00% t 8.00% a 7.00% n 6.00% l 6.00% o 5.00% s 4.00% r 4.00%
Position 5
e 27.00% n 7.00% l 6.00% a 5.00% t 4.00% r 4.00% o 4.00% i 4.00% u 2.00%
Position 6
s 36.00% d 11.00% e 9.00% r 8.00% y 6.00% n 5.00% t 4.00% g 3.00% a 3.00%
The distribution of letters is quite skewed, and you get three goes with Keepsafe, so a patient intruder could probably guess a substantial minority of answers.
I’m not sure what the end of this arms race will be.
Tags: security ~ kiwibank ~ pythonPainless html parsing with lxml
Wednesday, January 14 2009
I am working on a Ruminator 2.0. I intend to parse full stories, not just the summaries that appear in RSS.
So I’ve been investigating my options for HTML parsing. There are quite a few options for Python, with varying degrees of speed, flexibility, and tolerance for broken markup.
After a rapturous writeup from Ian Bicking, I thought I’d try lxml, which is a Pythonic wrapper around Gnome’s libxml and libxlst libraries. I’m sold. You can even use CSS selectors if, just like jQuery! (I like not having too much loaded into my head at once).
Suppose you want to scrape a news story (for statistical analysis, not copyright infringement) from the NZ Herald:
>>> from lxml.html import parse
>>> doc = parse('http://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=10551829&ref=rss&pnum=0').getroot()
>>> paras = doc.cssselect('div.article-holder p')
>>> for p in paras:
... print p.text_content()
Easy peasy.
Tags: python ~ lxml ~ the ruminatorPerl vs Python, minor things i
Thursday, April 10 2008
Python has the rather nifty enumerate for those times when you want to iterate over a sequence with an index:
>>> l = ['spam', 'eggs', 'spam']
>>> for (i, j) in enumerate(l):
... print i, j
...
0 spam
1 eggs
2 spam
I’m enjoying reawakening those dormant brain cells where the Perl lives, but I miss these little niceties.
Tags: python ~ perlMicrobes reduce brain efficiency
Wednesday, March 19 2008
That function below would have been better as:
def sortunique(l):
new = []
for i in l:
if i not in new:
new.append(i)
return new
I dunno what I was thinking.
Tags: python ~ codeTuesday, March 18 2008
Something I need to do when building up collocations and phrases is remove doubles from a list, but preserve the sequence.
def sortunique(l):
if l == []:
return l
new = []
last = l[0]
new.append(last)
for i in l[1:]:
if i not in new:
new.append(i)
last = i
return new
Sunday, March 16 2008
I have been ignoring my burgeoning cold today and working on the Ruminator, teaching it to identify phrases in text.
This has been unexpectedly easy, because I googled up a paper that lays out a technique which has proved very effective. I feel I should write the authors a thank you note.
Their technique is to identify pairs of words with a high mutual information statistic, and then to do a second pass through the corpus to try and find words to the left and right of the pair that might also be part of the phrase. They suggest only testing pairs where at least one word is capitalised.
Bugger me, but it works well.
Here’s a little chunk of output from my New Zealand news corpus:
The initial pair
[‘Coast’, ‘District’]
It appears 48 times
48
These are the words that appear to the left in the corpus.
{‘Union’: 1, ‘issued’: 1, ‘Workers’: 1, ‘executive’: 1, ‘soon’: 1, ‘Under’: 1, ‘chair’: 1, ‘announcement’: 1, ‘death’: 1, ‘workers’: 1, ‘winner’: 1, ‘troubled’: 2, ‘two-month-long’: 1, ‘over’: 2, ‘Both’: 1, ‘g
overnment’: 1, ‘assau’: 1, ‘birth’: 2, ‘not’: 1, ’50’: 1, ‘Clinic’: 1, ‘crisis-stricken’: 2, ‘says’: 1, ‘picket’: 1, ‘Organisation’: 1, ‘Disability’: 2, ‘Gisborne’: 1, ‘year’: 2, ‘laboratory’: 1, ‘embattled’:
2, ‘for’: 4, ‘has’: 2, ’11m’: 2, ‘state’: 1, ‘patient’: 1, ‘siege’: 1, ‘met’: 1, ‘address’: 1, ‘by’: 2, ‘on’: 1, ‘about’: 2, ‘her’: 2, ‘of’: 3, ‘products’: 1, ‘action’: 1, ‘footsteps’: 1, ‘raised’: 1, ‘industr
ial’: 1, ‘Cup’: 1, ‘into’: 1, ‘alleged’: 1, ‘suspended’: 1, ‘crisis’: 1, ‘impressed’: 1, ‘given’: 1, ‘from’: 1, ‘Monday’: 1, ‘hospital’: 1, ‘criticised’: 1, ‘next’: 2, ‘Hospital’: 1, ‘Wellington’: 1, ‘doctors’
: 2, ‘line’: 1, ‘with’: 3, ‘Anaesthetists’: 1, ‘hat’: 1, ‘and’: 2, ‘do’: 1, ‘in’: 1, ‘at’: 7, ‘Capital’: 41, ‘Commissioner’: 2, ‘end’: 1, ‘Regional’: 1, ‘Lab’: 1, ‘concerns’: 1, ‘take’: 1, ‘Zealand’: 1, ‘Medic
al’: 1, ‘rain’: 1, ‘Melbourne’: 1, ‘The’: 9, ‘the’: 20, ‘a’: 3, ‘disbelief’: 1, ‘Wellingtons’: 5, ‘Another’: 1, ‘2008’: 1, ‘gardens’: 1}
These words appear on the right.
{‘and’: 2, ‘says’: 3, ‘over’: 1, ‘expects’: 1, ‘defended’: 1, ‘manager’: 1, ‘Health’: 48, ‘Board’: 43, ‘have’: 1, ‘in’: 3, ‘moves’: 1, ‘staff’: 1, ‘spokeswoman’: 1, ‘for’: 1, ‘remains’: 1, ‘admission’: 1, ‘ami
dst’: 1, ‘Cabinet’: 1, ‘to’: 4, ‘Opposition’: 1, ‘new’: 1, ‘has’: 4, ‘is’: 2, ‘A’: 2, ‘Neville’: 1, ‘Boards’: 3, ‘after’: 1, ‘but’: 1, ‘CCDHB’: 1, ‘hopes’: 1, ‘The’: 2, ‘about’: 1, ‘scheme’: 1, ‘taking’: 1, ‘c
ompliance’: 1, ‘will’: 1, ‘chief’: 1, ‘maternity’: 1, ‘could’: 1}
This is a phrase.
[‘Capital’, ‘Coast’, ‘District’, ‘Health’, ‘Board’]
We started with “Coast District”, and looked at the frequency of words to the left and right, and presto, we get Capital Coast District Health Board.
Here’s another one:
[‘Australian’, ‘Prime’]
36
{‘help’: 1, ‘charities’: 1, ‘cruise’: 1, ‘hell’: 1, ‘its’: 2, ‘before’: 1, ’24’: 1, ‘informal’: 1, ‘ships’: 1, ‘to’: 3, ‘board’: 1, ‘Helen’: 2, ‘has’: 1, ‘upping’: 1, ‘Prime’: 2, ‘they’: 2, ‘not’: 1, ‘one’: 1, ‘Protests’: 1, ‘calling’: 1, ‘continue’: 1, ‘A’: 1, ‘Howard’: 2, ‘doing’: 1, ‘national’: 1, ‘Somalia’: 1, ‘Sydney’: 2, ‘year’: 1, ‘John’: 1, ‘said’: 1, ‘Environmentalists’: 1, ‘Darfur’: 1, ‘new’: 1, ‘announced’: 1, ‘be’: 1, ‘missing’: 1, ‘aboriginal’: 1, ‘takeover’: 1, ‘MPs’: 1, ‘on’: 5, ‘climate’: 1, ‘Clark’: 1, ‘of’: 4, ‘region’: 1, ‘times’: 1, ‘abuse’: 3, ‘airline’: 1, ‘tough’: 1, ‘angrily’: 1, ‘three’: 1, ‘poll’: 1, ‘Harawira’: 1, ‘given’: 1, ‘from’: 1, ‘would’: 1, ‘&’: 1, ‘Australias’: 1, ‘two’: 1, ‘attack’: 1, ‘way’: 1, ‘forward’: 1, ‘meeting’: 2, ‘gives’: 1, ‘a’: 2, ‘apologise’: 1, ‘labelled’: 1, ‘child’: 1, ‘he’: 2, ‘HIV-positive’: 1, ‘Saturdays’: 1, ‘this’: 3, ‘polls’: 2, ‘reacted’: 1, ‘will’: 1, ‘country’: 1, ‘urging’: 1, ‘are’: 3, ‘have’: 3, ‘Northern’: 3, ‘voters’: 1, ‘moved’: 1, ‘Expectations’: 1, ‘an’: 1, ‘as’: 1, ‘want’: 1, ‘in’: 8, ‘end’: 1, ‘ex-partner’: 1, ‘Minister’: 2, ‘outbreak’: 1, ‘you’: 1, ‘Zealand’: 1, ‘towards’: 1, ‘after’: 1, ‘plane’: 1, ‘mouth’: 1, ‘building’: 1, ‘later’: 2, ‘2005’: 1, ‘the’: 7}
{‘a’: 1, ‘Maori’: 2, ‘says’: 1, ‘Howard’: 27, ‘warned’: 1, ‘that’: 1, ‘visit’: 1, ‘Ministers’: 1, ‘brief’: 1, ‘to’: 1, ‘racist’: 1, ‘Minister’: 35, ‘Howards’: 1, ‘put’: 1, ‘Rudd’: 1, ‘John’: 28, ‘The’: 2, ‘Kevin’: 1, ‘he’: 1}
[‘Australian’, ‘Prime’, ‘Minister’, ‘John’, ‘Howard’]
I’m stoked. It just needs a little tuning, and I’ll have a collection of phrases I can use to make the Ruminator’s output a lot more meaningful.
Tags: python ~ the ruminator ~ natural language processingWhat I had cooking was pretty tasty
Monday, January 21 2008
Mission accomplished. It wasn’t that hard in the end. To my shame, I didn’t end up using an HTML parser, and brute-forced with regex instead. Why? Because the pages I snooped my data from weren’t perfectly well-formed, so I would have had to sanitise them first. In theory it ought to be easier to say find me the node that contains “EBIT” and give me the contents of its sibling but in practise it’s even easier to say:
pat = r’EBIT</b></td><td align=“right”>\(?(-?[,\d]+\.\d+)’
m = re.search(pat,earningspage)
I suppose that if it were important, I would have pumped it through BeautifulSoup and then used minidom or similar. It bugs me that the “right way” is so often the hard way when it comes to deal with other people’s content. I suspect there is a deep principle at work there.
One thing I was pleased with was that I put in lots of verbose debugging in the output, none of which had commas in
Fetching WTF
EBIT: 32.92
TCA: 106.28
TCL: 82.24
PP&E: 4.68
NWC: 24.04
Return: 1.14623955432
Total Debt: 0
Market Cap: 1013.65
EV: 1013.65
Yield: 0.0324766931387
Sector: Retailing
WTF,1.14623955432,0.0324766931387,Retailing
Fetching ZFX
EBIT: 1153.50
TCA: 1946.10
TCL: 633.60
PP&E: 1398.20
NWC: 1312.5
Return: 0.425535839451
Total Debt: 135.70
Market Cap: 4635.40
EV: 4771.1
Yield: 0.241768145711
Sector: Materials
ZFX,0.425535839451,0.241768145711,Materials
so I had something easy to inspect by eye, but I could create a CSV for manipulating in a spreadsheet easily by just grepping for commas. When I first started using Unix I wasn’t too sure what people meant when they said it was an environment geared to the needs of programmers, but now I understand. One the one hand ones programs are often not as complete or helpful as they might be, because you know that you’ll just use some utilities to do the last bits later – on the other hand, you save a lot of time.
Tags: python ~ regex ~ html parsingTuesday, December 04 2007
At work we are migrating an old site to a new CMS.
Unfortunately the content is a mess. Owing to people pasting text in from Word and various other accidents, one fragment of HTML can be a mixture of UTF-8 and Latin-1 and cp1252 and goodness knows what else. When you’ve been a good boy and coded all your templates to declare “I am UTF-8, honest guv” it’s a bit trying. Especially when the client complains.
The markup is pretty broken too. It’s littered with weird markup from Word and generally non-compliant.
So far I’m having good results from a pipeline of various tricks.
The only downside is that over thousands of items, this is pretty slow. But it’s the price you pay to be beautiful, I guess.
Tags: python ~ unicode ~ markup ~ programming ~ html tidy ~ beautiful soupWednesday, November 21 2007
The other day I got Joel Greenblatt’s The Little Book That Beats the Market out from the public library. (What money-minded person would buy books that are in the library?)
It’s a light, cute read, but it offers suggestions that accord very well with what I understand of value investing. At the end you are presented with a “Magic Formula” for ranking stocks that is supposed to be likely to do better than the market, and again I should think it would, since it basically selects companies that get the best return on their capital that are currently priced cheapest in the market.
I thought it would be an interesting exercise to write a stock screener for New Zealand and Australia that uses this formula. This would be of some value, because no one seems to be offering such a service here. So I’m going to use Python to do it.
The formula requires you to know various figures about each company. Some are published by Yahoo and other websites, and the others can be derived from published data.
So what I need to do is:
This is clearly going to involve some sort of HTML parsing, so I think my first major technical decision is going to be what to use.
Tags: money ~ investing ~ python ~ magic formulaA tiny WSGI framework in an hour or two
Sunday, October 14 2007
I’m not sure whether the first story here is meant to encourage or dissuade, but I am writing my own WSGI framework to support Burble. Colubrid is holding me back, and its replacement Werkzeug is overkill for what I want. (To be fair, Colubrid has been a great help to me in getting started.) I’m really getting into the educational aspect of doing things from scratch where I can.
It turns out that putting together a very lightweight WSGI framework is very easy indeed, especially having made a small compromise by using a few pre-built things from Ian Bicking’s Paste. (Yeah, I contradict myself. I am large, I contain multitudes.)
It’s so easy that I’m almost done, so I present a tiny, noddy framework for your reading pleasure. It implements a regex-based URL dispatcher a la Web.py.
#!/usr/bin/python
from paste.request import parse_formvars
from paste.response import HeaderDict
import re
def attrsfromdict(d):
"""From Python cookbook s6.18 p 280"""
self = d.pop('self')
for n,v in d.iteritems():
setattr(self, n, v)
def simplerepr(obj):
d = obj.__dict__
members = ', '.join([n + '=' + v.__repr__() for n,v in d.iteritems()])
return '%s(%s)' % (obj.__class__.__name__, members)
class NoMatchingControllerException(Exception):
pass
class Request(object):
def __init__(self, environ):
self.environ = environ
self.fields = parse_formvars(environ)
class Response(object):
def __init__(self,
status_code='200',
response_phrase="OK", body="",
headers=HeaderDict({'content-type': 'text/html'})
):
attrsfromdict(locals())
def __str__(self):
return simplerepr(self)
def __repr__(self):
return self.__str__()
def status(self):
return ' '.join([self.status_code, self.response_phrase])
class Dispatcher(object):
"""
The Dispatcher maintains an internal list of regexes and controllers.
The Dispatcher accepts strings, and tries to match in turn against
the regexes. As soon as a match is found, the corresponding controller is
invoked.
"""
def __init__(self, regex_app_tuples):
self.dispatch_list = []
for k, v in regex_app_tuples:
p = re.compile(k)
self.dispatch_list.append((p, v))
def dispatch(self, request):
"""
Expects to call a Controller's instance method GET or POST
with the request and the groups obtained from the regex
as arguments.
"""
path_info = request.environ.get('PATH_INFO', '')
method = request.environ['REQUEST_METHOD']
for pat, app in self.dispatch_list:
mo = pat.match(path_info)
if mo != None:
args = [request]
args.extend([i for i in mo.groups()])
if method == 'GET':
return app.GET(*args)
elif method =='POST':
return app.POST(*args)
raise NoMatchingControllerException, "No match for %s" % path_info
class WhiskyApp(object):
def __init__(self, dispatcher):
self.dispatcher = dispatcher
def __call__(self, environ, start_response):
request = Request(environ)
response = self.dispatcher.dispatch(request)
start_response(response.status(), response.headers.items())
return [response.body]
class NoddyController(object):
def GET(self, request, id):
r = Response(body="Noddy got %s" % id)
return r
class BigEarsController(object):
def __init__(self):
print "in init"
def GET(self, request, arg1, arg2):
r = Response(body="Big Ears got %s and %s" % (arg1, arg2))
return r
if __name__ == '__main__':
from paste import httpserver
dispatch_list = [
(r'/noddy/(\d+)/?$', NoddyController()),
(r'/bigears/(.*?)/(\d+)/?$', BigEarsController())
]
dispatcher = Dispatcher(dispatch_list)
app = WhiskyApp(dispatcher)
httpserver.serve(app, host='127.0.0.1', port='8080')
That’s pretty much all I need, to be honest. I’m happy using Beaker for sessions, and I’ll probably pull in cookie stuff from Paste. I bodged up an Etag cache manager for Burble, which I want to integrate. I want to write a nice base class for controllers. And that’s it. Whee!
Tags: python ~ burble ~ wsgi ~ paste ~ programming ~ web developmentMaking life easy on yourself in Python with quick and dirty __repr__
Wednesday, October 10 2007
When I write Python, sadly my code often has bugs. One way or another I always end up dumping out variables to see what’s in them. If those variables refer to objects, Python’s default representation is not very helpful:
>>> class C:
... def __init__(self, arg):
... self.member = arg
...
>>> obj = C('foo')
>>> print obj
<__main__.C instance at 0xb7d363ec>
It would be nice to have something that lets you know what’s inside that object
Python lets you help yourself. If you define __str__ and __repr__ methods on your classes, then they will have nice string representations when you want to print them, or when a debugger inspects them for you. This can be a bit laborious though, especially if you want to meet the requirement (see the docs) that __repr__ should return “a valid Python expression that could be used to recreate an object with the same value.” And as a good Python citizen, of course you want to do that.
I have a solution that won’t always be the right thing, but saves a lot of typing for many simple classes. I find that often I write classes that are really glorified dictionaries with a few helper methods. In my day job, where we mostly write Java, we work where possible with POJOs or beans. This seems like a Pythonic way to emulate and improve on that idiom.
def simplerepr(obj):
d = obj.__dict__
members = ', '.join([n + '=' + v.__repr__() for n,v in d.iteritems()])
return '%s(%s)' % (obj.__class__.__name__, members)
def attrsfromdict(d):
"""From Python cookbook s6.18 p 280"""
self = d.pop('self')
for n,v in d.iteritems():
setattr(self, n, v)
class Foo(object):
def __init__(self, arg1="wstfgl", arg2="sneeb!"):
attrsfromdict(locals())
def __str__(self):
return simplerepr(self)
def __repr__(self):
return self.__str__()
>>> f = Foo('quux')
>>> f
Foo(arg1='quux', arg2='sneeb!')
>>> g = Foo(arg1='quux', arg2='sneeb!')
>>> g
Foo(arg1='quux', arg2='sneeb!')
This was inspired by and builds on a recipe in the Python Cookbook.
Tags: python ~ programming ~ good practiceWednesday, October 03 2007
I’m committed now to using Burble for two sites. I’m about halfway through refactoring to make this possible, and I’ve discovered that my choice of framework is making this a tedious chore.
Colubrid is by the authors’ own admission based on a misapprehension about how best to use WSGI, and they recommend that you use Werkzeug now instead. Unfortunately, I didn’t realise this when I started using it.
The fundamental problem is that various important thingies are set up and torn down with each request, and various things done after instantiation, such that your init methods break. So I am forced to cut and paste where inheritance would be better, or monkeypatch.
So my very next thing to do is get off Colubrid. It would be top of the list now, but I really am desperate to get vital.org.nz off Blosxom and on to Burble. I just like my own blogging tool so much more.
Tags: burble ~ python ~ colubrid ~ wsgiFriday, September 28 2007
When importing all my old content, I hit a snag. A lot of vital.org.nz has pretty broken markup in it. Burble’s templating system is strict XML under the hood, so any post or comment that contains broken markup causes burble to barf.
I discovered that there is a lovely Python wrapper for HTML Tidy. And there’s even an Ubuntu package. Problem solved.
>>> import tidy
>>> html = 'some <b>horrible<i> soup</b> which is nasty & yukky'
>>> options = {'show-body-only':'y', 'output-xhtml':'y', 'enclose-block-text':'y', 'enclose-text':'y'}
>>> body = str(tidy.parseString(html, **options))
>>> body
'<p>some <b>horrible <i>soup</i> which is nasty & yukky</b></p>\n\n'
Wednesday, September 26 2007
More progress: Burble now has an Atom feed. This proved surprisingly easy to implement. It’s just another template to pump a list of entries into.
Tags: burble ~ python ~ syndication ~ atom ~ templatesSunday, September 23 2007
I’ve learned about Apache bugs. I’ve learned about Etags. I’ve learned about Python.
This post won’t be sticking around for long — I’ve a lot more work to do on this first — but this will be taking over as my publishing wotsit soon.
Tags: burble ~ python ~ testing ~ i did it my wayRendered at 2012-02-05 18:05:43