February 2010 (1)
September 2009 (1)
May 2009 (1)
April 2009 (1)
March 2009 (4)
January 2009 (3)
November 2008 (2)
October 2008 (2)
September 2008 (1)
August 2008 (5)
July 2008 (3)
June 2008 (1)
May 2008 (5)
April 2008 (8)
March 2008 (3)
February 2008 (1)
January 2008 (2)
December 2007 (2)
November 2007 (4)
October 2007 (17)
September 2007 (9)
What I had cooking was pretty tasty
Monday, January 21 2008
Mission accomplished. It wasn’t that hard in the end. To my shame, I didn’t end up using an HTML parser, and brute-forced with regex instead. Why? Because the pages I snooped my data from weren’t perfectly well-formed, so I would have had to sanitise them first. In theory it ought to be easier to say find me the node that contains “EBIT” and give me the contents of its sibling but in practise it’s even easier to say:
pat = r’EBIT</b></td><td align=“right”>\(?(-?[,\d]+\.\d+)’
m = re.search(pat,earningspage)
I suppose that if it were important, I would have pumped it through BeautifulSoup and then used minidom or similar. It bugs me that the “right way” is so often the hard way when it comes to deal with other people’s content. I suspect there is a deep principle at work there.
One thing I was pleased with was that I put in lots of verbose debugging in the output, none of which had commas in
Fetching WTF
EBIT: 32.92
TCA: 106.28
TCL: 82.24
PP&E: 4.68
NWC: 24.04
Return: 1.14623955432
Total Debt: 0
Market Cap: 1013.65
EV: 1013.65
Yield: 0.0324766931387
Sector: Retailing
WTF,1.14623955432,0.0324766931387,Retailing
Fetching ZFX
EBIT: 1153.50
TCA: 1946.10
TCL: 633.60
PP&E: 1398.20
NWC: 1312.5
Return: 0.425535839451
Total Debt: 135.70
Market Cap: 4635.40
EV: 4771.1
Yield: 0.241768145711
Sector: Materials
ZFX,0.425535839451,0.241768145711,Materials
so I had something easy to inspect by eye, but I could create a CSV for manipulating in a spreadsheet easily by just grepping for commas. When I first started using Unix I wasn’t too sure what people meant when they said it was an environment geared to the needs of programmers, but now I understand. One the one hand ones programs are often not as complete or helpful as they might be, because you know that you’ll just use some utilities to do the last bits later – on the other hand, you save a lot of time.
Tags: python ~ regex ~ html parsing
Rendered at 2012-02-05 19:07:59