Archives

February 2010 (1)
September 2009 (1)
May 2009 (1)
April 2009 (1)
March 2009 (4)
January 2009 (3)

November 2008 (2)
October 2008 (2)
September 2008 (1)
August 2008 (5)
July 2008 (3)
June 2008 (1)
May 2008 (5)
April 2008 (8)
March 2008 (3)
February 2008 (1)
January 2008 (2)

December 2007 (2)
November 2007 (4)
October 2007 (17)
September 2007 (9)

Next on the menu

Wednesday, January 30 2008

I’m trying to make The Ruminator a bit smarter.

Right now, it simply chomps text up into words by splitting it on whitespace and lower-casing it. This means that some things that really ought to be treated as one thing aren’t. I’ve hard-coded “New Zealand” but that approach is pretty stupid.

So I’ve been looking into ways to do this better.

The thing to do seems to be to identify so-called “collocations”, which are sequences of words that are significant. “North Island”, “Wellington City Council”, “aggravated robbery” are examples of collocations that the Ruminator might see. The trick is in deciding on significance just through statistical analysis.

There is a bunch of computer science that deals with this problem already, and I’ve found some helpful references. The guts of the best solution seems to be to calculate the mutual information statistic. Which is to say, take the probability of words x and y appearing in your corpus in sequence, and divide that by the probability of x occuring times the probability of y occuring. Or:

  P(“xylophonic yurt”)/P(“xylophonic”)P(“yurt”)

Having done that and identified some collocations, we could repeat the exercise with the words appearing before and after, and see whether “white xylophonic yurt” and “xylophonic yurt zapper” are collocations too.

There’s a bunch of tweaking to do after that. What is the threshold for considering something significant? What about sequences that score high, but based on a very few appearances in your corpus?

And of course I need better tools for identifying “words” in the first place. Yay NLTK. I hope to use this to “stem” words so that minor variations in syntax don’t result in stories ending up in different places.

I have a big corpus of news items to play with. I’ve already discovered that reading in 100 MB of text at one go isn’t so smart… anyway, the results are interesting, but it’s going to take a while to fine tune.

Once I’ve done that, I’m going to see whether Bayesian techniques have anything to offer in sorting, tagging and labelling news items. I foresee pain there: someone has to train the sorter, and that could take a while.

Still, it’s enjoyable. Sometimes I regret not having had a full computer science education, and pursuing problems like this makes me feel as though I am somehow making up for it. And it’s just interesting.

no comments

Tags: the ruminator ~ natural language processing

What I had cooking was pretty tasty

Monday, January 21 2008

Mission accomplished. It wasn’t that hard in the end. To my shame, I didn’t end up using an HTML parser, and brute-forced with regex instead. Why? Because the pages I snooped my data from weren’t perfectly well-formed, so I would have had to sanitise them first. In theory it ought to be easier to say find me the node that contains “EBIT” and give me the contents of its sibling but in practise it’s even easier to say:

        pat = r’EBIT</b></td><td align=“right”>\(?(-?[,\d]+\.\d+)’
        m = re.search(pat,earningspage)

I suppose that if it were important, I would have pumped it through BeautifulSoup and then used minidom or similar. It bugs me that the “right way” is so often the hard way when it comes to deal with other people’s content. I suspect there is a deep principle at work there.

One thing I was pleased with was that I put in lots of verbose debugging in the output, none of which had commas in

    Fetching WTF
    EBIT: 32.92
    TCA: 106.28
    TCL: 82.24
    PP&E: 4.68
    NWC: 24.04
    Return: 1.14623955432
    Total Debt: 0
    Market Cap: 1013.65
    EV: 1013.65
    Yield: 0.0324766931387
    Sector: Retailing
    WTF,1.14623955432,0.0324766931387,Retailing
    Fetching ZFX
    EBIT: 1153.50
    TCA: 1946.10
    TCL: 633.60
    PP&E: 1398.20
    NWC: 1312.5
    Return: 0.425535839451
    Total Debt: 135.70
    Market Cap: 4635.40
    EV: 4771.1
    Yield: 0.241768145711
    Sector: Materials
    ZFX,0.425535839451,0.241768145711,Materials

so I had something easy to inspect by eye, but I could create a CSV for manipulating in a spreadsheet easily by just grepping for commas. When I first started using Unix I wasn’t too sure what people meant when they said it was an environment geared to the needs of programmers, but now I understand. One the one hand ones programs are often not as complete or helpful as they might be, because you know that you’ll just use some utilities to do the last bits later – on the other hand, you save a lot of time.

no comments

Tags: python ~ regex ~ html parsing

Recent comments

Rendered at 2012-02-05 17:21:30