Feeds:
Posts
Comments

Archive for the ‘Python’ Category

I set out to write a simple Bayesian text classifier for news articles. The problem is that I only have the test data. In other words, I have the articles that I want to classify or attach tags to, but I don’t have any training data for building a proper model.

My first task was to find a proper taxonomy of article categories/tags. It took me about an hour to realize that it’s just not an easy task. I was searching for an easy way to download the Yahoo! directory, no luck. My better chance was with the open directory project. I finally found something promising on this page, which was actually pretty tricky to find! I got the categories.txt file, it looks like this:

Adult
Adult/Arts
Adult/Arts/Animation
Adult/Arts/Animation/Anime
Adult/Arts/Animation/Anime/Fan_Works
Adult/Arts/Animation/Anime/Fan_Works/Fan_Art
….

Then I formatted it a little so that I only get the first two levels (category/subcategory):

Arts
Arts/Animation
Arts/Animation/Celebrities
….

This resulted in a huge reduction in the number of categories (ended up with 275 topics).

The next step was to get some text that is related to each category. I used Wikipedia for this. For each category, I tried crawling the page with url prefix “en.wikipedia.org/wiki/” using Python Goose to extract the article body. Since most of the category names were simple notions or concepts, there was always a corresponding wikipedia article, it let me down with only a few, which I just ignored.

A little trick I used in the script was also to try separating the category names into several topics, whenever I would see a category name such as Religion_and_Spirituality. I would also get a singular form of the category, since Wikipedia would only have an article for “Award” and not “Awards.”

Up till this point, you can find my Python script that crawls the pages and generates one file per article here.

I have several points on my TODO’s for this little project:

  • Remove stop words. This is very simple. Although the classifier should work with stop words with more or less the same accuracy, removing stop words can actually make training faster with less features.
  • Crawl more languages. In fact, the reason I’m collecting training data is to tag French news articles, so this one is kind of a requirement.
  • Lemmatize the category names as well as the features extracted for each category. I believe this should improve classification significantly.
  • Finally, I would like to try a different classifier than the Naive Bayesian, it doesn’t enjoy the best popularity in the classification arena.
Sounds pretty optimistic, I’ll try to keep my appetite up for this.

Read Full Post »

Yesterday I wrote this little handy Python script to compute the TF-IDF scores for a collection of documents, check it out here.

This little function does most of the work other than the TF-IDF calculation itself:

 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# function to tokenize text, and put words back to their roots
def tokenize(str):

# remove punctuation
tokens = re.findall(r"<a.*?/a>|<[^\>]*>|[\w'@#]+",
        str.lower())

# lemmatize words. try both noun and verb lemmatizations
lmtzr = WordNetLemmatizer()
for i in range(0,len(tokens)):
res = lmtzr.lemmatize(tokens[i])
if res == tokens[i]:
tokens[i] = lmtzr.lemmatize(tokens[i], 'v')
else:
tokens[i] = res
return tokens

It uses the powerful NLTK library along with wordnet to figure out word stemming. Since I don’t know the position of a word in a sentence (verb, noun, …) I just try it once as a noun and once as a verb and take the one that changes the word.

Read Full Post »

The usual pain task every researcher has to go through at some point in time: writing a simple customized web crawler!

Today I sat down to write my own. Can’t call it a crawler though, since it really doesn’t do any “crawling,” all I needed was a piece of code that will parse the web pages I throw at it and give me the HTML elements I’m looking for, by name, class, id, or what have you. It seems not enough people have tried this on a mac though, community support was horribly weak!

I had intended to write my parser in Java, but then after a few readups on google, I decided I’d try Python, probably much better at handling such clichéd, high-level tasks. Mac has it installed by default, sweet!

Now comes the pain, which package should I use? I started with the native Python support for HTML parsing, which has you write an extending class with three methods for handle_starttag, handle_endtag, and handle_data, which doesn’t really give you a very intuitive way for the flow of your logic when you wanna search an HTML tree for elements then execute some action. You can find it simply and easy on the next link, but a disclaimer: DON’T use it if you wanna handle complicated real-life HTML pages. In my experience it crashed while parsing javascript code.

Now a clear conscience, here: http://docs.python.org/library/htmlparser.html

The next step was to try Beautifulsoup. BS has a very nice popularity on the web — although I don’t think it enjoys a similar interest from its developers — but again for some reason it wouldn’t parse my hard-headed HTML page. My guess was that it was just a wrapper around the native Python parsing code.

My last resort was the famous lxml parser. I read a lot about this one, and was all excited to try it. However, no matter how hard I tried, it still wouldn’t install on my Mac. I’m on a Mountain Lion, so there are a number of issues to suspect here, could be the manually-installed GCC compiler? (Apple stopped embracing GCC since Mountain Lion, you need to install it manually if needed), could be the version of lxml, or it could just be that the developers behind it are whole-hearted mac-haters, who proudly declare on the software web page that:

Apple doesn’t help here, as MacOS-X is so badly maintained by them that the pre-installed system libraries of libxml2 and libxslt tend to be horribly outdated, and updating them is everything but easy

Anyway, after wasting another hour, I found this great step-by-step tutorial by Leo for installing it. However, the latest versions of libxml2, libxslt, and lxml did NOT compile still. I had to try with a couple more, et voilà.. finally I found this combination of versions which compiled fine:
  • libxml2-2.8.0
  • libxslt-1.1.27
  • lxml-2.3.5
Hope this helps any other poor lxml-using sole.

Read Full Post »