Feeds:
Posts
Comments

Archive for the ‘categorization’ Category

I set out to write a simple Bayesian text classifier for news articles. The problem is that I only have the test data. In other words, I have the articles that I want to classify or attach tags to, but I don’t have any training data for building a proper model.

My first task was to find a proper taxonomy of article categories/tags. It took me about an hour to realize that it’s just not an easy task. I was searching for an easy way to download the Yahoo! directory, no luck. My better chance was with the open directory project. I finally found something promising on this page, which was actually pretty tricky to find! I got the categories.txt file, it looks like this:

Adult
Adult/Arts
Adult/Arts/Animation
Adult/Arts/Animation/Anime
Adult/Arts/Animation/Anime/Fan_Works
Adult/Arts/Animation/Anime/Fan_Works/Fan_Art
….

Then I formatted it a little so that I only get the first two levels (category/subcategory):

Arts
Arts/Animation
Arts/Animation/Celebrities
….

This resulted in a huge reduction in the number of categories (ended up with 275 topics).

The next step was to get some text that is related to each category. I used Wikipedia for this. For each category, I tried crawling the page with url prefix “en.wikipedia.org/wiki/” using Python Goose to extract the article body. Since most of the category names were simple notions or concepts, there was always a corresponding wikipedia article, it let me down with only a few, which I just ignored.

A little trick I used in the script was also to try separating the category names into several topics, whenever I would see a category name such as Religion_and_Spirituality. I would also get a singular form of the category, since Wikipedia would only have an article for “Award” and not “Awards.”

Up till this point, you can find my Python script that crawls the pages and generates one file per article here.

I have several points on my TODO’s for this little project:

  • Remove stop words. This is very simple. Although the classifier should work with stop words with more or less the same accuracy, removing stop words can actually make training faster with less features.
  • Crawl more languages. In fact, the reason I’m collecting training data is to tag French news articles, so this one is kind of a requirement.
  • Lemmatize the category names as well as the features extracted for each category. I believe this should improve classification significantly.
  • Finally, I would like to try a different classifier than the Naive Bayesian, it doesn’t enjoy the best popularity in the classification arena.
Sounds pretty optimistic, I’ll try to keep my appetite up for this.

Read Full Post »