Feeds:
Posts
Comments

Archive for October, 2012

Yesterday I wrote this little handy Python script to compute the TF-IDF scores for a collection of documents, check it out here.

This little function does most of the work other than the TF-IDF calculation itself:

 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# function to tokenize text, and put words back to their roots
def tokenize(str):

# remove punctuation
tokens = re.findall(r"<a.*?/a>|<[^\>]*>|[\w'@#]+",
        str.lower())

# lemmatize words. try both noun and verb lemmatizations
lmtzr = WordNetLemmatizer()
for i in range(0,len(tokens)):
res = lmtzr.lemmatize(tokens[i])
if res == tokens[i]:
tokens[i] = lmtzr.lemmatize(tokens[i], 'v')
else:
tokens[i] = res
return tokens

It uses the powerful NLTK library along with wordnet to figure out word stemming. Since I don’t know the position of a word in a sentence (verb, noun, …) I just try it once as a noun and once as a verb and take the one that changes the word.

Read Full Post »

So you’re working on a data mining, information retrieval, unstructured data management, or any similar topics, and you now it’s time to tamper with some data. It could be that you have some hypothesis to test, an equation for building a TF-IDF index or a model for classification or similarity. In any case, you need some real-life data sets to play with. In my case, it was news articles. I first need to crawl some news websites, generating URLs for several hundreds or thousands of (random in my case) pages, which is itself a non-trivial task, then I need to extract the bulk of each news article from its web page.

While you can use a number of already-implemented tools to parse the HTML of a web page in every possible language, it’s probably easy to write your own in a high-level language such as Python. I gave it a try with lxml and it worked great! Nevertheless, it’s still relatively difficult to extract the body of an article from an arbitrary news source, since you don’t know the structure of the HTML elements that hold the interesting pieces of the article.
The process is called “Content Extraction/Scraping.” If you google the term you’ll find several thousands of related results, including open-source projects, articles, and publications on the topic. A content extraction tool will essentially extract text from an HTML page. I needed a simple tool that can return only the body of an article, essentially, extract only “meaningful” text.
Enter decruft.
I like minimalist software. Most of the time I just need a tiny piece of code that does exactly what I need, not a bit more, not a bit less. decruft is a minimalist tool for extracting the body of an article. It’s a Python port of Arc90‘s amazing Readability project. In fact, its core is used in very popular software today (Apple Safari, Amazon Kindle, and some iPad readers).
I coded very little Python to wrap decruft, achieving exactly what I wanted in very little time. All the wrapper did was read a list of URLs from a file, supply it one by one to decruft‘s main method, get the result back in HTML format, and proceed in further processing of that as an lxml ElementTree.

Read Full Post »

The usual pain task every researcher has to go through at some point in time: writing a simple customized web crawler!

Today I sat down to write my own. Can’t call it a crawler though, since it really doesn’t do any “crawling,” all I needed was a piece of code that will parse the web pages I throw at it and give me the HTML elements I’m looking for, by name, class, id, or what have you. It seems not enough people have tried this on a mac though, community support was horribly weak!

I had intended to write my parser in Java, but then after a few readups on google, I decided I’d try Python, probably much better at handling such clichéd, high-level tasks. Mac has it installed by default, sweet!

Now comes the pain, which package should I use? I started with the native Python support for HTML parsing, which has you write an extending class with three methods for handle_starttag, handle_endtag, and handle_data, which doesn’t really give you a very intuitive way for the flow of your logic when you wanna search an HTML tree for elements then execute some action. You can find it simply and easy on the next link, but a disclaimer: DON’T use it if you wanna handle complicated real-life HTML pages. In my experience it crashed while parsing javascript code.

Now a clear conscience, here: http://docs.python.org/library/htmlparser.html

The next step was to try Beautifulsoup. BS has a very nice popularity on the web — although I don’t think it enjoys a similar interest from its developers — but again for some reason it wouldn’t parse my hard-headed HTML page. My guess was that it was just a wrapper around the native Python parsing code.

My last resort was the famous lxml parser. I read a lot about this one, and was all excited to try it. However, no matter how hard I tried, it still wouldn’t install on my Mac. I’m on a Mountain Lion, so there are a number of issues to suspect here, could be the manually-installed GCC compiler? (Apple stopped embracing GCC since Mountain Lion, you need to install it manually if needed), could be the version of lxml, or it could just be that the developers behind it are whole-hearted mac-haters, who proudly declare on the software web page that:

Apple doesn’t help here, as MacOS-X is so badly maintained by them that the pre-installed system libraries of libxml2 and libxslt tend to be horribly outdated, and updating them is everything but easy

Anyway, after wasting another hour, I found this great step-by-step tutorial by Leo for installing it. However, the latest versions of libxml2, libxslt, and lxml did NOT compile still. I had to try with a couple more, et voilà.. finally I found this combination of versions which compiled fine:
  • libxml2-2.8.0
  • libxslt-1.1.27
  • lxml-2.3.5
Hope this helps any other poor lxml-using sole.

Read Full Post »

A popular categorization of recommendation systems is memory-based vs. model-based algorithms. I’ve blogged before about a brief summary of both techniques.

To the best of my knowledge, Google still uses a mixed model consisting of both techniques, plus a simple content-based algorithm that monitors explicitly-declared user interests, and observes their change over time. The latest two papers from Google on this are:

The first paper specifically deserves a special discussion. From the looks of it, this one introduces the main part of the Google News recommendation engine. It discusses a hybrid system that uses a memory-based algorithm (item covisitation) along with two model-based algorithms, MinHash and the infamous classic PLSI, no news there. Greg wrote an excellent review of the paper here.
The second paper, however more recent, has less detail, and introduces a simple content-based technique, summed up as follows:

In this paper, we describe an content-based method to recommend news articles according to their topic categories, which is assigned by text classifiers. Based on a user’s news reading history the recommender predicts the topic categories of interest to her each time she visit Google News.

And the user news reading history is simply a collection of news items that the user has clicked on in the past. The authors argue that the tricky part is that the user’s clicking trend changes over time, and they categorize a user’s interests into two types: long-term interests, which are based on intrinsic features of the user, like her age, profession, character, and so on. There’s also the short-term interests, which they argue are media influences on the general public due to some popular news event or featured story.

The paper proposes a Bayesian model to capture the user’s long-term interests. The more interesting part is actually using the model to observe the news trend of the public, which is then used to influence recommendations presented to the user.

Overall, it seems that the Googlers managed to create themselves a pretty good recommendation engine based on almost ALL of the state-of-the-art techniques combined in order to get the best of each and come up with recommendations in the most accurate way.

Read Full Post »

Mac KeyRemapper

I busted my macbookPro’s spacebar key last night. It felt a little sticky (I guess I spilled something on it), so while trying to clean it I took it out, and it seems like I broke tiny little hinges on it. I contacted a local Apple reseller, but of course their estimate was around 200CHF for fixing a spacebar!

Until I’ll be able to buy a new spacebar for myself on ebay.ch, I remapped my right CMD key to space, I rarely use that key anyway. Check out the keyRemap4Macbook, did the trick minimally.

Read Full Post »

Anywhere you’d try to read on recommendation systems you’ll catch a mention of this categorization: memory-based versus model-based recommendation systems. I’ve seen some terrible explanations of this categorization, so I’ll try to put it as simple as I can.

Memory-based techniques use the data (likes, votes, clicks, etc) that you have to establish correlations (similarities?) between either users (Collaborative Filtering) or items (Content-Based Recommendation) to recommend an item i to a user u who’s never seen it before. In the case of collaborative filtering, we get the recommendations from items seen by the user’s who are closest to u, hence the term collaborative. In contrast, content-based recommendation tries to compare items using their characteristics (movie genre, actors, book’s publisher or author… etc) to recommend similar new items.

In a nutshell, memory-based techniques rely heavily on simple similarity measures (Cosine similarity, Pearson correlation, Jaccard coefficient… etc) to match similar people or items together. If we have a huge matrix with users on one dimension and items on the other, with the cells containing votes or likes, then memory-based techniques use similarity measures on two vectors (rows or columns) of such a matrix to generate a number representing similarity.

Model-based techniques on the other hand try to further fill out this matrix. They tackle the task of “guessing” how much a user will like an item that they did not encounter before. For that they utilize several machine learning algorithms to train on the vector of items for a specific user, then they can build a model that can predict the user’s rating for a new item that has just been added to the system.

Since I’ll be working on news recommendations, the latter technique sounds much more interesting. Particularly since news items emerge very quickly (and disappear also very quickly), it makes sense that the system develops some smart way of detecting when a new piece of news will be interesting to the user even before other users see/rate it.

Popular model-based techniques are Bayesian Networks, Singular Value Decomposition, and Probabilistic Latent Semantic Analysis (or Probabilistic Latent Semantic Indexing). For some reason, all model-based techniques do not enjoy particularly happy-sounding names.

Read Full Post »

My first post in this blog. Also my first few days in Lausanne for my PhD.

I left my comfort zone back in Egypt and Saudi Arabia, where I knew the places, the people, the language, and I had gotten used to working with my research advisor and friends and colleagues, but I guess it was time.

I’ll be working with news recommendations, at least for the first semester project. I’m now exploring into this new (old?) world of collaborative filters, content-based recommendations, and hybrid models. It’s all overwhelming in the beginning, but then when you read enough papers, blogs, and wikis, you realize it’s the same few concepts repeating all over the place. For the first week or so, it feels incredibly boring to read through all those papers, each of which adding so little to the space of the already-exhausted market of ideas. Every paper starts with the same introduction about the internet being such a massive collection of resources, and that users don’t know what they are looking for most of the time, they’re just looking for “what’s up.”

To be honest with you, I hate the look of the news.google.com website. I think it’s design is a complete disaster, and I dread the way news snippets will stretch only to show you similar stories by other news agencies. Nevertheless, I use it almost regularly! The reason is that, having so much data and intellectual resources as Google does, I’d trust the folks over there to do one heck of a job spotting news that I’d be interested in more than any other service on the Internet. This is why I thought the best thing to do now is to search for the latest Google publications on the subject.

A couple of days back, I stopped at a recent paper by three Googlers titled “Google News Personalization: Scalable Online Collaborative Filtering.” I’ll dedicate my next few posts to this area.

Read Full Post »