Feeds:
Posts
Comments

Thelastpsychiatrist strikes again. On how the system sends misleading signals to women (at least that’s what i gathered from his mind-boggling-metaphores-infested writing!). He argues that women are picking the wrong fights with the system, striving for the trappings of power such as promotions, management, and work prestige that is without a real reason, but just for the sake of itself. Instead, they should fight for real chances to change the unbalanced workforce, and that is, in his opinion, all about how much money you make, you long you work, and other material benefits.

One quote i especially liked:

It’s so great that Americans will still vote for a white guy even if he’s a little black”

hilarious gifs for moments in a developer’s life! it’s just awesome to use moments from House, the guy has such facial expressions!!

part 2: http://martinvalasek.com/blog/pictures-from-a-developers-life-part-2

part 3: http://martinvalasek.com/blog/pictures-from-a-developers-life-part-3

Pictures from a developer’s life

Add your thoughts here… (optional)

Righteous Trayf

Dear Mr. President,

My younger brother was an early believer in you.  He worked for your Senate campaign.  At the age of 25, he ran the GOTV campaign in North Carolina, delivering an improbable victory for you in a Southern state that helped give you your first term.  This year, slightly less bright-eyed but nonetheless a believer, he was working on your campaign again when he died suddenly, a brilliant, energetic 29 year old, dead in his tracks.  You know this.  You called my parents.  Your campaign, to my greatest appreciation and respect, brought grief counselors for his coworkers, dedicated a corner of the office and much of your fundraising efforts to him, and bussed his coworkers to join the hundreds of others at his funeral.

You may not know that after his sudden passing, many of his friends quit their jobs, moved, changed their lives to continue working on…

View original post 580 more words

This is perhaps one of my more favorite graph algorithms: max-flow min-cut. It’s probably one of the most elegant algorithms, yet practical and useful in so many areas. In this problem, I’m gonna talk a little about graph stability.

A flow graph G is called k-stable iff by changing the capacity on k edges in G  the max-flow value doesn’t change. A very nice wording! By this you can tell that in a specific road network for example, changing only k edges just doesn’t do! No matter what you can do to the capacities of these k roads, traffic is still gonna be a nightmare, you’re not moving an inch towards a better solution!

Take the case when k=1. This means that you cannot get a better flow through the network by changing one edge. Any edge.

A straightforward solution to this is by performing a max-flow procedure, using any of the pretty well-documented algorithms, to obtain a residual network whose source is disconnected from the sink. You can do this in O(V²E) with a moderately complicated implementation. Afterwards, you can try to look for an edge such that if you increase its capacity (let’s call this edge-promotion!), the max-flow increases. If found, this means the graph is NOT 1-stable, in other words, it is 1-unstable. In order to prove it in fact is 1-stable, you have to exhaust all the options by trying all possible edges and making sure none of them can increase the total max-flow if promoted on its own. If you check using a BFS, this means a quadratic time to exhaust all the possibilities.

A better way to do this is by constructing two sets of vertices: those that are reachable from the source, call it Vs, and those that can reach the sink, call it Vt. Now it’s very easy to check in a linear time, by just checking each edge if it starts with a vertex in Vs and ends with a vertex in Vt. If we find such an edge, the graph is not 1-stable.

If we want to check for 2-stability, it gets a little more complicated, since now we need to find a path that contains two (consecutive?) edges which can increase the max-flow if promoted together. Surprisingly, we can still do this in a linear time. What we have to do now is split vertices into three sets: those that are reachable from the source (Vs), those that can reach the sink (Vt), and the rest of the vertices (Vm), which will lie in the middle, unreachable from any other vertex. Now the problem reduces to finding a path from the source to the sink that passes through two edges which used to be saturated. In fact, it is easier to just promote all the edges between those three sets and try to find a path, which is still a linear process.

Can we still do this with k>2?

I set out to write a simple Bayesian text classifier for news articles. The problem is that I only have the test data. In other words, I have the articles that I want to classify or attach tags to, but I don’t have any training data for building a proper model.

My first task was to find a proper taxonomy of article categories/tags. It took me about an hour to realize that it’s just not an easy task. I was searching for an easy way to download the Yahoo! directory, no luck. My better chance was with the open directory project. I finally found something promising on this page, which was actually pretty tricky to find! I got the categories.txt file, it looks like this:

Adult
Adult/Arts
Adult/Arts/Animation
Adult/Arts/Animation/Anime
Adult/Arts/Animation/Anime/Fan_Works
Adult/Arts/Animation/Anime/Fan_Works/Fan_Art
….

Then I formatted it a little so that I only get the first two levels (category/subcategory):

Arts
Arts/Animation
Arts/Animation/Celebrities
….

This resulted in a huge reduction in the number of categories (ended up with 275 topics).

The next step was to get some text that is related to each category. I used Wikipedia for this. For each category, I tried crawling the page with url prefix “en.wikipedia.org/wiki/” using Python Goose to extract the article body. Since most of the category names were simple notions or concepts, there was always a corresponding wikipedia article, it let me down with only a few, which I just ignored.

A little trick I used in the script was also to try separating the category names into several topics, whenever I would see a category name such as Religion_and_Spirituality. I would also get a singular form of the category, since Wikipedia would only have an article for “Award” and not “Awards.”

Up till this point, you can find my Python script that crawls the pages and generates one file per article here.

I have several points on my TODO’s for this little project:

  • Remove stop words. This is very simple. Although the classifier should work with stop words with more or less the same accuracy, removing stop words can actually make training faster with less features.
  • Crawl more languages. In fact, the reason I’m collecting training data is to tag French news articles, so this one is kind of a requirement.
  • Lemmatize the category names as well as the features extracted for each category. I believe this should improve classification significantly.
  • Finally, I would like to try a different classifier than the Naive Bayesian, it doesn’t enjoy the best popularity in the classification arena.
Sounds pretty optimistic, I’ll try to keep my appetite up for this.

Yesterday I wrote this little handy Python script to compute the TF-IDF scores for a collection of documents, check it out here.

This little function does most of the work other than the TF-IDF calculation itself:

 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# function to tokenize text, and put words back to their roots
def tokenize(str):

# remove punctuation
tokens = re.findall(r"<a.*?/a>|<[^\>]*>|[\w'@#]+",
        str.lower())

# lemmatize words. try both noun and verb lemmatizations
lmtzr = WordNetLemmatizer()
for i in range(0,len(tokens)):
res = lmtzr.lemmatize(tokens[i])
if res == tokens[i]:
tokens[i] = lmtzr.lemmatize(tokens[i], 'v')
else:
tokens[i] = res
return tokens

It uses the powerful NLTK library along with wordnet to figure out word stemming. Since I don’t know the position of a word in a sentence (verb, noun, …) I just try it once as a noun and once as a verb and take the one that changes the word.

So you’re working on a data mining, information retrieval, unstructured data management, or any similar topics, and you now it’s time to tamper with some data. It could be that you have some hypothesis to test, an equation for building a TF-IDF index or a model for classification or similarity. In any case, you need some real-life data sets to play with. In my case, it was news articles. I first need to crawl some news websites, generating URLs for several hundreds or thousands of (random in my case) pages, which is itself a non-trivial task, then I need to extract the bulk of each news article from its web page.

While you can use a number of already-implemented tools to parse the HTML of a web page in every possible language, it’s probably easy to write your own in a high-level language such as Python. I gave it a try with lxml and it worked great! Nevertheless, it’s still relatively difficult to extract the body of an article from an arbitrary news source, since you don’t know the structure of the HTML elements that hold the interesting pieces of the article.
The process is called “Content Extraction/Scraping.” If you google the term you’ll find several thousands of related results, including open-source projects, articles, and publications on the topic. A content extraction tool will essentially extract text from an HTML page. I needed a simple tool that can return only the body of an article, essentially, extract only “meaningful” text.
Enter decruft.
I like minimalist software. Most of the time I just need a tiny piece of code that does exactly what I need, not a bit more, not a bit less. decruft is a minimalist tool for extracting the body of an article. It’s a Python port of Arc90‘s amazing Readability project. In fact, its core is used in very popular software today (Apple Safari, Amazon Kindle, and some iPad readers).
I coded very little Python to wrap decruft, achieving exactly what I wanted in very little time. All the wrapper did was read a list of URLs from a file, supply it one by one to decruft‘s main method, get the result back in HTML format, and proceed in further processing of that as an lxml ElementTree.

The usual pain task every researcher has to go through at some point in time: writing a simple customized web crawler!

Today I sat down to write my own. Can’t call it a crawler though, since it really doesn’t do any “crawling,” all I needed was a piece of code that will parse the web pages I throw at it and give me the HTML elements I’m looking for, by name, class, id, or what have you. It seems not enough people have tried this on a mac though, community support was horribly weak!

I had intended to write my parser in Java, but then after a few readups on google, I decided I’d try Python, probably much better at handling such clichéd, high-level tasks. Mac has it installed by default, sweet!

Now comes the pain, which package should I use? I started with the native Python support for HTML parsing, which has you write an extending class with three methods for handle_starttag, handle_endtag, and handle_data, which doesn’t really give you a very intuitive way for the flow of your logic when you wanna search an HTML tree for elements then execute some action. You can find it simply and easy on the next link, but a disclaimer: DON’T use it if you wanna handle complicated real-life HTML pages. In my experience it crashed while parsing javascript code.

Now a clear conscience, here: http://docs.python.org/library/htmlparser.html

The next step was to try Beautifulsoup. BS has a very nice popularity on the web — although I don’t think it enjoys a similar interest from its developers — but again for some reason it wouldn’t parse my hard-headed HTML page. My guess was that it was just a wrapper around the native Python parsing code.

My last resort was the famous lxml parser. I read a lot about this one, and was all excited to try it. However, no matter how hard I tried, it still wouldn’t install on my Mac. I’m on a Mountain Lion, so there are a number of issues to suspect here, could be the manually-installed GCC compiler? (Apple stopped embracing GCC since Mountain Lion, you need to install it manually if needed), could be the version of lxml, or it could just be that the developers behind it are whole-hearted mac-haters, who proudly declare on the software web page that:

Apple doesn’t help here, as MacOS-X is so badly maintained by them that the pre-installed system libraries of libxml2 and libxslt tend to be horribly outdated, and updating them is everything but easy

Anyway, after wasting another hour, I found this great step-by-step tutorial by Leo for installing it. However, the latest versions of libxml2, libxslt, and lxml did NOT compile still. I had to try with a couple more, et voilà.. finally I found this combination of versions which compiled fine:
  • libxml2-2.8.0
  • libxslt-1.1.27
  • lxml-2.3.5
Hope this helps any other poor lxml-using sole.

A popular categorization of recommendation systems is memory-based vs. model-based algorithms. I’ve blogged before about a brief summary of both techniques.

To the best of my knowledge, Google still uses a mixed model consisting of both techniques, plus a simple content-based algorithm that monitors explicitly-declared user interests, and observes their change over time. The latest two papers from Google on this are:

The first paper specifically deserves a special discussion. From the looks of it, this one introduces the main part of the Google News recommendation engine. It discusses a hybrid system that uses a memory-based algorithm (item covisitation) along with two model-based algorithms, MinHash and the infamous classic PLSI, no news there. Greg wrote an excellent review of the paper here.
The second paper, however more recent, has less detail, and introduces a simple content-based technique, summed up as follows:

In this paper, we describe an content-based method to recommend news articles according to their topic categories, which is assigned by text classifiers. Based on a user’s news reading history the recommender predicts the topic categories of interest to her each time she visit Google News.

And the user news reading history is simply a collection of news items that the user has clicked on in the past. The authors argue that the tricky part is that the user’s clicking trend changes over time, and they categorize a user’s interests into two types: long-term interests, which are based on intrinsic features of the user, like her age, profession, character, and so on. There’s also the short-term interests, which they argue are media influences on the general public due to some popular news event or featured story.

The paper proposes a Bayesian model to capture the user’s long-term interests. The more interesting part is actually using the model to observe the news trend of the public, which is then used to influence recommendations presented to the user.

Overall, it seems that the Googlers managed to create themselves a pretty good recommendation engine based on almost ALL of the state-of-the-art techniques combined in order to get the best of each and come up with recommendations in the most accurate way.

Mac KeyRemapper

I busted my macbookPro’s spacebar key last night. It felt a little sticky (I guess I spilled something on it), so while trying to clean it I took it out, and it seems like I broke tiny little hinges on it. I contacted a local Apple reseller, but of course their estimate was around 200CHF for fixing a spacebar!

Until I’ll be able to buy a new spacebar for myself on ebay.ch, I remapped my right CMD key to space, I rarely use that key anyway. Check out the keyRemap4Macbook, did the trick minimally.