Tech | Yasser Ebrahim

Archive for the ‘Tech’ Category

On Hadoop, and Mac: Quickstart

Posted in hadoop, Tech on April 12, 2013| Leave a Comment »

Last month I had a project for my advanced databases course, it was to implement a simple iterative UV matrix decomposition algorithm using Hadoop. This post is just about the first step: setting up Hadoop to work, not the algorithm itself. I’m gonna talk about that in a future post.

For those of you noobs like me who never tried Hadoop on more complicated stuff beyond Apache’s boring copy & paste WordCount, it can be a real pain in the butt to run anything more complicated, simply because it’s just one of the worst-documented, badly written pieces of software that implement a really nice idea!

One good idea is to compile a Hadoop version from scratch, rather than use the pre-compiled libraries. This way you should be able to debug and run your Hadoop applications easily. For that, take a look at this video tutorial.

In this post however, I list very simple steps to quickly get Hadoop running on a Mac. So here we go:

1. Download Hadoop

You can get the latest stable version from here: http://www.apache.org/dyn/closer.cgi/hadoop/common/ or you can go to the archive to find older versions. Sometimes you need a specific version for compatibility, and there are some tutorials that recommend a Hadoop version that’s earlier than 0.21.0 so that it works with the infamous Hadoop Eclipse plugin. I tried the plugin, it’s definitely not worth it. You’re better off either compiling your Hadoop, or just following these steps if you don’t need Eclipse’s step debugging.

Let’s assume you downloaded the version and decompressed it into your ~/Downloads directory.

2. Setting Environment Variables

You need two things set, either export them temporarily, or put your exports in your bash_rc or bash_profile:

JAVA_HOME: needs to point to the directory for your Java binaries. To find this you can first get which Java binary is set in your path by running which java, then running readlink on the output path from that. Note that sometimes the readlink would just return another symlink, which you need to follow again. Do that till you reach the JRE/bin/java directory.

HADOOP_HOME: needs to point to the bin directory of your downloaded Hadoop version. In our case, set it to /home/user_name/Downloads/hadoop-*/bin, where hadoop-* is your hadoop version.

3. Configuration and Local SSH

Follow apache’s pseudo-distributed operation steps for configurations and passphraseless SSH, they’re standard. Here.

4. Run it!

First, you need to tell Hadoop to set up its file system by running a format command:

> bin/hadoop namenode -format

Then start the daemons:

> bin/start-all.sh

You should see a few lines appear indicating the starting of the namenode, the datanode, the secondarynamenode, the jobtracker, and the tasktracker.

5. Monitor Your Jobs

You can use Hadoop’s graphical administration interface for the jobtracker, by default it’s at http://localhost:50030/jobtracker.jsp where you can see all your running, failed, and retired jobs, should look like this:

Read Full Post »

decruft – Content Extraction on News Web Pages

Posted in content extraction, decruft, HTML Parser, Tech on October 24, 2012| Leave a Comment »

So you’re working on a data mining, information retrieval, unstructured data management, or any similar topics, and you now it’s time to tamper with some data. It could be that you have some hypothesis to test, an equation for building a TF-IDF index or a model for classification or similarity. In any case, you need some real-life data sets to play with. In my case, it was news articles. I first need to crawl some news websites, generating URLs for several hundreds or thousands of (random in my case) pages, which is itself a non-trivial task, then I need to extract the bulk of each news article from its web page.

While you can use a number of already-implemented tools to parse the HTML of a web page in every possible language, it’s probably easy to write your own in a high-level language such as Python. I gave it a try with lxml and it worked great! Nevertheless, it’s still relatively difficult to extract the body of an article from an arbitrary news source, since you don’t know the structure of the HTML elements that hold the interesting pieces of the article.

The process is called “Content Extraction/Scraping.” If you google the term you’ll find several thousands of related results, including open-source projects, articles, and publications on the topic. A content extraction tool will essentially extract text from an HTML page. I needed a simple tool that can return only the body of an article, essentially, extract only “meaningful” text.

Enter decruft.

I like minimalist software. Most of the time I just need a tiny piece of code that does exactly what I need, not a bit more, not a bit less. decruft is a minimalist tool for extracting the body of an article. It’s a Python port of Arc90‘s amazing Readability project. In fact, its core is used in very popular software today (Apple Safari, Amazon Kindle, and some iPad readers).

I coded very little Python to wrap decruft, achieving exactly what I wanted in very little time. All the wrapper did was read a list of URLs from a file, supply it one by one to decruft‘s main method, get the result back in HTML format, and proceed in further processing of that as an lxml ElementTree.

Read Full Post »

Installing lxml on Mac

Posted in Crawler, HTML Parser, lxml, Python, Tech on October 22, 2012| Leave a Comment »

The usual pain task every researcher has to go through at some point in time: writing a simple customized web crawler!

Today I sat down to write my own. Can’t call it a crawler though, since it really doesn’t do any “crawling,” all I needed was a piece of code that will parse the web pages I throw at it and give me the HTML elements I’m looking for, by name, class, id, or what have you. It seems not enough people have tried this on a mac though, community support was horribly weak!

I had intended to write my parser in Java, but then after a few readups on google, I decided I’d try Python, probably much better at handling such clichéd, high-level tasks. Mac has it installed by default, sweet!

Now comes the pain, which package should I use? I started with the native Python support for HTML parsing, which has you write an extending class with three methods for handle_starttag, handle_endtag, and handle_data, which doesn’t really give you a very intuitive way for the flow of your logic when you wanna search an HTML tree for elements then execute some action. You can find it simply and easy on the next link, but a disclaimer: DON’T use it if you wanna handle complicated real-life HTML pages. In my experience it crashed while parsing javascript code.

Now a clear conscience, here: http://docs.python.org/library/htmlparser.html

The next step was to try Beautifulsoup. BS has a very nice popularity on the web — although I don’t think it enjoys a similar interest from its developers — but again for some reason it wouldn’t parse my hard-headed HTML page. My guess was that it was just a wrapper around the native Python parsing code.

My last resort was the famous lxml parser. I read a lot about this one, and was all excited to try it. However, no matter how hard I tried, it still wouldn’t install on my Mac. I’m on a Mountain Lion, so there are a number of issues to suspect here, could be the manually-installed GCC compiler? (Apple stopped embracing GCC since Mountain Lion, you need to install it manually if needed), could be the version of lxml, or it could just be that the developers behind it are whole-hearted mac-haters, who proudly declare on the software web page that:

Apple doesn’t help here, as MacOS-X is so badly maintained by them that the pre-installed system libraries of libxml2 and libxslt tend to be horribly outdated, and updating them is everything but easy

Anyway, after wasting another hour, I found this great step-by-step tutorial by Leo for installing it. However, the latest versions of libxml2, libxslt, and lxml did NOT compile still. I had to try with a couple more, et voilà.. finally I found this combination of versions which compiled fine:

libxml2-2.8.0
libxslt-1.1.27
lxml-2.3.5

Hope this helps any other poor lxml-using sole.

Read Full Post »

Mac KeyRemapper

Posted in key map, mac, remap, Tech on October 15, 2012| Leave a Comment »

I busted my macbookPro’s spacebar key last night. It felt a little sticky (I guess I spilled something on it), so while trying to clean it I took it out, and it seems like I broke tiny little hinges on it. I contacted a local Apple reseller, but of course their estimate was around 200CHF for fixing a spacebar!

Until I’ll be able to buy a new spacebar for myself on ebay.ch, I remapped my right CMD key to space, I rarely use that key anyway. Check out the keyRemap4Macbook, did the trick minimally.

Read Full Post »

Yasser Ebrahim

a blog about data and about me.. -.– — ..- .-. . .- -.. — -.– — — .-. … .

Archive for the ‘Tech’ Category

On Hadoop, and Mac: Quickstart

1. Download Hadoop

2. Setting Environment Variables

3. Configuration and Local SSH

4. Run it!

5. Monitor Your Jobs

decruft – Content Extraction on News Web Pages

Installing lxml on Mac

Mac KeyRemapper

Recent Posts

Archives

Categories