Feeds:
Posts
Comments

Archive for the ‘Crawler’ Category

The usual pain task every researcher has to go through at some point in time: writing a simple customized web crawler!

Today I sat down to write my own. Can’t call it a crawler though, since it really doesn’t do any “crawling,” all I needed was a piece of code that will parse the web pages I throw at it and give me the HTML elements I’m looking for, by name, class, id, or what have you. It seems not enough people have tried this on a mac though, community support was horribly weak!

I had intended to write my parser in Java, but then after a few readups on google, I decided I’d try Python, probably much better at handling such clichéd, high-level tasks. Mac has it installed by default, sweet!

Now comes the pain, which package should I use? I started with the native Python support for HTML parsing, which has you write an extending class with three methods for handle_starttag, handle_endtag, and handle_data, which doesn’t really give you a very intuitive way for the flow of your logic when you wanna search an HTML tree for elements then execute some action. You can find it simply and easy on the next link, but a disclaimer: DON’T use it if you wanna handle complicated real-life HTML pages. In my experience it crashed while parsing javascript code.

Now a clear conscience, here: http://docs.python.org/library/htmlparser.html

The next step was to try Beautifulsoup. BS has a very nice popularity on the web — although I don’t think it enjoys a similar interest from its developers — but again for some reason it wouldn’t parse my hard-headed HTML page. My guess was that it was just a wrapper around the native Python parsing code.

My last resort was the famous lxml parser. I read a lot about this one, and was all excited to try it. However, no matter how hard I tried, it still wouldn’t install on my Mac. I’m on a Mountain Lion, so there are a number of issues to suspect here, could be the manually-installed GCC compiler? (Apple stopped embracing GCC since Mountain Lion, you need to install it manually if needed), could be the version of lxml, or it could just be that the developers behind it are whole-hearted mac-haters, who proudly declare on the software web page that:

Apple doesn’t help here, as MacOS-X is so badly maintained by them that the pre-installed system libraries of libxml2 and libxslt tend to be horribly outdated, and updating them is everything but easy

Anyway, after wasting another hour, I found this great step-by-step tutorial by Leo for installing it. However, the latest versions of libxml2, libxslt, and lxml did NOT compile still. I had to try with a couple more, et voilà.. finally I found this combination of versions which compiled fine:
  • libxml2-2.8.0
  • libxslt-1.1.27
  • lxml-2.3.5
Hope this helps any other poor lxml-using sole.

Read Full Post »