Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Check out Nutch, not sure if its exactly what you want though. It's in Java not Python, but it works with Hadoop quite nicely


I concur with the Nutch vote; but more specifically, take a look at the crawler code written in the src trunk for use with Hadoop. That is probably a good place to start. Also worth a look is Heritrix (crawler for archive.org). http://sourceforge.net/projects/archive-crawler Sadly, this too is written in Java.

The only Python one I am aware of for which code is available is: http://sourceforge.net/projects/ruya/

Edit: You might also want to take a look at http://wiki.apache.org/hadoop/AmazonEC2

Edit2: Polybot is another Python based crawler, but no code. However, the paper has some interesting ideas:

Design and Implementation of a High-Performance Distributed Web Crawler. V. Shkapenyuk and T. Suel. IEEE International Conference on Data Engineering, February 2002. http://cis.poly.edu/westlab/polybot/


Good response. We've created a basic crawler in Python, but are looking for something more powerful too. Heritrix above looks good


Thanks. Still would like to use Python to be honest (any python suggestions?), but I'll give this a go. Going to do some more research and might post back findings if anyone would be interested in critiquing them. I'm creating this startup from scratch so if there is anyone interested in the crawler side of things I'd be happy to chat either about collaboration or sharing ideas.


If you're looking at building your own crawler in Python from scratch, here's a benchmark of SGML parsers:

http://72.14.205.104/search?q=cache:LYoRD1GTP2UJ:www.oluyede...

We've been playing with sgmlop (http://effbot.org/zone/sgmlop-index.htm) for parsing and urllib2 (http://docs.python.org/lib/module-urllib2.html) for fetching.


I'm going to vote for Nutch too. Have heard good things.

Also, if someone on this list wants to work on a cool web spidering project (probably using Nutch), send me a message. I'm looking for someone.


I'm interested...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: