Check out Nutch, not sure if its exactly what you want though. It's in Java not ...

sarosh · on March 30, 2008

I concur with the Nutch vote; but more specifically, take a look at the crawler code written in the src trunk for use with Hadoop. That is probably a good place to start. Also worth a look is Heritrix (crawler for archive.org). http://sourceforge.net/projects/archive-crawler Sadly, this too is written in Java.

The only Python one I am aware of for which code is available is: http://sourceforge.net/projects/ruya/

Edit: You might also want to take a look at http://wiki.apache.org/hadoop/AmazonEC2

Edit2: Polybot is another Python based crawler, but no code. However, the paper has some interesting ideas:

Design and Implementation of a High-Performance Distributed Web Crawler. V. Shkapenyuk and T. Suel. IEEE International Conference on Data Engineering, February 2002. http://cis.poly.edu/westlab/polybot/

inovica · on March 30, 2008

Good response. We've created a basic crawler in Python, but are looking for something more powerful too. Heritrix above looks good

groovyone · on March 30, 2008

Thanks. Still would like to use Python to be honest (any python suggestions?), but I'll give this a go. Going to do some more research and might post back findings if anyone would be interested in critiquing them. I'm creating this startup from scratch so if there is anyone interested in the crawler side of things I'd be happy to chat either about collaboration or sharing ideas.

konsl · on March 30, 2008

If you're looking at building your own crawler in Python from scratch, here's a benchmark of SGML parsers:

http://72.14.205.104/search?q=cache:LYoRD1GTP2UJ:www.oluyede...

We've been playing with sgmlop (http://effbot.org/zone/sgmlop-index.htm) for parsing and urllib2 (http://docs.python.org/lib/module-urllib2.html) for fetching.

dshah · on March 30, 2008

I'm going to vote for Nutch too. Have heard good things.

Also, if someone on this list wants to work on a cool web spidering project (probably using Nutch), send me a message. I'm looking for someone.

surya · on March 31, 2008

I'm interested...