I do a fair amount of web scraping with Ruby. Can anyone who has dabbled in both...

Jake232 · on March 10, 2014

Hey, article author here.

I've done extensive scraping in both Python and Ruby, as I wrote most of the scraping / crawling code at http://serpiq.com, so I can chip in.

Overall, I prefer Python. That is pretty much solely down to the requests library though, it makes everything so simple and quick. I haven't covered it in the article yet, but you can extend the Response() class easily, so you can for example add methods like images(), links(nofollow=True), etc. Overall, I just think the requests library is much more polished than anything available in Ruby.

grequests (Python) means I can make things concurrent in a matter of minutes. However in Ruby the only capable library supporting concurrent HTTP requests that I liked was Typhoeus. It just wasn't to the same standard though, and I ran across certain issues when using proxies etc.

As far as the HTML parsing goes, I don't really have any preference. Nokogiri and lxml are both equally capable.

I think they're both perfectly capable languages though, stick with what you prefer. I've been experimenting with Go lately.

flexd · on March 10, 2014

Scrapy mentioned here is really easy, or as the article mentions a combination of requests, lxml and something like phantomjs if there is javascript involved makes scraping sites in python a nice experience.

I have briefly made something with Scrapy to scrape my university's websites to notify me when we get new exam results [1], and that was okay. I might be slightly abusing scrapy but it was an okay experience.

Previously I have used 'scrubyt' for Ruby to scrape things, but their homepage seems to lead to a skin-related website now. What tools/libraries do you use for scraping stuff with Ruby these days? I remember Mechanize and Nokogiri was good/decent, but it's been more than a few years since I last used Ruby.

[1] https://github.com/flexd/studweb (description in Norwegian but it's not important)

ZenoArrow · on March 10, 2014

I've used both Python and Ruby for web scraping. Whilst Python is my language of choice for most things, I enjoyed the web scraping experience more with Ruby (in particular, Nokogiri). Maybe it's just bad luck on my part, but I tend to find Unicode issues when scraping with Python 2.x, whereas Ruby has had decent Unicode support for a while. I've not used Python 3.x. YMMV.

pudquick · on March 11, 2014

RE: "whereas Ruby has had decent Unicode support for a while"

Really? I'd love for some examples for where Ruby shines when it comes to Unicode handling when dealing with web content.

I know a lot of work was done in Ruby 1.9+ to bring decent Unicode encoding support to the language, but I still see a good number of complaints/articles about issues with it.

The encoding issues I've run into with python 2 have generally been whatever framework I'm using to ingest the content took a website at face value for encoding: either it wasn't defined at all or it was defined incorrectly.

In the wild wild world of web, unless you're doing intelligent data inspection, you're just going to run into that sort of thing.

In python, that's why projects like this exist: https://github.com/LuminosoInsight/python-ftfy

They let you correct Unicode content that was decoded with the wrong encoding.

rspeer · on March 11, 2014

Author of ftfy here - thanks for the shoutout.

By the way, here's another problem with taking the "encoding" parameter at face value: you're opening yourself up to DoS or data corruption bugs in the case where someone tells you to use a dangerous encoding.

There have been multiple bugs found in Python's UTF-7 decoder recently, and generally they were found by people who were scraping the Web with Python. These bugs, such as [1], could cause you to write strings that corrupt your data or crash the Python interpreter. And until the latest version -- and this is possibly still the case in all versions of Python 2 -- someone could give you a gzip bomb that decompresses to petabytes of data, and tell you it's in the "gzip" encoding [2].

I'm sure there are more bugs like this out there, and that Ruby has similar lurking bugs as well, given how recently they changed their Unicode system.

Basically, you shouldn't let someone else's Web page tell you what code to run, unless it's code you're planning to run. I recommend making a short list of encodings you trust, including ASCII, UTF-8, UTF-16, ISO-8859-x, Windows-125x, and MacRoman, and maybe a few others if you're working with CJK text, and just rejecting all others.

(The x's can be filled in with digits. Don't accept UTF-7, because it's clearly horrible. And I don't have any particular reason to be suspicious of UTF-32, but I've never seen anyone seriously use it.)

[1] http://bugs.python.org/issue19279

[2] http://bugs.python.org/issue20404

hatchoo · on March 11, 2014

Thanks for sharing ftfy. I have had several issues that this little library appears to address perfectly