Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

http://mojolicio.us is way better for this kind of stuff. Here's the synopsis example redone using Mojo:

    $ perl -Mojo -e'g("reddit.com")->dom("a.title")->each(sub { warn shift->text })'


The one liner is cool, but I guarantee that node.js's non-blocking IO will outperform perl any day of the week. Try scraping thousands of pages at once using perl..


mojolicious is using a non-blocking async runloop as well =)


The problem you'd have with anything that represents a page as some kind of graph is you have to construct the whole tree before you can start doing anything with it. The API largely precludes streams. Callbacks would be possible but some of the conditional CSS selectors need a complete knowledge of the page before they can be resolved.

So while GET-ting pages to scrape can benefit from async IO, you're effectively "blocked" while scraping pieces out of the page itself.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: