HN2new | past | comments | ask | show | jobs | submitlogin

Hmm, the matching issue might have been something like wanting to do "[i.match(tag='foo') for i in body.match(tag='bar')]" and getting a list-of-lists back, but this was a long time ago :-)

Incidentally, I've since gone off pyQuery as it doesn't always keep up with jquery. I now prefer lxml or BS4..

BTW, I love ScrapingHub. I bashed out a few Spiders with portia, but ultimately, I'll prob start scripting instead. Do you know if portia actually generates script code? Might be easier for fast scraping to get 60% of the ways with portia, then manually write the rest of the script.

One last thing - looking at this page

> http://stackoverflow.com/questions/6261714/inferring-templat...

there is mention of a "wrapper induction library"; I can't find anymore mention of it though, does the class/functionality still exist?



Wrapper induction library is separated from Scrapy: https://github.com/scrapy/scrapely. It is used in Portia under the hood. Portia can be seen as a tool to annotate scrapely templates and define crawling rules and post-processing rules.

I'm not a Portia developer/user myself, but I think it is possible to get script code from Portia; it exports Scrapy spider to some folder. But I don't really know what I'm talking about, it is better to ask at https://groups.google.com/forum/#!forum/portia-scraper or at stackoverflow (use tag 'Portia').


Thanks for your help :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: