Never got a reply, but I'll reproduce here; I wonder how much of this is still true?
"
The docs where pretty good, but it was unclear sometimes how to proceed; There was a lot of structure to understand in order to get started.
When I used it, I wanted to scrape a site until certain conditions were met; when the last page scraped returned no objects. I wanted all results initially returned from a page to be dropped if they were older than a certain date; Thus I wanted Scrapy to keep scraping until no new items were found. Also, I wanted the latest date of the items returned so I could use this the next time I scrape.
I created the 'DropElderMiddleware' middleware to do this. I couldn't see any other way of making calculations based on items returned from a particular page.
I could never figure out what the difference between input and output processors where, or when I should use one or the other.
The MapCompose function flattens object by default, So I had to be careful sometimes when returning lists that represented structure I wanted to retain.
The way the html match object worked was sometimes confusing; If I wanted to match multiple items, then match items within each of those, I wanted a list of lists (group matches together base based on what matches they were found in). I can't remember the details of why I found this hard, but I can try to come up with an example if you like?
In the end I figured I was having to learn the structure of Scrapy for everything that I wanted it to do, but many of Scrapy's features I didn't need e.g. I didn't want command-line control (I would actually prefer not to use the interface, though didn't discover how I could write a python script to apply the spider directly).
Now I prefer to use mechanize + PyQuery; PyQuery is at least as good as processing web pages as Scrapy's object, and if I need something more for opening a page e.g. complicated login, I can use mechanize. I find this a more modular approach, and think that I better understand what's going on in my scripts.
"
> The docs where pretty good, but it was unclear sometimes how to proceed; There was a lot of structure to understand in order to get started.
Yeah, docs used to be a problem; they improved a lot in 1.0 and 1.1 releases though.
> When I used it, I wanted to scrape a site until certain conditions were met; when the last page scraped returned no objects. I wanted all results initially returned from a page to be dropped if they were older than a certain date; Thus I wanted Scrapy to keep scraping until no new items were found. Also, I wanted the latest date of the items returned so I could use this the next time I scrape.
> I could never figure out what the difference between input and output processors where, or when I should use one or the other.
> The MapCompose function flattens object by default, So I had to be careful sometimes when returning lists that represented structure I wanted to retain.
> The way the html match object worked was sometimes confusing; If I wanted to match multiple items, then match items within each of those, I wanted a list of lists (group matches together base based on what matches they were found in). I can't remember the details of why I found this hard, but I can try to come up with an example if you like?
I'm not sure what problems did you have. Scrapy selectors library (https://github.com/scrapy/parsel) is quite similar to PyQuery (esp. when CSS selectors are used), and nothing prevents you from using PyQuery with Scrapy. In future we may add PyQuery (and BeautifulSoup?) support to parsel and provide PyQuery selectors as response.pq (like response.css and response.xpath), +1 to do that.
> In the end I figured I was having to learn the structure of Scrapy for everything that I wanted it to do, but many of Scrapy's features I didn't need e.g. I didn't want command-line control (I would actually prefer not to use the interface, though didn't discover how I could write a python script to apply the spider directly).
Yeah, library interface used to be a problem. It was improved in 1.0 release (there is an official API for integrating Scrapy with Twisted apps and running spiders from user scripts), but there is still more to go. See http://doc.scrapy.org/en/1.0/topics/practices.html#run-scrap....
It probably won't be as easy to integrate with regular Python scripts as mechanize because Scrapy is async. On the other hand, Scrapy is easier to integrate with async servers like Twisted or Tornado.
> Now I prefer to use mechanize + PyQuery; PyQuery is at least as good as processing web pages as Scrapy's object, and if I need something more for opening a page e.g. complicated login, I can use mechanize. I find this a more modular approach, and think that I better understand what's going on in my scripts.
You may want to check the new 'Scrapy at glance' page (http://doc.scrapy.org/en/latest/intro/overview.html). The main advantage of Scrapy over mechanize is that it handles parallel downloads and have a wide range of built-in extensions you won't have to implement yourselves.
Hmm, the matching issue might have been something like wanting to do "[i.match(tag='foo') for i in body.match(tag='bar')]" and getting a list-of-lists back, but this was a long time ago :-)
Incidentally, I've since gone off pyQuery as it doesn't always keep up with jquery. I now prefer lxml or BS4..
BTW, I love ScrapingHub. I bashed out a few Spiders with portia, but ultimately, I'll prob start scripting instead. Do you know if portia actually generates script code? Might be easier for fast scraping to get 60% of the ways with portia, then manually write the rest of the script.
Wrapper induction library is separated from Scrapy: https://github.com/scrapy/scrapely. It is used in Portia under the hood. Portia can be seen as a tool to annotate scrapely templates and define crawling rules and post-processing rules.
I'm not a Portia developer/user myself, but I think it is possible to get script code from Portia; it exports Scrapy spider to some folder. But I don't really know what I'm talking about, it is better to ask at https://groups.google.com/forum/#!forum/portia-scraper or at stackoverflow (use tag 'Portia').
Never got a reply, but I'll reproduce here; I wonder how much of this is still true?
" The docs where pretty good, but it was unclear sometimes how to proceed; There was a lot of structure to understand in order to get started.
When I used it, I wanted to scrape a site until certain conditions were met; when the last page scraped returned no objects. I wanted all results initially returned from a page to be dropped if they were older than a certain date; Thus I wanted Scrapy to keep scraping until no new items were found. Also, I wanted the latest date of the items returned so I could use this the next time I scrape.
I created the 'DropElderMiddleware' middleware to do this. I couldn't see any other way of making calculations based on items returned from a particular page.
I could never figure out what the difference between input and output processors where, or when I should use one or the other.
The MapCompose function flattens object by default, So I had to be careful sometimes when returning lists that represented structure I wanted to retain.
The way the html match object worked was sometimes confusing; If I wanted to match multiple items, then match items within each of those, I wanted a list of lists (group matches together base based on what matches they were found in). I can't remember the details of why I found this hard, but I can try to come up with an example if you like?
In the end I figured I was having to learn the structure of Scrapy for everything that I wanted it to do, but many of Scrapy's features I didn't need e.g. I didn't want command-line control (I would actually prefer not to use the interface, though didn't discover how I could write a python script to apply the spider directly).
Now I prefer to use mechanize + PyQuery; PyQuery is at least as good as processing web pages as Scrapy's object, and if I need something more for opening a page e.g. complicated login, I can use mechanize. I find this a more modular approach, and think that I better understand what's going on in my scripts. "