python - building scrapy spiders into my own program, i don't want to call scrapy from command line) -
in similar vein question: stackoverflow: running-multiple-spiders-in-scrapy
i wondering, can run entire scrapy project within python program? lets wanted build entire program required scraping several different sites, , build entire scrapy projects each site.
instead of running command line 1 of, want run these spiders , acquire information them.
i can use mongodb in python ok, , can build scrapy projects contain spiders, merging 1 application.
i want run application once, , have ability control multiple spiders own program
why this? application may connect other sites using api , needs compare results api site scraped site in real time. don't want ever have call scrapy command line, meant self contained.
(i have been asking lots of questions scraping recently, because trying find right solution build in)
thanks :)
yep, of course can ;)
the idea (inspired blog post) create worker , use in own python script:
from scrapy import project, signals scrapy.conf import settings scrapy.crawler import crawlerprocess scrapy.xlib.pydispatch import dispatcher multiprocessing.queues import queue import multiprocessing class crawlerworker(multiprocessing.process): def __init__(self, spider, result_queue): multiprocessing.process.__init__(self) self.result_queue = result_queue self.crawler = crawlerprocess(settings) if not hasattr(project, 'crawler'): self.crawler.install() self.crawler.configure() self.items = [] self.spider = spider dispatcher.connect(self._item_passed, signals.item_passed) def _item_passed(self, item): self.items.append(item) def run(self): self.crawler.crawl(self.spider) self.crawler.start() self.crawler.stop() self.result_queue.put(self.items)
example of use:
result_queue = queue() crawler = crawlerworker(myspider(myargs), result_queue) crawler.start() item in result_queue.get(): yield item
another way execute scrapy crawl command system()
Comments
Post a Comment