python - building scrapy spiders into my own program, i don't want to call scrapy from command line) -


in similar vein question: stackoverflow: running-multiple-spiders-in-scrapy

i wondering, can run entire scrapy project within python program? lets wanted build entire program required scraping several different sites, , build entire scrapy projects each site.

instead of running command line 1 of, want run these spiders , acquire information them.

i can use mongodb in python ok, , can build scrapy projects contain spiders, merging 1 application.

i want run application once, , have ability control multiple spiders own program

why this? application may connect other sites using api , needs compare results api site scraped site in real time. don't want ever have call scrapy command line, meant self contained.

(i have been asking lots of questions scraping recently, because trying find right solution build in)

thanks :)

yep, of course can ;)

the idea (inspired blog post) create worker , use in own python script:

from scrapy import project, signals scrapy.conf import settings scrapy.crawler import crawlerprocess scrapy.xlib.pydispatch import dispatcher multiprocessing.queues import queue import multiprocessing  class crawlerworker(multiprocessing.process):      def __init__(self, spider, result_queue):         multiprocessing.process.__init__(self)         self.result_queue = result_queue          self.crawler = crawlerprocess(settings)         if not hasattr(project, 'crawler'):             self.crawler.install()         self.crawler.configure()          self.items = []         self.spider = spider         dispatcher.connect(self._item_passed, signals.item_passed)      def _item_passed(self, item):         self.items.append(item)      def run(self):         self.crawler.crawl(self.spider)         self.crawler.start()         self.crawler.stop()         self.result_queue.put(self.items) 

example of use:

result_queue = queue() crawler = crawlerworker(myspider(myargs), result_queue) crawler.start() item in result_queue.get():     yield item 

another way execute scrapy crawl command system()


Comments

Popular posts from this blog

c# - SVN Error : "svnadmin: E205000: Too many arguments" -

c# - Copy ObservableCollection to another ObservableCollection -

All overlapping substrings matching a java regex -