摘要:Scrapy从脚本运行爬虫的几种方式

方式一:CrawlerProcess()

之前介绍了一个cmdline.execute() 的方法运行爬虫,还用另一种从脚本运行爬虫的方法:CrawlerProcess()

先看一个例子:

import scrapy
import random
from scrapy.crawler import CrawlerProcess


def serialize_text(text):
    word_list = text.replace(u'“', '').replace(u'”', '').split()
    return random.sample(word_list, 2)


class QuotesItem(scrapy.Item):
    text = scrapy.Field(serializer=serialize_text)
    author = scrapy.Field()


class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['toscrape.com']
    custom_settings = {
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_URI': 'quotes.jsonlines',
    }

    def __init__(self, category=None, *args, **kwargs):
        super(QuotesSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://quotes.toscrape.com/tag/%s/' % category, ]

    def parse(self, response):
        quote_block = response.css('div.quote')
        for quote in quote_block:
            text = quote.css('span.text::text').extract_first()
            author = quote.xpath('span/small/text()').extract_first()
            item = QuotesItem()
            item['text'] = text
            item['author'] = author
            yield item

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)


process = CrawlerProcess()
process.crawl(QuotesSpider, category='love')
process.start()

上面的可以直接运行。我们再看看CrawlerProcess这个类:

class scrapy.crawler.CrawlerProcess(settings=None)

用于在一个进程(process)中同时启动多个爬虫,接收settings作为初始化参数,在看看它的一个方法:

crawl(crawler_or_spidercls, *args, **kwargs)
  • crawler_or_spidercls:Crawler 实例或Spider实例
  • args (list) – 用于初始化爬虫的参数
  • kwargs (dict) – 关键词类型的参数用于初始化爬虫

其他的我就不介绍了,具体在这Common Practices

下面给出一个启动多个爬虫的部分代码:

from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(QuotesSpider, category='humor')
process.crawl(QuotesSpider, category='love')
process.start()

是不是很简单呢。

方式二:cmdline.execute()

# -*- coding: utf-8 -*-
# filename: Quotes_Spider.py
import scrapy
import random


def serialize_text(text):
    word_list = text.replace(u'“', '').replace(u'”', '').split()
    return random.sample(word_list, 5)


class QuotesItem(scrapy.Item):
    text = scrapy.Field(serializer=serialize_text)
    author = scrapy.Field()


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ['toscrape.com']
    custom_settings = {
        'FEED_EXPORT_ENCODING': 'utf-8',
        'FEED_URI': 'quotes.jsonlines',
    }

    def __init__(self, category=None, *args, **kwargs):
        super(QuotesSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://quotes.toscrape.com/tag/%s/' % category, ]

    def parse(self, response):
        quote_block = response.css('div.quote')
        for quote in quote_block:
            text = quote.css('span.text::text').extract_first()
            author = quote.xpath('span/small/text()').extract_first()
            item = QuotesItem()
            item['text'] = text
            item['author'] = author
            yield item

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

运行脚本:

from scrapy import cmdline
cmdline.execute("scrapy runspider Quotes_Spider.py -a category=life".split())

上面两个py文件在同一个目录。

参考

参考自:Run Scrapy - 从脚本运行爬虫及多爬虫运行 - 知乎,感谢作者。

文章目录