摘要:关于scrapy中两个对象Requests、Responses的介绍

Request

一个 Request 对象代表一个HTTP请求, HTTP 请求是由 Spider 产生并被 Downloader 处理进而生成一个 Response.

scrapy.http.Request((url [, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags])

参数详解:

  • url (string)

    请求的目标地址

  • callback (callable)

    将处理请的响应,如果没有指定则自动使用 parse() 作为回调函数,如果处理过程发生错误将调用 errback.

  • meta (dict)

    参数将以字典的形式传递到回调函数。

  • priority (int)

    设置请求的优先值,值越大在 scheduler 中越早被执行。默认值为0,允许负值表示相对较低的优先级。

  • dont_filter (boolean)

    默认为 False,当设置为 True 时请求经不被过滤,当使用该项时可能会造成爬虫无限循环。

  • errback (callable)

    作为发生请求错误的回调函数。它收到一个Twisted Failure 实例作为第一个参数,可用于跟踪连接建立超时,DNS错误等。

  • copy()
  • replace()

例子:

  1. 使用 meta 进行参数传递

    def parse_page1(self, response):
        item = MyItem()
        item['main_url'] = response.url
        request = scrapy.Request("http://www.example.com/some_page.html",
                                 callback=self.parse_page2)
        request.meta['item'] = item
        yield request
    
    def parse_page2(self, response):
        item = response.meta['item']
        item['other_url'] = response.url
        yield item
  2. 错误处理

    
    import scrapy
    
    from scrapy.spidermiddlewares.httperror import HttpError
    from twisted.internet.error import DNSLookupError
    from twisted.internet.error import TimeoutError, TCPTimedOutError
    
    class ErrbackSpider(scrapy.Spider):
        name = "errback_example"
        start_urls = [
            "http://www.httpbin.org/",              # HTTP 200 expected
            "http://www.httpbin.org/status/404",    # Not found error
            "http://www.httpbin.org/status/500",    # server issue
            "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
            "http://www.httphttpbinbin.org/",       # DNS error expected
        ]
    
        def start_requests(self):
            for u in self.start_urls:
                yield scrapy.Request(u, callback=self.parse_httpbin,
                                        errback=self.errback_httpbin,
                                        dont_filter=True)
    
        def parse_httpbin(self, response):
            self.logger.info('Got successful response from {}'.format(response.url))
            # do something useful here...
        # errback如果有返回值就要返回一个可迭代的对象,类似于["1","2","3","4"]
        def errback_httpbin(self, failure):
            # log all failures
            self.logger.error(repr(failure))
    
            # in case you want to do something special for some errors,
            # you may need the failure's type:
    
            if failure.check(HttpError):
                # these exceptions come from HttpError spider middleware
                # you can get the non-200 response
                response = failure.value.response
                self.logger.error('HttpError on %s', response.url)
    
            elif failure.check(DNSLookupError):
                # this is the original request
                request = failure.request
                self.logger.error('DNSLookupError on %s', request.url)
    
            elif failure.check(TimeoutError, TCPTimedOutError):
                request = failure.request
                self.logger.error('TimeoutError on %s', request.url)
    
  3. meta 的特殊键值
    Request.meta 属性可以包含任意数据,但是有一些是由 Scrapy 及其内置扩展的特殊键。列表如下:

    • dont_redirect
      如果改值设为 True,则请求将被 middleware 忽略
    • redirect_urls

      #打印重定向的链接
      request.meta["redirect_urls"]
    • redirect_reasons

      # 打印重定向的原因
      request.meta["redirect_reasons"]
      # For example: [301, 302, 307, 'meta refresh'][301, 302, 307, 'meta refresh']
    • dont_retry
      默认为 True,决定 Retry middleware 是否使用。
    • handle_httpstatus_list
      用于指定允许哪些响应码。
    • handle_httpstatus_all
      设置该选项为 Ture 将允许所有请求的响应码。
    • dont_merge_cookies
      当你不想响应的 Cookie 和当前已有的 Cookie 混合时将次参数设置为 True。点击查看关于Cookie的中间,实例:

      request_with_cookies = Request(url="http://www.example.com",
                                     cookies={'currency': 'USD', 'country': 'UY'},
                                     meta={'dont_merge_cookies': True})
    • cookiejar
      支持使用多个 Cookie Session,但是 cookiejar 不会自动添加在 Request 上,需要每次手动添加。例子:

      for i, url in enumerate(urls):
          yield scrapy.Request(url, meta={'cookiejar': i},
              callback=self.parse_page)
      
      def parse_page(self, response):
          # do some processing
          return scrapy.Request("http://www.example.com/otherpage",
              meta={'cookiejar': response.meta['cookiejar']},
              callback=self.parse_other_page)
    • dont_cache
      设置为 True 将禁止缓存响应。
    • redirect_urls
      指定重定向链接。
    • bindaddress
      The IP of the outgoing IP address to use for the performing the request.
    • dont_obey_robotstxt
      即使在 settings 中为 True,当 Request.meta 有 dont_obey_robotstxt 时也将忽略robots.txt
    • download_timeout
      下载超时时间(秒)。
    • download_maxsize
      最大响应体大小 (bytes)
    • download_latency

      The amount of time spent to fetch the response, since the request has been started, i.e. HTTP message sent over the network. This meta key only becomes available when the response has been downloaded. While most other meta keys are used to control Scrapy behavior, this one is supposed to be read-only.

    • download_fail_on_dataloss
      默认为 True,当响应体大大小和 Content-Length 不匹配是 raise ResponseFailed([_DataLoss]),当为 False 响应体将被处理,同时dataloss 将被添加到response 的 flags中。

      'dataloss' in response.flags is True
    • proxy
      为请求添加代理服务器,并且优先于http_proxy / https_proxy

      http://some_proxy_server:port
      http://username:password@some_proxy_server:port
    • ftp_user
    • ftp_password
    • referrer_policy
      请求头的 referrer 策略,详情点击查看。
    • max_retry_times
      为每个请求添加最大请求次数参数,参数优先于 settings 中的 RETRY_TIMES.

Request 的扩展 FormRequest

  1. 模拟网页数据提交 (post)
return [FormRequest(url="http://www.example.com/post/action",
                    formdata={'name': 'John Doe', 'age': '27'},
                    callback=self.after_post)]
  1. 使用类方法from_response模拟用户登录,该方法返回一个 FormRequest 对象,并且他的 form 值被返回的响应HTML页面中 <form>标签 事先填充
import scrapy

class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'john', 'password': 'secret'},
            callback=self.after_login
        )

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.logger.error("Login failed")
            return

Response

文章目录