Scrapy 性能
?
參考:https://blog.csdn.net/s150503/article/details/72571680
?
?
CONCURRENT_REQUESTS 與 DOWNLOAD_DELAY
?
Scrapy 中 CONCURRENT_REQUESTS 與 DOWNLOAD_DELAY 的聯系,先建立一個項目來找CONCURRENT_REQUESTS與DOWNLOAD_DELAY的聯系
?
以豆瓣電影top250 為例
douban_spider.py
# -*- coding: utf-8 -*-import scrapy import time import re from lxml import etree""" scrapy 豆瓣登錄響應結果亂碼問題 https://www.jianshu.com/p/9974fc338242 """class ExampleSpider(scrapy.Spider):name = 'douban'allowed_domains = ['example.com']# start_urls = ['https://movie.douban.com/top250?start={}&filter='.format(i) for i in range(0, 250, 25)]start_urls = ['https://movie.douban.com/top250?start={}&filter='.format(i) for i in range(10000)]custom_settings = {'DEFAULT_REQUEST_HEADERS': {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,''*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',"Accept-Encoding": "gzip, deflate","Accept-Language": "zh-CN,zh;q=0.9","Connection": "keep-alive","Host": "movie.douban.com","Upgrade-Insecure-Requests": "1","User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'' (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36',},'CONCURRENT_REQUESTS': 10,'DOWNLOAD_DELAY': 0.01,'CONCURRENT_REQUESTS_PER_IP': 0,'CONCURRENT_REQUESTS_PER_DOMAIN': 10000,'FEED_EXPORT_ENCODING': 'utf-8'}def parse(self, response):current_url = response.urlprint(current_url)time.sleep(3)returnoffset = re.findall(r'start=(\d+)', current_url)[0]page_num = int(offset) // 25html = etree.HTML(text=response.text)# 先定位到 li 標簽,data 是一個包含25個li標簽的list,就是包含25部電影信息的listdata = html.xpath('//ol[@class="grid_view"]/li')index = 0for d in data:data_title = d.xpath('div/div[2]/div[@class="hd"]/a/span[1]/text()')data_info = d.xpath('div/div[2]/div[@class="bd"]/p[1]/text()')data_quote = d.xpath('div/div[2]/div[@class="bd"]/p[2]/span/text()')data_score = d.xpath('div/div[2]/div[@class="bd"]/div/span[@class="rating_num"]/text()')data_num = d.xpath('div/div[2]/div[@class="bd"]/div/span[4]/text()')data_pic_url = d.xpath('div/div[1]/a/img/@src')print(f"No: {str(page_num * 25 + index + 1)} {data_title}")index += 1passif __name__ == '__main__':from scrapy import cmdlinecmdline.execute('scrapy crawl douban'.split())pass?
驗證 1:
'CONCURRENT_REQUESTS': 10, 'DOWNLOAD_DELAY': 0.01,CONCURRENT_REQUESTS 設置 為 10 時,理論上 可以并發 10個請求。但是??DOWNLOAD_DELAY 設置為 0.01 時,按??DOWNLOAD_DELAY 來算,可以并發 1 / 0.01 = 100 個請求,這兩個取最小值 為 10,所以 并發 10個 請求。
幾乎同一秒 并發 10 個左右的 請求
?
驗證 2:
'CONCURRENT_REQUESTS': 10, 'DOWNLOAD_DELAY': 0.5,CONCURRENT_REQUESTS 設置 為 10 時,理論上 可以并發 10個請求。但是??DOWNLOAD_DELAY 設置為 0.5?時,按??DOWNLOAD_DELAY 來算,可以并發 1 / 0.5?= 2?個請求,這兩個取最小值 為 2,所以 并發 2個 請求。
?
總結:
DOWNLOAD_DELAY 會影響?CONCURRENT_REQUESTS,不能使并發顯現出來。
?
?
思考:
1. 當有 CONCURRENT_REQUESTS,沒有 DOWNLOAD_DELAY 時,服務器會在同一時間收到大量的請求。
'CONCURRENT_REQUESTS': 10, # 'DOWNLOAD_DELAY': 0.5,DOWNLOAD_DELAY 注釋后,會使用默認值 0,
?
2. 當有 CONCURRENT_REQUESTS,有 DOWNLOAD_DELAY 時,服務器不會在同一時間收到大量的請求。
# 'CONCURRENT_REQUESTS': 0, 'DOWNLOAD_DELAY': 0.5,CONCURRENT_REQUESTS 注釋后,會使用默認值 16,
?
?
?
?
總結
- 上一篇: Scrapy - Request 和 R
- 下一篇: 四个小时不止是敲了30多行代码,还懂了好