request设置请求头_收藏 Scrapy框架各组件详细设置
關于Scrapy
Scrapy是純Python語言實現的爬蟲框架,簡單、易用、拓展性高是其主要特點。這里不過多介紹Scrapy的基本知識點,主要針對其高拓展性詳細介紹各個主要部件的配置方法。其實也不詳細,不過應該能滿足大多數人的需求了 : )。當然,更多信息可以仔細閱讀官方文檔。首先還是放一張Scrapy數據流的圖供復習和參考。接下來進入正題,有些具體的示例以某瓣spider為例。創建命令
scrapy?startproject?<Project_name>scrapy?genspider?<spider_name>?<domains>如果想要創建全網爬取的便捷框架crawlspider,則用如下命令:scrapy?genspider?–t?crawl?<spider_name>?<domains>
spider.py
首先介紹最核心的部件spider.py,廢話不多說,上代碼,看注釋
import?scrapy#?有些命令如果有python基礎的都明白,我不做過多介紹
import?json
#?需要做持久化所以導入item,也可以根據文件夾名慢慢導入
from?..items?import?DoubanItem
class?DoubanSpider(scrapy.Spider):
????name?=?'douban'
????allowed_domains?=?['douban.com']
????#?對單個爬蟲設置請求頭
????custom_settings?=?{??
????????'DEFAULT_REQUEST_HEADERS':?{
????????'Accept':?'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
????????'Accept-Language':?'en',
????????'user-agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/74.0.3729.169?Safari/537.36'
????}}
????
????#?很多時候并不需要重載這個函數,如果需要定制化起始url或者單獨設置請求頭可以選擇重載
????def?start_requests(self):
????????page?=?18
????????base_url?=?'https://xxxx'
????????for?i?in?range(page):
????????????url?=?base_url.format(i?*?20)
????????????req?=?scrapy.Request(url=url,?callback=self.parse)
????????????#?對某個請求添加請求頭,后面的請求如果要設置也是類似方法
????????????#?req.headers['User-Agent']?=?''??
????????????yield?req
????????????
????#?沒有特別要解釋,就是常規的頁面解析拋給...(看數據流就明白了)
????def?parse(self,?response):
????????json_str?=?response.body.decode('utf-8')
????????res_dict?=?json.loads(json_str)
????????for?i?in?res_dict['subjects']:
????????????url?=?i['url']
????????????yield?scrapy.Request(url=url,?callback=self.parse_detailed_page)
????????????
????#?scrapy的response可以直接用xpath解析,基礎東西大家都懂不贅述
????def?parse_detailed_page(self,?response):
????????title?=?response.xpath('//h1/span[1]/text()').extract_first()
????????year?=?response.xpath('//h1/span[2]/text()').extract()[0]
????????image?=?response.xpath('//img[@rel="v:image"]/@src').extract_first()
????????
????????item?=?DoubanItem()
????????item['title']?=?title
????????item['year']?=?year
????????item['image']?=?image
????????#?如果要下載圖片需要單獨設置,ImagePipelines,同樣在settings和pipelines都需要相應設置
????????item['image_urls']?=?[image]???
????????yield?item
如果是全網爬取,則框架中spiders的部分開頭會略有差別
rules?=?(Rule(LinkExtractor(allow=r'http://digimons.net/digimon/.*/index.html'),?callback='parse_item',?follow=False),)關鍵就是follow的設置了,是否到達既定深度和頁面需要自己把握。提一嘴,請求頭可以在三個地方設置,決定了請求頭的影響范圍
在settings中設置,范圍最大,影響整個框架的所有spider
在spiders類變量處設置,影響該spider的所有請求
在具體請求中設置,只影響該request
三處設置的影響范圍實際就是從全局到單個爬蟲到單個請求。如果同時存在則單個請求的headers設置優先級最高!
items.py
import?scrapyclass?DoubanItem(scrapy.Item):
????title?=?scrapy.Field()
????year?=?scrapy.Field()
????image?=?scrapy.Field()
????#?下載圖片的ImagePipelines也需要設置items
????image_urls?=?scrapy.Field()???
????
????#?持久化存儲我選擇用mysql,不具體展開
????def?get_insert_sql_and_data(self):
????#?CREATE?TABLE?douban(
????#?id?int?not?null?auto_increment?primary?key,
????#?title?text,?`year`?int,?image?text)ENGINE=INNODB?DEFAULT?CHARSET=UTF8mb4;
????????insert_sql?=?'INSERT?INTO?douban(title,`year`,image)'?\?????#?系統關鍵字需要加``
?????????????????????'VALUES(%s,%s,%s)'
????????data?=?(self['title'],self['year'],self['image'])
????????return?(insert_sql,?data)
middlewares.py
中間件就很靈性了,很多小伙伴也不一定用的到,但實際上在配置代理時很重要,一般需求不去配置SpiderMiddleware,主要針對DownloaderMiddleware進行修改
#?信號,這個名詞在scrapy自定義拓展中很重要from?scrapy?import?signals
#?本地配置的類,代碼見后續,可以搭在自己的IP池上,也可以直接掛在收費IP(比如我)
from?proxyhelper?import?Proxyhelper
#?多線程操作同一個對象需要鎖,用法就是實例化以后一鎖一釋放
from?twisted.internet.defer?import?DeferredLock
class?DoubanSpiderMiddleware(object):?#?spider中間件不設置
????pass
class?DoubanDownloaderMiddleware(object):
????def?__init__(self):
????????#?對IP配置的代理和鎖都實例化
????????self.helper?=?Proxyhelper()
????????self.lock?=?DeferredLock()
????@classmethod
????def?from_crawler(cls,?crawler):??#?不修改
????????#?This?method?is?used?by?Scrapy?to?create?your?spiders.
????????s?=?cls()
????????crawler.signals.connect(s.spider_opened,?signal=signals.spider_opened)
????????return?s
????def?process_request(self,?request,?spider):
????????#?request的數據流到達下載中間件的時候出發
????????self.lock.acquire()
????????request.meta['Proxy']?=?self.helper.get_proxy()
????????self.lock.release()
????????return?None
????def?process_response(self,?request,?response,?spider):
????????#?對響應判斷,如果不符合就換代理重新請求
????????if?response.status?!=?200:???
????????????self.lock.acquire()
????????????self.helper.update_proxy(request.meta['Proxy'])
????????????self.lock.release()
????????????return?request
????????return?response
????def?process_exception(self,?request,?exception,?spider):
????????self.lock.acquire()
????????self.helper.update_proxy(request.meta['Proxy'])
????????self.lock.release()
????????return?request
????def?spider_opened(self,?spider):??#?不修改
????????spider.logger.info('Spider?opened:?%s'?%?spider.name)附上proxyhelper配置的代碼import?requests
class?Proxyhelper(object):
????def?__init__(self):
????????self.proxy?=?self._get_proxy_from_xxx()
????def?get_proxy(self):
????????return?self.proxy
????def?update_proxy(self,?proxy):
????????if?proxy?==?self.proxy:
????????????print('Updating?a?proxy')
????????????self.proxy?=?self._get_proxy_from_xxx()
????def?_get_proxy_from_xxx(self):
????????url?=?''?#?此處修改url,最好是一次返回一個ip
????????response?=?requests.get(url)
????????return?'http://'?+?response.text.strip()
pipelines.py
#?載入本地的mysql持久化類,按需自己寫from?mysqlhelper?import?Mysqlhelper
#?載入ImagesPipeline便于重載,自定義一些功能
from?scrapy.pipelines.images?import?ImagesPipeline
import?hashlib
from?scrapy.utils.python?import?to_bytes
from?scrapy.http?import?Request
class?DoubanImagesPipeline(ImagesPipeline):
????def?get_media_requests(self,?item,?info):
????????request_lst?=?[]
????????for?x?in?item.get(self.images_urls_field,?[]):
????????????req?=?Request(x)
????????????req.meta['movie_name']?=?item['title']??#?獲取名字
????????????request_lst.append(req)
????????return?request_lst
????#?重載
????def?file_path(self,?request,?response=None,?info=None):
????????image_guid?=?hashlib.sha1(to_bytes(request.url)).hexdigest()
????????return?'full/%s.jpg'?%?(request.meta['movie_name'])?#?修改圖片名
#?無特殊,有些步驟在items已經寫完,實現pipelines和items功能上的分離
class?DoubanPipeline(object):
????def?__init__(self):
????????self.mysqlhelper?=?Mysqlhelper()
????def?process_item(self,?item,?spider):
????????if?'get_insert_sql_and_data'?in?dir(item):
????????????(insert_sql,?data)?=?item.get_insert_sql_and_data()
????????????self.mysqlhelper.execute_sql(insert_sql,?data)
????????return?item
setting.py
極其關鍵的部件,注釋已經在代碼中標注
#?爬蟲名稱BOT_NAME?=?'Douban'
SPIDER_MODULES?=?['Douban.spiders']
NEWSPIDER_MODULE?=?'Douban.spiders'
#?客戶端請求頭
#?Crawl?responsibly?by?identifying?yourself?(and?your?website)?on?the?user-agent
#USER_AGENT?=?'Douban?(+http://www.yourdomain.com)'
#?Obey?robots.txt?rules
#?機器人協定
ROBOTSTXT_OBEY?=?False
#?并發請求數
#?Configure?maximum?concurrent?requests?performed?by?Scrapy?(default:?16)
CONCURRENT_REQUESTS?=?32
#?下載延遲
#DOWNLOAD_DELAY?=?3
#?The?download?delay?setting?will?honor?only?one?of:
#?單域名和單IP并發數,會覆蓋上面的設定
#CONCURRENT_REQUESTS_PER_DOMAIN?=?16
#CONCURRENT_REQUESTS_PER_IP?=?16
#?Disable?cookies?(enabled?by?default)
#COOKIES_ENABLED?=?False
#?Disable?Telnet?Console?(enabled?by?default)
#?對爬蟲進行監控
#TELNETCONSOLE_ENABLED?=?False
#?TELNETCONSOLE_ENABLED?=?True
#?TELNETCONSOLE_HOST?=?'127.0.0.1'
#?TELNETCONSOLE_PORT?=?[6023,]
#?操作命令:cmd -> telent 127.0.0.1 6023-> est<>
#?Override?the?default?request?headers:
#?默認請求頭,項目內所有爬蟲有效
#?DEFAULT_REQUEST_HEADERS?=?{
#???'Accept':?'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#???'Accept-Language':?'en',
#???'user-agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/74.0.3729.169?Safari/537.36'
#?}
#?爬蟲中間件
#?SPIDER_MIDDLEWARES?=?{
#????#?'scrapy.spidermiddlewares.offsite.OffsiteMiddleware':?None
#????'Douban.middlewares.DoubanSpiderMiddleware':?543,
#?}
#?Enable?or?disable?downloader?middlewares
#?See?https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#?下載中間件
DOWNLOADER_MIDDLEWARES?=?{
???'Douban.middlewares.DoubanDownloaderMiddleware':?560,??
#?更改為560的原因在于不同中間件細分很多亞組間,這些組間的數據大小決定了request和response數據流觸碰的順序,具體見官方文檔
}
#?允許url的訪問時限
TIMEOUT?=?10
#?深度限制
#?DEPTH_LIMIT?=?1
#?自定義拓展
EXTENSIONS?=?{
???'Douban.extends.MyExtension':?500,
}
#?item-pipelines配置
ITEM_PIPELINES?=?{
???#?'scrapy.pipelines.images.ImagesPipeline':?1,??#?圖片下載器需要注冊
???'Douban.pipelines.DoubanImagesPipeline':?300,
}
#?利用算法,自動限速
#?Enable?and?configure?the?AutoThrottle?extension?(disabled?by?default)
#?See?https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED?=?True
#?The?initial?download?delay
#AUTOTHROTTLE_START_DELAY?=?5
#?The?maximum?download?delay?to?be?set?in?case?of?high?latencies
#AUTOTHROTTLE_MAX_DELAY?=?60
#?The?average?number?of?requests?Scrapy?should?be?sending?in?parallel?to
#?each?remote?server
#AUTOTHROTTLE_TARGET_CONCURRENCY?=?1.0
#?Enable?showing?throttling?stats?for?every?response?received:
#AUTOTHROTTLE_DEBUG?=?False
#?啟用緩存,較少用
#?Enable?and?configure?HTTP?caching?(disabled?by?default)
#?See?https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED?=?True
#HTTPCACHE_EXPIRATION_SECS?=?0
#HTTPCACHE_DIR?=?'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES?=?[]
#HTTPCACHE_STORAGE?=?'scrapy.extensions.httpcache.FilesystemCacheStorage'
#?圖片下載器ImagePipeline的配置,按需開啟
IMAGES_STORE?=?'download'
extends.py
自定義擴展,建議設置該部件需要對信號有了解,深入理解scrapy運行過程的信號觸發,實際還是需要對數據流理解的完善。代碼中我是利用自己寫的類,本質就是利用喵提醒在某些特定時刻觸發提醒(喵提醒打錢?)。當然也可以利用日志或者其他功能強化拓展功能,通過signal的不同觸發時刻針對性設置
需要自己創建,創建位置如圖:
from?scrapy?import?signalsfrom?message?import?Message
class?MyExtension(object):
????def?__init__(self,?value):
????????self.value?=?value
????@classmethod
????def?from_crawler(cls,?crawler):
????????val?=?crawler.settings.getint('MMMM')
????????ext?=?cls(val)
????????crawler.signals.connect(ext.spider_opened,?signal=signals.spider_opened)
????????crawler.signals.connect(ext.spider_closed,?signal=signals.spider_closed)
????????return?ext
????def?spider_opened(self,?spider):
????????print('spider?running')
????def?spider_closed(self,?spider):
????????message?=?Message('spider運行結束')
????????message.push()
????????print('spider?closed')
running.py
runnings.py最后提一下吧,其實就是一個在python中運行cmd的命令
from?scrapy.cmdline?import?executeexecute('scrapy?crawl?douban'.split())
以上就是可以滿足基本需求的Scrapy各部件配置,如果還不熟悉的話可以參考,之后我們將更新一些Scrapy爬蟲實戰案例。
牛逼!一行代碼讓 pandas 的 apply 速度飆到極致!
萬里星空、皓月千里、電閃雷鳴,各種天氣特效,算法一鍵生成
Python操作Excel 模塊,你猜哪家強?
終于來了!!Pyston v2.0 發布,解決 Python 慢速的救星
2020年,那些已經死亡的公司
End
碼農升級
長按二維碼關注
你點的每個在看,我都認真當成了喜歡 與50位技術專家面對面20年技術見證,附贈技術全景圖總結
以上是生活随笔為你收集整理的request设置请求头_收藏 Scrapy框架各组件详细设置的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python集合类型_Python 的集
- 下一篇: sklearn 神经网络_sklearn