當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

爬虫scrapy

發(fā)布時(shí)間：2023/12/31 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫scrapy 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Q2Day81

性能相關(guān)

在編寫爬蟲時(shí)，性能的消耗主要在IO請(qǐng)求中，當(dāng)單進(jìn)程單線程模式下請(qǐng)求URL時(shí)必然會(huì)引起等待，從而使得請(qǐng)求整體變慢。

import requestsdef fetch_async(url):response = requests.get(url)return responseurl_list = ['http://www.github.com', 'http://www.bing.com']for url in url_list:fetch_async(url) ?2.多線程執(zhí)行 ?2.多線程+回調(diào)函數(shù)執(zhí)行 ?3.多進(jìn)程執(zhí)行 ?3.多進(jìn)程+回調(diào)函數(shù)執(zhí)行

通過上述代碼均可以完成對(duì)請(qǐng)求性能的提高，對(duì)于多線程和多進(jìn)行的缺點(diǎn)是在IO阻塞時(shí)會(huì)造成了線程和進(jìn)程的浪費(fèi)，所以異步IO回事首選：

?1.asyncio示例1 ?1.asyncio示例2 ?2.asyncio + aiohttp ?3.asyncio + requests ?4.gevent + requests ?5.grequests ?6.Twisted示例 ?7.Tornado from twisted.internet import reactor from twisted.web.client import getPage import urllib.parsedef one_done(arg):print(arg)reactor.stop()post_data = urllib.parse.urlencode({'check_data': 'adf'}) post_data = bytes(post_data, encoding='utf8') headers = {b'Content-Type': b'application/x-www-form-urlencoded'} response = getPage(bytes('http://dig.chouti.com/login', encoding='utf8'),method=bytes('POST', encoding='utf8'),postdata=post_data,cookies={},headers=headers) response.addBoth(one_done)reactor.run()

以上均是Python內(nèi)置以及第三方模塊提供異步IO請(qǐng)求模塊，使用簡(jiǎn)便大大提高效率，而對(duì)于異步IO請(qǐng)求的本質(zhì)則是【非阻塞Socket】+【IO多路復(fù)用】：

import select import socket import timeclass AsyncTimeoutException(TimeoutError):"""請(qǐng)求超時(shí)異常類"""def __init__(self, msg):self.msg = msgsuper(AsyncTimeoutException, self).__init__(msg)class HttpContext(object):"""封裝請(qǐng)求和相應(yīng)的基本數(shù)據(jù)"""def __init__(self, sock, host, port, method, url, data, callback, timeout=5):"""sock: 請(qǐng)求的客戶端socket對(duì)象host: 請(qǐng)求的主機(jī)名port: 請(qǐng)求的端口port: 請(qǐng)求的端口method: 請(qǐng)求方式url: 請(qǐng)求的URLdata: 請(qǐng)求時(shí)請(qǐng)求體中的數(shù)據(jù)callback: 請(qǐng)求完成后的回調(diào)函數(shù)timeout: 請(qǐng)求的超時(shí)時(shí)間"""self.sock = sockself.callback = callbackself.host = hostself.port = portself.method = methodself.url = urlself.data = dataself.timeout = timeoutself.__start_time = time.time()self.__buffer = []def is_timeout(self):"""當(dāng)前請(qǐng)求是否已經(jīng)超時(shí)"""current_time = time.time()if (self.__start_time + self.timeout) < current_time:return Truedef fileno(self):"""請(qǐng)求sockect對(duì)象的文件描述符，用于select監(jiān)聽"""return self.sock.fileno()def write(self, data):"""在buffer中寫入響應(yīng)內(nèi)容"""self.__buffer.append(data)def finish(self, exc=None):"""在buffer中寫入響應(yīng)內(nèi)容完成，執(zhí)行請(qǐng)求的回調(diào)函數(shù)"""if not exc:response = b''.join(self.__buffer)self.callback(self, response, exc)else:self.callback(self, None, exc)def send_request_data(self):content = """%s %s HTTP/1.0\r\nHost: %s\r\n\r\n%s""" % (self.method.upper(), self.url, self.host, self.data,)return content.encode(encoding='utf8')class AsyncRequest(object):def __init__(self):self.fds = []self.connections = []def add_request(self, host, port, method, url, data, callback, timeout):"""創(chuàng)建一個(gè)要請(qǐng)求"""client = socket.socket()client.setblocking(False)try:client.connect((host, port))except BlockingIOError as e:pass# print('已經(jīng)向遠(yuǎn)程發(fā)送連接的請(qǐng)求')req = HttpContext(client, host, port, method, url, data, callback, timeout)self.connections.append(req)self.fds.append(req)def check_conn_timeout(self):"""檢查所有的請(qǐng)求，是否有已經(jīng)連接超時(shí)，如果有則終止"""timeout_list = []for context in self.connections:if context.is_timeout():timeout_list.append(context)for context in timeout_list:context.finish(AsyncTimeoutException('請(qǐng)求超時(shí)'))self.fds.remove(context)self.connections.remove(context)def running(self):"""事件循環(huán)，用于檢測(cè)請(qǐng)求的socket是否已經(jīng)就緒，從而執(zhí)行相關(guān)操作"""while True:r, w, e = select.select(self.fds, self.connections, self.fds, 0.05)if not self.fds:returnfor context in r:sock = context.sockwhile True:try:data = sock.recv(8096)if not data:self.fds.remove(context)context.finish()breakelse:context.write(data)except BlockingIOError as e:breakexcept TimeoutError as e:self.fds.remove(context)self.connections.remove(context)context.finish(e)breakfor context in w:# 已經(jīng)連接成功遠(yuǎn)程服務(wù)器，開始向遠(yuǎn)程發(fā)送請(qǐng)求數(shù)據(jù)if context in self.fds:data = context.send_request_data()context.sock.sendall(data)self.connections.remove(context)self.check_conn_timeout()if __name__ == '__main__':def callback_func(context, response, ex):""":param context: HttpContext對(duì)象，內(nèi)部封裝了請(qǐng)求相關(guān)信息:param response: 請(qǐng)求響應(yīng)內(nèi)容:param ex: 是否出現(xiàn)異常（如果有異常則值為異常對(duì)象；否則值為None）:return:"""print(context, response, ex)obj = AsyncRequest()url_list = [{'host': 'www.google.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},{'host': 'www.baidu.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},{'host': 'www.bing.com', 'port': 80, 'method': 'GET', 'url': '/', 'data': '', 'timeout': 5,'callback': callback_func},]for item in url_list:print(item)obj.add_request(**item)obj.running()

Scrapy

Scrapy是一個(gè)為了爬取網(wǎng)站數(shù)據(jù)，提取結(jié)構(gòu)性數(shù)據(jù)而編寫的應(yīng)用框架。其可以應(yīng)用在數(shù)據(jù)挖掘，信息處理或存儲(chǔ)歷史數(shù)據(jù)等一系列的程序中。
其最初是為了頁面抓取 (更確切來說, 網(wǎng)絡(luò)抓取 )所設(shè)計(jì)的，也可以應(yīng)用在獲取API所返回的數(shù)據(jù)(例如 Amazon Associates Web Services ) 或者通用的網(wǎng)絡(luò)爬蟲。Scrapy用途廣泛，可以用于數(shù)據(jù)挖掘、監(jiān)測(cè)和自動(dòng)化測(cè)試。

Scrapy 使用了 Twisted異步網(wǎng)絡(luò)庫(kù)來處理網(wǎng)絡(luò)通訊。整體架構(gòu)大致如下

Scrapy主要包括了以下組件：

引擎(Scrapy)
用來處理整個(gè)系統(tǒng)的數(shù)據(jù)流處理, 觸發(fā)事務(wù)(框架核心)
調(diào)度器(Scheduler)
用來接受引擎發(fā)過來的請(qǐng)求, 壓入隊(duì)列中, 并在引擎再次請(qǐng)求的時(shí)候返回. 可以想像成一個(gè)URL（抓取網(wǎng)頁的網(wǎng)址或者說是鏈接）的優(yōu)先隊(duì)列, 由它來決定下一個(gè)要抓取的網(wǎng)址是什么, 同時(shí)去除重復(fù)的網(wǎng)址
下載器(Downloader)
用于下載網(wǎng)頁內(nèi)容, 并將網(wǎng)頁內(nèi)容返回給蜘蛛(Scrapy下載器是建立在twisted這個(gè)高效的異步模型上的)
爬蟲(Spiders)
爬蟲是主要干活的, 用于從特定的網(wǎng)頁中提取自己需要的信息, 即所謂的實(shí)體(Item)。用戶也可以從中提取出鏈接,讓Scrapy繼續(xù)抓取下一個(gè)頁面
項(xiàng)目管道(Pipeline)
負(fù)責(zé)處理爬蟲從網(wǎng)頁中抽取的實(shí)體，主要的功能是持久化實(shí)體、驗(yàn)證實(shí)體的有效性、清除不需要的信息。當(dāng)頁面被爬蟲解析后，將被發(fā)送到項(xiàng)目管道，并經(jīng)過幾個(gè)特定的次序處理數(shù)據(jù)。
下載器中間件(Downloader Middlewares)
位于Scrapy引擎和下載器之間的框架，主要是處理Scrapy引擎與下載器之間的請(qǐng)求及響應(yīng)。
爬蟲中間件(Spider Middlewares)
介于Scrapy引擎和爬蟲之間的框架，主要工作是處理蜘蛛的響應(yīng)輸入和請(qǐng)求輸出。
調(diào)度中間件(Scheduler Middewares)
介于Scrapy引擎和調(diào)度之間的中間件，從Scrapy引擎發(fā)送到調(diào)度的請(qǐng)求和響應(yīng)。

Scrapy運(yùn)行流程大概如下：

引擎從調(diào)度器中取出一個(gè)鏈接(URL)用于接下來的抓取

引擎把URL封裝成一個(gè)請(qǐng)求(Request)傳給下載器

下載器把資源下載下來，并封裝成應(yīng)答包(Response)

爬蟲解析Response

解析出實(shí)體（Item）,則交給實(shí)體管道進(jìn)行進(jìn)一步的處理

解析出的是鏈接（URL）,則把URL交給調(diào)度器等待抓取

一、安裝

1 2 3 4 5 6 7 8 9 10

Linux ??????pip3 install scrapy Windows ??????a. pip3 install wheel ??????b. 下載twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted ??????c. 進(jìn)入下載目錄，執(zhí)行 pip3 install Twisted?17.1.0?cp35?cp35m?win_amd64.whl ??????d. pip3 install scrapy ??????e. 下載并安裝pywin32：https://sourceforge.net/projects/pywin32/files/

二、基本使用

1. 基本命令

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

1.?scrapy startproject 項(xiàng)目名稱 ???-?在當(dāng)前目錄中創(chuàng)建中創(chuàng)建一個(gè)項(xiàng)目文件（類似于Django） 2.?scrapy genspider [-t template] <name> <domain> ???-?創(chuàng)建爬蟲應(yīng)用 ???如： ??????scrapy gensipider?-t basic oldboy oldboy.com ??????scrapy gensipider?-t xmlfeed autohome autohome.com.cn ???PS: ??????查看所有命令：scrapy gensipider?-l ??????查看模板命令：scrapy gensipider?-d 模板名稱 3.?scrapy?list ???-?展示爬蟲應(yīng)用列表 4.?scrapy crawl 爬蟲應(yīng)用名稱 ???-?運(yùn)行單獨(dú)爬蟲應(yīng)用

2.項(xiàng)目結(jié)構(gòu)以及爬蟲應(yīng)用簡(jiǎn)介

1 2 3 4 5 6 7 8 9 10 11 12

project_name/ ???scrapy.cfg ???project_name/ ???????__init__.py ???????items.py ???????pipelines.py ???????settings.py ???????spiders/ ???????????__init__.py ???????????爬蟲1.py ???????????爬蟲2.py ???????????爬蟲3.py

文件說明：

scrapy.cfg ?項(xiàng)目的主配置信息。（真正爬蟲相關(guān)的配置信息在settings.py文件中）
items.py ? ?設(shè)置數(shù)據(jù)存儲(chǔ)模板，用于結(jié)構(gòu)化數(shù)據(jù)，如：Django的Model
pipelines ? ?數(shù)據(jù)處理行為，如：一般結(jié)構(gòu)化的數(shù)據(jù)持久化
settings.py 配置文件，如：遞歸的層數(shù)、并發(fā)數(shù)，延遲下載等
spiders ? ? ?爬蟲目錄，如：創(chuàng)建文件，編寫爬蟲規(guī)則

注意：一般創(chuàng)建爬蟲文件時(shí)，以網(wǎng)站域名命名

?爬蟲1.py ?關(guān)于windows編碼

3.?小試牛刀

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

import?scrapy from?scrapy.selector?import?HtmlXPathSelector from?scrapy.http.request?import?Request class?DigSpider(scrapy.Spider): ????# 爬蟲應(yīng)用的名稱，通過此名稱啟動(dòng)爬蟲命令 ????name?=?"dig" ????# 允許的域名 ????allowed_domains?=?["chouti.com"] ????# 起始URL ????start_urls?=?[ ????????'http://dig.chouti.com/', ????] ????has_request_set?=?{} ????def?parse(self, response): ????????print(response.url) ????????hxs?=?HtmlXPathSelector(response) ????????page_list?=?hxs.select('//div[@id="dig_lcpage"]//a[re:test(@href, "/all/hot/recent/\d+")]/@href').extract() ????????for?page?in?page_list: ????????????page_url?=?'http://dig.chouti.com%s'?%?page ????????????key?=?self.md5(page_url) ????????????if?key?in?self.has_request_set: ????????????????pass ????????????else: ????????????????self.has_request_set[key]?=?page_url ????????????????obj?=?Request(url=page_url, method='GET', callback=self.parse) ????????????????yield?obj ????@staticmethod ????def?md5(val): ????????import?hashlib ????????ha?=?hashlib.md5() ????????ha.update(bytes(val, encoding='utf-8')) ????????key?=?ha.hexdigest() ????????return?key

執(zhí)行此爬蟲文件，則在終端進(jìn)入項(xiàng)目目錄執(zhí)行如下命令：

1	scrapy crawl dig?--nolog

對(duì)于上述代碼重要之處在于：

Request是一個(gè)封裝用戶請(qǐng)求的類，在回調(diào)函數(shù)中yield該對(duì)象表示繼續(xù)訪問
HtmlXpathSelector用于結(jié)構(gòu)化HTML代碼并提供選擇器功能

4. 選擇器

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

#!/usr/bin/env python # -*- coding:utf-8 -*- from?scrapy.selector?import?Selector, HtmlXPathSelector from?scrapy.http?import?HtmlResponse html?=?"""<!DOCTYPE html> <html> ????<head lang="en"> ????????<meta charset="UTF-8"> ????????<title></title> ????</head> ????<body> ????????<ul> ????????????<li class="item-"><a id='i1' href="link.html">first item</a></li> ????????????<li class="item-0"><a id='i2' href="llink.html">first item</a></li> ????????????<li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li> ????????</ul> ????????<div><a href="llink2.html">second item</a></div> ????</body> </html> """ response?=?HtmlResponse(url='http://example.com', body=html,encoding='utf-8') # hxs = HtmlXPathSelector(response) # print(hxs) # hxs = Selector(response=response).xpath('//a') # print(hxs) # hxs = Selector(response=response).xpath('//a[2]') # print(hxs) # hxs = Selector(response=response).xpath('//a[@id]') # print(hxs) # hxs = Selector(response=response).xpath('//a[@id="i1"]') # print(hxs) # hxs = Selector(response=response).xpath('//a[@href="link.html"][@id="i1"]') # print(hxs) # hxs = Selector(response=response).xpath('//a[contains(@href, "link")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract() # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract() # print(hxs) # hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract() # print(hxs) # hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first() # print(hxs) # ul_list = Selector(response=response).xpath('//body/ul/li') # for item in ul_list: #???? v = item.xpath('./a/span') #???? # 或 #???? # v = item.xpath('a/span') #???? # 或 #???? # v = item.xpath('*/a/span') #???? print(v)

?示例：自動(dòng)登陸抽屜并點(diǎn)贊

注意：settings.py中設(shè)置DEPTH_LIMIT = 1來指定“遞歸”的層數(shù)。

5. 格式化處理

上述實(shí)例只是簡(jiǎn)單的處理，所以在parse方法中直接處理。如果對(duì)于想要獲取更多的數(shù)據(jù)處理，則可以利用Scrapy的items將數(shù)據(jù)格式化，然后統(tǒng)一交由pipelines來處理。

?spiders/xiahuar.py ?items ?pipelines ?settings

對(duì)于pipeline可以做更多，如下：

?自定義pipeline

6.中間件

?爬蟲中間件 ?下載器中間件

7. 自定制命令

在spiders同級(jí)創(chuàng)建任意目錄，如：commands
在其中創(chuàng)建 crawlall.py 文件（此處文件名就是自定義的命令）
?crawlall.py
在settings.py 中添加配置 COMMANDS_MODULE = '項(xiàng)目名稱.目錄名稱'
在項(xiàng)目目錄執(zhí)行命令：scrapy crawlall?

8. 自定義擴(kuò)展

自定義擴(kuò)展時(shí)，利用信號(hào)在指定位置注冊(cè)制定操作

?View Code

9. 避免重復(fù)訪問

scrapy默認(rèn)使用 scrapy.dupefilter.RFPDupeFilter 進(jìn)行去重，相關(guān)配置有：

1 2 3

DUPEFILTER_CLASS?=?'scrapy.dupefilter.RFPDupeFilter' DUPEFILTER_DEBUG?=?False JOBDIR?=?"保存范文記錄的日志路徑，如：/root/"??# 最終路徑為 /root/requests.seen

?自定義URL去重操作

10.其他

?settings?

11.TinyScrapy

?twisted示例一 ?twisted示例二 ?twisted示例三 ?模擬scrapy框架 ?參考版

點(diǎn)擊下載

?更多文檔參見：http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

轉(zhuǎn)載于:https://www.cnblogs.com/xc1234/p/8645901.html

總結(jié)

以上是生活随笔為你收集整理的爬虫scrapy的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： SpringMVC （六）注解式开发
下一篇：一无所有的反义词