當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

增量式爬虫与分布式爬虫

發布時間：2025/3/21 编程问答 15 豆豆

生活随笔收集整理的這篇文章主要介紹了增量式爬虫与分布式爬虫小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

1.redis的安裝
2、基于crawlSpider的全站數據爬取
3、分布式
4、增量式
5、scrapy提高數據爬取效率
6、虛擬環境

1.redis的安裝

1.將安裝包解壓到一個文件夾下: 如 D:\redis, 會在該文件夾下看到所有redis的文件
2.將該文件夾添加至系統環境變量中
3.在解壓的文件目錄的地址欄上輸入cmd, 在cmd窗口中輸入
redis-server ./redis.windows.conf ,
然后回車, 如果出現下面圖片的樣子說明redis安裝成功了

2、基于crawlSpider的全站數據爬取

項目的創建
scrapy startproject projectname
scrapy genspider -t crawl spidername www.baidu.com

crawlspider全站數據爬取:

CrawlSpider是一個爬蟲類, 是scrapy.spider的子類, 功能比spider更強大.
CrawlSpider的機制:
- 連接提取器: 可以根據指定的規則進行連接的提取
- 規則解析器: 根據指定的規則對響應數據進行解析

案例: 基于CrawlSpider對笑話網進行全站深度數據爬取, 抓取笑話標題與內容, 并存儲于MongoDB中

# item編碼: import scrapy class JokeItem(scrapy.Item):title = scrapy.Field()content = scrapy.Field() # spider編碼: import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from..items import JokeItemclass ZSpider(CrawlSpider):name = 'z'# allowed_domains = ['www.baidu.com']start_urls = ['http://xiaohua.zol.com.cn/lengxiaohua/']link = LinkExtractor(allow=r'/lengxiaohua/\d+.html')link_detail = LinkExtractor(allow=r'.*?\d+\.html')rules = (Rule(link, callback='parse_item', follow=True),Rule(link_detail, callback='parse_detail'),)def parse_item(self, response):passdef parse_detail(self, response):title = response.xpath('//h1[@class="article-title"]/text()').extract_first()content = response.xpath('//div[@class="article-text"]//text()').extract()content = ''.join(content)if title and content:item = JokeItem()item["title"] = titleitem["content"] = contentprint(dict(item))yield item # pipeline編碼: class JokePipeline(object):def __init__(self, mongo_uri, mongo_db):self.mongo_uri = mongo_uriself.mongo_db = mongo_db@classmethoddef from_crawler(cls, crawler):return cls(mongo_uri=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DB'))def open_spider(self, spider):self.client = pymongo.MongoClient(self.mongo_uri)self.db = self.client[self.mongo_db]def process_item(self, item, spider):self.db["joke"].insert(dict(item))return itemdef close(self, spider):self.client.close()

電影天堂: 全站深度抓取電影名與下載鏈接:

# item定義存儲字段: import scrapyclass BossItem(scrapy.Item):title = scrapy.Field()downloadlink = scrapy.Field() # spider編碼: import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from ..items import MvItemclass BSpider(CrawlSpider):name = 'mv'# allowed_domains = ['www.baidu.com']start_urls = ['https://www.ygdy8.net/html/gndy/oumei/index.html']link = LinkExtractor(allow=r'list.*?html')link_detail = LinkExtractor(allow=r'.*?/\d+\.html')rules = (Rule(link, callback='parse_item', follow=True,),Rule(link_detail, callback='parse_detail', follow=True,),)def parse_item(self, response):passdef parse_detail(self, response):title = response.xpath('//h1//text()').extract_first()downloadlink = response.xpath('//tbody/tr/td/a/text()').extract_first()if title and downloadlink and 'ftp' in downloadlink:item = BossItem()item['title'] = titleitem['downloadlink'] = downloadlinkyield item # piplines編碼: class MvPipeline(object):def __init__(self, mongo_uri, mongo_db):self.mongo_uri = mongo_uriself.mongo_db = mongo_db@classmethoddef from_crawler(cls, crawler):return cls(mongo_uri=crawler.settings.get('MONGO_URI'),mongo_db=crawler.settings.get('MONGO_DB'))def open_spider(self, spider):self.client = pymongo.MongoClient(self.mongo_uri)self.db = self.client[self.mongo_db]def process_item(self, item, spider):self.db["mv"].insert(dict(item))return itemdef close(self, spider):self.client.close()

3、分布式

分布式概念:
使用多臺機器組成一個分布式的機群，在機群中運行同一組程序，進行聯合數據的爬取。

原生scrapy無法實現分布式原因:

原生的scrapy中的調度器不可以被共享
原生的scrapy的管道不可以被共享

使用scrapy實現分布式思路:

為原生的scrapy框架提供共享的管道和調度器
pip install scrapy_redis

- 1. 創建工程: scrapy startproject projectname - 2. 爬蟲文件: scrapy genspider -t crawl spidername www.baidu.com - 3. 修改爬蟲文件：- 3.1 導包：from scrapy_redis.spiders import RedisCrawlSpider- 3.2 將當前爬蟲類的父類進行修改RedisCrawlSpider- 3.3 allowed_domains，start_url注釋掉，添加一個新屬性redis_key='qn'(調度器隊列的名稱)- 3.4 指定redis_key = 'xxx', 即共享調度器隊列名字# lpush xxx url- 3.4 數據解析，將解析的數據封裝到item中然后向管道提交 - 4. 配置文件的編寫：- 4.1 指定管道：ITEM_PIPELINES = {'scrapy_redis.pipelines.RedisPipeline': 400}- 4.2 指定調度器：# 增加了一個去重容器類的配置, 作用使用Redis的set集合來存儲請求的指紋數據, 從而實現請求去重的持久化DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"# 使用scrapy-redis組件自己的調度器SCHEDULER = "scrapy_redis.scheduler.Scheduler"# 配置調度器是否要持久化, 也就是當爬蟲結束了, 要不要清空Redis中請求隊列和去重指紋的set。如果是True, 就表示要持久化存儲, 就不清空數據, 否則清空數據SCHEDULER_PERSIST = True- 4.3 指定具體的redis：REDIS_HOST = 'redis服務的ip地址'REDIS_PORT = 6379 - 5. 修改Redis配置并指定配置啟動：- #bind 127.0.0.1- protected-mode no- 開啟redis服務(攜帶redis的配置文件：redis-server ./redis.windows.conf),和客戶端(redis-cli)：- 6. 啟動程序：scrapy runspider xxx.py(需要進入spider文件夾) - 7. 向調度器隊列中扔入一個起始的url（redis的客戶端）：lpush xxx www.xxx.com(xxx表示的就是redis_key的值)

案例: 陽光熱線問政平臺投訴信息爬取
網址: http://wz.sun0769.com/index.php/question/questionType?type=4

# items編碼: import scrapy class FbsproItem(scrapy.Item):# define the fields for your item here like:title = scrapy.Field() # spider編碼: import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from scrapy_redis.spiders import RedisCrawlSpider from fbspro.items import FbsproItem class TestSpider(RedisCrawlSpider):name = 'test' # allowed_domains = ['ww.baidu.com']# start_urls = ['http://ww.baidu.com/']redis_key = 'urlscheduler'link = LinkExtractor(allow=r'.*?&page=\d+')rules = (Rule(link, callback='parse_item', follow=True),)def parse_item(self, response):a_lst = response.xpath('//a[@class="news14"]')for a in a_lst:title = a.xpath('./text()').extract_first()# print(title)item = FbsproItem()item['title'] = titleyield item # settings配置編碼: USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36' ROBOTSTXT_OBEY = False CONCURRENT_REQUESTS = 3 ITEM_PIPELINES = {# 'fbspro.pipelines.FbsproPipeline': 300,'scrapy_redis.pipelines.RedisPipeline': 400 } # 增加了一個去重容器類的配置, 作用使用Redis的set集合來存儲請求的指紋數據, 從而實現請求去重的持久化 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis組件自己的調度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 配置調度器是否要持久化, 也就是當爬蟲結束了, 要不要清空Redis中請求隊列和去重指紋的set。如果是True, 就表示要持久化存儲, 就不清空數據, 否則清空數據 SCHEDULER_PERSIST = True# redis配置 REDIS_HOST = '192.168.12.198' REDIS_PORT = 6379

4、增量式

概念: - 檢測網站數據更新, 只爬取更新的內容

核心: 去重
- url
- 數據指紋()

增量式爬蟲: 電影名稱與電影類型的爬取

url: https://www.4567tv.co/list/index1.html

# items編碼: import scrapy class MvproItem(scrapy.Item):title = scrapy.Field()position = scrapy.Field() # spider編碼: import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from redis import Redis from mvpro.items import MvproItemclass MoveSpider(CrawlSpider):conn = Redis('127.0.0.1', 6379)name = 'move'# allowed_domains = ['www.baidu.com']start_urls = ['https://www.4567tv.co/list/index1.html']link = LinkExtractor(allow=r'/list/index1-\d+\.html')rules = (Rule(link, callback='parse_item', follow=True),)def parse_item(self, response):li_list = response.xpath('//div[contains(@class, "index-area")]/ul/li')for li in li_list:mv_link = 'https://www.4567tv.co' + li.xpath('./a/@href').extract_first()ex = self.conn.sadd('mv_link', mv_link)if ex:print('有新數據可以爬取..........................')yield scrapy.Request(url=mv_link, callback=self.parse_detail)else:print('沒有新數據可以爬取!!!!!!!!!!!!!!!!!!!!!!!!!')def parse_detail(self, response):title = response.xpath('//dt[@class="name"]/text()').extract_first()pro = response.xpath('//div[@class="ee"]/text()').extract_first()item = MvproItem()item['title'] = titleitem['position'] = proyield item

需求: 基于數據指紋的增量式爬蟲, 爬取糗百文字

# spider編碼: import scrapy from qiubai.items import QiubaiItem import hashlib from redis import Redisclass QbSpider(scrapy.Spider):conn = Redis('127.0.0.1', 6379)name = 'qb'# allowed_domains = ['www.baidu.com']start_urls = ['https://www.qiushibaike.com/text/']def parse(self, response):div_list = response.xpath('//div[@id="content-left"]/div')for div in div_list:content = div.xpath('./a[1]/div[@class="content"]/span[1]/text()').extract_first()fp = hashlib.md5(content.encode('utf-8')).hexdigest()ex = self.conn.sadd('fp', fp)if ex:print('有更新數據需要爬取........................')item = QiubaiItem()item['content'] = contentyield itemelse:print('沒有數據更新!!!!!!!!!!!!!!!!!!!!!!!!')

5、scrapy提高數據爬取效率

1.增加并發：默認scrapy開啟的并發線程為16個，可以適當進行增加。在settings配置文件中修改CONCURRENT_REQUESTS = 100值為100,并發設置成了為100。2.降低日志級別：(自行添加)在運行scrapy時，會有大量日志信息的輸出，為了減少CPU的使用率。可以設置log輸出信息為INFO或者ERROR即可。在配置文件中編寫：LOG_LEVEL = ‘ERROR’3.禁止cookie：(放開注釋)如果不是真的需要cookie，則在scrapy爬取數據時可以禁止cookie從而減少CPU的使用率，提升爬取效率。在配置文件中編寫：COOKIES_ENABLED = False4.禁止重試：(自行添加)對失敗的HTTP進行重新請求（重試）會減慢爬取速度，因此可以禁止重試。在配置文件中編寫：RETRY_ENABLED = False5.減少下載超時：(放開注釋)如果對一個非常慢的鏈接進行爬取，減少下載超時可以能讓卡住的鏈接快速被放棄，從而提升效率。在配置文件中進行編寫：DOWNLOAD_TIMEOUT = 10 超時時間為10s

6、虛擬環境

安裝:

pip install virtualenvwrapper-win

# 常用命令: mkvirtualenv envname # 創建虛擬環境并自動切換到該環境下 workon envname # 切換到某虛擬環境下 pip list rmvirtualenv envname # 刪除虛擬環境 deactivate # 退出虛擬環境 lsvirtualenv # 列出所有常見的虛擬環境 mkvirtualenv --python==C:\...\python.exe envname # 指定Python解釋器創建虛擬環境

總結

以上是生活随笔為你收集整理的增量式爬虫与分布式爬虫的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： scrapy框架对接seleniumpi
下一篇： Scrapy Django项目