當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Scrapy 框架入门

發(fā)布時間：2024/4/17 编程问答 38 豆豆

生活随笔收集整理的這篇文章主要介紹了 Scrapy 框架入门小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

一、介紹

? Scrapy 是一個基于Twisted 的異步處理框架，是純 Python 實現(xiàn)的爬蟲框架，其架構清晰，模塊之間耦合較低，擴展性和靈活強，是目前 Python 中使用最廣泛的爬蟲框架。

架構示意圖；

它分為以下幾個部分：

Engine：引擎，處理整個系統(tǒng)的數(shù)據(jù)流處理、觸發(fā)事務，是整個框架的核心。
Item：項目，它定義了爬取數(shù)據(jù)結果的數(shù)據(jù)結構，爬取的數(shù)據(jù)會被賦值成該 Item 對象。
Scheduler：調度器，接受引擎發(fā)送過來的請求并將其加入到隊列中，在引擎再次請求的時候提供給引擎。
Downloader：下載器，下載網(wǎng)頁內容并將其返回給Spiders。
Spiders：蜘蛛，其內定義了爬取的邏輯和網(wǎng)頁的解析規(guī)則，它主要任務是負責解析響應并生成提取結果和新的請求。
Item Pipeline：項目管道，負責處理由 Spiders 從網(wǎng)頁中抽取的項目，它的主要任務是清洗、驗證和存儲數(shù)據(jù)。
Downloader Middlewares：下載中間件，位于引擎和下載器之間的鉤子框架，主要處理引擎與下載器之間的請求及響應。
Spider Middlewares：蜘蛛中間件，位于引擎和蜘蛛之間的鉤子框架，主要處理蜘蛛輸入的響應和輸出的結果及新的請求。

項目結構

Scrapy 框架通過命令行來創(chuàng)建項目，IDE 編寫代碼，項目文件結構如下所示：

scrapy.cfg # Scrapy 項目配置文件 project/__init__.pyitems.py # 它定義了 Item 數(shù)據(jù)結構pipelines.py # 它定義了 Item Pipeline 的實像settings.py # 它定義了項目的全局配置middlewares.py # 它定義了 Spider、Downloader 的中間件的實現(xiàn)spiders/ # 其內包含了一個個 spider 的實現(xiàn)__init__.pyspider1.pyspider2.py...

二、Scrapy 入門 Demo

目標：

創(chuàng)建一個 Scrapy 項目。
創(chuàng)建一個 Spider 來抓取站點和處理數(shù)據(jù)。
通過命令行將抓取的內容導出。
將抓取的內容保存到 MongoDB 數(shù)據(jù)庫。

創(chuàng)建一個 Scrapy 項目：

scrapy startproject tutorial

文件夾結構如下：

創(chuàng)建 Spider

自定義的 Spider 類必須繼承scrapy.Spider 類。使用命令行自定義一個 Quotes Spider。

cd tutorial # 進入剛才創(chuàng)建的 tutorial，即進入項目的根路徑 scrapy genspider quotes quotes.toscrape.com # 執(zhí)行 genspider 命令，第一個參數(shù)是 Spider 的名稱，第二個參數(shù)是網(wǎng)站域名。

然后 spiders 下就多了個 quotes.py 文件：

# -*- coding: utf-8 -*- import scrapyclass QuotesSpider(scrapy.Spider):# 每個 spider 獨特的名字以便區(qū)分name = 'quotes' # 要爬取的鏈接的域名，若鏈接不在這個域名下，會被過濾allowed_domains = ['quotes.toscrape.com']# 它包含了 Spider 在啟動時爬取的 url 列表請求start_urls = ['http://quotes.toscrape.com/']# 當上述的請求在完成下載后，返回的響應作為參數(shù)，該方法負責解析返回的響應、提取數(shù)據(jù)或進一步生成要處理的請求def parse(self, response):pass

創(chuàng)建 Item

Item 是用來保存爬取數(shù)據(jù)的容器（數(shù)據(jù)結構），使用方法類似與字典，不過多了額外的保護機制避免拼寫錯誤。創(chuàng)建自定義的 Item 也需要繼承 scrapy.Item 類并且定義類型為 scrapy.Filed的字段。修改 items.py如下：

import scrapyclass QuoteItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()text = scrapy.Field()author = scrapy.Field()tags = scrapy.Field()pass

解析 Response

首先打開自定義的 Spider 中的首個請求：http://quotes.toscrape.com/，查看網(wǎng)頁結構，發(fā)現(xiàn)每一頁都有多個 class 為 quote 的區(qū)塊，每個區(qū)塊內都含有 text、author、tags。

所以，修改自定義 Spider 中的 parse 方法如下：

# -*- coding: utf-8 -*- import scrapyclass QuotesSpider(scrapy.Spider):name = 'quotes'allowed_domains = ['quotes.toscrape.com']start_urls = ['http://quotes.toscrape.com/']def parse(self, response):# 使用 css 選擇器，選出類為 quote 的元素quotes = response.css('.quote') for quote in quotes:# 獲取 quote 下第一個.text 元素的的 texttext = quote.css('.text::text').extract_first()author = quote.css('.author::text').extract_first()# 獲取多個標簽的文本tags = quote.css('.tags .tag::text').extract()

使用 Item

QuotesSpider 的改寫如下：

后續(xù) Requets

這里后續(xù)的請求指的是請求下一頁的數(shù)據(jù)，該怎么請求呢？就要觀察網(wǎng)頁了：

QuotesSpider.py：

# -*- coding: utf-8 -*- import scrapy from tutorial.items import QuoteItemclass QuotesSpider(scrapy.Spider):name = 'quotes'allowed_domains = ['quotes.toscrape.com']start_urls = ['http://quotes.toscrape.com/']def parse(self, response):# 使用 css 選擇器，選出類為 quote 的元素quotes = response.css('.quote') for quote in quotes:# 實例化 QuoteItemitem = QuoteItem()# 獲取 quote 下第一個.text 元素的的 textitem['text'] = quote.css('.text::text').extract_first()item['author'] = quote.css('.author::text').extract_first()# 獲取多個標簽的文本item['tags'] = quote.css('.tags .tag::text').extract()yield item# 獲取下一頁的相對 urlnext = response.css('.pager .next a::attr("href")').extract_first()# 獲取下一頁的絕對 urlurl = response.urljoin(next)# 構造新的請求，這個請求完成后，響應會重新經(jīng)過 parse 方法處理，如此往復yield scrapy.Request(url=url, callback=self.parse)

運行 Spider

scrapy crawl quotes

下面是控制臺的輸出結果，輸出了當前的版本號以及 Middlewares 和 Pipelines，各個頁面的抓取結果等。

保存到文件中

scrapy crawl quotes -o quotes.json：將上面抓取數(shù)據(jù)的結果保存成 json 文件。

scrapy crawl quotes -o quotes.jsonlines：每一個 Item 輸出一行 JSON。
scrapy crawl quotes -o quotes.cs：輸出為 CSV 格式。
scrapy crawl quotes -o quotes.xml：輸出為 XML 格式。
scrapy crawl quotes -o quotes.pickle：輸出為 pickle 格式。
scrapy crawl quotes -o quotes.marshal：輸出為 marshal 格式。
scrapy crawl quotes -o ftg://user:pass@ftp.example.com/path/to/quotes.csv：ftp 遠程輸出。

使用 Item Pineline 保存到數(shù)據(jù)庫中

如果想進行更復雜的操作，如將結果保存到 MongoDB 數(shù)據(jù)中或篩選出某些有用的 Item，則我們可以自定義 ItemPineline 來實現(xiàn)。修改 pinelines.py 文件：

# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom scrapy.exceptions import DropItem import pymongoclass TextPipeline(object):def __init__(self):self.limit = 50# 需要實現(xiàn) process_item 方法，啟用 Item Pineline 會自動調用這個方法def process_item(self, item, spider):'''如果字段無值，拋出 DropItem 異常，否則判斷字段的長度是否大于規(guī)定的長度，若大于則截取到規(guī)定的長度并拼接上省略號，否則直接返回 item'''if item['text']:if len(item['text']) > self.limit:item['text'] = item['text'][0:self.limit].rstrip() + '...'return itemelse:return DropItem('Missing Text')class MongoPipeline(object):def __init__(self, mongo_uri, mongo_db):self.mongo_uri = mongo_uriself.mongo_db= mongo_db'''此方法用@classmethod 修飾表示時一個類方法，是一種依賴注入的方式，通過 crawler我們可以獲取到全局配置（settings.py）的每個信息'''@classmethoddef from_crawler(cls, crawler):return cls(mongo_uri = crawler.settings.get('MONGO_URI'),mongo_db = crawler.settings.get('MONGO_DB'))def open_spider(self, spider):self.client = pymongo.MongoClient(self.mongo_uri)self.db = self.client[self.mongo_db]# 執(zhí)行了數(shù)據(jù)庫的插入操作def process_item(self, item, spider):name = item.__class__.__name__self.db[name].insert(dict(item))return itemdef close_spider(self, spider):self.client.close()

settings.py 添加如下內容：

# 賦值 ITEM_PIPELINES 字典，鍵名是 pipeline 類的名稱，鍵值是優(yōu)先級， #是一個數(shù)字，越小，越先被調用 ITEM_PIPELINES = {'tutorial.pipelines.TextPipeline': 300,'tutorial.pipelines.MongoPipeline': 400 } MONGO_URI = 'localhost' MONGO_DB = 'tutorial'

重新執(zhí)行爬取

scrapy crawl quotes

三、參考書籍

崔慶才.《Python3 網(wǎng)絡爬蟲開發(fā)實戰(zhàn)》

轉載于:https://www.cnblogs.com/yunche/p/10357232.html

與50位技術專家面對面20年技術見證，附贈技術全景圖

總結

以上是生活随笔為你收集整理的Scrapy 框架入门的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：【题解】跳石头
下一篇： PAT B1007 素数对猜想（20