當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Scrapy基本用法

發(fā)布時間：2025/6/17 编程问答 21 豆豆

生活随笔收集整理的這篇文章主要介紹了 Scrapy基本用法小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

安裝scrapy

不同操作系統(tǒng)安裝操作不同，可以直接看官方文檔Install Scrapy

創(chuàng)建一個項目

在命令行輸入

scrapy startproject tutorial

進入項目目錄創(chuàng)建一個spider

cd tutorial scrapy genspider quotes domain.com import scrapyclass QuotesSpider(scrapy.Spider):name = "quotes"def start_requests(self):urls = ['http://quotes.toscrape.com/page/1/','http://quotes.toscrape.com/page/2/',]for url in urls:yield scrapy.Request(url=url, callback=self.parse)def parse(self, response):page = response.url.split("/")[-2]filename = 'quotes-%s.html' % pagewith open(filename, 'wb') as f:f.write(response.body)self.log('Saved file %s' % filename)

運行scrapy，在項目頂級目錄下輸入命令

scrapy crawl quotes

在QuotesSpider這個類里，name指明spider的名稱，在start_requests函數(shù)里發(fā)出請求，用parse函數(shù)處理請求返回的結(jié)果，start_requests函數(shù)可以替換為start_urls列表，scrapy會自動幫我們發(fā)出請求，并默認用parse函數(shù)處理，還可以設(shè)置一些其它參數(shù)，詳見Document

選擇器用法

scrapy內(nèi)置css選擇器和xpath選擇器，當(dāng)然你也可以選擇使用其他的解析庫，比如BeautifulSoup，我們簡單用scrapy shell展示一下scrapy內(nèi)置選擇器的用法，在命令行中輸入

scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

示例代碼

<html><head><base href='http://example.com/' /><title>Example website</title></head><body><div id='images'><a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a><a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a><a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a><a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a><a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a></div></body> </html> # 獲取標(biāo)題 # selector可以去掉 # extract返回的是列表 response.selector.xpath('//title/text()').extract_first() response.selector.css('title::text').extract_first()# 獲取a標(biāo)簽里href參數(shù)內(nèi)容 response.xpath('//a/@href').extract() response.css('a::attr(href)').extract()# 混合獲取img標(biāo)簽的src屬性 response.xpath('//div[@id="images"]').css('img::attr(src)').extract()# 獲取a標(biāo)簽中包含image的href屬性 response.xpath('//a[contains(@href, "image")]/@href').extract() response.css('a[href*=image]::attr(href)').extract()# 使用正則表達式 response.css('a::text').re('Name\:(.*)') response.css('a::text').re_first('Name\:(.*)')# 添加default參數(shù)指定默認提取信息 response.css('aa').extract_first(default='')

Item Pipeline用法

通過parse處理函數(shù)返回的Item可以用Item Pipeline進行加工處理，主要是數(shù)據(jù)清洗，格式化。

# 過濾掉相同的itemclass DuplicatePipeline(object):def __init__(self):self.items = set()def process_item(self, item, spider):if item['id'] in self.items:raise DropItem('Duplicate item found: %s' % item['id'])else:self.items.add(item['id'])return item

需要在settings里的注冊一下自定義的Pipeline

ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 300,'tutorial.pipelines.DuplicatePipeline': 200, }

數(shù)字越小，優(yōu)先級越高