當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Scrapy-Link Extractors（链接提取器）

發布時間：2024/7/23 编程问答 38 豆豆

生活随笔收集整理的這篇文章主要介紹了 Scrapy-Link Extractors（链接提取器）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Link Extractors 中文文檔：https://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/link-extractors.html
Link Extractors 英文文檔：http://doc.scrapy.org/en/latest/topics/link-extractors.html

利用爬蟲Scrapy中的LinkExtractor（鏈接提取器）爬租房信息（全站爬蟲）：https://www.jianshu.com/p/57c1e34c03cb
scrapy高級用法之自動分頁：https://my.oschina.net/u/2351685/blog/612940?fromerr=QnjXr0Pi

python爬蟲之Scrapy框架( CrawlSpider )：https://www.cnblogs.com/sui776265233/p/9724147.html

CrawlSpider使用分析（詳解）:https://blog.csdn.net/godot06/article/details/81672900

鏈接提取器

鏈接提取器 的目的就是從 網頁(scrapy.http.Response 對象)?中，將最終 跟隨(follow)?網頁(即 scrapy.http.Response 對象) 的鏈接提取出來。簡單的說：就是用于從服務器返回的 response 里抽取 url，用以進行之后的操作。

可以在 Scrapy 中直接使用?scrapy.linkextractors import LinkExtractor ?提取鏈接，你也可以創建自己的自定義鏈接提取器，以滿足您的需求通??過實現一個簡單的界面。

每個?link extractor（鏈接提取器）有唯一的公共方法是?extract_links，?它?接收一個Response對象 并返回一個?scrapy.link.Link對象列表。鏈接提取器要被實例化一次，但是它的?extract_links?方法可以被不同的網頁（即?scrapy.http.Response 對象）調用好幾次，用來提取不同網頁中??跟隨的鏈接。

Link Extractors在?CrawlSpider?類（在 Scrapy 可用）中使用，通過一套規則，但你也可以用它在你的Spider中，即使你不是從?CrawlSpider?繼承的子類，因為它的目的很簡單：提取鏈接｡

內置鏈接提取器參考

Scrapy 提供的 Link Extractor 類在?scrapy.linkextractors?模塊提供｡默認的 link extractor 是?LinkExtractor?, 其實就是?LxmlLinkExtractor:

from scrapy.linkextractors import LinkExtractor

以前的 Scrapy 版本中曾經有過其他鏈接提取器類，但現在已經過時了。

LxmlLinkExtractor

class?scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow=(),?deny=(),?allow_domains=(),?deny_domains=(),?deny_extensions=None,?restrict_xpaths=(),?restrict_css=(),?tags=('a',?'area'),?attrs=('href',?),?canonicalize=False,?unique=True,?process_value=None,?strip=True)

LxmlLinkExtractor 是推薦的鏈接提取器與方便的過濾選項。它使用 lxml 的強大的 HTMLParser 實現。

參數解釋：

allow（正則表達式或正則表達式列表）：?一個單一的正則表達式（或正則表達式列表），（絕對）urls 必須匹配才能提取。如果沒有給出（或為空），它將匹配所有鏈接。
deny（正則表達式或正則表達式列表）：?一個正則表達式（或正則表達式列表），（絕對）urls必須匹配才能排除（即不提取）。它優先于 allow 參數。如果沒有給出（或為空），它不會排除任何鏈接。，可以和 allow 配合一起用，前后夾擊，參數和 allow 一樣。
allow_domains（ str 或 str 的?list ）：允許的域名或者域名列表。即會被提取的鏈接的 domains。其實這個和 spider 類里的 allowdomains 是一個作用，即抓取哪個域名下的網站。
deny_domains（str 或 str 的?list ）：拒絕?的域名或者域名列表。即?不會被提取鏈接的 domains。和 allowdomains 相反，即拒絕哪個域名下的網站。
deny_extensions（ list ）：包含在提取鏈接時應該忽略的擴展的單個值或字符串列表。即不允許的擴展名。如果沒有給出(默認是 None)，它將默認為 IGNORED_EXTENSIONS 在?scrapy.linkextractors?包中定義的列表。（參考：?http://www.xuebuyuan.com/296698.html）
restrict_xpaths（ str 或 list ）：一個XPath（或XPath的列表），它定義了從Response哪些區域中來提取鏈接。即?在網頁哪個區域里提取鏈接，可以用 xpath 表達式和 css 表達式這個功能是劃定提取鏈接的位置，讓提取更加精準。如果給出，只有那些XPath選擇的文本將被掃描鏈接。參見下面的例子。即使用 xpath表達式，和allow共同作用過濾鏈接。
restrict_css（ str 或 list ）：一個CSS選擇器（或選擇器列表），用于定義響應中應提取鏈接的區域。同?restrict_xpaths。
tags（ str 或 list ）：提取鏈接時要考慮的標簽或標簽列表。默認為('a', 'area')。即?默認提取a標簽和area標簽中的鏈接
attrs（ list ）：在查找要提取的鏈接時應該考慮的屬性或屬性列表（僅適用于參數中指定的那些標簽tags ）。默認為('href',)。即默認提取 tags 里的 href 屬性，也就是 url 鏈接。
canonicalize（ boolean ）：規范化每個提取的url（使用 w3lib.url.canonicalize_url）。默認為True。 canonicalize each extracted url (using w3lib.url.canonicalize_url). Defaults to?False. Note that canonicalize_url is meant for duplicate checking; it can change the URL visible at server side, so the response can be different for requests with canonicalized and raw URLs. If you’re using LinkExtractor to follow links it is more robust to keep the default?canonicalize=False.
unique（ boolean ）：是否對提取的鏈接進行過濾。即這個地址是否要唯一，默認true，重復收集相同的地址沒有意義。
process_value?(?callable?) – 它接收來自掃描標簽和屬性提取每個值，可以修改該值，并返回一個新的，或返回 None?完全忽略鏈接的功能。如果沒有給出，process_value?默認是?lambda?x:?x。其實就是：接受一個函數，可以立即對提取到的地址做加工，這個作用比較強大。比如，提取用 js 寫的鏈接：
例如，從這段代碼中提取鏈接: <a href="javascript:goToPage('../other/page.html'); return false">Link text</a>
你可以使用下面的這個?process_value?函數:
def process_value(value):m = re.search("javascript:goToPage\('(.*?)'", value)if m:return m.group(1)
strip ：把提取的地址前后多余的空格刪除，很有必要。whether to strip whitespaces from extracted attributes. According to HTML5 standard, leading and trailing whitespaces must be stripped from?hrefattributes of?<a>,?<area>?and many other elements,?src?attribute of?<img>,?<iframe>?elements, etc., so LinkExtractor strips space chars by default. Set?strip=False?to turn it off (e.g. if you’re extracting urls from elements or attributes which allow leading/trailing whitespaces).

Spider

在 spider 中使用?LinkExtractor （連接提取器）

示例代碼：

# -*- coding: utf-8 -*-import scrapy from scrapy.linkextractors import LinkExtractorclass DouBan(scrapy.Spider):name = 'douban_spider'allowed_domains = ['book.douban.com']start_urls = ['https://book.douban.com/top250']# 自定義配置。自定義配置會覆蓋 setting.py 中配置，即優先級大于 setting.py 中配置custom_settings = {'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ''AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',}def parse(self, response):link_extractor = LinkExtractor(allow=(r'subject/\d+/$',), )links = link_extractor.extract_links(response)print(links)for link in links:# print(link.url, link.text)print(link.url)if __name__ == '__main__':from scrapy import cmdlinecmdline.execute('scrapy crawl douban_spider'.split(' '))

CrawlSpider

CrawlSpider?官網文檔(英文)：http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider
CrawlSpider?官網文檔(中文)：https://scrapy-chs.readthedocs.io/zh_CN/latest/topics/spiders.html#crawlspider

CrawlSpider 除了從 Spider類繼承過來的屬性外，還提供了一個新的屬性?rules?來提供跟進鏈接（link）的方便機制，這個機制更適合從爬取的網頁中獲取 link 并繼續爬取的工作。

rules 包含一個或多個 Rule 對象的集合。每個 Rule 對爬取網站的動作定義了特定規則。如果多個 Rule 匹配了相同的鏈接，則根據他們在本屬性中被定義的順序，第一個會被使用。

所以：規則的順序可以決定獲取的數據，應該把精確匹配的規則放在前面，越模糊的規則放在后面。

CrawlSpider 也提供了一個可復寫的方法：parse_start_url(response)
當 start_url 的請求返回時，該方法被調用。該方法分析最初的返回值并必須返回一個 Item 對象或一個 Request 對象或者一個可迭代的包含二者的對象

注意：當編寫?CrawlSpider 爬蟲的 規則時，不要使用 parse 作為回調函數。 由于 CrawlSpider 使用 parse 方法來實現其邏輯，如果覆蓋了 parse 方法，CrawlSpider 將會運行失敗。

Rule 和?Link Extractors 多用于全站的爬取

Rule?

爬取規則(Crawling rules)(英文)：http://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules
爬取規則(Crawling rules)(中文)：https://scrapy-chs.readthedocs.io/zh_CN/latest/topics/spiders.html#crawling-rules

更多 Scrapy rules 使用示例：點擊打開鏈接

Rule 是在定義抽取鏈接的規則：

class?scrapy.contrib.spiders.Rule(link_extractor,callback=None,cb_kwargs=None,follow=None,process_links=None,process_request=None)

參數解釋：

link_extractor：是一個 Link Extractor 對象。其定義了如何從爬取到的 頁面（即 response）?提取鏈接的方式。

callback：是一個 callable 或 string（該Spider中同名的函數將會被調用）。從 link_extractor 中每獲取到鏈接時將會調用該函數。該回調函數接收一個 response 作為其第一個參數，并返回一個包含 Item 以及 Request 對象(或者這兩者的子類)的列表。

cb_kwargs：包含傳遞給回調函數的參數（keyword argument）的字典。

follow：是一個 boolean 值，指定了根據該規則從 response 提取的鏈接 是否需要跟進。如果 callback 為 None，follow 默認設置 True，否則默認 False。當 follow 為 True 時：爬蟲會從獲取的 response 中取出符合規則的 url，再次進行爬取，如果這次爬取的 response 中還存在符合規則的 url，則再次爬取，無限循環，直到不存在符合規則的 url。?當 follow 為 False 時：爬蟲只從 start_urls 的 response 中取出符合規則的 url，并請求。

process_links：是一個callable或string（該Spider中同名的函數將會被調用）。從link_extrator中獲取到鏈接列表時將會調用該函數。該方法主要是用來過濾。

process_request：是一個callable或string（該spider中同名的函數都將會被調用）。該規則提取到的每個request時都會調用該函數。該函數必須返回一個request或者None。用來過濾request。

豆瓣圖書 top250 示例爬蟲代碼

用豆瓣圖書 top250 寫一個小例子。豆瓣圖書 Top 250：https://book.douban.com/top250

也可以使用 scrapy genspider -t crawl douban douban.com 快速創建 CrawlSpider 模板的代碼：

from scrapy.spiders import CrawlSpider, Rule? 等價于??from scrapy.spiders.crawl import CrawlSpider, Rule

可以查看 scrapy/spiders/__init__.py ，最終導入的還是?from scrapy.spiders.crawl import CrawlSpider, Rule

創建工程：

用 Pycharm 打開工程，可以看到生成一個模板爬蟲：

改寫模板爬蟲：

# -*- coding: utf-8 -*-from scrapy.spiders.crawl import Rule, CrawlSpider from scrapy.linkextractors import LinkExtractorclass DouBan(CrawlSpider):name = 'douban_spider'allowed_domains = ['book.douban.com']start_urls = ['https://book.douban.com/top250']# 自定義配置。自定義配置會覆蓋 setting.py 中配置，即優先級大于 setting.py 中配置custom_settings = {'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ''AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'}rules = (Rule(LinkExtractor(allow=('subject/\d+/$',)), callback='parse_items'),)def parse_items(self, response):print('{0} step into : parse_items(self, response)'.format(self.name))print('current url : {0}'.format(response.url))passif __name__ == '__main__':from scrapy import cmdlinecmdline.execute('scrapy crawl douban_spider'.split(' '))

運行爬蟲：

1. 可以直接運行 example.py 這個文件。
2. 也可以在項目目錄命令行運行命令：scrapy crawl douban_spider

可見爬取了很多 url，這些 url 是怎么獲得的呢？其實就是 Link Extractors 提取出來的。?在?rule 中定義了 LinkExtractors，LinkExtractors 接收的一個參數是 allow=(‘subject/\d+/$’,) ，是一個正則表達式。?

爬蟲運行流程

1.scrapy 請求 start_urls ，獲取到 response
2.使用 LinkExtractors 中 allow 的內容去匹配 response，獲取到 url
3.請求這個 url ，然后服務器返回的 response 交給 callback 指向的方法處理

以上是 follow 沒有設置（默認為 False 的情況），在爬取到 25 個 url 的時候程序終止了。
如果將follow設置為True，將程序中 Rule 添一個 follow=True 參數即可。如下：

rules = (Rule(LinkExtractor(allow=('subject/\d+/$',)),callback='parse_items',follow=True),)

再次運行爬蟲，發現爬蟲（不被反爬的話）會跑很久不停下。因為在頁面中有很多匹配 Rule 中正則的 URL。他們的 url 也符合規則，就會被不斷的爬下來。

配置 logging 并把 log? 保存到文件中

可以在控制臺中可以看到有很多的 log 輸出。如果我們想把 log 保存到文件。可以在 setting.py 配置。

Scrapy 提供 5層 logging 級別:

CRITICAL：嚴重錯誤(critical)

ERROR：一般錯誤(regular errors)

WARNING：警告信息(warning messages)

INFO：一般信息(informational messages)

DEBUG：調試信息(debugging messages)

通過在 setting.py 中進行以下設置可以被用來配置 logging:

LOG_ENABLED 默認: True，啟用logging
LOG_ENCODING 默認: utf-8，logging使用的編碼
LOG_FILE 默認: None，在當前目錄里創建logging輸出文件的文件名
LOG_LEVEL 默認: DEBUG，log的最低級別
LOG_STDOUT 默認: False，如果為 True，進程所有的標準輸出(及錯誤)將會被重定向到log中。例如，執行 print “hello” ，其將會在Scrapy log中顯示。

在 settings.py 配置 logging

LOG_FILE = 'douban_spider.log' LOG_LEVEL = 'DEBUG'

就可以將 log 輸出到?douban_spider.log 文件中。

爬蟲運行結果截圖：

示例騰訊招聘所有頁面的職位信息

http://www.cnblogs.com/xinyangsdut/p/7628770.html

Item 代碼：

# -*- coding: utf-8 -*-import scrapyclass TencentItem(scrapy.Item):# define the fields for your item here like:# 職位名name = scrapy.Field()# 詳細鏈接detailLink = scrapy.Field()# 職位信息positionInfo = scrapy.Field()# 人數peopleNumber = scrapy.Field()# 工作地點workLocation = scrapy.Field()# 發布時間publishTime = scrapy.Field()

spider 代碼：

# -*- coding: utf-8 -*- import scrapy from douban.items import TencentItem from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Ruleclass TencentSpider(CrawlSpider):name = 'tencent'allowed_domains = ['tencent.com']start_urls = ['http://hr.tencent.com/position.php?&start=0#a']rules = (Rule(LinkExtractor(allow=r'position\.php\?&start=\d+'), callback='parse_item', follow=True),)def parse_item(self, response):# i = {}# i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()# i['name'] = response.xpath('//div[@id="name"]').extract()# i['description'] = response.xpath('//div[@id="description"]').extract()# return ifor each in response.xpath('//*[@class="even"]'):name = each.xpath('./td[1]/a/text()').extract()[0]detailLink = each.xpath('./td[1]/a/@href').extract()[0]positionInfo = each.xpath('./td[2]/text()').extract()[0]peopleNumber = each.xpath('./td[3]/text()').extract()[0]workLocation = each.xpath('./td[4]/text()').extract()[0]publishTime = each.xpath('./td[5]/text()').extract()[0]# print name, detailLink, catalog,recruitNumber,workLocation,publishTimeitem = TencentItem()item['name'] = nameitem['detailLink'] = detailLinkitem['positionInfo'] = positionInfoitem['peopleNumber'] = peopleNumberitem['workLocation'] = workLocationitem['publishTime'] = publishTimeyield itemif __name__ == '__main__':from scrapy import cmdlinecmdline.execute('scrapy crawl tencent'.split(' '))

pipeline 代碼：

# -*- coding: utf-8 -*-import jsonclass TencentPipeline(object):def __init__(self):self.filename = open("tencent.json", "w")def process_item(self, item, spider):text = json.dumps(dict(item), ensure_ascii=False) + ",\n"self.filename.write(text)return itemdef close_spider(self, spider):self.filename.close()

setting.py 里面啟用 pipeline 配置：

ITEM_PIPELINES = {'tencent.pipelines.TencentPipeline': 300, }

運行爬蟲，運行結果截圖：

使用 CrawlSpider 爬取校花網圖片

校花網：http://www.521609.com/

Item Pipeline 和 Spider----- 基于 scrapy 取校花網的信息編寫 item pipeline：
https://cloud.tencent.com/developer/article/1098246

上面這個地址是使用 spider?爬取校花網圖片，現在改用 CrawlSpider 同時有去重的功能，可以爬取多頁乃至全部頁面。

上面的下載圖片是在 pipeline 中通過 Request 請求一個圖片的URL來實現的，現在改成通過 scrayp 自帶的圖片中間件?ImagesPipeline （https://docs.scrapy.org/en/latest/topics/media-pipeline.html?highlight=pipeline）來實現圖片的下載。

更多關于?ImagesPipeline

scrapy 的常用 ImagesPipeline 重寫實現：https://www.jianshu.com/p/cd05763d49e8
使用 Scrapy 自帶的 ImagesPipeline下載圖片，并對其進行分類：https://www.cnblogs.com/Kctrina/p/9523553.html
使用 FilesPipeline 和 ImagesPipeline：https://www.jianshu.com/p/a412c0277f8a

item.py ：

class MMItem(scrapy.Item):image_urls = scrapy.Field()images = scrapy.Field()img_name = scrapy.Field()image_paths = scrapy.Field()pass

pipelines.py：

# -*- coding: utf-8 -*-import os from . import settings from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItemclass MMPipeline(ImagesPipeline):# def get_media_requests(self, item, info):## # 這個方法是在發送下載請求之前調用的，其實這個方法本身就是去發送下載請求的# for image_url in item['image_urls']:# yield scrapy.Request(image_url)def get_media_requests(self, item, info):# 這個方法是在發送下載請求之前調用的，其實這個方法本身就是去發送下載請求的request_objs = super(MMPipeline, self).get_media_requests(item, info)for request_obj in request_objs:# 向 request 對象的示例動態添加屬性request_obj.item = itemreturn request_objsdef item_completed(self, results, item, info):# 重寫父類方法"""所有圖片處理完畢后（不管下載成功或失敗），會調用item_completed進行處理results 是一個 list 第一個為圖片下載狀態,get_media_requests 在圖片下載完畢后，處理結果會以二元組的方式返回給 item_completed()函數的results，圖片下載狀態定義如下：(success, image_info_or_failure)success 表示圖片是否下載成功；image_info_or_failure 是一個字典"""# super(MMPipeline, self).item_completed(results, item, info)image_paths = [x['path'] for ok, x in results if ok]if not image_paths:raise DropItem("Item contains no images")item['image_paths'] = image_pathsreturn itemdef file_path(self, request, response=None, info=None):# 這個方法是在圖片將要被存儲的時候調用，來獲取這個圖片存儲的路徑raw_path = super(MMPipeline, self).file_path(request, response, info)# 'http://www.521609.com/uploads/allimg/110918/12310035925-5.jpg'current_url = request.url# ['http://www.521609.com/uploads/', '/110918/12310035925-5.jpg']temp = current_url.split('allimg')# '大慶體校張可(5)'img_name_prefix = request.item.get('img_name')img_name_suffix = temp[1].split('/')[-1]category = temp[1].split('/')[-2]image_name = img_name_prefix + img_name_suffiximage_path = os.path.join(category, image_name)return image_path

mmspider.py：

# -*- coding: utf-8 -*-from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from douban.items import MMItemclass MMSpider(CrawlSpider):name = 'xiaohua'allowed_domains = ['www.521609.com']start_urls = ['http://www.521609.com']# start_urls = ['http://www.521609.com/xiaoyuanmeinv/10464_5.html']# 自定義配置。自定義配置會覆蓋 setting.py 中配置，即優先級大于 setting.py 中配置custom_settings = {'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ''AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',# 'LOG_FILE': 'MMSpider.log',# 'LOG_LEVEL': 'DEBUG''LOG_ENABLED': False, # 關閉 scrapy 中的 debug 打印'LOG_LEVEL': 'INFO'}rules = (# 匹配 xiaohua/ 和 meinv/ 所有 URLRule(LinkExtractor(allow=('xiaohua/\d+_?\d+', r'meinv/\d+_?\d+')), callback='parse_img', follow=True),Rule(LinkExtractor(allow=('xiaohua/', 'meinv/')), follow=True))def parse_img(self, response):xiaohua_name = response.xpath('//div[@class="title"]/h2/text()').extract_first()link_extractor = LinkExtractor(allow='uploads/allimg', # 匹配的 URLdeny='(lp\.jpg)$', # 排除的 URL。這里排除掉以 lp.jpg 結尾的縮略圖 URLtags=('img', ), # 要提取鏈接的標簽attrs=('src', ), # 提取的連接restrict_xpaths='//div[@class="picbox"]', # 在 xpath 指定區域匹配deny_extensions='' # 禁用擴展。默認會把 .jpg 擴展名的URL后綴過濾掉，這里禁用。)all_links = link_extractor.extract_links(response)img_item = MMItem()for link in all_links:print('\t{0} : {1}'.format(xiaohua_name, link.url))# 這里必須把 URL 放到一個 list 里面，# 通過查看源碼可以知道 image_urls 字段必須是一個 url 的 listimg_item['image_urls'] = [link.url, ]img_item['img_name'] = xiaohua_nameyield img_itemif __name__ == '__main__':from scrapy import cmdlinecmdline.execute('scrapy crawl xiaohua'.split(' '))

setting.py：

ROBOTSTXT_OBEY = False # 禁用 robots.txt ITEM_PIPELINES = {'xiaohua.pipelines.MMPipeline':5, }IMAGES_STORE = 'img_dir' # 設置圖片下載路徑目錄

運行結果截圖：

示例代碼 2：

# -*- coding: utf-8 -*-import os import requests from pathlib import Path from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractorclass MMSpider(CrawlSpider):name = 'mm_spider'allowed_domains = ['www.521609.com']start_urls = ['http://www.521609.com']# start_urls = ['http://www.521609.com/xiaoyuanmeinv/10464_5.html']# 自定義配置。自定義配置會覆蓋 setting.py 中配置，即優先級大于 setting.py 中配置custom_settings = {'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ''AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',# 'LOG_FILE': 'MMSpider.log',# 'LOG_LEVEL': 'DEBUG''LOG_ENABLED': False, # 關閉 scrapy 中的 debug 打印'LOG_LEVEL': 'INFO'}rules = (# 匹配包含 xiaohua/ 和 meinv/ 所有 URLRule(LinkExtractor(allow=(r'xiaohua/\d+_?\d+', r'meinv/\d+_?\d+')), callback='parse_img', follow=True),Rule(LinkExtractor(allow=(r'xiaohua/', 'meinv/')), follow=True))# def parse(self, response):# print(response.url)# super(MMSpider, self).parse(response)# passdef parse_img(self, response):spider_name = self.namecurrent_url = response.urlprint(f'current_url:{current_url}')mm_name = response.xpath('//div[@class="title"]/h2/text()').extract_first()link_extractor = LinkExtractor(allow='uploads/allimg', # 匹配的 URLdeny=r'(lp\.jpg)$', # 排除的 URL。這里排除掉以 lp.jpg 結尾的縮略圖 URLtags=('img',), # 要提取鏈接的標簽attrs=('src',), # 提取的連接restrict_xpaths='//div[@class="picbox"]', # 在 xpath 指定區域匹配deny_extensions='' # 禁用擴展。默認會把 .jpg 擴展名的URL后綴過濾掉，這里禁用。)all_links = link_extractor.extract_links(response)# 可以使用自定義的 item，也可以直接使用 Python 的 dict，# 因為 item 本身就是一個 python類型的 dict# img_item = MMItem()img_item = dict()image_urls_list = list()for link in all_links:print(f'\t{mm_name}:{link.url}')image_urls_list.append(link.url)else:img_item['img_name'] = mm_nameimg_item['image_urls'] = image_urls_list# yield img_itemprint(f'\t{img_item}')for img_url in image_urls_list:file_name = img_url.split('/')[-1].split('.')[0]dir_path = f'./{mm_name}'if not Path(dir_path).is_dir():os.mkdir(dir_path)file_path = f'./{mm_name}/{file_name}.jpg'r = requests.get(url=img_url)if 200 == r.status_code:with open(file_path, 'wb') as f:f.write(r.content)else:print(f'status_code:{r.status_code}')passif __name__ == '__main__':from scrapy import cmdlinecmdline.execute('scrapy crawl mm_spider'.split())pass

總結

以上是生活随笔為你收集整理的Scrapy-Link Extractors（链接提取器）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Java8 Stream详解~聚合（ma
下一篇：最低票价