scrapy详解及主要应用场景
生活随笔
收集整理的這篇文章主要介紹了
scrapy详解及主要应用场景
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
文章目錄
- 1、scrapy 多頁爬取
- 2、scrapy爬取詳情頁
- 3、scrapy發送post請求
- 4、scrapy中間件
- 5、下載中間件實現UA池
1、scrapy 多頁爬取
# spider編碼在原基礎之上, 構建其他頁面的url地址, 并利用scrapy.Request發起新的請求, 請求的回調函數依然是parse: page = 1 base_url = 'http://www.xiaohuar.com/list-1-%s.html' if self.page < 4:page_url = base_url%self.pageself.page += 1yield scrapy.Request(url=page_url, callback=self.parse) # (其他文件不用改動)2、scrapy爬取詳情頁
需求: 爬取笑話的標題與詳情頁連接, 通過詳情頁鏈接, 爬取詳情頁笑話內容
# item編碼: 定義數據持久化的字段信息 import scrapy class JokeItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title = scrapy.Field()content = scrapy.Field() # spider的編碼: # -*- coding: utf-8 -*- import scrapy from ..items import JokeItemclass XhSpider(scrapy.Spider):name = 'xh'# allowed_domains = ['www.baidu.com']start_urls = ['http://www.jokeji.cn/list.htm']def parse(self, response):li_list = response.xpath('//div[@class="list_title"]/ul/li')for li in li_list:title = li.xpath('./b/a/text()').extract_first()link = 'http://www.jokeji.cn' + li.xpath('./b/a/@href').extract_first()yield scrapy.Request(url=link, callback=self.datail_parse, meta={"title":title})def datail_parse(self, response):joke_list = response.xpath('//span[@id="text110"]//text()').extract()title = response.meta["title"]content = ''for s in joke_list:content += sitem = JokeItem()item["title"] = titleitem["content"] = contentyield item # Pipeline編碼: 數據持久化具體操作 import pymongoclass JokePipeline(object):conn = pymongo.MongoClient('localhost', 27017)db = conn.hahatable = db.hahatabledef process_item(self, item, spider):self.table.insert(dict(item))return itemdef close_spider(self, spider):self.conn.close() # settings配置編碼: UA偽裝 Robots協議 Item_Pipeline3、scrapy發送post請求
import scrapy import jsonclass FySpider(scrapy.Spider):name = 'fy'# allowed_domains = ['www.baidu.com']start_urls = ['https://fanyi.baidu.com/sug']def start_requests(self):data = {'kw':'boy'}yield scrapy.FormRequest(url=self.start_urls[0], callback=self.parse, formdata=data)def parse(self, response):print(response.text)print(json.loads(response.text))4、scrapy中間件
# 中間件分類:- 下載中間件: DownloadMiddleware- 爬蟲中間件: SpiderMiddleware # 中間件的作用:- 下載中間件: 攔截請求與響應, 篡改請求與響應- 爬蟲中間件: 攔截請求與響應, 攔截管道item, 篡改請求與響應, 處理item # 下載中間件的主要方法: process_request process_response process_exception下載中間件攔截請求, 使用代理ip案例
# spider編碼: import scrapy class DlproxySpider(scrapy.Spider):name = 'dlproxy'# allowed_domains = ['www.baidu.com']start_urls = ['https://www.baidu.com/s?wd=ip']def parse(self, response):with open('baiduproxy.html', 'w', encoding='utf-8') as f:f.write(response.text) # Downloadermiddleware編碼: def process_request(self, request, spider):request.meta['proxy'] = 'http://111.231.90.122:8888'return None5、下載中間件實現UA池
# spider編碼: class DlproxySpider(scrapy.Spider):name = 'dlproxy'# allowed_domains = ['www.baidu.com']start_urls = ['https://www.baidu.com/','https://www.baidu.com/','https://www.baidu.com/','https://www.baidu.com/','https://www.baidu.com/']def parse(self, response):pass # 中間件的編碼: from scrapy import signals from fake_useragent import UserAgent import random ua = UserAgent() ua_list = [] for i in range(100):ua_chrome = ua.Chromeua_list.append(ua_chrome)class ...():def process_request(self, request, spider):# request.meta['proxy'] = 'http://111.231.90.122:8888'print(55555555555555555555555555555)print(self.ua_pool)print(55555555555555555555555555555)request.headers['User-Agent'] = random.choice(self.ua_pool)return Nonedef process_response(self, request, response, spider):print(1111111111111111111111111111111)print(request.headers["User-Agent"])print(2222222222222222222222222222222)return response 《新程序員》:云原生和全面數字化實踐50位技術專家共同創作,文字、視頻、音頻交互閱讀總結
以上是生活随笔為你收集整理的scrapy详解及主要应用场景的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 初识scrapy
- 下一篇: scrapy框架对接seleniumpi