當前位置：首頁 > 运维知识 > 数据库 >内容正文

数据库

对“西刺免费代理IP“爬取、测试并存入MySQL

發布時間：2023/12/10 数据库 22 豆豆

生活随笔收集整理的這篇文章主要介紹了对“西刺免费代理IP“爬取、测试并存入MySQL 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

前言

學習scrapy有一段時間了，但是對了筆記的總結并沒有跟上步伐，這個案例是視頻教程多次給出的，但是在此進行總結和學習，提高學習效率。
由于網站結構發生改變，這篇文章的代碼也隨之發生改變。
Python網絡爬蟲實戰 Scrapy
注：b站真是個好地方。

思路

對了需求無非進行下面的順序操作。

爬取IP信息
驗證IP信息
存儲IP信息

爬取

新建項目

scrapy startproject collectips

進入項目

cd collectips

建立模板

scrapy genspider xici xicidaili.com

分析結構(個人習慣)

scrapy shell http://www.xicidaili.com/nn/ #分析要存取的數據（此處注意表格的提取方式）

注：結構發現無論如何編寫shell命令，返回內容都為空。有幸在CSDN的一篇博文告訴了我答案。 Scrapy抓取西刺高匿代理ip

更改命令

scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36" "http://www.xicidaili.com/nn/"

注：除命令改變，編寫爬蟲的設置文件Settings.py也需要改變。

設置User-agent
更改配置文件中USER-AGENT一行，將注釋去掉。可以將其設置為

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'

結構分析
xpath

ip = response.xpath('//table[@id="ip_list"]/tr/td[2]/text()').extract() port = response.xpath('//table[@id="ip_list"]/tr/td[3]/text()').extract() address = response.xpath('//table[@id="ip_list"]/tr/td[4]/a/text()').extract() annoy = response.xpath('//table[@id="ip_list"]/tr/td[5]/text()').extract() type = response.xpath('//table[@id="ip_list"]/tr/td[6]/text()').extract() speed = response.xpath('//table[@id="ip_list"]/tr/td[7]/div/@title').re('\d{0,2}\.\d{0,}') time = response.xpath('//table[@id="ip_list"]/tr/td[8]/div/@title').re('\d{0,2}\.\d{0,}') live = response.xpath('//table[@id="ip_list"]/tr/td[9]/text()').extract() check = response.xpath('//table[@id="ip_list"]/tr/td[10]/text()').extract()

css

ip = item.css("td:nth-child(2)::text").extract() port = item.css("td:nth-child(3)::text").extract() address = item.css("td:nth-child(4) a::text").extract() type = item.css("td:nth-child(5)::text").extract() protocol = item.css("td:nth-child(6)::text").extract() speed = item.css("td:nth-child(7) div::attr(title)").extract() time = item.css("td:nth-child(8) div::attr(title)").extract() alive = item.css("td:nth-child(9)::text").extract() proof = item.css("td:nth-child(10)::text").extract()

編輯item.py

# -*- coding: utf-8 -*-# Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass CollectipsItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()ip = scrapy.Field()port = scrapy.Field()address = scrapy.Field()annoy = scrapy.Field()type = scrapy.Field()speed = scrapy.Field()time = scrapy.Field()live = scrapy.Field()check = scrapy.Field()

更改xici.py

# -*- coding: utf-8 -*- import scrapy from collectips.items import CollectipsItemclass XiciSpider(scrapy.Spider):name = 'xici'allowed_domains = ['xicidaili.com']start_urls = ['http://www.xicidaili.com']#開始請求地址def start_requests(self):reqs = []for i in range(1,2600):req=scrapy.Request("http://www.xicidaili.com/nn/%s"%i)reqs.append(req)return reqsdef parse(self, response):item = []for info in response.xpath('//table[@id="ip_list"]/tr')[1:]:collecte = CollectipsItem()collecte['ip'] = info.xpath('td[3]/text()').extract_first()collecte['port'] = info.xpath('td[3]/text()').extract_first()collecte['address'] = info.xpath('td[4]/a/text()').extract_first()collecte['annoy'] = info.xpath('td[5]/text()').extract_first()collecte['type'] = info.xpath('td[6]/text()').extract_first()collecte['speed'] = info.xpath('td[7]/div/@title').re('\d{0,2}\.\d{0,}')collecte['time'] = info.xpath('td[8]/div/@title').re('\d{0,2}\.\d{0,}')collecte['live'] = info.xpath('td[9]/text()').extract_first()collecte['check'] = info.xpath('td[10]/text()').extract_first()item.append(collecte)return item

運行爬蟲

scrapy crawl xici -o xici.json

結果日志文件返回大量503 Service Unavailable,用瀏覽器也無法訪問，看來代理訪問尤其必要。待恢復后進行緩慢爬取IP。

總結

以上是生活随笔為你收集整理的对“西刺免费代理IP“爬取、测试并存入MySQL的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

数据库

对“西刺免费代理IP“爬取、测试并存入MySQL

前言

思路

爬取

總結