scrapy shell -s USER_AGENT="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36""http://www.xicidaili.com/nn/"
注:除命令改變,編寫爬蟲的設置文件Settings.py也需要改變。
設置User-agent 更改配置文件中USER-AGENT一行,將注釋去掉。可以將其設置為
USER_AGENT ='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
結構分析 xpath
ip = response.xpath('//table[@id="ip_list"]/tr/td[2]/text()').extract()
port = response.xpath('//table[@id="ip_list"]/tr/td[3]/text()').extract()
address = response.xpath('//table[@id="ip_list"]/tr/td[4]/a/text()').extract()
annoy = response.xpath('//table[@id="ip_list"]/tr/td[5]/text()').extract()type= response.xpath('//table[@id="ip_list"]/tr/td[6]/text()').extract()
speed = response.xpath('//table[@id="ip_list"]/tr/td[7]/div/@title').re('\d{0,2}\.\d{0,}')
time = response.xpath('//table[@id="ip_list"]/tr/td[8]/div/@title').re('\d{0,2}\.\d{0,}')
live = response.xpath('//table[@id="ip_list"]/tr/td[9]/text()').extract()
check = response.xpath('//table[@id="ip_list"]/tr/td[10]/text()').extract()
css
ip = item.css("td:nth-child(2)::text").extract()
port = item.css("td:nth-child(3)::text").extract()
address = item.css("td:nth-child(4) a::text").extract()type= item.css("td:nth-child(5)::text").extract()
protocol = item.css("td:nth-child(6)::text").extract()
speed = item.css("td:nth-child(7) div::attr(title)").extract()
time = item.css("td:nth-child(8) div::attr(title)").extract()
alive = item.css("td:nth-child(9)::text").extract()
proof = item.css("td:nth-child(10)::text").extract()
編輯item.py
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclassCollectipsItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()ip = scrapy.Field()port = scrapy.Field()address = scrapy.Field()annoy = scrapy.Field()type= scrapy.Field()speed = scrapy.Field()time = scrapy.Field()live = scrapy.Field()check = scrapy.Field()
更改xici.py
# -*- coding: utf-8 -*-import scrapy
from collectips.items import CollectipsItemclassXiciSpider(scrapy.Spider):name ='xici'allowed_domains =['xicidaili.com']start_urls =['http://www.xicidaili.com']#開始請求地址defstart_requests(self):reqs =[]for i inrange(1,2600):req=scrapy.Request("http://www.xicidaili.com/nn/%s"%i)reqs.append(req)return reqsdefparse(self, response):item =[]for info in response.xpath('//table[@id="ip_list"]/tr')[1:]:collecte = CollectipsItem()collecte['ip']= info.xpath('td[3]/text()').extract_first()collecte['port']= info.xpath('td[3]/text()').extract_first()collecte['address']= info.xpath('td[4]/a/text()').extract_first()collecte['annoy']= info.xpath('td[5]/text()').extract_first()collecte['type']= info.xpath('td[6]/text()').extract_first()collecte['speed']= info.xpath('td[7]/div/@title').re('\d{0,2}\.\d{0,}')collecte['time']= info.xpath('td[8]/div/@title').re('\d{0,2}\.\d{0,}')collecte['live']= info.xpath('td[9]/text()').extract_first()collecte['check']= info.xpath('td[10]/text()').extract_first()item.append(collecte)return item
運行爬蟲
scrapy crawl xici -o xici.json
結果日志文件返回大量503 Service Unavailable,用瀏覽器也無法訪問,看來代理訪問尤其必要。待恢復后進行緩慢爬取IP。