Python爬取大量数据时防止被封IP
?
From:http://blog.51cto.com/7200087/2070320
基于scrapy框架的爬蟲代理IP設置:https://www.jianshu.com/p/074c36a7948c
Scrapy: 針對特定響應狀態碼,使用代理重新請求:http://www.cnblogs.com/my8100/p/scrapy_middleware_autoproxy.html
Python爬蟲技巧---設置代理IP:https://blog.csdn.net/lammonpeter/article/details/52917264
國內髙匿代理IP網站:http://www.xicidaili.com/nn/
Scrapy配置代理:https://www.jianshu.com/p/b21f94b8591c
Python網絡爬蟲--Scrapy使用IP代理池:https://www.jianshu.com/p/da94a2a24de8
scrapy設置代理池:https://blog.csdn.net/weixin_40475396/article/details/78241238
scrapy繞過反爬蟲:https://www.imooc.com/article/35588
Scrapy學習筆記(7)-定制動態可配置爬蟲:http://jinbitou.net/2016/12/05/2244.html
Python scrapy.http.HtmlResponse():https://www.programcreek.com/python/example/71413/scrapy.http.HtmlResponse
scrapy之隨機設置請求頭和ip代理池中間件:https://www.jianshu.com/p/ca1afe40bba3
?
?
在scrapy中使用代理,有兩種使用方式
?
方式一:使用中間件
要進行下面兩步操作
1.文件 settings.py 中:
# settings.pyDOWNLOADER_MIDDLEWARES = {'project_name.middlewares.ProxyMiddleware': 100, # 注意修改 project_name'scrapy.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110, }說明:
數字100, 110表示中間件先被調用的次序。數字越小,越先被調用。
官網文檔:
The integer values you assign to classes in this setting determine the order in which they run: items go through from lower valued to higher valued classes. It’s customary to define these numbers in the 0-1000 range.
2.文件 middlewares.py 看起來像這樣:
代理不斷變換
- 這里利用網上API 直接get過來。(需要一個APIKEY,免費注冊一個賬號就有了。這個APIKEY是我自己的,不保證一直有效!)
- 也可以從網上現抓。
- 還可以從本地文件讀取
方式二:直接設置Request類的meta參數
import random# 事先準備的代理池 proxy_pool = ['http://proxy_ip1:port', 'http://proxy_ip2:port', ..., 'http://proxy_ipn:port']class MySpider(BaseSpider):name = "my_spider"allowed_domains = ["example.com"]start_urls = ['http://www.example.com/articals/',]def start_requests(self):for url in self.start_urls:proxy_addr = random.choice(proxy_pool) # 隨機選一個yield scrapy.Request(url, callback=self.parse, meta={'proxy': proxy_addr}) # 通過meta參數添加代理def parse(self, response):# doing parse延伸閱讀
1.閱讀官網文檔對Request類的描述,我們可以發現除了設置proxy,還可以設置method, headers, cookies, encoding等等:
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])
2.官網文檔對Request.meta參數可以設置的詳細列表:
- dont_redirect
- dont_retry
- handle_httpstatus_list
- handle_httpstatus_all
- dont_merge_cookies (see cookies parameter of Request constructor)
- cookiejar
- dont_cache
- redirect_urls
- bindaddress
- dont_obey_robotstxt
- download_timeout
- download_maxsize
- proxy
如隨機設置請求頭和代理:
# my_spider.pyimport random# 事先收集準備的代理池 proxy_pool = ['http://proxy_ip1:port', 'http://proxy_ip2:port',..., 'http://proxy_ipn:port' ]# 事先收集準備的 headers headers_pool = [{'User-Agent': 'Mozzila 1.0'},{'User-Agent': 'Mozzila 2.0'},{'User-Agent': 'Mozzila 3.0'},{'User-Agent': 'Mozzila 4.0'},{'User-Agent': 'Chrome 1.0'},{'User-Agent': 'Chrome 2.0'},{'User-Agent': 'Chrome 3.0'},{'User-Agent': 'Chrome 4.0'},{'User-Agent': 'IE 1.0'},{'User-Agent': 'IE 2.0'},{'User-Agent': 'IE 3.0'},{'User-Agent': 'IE 4.0'}, ]class MySpider(BaseSpider):name = "my_spider"allowed_domains = ["example.com"]start_urls = ['http://www.example.com/articals/',]def start_requests(self):for url in self.start_urls:headers = random.choice(headers_pool) # 隨機選一個headersproxy_addr = random.choice(proxy_pool) # 隨機選一個代理yield scrapy.Request(url, callback=self.parse, headers=headers, meta={'proxy': proxy_addr})def parse(self, response):# doing parse?
?
記錄一個比較完整的通過ip池進行爬蟲被禁的處理
scrapy接入IP代理池(代碼部分):https://blog.csdn.net/xudailong_blog/article/details/80153387
class HttpProxymiddleware(object):# 一些異常情況匯總EXCEPTIONS_TO_CHANGE = (defer.TimeoutError, TimeoutError, ConnectionRefusedError, ConnectError, ConnectionLost,TCPTimedOutError, ConnectionDone)def __init__(self):# 鏈接數據庫 decode_responses設置取出的編碼為strself.redis = redis.from_url('redis://:你的密碼@localhost:6379/0',decode_responses=True)passdef process_request(self, request, spider):#拿出全部key,隨機選取一個鍵值對keys = self.rds.hkeys("xila_hash")key = random.choice(keys)#用eval函數轉換為dictproxy = eval(self.rds.hget("xila_hash",key))logger.warning("-----------------"+str(proxy)+"試用中------------------------")#將代理ip 和 key存入materequest.meta["proxy"] = proxy["ip"]request.meta["accountText"] = keydef process_response(self, request, response, spider):http_status = response.status#根據response的狀態判斷 ,200的話ip的times +1重新寫入數據庫,返回response到下一環節if http_status == 200:key = request.meta["accountText"]proxy = eval(self.rds.hget("xila_hash",key))proxy["times"] = proxy["times"] + 1self.rds.hset("xila_hash",key,proxy)return response#403有可能是因為user-agent不可用引起,和代理ip無關,返回請求即可elif http_status == 403:logging.warning("#########################403重新請求中############################")return request.replace(dont_filter=True)#其他情況姑且被判定ip不可用,times小于10的,刪掉,大于等于10的暫時保留else:ip = request.meta["proxy"]key = request.meta["accountText"]proxy = eval(self.rds.hget("xila_hash", key))if proxy["times"] < 10:self.rds.hdel("xila_hash",key)logging.warning("#################" + ip + "不可用,已經刪除########################")return request.replace(dont_filter=True)def process_exception(self, request, exception, spider):#其他一些timeout之類異常判斷后的處理,ip不可用刪除即可if isinstance(exception, self.EXCEPTIONS_TO_CHANGE) \and request.meta.get('proxy', False):key = request.meta["accountText"]print("+++++++++++++++++++++++++{}不可用+++將被刪除++++++++++++++++++++++++".format(key))proxy = eval(self.rds.hget("xila_hash", key))if proxy["times"] < 10:self.rds.hdel("xila_hash", key)logger.debug("Proxy {}鏈接出錯{}.".format(request.meta['proxy'], exception))return request.replace(dont_filter=True)?
?
Scrapy 擴展中間件: 針對特定響應狀態碼,使用代理重新請求
?
0.參考
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.redirect
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpproxy
?
1.主要實現
實際爬蟲過程中如果請求過于頻繁,通常會被臨時重定向到登錄頁面即302,甚至是提示禁止訪問即403,因此可以對這些響應執行一次代理請求:
(1) 參考原生?redirect.py 模塊,滿足?dont_redirect 或?handle_httpstatus_list 等條件時,直接傳遞?response
(2) 不滿足條件(1),如果響應狀態碼為 302 或 403,使用代理重新發起請求
(3) 使用代理后,如果響應狀態碼仍為 302 或 403,直接丟棄
?
2.代碼實現
保存至 /site-packages/my_middlewares.py
from w3lib.url import safe_url_string from six.moves.urllib.parse import urljoinfrom scrapy.exceptions import IgnoreRequestclass MyAutoProxyDownloaderMiddleware(object):def __init__(self, settings):self.proxy_status = settings.get(‘PROXY_STATUS‘, [302, 403])# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html?highlight=proxy#module-scrapy.downloadermiddlewares.httpproxyself.proxy_config = settings.get(‘PROXY_CONFIG‘, ‘http://username:password@some_proxy_server:port‘)@classmethoddef from_crawler(cls, crawler):return cls(settings = crawler.settings) # See /site-packages/scrapy/downloadermiddlewares/redirect.pydef process_response(self, request, response, spider):if (request.meta.get(‘dont_redirect‘, False) orresponse.status in getattr(spider, ‘handle_httpstatus_list‘, []) orresponse.status in request.meta.get(‘handle_httpstatus_list‘, []) orrequest.meta.get(‘handle_httpstatus_all‘, False)):return responseif response.status in self.proxy_status:if ‘Location‘ in response.headers:location = safe_url_string(response.headers[‘location‘])redirected_url = urljoin(request.url, location)else:redirected_url = ‘‘# AutoProxy for first timeif not request.meta.get(‘auto_proxy‘):request.meta.update({‘auto_proxy‘: True, ‘proxy‘: self.proxy_config})new_request = request.replace(meta=request.meta, dont_filter=True)new_request.priority = request.priority + 2spider.log(‘Will AutoProxy for <{} {}> {}‘.format(response.status, request.url, redirected_url))return new_request# IgnoreRequest for second timeelse:spider.logger.warn(‘Ignoring response <{} {}>: HTTP status code still in {} after AutoProxy‘.format(response.status, request.url, self.proxy_status))raise IgnoreRequestreturn response?
3.調用方法
(1) 項目 settings.py 添加代碼,注意必須在默認的 RedirectMiddleware 和?HttpProxyMiddleware 之間。
# Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = {# ‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware‘: 600,‘my_middlewares.MyAutoProxyDownloaderMiddleware‘: 601,# ‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware‘: 750, } PROXY_STATUS = [302, 403] PROXY_CONFIG = ‘http://username:password@some_proxy_server:port‘?
4.運行結果
2018-07-18 18:42:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://httpbin.org/> (referer: None) 2018-07-18 18:42:38 [test] DEBUG: Will AutoProxy for <302 http://httpbin.org/status/302> http://httpbin.org/redirect/1 2018-07-18 18:42:43 [test] DEBUG: Will AutoProxy for <403 https://httpbin.org/status/403> 2018-07-18 18:42:51 [test] WARNING: Ignoring response <302 http://httpbin.org/status/302>: HTTP status code still in [302, 403] after AutoProxy 2018-07-18 18:42:52 [test] WARNING: Ignoring response <403 https://httpbin.org/status/403>: HTTP status code still in [302, 403] after AutoProxy?
代理服務器 log:
squid [18/Jul/2018:18:42:53 +0800] "GET http://httpbin.org/status/302 HTTP/1.1" 302 310 "-" "Mozilla/5.0" TCP_MISS:HIER_DIRECT squid [18/Jul/2018:18:42:54 +0800] "CONNECT httpbin.org:443 HTTP/1.1" 200 3560 "-" "-" TCP_TUNNEL:HIER_DIRECT?
?
?
Python爬蟲技巧之設置代理IP
?
在學習Python爬蟲的時候,經常會遇見所要爬取的網站采取了反爬取技術,高強度、高效率地爬取網頁信息常常會給網站服務器帶來巨大壓力,所以同一個IP反復爬取同一個網頁,就很可能被封,這里講述一個爬蟲技巧,設置代理IP。
?
(一)配置環境
- 安裝requests庫
- 安裝bs4庫
- 安裝lxml庫
?
(二)代碼展示
?
?
IP地址取自國內髙匿代理IP網站:http://www.xicidaili.com/nn/
僅僅爬取首頁IP地址就足夠一般使用
?
(三)代理IP的使用
運行上面的代碼會得到一個隨機的proxies,把它直接傳入requests的get方法中即可。
web_data = requests.get(url, headers=headers, proxies=proxies)?
?
爬取豬八戒網信息
?
繼續老套路,這兩天我爬取了豬八戒上的一些數據 網址是:http://task.zbj.com/t-ppsj/p1s5.html,可能是由于爬取的數據量有點多吧,結果我的IP被封了,需要自己手動來驗證解封ip,但這顯然阻止了我爬取更多的數據了。
下面是我寫的爬取豬八戒的被封IP的代碼
# coding=utf-8 import requests from lxml import etreedef getUrl():for i in range(33):url = 'http://task.zbj.com/t-ppsj/p{}s5.html'.format(i+1)spiderPage(url)def spiderPage(url):if url is None:return NonehtmlText = requests.get(url).textselector = etree.HTML(htmlText)tds = selector.xpath('//*[@class="tab-switch tab-progress"]/table/tr')try:for td in tds:price = td.xpath('./td/p/em/text()')href = td.xpath('./td/p/a/@href')title = td.xpath('./td/p/a/text()')subTitle = td.xpath('./td/p/text()')deadline = td.xpath('./td/span/text()')price = price[0] if len(price)>0 else '' # python的三目運算 :為真時的結果 if 判定條件 else 為假時的結果title = title[0] if len(title)>0 else ''href = href[0] if len(href)>0 else ''subTitle = subTitle[0] if len(subTitle)>0 else ''deadline = deadline[0] if len(deadline)>0 else ''print price,title,href,subTitle,deadlineprint '---------------------------------------------------------------------------------------'spiderDetail(href)except:print '出錯'def spiderDetail(url):if url is None:return Nonetry:htmlText = requests.get(url).textselector = etree.HTML(htmlText)aboutHref = selector.xpath('//*[@id="utopia_widget_10"]/div[1]/div/div/div/p[1]/a/@href')price = selector.xpath('//*[@id="utopia_widget_10"]/div[1]/div/div/div/p[1]/text()')title = selector.xpath('//*[@id="utopia_widget_10"]/div[1]/div/div/h2/text()')contentDetail = selector.xpath('//*[@id="utopia_widget_10"]/div[2]/div/div[1]/div[1]/text()')publishDate = selector.xpath('//*[@id="utopia_widget_10"]/div[2]/div/div[1]/p/text()')aboutHref = aboutHref[0] if len(aboutHref) > 0 else '' # python的三目運算 :為真時的結果 if 判定條件 else 為假時的結果price = price[0] if len(price) > 0 else ''title = title[0] if len(title) > 0 else ''contentDetail = contentDetail[0] if len(contentDetail) > 0 else ''publishDate = publishDate[0] if len(publishDate) > 0 else ''print aboutHref,price,title,contentDetail,publishDateexcept:print '出錯'if '_main_':getUrl()我發現代碼運行完后,后面有幾頁數據沒有被爬取,我再也沒有辦法去訪問豬八戒網站了,等過了一段時間才能去訪問他們的網站,這就很尷尬了,我得防止被封IP
如何防止爬取數據的時候被網站封IP這里有一些套路.查了一些套路
1.修改請求頭
之前的爬蟲代碼沒有添加頭部,這里我添加了頭部,模擬成瀏覽器去訪問網站
user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400'headers = {'User-Agent': user_agent}htmlText = requests.get(url, headers=headers, proxies=proxies).text2.采用代理IP
當自己的ip被網站封了之后,只能采用代理ip的方式進行爬取,所以每次爬取的時候盡量用代理ip來爬取,封了代理還有代理。
這里我引用了這個博客的一段代碼來生成ip地址:http://blog.csdn.net/lammonpeter/article/details/52917264
生成代理ip,大家可以直接把這個代碼拿去用
# coding=utf-8 # IP地址取自國內髙匿代理IP網站:http://www.xicidaili.com/nn/ # 僅僅爬取首頁IP地址就足夠一般使用from bs4 import BeautifulSoup import requests import randomdef get_ip_list(url, headers):web_data = requests.get(url, headers=headers)soup = BeautifulSoup(web_data.text, 'lxml')ips = soup.find_all('tr')ip_list = []for i in range(1, len(ips)):ip_info = ips[i]tds = ip_info.find_all('td')ip_list.append(tds[1].text + ':' + tds[2].text)return ip_listdef get_random_ip(ip_list):proxy_list = []for ip in ip_list:proxy_list.append('http://' + ip)proxy_ip = random.choice(proxy_list)proxies = {'http': proxy_ip}return proxiesif __name__ == '__main__':url = 'http://www.xicidaili.com/nn/'headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}ip_list = get_ip_list(url, headers=headers)proxies = get_random_ip(ip_list)print(proxies)好了我用上面的代碼給我生成了一批ip地址(有些ip地址可能無效,但只要不封我自己的ip就可以了,哈哈),然后我就可以在我的請求頭部添加ip地址
給我們的請求添加代理ip
proxies = {'http': 'http://124.72.109.183:8118','http': 'http://49.85.1.79:31666'}user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400'headers = {'User-Agent': user_agent}htmlText = requests.get(url, headers=headers, timeout=3, proxies=proxies).text目前知道的就
最后完整代碼如下:
# coding=utf-8import requests import time from lxml import etreedef getUrl():for i in range(33):url = 'http://task.zbj.com/t-ppsj/p{}s5.html'.format(i+1)spiderPage(url)def spiderPage(url):if url is None:return Nonetry:proxies = {'http': 'http://221.202.248.52:80',}user_agent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.4295.400'headers = {'User-Agent': user_agent}htmlText = requests.get(url, headers=headers,proxies=proxies).textselector = etree.HTML(htmlText)tds = selector.xpath('//*[@class="tab-switch tab-progress"]/table/tr')for td in tds:price = td.xpath('./td/p/em/text()')href = td.xpath('./td/p/a/@href')title = td.xpath('./td/p/a/text()')subTitle = td.xpath('./td/p/text()')deadline = td.xpath('./td/span/text()')price = price[0] if len(price)>0 else '' # python的三目運算 :為真時的結果 if 判定條件 else 為假時的結果title = title[0] if len(title)>0 else ''href = href[0] if len(href)>0 else ''subTitle = subTitle[0] if len(subTitle)>0 else ''deadline = deadline[0] if len(deadline)>0 else ''print price,title,href,subTitle,deadlineprint '---------------------------------------------------------------------------------------'spiderDetail(href)except Exception,e:print '出錯',e.messagedef spiderDetail(url):if url is None:return Nonetry:htmlText = requests.get(url).textselector = etree.HTML(htmlText)aboutHref = selector.xpath('//*[@id="utopia_widget_10"]/div[1]/div/div/div/p[1]/a/@href')price = selector.xpath('//*[@id="utopia_widget_10"]/div[1]/div/div/div/p[1]/text()')title = selector.xpath('//*[@id="utopia_widget_10"]/div[1]/div/div/h2/text()')contentDetail = selector.xpath('//*[@id="utopia_widget_10"]/div[2]/div/div[1]/div[1]/text()')publishDate = selector.xpath('//*[@id="utopia_widget_10"]/div[2]/div/div[1]/p/text()')aboutHref = aboutHref[0] if len(aboutHref) > 0 else '' # python的三目運算 :為真時的結果 if 判定條件 else 為假時的結果price = price[0] if len(price) > 0 else ''title = title[0] if len(title) > 0 else ''contentDetail = contentDetail[0] if len(contentDetail) > 0 else ''publishDate = publishDate[0] if len(publishDate) > 0 else ''print aboutHref,price,title,contentDetail,publishDateexcept:print '出錯'if '_main_':getUrl()最后程序完美運行,再也沒有出現被封IP的情況。當然防止被封IP肯定不止這些了,這還需要進一步探索!
最后
當然數據我是已經抓取過來了,但是我的數據都沒有完美呈現出來,我應該寫入execl文件,或者數據庫中啊,這樣才能方便采用.所以接下來我準備了使用
Python操作execl,mysql,mongoDB
?
?
?
?
總結
以上是生活随笔為你收集整理的Python爬取大量数据时防止被封IP的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Linux安装配置类似mac下的dock
- 下一篇: Python 的协程库 greenlet