python 爬取西刺ip代理池
生活随笔
收集整理的這篇文章主要介紹了
python 爬取西刺ip代理池
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
1. 如何在requests中設置ip代理
最直接的用法,在get中添加proxies設置ip代理
proxies = {'https': 'http://183.148.153.147:9999/''https': 'http://183.148.153.147:9999/'}) requests.get(url=url, headers=headers, proxies=proxies)當ip被網站ban掉時,我們就需要使用大量的ip來進行替換,由此引出了下面的內容,爬取西刺提供的免費ip
2. 爬取西刺免費ip代理,并存入mysql
。。具體的字段分析,先待定吧,我ip被西刺ban了。。。233
直接上代碼了
3. 定義GetIp類,用于從mysql中取出ip
class GetIp(object):# 刪除不可用的Ipdef delete_ip(self, ip):delete_sql = """DELETE FROM ip_pond WHERE ip='{0}'""".format(ip)cursor.execute(delete_sql)conn.commit()return True# 驗證ip是否可用def judge_ip(self, ip, port, proxy_type):#通過百度來驗證吧http_url = 'https://www.baidu.com'proxy_url = '{0}://{1}:{2}'.format(proxy_type, ip, port)try:#對http.https進行區分if proxy_type == 'http':proxy_dict = {'http': proxy_url,}response = requests.get(http_url, proxies=proxy_dict)else:proxy_dict = {'https': proxy_url,}response = requests.get(http_url, proxies=proxy_dict, verify=False)except Exception as e:print('invalid ip and port')self.delete_ip(ip)return Falseelse:code = response.status_codeif code >= 200 and code < 300:print('effective ip')return Trueelse:print('invalid ip and port')self.delete_ip(ip)return False# 從數據庫中隨機選擇def get_random_ip(self):random_sql = """SELECT ip,port,proxy_type,speed FROM ip_pond ORDER BY RAND() LIMIT 1"""cursor.execute(random_sql)for ip_info in cursor.fetchall():ip = ip_info[0]port = ip_info[1]proxy_type = ip_info[2]judge_re = self.judge_ip(ip, port, proxy_type)if judge_re:return '{0}://{1}:{2}'.format(proxy_type, ip, port)else:return self.get_random_ip()# 從數據庫中選速度最快的 (大部分和上面的一樣,只是sql語句不一樣)def get_optimum_ip(self):optimum_sql = """SELECT ip,port,proxy_type,speed FROM ip_pond ORDER BY speed LIMIT 1"""cursor.execute(optimum_sql)for ip_info in cursor.fetchall():ip = ip_info[0]port = ip_info[1]proxy_type = ip_info[2]judge_re = self.judge_ip(ip, port, proxy_type)if judge_re:return '{0}://{1}:{2}'.format(proxy_type, ip, port)else:return self.get_optimum_ip()#對獲取的ip簡單封裝了下,方便使用def get_proxies(self):getip = GetIp()ip = getip.get_random_ip()print(ip)proxy_type = ip.split(':')[0]proxies = {proxy_type: ip}return proxies4. 正確的使用方式
if __name__ == '__main__':# 當取到的ip是https的時候,會有點慢# 先確認是否存在ip_pond這表sql = """SELECT * FROM ip_pond"""check_table = cursor.execute(sql)if check_table:#測試用的url,這個隨便寫url = 'https://www.baidu.com'headers = {"User-Agent": "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"}#上一步的簡單封裝,直接獲取到proxiesproxies = GetIp().get_proxies()res = requests.get(url=url, headers=headers, proxies=proxies)else:update_ip_pond()源碼請點擊這里
總結
以上是生活随笔為你收集整理的python 爬取西刺ip代理池的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: ADA集成开发环境GNAT-GPS的版本
- 下一篇: 小净空蓝牙天线怎么选 OA-C07天线