當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

西刺代理python_Python四线程爬取西刺代理

發(fā)布時(shí)間：2023/12/10 python 24 豆豆

生活随笔收集整理的這篇文章主要介紹了西刺代理python_Python四线程爬取西刺代理小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

import requests

from bs4 import BeautifulSoup

import lxml

import telnetlib #驗(yàn)證代理的可用性

import pymysql.cursors

import random

import threading

BASEURL = 'http://www.xicidaili.com/' #西刺首頁

urls = [BASEURL+ 'nn/',BASEURL+'nt/',BASEURL+'wn/',BASEURL+'wt/']#西刺分組(more)的ip信息鏈接列表

#請(qǐng)求頭信息，必須有User-Agent

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}

#proxies = {'https': 'http://123.57.85.224:80', 'http': 'http://123.57.85.224:80'}

#獲得與數(shù)據(jù)庫的連接和游標(biāo)

def get_cc():

# 連接MySQL數(shù)據(jù)庫

connection = pymysql.connect(host='127.0.0.1', port=3306, user='root', password='root', db='iptables',

charset='utf8', cursorclass=pymysql.cursors.DictCursor)

# 通過cursor創(chuàng)建游標(biāo)

cursor = connection.cursor()

return connection,cursor

#保存ip_port到數(shù)據(jù)庫

def save_ip_port(ip_port):

connection,cursor = get_cc()

try:

sql = 'insert into iptable(ip_port) values("'+ip_port+'")'

cursor.execute(sql)

except:

print('保存'+ip_port+'失敗!!!!!')

else:

connection.commit()

connection.close()

#從數(shù)據(jù)庫獲得ip_port

def get_ip_port():

connection,cursor = get_cc()

sql_get_id = 'select id,ip_port from iptable'

cursor.execute(sql_get_id)

#fetchone()是查詢一條數(shù)據(jù)

id_list = cursor.fetchall()#得到所有的id的字典列表

i = random.randint(0,len(id_list)-1)

id_num = id_list[i]['id']

ip_port = id_list[i]['ip_port'] #獲得所有可用的代理

return id_num,ip_port#返回id和ip_port：192.168.1.2:8080

#刪除被封的ip_port

def del_ip_port(id_num):

connection,cursor = get_cc()

try:

sql = 'delete from iptable where id = ' + str(id_num)

cursor.execute(sql)

except:

print('刪除'+ip_port+'失敗!!!!!')

else:

connection.commit()

connection.close()

#獲得代理

def get_proxies(ip_port):#ip_port = '192.168.2.45:8088'

proxy_ip = 'http://' + ip_port

proxy_ips = 'https://' + ip_port

proxies = {'https': proxy_ips, 'http': proxy_ip}

return proxies

#獲得對(duì)應(yīng)url分類的最大頁碼

def get_max_pagenum(url): #url是more(分類)的鏈接，/nn,/nt....

response = requests.get(url,headers = headers)

status_code = response.status_code

soup = BeautifulSoup(response.content,'lxml')

max_pagenum = soup.find('div',attrs = {'class':'pagination'}).find_all('a')[-2].string

max_pagenum = int(max_pagenum)

return max_pagenum

#驗(yàn)證代理是否有用,ip_port = '192.168.2.45:8088'

#每得到一個(gè)ip_port都要進(jìn)行驗(yàn)證，如果可用則保存，否則拋棄

def verifyProxyList(ip_port):

url = 'http://www.baidu.com'

# proxies = { "http": "http://"+ ip_port }

host ,port = ip_port.split(':')

try:

# res = requests.get(url,headers = headers,proxies = proxies,timeout = 5.0)

telnetlib.Telnet(host, port=port, timeout=5)

except:

print('---Failur:' + ip_port)

else:

#ips.append(ip_port)#這里應(yīng)該存儲(chǔ)到Redis等數(shù)據(jù)庫中

save_ip_port(ip_port)

def main(url,proxies):#這里是more的鏈接,/nn/1,/nn/2....

try:

response = requests.get(url,headers = headers,proxies = proxies,timeout = 5.0)

status_code = response.status_code #503說明ip被封

if(status_code != requests.codes.ok):#響應(yīng)的不是正常狀態(tài)

#刪除舊的代理ip_port,這里還需要驗(yàn)證是否有bug

old_ip_port = proxies['http'][7:]

del_ip_port(old_ip_port)

#修改代理，重新請(qǐng)求

id_num,ip_port = get_ip_port()

proxies = get_proxies(ip_port)

print(str(proxies))

return

soup = BeautifulSoup(response.content,'lxml')

results = soup.find_all('tr')#遍歷所有的tr

for result in results[1:]:#這里第一個(gè)tr子標(biāo)簽是th，所以會(huì)報(bào)錯(cuò)

tdlist = result.find_all('td')

ip_port = tdlist[1].string+':'+tdlist[2].string

verifyProxyList(ip_port)

except:

print('請(qǐng)求異常......')

class myThread(threading.Thread):

def __init__(self, threadID, name, url):

threading.Thread.__init__(self)

self.threadID = threadID

self.name = name

self.url = url

def run(self):

print('正在執(zhí)行線程：'+self.name)#沒有驗(yàn)證這一行的可行性

id_num,ip_port = get_ip_port()

proxies = get_proxies(ip_port)

max_pagenum = get_max_pagenum(self.url)

#print(max_pagenum)

for i in range(1,max_pagenum):

url = self.url + '/' + str(i)

main(url,proxies)

#4線程爬取西刺的ip代理池

if __name__ == '__main__':

t1 = myThread(1,"Thread-1",urls[0])

t2 = myThread(2,"Thread-2",urls[1])

t3 = myThread(3,"Thread-3",urls[2])

t4 = myThread(4,"Thread-4",urls[3])

t1.start()

t2.start()

t3.start()

t4.start()

t1.join()

t2.join()

t3.join()

t4.join()

手把手教你使用Python爬取西刺代理數(shù)據(jù)(下篇)

/1 前言/ 前幾天小編發(fā)布了手把手教你使用Python爬取西次代理數(shù)據(jù)(上篇),木有趕上車的小伙伴,可以戳進(jìn)去看看.今天小編帶大家進(jìn)行網(wǎng)頁結(jié)構(gòu)的分析以及網(wǎng)頁數(shù)據(jù)的提取,具體步驟如下. /2 首頁分析 ...

Scrapy爬取西刺代理ip流程

西刺代理爬蟲 1. 新建項(xiàng)目和爬蟲 scrapy startproject daili_ips ...... cd daili_ips/ #爬蟲名稱和domains scrapy genspider ...

python scrapy 爬取西刺代理ip(一基礎(chǔ)篇)(ubuntu環(huán)境下) -賴大大

第一步:環(huán)境搭建 1.python2 或 python3 2.用pip安裝下載scrapy框架具體就自行百度了,主要內(nèi)容不是在這. 第二步:創(chuàng)建scrapy(簡單介紹) 1.Creating a p ...

python+scrapy 爬取西刺代理ip(一)

轉(zhuǎn)自:https://www.cnblogs.com/lyc642983907/p/10739577.html 第一步:環(huán)境搭建 1.python2 或 python3 2.用pip安裝下載scrap ...

python3爬蟲-通過requests爬取西刺代理

import requests from fake_useragent import UserAgent from lxml import etree from urllib.parse import ...

爬取西刺ip代理池

好久沒更新博客啦~,今天來更新一篇利用爬蟲爬取西刺的代理池的小代碼先說下需求,我們都是用python寫一段小代碼去爬取自己所需要的信息,這是可取的,但是,有一些網(wǎng)站呢,對(duì)我們的網(wǎng)絡(luò)爬蟲做了一些限制, ...

爬取西刺網(wǎng)的免費(fèi)IP

在寫爬蟲時(shí),經(jīng)常需要切換IP,所以很有必要自已在數(shù)據(jù)維護(hù)庫中維護(hù)一個(gè)IP池,這樣,就可以在需用的時(shí)候隨機(jī)切換IP,我的方法是爬取西刺網(wǎng)的免費(fèi)IP,存入數(shù)據(jù)庫中,然后在scrapy 工程中加入tools ...

scrapy爬取西刺網(wǎng)站ip

# scrapy爬取西刺網(wǎng)站ip # -*- coding: utf-8 -*- import scrapy from xici.items import XiciItem class Xicispi ...

爬取西刺網(wǎng)代理ip，并把其存放mysql數(shù)據(jù)庫

需求: 獲取西刺網(wǎng)代理ip信息,包括ip地址.端口號(hào).ip類型西刺網(wǎng):http://www.xicidaili.com/nn/ 那,如何解決這個(gè)問題? 分析頁面結(jié)構(gòu)和url設(shè)計(jì)得知: 數(shù)據(jù)都在本頁面 ...

隨機(jī)推薦

UnitOfWork以及其在ABP中的應(yīng)用

Unit Of Work(UoW)模式在企業(yè)應(yīng)用架構(gòu)中被廣泛使用,它能夠?qū)omain Model中對(duì)象狀態(tài)的變化收集起來,并在適當(dāng)?shù)臅r(shí)候在同一數(shù)據(jù)庫連接和事務(wù)處理上下文中一次性將對(duì)象的變更提交到數(shù)據(jù) ...

AT指令(轉(zhuǎn))

資料來自網(wǎng)絡(luò) 附錄AT指令簡編一．一般命令1．AT+CGMI 給出模塊廠商的標(biāo)識(shí).2．AT+CGMM 獲得模塊標(biāo)識(shí).這個(gè)命令用來得到支持的頻帶(GSM 900,DCS 1800 或PCS 1900) ...

python-基礎(chǔ)介紹

一.Linux基礎(chǔ)?- 計(jì)算機(jī)以及日后我們開發(fā)的程序防止的服務(wù)器的簡單操作?二.Python開發(fā)?http://www.cnblogs.com/wupeiqi/articles/5433893.htm ...

hdu 4272 2012長春賽區(qū)網(wǎng)絡(luò)賽 dfs暴力 &ast;&ast;&ast;

總是T,以為要剪枝,后來發(fā)現(xiàn)加個(gè)map就行了 #include #include #include #in ...

winform學(xué)習(xí)之----圖片控件應(yīng)用(上一張，下一張)

示例1: int i = 0;??????? string[] path = Directory.GetFiles(@"C:\Users\Administrator\Desktop\圖片&q ...

01-android快速入門

adb Android debug bridge 安卓調(diào)試橋創(chuàng)建模擬器,屏幕盡量小些,啟動(dòng)速度運(yùn)行速度快 Android項(xiàng)目的目錄結(jié)構(gòu) Activity:應(yīng)用被打開時(shí)顯示的界面 src:項(xiàng)目代碼 R ...

用C寫一個(gè)web服務(wù)器(二) I/O多路復(fù)用之epoll

.container { margin-right: auto; margin-left: auto; padding-left: 15px; padding-right: 15px } .conta ...

Android開發(fā)之漫漫長途 Ⅵ——圖解Android事件分發(fā)機(jī)制(深入底層源碼)

該文章是一個(gè)系列文章,是本人在Android開發(fā)的漫漫長途上的一點(diǎn)感想和記錄,我會(huì)盡量按照先易后難的順序進(jìn)行編寫該系列.該系列引用了以及

2015最新Android學(xué)習(xí)線路圖

Android是一個(gè)以Linux為基礎(chǔ)的半開源操作系統(tǒng),主要用于移動(dòng)設(shè)備,由Google和開放手持設(shè)備聯(lián)盟開發(fā)與領(lǐng)導(dǎo).據(jù)2011年初數(shù)據(jù)顯示僅正式上市兩年的操作系統(tǒng)Android已經(jīng)躍居全球最受歡迎的 ...

使用js如何設(shè)置、獲取盒模型的寬和高

第一種: dom.style.width/height 這種方法只能獲取使用內(nèi)聯(lián)樣式的元素的寬和高. 第二種: dom.currentStyle.width/height 這種方法獲取的是瀏覽器渲染以 ...

總結(jié)

以上是生活随笔為你收集整理的西刺代理python_Python四线程爬取西刺代理的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： UEFI shell控制台向.efi文件
下一篇：牛客网在线编程：分苹果

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

西刺代理python_Python四线程爬取西刺代理

總結(jié)