當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

pool python 传参数_Python-爬虫-多线程、线程池模拟（urllib、requests、UserAgent、超时等）...

發(fā)布時(shí)間：2024/1/23 python 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 pool python 传参数_Python-爬虫-多线程、线程池模拟（urllib、requests、UserAgent、超时等）... 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

接著之前的MonkeyLei：Python-爬取頁(yè)面內(nèi)容（涉及urllib、requests、UserAgent、Json等）繼續(xù)練習(xí)下多線(xiàn)程，線(xiàn)程池模擬..

我想這樣：

1. 創(chuàng)建一個(gè)線(xiàn)程池，線(xiàn)程池?cái)?shù)量可以定為初始化16大小（如果無(wú)可用線(xiàn)程，則再次分配16個(gè)線(xiàn)程加入到線(xiàn)程池 - 目前線(xiàn)程編號(hào)有重復(fù)）

2. 然后url列表裝載到一個(gè)隊(duì)列Queue里面

3. 接下來(lái)遍歷url列表數(shù)量（無(wú)需獲取url，只是為了啟動(dòng)一個(gè)線(xiàn)程來(lái)處理url），同時(shí)啟動(dòng)一個(gè)線(xiàn)程（該線(xiàn)程會(huì)從隊(duì)列里面去獲取url進(jìn)行爬取）

4（attention）. 然后主線(xiàn)程等待子線(xiàn)程運(yùn)行完畢（過(guò)程中加入了運(yùn)行線(xiàn)程是否活著的判斷，如果運(yùn)行了就不用join了）

5（attention）. 網(wǎng)絡(luò)請(qǐng)求添加了超時(shí)請(qǐng)求，github模擬會(huì)比較慢，懶得等

So，看代碼

thread_pool.py

#!/usr/bin/python3 # -*- coding: UTF-8 -*- # 文件名：thread_pool.pyfrom threading import Thread from queue import Queue import time as Time from urllib import requesttread_pool_len = 16 threads_pool = [] running_thread = [] url_list = ['http://www.baidu.com','https://github.com/FanChael/DocPro','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','https://github.com/FanChael/DocPro','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','https://github.com/FanChael','https://github.com/FanChael', ]# url列表長(zhǎng)度 url_len = len(url_list) # 創(chuàng)建隊(duì)列并初始化 queue = Queue(url_len) for url in url_list:queue.put(url)# 偽裝瀏覽器 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36', }# 自定義線(xiàn)程 class my_thread(Thread):def __init__(self):Thread.__init__(self)def run(self):if not queue.empty():print(self.getName(), '運(yùn)行中')data = ''try:req = request.Request(queue.get(), None, headers)with request.urlopen(req, timeout=5) as uf:while True:data_temp = uf.read(1024)if not data_temp:breakdata += data_temp.decode('utf-8', 'ignore')# print('線(xiàn)程', self.getName(), '獲取數(shù)據(jù)=', data)except Exception as err:print(self.getName(), str(err))else:pass# 初始化線(xiàn)程池 def init_thread(count):thread_count = len(threads_pool)for i in range(thread_count, count):thead = my_thread()thead.setName('第' + str(i) + '號(hào)線(xiàn)程')threads_pool.append(thead)# 獲取可用線(xiàn)程 - 優(yōu)化思路：每次都遍歷一遍效率低，可以封裝對(duì)象，設(shè)置標(biāo)示位，執(zhí)行結(jié)束后改變標(biāo)志位狀態(tài)；但這樣還是要循環(huán)一遍；此時(shí)取到一定數(shù)量或者快到頭了，然后再?gòu)念^遍歷 def get_available():for c_thread in threads_pool:if not c_thread.isAlive():threads_pool.remove(c_thread)return c_thread# 擴(kuò)容線(xiàn)程init_thread(tread_pool_len)return get_available()if __name__ == '__main__':# 初始化線(xiàn)程池init_thread(tread_pool_len)# 啟動(dòng)時(shí)間start_time = Time.time()# 啟動(dòng)線(xiàn)程去從隊(duì)列獲取url執(zhí)行請(qǐng)求for i in range(url_len):a_thread = get_available()if a_thread:running_thread.append(a_thread)a_thread.start()# 主線(xiàn)程等所有子線(xiàn)程運(yùn)行完畢f(xié)or t in running_thread:if t.isAlive():t.join()# 結(jié)束時(shí)間end_time = Time.time()print(len(running_thread), '個(gè)線(xiàn)程, ', '運(yùn)行時(shí)間: ', end_time - start_time, '秒')print('空余線(xiàn)程數(shù): ', len(threads_pool))

Result :

D:PycharmProjectspython_studyvenv3.xScriptspython.exe D:/PycharmProjects/python_study/protest/thread_pool.py 第0號(hào)線(xiàn)程運(yùn)行中第1號(hào)線(xiàn)程運(yùn)行中第2號(hào)線(xiàn)程運(yùn)行中第3號(hào)線(xiàn)程運(yùn)行中第4號(hào)線(xiàn)程運(yùn)行中第5號(hào)線(xiàn)程運(yùn)行中第6號(hào)線(xiàn)程運(yùn)行中第7號(hào)線(xiàn)程運(yùn)行中第8號(hào)線(xiàn)程運(yùn)行中第9號(hào)線(xiàn)程運(yùn)行中第0號(hào)線(xiàn)程運(yùn)行中第1號(hào)線(xiàn)程運(yùn)行中第2號(hào)線(xiàn)程運(yùn)行中第1號(hào)線(xiàn)程 <urlopen error timed out> 第2號(hào)線(xiàn)程 The read operation timed out 13 個(gè)線(xiàn)程, 運(yùn)行時(shí)間: 20.04409170150757 秒空余線(xiàn)程數(shù): 7Process finished with exit code 0

工程練習(xí)地址： https://gitee.com/heyclock/doc/tree/master/Python/python_study

補(bǔ)充....這個(gè)地方我還會(huì)去看哈主流的線(xiàn)程池爬蟲(chóng)方案（其中官方線(xiàn)程池的用法參考： python線(xiàn)程池 ThreadPoolExecutor 的用法及實(shí)戰(zhàn)），然后學(xué)習(xí)下，然后補(bǔ)充

threadpoolexecutor_practice.py

#!/usr/bin/python3 # -*- coding: UTF-8 -*- # 文件名：threadpoolexecutor_practice.pyfrom concurrent.futures import ThreadPoolExecutor, wait, FIRST_COMPLETED, ALL_COMPLETED, as_completed from urllib import request# 偽裝瀏覽器 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.70 Safari/537.36', }url_list = ['http://www.baidu.com','https://github.com/FanChael/DocPro','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','https://github.com/FanChael/DocPro','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','http://www.baidu.com','https://github.com/FanChael','https://github.com/FanChael', ]def spider(url_path):data_html = ''try:req = request.Request(url_path, None, headers)# 爬到內(nèi)容不對(duì)的還需要結(jié)合selenium等獲取動(dòng)態(tài)js內(nèi)容with request.urlopen(req, timeout=5) as uf:while True:data_temp = uf.read(1024)if not data_temp:breakdata_html += data_temp.decode('utf-8', 'ignore')# 爬到的數(shù)據(jù)可以本地或者數(shù)據(jù)庫(kù) - 總之進(jìn)行一系列后續(xù)處理print(url_path, " 完成")except Exception as err:print(str(err))else:passreturn data_html# 創(chuàng)建一個(gè)最大容量為1的線(xiàn)程 executor = ThreadPoolExecutor(max_workers=16)if __name__ == '__main__':tasks = []# 執(zhí)行蜘蛛并加入執(zhí)行列表for url in url_list:# 執(zhí)行函數(shù)，并傳入?yún)?shù)task = executor.submit(spider, url)tasks.append(task)# 等待方式1：結(jié)束# wait(tasks, return_when=ALL_COMPLETED)# 等待方式2：結(jié)束for future in as_completed(tasks):# spider方法無(wú)返回，則返回為Nonedata = future.result()print(f"main:{data[0:10]}")# 等待方式3: 結(jié)束 - 替代submit并伴隨等待！# for data in executor.map(spider, url_list):# print(data)print('結(jié)束啦')

用官方的線(xiàn)程池，更簡(jiǎn)單一些，別人都做好了處理線(xiàn)程的管理。其實(shí)點(diǎn)擊進(jìn)去看看源碼，大概也知道，也有類(lèi)似的擴(kuò)容處理，然后調(diào)用封裝，任務(wù)也都是放到的隊(duì)列里面的。比如下面一段源碼：

線(xiàn)程池練習(xí)，更好的封裝，比如（你自己初步實(shí)現(xiàn)，然后可以包裝起來(lái)獨(dú)立模塊，外部提供參數(shù)運(yùn)行）https://blog.csdn.net/Key_book/article/details/80258022

OK，先醬紫...下一步數(shù)據(jù)庫(kù)連接，正則匹配學(xué)哈。。差不多公司項(xiàng)目就可以看看了....具體其他的再深入...

附錄：https://blog.csdn.net/Key_book/article/details/80258022 - python爬蟲(chóng)之urllib,偽裝,超時(shí)設(shè)置,異常處理

總結(jié)

以上是生活随笔為你收集整理的pool python 传参数_Python-爬虫-多线程、线程池模拟（urllib、requests、UserAgent、超时等）...的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： idea导入nodejs插件_sbt 项
下一篇： adb指令禁用软件_三星等安卓手机续航差

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

pool python 传参数_Python-爬虫-多线程、线程池模拟（urllib、requests、UserAgent、超时等）...

總結(jié)