爬虫高性能相关
高性能相關
如何實現多個任務的同時進行 而且還效率高?
串行實現
效率最低最不可取
import requestsurls = ['http://www.baidu.com/','https://www.cnblogs.com/','https://www.cnblogs.com/news/','https://cn.bing.com/','https://stackoverflow.com/', ]for url in urls:response = requests.get(url)print(response)多線程
多線程存在線程利用率不高的問題
import requests import threadingurls = ['http://www.baidu.com/','https://www.cnblogs.com/','https://www.cnblogs.com/news/','https://cn.bing.com/','https://stackoverflow.com/', ]def task(url):response = requests.get(url)print(response)for url in urls:t = threading.Thread(target=task,args=(url,))t.start()協程+IO切換
gevent內部調用greenlet(實現了協程)
基于協程比線程更加省資源
from gevent import monkey; monkey.patch_all() import gevent import requestsdef func(url):response = requests.get(url)print(response)urls = ['http://www.baidu.com/','https://www.cnblogs.com/','https://www.cnblogs.com/news/','https://cn.bing.com/','https://stackoverflow.com/', ] spawn_list = [] for url in urls:spawn_list.append(gevent.spawn(func, url)) # 創建協程 gevent.joinall(spawn_list)事件循環
基于事件循環的異步非阻塞模塊:Twisted
from twisted.web.client import getPage, defer from twisted.internet import reactordef stop_loop(arg):reactor.stop()def get_response(contents):print(contents)deferred_list = []url_list = ['http://www.baidu.com/','https://www.cnblogs.com/','https://www.cnblogs.com/news/','https://cn.bing.com/','https://stackoverflow.com/', ]for url in url_list:deferred = getPage(bytes(url, encoding='utf8')) # 拿到了要爬取的任務,并沒有真正的執行爬蟲deferred.addCallback(get_response) # 要調用的回調函數 deferred_list.append(deferred) # 將所有的任務加入帶一個列表里面dlist = defer.DeferredList(deferred_list) # 檢測所有的任務是否都被循環 dlist.addBoth(stop_loop) # 如果列表中的任務都完成了就停止循環,執行停止的函數 reactor.run()?
轉載于:https://www.cnblogs.com/shijieli/p/10360799.html
創作挑戰賽新人創作獎勵來咯,堅持創作打卡瓜分現金大獎總結
- 上一篇: 摄影人像三要素是指什么?(新手如何系统性
- 下一篇: 全军突击武器攻略