设置timeout限制在爬虫中的运用
生活随笔
收集整理的這篇文章主要介紹了
设置timeout限制在爬虫中的运用
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
設置timeout方法
這個有很多種的,下面以urllib為例
下面選取的是網頁是python官網
不使用的timeout的情況
>>> import urllib.request >>> response = urllib.request.urlopen('http://www.python.org') >>>>使用timeout的情況
情況一:timeout = 0.1
>>> response = urllib.request.urlopen('http://www.python.org',timeout = 0.1) Traceback (most recent call last):File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1254, in do_openh.request(req.get_method(), req.selector, req.data, headers)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1107, in requestself._send_request(method, url, body, headers)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1152, in _send_requestself.endheaders(body)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1103, in endheadersself._send_output(message_body)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 934, in _send_outputself.send(msg)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 877, in sendself.connect()File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 849, in connect(self.host,self.port), self.timeout, self.source_address)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\socket.py", line 712, in create_connectionraise errFile "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\socket.py", line 703, in create_connectionsock.connect(sa) socket.timeout: timed outDuring handling of the above exception, another exception occurred:Traceback (most recent call last):File "<stdin>", line 1, in <module>File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 163, in urlopenreturn opener.open(url, data, timeout)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 466, in openresponse = self._open(req, data)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 484, in _open'_open', req)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 444, in _call_chainresult = func(*args)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1282, in http_openreturn self.do_open(http.client.HTTPConnection, req)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1256, in do_openraise URLError(err) urllib.error.URLError: <urlopen error timed out>情況二:timeout = 0.5
>>> response = urllib.request.urlopen('http://www.python.org',timeout = 0.5) Traceback (most recent call last):File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1254, in do_openh.request(req.get_method(), req.selector, req.data, headers)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1107, in requestself._send_request(method, url, body, headers)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1152, in _send_requestself.endheaders(body)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1103, in endheadersself._send_output(message_body)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 934, in _send_outputself.send(msg)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 877, in sendself.connect()File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\http\client.py", line 1261, in connectserver_hostname=server_hostname)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\ssl.py", line 385, in wrap_socket_context=self)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\ssl.py", line 760, in __init__self.do_handshake()File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\ssl.py", line 996, in do_handshakeself._sslobj.do_handshake()File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\ssl.py", line 641, in do_handshakeself._sslobj.do_handshake() socket.timeout: _ssl.c:703: The handshake operation timed outDuring handling of the above exception, another exception occurred:Traceback (most recent call last):File "<stdin>", line 1, in <module>File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 163, in urlopenreturn opener.open(url, data, timeout)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 472, in openresponse = meth(req, response)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 582, in http_response'http', request, response, code, msg, hdrs)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 504, in errorresult = self._call_chain(*args)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 444, in _call_chainresult = func(*args)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 696, in http_error_302return self.parent.open(new, timeout=req.timeout)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 466, in openresponse = self._open(req, data)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 484, in _open'_open', req)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 444, in _call_chainresult = func(*args)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1297, in https_opencontext=self._context, check_hostname=self._check_hostname)File "C:\Users\lijy2\AppData\Local\Programs\Python\Python35-32\lib\urllib\request.py", line 1256, in do_openraise URLError(err) urllib.error.URLError: <urlopen error _ssl.c:703: The handshake operation timed out>情形3: timeout = 1
>>> response = urllib.request.urlopen('http://www.python.org',timeout = 1) >>>解析
這里,我們發現在設置了timeout之后,一旦超時,會發生報錯,然后任務也就結束了。但是會保證每個任務的時間都是被限制了的。
運用
比如,我們做一個并發的爬蟲(例如用多協程或者多線程實現)。這里,如果不進行爬蟲不設置timeout的話,如果某個子協程在運行的在還在等待的話,就有其他的線程跟著一起等這個線程的響應。(雖然會讓其他的線程或者協程在這時候運行,但是切換所需要的時間的)。如果可以設計到這個timeout的數值比較小(合理的小的話)就會讓這個線程(或者協程)在只用很短的時間就結束爬蟲。如果失敗就先記錄下來,在之后做這個失敗的數據的處理。
可以采用分級的timeout。這樣,失敗一次就放到timeout時間序列更長的隊列當中。這樣通過mlfq這樣的操作來調度這些爬蟲。
這樣方法對于網絡質量不是很穩定的情況下,這個爬蟲效果會比較好。有些時候就沒有必要用那么長的時間來等待。
總結
以上是生活随笔為你收集整理的设置timeout限制在爬虫中的运用的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 查看网页服务器搭建方式(Python3)
- 下一篇: urllib使用cookies(下载,提