生活随笔
收集整理的這篇文章主要介紹了
异步爬虫(爬取小说30秒12MB!)Python实现
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
注意,修改
下面代碼目前只能爬取http://www.biquge.com.tw/這個小說網站上的小說,只要是這個網站上的小說就是可以被直接拿下來的。
之前,我們爬取的小說,雖然說爬取15MB大小的小說,也是可以的,但是速度太慢,而且容易被封。所以,這次在前面的基礎上做改良。
僅供學習使用!不作商業用途,侵權刪
這個小說字數在545.2萬,估計爬取的難度還是很大的(按照之前的那種爬取方案,估計沒幾個小時我是不相信的…)
url = http://www.biquge.com.tw/3_3711/
這次爬取,我們會使用IP代理,headers設置,lxml解析器,多協程(并發)等技術
相比于之前爬取類似的代碼要幾個小時這次就只需要用30秒
哈哈哈真是太強了啊!
import requests
import os
import gevent
from gevent
import monkey
import random
import re
from lxml
import etree
monkey.patch_all(select=
False)
from urllib
import parse
import time
def setDir():if 'Noval' not in os.listdir(
'./'):os.mkdir(
'./Noval')
def getNoval(url, id):headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language':
'zh-CN,zh;q=0.9',
'Cookie':
'__cfduid=d820fcba1e8cf74caa407d320e0af6b5d1518500755; UM_distinctid=1618db2bfbb140-060057ff473277-4323461-e1000-1618db2bfbc1e4; ctrl_time=1; CNZZDATA1272873873=2070014299-1518497311-https%253A%252F%252Fwww.baidu.com%252F%7C1518507528; yjs_id=69163e1182ffa7d00c30fa85105b2432; jieqiVisitTime=jieqiArticlesearchTime%3D1518509603'}IPs = [{
'HTTPS':
'https://115.237.16.200:8118'},{
'HTTPS':
'https://42.49.119.10:8118'},{
'HTTPS':
'http://60.174.74.40:8118'}]IP = random.choice(IPs)res = requests.get(url, headers=headers, proxies=IP)res.encoding =
'GB18030'html = res.text.replace(
' ',
' ') page = etree.HTML(html)content = page.xpath(
'//div[@id="content"]')ps = page.xpath(
'//div[@class="bookname"]/h1')
if len(ps) !=
0:s = ps[
0].text +
'\n's = s + content[
0].xpath(
"string(.)")
with open(
'./Noval/%d.txt' % id,
'w', encoding=
'gb18030', errors=
'ignore')
as f:f.write(s)
def getContentFile(url):headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',
'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language':
'zh-CN,zh;q=0.9',
'Cookie':
'__cfduid=d820fcba1e8cf74caa407d320e0af6b5d1518500755; UM_distinctid=1618db2bfbb140-060057ff473277-4323461-e1000-1618db2bfbc1e4; ctrl_time=1; CNZZDATA1272873873=2070014299-1518497311-https%253A%252F%252Fwww.baidu.com%252F%7C1518507528; yjs_id=69163e1182ffa7d00c30fa85105b2432; jieqiVisitTime=jieqiArticlesearchTime%3D1518509603'}IPs = [{
'HTTPS':
'https://115.237.16.200:8118'},{
'HTTPS':
'https://42.49.119.10:8118'},{
'HTTPS':
'http://60.174.74.40:8118'}]IP = random.choice(IPs)res = requests.get(url, headers=headers, proxies=IP)res.encoding =
'GB18030'page = etree.HTML(res.text)bookname = page.xpath(
'//div[@id="info"]/h1')[
0].xpath(
'string(.)')dl = page.xpath(
'//div[@id="list"]/dl/dd/a')splitHTTP = parse.urlsplit(url)url = splitHTTP.scheme +
'://' + splitHTTP.netloc
return list(map(
lambda x: url + x.get(
'href'), dl)), bookname
def BuildGevent(baseurl):content, bookname = getContentFile(baseurl) steps =
200beginIndex, length = steps, len(content)count =
0name =
"%s.txt" % bookname
while (count -
1) * steps < length:WaitigList = [gevent.spawn(getNoval, content[i + count * steps], i + count * steps)
for i
in range(steps)
ifi + count * steps < length]gevent.joinall(WaitigList)NovalFile = list(filter(
lambda x: x[:x.index(
'.')].isdigit(), os.listdir(
'./Noval')))NovalFile.sort(key=
lambda x: int(re.match(
'\d+', x).group()))String =
''for dirFile
in NovalFile:
with open(
'./Noval/' + dirFile,
'r', encoding=
'gb18030', errors=
'ignore')
as f:String = String +
'\n' + f.read()os.remove(
'./Noval/%s' % dirFile)
if count ==
0:
with open(
'./Noval/' + name,
'w', encoding=
'gb18030', errors=
'ignore')
as ff:ff.write(String)
else:
with open(
'./Noval/' + name,
'a', encoding=
'gb18030', errors=
'ignore')
as ff:ff.write(String)count +=
1if __name__ ==
'__main__':starttime = time.time()setDir()url =
'http://www.biquge.com.tw/3_3711/'BuildGevent(url)endtime = time.time()print(
"Total use time: %.6f" % (endtime - starttime))
最后,老套路,宣傳一波自己的公眾號!(求關注哇!)
本人中大一肥宅,歡迎大家關注,請掃下面的二維碼(〃’▽’〃)
如果覺得有幫助的話,可以掃碼,贊賞鼓勵一下!謝謝!
總結
以上是生活随笔為你收集整理的异步爬虫(爬取小说30秒12MB!)Python实现的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。