python爬虫之爬取起点中文网小说
生活随笔
收集整理的這篇文章主要介紹了
python爬虫之爬取起点中文网小说
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
python爬蟲之爬取起點(diǎn)中文網(wǎng)小說
hello大家好,這篇文章帶大家來制作一個(gè)python爬蟲爬取閱文集團(tuán)旗下產(chǎn)品起點(diǎn)中文網(wǎng)的程序,這篇文章的靈感來源于本人制作的一個(gè)項(xiàng)目:電腦助手 啟帆助手
?是項(xiàng)目的部分源碼
準(zhǔn)備工作
用到的庫有:
- urllib.request
- lxml.etree
代碼分析
2.第二步:設(shè)置請求頭及小說網(wǎng)址(這里的網(wǎng)址以作者寫的一本為例)
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'} url="https://book.qidian.com/info/1020546097"3.第三步:爬取每個(gè)章節(jié)的鏈接、標(biāo)題,并解析
req = request.Request(url, headers=header) html = request.urlopen(req).read().decode('utf-8') html = etree.HTML(html) Lit_tit_list = html.xpath('//ul[@class="cf"]/li/a/text()') #爬取每個(gè)章節(jié)名字 Lit_href_list = html.xpath('//ul[@class="cf"]/li/a/@href') #每個(gè)章節(jié)鏈接 # print(Lit_tit_list) # print(Lit_href_list)4.第四步:抓取文章并用text保存至電腦
for tit,src in zip(Lit_tit_list,Lit_href_list):url = "http:" + srcreq = request.Request(url, headers=header)html = request.urlopen(req).read().decode('utf-8')html = etree.HTML(html)text_list = html.xpath('//div[@class="read-content j_readContent"]/p/text()')text = "\n".join(text_list)file_name = tit + ".txt"print("正在抓取文章:" + file_name)with open(file_name, 'a', encoding="utf-8") as f:f.write("\t" + tit + '\n' + text)完整代碼
from urllib import request from lxml import etree header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'} url="https://book.qidian.com/info/1020546097" req = request.Request(url, headers=header) html = request.urlopen(req).read().decode('utf-8') html = etree.HTML(html) Lit_tit_list = html.xpath('//ul[@class="cf"]/li/a/text()') #爬取每個(gè)章節(jié)名字 Lit_href_list = html.xpath('//ul[@class="cf"]/li/a/@href') #每個(gè)章節(jié)鏈接 # print(Lit_tit_list) # print(Lit_href_list) for tit,src in zip(Lit_tit_list,Lit_href_list):url = "http:" + srcreq = request.Request(url, headers=header)html = request.urlopen(req).read().decode('utf-8')html = etree.HTML(html)text_list = html.xpath('//div[@class="read-content j_readContent"]/p/text()')text = "\n".join(text_list)file_name = tit + ".txt"print("正在抓取文章:" + file_name)with open(file_name, 'a', encoding="utf-8") as f:f.write("\t" + tit + '\n' + text)效果展示
以下就是爬取的txt文件啦:
好啦,這篇文章就到這里吧,白······
總結(jié)
以上是生活随笔為你收集整理的python爬虫之爬取起点中文网小说的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 网络互联的层次结构
- 下一篇: 深入理解jvm(转)