當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python爬虫之爬取起点中文网小说

發(fā)布時(shí)間：2023/12/14 python 17 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬虫之爬取起点中文网小说小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

python爬蟲之爬取起點(diǎn)中文網(wǎng)小說

hello大家好，這篇文章帶大家來制作一個(gè)python爬蟲爬取閱文集團(tuán)旗下產(chǎn)品起點(diǎn)中文網(wǎng)的程序，這篇文章的靈感來源于本人制作的一個(gè)項(xiàng)目：電腦助手啟帆助手

?是項(xiàng)目的部分源碼

準(zhǔn)備工作

用到的庫有：

urllib.request
lxml.etree

代碼分析

第一步：導(dǎo)入要用到的庫

from urllib import request from lxml import etree

2.第二步:設(shè)置請求頭及小說網(wǎng)址(這里的網(wǎng)址以作者寫的一本為例)

header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'} url="https://book.qidian.com/info/1020546097"

3.第三步：爬取每個(gè)章節(jié)的鏈接、標(biāo)題，并解析

req = request.Request(url, headers=header) html = request.urlopen(req).read().decode('utf-8') html = etree.HTML(html) Lit_tit_list = html.xpath('//ul[@class="cf"]/li/a/text()') #爬取每個(gè)章節(jié)名字 Lit_href_list = html.xpath('//ul[@class="cf"]/li/a/@href') #每個(gè)章節(jié)鏈接 # print(Lit_tit_list) # print(Lit_href_list)

4.第四步:抓取文章并用text保存至電腦

for tit,src in zip(Lit_tit_list,Lit_href_list):url = "http:" + srcreq = request.Request(url, headers=header)html = request.urlopen(req).read().decode('utf-8')html = etree.HTML(html)text_list = html.xpath('//div[@class="read-content j_readContent"]/p/text()')text = "\n".join(text_list)file_name = tit + ".txt"print("正在抓取文章：" + file_name)with open(file_name, 'a', encoding="utf-8") as f:f.write("\t" + tit + '\n' + text)

完整代碼

from urllib import request from lxml import etree header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'} url="https://book.qidian.com/info/1020546097" req = request.Request(url, headers=header) html = request.urlopen(req).read().decode('utf-8') html = etree.HTML(html) Lit_tit_list = html.xpath('//ul[@class="cf"]/li/a/text()') #爬取每個(gè)章節(jié)名字 Lit_href_list = html.xpath('//ul[@class="cf"]/li/a/@href') #每個(gè)章節(jié)鏈接 # print(Lit_tit_list) # print(Lit_href_list) for tit,src in zip(Lit_tit_list,Lit_href_list):url = "http:" + srcreq = request.Request(url, headers=header)html = request.urlopen(req).read().decode('utf-8')html = etree.HTML(html)text_list = html.xpath('//div[@class="read-content j_readContent"]/p/text()')text = "\n".join(text_list)file_name = tit + ".txt"print("正在抓取文章：" + file_name)with open(file_name, 'a', encoding="utf-8") as f:f.write("\t" + tit + '\n' + text)

效果展示

以下就是爬取的txt文件啦：

好啦，這篇文章就到這里吧，白······

總結(jié)

以上是生活随笔為你收集整理的python爬虫之爬取起点中文网小说的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。