當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

使用Xpath+多进程爬取诗词名句网的史书典籍类所有文章。update~

發(fā)布時(shí)間：2024/9/30 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了使用Xpath+多进程爬取诗词名句网的史书典籍类所有文章。update~ 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

上次寫了爬取這個(gè)網(wǎng)站的程序，有一些地方不完善，而且爬取速度較慢，今天完善一下并開啟多進(jìn)程爬取，速度就像坐火箭。。

# 需要的庫 from lxml import etree import requests from multiprocessing import Pool # 請(qǐng)求頭 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36' } # 保存文本的地址 pathname=r'E:\爬蟲\詩詞名句網(wǎng)\\' # 獲取書籍名稱的函數(shù) def get_book(url):try:response = requests.get(url,headers)etrees = etree.HTML(response.text)url_infos = etrees.xpath('//div[@class="bookmark-list"]/ul/li')urls = []for i in url_infos:url_info = i.xpath('./h2/a/@href')book_name = i.xpath('./h2/a/text()')[0]print('開始下載.'+book_name)urls.append('http://www.shicimingju.com' + url_info[0])# print('http://www.shicimingju.com'+url_info[0])# get_index('http://www.shicimingju.com'+url_info[0])# 開啟多進(jìn)程pool.map(get_index,urls)except Exception:print('get_book failed') # 獲取書籍目錄的函數(shù) def get_index(url):try:response = requests.get(url, headers)etrees = etree.HTML(response.text)url_infos = etrees.xpath('//div[@class="book-mulu"]/ul/li')for i in url_infos:url_info = i.xpath('./a/@href')# print('http://www.shicimingju.com' + url_info[0])get_content('http://www.shicimingju.com' + url_info[0])except Exception as e:print(e) # 獲取書籍內(nèi)容并寫入.txt文件 def get_content(url):try:response = requests.get(url, headers)etrees = etree.HTML(response.text)title = etrees.xpath('//div[@class="www-main-container www-shadow-card "]/h1/text()')[0]content = etrees.xpath('//div[@class="chapter_content"]/p/text()')if not content:content = etrees.xpath('//div[@class="chapter_content"]/text()')content = ''.join(content)book_name = etrees.xpath('//div[@class="nav-top"]/a[3]/text()')[0]with open(pathname + book_name + '.txt', 'a+', encoding='utf-8') as f:f.write(title + '\n\n' + content + '\n\n\n')print(title + '..下載完成')else:content = ''.join(content)book_name=etrees.xpath('//div[@class="nav-top"]/a[3]/text()')[0]with open(pathname+book_name+'.txt','a+',encoding='utf-8') as f:f.write(title+'\n\n'+content+'\n\n\n')print(title+'..下載完成')except Exception:print('get_content failed') # 程序入口 if __name__ == '__main__':url = 'http://www.shicimingju.com/book/'# 開啟進(jìn)程池pool = Pool()# 啟動(dòng)函數(shù)get_book(url)

控制臺(tái)輸出；
查看文件夾，可以發(fā)現(xiàn)文件是多個(gè)多個(gè)的同時(shí)在下載；

總結(jié)

以上是生活随笔為你收集整理的使用Xpath+多进程爬取诗词名句网的史书典籍类所有文章。update~的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：使用Xpath爬虫库下载诗词名句网的史书
下一篇： Python使用request包请求网页

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

使用Xpath+多进程爬取诗词名句网的史书典籍类所有文章。update~

總結(jié)