Python 爬虫 书籍爬取实例
生活随笔
收集整理的這篇文章主要介紹了
Python 爬虫 书籍爬取实例
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
網頁文字爬取,以爬取整本書籍為例。
一、步驟
①首先,獲取目錄頁的h1(小說名)作為文件夾名并創建文件夾。?
#設置存儲文件夾 FName = text1.findAll('h1')[1].text if not os.path.exists(FName):os.mkdir(FName)②通過要獲取小說的目錄頁,爬取每個章節的鏈接
#目錄下各章節鏈接獲取 t = '<a style="" href="(.*?)">' AllUrl= re.findall(t, response.text)③獲取每個章節下的文字。章節名作為存儲的txt名,并把對應文字存入。
?注意:txt存儲存在文件名存在格式問題。故:
#判斷存儲文件名類型,去除不符合條件文件名for NoName in ["?","/","~","*","<",">",":","|"]:if(fileName[-1]==NoName):fileName=fileName[0:len(fileName)-1]二、完整代碼
import requests import re from bs4 import BeautifulSoup import osdicF=input("請輸入需要下載書籍目錄:"+"\n") headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 SE 2.X MetaSr 1.0' } url = dicF response = requests.get(url, headers=headers) text1 = BeautifulSoup(response.content.decode('utf-8', 'ignore').encode('gbk', 'ignore'), 'lxml') #設置存儲文件夾 FName = text1.findAll('h1')[1].text if not os.path.exists(FName):os.mkdir(FName)#目錄下各章節鏈接獲取 t = '<a style="" href="(.*?)">' AllUrl= re.findall(t, response.text)print('......書籍開始存儲......') for oneUrl in AllUrl:ListString = ""url = dicF.split('/')[0]+"//"+dicF.split('/')[2] + oneUrl.split('.')[0]+".html"response = requests.get(url, headers=headers)text = BeautifulSoup(response.content.decode('utf-8', 'ignore').encode('gbk', 'ignore'), 'lxml')getNum=0for NextOne in text.findAll('a'):if (NextOne.text=="下一頁"):getNum+=1#雙頁文章if(getNum!=0):for num in [1,2]:url = dicF.split('/')[0]+"//"+dicF.split('/')[2]+oneUrl.split('.')[0]+"_"+num.__str__()+".html"response = requests.get(url, headers=headers)text = BeautifulSoup(response.content.decode('utf-8', 'ignore').encode('gbk', 'ignore'), 'lxml')# 存儲文件標題fileName = text.findAll('h1')[1].textdiv = text.find('div', id='content')for item in div.text.split():ListString += item + "\n"# 單頁文章else:url = dicF.split('/')[0]+"//"+dicF.split('/')[2] + oneUrl.split('.')[0] + ".html"response = requests.get(url, headers=headers)text = BeautifulSoup(response.content.decode('utf-8', 'ignore').encode('gbk', 'ignore'), 'lxml')# 存儲文件標題fileName = text.findAll('h1')[1].textdiv = text.find('div', id='content')for item in div.text.split():ListString += item + "\n"getNum=0#判斷存儲文件名類型,去除不符合條件文件名for NoName in ["?","/","~","*","<",">",":","|"]:if(fileName[-1]==NoName):fileName=fileName[0:len(fileName)-1]#文件存儲with open(FName+'/'+fileName+'.txt', 'a', encoding='utf-8') as f:print(fileName)f.writelines(ListString) print("......書籍存儲完成......")總結
以上是生活随笔為你收集整理的Python 爬虫 书籍爬取实例的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 数据中心交换机芯片学习总结
- 下一篇: 好物分享 Kvaser支持CANFD功能