生活随笔
收集整理的這篇文章主要介紹了
Python爬虫练习(一) 爬取新笔趣阁小说(搜索+爬取)
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
爬取筆趣閣小說(搜索+爬取)
首先看看最終效果(gif): 實現步驟: 1.探查網站“http://www.xbiquge.la/”,看看網站的實現原理。
2.編寫搜索功能(獲取每本書目錄的URL)。
3.編寫寫入功能(按章節寫入文件)。
4.完善代碼(修修bug,建了文件夾)。
ps:所需模塊 :
import requests
import bs4
import os
import sys
import time
import random
一、網站搜索原理,并用Python實現。
我本以為這個網站和一般網站一樣,通過修改URL來進行搜索,結果并不然。 可以看出這個網站不會因搜索內容改變而改變URL 。 那還有一種可能:通過POST 請求,來更新頁面。讓我們打開Network驗證一下。 我的猜想是對的。接下來開始模擬。
headers
= { "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36" , "Cookie" : "_abcde_qweasd=0; Hm_lvt_169609146ffe5972484b0957bd1b46d6=1583122664; bdshare_firstime=1583122664212; Hm_lpvt_169609146ffe5972484b0957bd1b46d6=1583145548" , "Host" : "www.xbiquge.la" }
x
= str ( input ( "輸入書名或作者名:" ) )
data
= { 'searchkey' : x
}
url
= 'http://www.xbiquge.la/modules/article/waps.php'
r
= requests
. post
( url
, data
= data
, headers
= headers
)
soup
= bs4
. BeautifulSoup
( r
. text
. encode
( 'utf-8' ) , "html.parser" )
可是如果現在我printf(soup)后發現里面的中文全為亂碼! 這不難看出是編碼格式不對,但我們可以用encoding方法來獲取編碼方式。 改完編碼后就可以正常提取了,并且和瀏覽器顯示的一致,都是我們搜索的內容。 二、接下來我們就來在這一堆代碼里找到我們想要的內容了(書名,作者,目錄URL)
通過元素審查我們很容易就可以定位到它們所在位置。 鏈接和書名在"td class even"< a> 標簽里,作者在"td class=even"里。
什么!標簽重名了!怎么辦!管他三七二十一!先把"td class=even"全打印出來看看。
book_author
= soup
. find_all
( "td" , class_
= "even" )
for each
in book_author
: print ( each
)
可以發現每個each分為兩層。 那我們可以奇偶循環 來分別處理這兩層。(因為如果不分層處理的話第一層要用的方法(each.a.get(“href”)在第二層會報錯,好像try也可以處理這個錯,沒試)
并且用創建兩個三個列表來儲存三個值。
books
= [ ]
authors
= [ ]
directory
= [ ]
tem
= 1
for each
in book_author
: if tem
== 1 : books
. append
( each
. text
) tem
-= 1 directory
. append
( each
. a
. get
( "href" ) ) else : authors
. append
( each
. text
) tem
+= 1
成功!三個列表全部一樣對應! 那么要如何實現選擇一個序號,來讓Python獲得一個目錄鏈接呢? 我們可以這樣:
print ( '搜索結果:' )
for num
, book
, author
in zip ( range ( 1 , len ( books
) + 1 ) , books
, authors
) : print ( ( str ( num
) + ": " ) . ljust
( 4 ) + ( book
+ "\t" ) . ljust
( 25 ) + ( "\t作者:" + author
) . ljust
( 20 ) )
search
= dict ( zip ( books
, directory
) )
是不是很神奇!“search”是我們用書名和目錄URL組成的字典,我們只要 return search[books[i-1]] 就可以讓下一個函數得到這本書的目錄URL了。
三、獲取章節URL,獲取文本內容,寫入文件。
我們得到目錄的URL后就可以用相同的方法獲取每一章節的URL了(不贅述了)。
def get_text_url ( titel_url
) : url
= titel_url
global headersr
= requests
. get
( url
, headers
= headers
) soup
= bs4
. BeautifulSoup
( r
. text
. encode
( 'ISO-8859-1' ) , "html.parser" ) titles
= soup
. find_all
( "dd" ) texts
= [ ] names
= [ ] texts_names
= [ ] for each
in titles
: texts
. append
( "http://www.xbiquge.la" + each
. a
[ "href" ] ) names
. append
( each
. a
. text
) texts_names
. append
( texts
) texts_names
. append
( names
) return texts_names
注意這里的返回值是一個包含兩個列表的列表!! texts_names[0] 就是每一章節的 URL, texts_names[0] 是章節名 為下一個寫內容的函數方便調用。 接下來接是寫文件了!
search
= dict ( zip ( books
, directory
) )
url
= texts_url
[ 0 ] [ n
]
name
= texts_url
[ 1 ] [ n
]
req
= requests
. get
( url
= url
, headers
= headers
)
time
. sleep
( random
. uniform
( 0 , 0.5 ) )
req
. encoding
= 'UTF-8'
html
= req
. text
soup
= bs4
. BeautifulSoup
( html
, features
= "html.parser" )
texts
= soup
. find_all
( "div" , id = "content" )
while ( len ( texts
) == 0 ) : req
= requests
. get
( url
= url
, headers
= headers
) time
. sleep
( random
. uniform
( 0 , 0.5 ) ) req
. encoding
= 'UTF-8' html
= req
. textsoup
= bs4
. BeautifulSoup
( html
, features
= "html.parser" ) texts
= soup
. find_all
( "div" , id = "content" )
else : content
= texts
[ 0 ] . text
. replace
( '\xa0' * 8 , '\n\n' ) content
= content
. replace
( "親,點擊進去,給個好評唄,分數越高更新越快,據說給新筆趣閣打滿分的最后都找到了漂亮的老婆哦!手機站全新改版升級地址:http://m.xbiquge.la,數據和書簽與電腦站同步,無廣告清新閱讀!" , "\n" )
with open ( name
+ '.txt' , "w" , encoding
= 'utf-8' ) as f
: f
. write
( content
) sys
. stdout
. write
( "\r已下載{}章,還剩下{}章" . format ( count
, max - count
) ) count
+= 1
n 就是章節的序列,直接for循環就可以把所有章節寫成文件了 這里處理503的方法雖然很暴力,可是是最有用的!
四、整理代碼,修修bug。
把上面的思路寫成三道四個函數打包一下。 然后測試一下,看看有什么bug,能修就修復,修復不了就直接try掉。(哈哈哈) 想要文件夾的可以研究研究os模塊,很簡單,這里不贅述了。 最后附上完整代碼!
import requests
import bs4
import os
import sys
import time
import random
headers
= { "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36" , "Cookie" : "_abcde_qweasd=0; Hm_lvt_169609146ffe5972484b0957bd1b46d6=1583122664; bdshare_firstime=1583122664212; Hm_lpvt_169609146ffe5972484b0957bd1b46d6=1583145548" , "Host" : "www.xbiquge.la" }
b_n
= ""
def get_title_url ( ) : x
= str ( input ( "輸入書名或作者名:" ) ) data
= { 'searchkey' : x
} url
= 'http://www.xbiquge.la/modules/article/waps.php' global headers
, b_nr
= requests
. post
( url
, data
= data
, headers
= headers
) soup
= bs4
. BeautifulSoup
( r
. text
. encode
( 'ISO-8859-1' ) , "html.parser" ) book_author
= soup
. find_all
( "td" , class_
= "even" ) books
= [ ] authors
= [ ] directory
= [ ] tem
= 1 for each
in book_author
: if tem
== 1 : books
. append
( each
. text
) tem
-= 1 directory
. append
( each
. a
. get
( "href" ) ) else : authors
. append
( each
. text
) tem
+= 1 print ( '搜索結果:' ) for num
, book
, author
in zip ( range ( 1 , len ( books
) + 1 ) , books
, authors
) : print ( ( str ( num
) + ": " ) . ljust
( 4 ) + ( book
+ "\t" ) . ljust
( 25 ) + ( "\t作者:" + author
) . ljust
( 20 ) ) search
= dict ( zip ( books
, directory
) ) if books
== [ ] : print ( "沒有找到任何一本書,請重新輸入!" ) get_title_url
( ) try : i
= int ( input ( "輸入需要下載的序列號(重新搜索輸入'0')" ) ) except : print ( "輸入錯誤重新輸入:" ) i
= int ( input ( "輸入需要下載的序列號(重新搜索輸入'0')" ) ) if i
== 0 : books
= [ ] authors
= [ ] directory
= [ ] get_title_url
( ) if i
> len ( books
) or i
< 0 : print ( "輸入錯誤重新輸入:" ) i
= int ( input ( "輸入需要下載的序列號(重新搜索輸入'0')" ) ) b_n
= books
[ i
- 1 ] try : os
. mkdir
( books
[ i
- 1 ] ) os
. chdir
( b_n
) except : os
. chdir
( b_n
) b_n
= books
[ i
- 1 ] return search
[ books
[ i
- 1 ] ] def get_text_url ( titel_url
) : url
= titel_url
global headersr
= requests
. get
( url
, headers
= headers
) soup
= bs4
. BeautifulSoup
( r
. text
. encode
( 'ISO-8859-1' ) , "html.parser" ) titles
= soup
. find_all
( "dd" ) texts
= [ ] names
= [ ] texts_names
= [ ] for each
in titles
: texts
. append
( "http://www.xbiquge.la" + each
. a
[ "href" ] ) names
. append
( each
. a
. text
) texts_names
. append
( texts
) texts_names
. append
( names
) return texts_names
def readnovel ( texts_url
) : global headers
, b_ncount
= 1 max = len ( texts_url
[ 1 ] ) print ( "預計耗時{}分鐘" . format ( ( max // 60 ) + 1 ) ) tishi
= input ( str ( b_n
) + "一共{}章,確認下載輸入'y',輸入其他鍵取消" . format ( max ) ) if tishi
== "y" or tishi
== "Y" : for n
in range ( max ) : url
= texts_url
[ 0 ] [ n
] name
= texts_url
[ 1 ] [ n
] req
= requests
. get
( url
= url
, headers
= headers
) time
. sleep
( random
. uniform
( 0 , 0.5 ) ) req
. encoding
= 'UTF-8' html
= req
. textsoup
= bs4
. BeautifulSoup
( html
, features
= "html.parser" ) texts
= soup
. find_all
( "div" , id = "content" ) while ( len ( texts
) == 0 ) : req
= requests
. get
( url
= url
, headers
= headers
) time
. sleep
( random
. uniform
( 0 , 0.5 ) ) req
. encoding
= 'UTF-8' html
= req
. textsoup
= bs4
. BeautifulSoup
( html
, features
= "html.parser" ) texts
= soup
. find_all
( "div" , id = "content" ) else : content
= texts
[ 0 ] . text
. replace
( '\xa0' * 8 , '\n\n' ) content
= content
. replace
( "親,點擊進去,給個好評唄,分數越高更新越快,據說給新筆趣閣打滿分的最后都找到了漂亮的老婆哦!手機站全新改版升級地址:http://m.xbiquge.la,數據和書簽與電腦站同步,無廣告清新閱讀!" , "\n" ) with open ( name
+ '.txt' , "w" , encoding
= 'utf-8' ) as f
: f
. write
( content
) sys
. stdout
. write
( "\r已下載{}章,還剩下{}章" . format ( count
, max - count
) ) count
+= 1 print ( "\n全部下載完畢" ) else : print ( "已取消!" ) os
. chdir
( '..' ) os
. rmdir
( b_n
) main
( ) def main ( ) : titel_url
= get_title_url
( ) texts_url
= get_text_url
( titel_url
) readnovel
( texts_url
) input ( "輸入任意鍵退出" ) if __name__
== '__main__' : print ( "小說資源全部來自于'新筆趣閣'---》http://www.xbiquge.la\n所以搜不到我也沒辦法..........@曉軒\n為了確保下載完整,每章設置了0.5秒到1秒延時!" ) main
( )
總結
以上是生活随笔 為你收集整理的Python爬虫练习(一) 爬取新笔趣阁小说(搜索+爬取) 的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔 網站內容還不錯,歡迎將生活随笔 推薦給好友。