當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

Python爬虫入门教程 14-100 All IT eBooks多线程爬取

發(fā)布時(shí)間：2024/4/15 python 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫入门教程 14-100 All IT eBooks多线程爬取小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

All IT eBooks多線程爬取-寫在前面

對(duì)一個(gè)爬蟲愛好者來說，或多或少都有這么一點(diǎn)點(diǎn)的收集癖 ~ 發(fā)現(xiàn)好的圖片，發(fā)現(xiàn)好的書籍，發(fā)現(xiàn)各種能存放在電腦上的東西，都喜歡把它批量的爬取下來。然后放著，是的，就這么放著.......然后慢慢的遺忘掉.....

All IT eBooks多線程爬取-爬蟲分析

打開網(wǎng)址 http://www.allitebooks.com/ 發(fā)現(xiàn)特別清晰的小頁(yè)面，一看就好爬

在點(diǎn)擊一本圖書進(jìn)入，發(fā)現(xiàn)下載的小鏈接也很明顯的展示在了我們面前，小激動(dòng)一把，這么清晰無廣告的網(wǎng)站不多見了。

All IT eBooks多線程爬取-擼代碼

這次我采用了一個(gè)新的模塊 requests-html 這個(gè)模塊的作者之前開發(fā)了一款 requests，你應(yīng)該非常熟悉了，線程控制采用的 queue
安裝 requests-html 模塊

pip install requests-html

關(guān)于這個(gè)模塊的使用，你只需要使用搜索引擎搜索一下這個(gè)模塊名稱，那文章也是很多滴，作為能學(xué)到這篇博客的你來說，是很簡(jiǎn)單的拉~

我們編寫一下核心的內(nèi)容

from requests_html import HTMLSession from queue import Queue import requests import randomimport threading CARWL_EXIT = False DOWN_EXIT = False##### # 其他代碼 #### if __name__ == '__main__':page_queue = Queue(5)for i in range(1,6):page_queue.put(i) # 把頁(yè)碼存儲(chǔ)到page_queue里面# 采集結(jié)果data_queue = Queue()# 記錄線程列表thread_crawl = []# 每次開啟5個(gè)線程craw_list = ["采集線程1號(hào)","采集線程2號(hào)","采集線程3號(hào)","采集線程4號(hào)","采集線程5號(hào)"]for thread_name in craw_list:c_thread = ThreadCrawl(thread_name,page_queue,data_queue)c_thread.start()thread_crawl.append(c_thread)while not page_queue.empty():pass# 如果page_queue為空，采集線程退出循環(huán)CARWL_EXIT = Truefor thread in thread_crawl:thread.join()print("抓取線程結(jié)束")

上面就是爬取圖書詳情頁(yè)面的線程了，我開啟了5個(gè)線程爬取，頁(yè)碼也只爬取了5 頁(yè)，如果你需要更多的，只需要修改

page_queue = Queue(5)for i in range(1,6):page_queue.put(i) # 把頁(yè)碼存儲(chǔ)到page_queue里面

下面我們把 ThreadCrawl 類編寫完畢

session = HTMLSession()# 這個(gè)地方是 User_Agents 以后我把他配置到服務(wù)器上面，就可以遠(yuǎn)程獲取了這個(gè)列表里面有很多項(xiàng)，你自己去源碼里面找吧 USER_AGENTS = ["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20" ] # 獲取圖書下載鏈接的線程類 class ThreadCrawl(threading.Thread):# 構(gòu)造函數(shù)def __init__(self,thread_name,page_queue,data_queue):super(ThreadCrawl,self).__init__()self.thread_name = thread_nameself.page_queue = page_queueself.data_queue = data_queueself.page_url = "http://www.allitebooks.com/page/{}" #URL拼接模板def run(self):print(self.thread_name+" 啟動(dòng)*********")while not CARWL_EXIT:try:page = self.page_queue.get(block=False)page_url = self.page_url.format(page) # 拼接URL操作self.get_list(page_url) # 分析頁(yè)面鏈接 except Exception as e:print(e)break# 獲取當(dāng)前列表頁(yè)所有圖書鏈接def get_list(self,url):try:response = session.get(url)except Exception as e:print(e)raise eall_link = response.html.find('.entry-title>a') # 獲取頁(yè)面所有圖書詳情鏈接for link in all_link:self.get_book_url(link.attrs['href']) # 獲取圖書鏈接# 獲取圖書下載鏈接def get_book_url(self,url):try:response = session.get(url)except Exception as e:print(e)raise edownload_url = response.html.find('.download-links a', first=True)if download_url is not None: # 如果下載鏈接存在，那么繼續(xù)下面的爬取工作link = download_url.attrs['href']self.data_queue.put(link) # 把圖書下載地址存儲(chǔ)到 data_queue里面，準(zhǔn)備后面的下載print("抓取到{}".format(link))

上述代碼一個(gè)非常重要的內(nèi)容就是把圖書的下載鏈接存儲(chǔ)到了data_queue 里面，這些數(shù)據(jù) 在另一個(gè)下載線程里面是最基本的數(shù)據(jù)。

下面開始編寫圖書下載的類和方法。

我開啟了4個(gè)線程，操作和上面的非常類似

class ThreadDown(threading.Thread):def __init__(self, thread_name, data_queue):super(ThreadDown, self).__init__()self.thread_name = thread_nameself.data_queue = data_queuedef run(self):print(self.thread_name + ' 啟動(dòng)************')while not DOWN_EXIT:try:book_link = self.data_queue.get(block=False)self.download(book_link)except Exception as e:passdef download(self,url):# 隨機(jī)瀏覽器User-Agentheaders = {"User-Agent":random.choice(USER_AGENTS)}# 獲取文件名字filename = url.split('/')[-1]# 如果url里面包含pdfif '.pdf' in url or '.epub' in url:file = 'book/'+filename # 文件路徑已經(jīng)寫死，請(qǐng)?jiān)诟夸浵葎?chuàng)建好一個(gè)book文件夾with open(file,'wb') as f: # 開始二進(jìn)制寫文件print("正在下載 {}".format(filename))response = requests.get(url,stream=True,headers=headers)# 獲取文件大小totle_length = response.headers.get("content-length")# 如果文件大小不存在，則直接寫入返回的文本if totle_length is None:f.write(response.content)else:for data in response.iter_content(chunk_size=4096):f.write(data)else:f.close()print("{}下載完成".format(filename))if __name__ == '__main__': # 其他代碼在上面thread_image = []image_list = ['下載線程1號(hào)', '下載線程2號(hào)', '下載線程3號(hào)', '下載線程4號(hào)']for thread_name in image_list:d_thread = ThreadDown(thread_name, data_queue)d_thread.start()thread_image.append(d_thread)while not data_queue.empty():passDOWN_EXIT = Truefor thread in thread_image:thread.join()print("下載線程結(jié)束")

如果你把我上面的代碼都組合完畢，那么應(yīng)該可以很快速的去爬取圖書了，當(dāng)然這些圖書都是英文了，下載下來你能不能讀....... 我就不知道了。

源碼下載地址，去上篇博客找吧~~~~

轉(zhuǎn)載于:https://www.cnblogs.com/happymeng/p/10188468.html

總結(jié)

以上是生活随笔為你收集整理的Python爬虫入门教程 14-100 All IT eBooks多线程爬取的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： tensorflow--卷积神经网络
下一篇： [day17]appium之元素的定位