當前位置：首頁 > 编程语言 > python >内容正文

python

[Python爬虫] Selenium+Phantomjs动态获取CSDN下载资源信息和评论

發布時間：2024/5/28 python 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 [Python爬虫] Selenium+Phantomjs动态获取CSDN下载资源信息和评论小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

? ? ? ? 前面幾篇文章介紹了Selenium、PhantomJS的基礎知識及安裝過程，這篇文章是一篇應用。通過Selenium調用Phantomjs獲取CSDN下載資源的信息，最重要的是動態獲取資源的評論，它是通過JavaScript動態加載的，故通過Phantomjs模擬瀏覽器加載獲取。
? ? ? ? 希望該篇基礎性文章對你有所幫助，如果有錯誤或不足之處，請海涵~
? ? ? ?? [Python爬蟲] 在Windows下安裝PhantomJS和CasperJS及入門介紹(上)
? ? ? ?? [Python爬蟲] 在Windows下安裝PIP+Phantomjs+Selenium
? ? ? ??[Python爬蟲] Selenium自動訪問Firefox和Chrome并實現搜索截圖
? ? ? ??[Python爬蟲] Selenium實現自動登錄163郵箱和Locating Elements介紹

源代碼

# coding=utf-8 from selenium import webdriver from selenium.webdriver.common.keys import Keys import selenium.webdriver.support.ui as ui from selenium.webdriver.common.action_chains import ActionChains import time import re import os #打開Firefox瀏覽器設定等待加載時間訪問URL driver = webdriver.PhantomJS(executable_path="G:\phantomjs-1.9.1-windows\phantomjs.exe") driver_detail = webdriver.PhantomJS(executable_path="G:\phantomjs-1.9.1-windows\phantomjs.exe") wait = ui.WebDriverWait(driver,10) driver.get("http://download.csdn.net/user/eastmount/uploads") SUMRESOURCES = 0 #全局變量記錄資源總數(盡量避免) #獲取列表頁數 <div class="page_nav>共46個共8頁..</div> def getPage(): number = 0 wait.until(lambda driver: driver.find_element_by_xpath("//div[@class='page_nav']")) texts = driver.find_element_by_xpath("//div[@class='page_nav']").text print texts m = re.findall(r'(\w*[0-9]+)\w*',texts) #正則表達式尋找數字 print '頁數：' + str(m[1]) return int(m[1]) #獲取URL和文章標題 def getURL_Title(num): global SUMRESOURCES url = 'http://download.csdn.net/user/eastmount/uploads/' + str(num) print unicode('下載列表URL: ' + url,'utf-8') ''''' ' 等待最下面頁面加載成功獲取URL和標題 ' 源碼 ' <div class='list-container mb-bg'> ' <dl> ' <dt> ' <div class="icon"><img src="xxx"></img></div> ' <h3><a href="/detail/eastmount/8757243">MFC顯示BMP圖片</a></h3> ' </dt> ' </dl> ' </div> ' get_attribute('href')獲取URL且自動補齊 ' unicode防止報錯 - s.encode('utf8')unicode轉換成utf8編碼 decode表示utf8轉換成unicode ''' driver.get(url) wait.until(lambda driver: driver.find_element_by_xpath("//div[@class='page_nav']")) list_container = driver.find_elements_by_xpath("//div[@class='list-container mb-bg']/dl/dt/h3/a") for title in list_container: print 'Num' + str(SUMRESOURCES +1) print u'標題: ' + title.text print u'鏈接: ' + title.get_attribute('href') SUMRESOURCES = SUMRESOURCES +1 # #獲取具體內容和評論 getDetails( str(title.get_attribute('href')) ) else: print ' ' #換行 #獲取詳細信息因前定義的driver正在使用中故調用driver_detail #否則報錯 Message: Error Message => 'Element does not exist in cache' def getDetails(url): #獲取infobox driver_detail.get(url) details = driver_detail.find_element_by_xpath("//div[@class='info']").text print details #加載評論 <dl><dt></dt><dd></dd></dl> comments = driver_detail.find_elements_by_xpath("//dl[@class='recom_list']/dd") for com in comments: print u'評論：' + com.text else: print ' ' #換行 #主函數 def main(): start = time.clock() pageNum = getPage() i=1 #循環獲取標題和URL while(i<=pageNum): getURL_Title(i) i = i + 1 else: print 'SUmResouces: ' + str(SUMRESOURCES) print 'Load Over' end = time.clock() print "Time: %f s" % (end - start) main()

代碼實現步驟

? ? ? ? 1.首先獲取頁面總數，通過getPage()函數實現；
? ? ? ? 2.每個頁面有一列資源，通過driver的find_element_by_xpath()路徑獲取標題和get_attribute('href')函數獲取URL，它會自動補齊鏈接；
? ? ? ? 3.根據步驟2獲取資源的URL，去到具體資源獲取消息框和評論信息；
? ? ? ? 4.由于采用Phantomjs無界面瀏覽器加載頁面，故獲取class=info和recom_list的div即可。

運行結果

? ? ? ? 運行結果如下圖所示：

程序分析

? ? ? ? 首先獲取如下圖所示的頁面總數，此時為“8”頁。它通過如下代碼實現：
? ? ? ? texts = driver.find_element_by_xpath("//div[@class='page_nav']").text
? ? ? ? 然后再while(i<=8)依次遍歷每頁的資源，每頁資源的URL鏈接為：
? ? ? ??http://download.csdn.net/user/eastmount/uploads/8

? ? ? ? 再獲取每頁所有資源的標題及URL，通過代碼如下：
list_container = driver.find_elements_by_xpath("//div[@class='list-container mb-bg']/dl/dt/h3/a") for title in list_container: print 'Num' + str(SUMRESOURCES +1) print u'標題: ' + title.text print u'鏈接: ' + title.get_attribute('href') ? ? ? ? 其中對應的源碼如下所示，通過獲取find_elements_by_xpath()獲取多個元素，其div的class='list-container mb-bg'，同時路徑為<div><dl><dt><h3><a>即可。同時自動補齊URL，如：
? ? ? ? <a href='/detail/eastmount/6917799'會添加“http://download.csdn.net/”。

? ? ? ? 最后在進入具體的資源獲取相應的消息盒InfoBox和評論信息，由于通過模擬Phantomjs瀏覽器直接可以顯示動態JS評論信息。

? ? ? ? 而如果通過BeautifulSoup只能獲取的HTML源碼如下，并沒有JS信息。因為它是動態加載的，這就體現了Phantomjs的優勢。而通過Chrome或FireFox瀏覽器審查元素能查看具體的評論div，這也是模擬瀏覽器的用處所在吧！
? ? ? ? 可對比前面寫過的文章：[Python學習] 簡單爬取CSDN下載資源信息

<div class="section-list panel panel-default"> <div class="panel-heading"> <h3 class="panel-title">資源評論</h3> </div>  <script language='JavaScript' defer type='text/javascript' src='/js/comment.js'></script> <div class="recommand download_comment panel-body" sourceid="8772951"></div> </div>

總結

? ? ? ? 這篇文章主要講述通過Selenium和Phantomjs獲取CSDN下載資源信息的過程，其中由于driver調用Chrome或FireFox瀏覽器總會有額外空間增加，故調用Phantomjs無界面瀏覽器完成。同時有幾個問題：
? ? ? ? ? ? 1.如何避免Phantomjs的黑框彈出；
? ? ? ? ? ? 2.程序的運行時間比較低，響應時間較慢，如何提高？
? ? ? ? 接下來如果有機會準備嘗試的內容包括：
? ? ? ? ? ? 1.下載百度百科的旅游地點InfoBox（畢設知識圖譜挖掘）；
? ? ? ? ? ? 2.如何爬取搜狐圖片的動態加載圖片，完成智能爬圖工具；
? ? ? ? ? ? 3.當需要自動登錄時driver訪問Chrome或FireFox瀏覽器發送消息。
? ? ? ? 最后希望文章對你有所幫助吧！如果有錯誤或不足之處，還請海涵~
? ? ? （By:Eastmount 2015-8-24 深夜2點半? ?http://blog.csdn.net/eastmount/）
? ? ? ??

總結

以上是生活随笔為你收集整理的[Python爬虫] Selenium+Phantomjs动态获取CSDN下载资源信息和评论的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： [Python爬虫] Selenium实
下一篇： [Python爬虫] Selenium获