當前位置：首頁 > 编程语言 > python >内容正文

python

python+requests+beautifulsoup爬取大众点评评论信息

發布時間：2024/1/18 python 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 python+requests+beautifulsoup爬取大众点评评论信息小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

特別聲明，此文寫于2018年2月，大眾點評的頁面邏輯，已做了改動，請找最近爬的文章看下，謝謝支持。

先簡單聊兩句，距離上一篇博客大概過去了4個月，在忙一些別的事情，除了公司有新項目上線，學習新技術之外，博主戀愛了，嗯，奔著結婚的那種，榮升程序員鄙視鏈頂端，emmmmm，我想說，來呀，打我呀！

好了好了，這是一篇技術型博文，最近公司需求，爬取大眾點評中幾個連鎖便利店的評論信息，因為只是一次需求，不用做成接口類型的，所以，記得之前學過python 的 requests + beautifulsoup 去爬取并處理爬取的頁面的信息

連鎖便利店：武漢的 7tt，today今天等

首先看一下

https://www.dianping.com/search/keyword/16/0_7tt
https://www.dianping.com/search/keyword/16/0_today今天
這是兩個連鎖便利店的列表路徑，都是固定格式后拼接便利店名字
首先獲取每個店的id，拼成這家店的詳情鏈接，例如http://www.dianping.com/shop/22711693
點擊最下面的更多點評，即可得到全部的評論的頁面

所以最終的評論頁面鏈接是http://www.dianping.com/shop/22711693/review_all
接著，點擊下方的頁碼，會改變鏈接，即在后面拼/p2代表頁數
http://www.dianping.com/shop/22711693/review_all/p2
所以可以通過獲取最下方頁碼來遍歷全部評論
那，怎么獲取頁碼呢？
window下f12，mac下alt+comand+j

可以看到class=PageLink的一共有9個，所以循環時+1就行，代碼如下：

url = "https://www.dianping.com/shop/%s/review_all" % i r = requests.get(url, headers=headers, cookies=cookies,proxies = proxies) # print r.text soup = BeautifulSoup(r.text, 'lxml') lenth = soup.find_all(class_='PageLink').__len__() + 1

這里得到的lenth就是這一頁的頁碼

然后如何在這一頁獲取每個評論的用戶名，星級，評論內容
如圖是放在多個li里面的，所以先獲取li，再通過li獲取下面的內容

coment = [] coment = soupIn.select('.reviews-items li')

接著遍歷li

for one in coment:try:if one['class'][0]=='item':continueexcept(KeyError),e:passname = one.select_one('.main-review .dper-info .name')#print name.get_text().strip()name = name.get_text().strip()star = one.select_one('.main-review .review-rank span')#print star['class'][1][7:8]star = star['class'][1][7:8]pl = one.select_one('.main-review .review-words')pl['class'] = {'review-words'}words = pl.get_text().strip()returnList.append([title,name,star,words])

因為獲取到的是class="reviews-items"下面所有的li，這里斷點調試發現，除去class="item"的就行，所以進行了判斷，

用戶名name很好獲取，這里的星級star是通過span中的class來表示的，class=“sml-str40” 表示4星，所以需要獲取class屬性并截取，

而最重要的評論，是有點擊展開評論按鈕，改變class="Hide"的，所以這里需要先去除掉評論div的Hide屬性，直接定義覆蓋： pl[‘class’] = {‘review-words’}

基本完成了，存到list[]中，然后寫文件，或者數據庫即可

訪問需要帶有請求頭headers ，cookies才可以訪問，cookies代表用戶訪問身份識別，其中的一些參數是要解析的，并且有時間戳，超時會失效等，headers中的referer表示你是從那個頁碼跳轉過來的，如果不加referer會在訪問幾次后現在你繼續訪問，有爬蟲嫌疑。

另外如果同一ip訪問次數過多也會封ip的，這里就要用代理了proxies，python很簡單，直接在請求中帶上proxies參數就行，r = requests.get(url, headers=headers, cookies=cookies,proxies = proxies)，另外代理ip的話，給大家推薦個網站http://www.data5u.com/，最下方會有20個免費的，一般小爬蟲夠用了，使用代理就會出現代理連接是否通之類的問題，需要在程序中添加下面的代碼，設置連接時間

requests.adapters.DEFAULT_RETRIES = 5 s = requests.session() s.keep_alive = False

最后的樣子就是這樣的

大致就是這樣，下面附上代碼，

歡迎關注我的微博@住街對面的查理，我的生活很有趣，你要不要來看一看。

#coding=utf-8 from bs4 import BeautifulSoup import sys reload(sys) sys.setdefaultencoding('utf-8') import json import requestslist = [22711693,24759450,69761921,69761921,22743334,66125712,22743270,57496584,75153221,57641884,66061653,70669333,57279088,24740739,66126129,75100027,92667587,92452007,72345827,90004047,90485109,90546031,83527455,91070982,83527745,94273474,80246564,83497073,69027373,96191554,96683472,90500524,92454863,92272204,70443082,96076068,91656438,75633029,96571687,97659144,69253863,98279207,90435377,70669359,96403354,83618952,81265224,77365611,74592526,90479676,56540304,37924067,27496773,56540319,32571869,43611843,58612870,22743340,67293664,67292945,57641749,75157068,58934198,75156610,59081304,75156647,75156702,67293838,] returnList = [] proxies = {# "https": "http://14.215.177.73:80","http": "http://202.108.2.42:80", } headers = {'Host': 'www.dianping.com','Referer': 'http://www.dianping.com/shop/22711693','User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/535.19','Accept-Encoding': 'gzip' } cookies = {'_lxsdk_cuid': '16146a366a7c8-08cd0a57dad51b-32637402-fa000-16146a366a7c8','lxsdk': '16146a366a7c8-08cd0a57dad51b-32637402-fa000-16146a366a7c8','_hc.v': 'ec20d90c-0104-0677-bf24-391bdf00e2d4.1517308569','s_ViewType': '10','cy': '16','cye': 'wuhan','_lx_utm': 'utm_source%3DBaidu%26utm_medium%3Dorganic','_lxsdk_s': '1614abc132e-f84-b9c-2bc%7C%7C34'} requests.adapters.DEFAULT_RETRIES = 5 s = requests.session() s.keep_alive = False for i in list:url = "https://www.dianping.com/shop/%s/review_all" % ir = requests.get(url, headers=headers, cookies=cookies,proxies = proxies)# print r.textsoup = BeautifulSoup(r.text, 'lxml')lenth = soup.find_all(class_='PageLink').__len__() + 1#print lenthfor j in xrange(lenth):urlIn = "http://www.dianping.com/shop/%s/review_all/p%s" % (i, j)re = requests.get(urlIn, headers=headers, cookies=cookies,proxies =proxies)soupIn = BeautifulSoup(re.text, 'lxml')title = soupIn.title.string[0:15]#print titlecoment = []coment = soupIn.select('.reviews-items li')for one in coment:try:if one['class'][0]=='item':continueexcept(KeyError),e:passname = one.select_one('.main-review .dper-info .name')#print name.get_text().strip()name = name.get_text().strip()star = one.select_one('.main-review .review-rank span')#print star['class'][1][7:8]star = star['class'][1][7:8]pl = one.select_one('.main-review .review-words')pl['class'] = {'review-words'}words = pl.get_text().strip()returnList.append([title,name,star,words])file = open("/Users/huojian/Desktop/store_shop.sql","w") for one in returnList:file.write("\n")file.write(unicode(one[0]))file.write("\n")file.write(unicode(one[1]))file.write("\n")file.write(unicode(one[2]))file.write("\n")file.write(unicode(one[3]))file.write("\n")

總結

以上是生活随笔為你收集整理的python+requests+beautifulsoup爬取大众点评评论信息的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：宇宙简史尔雅答案_今天是10/10/10
下一篇： JS函数初步

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

python+requests+beautifulsoup爬取大众点评评论信息

特別聲明，此文寫于2018年2月，大眾點評的頁面邏輯，已做了改動，請找最近爬的文章看下，謝謝支持。

總結