python+requests+beautifulsoup爬取大众点评评论信息
特別聲明,此文寫于2018年2月,大眾點評的頁面邏輯,已做了改動,請找最近爬的文章看下,謝謝支持。
先簡單聊兩句,距離上一篇博客大概過去了4個月,在忙一些別的事情,除了公司有新項目上線,學習新技術之外,博主戀愛了,嗯,奔著結婚的那種,榮升程序員鄙視鏈頂端,emmmmm,我想說,來呀,打我呀!
好了好了,這是一篇技術型博文,最近公司需求,爬取大眾點評中幾個連鎖便利店的評論信息,因為只是一次需求,不用做成接口類型的,所以,記得之前學過python 的 requests + beautifulsoup 去爬取并處理爬取的頁面的信息
連鎖便利店:武漢的 7tt,today今天等
首先看一下
https://www.dianping.com/search/keyword/16/0_7tt
https://www.dianping.com/search/keyword/16/0_today今天
這是兩個連鎖便利店的列表路徑,都是固定格式后拼接便利店名字
首先獲取每個店的id,拼成這家店的詳情鏈接,例如http://www.dianping.com/shop/22711693
點擊最下面的更多點評,即可得到全部的評論的頁面
所以最終的評論頁面鏈接是http://www.dianping.com/shop/22711693/review_all
接著,點擊下方的頁碼,會改變鏈接,即在后面拼/p2代表頁數
http://www.dianping.com/shop/22711693/review_all/p2
所以可以通過獲取最下方頁碼來遍歷全部評論
那,怎么獲取頁碼呢?
window下f12,mac下alt+comand+j
可以看到class=PageLink的一共有9個,所以循環時+1就行,代碼如下:
這里得到的lenth就是這一頁的頁碼
然后如何在這一頁獲取每個評論的用戶名,星級,評論內容
如圖是放在多個li里面的,所以先獲取li,再通過li獲取下面的內容
接著遍歷li
for one in coment:try:if one['class'][0]=='item':continueexcept(KeyError),e:passname = one.select_one('.main-review .dper-info .name')#print name.get_text().strip()name = name.get_text().strip()star = one.select_one('.main-review .review-rank span')#print star['class'][1][7:8]star = star['class'][1][7:8]pl = one.select_one('.main-review .review-words')pl['class'] = {'review-words'}words = pl.get_text().strip()returnList.append([title,name,star,words])因為獲取到的是class="reviews-items"下面所有的li,這里斷點調試發現,除去class="item"的就行,所以進行了判斷,
用戶名name很好獲取,這里的星級star是通過span中的class來表示的,class=“sml-str40” 表示4星,所以需要獲取class屬性并截取,
而最重要的評論,是有點擊展開評論按鈕,改變class="Hide"的,所以這里需要先去除掉評論div的Hide屬性,直接定義覆蓋: pl[‘class’] = {‘review-words’}
基本完成了,存到list[]中,然后寫文件,或者數據庫即可
訪問需要帶有請求頭headers ,cookies才可以訪問,cookies代表用戶訪問身份識別,其中的一些參數是要解析的,并且有時間戳,超時會失效等,headers中的referer表示你是從那個頁碼跳轉過來的,如果不加referer會在訪問幾次后現在你繼續訪問,有爬蟲嫌疑。
另外如果同一ip訪問次數過多也會封ip的,這里就要用代理了proxies,python很簡單,直接在請求中帶上proxies參數就行,r = requests.get(url, headers=headers, cookies=cookies,proxies = proxies),另外代理ip的話,給大家推薦個網站http://www.data5u.com/,最下方會有20個免費的,一般小爬蟲夠用了,使用代理就會出現代理連接是否通之類的問題,需要在程序中添加下面的代碼,設置連接時間
requests.adapters.DEFAULT_RETRIES = 5 s = requests.session() s.keep_alive = False最后的樣子就是這樣的
大致就是這樣,下面附上代碼,
歡迎關注我的微博@住街對面的查理,我的生活很有趣,你要不要來看一看。
#coding=utf-8 from bs4 import BeautifulSoup import sys reload(sys) sys.setdefaultencoding('utf-8') import json import requestslist = [22711693,24759450,69761921,69761921,22743334,66125712,22743270,57496584,75153221,57641884,66061653,70669333,57279088,24740739,66126129,75100027,92667587,92452007,72345827,90004047,90485109,90546031,83527455,91070982,83527745,94273474,80246564,83497073,69027373,96191554,96683472,90500524,92454863,92272204,70443082,96076068,91656438,75633029,96571687,97659144,69253863,98279207,90435377,70669359,96403354,83618952,81265224,77365611,74592526,90479676,56540304,37924067,27496773,56540319,32571869,43611843,58612870,22743340,67293664,67292945,57641749,75157068,58934198,75156610,59081304,75156647,75156702,67293838,] returnList = [] proxies = {# "https": "http://14.215.177.73:80","http": "http://202.108.2.42:80", } headers = {'Host': 'www.dianping.com','Referer': 'http://www.dianping.com/shop/22711693','User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/535.19','Accept-Encoding': 'gzip' } cookies = {'_lxsdk_cuid': '16146a366a7c8-08cd0a57dad51b-32637402-fa000-16146a366a7c8','lxsdk': '16146a366a7c8-08cd0a57dad51b-32637402-fa000-16146a366a7c8','_hc.v': 'ec20d90c-0104-0677-bf24-391bdf00e2d4.1517308569','s_ViewType': '10','cy': '16','cye': 'wuhan','_lx_utm': 'utm_source%3DBaidu%26utm_medium%3Dorganic','_lxsdk_s': '1614abc132e-f84-b9c-2bc%7C%7C34'} requests.adapters.DEFAULT_RETRIES = 5 s = requests.session() s.keep_alive = False for i in list:url = "https://www.dianping.com/shop/%s/review_all" % ir = requests.get(url, headers=headers, cookies=cookies,proxies = proxies)# print r.textsoup = BeautifulSoup(r.text, 'lxml')lenth = soup.find_all(class_='PageLink').__len__() + 1#print lenthfor j in xrange(lenth):urlIn = "http://www.dianping.com/shop/%s/review_all/p%s" % (i, j)re = requests.get(urlIn, headers=headers, cookies=cookies,proxies =proxies)soupIn = BeautifulSoup(re.text, 'lxml')title = soupIn.title.string[0:15]#print titlecoment = []coment = soupIn.select('.reviews-items li')for one in coment:try:if one['class'][0]=='item':continueexcept(KeyError),e:passname = one.select_one('.main-review .dper-info .name')#print name.get_text().strip()name = name.get_text().strip()star = one.select_one('.main-review .review-rank span')#print star['class'][1][7:8]star = star['class'][1][7:8]pl = one.select_one('.main-review .review-words')pl['class'] = {'review-words'}words = pl.get_text().strip()returnList.append([title,name,star,words])file = open("/Users/huojian/Desktop/store_shop.sql","w") for one in returnList:file.write("\n")file.write(unicode(one[0]))file.write("\n")file.write(unicode(one[1]))file.write("\n")file.write(unicode(one[2]))file.write("\n")file.write(unicode(one[3]))file.write("\n")總結
以上是生活随笔為你收集整理的python+requests+beautifulsoup爬取大众点评评论信息的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 宇宙简史尔雅答案_今天是10/10/10
- 下一篇: JS函数初步