當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

Python爬虫之猫眼APP电影数据（十八）

發(fā)布時(shí)間：2023/12/14 python 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python爬虫之猫眼APP电影数据（十八）小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

原創(chuàng)不易，轉(zhuǎn)載前請(qǐng)注明博主的鏈接地址：Blessy_Zhu https://blog.csdn.net/weixin_42555080
本次代碼的環(huán)境：
運(yùn)行平臺(tái)： Windows
Python版本： Python3.x
IDE： PyCharm

0 前言

好久沒有寫爬蟲了，為了讓自己的不至于對(duì)爬蟲那么陌生，于是準(zhǔn)備拿貓眼App電影的數(shù)據(jù)進(jìn)行上手。你肯定不會(huì)忘記今年上映的《復(fù)仇者聯(lián)盟4：終局之戰(zhàn)》，滿滿的回憶、滿滿的震撼。接下里我們就用這個(gè)電影作為爬取數(shù)據(jù)的例子進(jìn)行分析。
　

一聲響指，宇宙間半數(shù)生命灰飛煙滅。幾近絕望的復(fù)仇者們?cè)隗@奇隊(duì)長(zhǎng)（布麗·拉爾森飾）的幫助下找到滅霸（喬什·布洛林飾）歸隱之處，卻得知六顆無(wú)限寶石均被銷毀，希望徹底破滅。如是過(guò)了五年，迷失在量子領(lǐng)域的蟻人（保羅·路德飾）意外回到現(xiàn)實(shí)世界，他的出現(xiàn)為幸存的復(fù)仇者們點(diǎn)燃了希望。與美國(guó)隊(duì)長(zhǎng)（克里斯·埃文斯飾）冰釋前嫌的托尼（小羅伯特·唐尼飾）找到了穿越時(shí)空的方法，星散各地的超級(jí)英雄再度集結(jié)，他們分別穿越不同的時(shí)代去搜集無(wú)限寶石。而在這一過(guò)程中，平行宇宙的滅霸察覺了他們的計(jì)劃。注定要載入史冊(cè)的最終決戰(zhàn)，超級(jí)英雄們?yōu)榱诵闹秀∈氐男拍钋捌秃罄^……

一、貓眼數(shù)據(jù)簡(jiǎn)介

1.1 PC端與APP端數(shù)據(jù)對(duì)比

在博文Python爬蟲之豆瓣電影評(píng)論數(shù)據(jù)的爬取（十四），我曾經(jīng)爬取過(guò)豆瓣電影評(píng)論數(shù)據(jù)，那個(gè)相對(duì)來(lái)說(shuō)比較簡(jiǎn)單，為什么呢？因?yàn)樗际庆o態(tài)網(wǎng)頁(yè)，只需要更改爬取評(píng)論的url，然后解析就可以了。但是今天要爬取的貓眼數(shù)據(jù)呢？他可沒有這么簡(jiǎn)單！！！
在貓眼PC端的網(wǎng)頁(yè)中，只存在最熱門的10條熱評(píng)數(shù)據(jù)，這顯示是不夠支撐我們進(jìn)行后續(xù)的數(shù)據(jù)分析的。

貓眼PC端網(wǎng)頁(yè)地址： https://maoyan.com/films/248172

手機(jī)APP網(wǎng)頁(yè)版本，即

貓眼移動(dòng)端網(wǎng)頁(yè)地址：https://m.maoyan.com/movie/248172/comments?_v_yes

可以看到，這里面有我們需要的全部數(shù)據(jù)，為了獲得較多的數(shù)據(jù)并進(jìn)行分析，那就來(lái)爬去手機(jī)APP端的數(shù)據(jù)吧！！！！通過(guò)下面可以清楚的知道，這個(gè)又是通過(guò)Ajax進(jìn)行異步傳輸?shù)臄?shù)據(jù)。對(duì)這塊內(nèi)容有所遺忘的可以參考博文：AJAX數(shù)據(jù)爬取基本認(rèn)識(shí)及原理，這里就不再具體介紹了！！！
　

1.2 貓眼數(shù)據(jù)分析

首先，先找到數(shù)據(jù)接口：{ http://m.maoyan.com/mmdb/comments/movie/248172.json?v=yes&offset=1&startTime=2019-07-13%2022:24:21 }

對(duì)于接口中的數(shù)據(jù)，可以通過(guò)一下JSON在線編輯器網(wǎng)址：http://www.bejson.com/jsoneditoronline/ 進(jìn)行字典數(shù)據(jù)的編輯，使得數(shù)據(jù)更加的規(guī)整。效果如下：

對(duì)于接口連接，這里面的“248172”是每個(gè)電影對(duì)應(yīng)的ID號(hào)，是該電影的唯一標(biāo)識(shí)。如何找到呢？
打開貓眼電影主頁(yè)：https://maoyan.com/ ，找到待爬取的電影《復(fù)仇者聯(lián)盟4：終局之戰(zhàn)》，可以看到它的URL如下，這個(gè)就是電影的唯一標(biāo)識(shí)ID：
　

通過(guò)對(duì)評(píng)論數(shù)據(jù)進(jìn)行分析，得到如下信息：

返回的是json格式數(shù)據(jù)
248172表示電影的專屬id；offset表示偏移量；startTime表示獲取評(píng)論的起始時(shí)間，從該時(shí)間向前取數(shù)據(jù)，即獲取最新的評(píng)論
cmts表示評(píng)論，每次獲取15條，offset偏移量是指每次獲取評(píng)論時(shí)的起始索引，向后取15條;通過(guò)下面的內(nèi)容就可以看出這個(gè)端倪。

https://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=0&limit=15&ts=0&type=3 https://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=15&limit=15&ts=1563091719066&type=3 https://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=30&limit=15&ts=1563091719066&type=3 https://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=45&limit=15&ts=1563091719066&type=3 https://m.maoyan.com/review/v2/comments.json?movieId=248172&userId=-1&offset=60&limit=15&ts=1563091719066&type=3

hcmts表示熱門評(píng)論前10條
total表示總評(píng)論數(shù)

二、代碼實(shí)現(xiàn)

這里我用《復(fù)仇者聯(lián)盟3：無(wú)限戰(zhàn)爭(zhēng)》作為代碼演示的例子。

安裝必要的Python庫(kù)：

from urllib import request import time from datetime import datetime from datetime import timedelta

2.1 獲取數(shù)據(jù)get_data()并處理數(shù)據(jù)parse_data()

#獲取數(shù)據(jù) def get_data(url):headers = {'User_Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36'}req = request.Request(url,headers = headers)response = request.urlopen(req)if response.getcode() == 200:return response.read()return None# 處理數(shù)據(jù) def parse_data(html):data = json.loads(html)['cmts'] # 將str轉(zhuǎn)換為jsoncomments = []for item in data:comment = {'id': item['id'],'nickName': item['nickName'],'cityName': item['cityName'] if 'cityName' in item else '', # 處理cityName不存在的情況'content': item['content'].replace('\n', ' ', 10), # 處理評(píng)論內(nèi)容換行的情況'score': item['score'],'startTime': item['startTime']}comments.append(comment)return comments

2.2 存儲(chǔ)數(shù)據(jù)save_to_txt()

# 存儲(chǔ)數(shù)據(jù)，存儲(chǔ)到文本文件 def save_to_txt():# 獲取當(dāng)前時(shí)間，從當(dāng)前時(shí)間向前獲取start_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')end_time = '2018-05-11 00:00:00'# while start_time > end_time:#因?yàn)槭茄菔?#xff0c;所以只爬取前10頁(yè)的數(shù)據(jù)for i in range(10):url = 'http://m.maoyan.com/mmdb/comments/movie/248170.json?_v_=yes&offset=' + str(15 * i) + '&startTime='+ start_time.replace(' ', '%20')try:html = get_data(url)time.sleep(1)except Exception as e:time.sleep(3)html = get_data(url)else:time.sleep(1)comments = parse_data(html)print(comments)start_time = comments[14]['startTime'] # 獲得末尾評(píng)論的時(shí)間start_time = datetime.strptime(start_time, '%Y-%m-%d %H:%M:%S') + timedelta(seconds=-1) # 轉(zhuǎn)換為datetime類型，減1秒，避免獲取到重復(fù)數(shù)據(jù)start_time = datetime.strftime(start_time, '%Y-%m-%d %H:%M:%S') # 轉(zhuǎn)換為strfor item in comments:with open('comments.txt', 'a', encoding='utf-8') as f:f.write(str(item['id']) + ',' + item['nickName'] + ',' + item['cityName'] + ',' + item['content'] + ',' + str(item['score']) + ',' + item['startTime'] + '\n')

爬蟲結(jié)果：

2.3 粉絲位置數(shù)據(jù)可視化

這里使用的是pyecharts，pyecharts是一個(gè)用于生成Echarts圖表的類庫(kù)，便于在Python中根據(jù)數(shù)據(jù)生成可視化的圖表。Echarts是百度開源的一個(gè)數(shù)據(jù)可視化JS庫(kù)，主要用于數(shù)據(jù)可視化。

# 導(dǎo)入Style類，用于定義樣式風(fēng)格 from pyecharts import Style # 導(dǎo)入Geo組件，用于生成地理坐標(biāo)類圖 from pyecharts import Geo import json # 導(dǎo)入Geo組件，用于生成柱狀圖 from pyecharts import Bar # 導(dǎo)入Counter類，用于統(tǒng)計(jì)值出現(xiàn)的次數(shù) from collections import Counter# 數(shù)據(jù)可視化 def funsLoctions():# 獲取評(píng)論中所有城市cities = []with open('comments.txt', mode='r', encoding='utf-8') as f:rows = f.readlines()for row in rows:city = row.split(',')[2]if city != '': # 去掉城市名為空的值cities.append(city)# 對(duì)城市數(shù)據(jù)和坐標(biāo)文件中的地名進(jìn)行處理#handle(cities)# 統(tǒng)計(jì)每個(gè)城市出現(xiàn)的次數(shù)data = Counter(cities).most_common() # 使用Counter類統(tǒng)計(jì)出現(xiàn)的次數(shù)，并轉(zhuǎn)換為元組列表print(data)# 定義樣式style = Style(title_color='#fff',title_pos='center',width=1200,height=600,background_color='#404a59')# 根據(jù)城市數(shù)據(jù)生成地理坐標(biāo)圖geo = Geo('《一出好戲》粉絲位置分布', '數(shù)據(jù)來(lái)源：貓眼電影數(shù)據(jù)', **style.init_style)attr, value = geo.cast(data)geo.add('', attr, value, visual_range=[0, 3500],visual_text_color='#fff', symbol_size=15,is_visualmap=True, is_piecewise=True, visual_split_number=10)geo.render('粉絲位置分布-地理坐標(biāo)圖.html')# 根據(jù)城市數(shù)據(jù)生成柱狀圖data_top20 = Counter(cities).most_common(20) # 返回出現(xiàn)次數(shù)最多的20條bar = Bar("《一出好戲》粉絲來(lái)源排行TOP20", "數(shù)據(jù)來(lái)源：貓眼電影數(shù)據(jù)", title_pos='center', width=1200, height=60)attr, value = bar.cast(data_top20)bar.add("", attr, value, is_visualmap=True, visual_range=[0, 3500], visual_text_color='#fff', is_more_utils=True,is_label_show=True)bar.render("粉絲來(lái)源排行-柱狀圖.html")

此時(shí)我的代碼里面報(bào)了兩個(gè)錯(cuò)誤：

1 未找到pyecharts_snapshot庫(kù)
解決辦法如下：
官網(wǎng)下載pyecharts_snapshot 安裝。

或者直接在pycharm中File–>Settings–>Project:XX–>Project Interpreter中添加pyecharts_snapshot
2 報(bào)錯(cuò)：ValueError: No coordinate is specified for xxx(地名)
原因：pyecharts的坐標(biāo)文件中沒有該地名，實(shí)際上是名稱不一致導(dǎo)致的，如數(shù)據(jù)中地名為’達(dá)州’，而坐標(biāo)文件中為’達(dá)州市’
坐標(biāo)文件所在路徑：項(xiàng)目/venv/lib/python3.6/site-packages/pyecharts/datasets/city_coordinates.json
解決：修改坐標(biāo)文件，在原位置下復(fù)制個(gè)同樣的，然后修改下地名

{"達(dá)州市": [107.5,31.22],"達(dá)州": [107.5,31.22], }

不過(guò)由于要修改的地名太多，上面的方法實(shí)在是麻煩，所以可以定義了一個(gè)函數(shù)，用來(lái)處理地名數(shù)據(jù)找不到的問(wèn)題

# 處理地名數(shù)據(jù)，解決坐標(biāo)文件中找不到地名的問(wèn)題 def handle(cities):# print(len(cities), len(set(cities)))# 獲取坐標(biāo)文件中所有地名data = Nonewith open('/項(xiàng)目絕對(duì)地址/venv/lib/python3.6/site-packages/pyecharts/datasets/city_coordinates.json',mode='r', encoding='utf-8') as f:data = json.loads(f.read()) # 將str轉(zhuǎn)換為json# 循環(huán)判斷處理data_new = data.copy() # 拷貝所有地名數(shù)據(jù)for city in set(cities): # 使用set去重# 處理地名為空的數(shù)據(jù)if city == '':while city in cities:cities.remove(city)count = 0for k in data.keys():count += 1if k == city:breakif k.startswith(city): # 處理簡(jiǎn)寫的地名，如達(dá)州市簡(jiǎn)寫為達(dá)州# print(k, city)data_new[city] = data[k]breakif k.startswith(city[0:-1]) and len(city) >= 3: # 處理行政變更的地名，如縣改區(qū) 或縣改市等data_new[city] = data[k]break# 處理不存在的地名if count == len(data):while city in cities:cities.remove(city)# 寫入覆蓋坐標(biāo)文件with open('/項(xiàng)目絕對(duì)地址/venv/lib/python3.6/site-packages/pyecharts/datasets/city_coordinates.json',mode='w', encoding='utf-8') as f:f.write(json.dumps(data_new, ensure_ascii=False)) # 將json轉(zhuǎn)換為str

效果如下：

2.4 評(píng)分星級(jí)可視化

# coding=utf-8 # 導(dǎo)入Pie組件，用于生成餅圖 from pyecharts import Pie# 獲取評(píng)論中所有評(píng)分 rates = [] with open('comments.txt', mode='r', encoding='utf-8') as f:rows = f.readlines()for row in rows:rates.append(row.split(',')[4]) # 定義星級(jí)，并統(tǒng)計(jì)各星級(jí)評(píng)分?jǐn)?shù)量 attr = ["五星", "四星", "三星", "二星", "一星"] value = [rates.count('5') + rates.count('4.5'),rates.count('4') + rates.count('3.5'),rates.count('3') + rates.count('2.5'),rates.count('2') + rates.count('1.5'),rates.count('1') + rates.count('0.5') ] pie = Pie('評(píng)分星級(jí)比例', title_pos='center', width=900) pie.add("7-17", attr, value, center=[75, 50], is_random=True,radius=[30, 75], rosetype='area',is_legend_show=False, is_label_show=True) pie.render('評(píng)分.html')

結(jié)果，因?yàn)閿?shù)據(jù)量較少，評(píng)價(jià)為二星的樣本竟然沒有：

2.5 評(píng)論詞云可視化

# coding=utf-8 # 導(dǎo)入jieba模塊，用于中文分詞 import jieba # 導(dǎo)入matplotlib，用于生成2D圖形 import matplotlib.pyplot as plt # 導(dǎo)入wordcount，用于制作詞云圖 from wordcloud import WordCloud# 獲取所有評(píng)論 comments = [] with open('comments.txt', mode='r', encoding='utf-8') as f:rows = f.readlines()for row in rows:comment = row.split(',')[3]if comment != '':comments.append(comment)# 設(shè)置分詞 comment_after_split = jieba.cut(str(comments), cut_all=False) # 非全模式分詞，cut_all=false words = " ".join(comment_after_split) # 以空格進(jìn)行拼接 # 設(shè)置詞云參數(shù)，參數(shù)分別表示：畫布寬高、背景顏色、字體、最大詞的字體大小 wc = WordCloud(width=1024, height=768, background_color='white',font_path='STKAITI.TTF',max_font_size=400, random_state=50) # 將分詞后數(shù)據(jù)傳入云圖 wc.generate_from_text(words) plt.imshow(wc) plt.axis('off') # 不顯示坐標(biāo)軸 plt.show() # 保存結(jié)果到本地 wc.to_file('wc.jpg')

效果展示：

三、總結(jié)

其實(shí)，從接口獲取的數(shù)據(jù)維度還真不少,如下，截取了一個(gè)用戶的貓眼數(shù)據(jù)，里面的數(shù)據(jù)大家可以自己仔細(xì)分析一下，用到什么就下載什么就可以了：

{"approve": 0,"approved": false,"assistAwardInfo": {"avatar": "","celebrityId": 0,"celebrityName": "","rank": 0,"title": ""},"authInfo": "","avatarurl": "https://img.meituan.net/avatar/77ec05f3cd886ec3eb9cde0ddbf3634c151592.jpg","cityName": "珠海","content": "簡(jiǎn)直太贊了，對(duì)于漫威迷來(lái)說(shuō)，這部真的很走心了","filmView": false,"gender": 2,"id": 1071346062,"isMajor": false,"juryLevel": 0,"majorType": 0,"movieId": 248172,"nick": "WDYGY","nickName": "WDYGY","oppose": 0,"pro": false,"reply": 0,"score": 5,"spoiler": 0,"startTime": "2019-07-13 22:23:20","supportComment": true,"supportLike": true,"sureViewed": 1,"tagList": {"fixed": [{"id": 1,"name": "好評(píng)"},{"id": 4,"name": "購(gòu)票"}]},"time": "2019-07-13 22:23","userId": 245966367,"userLevel": 2,"videoDuration": 0,"vipInfo": "","vipType": 0}

這篇文章就到這里了，歡迎大佬們多批評(píng)指正，也歡迎大家積極評(píng)論多多交流。
　

總結(jié)

以上是生活随笔為你收集整理的Python爬虫之猫眼APP电影数据（十八）的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Vegas中的Vignette暗角视频特
下一篇： Ant Design Vue - 修改＜