當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

TMDB电影数据分析报告

發布時間：2023/12/31 编程问答 34 豆豆

生活随笔收集整理的這篇文章主要介紹了 TMDB电影数据分析报告小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

TMDB電影數據分析報告

前言
一、提出問題
二、理解數據
三、數據清洗
四、數據可視化
五、形成數據分析報告
代碼部分：

前言

數據分析的基本流程：

提出問題

理解數據

數據清洗

構建模型

數據可視化

形成報告

一、提出問題

本次報告的主要任務是：根據歷史電影數據，分析哪種電影收益能力更好，未來電影的流行趨勢，以及為電影拍攝提供建議。細化為以下幾個小問題：

電影風格隨時間變化的趨勢；

不同風格電影的收益能力；

不同風格電影的受歡迎程度

不同風格電影的評分比較；

原創電影與改編電影對比；

影響票房收入的因素；

比較行業內兩家巨頭公司Universal Pictures和Paramount Pictures.

二、理解數據

從Kaggle平臺上下載原始數據集：tmdb_5000_movies和tmdb_5000_credits，前者為電影基本信息，包含20個變量，后者為演職員信息，包含4個變量。
導入數據集后，通過對數據的查看，并結合要分析的問題，篩選出以下9個要重點分析的變量：

序號變量名說明

1	budget	電影預算（單位：美元）
2	genres	電影風格
3	keywords	電影關鍵字
4	popularity	受歡迎程度
5	production_companies	制作公司
6	release_year	上映時間
7	revenue	票房收入（單位：美元）
8	vote_average	平均評分
9	vote_count	評分次數

三、數據清洗

針對本數據集，數據清洗主要包括三個步驟：1.數據預處理 2.特征提取 3.特征選擇

數據預處理：
（1）通過查看數據集信息，發現’runtime’列有一條缺失數據，‘release_date’列有一條缺失數據，‘homepage’有30條缺失數據，只對‘release’列和‘runtime’列進行缺失值填補。具體操作方法是：通過索引的方式找到具體是哪一部電影，然后上網搜索準確的數據，將其填補。（詳見后續代碼）
（2）對‘release_date’列進行格式轉化，并從中抽取出“年份”信息。

特征提取：
（1）credits數據集中cast，crew這兩列都是json格式，需要將演員、導演分別從這兩列中提取出來；
movies數據集中的genres，keywords，production_companies都是json格式，需要將其轉化為字符串格式。
（2）處理過程：通過json.loads先將json格式轉換為字典列表"[{},{},{}]"的形式，再遍歷每個字典，取出鍵(key)為‘name’所對應的值(value)，并將這些值(value)用“，”分隔，形成一個“多選題”的結構。在進行具體問題分析的時候，再將“多選題”編碼為虛擬變量，即所有多選題的每一個不重復的選項，拿出來作為新變量，每一條觀測包含該選項則填1，否則填0。

特征選擇：
在分析每一個小問題之前，都要通過特征提取，選擇最適合分析的變量，即在分析每一個小問題時，都要先構造一個數據框，放入要分析的變量，而不是在原數據框中亂涂亂畫。

四、數據可視化

本次數據分析只是對數據集進行了基本的描述性分析和相關性分析，構建模型步驟均與特征選取、新建數據框一起完成，本案例不屬于機器學習范疇，因此不涉及構建模型問題。
本次數據可視化用到的圖形有：折線圖、柱狀圖、直方圖、餅圖、散點圖、詞云圖。（詳見后續代碼）

五、形成數據分析報告

在這里插入圖片描述

代碼部分：

導入包，并讀取數據集：

import numpy as np import pandas as pd import matplotlib import matplotlib.pyplot as plt import seaborn as sns sns.set_style('darkgrid') from pandas import DataFrame, Series import json from wordcloud import WordCloud, STOPWORDS plt.rcParams['font.sans-serif'] = ['SimHei'] #讀取數據集：電影信息、演員信息 movies = pd.read_csv('tmdb_5000_movies.csv',encoding = 'utf_8') credits = pd.read_csv('tmdb_5000_credits.csv',encoding = 'utf_8')

處理json數據格式，將兩張表合并為一張表，并刪除不需要的字段：

#將json數據轉換為字符串 #credits：json數據解析 json_cols = ['cast', 'crew'] for i in json_cols:credits[i] = credits[i].apply(json.loads) #提取演員 def get_names(x):return ','.join([i['name'] for i in x]) credits['cast'] = credits['cast'].apply(get_names) credits.head() #提取導演 def get_directors(x):for i in x:if i['job'] == 'Director':return i['name'] credits['crew'] = credits['crew'].apply(get_directors) #將字段‘crew’改為‘director’ credits.rename(columns={'crew':'director'}, inplace = True)#movies：json數據解析 json_cols = ['genres', 'keywords', 'spoken_languages', 'production_companies', 'production_countries'] for i in json_cols:movies[i] = movies[i].apply(json.loads) def get_names(x):return ','.join([i['name'] for i in x]) movies['genres'] = movies['genres'].apply(get_names) movies['keywords'] = movies['keywords'].apply(get_names) movies['spoken_languages'] = movies['spoken_languages'].apply(get_names) movies['production_countries'] = movies['production_countries'].apply(get_names) movies['production_companies'] = movies['production_companies'].apply(get_names) #合并數據 #credits, movies兩個表中都有字段id, title，檢查兩個字段是否相同 (movies['title'] == credits['title']).describe() #刪除重復字段 del movies['title'] #合并兩張表,參數代表合并方式 df = credits.merge(right = movies, how = 'inner', left_on = 'movie_id', right_on = 'id') #刪除分析不需要的字段 del df['overview'] del df['original_title'] del df['id'] del df['homepage'] del df['spoken_languages'] del df['tagline']

填補缺失值，并抽取“年份”信息：

#填補缺失值 #首先查找出缺失值記錄 df[df.release_date.isnull()] #然后在網上查詢到該電影的發行年份，進行填補 df['release_date'] = df['release_date'].fillna('2014-06-01') #電影時長也和上面的處理一樣 df.loc[2656] = df.loc[2656].fillna(94) df.loc[4140] = df.loc[2656].fillna(81) #轉換日期格式，只保留年份信息 df['release_year'] = pd.to_datetime(df.release_date, format = '%Y-%m-%d').dt.year

不同電影風格的數量占比分析，以及隨時間變化的趨勢：

#獲取電影類型信息 genre = set() for i in df['genres'].str.split(','):genre = set().union(i,genre) #轉化為列表 genre = list(genre) #移除列表中無用的字段 genre.remove('') #對電影類型進行one-hot編碼 for genr in genre:df[genr] = df['genres'].str.contains(genr).apply(lambda x: 1 if x else 0) df_gy = df.loc[:, genre] df_gy.index = df['release_year'] #各種電影類型的總數量 df_gysum = df_gy.sum().sort_values(ascending = True) df_gysum.plot.barh(label='genre', figsize=(10,6)) plt.xlabel('數量',fontsize=15) plt.ylabel('電影風格',fontsize=15) plt.title('不同電影風格的總數量',fontsize=20) plt.grid(False) #電影類型隨時間變化的趨勢 df_gy1 = df_gy.sort_index(ascending = False) df_gys = df_gy1.groupby('release_year').sum() df_sub_gys = df_gys[[ 'Drama', 'Comedy', 'Thriller', 'Romance', 'Adventure', 'Crime', 'Science Fiction', 'Horror']].loc[1960:,:] plt.figure(figsize=(10,6)) plt.plot(df_sub_gys, label = df_sub_gys.columns) plt.legend(df_sub_gys) plt.xticks(range(1915,2018,10)) plt.xlabel('年份', fontsize=15) plt.ylabel('數量', fontsize=15) plt.title('電影風格隨時間變化趨勢', fontsize=20) plt.show()

不同電影風格的受歡迎程度分析：

#定義一個數據框，以電影類型為索引，以每種電影類型的受歡迎程度為值 df_gen_popu = pd.DataFrame(index = genre) #計算每種電影類型的平均受歡迎程度 list = [] for genr in genre:list.append(df.groupby(genr)['popularity'].mean()) list2 = [] for i in range(len(genre)):list2.append(list[i][1]) df_gen_popu['mean_popularity'] = list2 df_gen_popu.sort_values(by = 'mean_popularity', ascending=True).plot.barh(label = genre, figsize=(10,6)) plt.xlabel('受歡迎程度', fontsize=15) plt.ylabel('電影風格', fontsize=15) plt.title('不同電影風格的受歡迎程度', fontsize=20) plt.grid(False) plt.show() #keywords關鍵詞分析 keywords_list = [] for i in df['keywords']:keywords_list.append(i)keywords_list #把字符串列表連接成一個長字符串 lis = ''.join(keywords_list) lis.replace('\'s','') #設置停用詞 stopwords = set(STOPWORDS) stopwords.add('film') wordcloud = WordCloud(background_color = 'white',stopwords = stopwords,max_words = 3000,scale=1).generate(lis) plt.figure(figsize=(10,6)) plt.imshow(wordcloud) plt.axis('off') plt.show()

不同電影風格的收益能力分析：

#不同電影風格的收益能力分析 #增加收益列 df['profit'] = df['revenue'] - df['budget'] #創建收益數據框 profit_df = pd.DataFrame() profit_df = pd.concat([df.loc[:,genre], df['profit']], axis=1) #創建一個Series,其index為各個genre，值為按genre分類計算的profit之和 profit_sum_by_genre = pd.Series(index=genre) for genr in genre:profit_sum_by_genre.loc[genr] = profit_df.loc[:, [genr, 'profit']].groupby(genr, as_index = False).sum().loc[1, 'profit'] #創建一個Series,其index為各個genre，值為按genre分類計算的budget之和 budget_df = pd.concat([df.loc[:,genre], df['budget']], axis=1) budget_by_genre = pd.Series(index=genre) for genr in genre:budget_by_genre.loc[genr] = budget_df.loc[:, [genr, 'budget']].groupby(genr, as_index = False).sum().loc[1, 'budget'] #橫向合并數據框 profit_rate = pd.concat([profit_sum_by_genre, budget_by_genre], axis=1) profit_rate.columns = ['profit', 'budget'] #添加收益率列 profit_rate['profit_rate'] = (profit_rate['profit']/profit_rate['budget'])*100 profit_rate.sort_values(by=['profit', 'profit_rate'], ascending=False, inplace=True) #xl為索引實際值 xl = profit_rate.index #可視化不同風格電影的收益（柱狀圖）和收益率（折線圖） fig = plt.figure(figsize=(10,6)) ax1 = fig.add_subplot(1,1,1) plt.bar(range(0,20), profit_rate['profit'], label='profit', alpha=0.7) plt.xticks(range(0,20),xl,rotation=60, fontsize=12) plt.yticks(fontsize=12) ax1.set_xlabel('電影風格', fontsize=15) ax1.set_ylabel('利潤', fontsize=15) ax1.set_title('不同電影風格的收益能力', fontsize=20) ax1.set_ylim(0,1.2e11)#次縱坐標軸標簽設置為百分比顯示 import matplotlib.ticker as mtick ax2 = ax1.twinx() ax2.plot(range(0,20), profit_rate['profit_rate'], 'ro-', lw=2, label='profit_rate') fmt='%.2f%%' yticks = mtick.FormatStrFormatter(fmt) ax2.yaxis.set_major_formatter(yticks) plt.xticks(range(0,20),xl,rotation=60, fontsize=12) plt.yticks(fontsize=15) ax2.set_ylabel('收益率', fontsize=15) plt.grid(False)

不同電影風格的平均收益能力分析：

#不同電影風格的平均收益能力分析 #創建一個Series,其index為各個genre，值為按genre分類計算的profit平均值 profit_mean_by_genre = pd.Series(index=genre) for genr in genre:profit_mean_by_genre.loc[genr] = profit_df.loc[:, [genr, 'profit']].groupby(genr, as_index = False).mean().loc[1, 'profit'] #創建一個Series,其index為各個genre，值為按genre分類計算的budget之和 budget_df = pd.concat([df.loc[:,genre], df['budget']], axis=1) budget_mean_by_genre = pd.Series(index=genre) for genr in genre:budget_mean_by_genre.loc[genr] = budget_df.loc[:, [genr, 'budget']].groupby(genr, as_index = False).mean().loc[1, 'budget'] #橫向合并數據框 profit_rate_mean = pd.concat([profit_mean_by_genre, budget_mean_by_genre], axis=1) profit_rate_mean.columns = ['mean_profit', 'mean_budget'] #添加收益率列 profit_rate_mean['mean_profit_rate'] = (profit_rate_mean['mean_profit']/profit_rate_mean['mean_budget'])*100 profit_rate_mean.sort_values(by=['mean_profit', 'mean_profit_rate'], ascending=False, inplace=True) #xl為索引實際值 xl = profit_rate_mean.index #可視化不同風格電影的收益（柱狀圖）和收益率（折線圖） fig = plt.figure(figsize=(10,6)) ax3 = fig.add_subplot(1,1,1) plt.bar(range(0,20), profit_rate_mean['mean_profit'], label='mean_profit', alpha=0.7) plt.xticks(range(0,20),xl,rotation=60, fontsize=12) plt.yticks(fontsize=12) ax3.set_xlabel('電影風格', fontsize=15) ax3.set_ylabel('平均利潤', fontsize=15) ax3.set_title('不同電影風格的平均收益能力', fontsize=20) #ax3.set_ylim(0,1.2e11)#次縱坐標軸標簽設置為百分比顯示 import matplotlib.ticker as mtick ax4 = ax3.twinx() ax4.plot(range(0,20), profit_rate_mean['mean_profit_rate'], 'ro-', lw=2, label='mean_profit_rate') fmt='%.2f%%' yticks = mtick.FormatStrFormatter(fmt) ax4.yaxis.set_major_formatter(yticks) plt.xticks(range(0,20),xl,rotation=60, fontsize=12) plt.yticks(fontsize=15) ax4.set_ylabel('平均收益率', fontsize=15) plt.grid(False) plt.show()

不同電影風格的預算分析：

#可視化不同風格電影的預算 profit_rate_mean.sort_values(by=['mean_budget'], ascending=False, inplace=True) xl = profit_rate_mean.index fig = plt.figure(figsize=(10,6)) ax5 = fig.add_subplot(1,1,1) plt.bar(range(0,20), profit_rate_mean['mean_budget'], label='mean_budget', alpha=0.7) plt.xticks(range(0,20),xl,rotation=60, fontsize=12) plt.yticks(fontsize=12) ax5.set_xlabel('電影風格', fontsize=15) ax5.set_ylabel('平均預算', fontsize=15) ax5.set_title('不同電影風格的平均預算', fontsize=20)#定義一個數據框，以電影類型為索引，以每種電影類型的受歡迎程度為值 df_gen_popu = pd.DataFrame(index = genre) #計算每種電影類型的平均受歡迎程度 list = [] for genr in genre:list.append(df.groupby(genr)['popularity'].mean()) list2 = [] for i in range(len(genre)):list2.append(list[i][1]) df_gen_popu['mean_popularity'] = list2 df_gen_popu.sort_values(by = 'mean_popularity', ascending=True).plot.barh(label = genre, figsize=(10,6)) plt.xlabel('受歡迎程度', fontsize=15) plt.ylabel('電影風格', fontsize=15) plt.title('不同電影風格的受歡迎程度', fontsize=20) plt.grid(False) plt.show()

不同電影風格的平均評分分析：

#創建平均評分數據框 vote_avg_df = pd.concat([df.loc[:,genre], df['vote_average']], axis=1) voteavg_mean_list = [] for genr in genre:voteavg_mean_list.append(vote_avg_df.groupby(genr, as_index = False).mean().loc[1, 'vote_average']) #創建不同風格電影平均評分數據框 voteavg_mean_by_genre = pd.DataFrame(index = genre) voteavg_mean_by_genre['voteavg_mean'] = voteavg_mean_list df['popularity'].corr(df['vote_average']) #可視化不同風格電影的平均評分 voteavg_mean_by_genre.sort_values(by='voteavg_mean', ascending=False, inplace = True) fig = plt.figure(figsize=(10,6)) ax = fig.add_subplot(1,1,1) voteavg_mean_by_genre.plot.bar(ax=ax) plt.title('不同電影風格的平均評分', fontsize=20) plt.xlabel('電影風格', fontsize = 15) plt.ylabel('平均評分', fontsize = 15) plt.xticks(rotation=45) plt.ylim(5, 7, 0.5)#可視化所有電影的評分分布 fig, ax = plt.subplots(figsize=(8,5)) ax = sns.distplot(df['vote_average'], bins=10) plt.title('電影平均評分分布', fontsize=20) plt.xlabel('數量', fontsize = 15) plt.ylabel('平均評分', fontsize = 15) plt.xticks(np.arange(11)) plt.grid(True) plt.show()

原創電影與改編電影對比分析：

#原創電影與改編電影對比分析 original_novel = pd.DataFrame() original_novel['keywords'] = df['keywords'].str.contains('based on').map(lambda x: 1 if x else 0) original_novel['profit'] = df['profit'] novel_count = original_novel['keywords'].sum() original_count = original_novel['keywords'].count() - original_novel['keywords'].sum() original_novel = original_novel.groupby('keywords', as_index = False).mean() #創建原創與改編對比的數據框 org_vs_novel = pd.DataFrame() org_vs_novel['count'] = [original_count, novel_count] org_vs_novel['profit'] = original_novel['profit'] org_vs_novel.index = ['original works', 'based_on_novel'] #可視化原創電影與改編電影的數量占比（餅圖）、片均受益（柱狀圖） fig= plt.figure(figsize = (12,5)) ax1 = plt.subplot(1, 2, 1) ax1 = plt.pie(org_vs_novel['count'], labels=org_vs_novel.index, autopct='%.2f%%', startangle=90, pctdistance=0.6) plt.title('原創電影 VS 改編電影：占比分析', fontsize=15) ax2 = plt.subplot(1, 2, 2) ax2 = org_vs_novel['profit'].plot.bar() plt.xticks(rotation=0) plt.ylabel('收入', fontsize=12) plt.title('原創電影 VS 改編電影：利潤對比', fontsize=15) plt.grid(False) plt.show()

票房收入影響因素分析：

#通過相關性分析觀察影響票房的因素 df[['budget', 'popularity', 'revenue', 'runtime', 'vote_average', 'vote_count']].corr() #從相關性結果中發現對票房影響比較大的是預算、受歡迎度、評分次數 revenue_corr = df[['popularity', 'vote_count', 'budget', 'revenue']] #可視化票房收入分別與受歡迎程度（藍）、評價次數（綠）、電影預算（紅）的相關性散點圖 fig = plt.figure(figsize=(12,5)) ax1 = plt.subplot(1,3,1) ax1 = sns.regplot(x='popularity', y='revenue', data=revenue_corr, x_jitter=0.1) ax1.text(400,3e9, 'r=0.64', fontsize=15) plt.title('受歡迎程度對票房的影響', fontsize=15) plt.xlabel('受歡迎程度', fontsize=12) plt.ylabel('票房收入', fontsize=12)ax2 = plt.subplot(1,3,2) ax2 = sns.regplot(x='vote_count', y='revenue', data=revenue_corr, x_jitter=0.1, color='g', marker='+') ax2.text(5800,2.2e9, 'r=0.78', fontsize=15) plt.title('評價次數對票房的影響', fontsize=15) plt.xlabel('評價次數', fontsize=12) plt.ylabel('票房收入', fontsize=12)ax3 = plt.subplot(1,3,3) ax3 = sns.regplot(x='budget', y='revenue', data=revenue_corr, x_jitter=0.1, color='r', marker='^') ax3.text(1.6e8,2.2e9, 'r=0.73', fontsize=15) plt.title('預算對票房的影響', fontsize=15) plt.xlabel('預算', fontsize=12) plt.ylabel('票房收入', fontsize=12)

行業內兩巨頭公司對比分析：

#創建公司數據框 company_list = ['Universal Pictures', 'Paramount Pictures'] df_company = pd.DataFrame() for company in company_list:df_company[company] = df['production_companies'].str.contains(company).map(lambda x: 1 if x else 0) df_company = pd.concat([df['release_year'], df_company, df.loc[:,genre], df['revenue'], df['profit']], axis=1) #創建巨頭對比數據框 Uni_vs_Para = pd.DataFrame(index=company_list, columns = df_company.columns[3:]) #計算兩公司的收益總額 Uni_vs_Para.loc['Universal Pictures'] = df_company.groupby('Universal Pictures', as_index=False).sum().iloc[1,3:-1] Uni_vs_Para.loc['Paramount Pictures'] = df_company.groupby('Paramount Pictures', as_index=False).sum().iloc[1,3:-1] #可視化兩公司票房收入對比 fig = plt.figure(figsize=(8,6)) ax = fig.add_subplot(1,1,1) Uni_vs_Para['revenue'].plot(ax=ax, kind='bar') plt.title('Universal VS. Paramount 票房總收入', fontsize=15) plt.xticks(rotation=0) plt.ylabel('票房收入', fontsize=12) plt.grid(False) plt.show() #建立兩家公司的利潤對話框 df_company_profit = df_company[['Universal Pictures', 'Paramount Pictures', 'profit']].reset_index(drop=True) df_company_profit.index = df['release_year'] #將兩家公司的利潤提取出來，并合并每年的利潤 df_company_profit['Universal Pictures_profit'] = df_company_profit['Universal Pictures']*df_company_profit['profit'] df_company_profit['Paramount Pictures_profit'] = df_company_profit['Paramount Pictures']*df_company_profit['profit'] company1 = df_company_profit['Universal Pictures_profit'].groupby('release_year').sum() company2 = df_company_profit['Paramount Pictures_profit'].groupby('release_year').sum() #繪制兩家公司的總利潤隨時間變化折線圖 fig = plt.figure(figsize = (10,6)) ax1 = fig.add_subplot(1,1,1) company1.plot(x=df_company_profit.index, y=df_company_profit['Universal Pictures_profit'], label='Universal Pictures', ax=ax1) company2.plot(x=df_company_profit.index, y=df_company_profit['Paramount Pictures_profit'], label='Paramount Pictures', ax=ax1) plt.title('Universal VS. Paramount 每年總利潤', fontsize=15) plt.xticks(fontsize=12) plt.yticks(fontsize=12) plt.xlabel('年份', fontsize=12) plt.ylabel('利潤', fontsize=12) plt.legend(fontsize=12) plt.show() #轉置 Uni_vs_Para = Uni_vs_Para.T Universal = Uni_vs_Para['Universal Pictures'].iloc[:-1] Paramount = Uni_vs_Para['Paramount Pictures'].iloc[:-1] Universal['others'] = Universal.sort_values(ascending=False).iloc[8:].sum() Universal = Universal.sort_values(ascending=True).iloc[-9:] Paramount['others'] = Paramount.sort_values(ascending=False).iloc[8:].sum() Paramount = Paramount.sort_values(ascending=True).iloc[-9:] #兩公司電影風格可視化 fig = plt.figure(figsize=(13, 6)) ax1 = plt.subplot(1,2,1) ax1 = plt.pie(Universal, labels = Universal.index, autopct='%.2f%%', startangle=90, pctdistance=0.75) plt.title('Universal Pictures', fontsize=20) ax2 = plt.subplot(1,2,2) ax2 = plt.pie(Paramount, labels = Paramount.index, autopct='%.2f%%', startangle=90, pctdistance=0.75) plt.title('Paramount Pictures', fontsize=20) plt.show()

總結

以上是生活随笔為你收集整理的TMDB电影数据分析报告的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： M2Det-一种使用新的特征金字塔方式的
下一篇： FGSM代码实践