當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

电影数据集TMDB数据分析练习

發布時間：2023/12/31 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了电影数据集TMDB数据分析练习小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

加載TMDB數據集

TMDb電影數據庫”，數據集中包含來自1960-2016年上映的近11000部電影的基本信息，主要包括了電影類型、預算、票房、演職人員、時長、評分等信息。
本文作為自學練習小項目，將從最原始的數據格式化、數據清洗、數據分析進行全面的學習
并且事無巨細，展示練習全過程

參考文章 https://blog.csdn.net/moyue1002/article/details/80332186
python 3.7
pandas 0.23
numpy 1.18
metplotlib 2.2

import pandas as pdcredits = pd.read_csv('./tmdb_5000_credits.csv') movies = pd.read_csv('./tmdb_5000_movies.csv')

查看各個dataframe的一般信息

# 這是movies表的信息 movies.head(1) print(movies.info())Out[3]: budget genres homepage id ... tagline title vote_average vote_count 0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800<class 'pandas.core.frame.DataFrame'>RangeIndex: 4803 entries, 0 to 4802Data columns (total 20 columns):budget 4803 non-null int64genres 4803 non-null objecthomepage 1712 non-null objectid 4803 non-null int64keywords 4803 non-null objectoriginal_language 4803 non-null objectoriginal_title 4803 non-null objectoverview 4800 non-null objectpopularity 4803 non-null float64production_companies 4803 non-null objectproduction_countries 4803 non-null objectrelease_date 4802 non-null objectrevenue 4803 non-null int64runtime 4801 non-null float64spoken_languages 4803 non-null objectstatus 4803 non-null objecttagline 3959 non-null objecttitle 4803 non-null objectvote_average 4803 non-null float64vote_count 4803 non-null int64dtypes: float64(3), int64(4), object(13)memory usage: 750.5+ KBNone

這是credits表的信息

print(credits.info()) credits.head(1)Out[4]: <class 'pandas.core.frame.DataFrame'>RangeIndex: 4803 entries, 0 to 4802Data columns (total 4 columns):movie_id 4803 non-null int64title 4803 non-null objectcast 4803 non-null objectcrew 4803 non-null objectdtypes: int64(1), object(3)memory usage: 150.2+ KBNonemovie_id ... crew 0 19995 ... [{"credit_id": "52fe48009251416c750aca23", "de...

credits表的cast列很奇怪，數據很多

進行具體查看

# 查看credists表的cast列索引0的值，發現是一長串東西 print('cast格式：', type(credits['cast'][0])) # 查看其類型，為`str`類型，無法處理 Out[5]:cast格式： <class 'str'>

json格式化數據處理

從表中看出，cast列其實是json格式化數據，應該用json包進行處理

json格式是[{},{}]

將json格式的字符串轉換成Python對象用json.loads()

json.load()針對的是文件，從文件中讀取json

import json type(json.loads(credits['cast'][0])) Out[6]:list

從上面可以看出json.loads()將json字符串轉成了list,可以知道list里面又包裹多個dict

接下來批量處理

import json json_col = ['cast','crew'] for i in json_col:credits[i] = credits[i].apply(json.loads)credits['cast'][0][:3]Out[7]:[{'cast_id': 242,'character': 'Jake Sully','credit_id': '5602a8a7c3a3685532001c9a','gender': 2,'id': 65731,'name': 'Sam Worthington','order': 0},{'cast_id': 3,'character': 'Neytiri','credit_id': '52fe48009251416c750ac9cb','gender': 1,'id': 8691,'name': 'Zoe Saldana','order': 1},{'cast_id': 25,'character': 'Dr. Grace Augustine','credit_id': '52fe48009251416c750aca39','gender': 1,'id': 10205,'name': 'Sigourney Weaver','order': 2}] print('再次查看cast類型是:',type(credits['cast'][0])) # 數據類型變成了list，可以用于循環處理Out[8]:再次查看cast類型是: <class 'list'>

提取其中的名字

credits['cast'][0][:3] # credits第一行的cast，是個列表Out[9]:[{'cast_id': 242,'character': 'Jake Sully','credit_id': '5602a8a7c3a3685532001c9a','gender': 2,'id': 65731,'name': 'Sam Worthington','order': 0},{'cast_id': 3,'character': 'Neytiri','credit_id': '52fe48009251416c750ac9cb','gender': 1,'id': 8691,'name': 'Zoe Saldana','order': 1},{'cast_id': 25,'character': 'Dr. Grace Augustine','credit_id': '52fe48009251416c750aca39','gender': 1,'id': 10205,'name': 'Sigourney Weaver','order': 2}] credits['cast'][0][0]['name'] # 獲取第一行第一個字典的人名Out[10]:'Sam Worthington'

dict字典常用的函數

dict.get() 返回指定鍵的值，如果值不在字典中返回default值
dict.items() 以列表返回可遍歷的(鍵, 值) 元組數組

# 代碼測試如下： i = credits['cast'][0][0] for x in i.items():print(x)Out[11]:('cast_id', 242)('character', 'Jake Sully')('credit_id', '5602a8a7c3a3685532001c9a')('gender', 2)('id', 65731)('name', 'Sam Worthington')('order', 0)

創建get_names()函數，進一步分割cast

def get_names(x):return ','.join(i['name'] for i in x) credits['cast'] = credits['cast'].apply(get_names) credits['cast'][:3]Out[12]:0 Sam Worthington,Zoe Saldana,Sigourney Weaver,S...1 Johnny Depp,Orlando Bloom,Keira Knightley,Stel...2 Daniel Craig,Christoph Waltz,Léa Seydoux,Ralph...Name: cast, dtype: object

crew提取導演

credits['crew'][0][0] Out[13]:{'credit_id': '52fe48009251416c750aca23','department': 'Editing','gender': 0,'id': 1721,'job': 'Editor','name': 'Stephen E. Rivkin'} # 需要創建循環，找到job是director的，然后讀取名字并返回 def director(x):for i in x:if i['job'] == 'Director':return i['name']credits['crew'] = credits['crew'].apply(director) print(credits[['crew']][:3]) credits.rename(columns = {'crew':'director'},inplace=True) #修改列名 credits[['director']][:3]Out[[14]:crew0 James Cameron1 Gore Verbinski2 Sam Mendes

movies表進行json解析

>>> movies.head(1)budget genres homepage id ... tagline title vote_average vote_count 0 237000000 [{"id": 28, "name": "Action"}, {"id": 12, "nam... http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800

可以看出genres, keywords, spoken_languages, production_countries, producion_companies需要json解析的

# 方法同crew表 json_col = ['genres','keywords','spoken_languages','production_countries','production_companies'] for i in json_col:movies[i] = movies[i].apply(json.loads)movies[i] = movies[i].apply(get_names) >>> movies.head(1) budget genres homepage id ... tagline title vote_average vote_count 0 237000000 Action,Adventure,Fantasy,Science Fiction http://www.avatarmovie.com/ 19995 ... Enter the World of Pandora. Avatar 7.2 11800

開始分析數據

credits.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 4803 entries, 0 to 4802 Data columns (total 4 columns): movie_id 4803 non-null int64 title 4803 non-null object cast 4803 non-null object director 4773 non-null object dtypes: int64(1), object(3) memory usage: 150.2+ KB movies.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 4803 entries, 0 to 4802 Data columns (total 20 columns): budget 4803 non-null int64 genres 4803 non-null object homepage 1712 non-null object id 4803 non-null int64 keywords 4803 non-null object original_language 4803 non-null object original_title 4803 non-null object overview 4800 non-null object popularity 4803 non-null float64 production_companies 4803 non-null object production_countries 4803 non-null object release_date 4802 non-null object revenue 4803 non-null int64 runtime 4801 non-null float64 spoken_languages 4803 non-null object status 4803 non-null object tagline 3959 non-null object title 4803 non-null object vote_average 4803 non-null float64 vote_count 4803 non-null int64 dtypes: float64(3), int64(4), object(13) memory usage: 750.5+ KB

credits和movies都有一個id和title，他們是不是同一個東西？

檢測一下

(credits['movie_id'] == movies['id']).describe() count 4803 unique 1 top True freq 4803 dtype: object (credits['title'] == movies['title']).describe() count 4803 unique 1 top True freq 4803 Name: title, dtype: object

兩列相同，合并數據

df = credits.merge(right=movies,how='inner',left_on='movie_id',right_on='id') >>> df.head()movie_id title_x ... vote_average vote_count 0 19995 Avatar ... 7.2 11800 1 285 Pirates of the Caribbean: At World's End ... 6.9 4500 2 206647 Spectre ... 6.3 4466 3 49026 The Dark Knight Rises ... 7.6 9106 4 49529 John Carter ... 6.1 2124

df中有24個字段

movie_id:TMDB電影標識號
title_x & title_y: 這是合并時形成的兩個一樣的列，可刪除一列，電影名稱
cast：演員列表
direcor：導演
budget:預算
genres：電影風格
homepages:電影URL
id:同movie_id
original_language:電影語言
overview:劇情摘要
popularity:在database上的點擊次數
production_companies:制作公司
production_countries:制作國家
release_date:上映時間
spoken_languages:口語
status:狀態
tagline:電影標語
vote_average:平均評分
vote_count:評分次數

df.info() # df[['movie_id','id']] <class 'pandas.core.frame.DataFrame'> Int64Index: 4803 entries, 0 to 4802 Data columns (total 24 columns): movie_id 4803 non-null int64 title_x 4803 non-null object cast 4803 non-null object director 4773 non-null object budget 4803 non-null int64 genres 4803 non-null object homepage 1712 non-null object id 4803 non-null int64 keywords 4803 non-null object original_language 4803 non-null object original_title 4803 non-null object overview 4800 non-null object popularity 4803 non-null float64 production_companies 4803 non-null object production_countries 4803 non-null object release_date 4802 non-null object revenue 4803 non-null int64 runtime 4801 non-null float64 spoken_languages 4803 non-null object status 4803 non-null object tagline 3959 non-null object title_y 4803 non-null object vote_average 4803 non-null float64 vote_count 4803 non-null int64 dtypes: float64(3), int64(5), object(16) memory usage: 938.1+ KB

字段缺失值處理

del df['title_y'] del df['id'] df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 4803 entries, 0 to 4802 Data columns (total 22 columns): movie_id 4803 non-null int64 title_x 4803 non-null object cast 4803 non-null object director 4773 non-null object budget 4803 non-null int64 genres 4803 non-null object homepage 1712 non-null object keywords 4803 non-null object original_language 4803 non-null object original_title 4803 non-null object overview 4800 non-null object popularity 4803 non-null float64 production_companies 4803 non-null object production_countries 4803 non-null object release_date 4802 non-null object revenue 4803 non-null int64 runtime 4801 non-null float64 spoken_languages 4803 non-null object status 4803 non-null object tagline 3959 non-null object vote_average 4803 non-null float64 vote_count 4803 non-null int64 dtypes: float64(3), int64(4), object(15) memory usage: 863.0+ KB

同時，從上面可以看到director,release_date,runtime有缺失值

director無法處理，只能處理release_date,runtime的缺失值
另外，homepage,original_tille,overview,spoken_language,tagline這幾列數據我們也是用不到的，可以刪除

df['release_date']=df['release_date'].fillna('2014-06-01') df['runtime']=df['runtime'].fillna(df['runtime'].mean()) >>> df[['release_date','runtime']].isnull().describe()release_date runtime count 4803 4803 unique 1 1 top False False freq 4803 4803 >>> df.head(3)movie_id title_x ... vote_average vote_count 0 19995 Avatar ... 7.2 11800 1 285 Pirates of the Caribbean: At World's End ... 6.9 4500 2 206647 Spectre ... 6.3 4466

數據分析及可視化

處理日期時間

df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 4803 entries, 0 to 4802 Data columns (total 22 columns): movie_id 4803 non-null int64 title_x 4803 non-null object cast 4803 non-null object director 4773 non-null object budget 4803 non-null int64 genres 4803 non-null object homepage 1712 non-null object keywords 4803 non-null object original_language 4803 non-null object original_title 4803 non-null object overview 4800 non-null object popularity 4803 non-null float64 production_companies 4803 non-null object production_countries 4803 non-null object release_date 4803 non-null object revenue 4803 non-null int64 runtime 4803 non-null float64 spoken_languages 4803 non-null object status 4803 non-null object tagline 3959 non-null object vote_average 4803 non-null float64 vote_count 4803 non-null int64 dtypes: float64(3), int64(4), object(15) memory usage: 863.0+ KB # 從上面可以看出，release_time是object格式，因此要先轉化為時間格式 df['release_year'] = pd.to_datetime(df.release_date,format='%Y-%m-%d').dt.year df['release_month'] = pd.to_datetime(df.release_date,format='%Y-%m-%d').dt.month >>> df.head(3)movie_id title_x cast ... vote_count release_year release_month 0 19995 Avatar Sam Worthington,Zoe Saldana,Sigourney Weaver,S... ... 11800 2009 12 1 285 Pirates of the Caribbean: At World's End Johnny Depp,Orlando Bloom,Keira Knightley,Stel... ... 4500 2007 5 2 206647 Spectre Daniel Craig,Christoph Waltz,Léa Seydoux,Ralph... ... 4466 2015 10

電影類型分析

df['genres'][1].split(',') #split()分割字符串 ['Adventure', 'Fantasy', 'Action']

set()可創建一個集合，集合的最重要特性是元素不可重復性

因此可以以此來得到電影所有類型總共歸屬于哪些

genre = set() for i in df['genres'].str.split(','):genre=set().union(i,genre) # union()可以將i和genre合并到一起 genre {'','Action','Adventure','Animation','Comedy','Crime','Documentary','Drama','Family','Fantasy','Foreign','History','Horror','Music','Mystery','Romance','Science Fiction','TV Movie','Thriller','War','Western'} # 將上述集合轉為list，并去除無用的‘’ genre = list(genre) genre.remove('') genre ['War','History','Science Fiction','Foreign','Western','Action','Comedy','Family','Documentary','Animation','Romance','Drama','Mystery','Music','Fantasy','Horror','TV Movie','Adventure','Thriller','Crime']

電影類型和數量

for i in genre:df[i] = 0 # 創建名為i的列df[i][df.genres.str.contains(i)] = 1 #genres列包含字符i時，賦值為1 >>> df.head(8)movie_id title_x cast director ... Fantasy Romance Horror Foreign 0 19995 Avatar Sam Worthington,Zoe Saldana,Sigourney Weaver,S... James Cameron ... 1 0 0 0 1 285 Pirates of the Caribbean: At World's End Johnny Depp,Orlando Bloom,Keira Knightley,Stel... Gore Verbinski ... 1 0 0 0 2 206647 Spectre Daniel Craig,Christoph Waltz,Léa Seydoux,Ralph... Sam Mendes ... 0 0 0 0 3 49026 The Dark Knight Rises Christian Bale,Michael Caine,Gary Oldman,Anne ... Christopher Nolan ... 0 0 0 0 4 49529 John Carter Taylor Kitsch,Lynn Collins,Samantha Morton,Wil... Andrew Stanton ... 0 0 0 0 5 559 Spider-Man 3 Tobey Maguire,Kirsten Dunst,James Franco,Thoma... Sam Raimi ... 1 0 0 0 6 38757 Tangled Zachary Levi,Mandy Moore,Donna Murphy,Ron Perl... Byron Howard ... 0 0 0 0 7 99861 Avengers: Age of Ultron Robert Downey Jr.,Chris Hemsworth,Mark Ruffalo... Joss Whedon ... 0 0 0 0# 這里有另外一種更好的方法： # for i in genre: # df[i] = df['genres'].str.contains(i).apply(lambda x:1 if x else 0)

建立包含電影類型和年份的dataframe

df_gy = df[genre+['release_year']] >>> df_gy.head(10)War Thriller Animation Action Adventure Music Science Fiction Documentary ... Family Drama Mystery Fantasy Romance Horror Foreign release_year 0 0 0 0 1 1 0 1 0 ... 0 0 0 1 0 0 0 2009 1 0 0 0 1 1 0 0 0 ... 0 0 0 1 0 0 0 2007 2 0 0 0 1 1 0 0 0 ... 0 0 0 0 0 0 0 2015 3 0 1 0 1 0 0 0 0 ... 0 1 0 0 0 0 0 2012 4 0 0 0 1 1 0 1 0 ... 0 0 0 0 0 0 0 2012 5 0 0 0 1 1 0 0 0 ... 0 0 0 1 0 0 0 2007 6 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 0 0 2010 7 0 0 0 1 1 0 1 0 ... 0 0 0 0 0 0 0 2015 8 0 0 0 0 1 0 0 0 ... 1 0 0 1 0 0 0 2009 9 0 0 0 1 1 0 0 0 ... 0 0 0 1 0 0 0 2016

可視化電影年度趨勢

import matplotlib.pyplot as plt x = df_gy['release_year'].value_counts().sort_index() plt.plot(x) # 繪制電影數與時間的總的密度圖 plt.xlabel('Time (year)') plt.ylabel('Counts') plt.show()

繪制分類型電影-時間圖

x = df_gy.groupby('release_year').sum(axis = 1) plt.figure(figsize=(12,6)) plt.xticks(range(1915,2018,5)) plt.plot(x) plt.legend(x.columns.values,fontsize = 9) plt.xlabel('Time (year)') plt.ylabel('Counts') plt.show()

繪制電影總量柱狀圖

y = x.sum().sort_values() plt.figure(figsize=(12,6)) plt.xlabel('Counts',fontsize = 15) plt.ylabel('Category',fontsize = 15) plt.barh(y.index,y) plt.show()

餅狀圖

bl = y / y.sum() plt.figure(figsize=(6,6)) plt.pie(bl,labels=bl.index,autopct='%1.1f%%',explode=(bl>=0.06)/20+0.02) plt.title('Pie of Category') plt.show()

分析電影票房與哪些因素有關

df_revenue = df.groupby('release_year')['revenue'].sum() # 統計票房 df_revenue[:5] release_year 1916 8394751 1925 22000000 1927 650422 1929 4358000 1930 8000000 Name: revenue, dtype: int64

年份和票房

df_revenue.plot(figsize=(12,6)) plt.xticks(range(1915,2018,6)) plt.title('Total revenue in each year',fontsize = 15) plt.xlabel('Year',fontsize = 15) plt.ylabel('Total revenue',fontsize = 15) plt.show()

電影預算和票房的關系

plt.scatter(x=df.budget,y = df.revenue) plt.xlabel('Revenue') plt.ylabel('Budget') plt.show()

評分和票房的關系

plt.scatter(x = df.vote_average,y = df.revenue) plt.xlabel('Vote') plt.ylabel('Revenue') plt.show()

電影時長和票房的關系

plt.scatter(df.runtime,df.revenue) plt.xlabel('Run time') plt.ylabel('Revenue') plt.show()

評分和受歡迎程度

plt.scatter(df.vote_average,df.popularity) plt.xlabel('Vote') plt.ylabel('Popularity') plt.show()

時長和受歡迎程度

plt.scatter(df.runtime,df.popularity) plt.xlabel('Runtime(minutes)',fontsize = 15) plt.ylabel('Popularity',fontsize = 15) plt.show() # 看起來觀眾更喜歡60-160之間的電影

根據兩列相關信息繪制合適的可視化圖形，即為數據分析的初級階段

今天的訓練就到這里

總結

以上是生活随笔為你收集整理的电影数据集TMDB数据分析练习的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：智能小车52单片机c语言,基于STC89
下一篇： FGSM攻击机器学习模型