Python小练习——电影数据集TMDB预处理
加載TMDB數(shù)據(jù)集,進(jìn)行數(shù)據(jù)預(yù)處理
TMDb電影數(shù)據(jù)庫(kù),數(shù)據(jù)集中包含來(lái)自1960-2016年上映的近11000部電影的基本信息,主要包括了電影類型、預(yù)算、票房、演職人員、時(shí)長(zhǎng)、評(píng)分等信息。用于練習(xí)數(shù)據(jù)分析。
參考文章https://blog.csdn.net/moyue1002/article/details/80332186
python 3.7
pandas 0.23
numpy 1.18
metplotlib 2.2
查看各個(gè)dataframe的一般信息
#?這是movies表的信息 movies.head(1)Out[3]:?budget?????????????????????????????????????????????genres?????????????????????homepage?????id????...??????????????????????????tagline???title?vote_average?vote_count 0??237000000??[{"id":?28,?"name":?"Action"},?{"id":?12,?"nam...??http://www.avatarmovie.com/??19995????...??????Enter?the?World?of?Pandora.??Avatar??????????7.2??????11800這是credits表的信息
print(credits.info()) credits.head(1)Out[4]:?<class?'pandas.core.frame.DataFrame'>RangeIndex:?4803?entries,?0?to?4802Data?columns?(total?4?columns):movie_id????4803?non-null?int64title???????4803?non-null?objectcast????????4803?non-null?objectcrew????????4803?non-null?objectdtypes:?int64(1),?object(3)memory?usage:?150.2+?KBNonemovie_id????????????????????????...???????????????????????????????????????????????????????????????????????crew 0?????19995????????????????????????...??????????????????????????[{"credit_id":?"52fe48009251416c750aca23",?"de...credits表的cast列很奇怪,數(shù)據(jù)很多
進(jìn)行具體查看
json格式化數(shù)據(jù)處理 從表中看出,cast列其實(shí)是json格式化數(shù)據(jù),應(yīng)該用json包進(jìn)行處理
json格式是[{},{}]
將json格式的字符串轉(zhuǎn)換成Python對(duì)象用json.loads()
json.load()針對(duì)的是文件,從文件中讀取json
從上面可以看出json.loads()將json字符串轉(zhuǎn)成了list,可以知道list里面又包裹多個(gè)dict
接下來(lái)批量處理
提取其中的名字
credits['cast'][0][:3] #?credits第一行的cast,是個(gè)列表Out[9]:[{'cast_id':?242,'character':?'Jake?Sully','credit_id':?'5602a8a7c3a3685532001c9a','gender':?2,'id':?65731,'name':?'Sam?Worthington','order':?0},{'cast_id':?3,'character':?'Neytiri','credit_id':?'52fe48009251416c750ac9cb','gender':?1,'id':?8691,'name':?'Zoe?Saldana','order':?1},{'cast_id':?25,'character':?'Dr.?Grace?Augustine','credit_id':?'52fe48009251416c750aca39','gender':?1,'id':?10205,'name':?'Sigourney?Weaver','order':?2}] credits['cast'][0][0]['name']?#?獲取第一行第一個(gè)字典的人名Out[10]:'Sam?Worthington'dict字典常用的函數(shù) dict.get() 返回指定鍵的值,如果值不在字典中返回default值
dict.items() 以列表返回可遍歷的(鍵, 值) 元組數(shù)組
創(chuàng)建get_names()函數(shù),進(jìn)一步分割cast
def?get_names(x):return?','.join(i['name']?for?i?in?x) credits['cast']?=?credits['cast'].apply(get_names) credits['cast'][:3]Out[12]:0????Sam?Worthington,Zoe?Saldana,Sigourney?Weaver,S...1????Johnny?Depp,Orlando?Bloom,Keira?Knightley,Stel...2????Daniel?Craig,Christoph?Waltz,Léa?Seydoux,Ralph...Name:?cast,?dtype:?objectcrew提取導(dǎo)演
credits['crew'][0][0] Out[13]:{'credit_id':?'52fe48009251416c750aca23','department':?'Editing','gender':?0,'id':?1721,'job':?'Editor','name':?'Stephen?E.?Rivkin'} #?需要?jiǎng)?chuàng)建循環(huán),找到j(luò)ob是director的,然后讀取名字并返回 def?director(x):for?i?in?x:if?i['job']?==?'Director':return?i['name']credits['crew']?=?credits['crew'].apply(director) print(credits[['crew']][:3]) credits.rename(columns?=?{'crew':'director'},inplace=True)?#修改列名 credits[['director']][:3]Out[[14]:crew0???James?Cameron1??Gore?Verbinski2??????Sam?Mendesmovies表進(jìn)行json解析
>>>?movies.head(1) Out[15]:budget?????????????????????????????????????????????genres?????????????????????homepage?????id????...??????????????????????????tagline???title?vote_average?vote_count 0??237000000??[{"id":?28,?"name":?"Action"},?{"id":?12,?"nam...??http://www.avatarmovie.com/??19995????...??????Enter?the?World?of?Pandora.??Avatar??????????7.2??????11800可以看出genres, keywords, spoken_languages, production_countries, producion_companies需要json解析的
#?方法同crew表 json_col?=?['genres','keywords','spoken_languages','production_countries','production_companies'] for?i?in?json_col:movies[i]?=?movies[i].apply(json.loads)movies[i]?=?movies[i].apply(get_names) >>>?movies.head(1)? Out[16]:budget????????????????????????????????????genres?????????????????????homepage?????id????...??????????????????????????tagline???title?vote_average?vote_count 0??237000000??Action,Adventure,Fantasy,Science?Fiction??http://www.avatarmovie.com/??19995????...??????Enter?the?World?of?Pandora.??Avatar??????????7.2??????11800這樣,就把數(shù)據(jù)預(yù)處理做完了。
總結(jié)
以上是生活随笔為你收集整理的Python小练习——电影数据集TMDB预处理的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 小波与小波包、小波包分解与信号重构、小波
- 下一篇: ionic—alert弹出框