数据分析 - Kaggle TMDB 票房预测
數據分析 - Kaggle TMDB 票房預測
- 環境準備
- 數據集
- 正文
- 數據預處理
- 數據探索性分析
- 建模
 
環境準備
使用了的環境:
- Windows 10
- python 3.7.2
- Jupyter Notebook(代碼均在此測試成功)
數據集
https://www.kaggle.com/c/tmdb-box-office-prediction/data
正文
開工前準備,導入第三方庫:
import pandas as pd pd.set_option('max_columns',None) import matplotlib.pyplot as plt import seaborn as sns import plotly.graph_objs as go import plotly.offline as py from wordcloud import WordCloud plt.style.use('ggplot') import ast from collections import Counter import numpy as np from sklearn.preprocessing import LabelEncoder # 文本挖掘 from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer from sklearn.linear_model import LinearRegression # 模型 from sklearn.model_selection import train_test_split import lightgbm as lgb加載數據:
train=pd.read_csv('dataset/train.csv') test=pd.read_csv('dataset/test.csv')簡單了解數據:
train.head()| 0 | 1 | [{‘id’: 313576, ‘name’: 'Hot Tub Time Machine … | 14000000 | [{‘id’: 35, ‘name’: ‘Comedy’}] | NaN | tt2637294 | en | Hot Tub Time Machine 2 | When Lou, who has become the "father of the In… | 6.575393 | /tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg | [{‘name’: ‘Paramount Pictures’, ‘id’: 4}, {'na… | [{‘iso_3166_1’: ‘US’, ‘name’: 'United States o… | 2/20/15 | 93.0 | [{‘iso_639_1’: ‘en’, ‘name’: ‘English’}] | Released | The Laws of Space and Time are About to be Vio… | Hot Tub Time Machine 2 | [{‘id’: 4379, ‘name’: ‘time travel’}, {‘id’: 9… | [{‘cast_id’: 4, ‘character’: ‘Lou’, 'credit_id… | [{‘credit_id’: ‘59ac067c92514107af02c8c8’, 'de… | 12314651 | 
| 1 | 2 | [{‘id’: 107674, ‘name’: 'The Princess Diaries … | 40000000 | [{‘id’: 35, ‘name’: ‘Comedy’}, {‘id’: 18, 'nam… | NaN | tt0368933 | en | The Princess Diaries 2: Royal Engagement | Mia Thermopolis is now a college graduate and … | 8.248895 | /w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg | [{‘name’: ‘Walt Disney Pictures’, ‘id’: 2}] | [{‘iso_3166_1’: ‘US’, ‘name’: 'United States o… | 8/6/04 | 113.0 | [{‘iso_639_1’: ‘en’, ‘name’: ‘English’}] | Released | It can take a lifetime to find true love; she’… | The Princess Diaries 2: Royal Engagement | [{‘id’: 2505, ‘name’: ‘coronation’}, {‘id’: 42… | [{‘cast_id’: 1, ‘character’: ‘Mia Thermopolis’… | [{‘credit_id’: ‘52fe43fe9251416c7502563d’, 'de… | 95149435 | 
| 2 | 3 | NaN | 3300000 | [{‘id’: 18, ‘name’: ‘Drama’}] | http://sonyclassics.com/whiplash/ | tt2582802 | en | Whiplash | Under the direction of a ruthless instructor, … | 64.299990 | /lIv1QinFqz4dlp5U4lQ6HaiskOZ.jpg | [{‘name’: ‘Bold Films’, ‘id’: 2266}, {‘name’: … | [{‘iso_3166_1’: ‘US’, ‘name’: 'United States o… | 10/10/14 | 105.0 | [{‘iso_639_1’: ‘en’, ‘name’: ‘English’}] | Released | The road to greatness can take you to the edge. | Whiplash | [{‘id’: 1416, ‘name’: ‘jazz’}, {‘id’: 1523, 'n… | [{‘cast_id’: 5, ‘character’: ‘Andrew Neimann’,… | [{‘credit_id’: ‘54d5356ec3a3683ba0000039’, 'de… | 13092000 | 
| 3 | 4 | NaN | 1200000 | [{‘id’: 53, ‘name’: ‘Thriller’}, {‘id’: 18, 'n… | http://kahaanithefilm.com/ | tt1821480 | hi | Kahaani | Vidya Bagchi (Vidya Balan) arrives in Kolkata … | 3.174936 | /aTXRaPrWSinhcmCrcfJK17urp3F.jpg | NaN | [{‘iso_3166_1’: ‘IN’, ‘name’: ‘India’}] | 3/9/12 | 122.0 | [{‘iso_639_1’: ‘en’, ‘name’: ‘English’}, {'iso… | Released | NaN | Kahaani | [{‘id’: 10092, ‘name’: ‘mystery’}, {‘id’: 1054… | [{‘cast_id’: 1, ‘character’: ‘Vidya Bagchi’, '… | [{‘credit_id’: ‘52fe48779251416c9108d6eb’, 'de… | 16000000 | 
| 4 | 5 | NaN | 0 | [{‘id’: 28, ‘name’: ‘Action’}, {‘id’: 53, 'nam… | NaN | tt1380152 | ko | ???? | Marine Boy is the story of a former national s… | 1.148070 | /m22s7zvkVFDU9ir56PiiqIEWFdT.jpg | NaN | [{‘iso_3166_1’: ‘KR’, ‘name’: ‘South Korea’}] | 2/5/09 | 118.0 | [{‘iso_639_1’: ‘ko’, ‘name’: ‘???/???’}] | Released | NaN | Marine Boy | NaN | [{‘cast_id’: 3, ‘character’: ‘Chun-soo’, 'cred… | [{‘credit_id’: ‘52fe464b9251416c75073b43’, 'de… | 3923970 | 
數據集大小:數據量挺小的
print(train.shape) print(test.shape)(3000, 23)
 (4398, 22)
數據預處理
從上面的數據預覽,發現有幾列是json形式的數據,必須轉化成可處理的格式。json數據在python中可以pyquery處理,pyquery的語法類似于jquery,也可以用ast.literal_eval將字符串型的json數據轉化成字典列表,這里我用第二種方法:
dict_columns = ['belongs_to_collection', 'genres', 'production_companies','production_countries', 'spoken_languages', 'Keywords', 'cast', 'crew']def json_to_dict(df):for column in dict_columns:df[column] = df[column].apply(lambda x: {} if pd.isna(x) else ast.literal_eval(x) )return dftrain = json_to_dict(train) test = json_to_dict(test)再將這些不規則數據轉化成特征,分為標簽提取與編碼,如關鍵演員、題材、分類、系列、發行方等,以及標簽數量統計,如分類數量、演員數量、主題長度等。這里需要注意,因為數據集不多,為避免模型過擬合,應僅對TOP的標簽進行編碼:
# collections train['collection_name'] = train['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0) train['has_collection'] = train['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0)test['collection_name'] = test['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0) test['has_collection'] = test['belongs_to_collection'].apply(lambda x: len(x) if x != {} else 0)train = train.drop(['belongs_to_collection'], axis=1) test = test.drop(['belongs_to_collection'], axis=1)# genres list_of_genres = list(train['genres'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) train['num_genres'] = train['genres'].apply(lambda x: len(x) if x != {} else 0) train['all_genres'] = train['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_genres = [m[0] for m in Counter([i for j in list_of_genres for i in j]).most_common(15)] for g in top_genres:train['genre_' + g] = train['all_genres'].apply(lambda x: 1 if g in x else 0)test['num_genres'] = test['genres'].apply(lambda x: len(x) if x != {} else 0) test['all_genres'] = test['genres'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_genres:test['genre_' + g] = test['all_genres'].apply(lambda x: 1 if g in x else 0)train = train.drop(['genres'], axis=1) test = test.drop(['genres'], axis=1)# production companies list_of_companies = list(train['production_companies'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values)train['num_companies'] = train['production_companies'].apply(lambda x: len(x) if x != {} else 0) train['all_production_companies'] = train['production_companies'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_companies = [m[0] for m in Counter([i for j in list_of_companies for i in j]).most_common(30)] for g in top_companies:train['production_company_' + g] = train['all_production_companies'].apply(lambda x: 1 if g in x else 0)test['num_companies'] = test['production_companies'].apply(lambda x: len(x) if x != {} else 0) test['all_production_companies'] = test['production_companies'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_companies:test['production_company_' + g] = test['all_production_companies'].apply(lambda x: 1 if g in x else 0)train = train.drop(['production_companies', 'all_production_companies'], axis=1) test = test.drop(['production_companies', 'all_production_companies'], axis=1)# production countries list_of_countries = list(train['production_countries'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) train['num_countries'] = train['production_countries'].apply(lambda x: len(x) if x != {} else 0) train['all_countries'] = train['production_countries'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_countries = [m[0] for m in Counter([i for j in list_of_countries for i in j]).most_common(25)] for g in top_countries:train['production_country_' + g] = train['all_countries'].apply(lambda x: 1 if g in x else 0)test['num_countries'] = test['production_countries'].apply(lambda x: len(x) if x != {} else 0) test['all_countries'] = test['production_countries'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_countries:test['production_country_' + g] = test['all_countries'].apply(lambda x: 1 if g in x else 0)train = train.drop(['production_countries', 'all_countries'], axis=1) test = test.drop(['production_countries', 'all_countries'], axis=1)# spoken languages list_of_languages = list(train['spoken_languages'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) train['num_languages'] = train['spoken_languages'].apply(lambda x: len(x) if x != {} else 0) train['all_languages'] = train['spoken_languages'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_languages = [m[0] for m in Counter([i for j in list_of_languages for i in j]).most_common(30)] for g in top_languages:train['language_' + g] = train['all_languages'].apply(lambda x: 1 if g in x else 0)test['num_languages'] = test['spoken_languages'].apply(lambda x: len(x) if x != {} else 0) test['all_languages'] = test['spoken_languages'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_languages:test['language_' + g] = test['all_languages'].apply(lambda x: 1 if g in x else 0)train = train.drop(['spoken_languages', 'all_languages'], axis=1) test = test.drop(['spoken_languages', 'all_languages'], axis=1)# keywords list_of_keywords = list(train['Keywords'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) train['num_Keywords'] = train['Keywords'].apply(lambda x: len(x) if x != {} else 0) train['all_Keywords'] = train['Keywords'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') top_keywords = [m[0] for m in Counter([i for j in list_of_keywords for i in j]).most_common(30)] for g in top_keywords:train['keyword_' + g] = train['all_Keywords'].apply(lambda x: 1 if g in x else 0)test['num_Keywords'] = test['Keywords'].apply(lambda x: len(x) if x != {} else 0) test['all_Keywords'] = test['Keywords'].apply(lambda x: ' '.join(sorted([i['name'] for i in x])) if x != {} else '') for g in top_keywords:test['keyword_' + g] = test['all_Keywords'].apply(lambda x: 1 if g in x else 0)train = train.drop(['Keywords', 'all_Keywords'], axis=1) test = test.drop(['Keywords', 'all_Keywords'], axis=1)# cast list_of_cast_names = list(train['cast'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) list_of_cast_genders = list(train['cast'].apply(lambda x: [i['gender'] for i in x] if x != {} else []).values) list_of_cast_characters = list(train['cast'].apply(lambda x: [i['character'] for i in x] if x != {} else []).values) train['num_cast'] = train['cast'].apply(lambda x: len(x) if x != {} else 0) top_cast_names = [m[0] for m in Counter([i for j in list_of_cast_names for i in j]).most_common(15)] for g in top_cast_names:train['cast_name_' + g] = train['cast'].apply(lambda x: 1 if g in str(x) else 0) train['genders_0_cast'] = train['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 0])) train['genders_1_cast'] = train['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 1])) train['genders_2_cast'] = train['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 2])) top_cast_characters = [m[0] for m in Counter([i for j in list_of_cast_characters for i in j]).most_common(15)] for g in top_cast_characters:train['cast_character_' + g] = train['cast'].apply(lambda x: 1 if g in str(x) else 0)test['num_cast'] = test['cast'].apply(lambda x: len(x) if x != {} else 0) for g in top_cast_names:test['cast_name_' + g] = test['cast'].apply(lambda x: 1 if g in str(x) else 0) test['genders_0_cast'] = test['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 0])) test['genders_1_cast'] = test['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 1])) test['genders_2_cast'] = test['cast'].apply(lambda x: sum([1 for i in x if i['gender'] == 2])) for g in top_cast_characters:test['cast_character_' + g] = test['cast'].apply(lambda x: 1 if g in str(x) else 0)train = train.drop(['cast'], axis=1) test = test.drop(['cast'], axis=1)# crew list_of_crew_names = list(train['crew'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values) list_of_crew_jobs = list(train['crew'].apply(lambda x: [i['job'] for i in x] if x != {} else []).values) list_of_crew_genders = list(train['crew'].apply(lambda x: [i['gender'] for i in x] if x != {} else []).values) list_of_crew_departments = list(train['crew'].apply(lambda x: [i['department'] for i in x] if x != {} else []).values) list_of_crew_names = train['crew'].apply(lambda x: [i['name'] for i in x] if x != {} else []).values train['num_crew'] = train['crew'].apply(lambda x: len(x) if x != {} else 0) top_crew_names = [m[0] for m in Counter([i for j in list_of_crew_names for i in j]).most_common(15)] for g in top_crew_names:train['crew_name_' + g] = train['crew'].apply(lambda x: 1 if g in str(x) else 0) train['genders_0_crew'] = train['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 0])) train['genders_1_crew'] = train['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 1])) train['genders_2_crew'] = train['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 2])) top_crew_jobs = [m[0] for m in Counter([i for j in list_of_crew_jobs for i in j]).most_common(15)] for j in top_crew_jobs:train['jobs_' + j] = train['crew'].apply(lambda x: sum([1 for i in x if i['job'] == j])) top_crew_departments = [m[0] for m in Counter([i for j in list_of_crew_departments for i in j]).most_common(15)] for j in top_crew_departments:train['departments_' + j] = train['crew'].apply(lambda x: sum([1 for i in x if i['department'] == j])) test['num_crew'] = test['crew'].apply(lambda x: len(x) if x != {} else 0) for g in top_crew_names:test['crew_name_' + g] = test['crew'].apply(lambda x: 1 if g in str(x) else 0) test['genders_0_crew'] = test['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 0])) test['genders_1_crew'] = test['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 1])) test['genders_2_crew'] = test['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 2])) for j in top_crew_jobs:test['jobs_' + j] = test['crew'].apply(lambda x: sum([1 for i in x if i['job'] == j])) for j in top_crew_departments:test['departments_' + j] = test['crew'].apply(lambda x: sum([1 for i in x if i['department'] == j])) train = train.drop(['crew'], axis=1) test = test.drop(['crew'], axis=1)預覽一下數據處理完成后的效果:
train.head()| 0 | 1 | 14000000 | NaN | tt2637294 | en | Hot Tub Time Machine 2 | When Lou, who has become the "father of the In… | 6.575393 | /tQtWuwvMf0hCc2QR2tkolwl7c3c.jpg | 2/20/15 | 93.0 | Released | The Laws of Space and Time are About to be Vio… | Hot Tub Time Machine 2 | 12314651 | Hot Tub Time Machine Collection | 1 | 1 | Comedy | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 8 | 10 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 72 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 59 | 0 | 13 | 1 | 3 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 4 | 2 | 9 | 10 | 12 | 4 | 2 | 13 | 8 | 4 | 2 | 4 | 4 | 0 | 
| 1 | 2 | 40000000 | NaN | tt0368933 | en | The Princess Diaries 2: Royal Engagement | Mia Thermopolis is now a college graduate and … | 8.248895 | /w9Z7A0GHEhIp7etpj0vyKOeU1Wx.jpg | 8/6/04 | 113.0 | Released | It can take a lifetime to find true love; she’… | The Princess Diaries 2: Royal Engagement | 95149435 | The Princess Diaries Collection | 1 | 4 | Comedy Drama Family Romance | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 | 10 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 4 | 3 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 
| 2 | 3 | 3300000 | http://sonyclassics.com/whiplash/ | tt2582802 | en | Whiplash | Under the direction of a ruthless instructor, … | 64.299990 | /lIv1QinFqz4dlp5U4lQ6HaiskOZ.jpg | 10/10/14 | 105.0 | Released | The road to greatness can take you to the edge. | Whiplash | 13092000 | 0 | 0 | 1 | Drama | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 51 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 31 | 7 | 13 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 64 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 49 | 4 | 11 | 4 | 4 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 2 | 18 | 9 | 5 | 9 | 1 | 5 | 4 | 3 | 6 | 3 | 1 | 0 | 
| 3 | 4 | 1200000 | http://kahaanithefilm.com/ | tt1821480 | hi | Kahaani | Vidya Bagchi (Vidya Balan) arrives in Kolkata … | 3.174936 | /aTXRaPrWSinhcmCrcfJK17urp3F.jpg | 3/9/12 | 122.0 | Released | NaN | Kahaani | 16000000 | 0 | 0 | 2 | Drama Thriller | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 
| 4 | 5 | 0 | NaN | tt1380152 | ko | ???? | Marine Boy is the story of a former national s… | 1.148070 | /m22s7zvkVFDU9ir56PiiqIEWFdT.jpg | 2/5/09 | 118.0 | Released | NaN | Marine Boy | 3923970 | 0 | 0 | 2 | Action Thriller | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 
達到預期的效果,但是日期還是字符串格式,再轉化一下日期為標準格式:
def fix_date(x):"""Fixes dates which are in 20xx"""year = x.split('/')[2]if int(year) <= 19:return x[:-2] + '20' + yearelse:return x[:-2] + '19' + year test.loc[test['release_date'].isnull() == True, 'release_date'] = '01/01/98' train['release_date'] = train['release_date'].apply(lambda x: fix_date(x)) test['release_date'] = test['release_date'].apply(lambda x: fix_date(x)) train['release_date'] = pd.to_datetime(train['release_date']) test['release_date'] = pd.to_datetime(test['release_date'])數據探索性分析
首先看一下預算的分布情況,發現大部分值比較小,數據不平衡,應做log處理,增加數值較小時的區分度:
plt.hist(train['budget']) plt.title('budget distribution') plt.hist(np.log1p(train['budget'])) plt.title('log1p budget distribution')
 顯然收入revenue也一樣處理:
再看下homepage,這里把homepage轉換成布爾值,有homepage的也是有實力的象征:
train['has_homepage'] = 0 train.loc[train['homepage'].isnull() == False, 'has_homepage'] = 1 test['has_homepage'] = 0 test.loc[test['homepage'].isnull() == False, 'has_homepage'] = 1是否有主頁的分布情況,有主頁的票房更高:
sns.catplot(x='has_homepage', y='revenue', data=train); plt.title('Revenue for film with and without homepage');
 各個語言的票房收入情況:
 英語好片很多,爛片也很多,其他語言也有好的電影,總體差別不大。
overview列,涉及到文本信息挖掘,這里簡單結合常用的Tfidf和線性回歸進行建模,如下:()
vectorizer=TfidfVectorizer(sublinear_tf=True,analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1,2),min_df=5) overview_text=vectorizer.fit_transform(train['overview'].fillna('')) linreg=LinearRegression() linreg.fit(overview_text,train['log_revenue'])使用eli5可視化各關鍵字對log_revenue的影響:
import eli5 print('Target value:', train['log_revenue'][5]) eli5.show_prediction(linreg,doc=train['overview'][5],vec=vectorizer)
 日期特征比較粗糙,增加星期幾、月份、季度、年份等特征:
先看下每年電影的發行量:這里用可交互式的可視化工具plotly
d1=train['release_date_year'].value_counts().sort_index() d2=test['release_date_year'].value_counts().sort_index() py.init_notebook_mode(connected=True) data=[go.Scatter(x=d1.index,y=d1.values,name='train'),go.Scatter(x=d2.index,y=d2.values,name='test')] layout=go.Layout(dict(title='Number of films per year',xaxis=dict(title='year'),yaxis=dict(title='Count')),legend=dict(orientation='v')) py.iplot(dict(data=data,layout=layout))
 總發行量與總票房的趨勢:
 總發行量與平均票房的趨勢:(似乎平均票房在2000之后趨于穩定)
周幾發行是否與票房有關:
sns.catplot(x='release_date_weekday',y='revenue',data=train) plt.title('Revenue on different days of week of release')
 再看下箱線圖:
發現:周一到周三發布的電影很多是高票房的,周六發行的電影票房就很低了。
電影介紹tagline分析,分析出現頻率最高的詞匯:
plt.figure(figsize = (12, 12)) text_tagline=' '.join(train['tagline'].fillna('')) wordcloud_tagline=WordCloud(max_font_size=None,background_color='white',width=1200,height=1000).generate_from_text(text_tagline) plt.imshow(wordcloud_tagline) plt.title('Top words in tagline') plt.axis("off") plt.show()是否有系列has_collection對票房的影響:
sns.boxplot(x='has_collection',y='log_revenue',data=train)
 發現系列電影的平均票房更高。
分析電影題材數量與票房的關系:
train['num_genres'].value_counts() sns.catplot(x='num_genres',y='revenue',data=train)
 題材數量3個往往有更高的票房,數量多了反而不好。
最后看下電影發行方與票房的關系,分別繪制分布圖:
f,axes=plt.subplots(6,5,figsize=(24,32)) plt.suptitle('Violin of revenue vs production company') for i,e in enumerate([i for i in train.columns if 'production_company_' in i]):sns.violinplot(x=e,y='revenue',data=train,ax=axes[i//5][i%5])建模
先刪除一些無關的特征:
train = train.drop(['homepage', 'imdb_id', 'poster_path', 'release_date', 'status', 'log_revenue'], axis=1) test = test.drop(['homepage', 'imdb_id', 'poster_path', 'release_date', 'status'], axis=1)再刪除特征值唯一的特征:
for col in train.columns:if train[col].nunique()==1:print(col)train.drop([col],axis=1)train.drop([col],axis=1)對分類標簽進行編碼:
for col in ['original_language','collection_name','all_genres']:le=LabelEncoder()le.fit(list(train[col].fillna(''))+list(test[col].fillna('')))train[col]=le.transform(train[col].fillna('').astype(str))test[col]=le.transform(test[col].fillna('').astype(str))將文本轉化成特征:
train_texts = train[['title', 'tagline', 'overview', 'original_title']] test_texts = test[['title', 'tagline', 'overview', 'original_title']] for col in ['title','tagline','overview','original_title']:train['len_'+col]=train[col].fillna('').apply(lambda x: len(str(x)))train['words_'+col]=train[col].fillna('').apply(lambda x: len(str(x).split(' ')))test['len_'+col]=test[col].fillna('').apply(lambda x: len(str(x)))test['words_'+col]=test[col].fillna('').apply(lambda x: len(str(x).split(' ')))train=train.drop(col,axis=1)test=test.drop(col,axis=1)訓練數據和測試數據:
X=train.drop(['id','revenue'],axis=1) y=np.log1p(train['revenue']) X_test=test.drop(['id'],axis=1)模型訓練:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.1) # rmse: root mean square error, (sum(d^2)/N)^0.5 params = {'num_leaves': 30,'min_data_in_leaf': 20,'objective': 'regression','max_depth': 5,'learning_rate': 0.01,"boosting": "gbdt","feature_fraction": 0.9,"bagging_freq": 1,"bagging_fraction": 0.9,"bagging_seed": 11,"metric": 'rmse',"lambda_l1": 0.2,"verbosity": -1} model1=lgb.LGBMRegressor(**params,n_estimators=20000,nthread=4,jobs=-1) model1.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric='rmse',verbose=1000, early_stopping_rounds=200)Training until validation scores don’t improve for 200 rounds.
 [1000] training’s rmse: 1.42756 valid_1’s rmse: 2.07259
 Early stopping, best iteration is:
 [1118] training’s rmse: 1.38621 valid_1’s rmse: 2.06726
訓練后,各特征權重:
 
總結
以上是生活随笔為你收集整理的数据分析 - Kaggle TMDB 票房预测的全部內容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: 2060显卡驱动最新版本_教程:怎么安装
- 下一篇: Java Io
