Kaggle——TMDB电影票房预测
Kaggle——TMDB電影票房預(yù)測(cè)
- EDA
- 特征工程
- 模型訓(xùn)練
最近在kaggle上找項(xiàng)目練習(xí),發(fā)現(xiàn)一個(gè) TMDB電影票房預(yù)測(cè)項(xiàng)目比較適合練手。這里記錄在下。
目標(biāo)是通過train集中的數(shù)據(jù)訓(xùn)練模型,將test集數(shù)據(jù)導(dǎo)入模型得出目標(biāo)值revenue即票房,上傳結(jié)果得到分?jǐn)?shù)和排名。
數(shù)據(jù)可以從kaggle網(wǎng)站上直接下載,文中用到的額外數(shù)據(jù)可從
https://www.kaggle.com/kamalchhirang/tmdb-competition-additional-features
和
https://www.kaggle.com/kamalchhirang/tmdb-box-office-prediction-more-training-data
下載。
EDA
train.info()載入數(shù)據(jù)并整理后大體觀察
<class 'pandas.core.frame.DataFrame'> Int64Index: 3000 entries, 0 to 2999 Data columns (total 53 columns): id 3000 non-null int64 belongs_to_collection 604 non-null object budget 3000 non-null int64 genres 3000 non-null object homepage 946 non-null object imdb_id 3000 non-null object original_language 3000 non-null object original_title 3000 non-null object overview 2992 non-null object popularity 3000 non-null float64 poster_path 2999 non-null object production_companies 2844 non-null object production_countries 2945 non-null object release_date 3000 non-null object runtime 2998 non-null float64 spoken_languages 2980 non-null object status 3000 non-null object tagline 2403 non-null object title 3000 non-null object Keywords 2724 non-null object cast 2987 non-null object crew 2984 non-null object revenue 3000 non-null int64 logRevenue 3000 non-null float64 release_month 3000 non-null int32 release_day 3000 non-null int32 release_year 3000 non-null int32 release_dayofweek 3000 non-null int64 release_quarter 3000 non-null int64 Action 3000 non-null int64 Adventure 3000 non-null int64 Animation 3000 non-null int64 Comedy 3000 non-null int64 Crime 3000 non-null int64 Documentary 3000 non-null int64 Drama 3000 non-null int64 Family 3000 non-null int64 Fantasy 3000 non-null int64 Foreign 3000 non-null int64 History 3000 non-null int64 Horror 3000 non-null int64 Music 3000 non-null int64 Mystery 3000 non-null int64 Romance 3000 non-null int64 Science Fiction 3000 non-null int64 TV Movie 3000 non-null int64 Thriller 3000 non-null int64 War 3000 non-null int64 Western 3000 non-null int64 popularity2 2882 non-null float64 rating 3000 non-null float64 totalVotes 3000 non-null float64 meanRevenueByRating 8 non-null float64 dtypes: float64(7), int32(3), int64(25), object(18) memory usage: 1.2+ MB主要的幾項(xiàng)特征:
- budget:預(yù)算
- revenue:票房
- rating:觀眾打分
- totalVotes:觀眾打分?jǐn)?shù)量
- popularity:流行系數(shù)(怎么得出的暫未可知)
票房和預(yù)算呈較明顯的正相關(guān)關(guān)系。這很符合常識(shí),但也不定,現(xiàn)在也挺多投資高票房低的爛片的。
除了預(yù)算外,票房和觀眾打分?jǐn)?shù)量也有一定關(guān)系。這也符合常識(shí),不管觀眾打分高低,只要有大量觀眾打分,就說明該電影是輿論熱點(diǎn),票房就不會(huì)太低。
近幾年的電影市場(chǎng)無論是投資還是票房都有比較大的增長(zhǎng),說明了電影市場(chǎng)的火爆。也提醒我們后續(xù)的特征工程需要關(guān)注電影上映年份。
通過電影的語(yǔ)言來看票房。en表示英語(yǔ)。畢竟世界語(yǔ)言,無論是票房的體量還是高票房爆款,都獨(dú)占鰲頭。zh就是你心里想的那個(gè),中文。可見華語(yǔ)電影對(duì)于english可以望其項(xiàng)背了。語(yǔ)言對(duì)票房也有一定反映。
票房的分布明顯右偏,可以通過np.logp1方法轉(zhuǎn)換為對(duì)數(shù)形式實(shí)現(xiàn)數(shù)據(jù)的正態(tài)化,但記得在得到最后的預(yù)測(cè)數(shù)據(jù)后再用np.expm1方法轉(zhuǎn)換回來。
通過熱力圖觀察幾個(gè)主要特征跟票房的皮爾遜系數(shù)(線性相關(guān)系數(shù)),及其彼此的系數(shù)。可見跟票房revenue最相關(guān)的為budget,totalVotes,popularity。
特征工程
特征工程太過繁瑣,不一一敘述,直接上整體代碼。
import numpy as np import pandas as pd import warnings from tqdm import tqdm from sklearn.preprocessing import LabelEncoder warnings.filterwarnings("ignore")def prepare(df):global json_colsglobal train_dictdf['rating'] = df['rating'].fillna(1.5)df['totalVotes'] = df['totalVotes'].fillna(6)df['weightedRating'] = (df['rating'] * df['totalVotes'] + 6.367 * 300) / (df['totalVotes'] + 300)df[['release_month', 'release_day', 'release_year']] = df['release_date'].str.split('/', expand=True).replace(np.nan, 0).astype(int)df['release_year'] = df['release_year']df.loc[(df['release_year'] <= 19) & (df['release_year'] < 100), "release_year"] += 2000df.loc[(df['release_year'] > 19) & (df['release_year'] < 100), "release_year"] += 1900releaseDate = pd.to_datetime(df['release_date'])df['release_dayofweek'] = releaseDate.dt.dayofweekdf['release_quarter'] = releaseDate.dt.quarterdf['originalBudget'] = df['budget']df['inflationBudget'] = df['budget'] + df['budget'] * 1.8 / 100 * (2018 - df['release_year']) # Inflation simple formuladf['budget'] = np.log1p(df['budget'])df['genders_0_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 0]))df['genders_1_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 1]))df['genders_2_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 2]))df['_collection_name'] = df['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else '').fillna('')le = LabelEncoder()df['_collection_name'] = le.fit_transform(df['_collection_name'])df['_num_Keywords'] = df['Keywords'].apply(lambda x: len(x) if x != {} else 0)df['_num_cast'] = df['cast'].apply(lambda x: len(x) if x != {} else 0)df['_popularity_mean_year'] = df['popularity'] / df.groupby("release_year")["popularity"].transform('mean')df['_budget_runtime_ratio'] = df['budget'] / df['runtime']df['_budget_popularity_ratio'] = df['budget'] / df['popularity']df['_budget_year_ratio'] = df['budget'] / (df['release_year'] * df['release_year'])df['_releaseYear_popularity_ratio'] = df['release_year'] / df['popularity']df['_releaseYear_popularity_ratio2'] = df['popularity'] / df['release_year']df['_popularity_totalVotes_ratio'] = df['totalVotes'] / df['popularity']df['_rating_popularity_ratio'] = df['rating'] / df['popularity']df['_rating_totalVotes_ratio'] = df['totalVotes'] / df['rating']df['_totalVotes_releaseYear_ratio'] = df['totalVotes'] / df['release_year']df['_budget_rating_ratio'] = df['budget'] / df['rating']df['_runtime_rating_ratio'] = df['runtime'] / df['rating']df['_budget_totalVotes_ratio'] = df['budget'] / df['totalVotes']df['has_homepage'] = 1df.loc[pd.isnull(df['homepage']), "has_homepage"] = 0df['isbelongs_to_collectionNA'] = 0df.loc[pd.isnull(df['belongs_to_collection']), "isbelongs_to_collectionNA"] = 1df['isTaglineNA'] = 0df.loc[df['tagline'] == 0, "isTaglineNA"] = 1df['isOriginalLanguageEng'] = 0df.loc[df['original_language'] == "en", "isOriginalLanguageEng"] = 1df['isTitleDifferent'] = 1df.loc[df['original_title'] == df['title'], "isTitleDifferent"] = 0df['isMovieReleased'] = 1df.loc[df['status'] != "Released", "isMovieReleased"] = 0# get collection iddf['collection_id'] = df['belongs_to_collection'].apply(lambda x: np.nan if len(x) == 0 else x[0]['id'])df['original_title_letter_count'] = df['original_title'].str.len()df['original_title_word_count'] = df['original_title'].str.split().str.len()df['title_word_count'] = df['title'].str.split().str.len()df['overview_word_count'] = df['overview'].str.split().str.len()df['tagline_word_count'] = df['tagline'].str.split().str.len()df['production_countries_count'] = df['production_countries'].apply(lambda x: len(x))df['production_companies_count'] = df['production_companies'].apply(lambda x: len(x))df['crew_count'] = df['crew'].apply(lambda x: len(x) if x != {} else 0)# df['meanruntimeByYear'] = df.groupby("release_year")["runtime"].aggregate('mean')# df['meanPopularityByYear'] = df.groupby("release_year")["popularity"].aggregate('mean')# df['meanBudgetByYear'] = df.groupby("release_year")["budget"].aggregate('mean')# df['meantotalVotesByYear'] = df.groupby("release_year")["totalVotes"].aggregate('mean')# df['meanTotalVotesByRating'] = df.groupby("rating")["totalVotes"].aggregate('mean')# df['medianBudgetByYear'] = df.groupby("release_year")["budget"].aggregate('median')for col in ['genres', 'production_countries', 'spoken_languages', 'production_companies']:df[col] = df[col].map(lambda x: sorted(list(set([n if n in train_dict[col] else col + '_etc' for n in [d['name'] for d in x]]))))\.map(lambda x: ','.join(map(str, x)))temp = df[col].str.get_dummies(sep=',')df = pd.concat([df, temp], axis=1, sort=False)df.drop(['genres_etc'], axis=1, inplace=True)df = df.drop(['id', 'revenue', 'belongs_to_collection', 'genres', 'homepage', 'imdb_id', 'overview', 'runtime', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'spoken_languages', 'status', 'title', 'Keywords', 'cast', 'crew', 'original_language', 'original_title', 'tagline','collection_id'], axis=1)df.fillna(value=0.0, inplace=True)return dfdef get_dictionary(s):try:d = eval(s)except:d = {}return ddef get_json_dict(df) :global json_colsresult = dict()for e_col in json_cols :d = dict()rows = df[e_col].valuesfor row in rows :if row is None : continuefor i in row :if i['name'] not in d :d[i['name']] = 0d[i['name']] += 1result[e_col] = dreturn resultif __name__ == '__main__':train = pd.read_csv('./train.csv')train.loc[train['id'] == 16, 'revenue'] = 192864 # Skinningtrain.loc[train['id'] == 90, 'budget'] = 30000000 # Sommersbytrain.loc[train['id'] == 118, 'budget'] = 60000000 # Wild Hogstrain.loc[train['id'] == 149, 'budget'] = 18000000 # Beethoventrain.loc[train['id'] == 313, 'revenue'] = 12000000 # The Cookouttrain.loc[train['id'] == 451, 'revenue'] = 12000000 # Chasing Libertytrain.loc[train['id'] == 464, 'budget'] = 20000000 # Parenthoodtrain.loc[train['id'] == 470, 'budget'] = 13000000 # The Karate Kid, Part IItrain.loc[train['id'] == 513, 'budget'] = 930000 # From Prada to Nadatrain.loc[train['id'] == 797, 'budget'] = 8000000 # Welcome to Dongmakgoltrain.loc[train['id'] == 819, 'budget'] = 90000000 # Alvin and the Chipmunks: The Road Chiptrain.loc[train['id'] == 850, 'budget'] = 90000000 # Modern Timestrain.loc[train['id'] == 1007, 'budget'] = 2 # Zyzzyx Roadtrain.loc[train['id'] == 1112, 'budget'] = 7500000 # An Officer and a Gentlemantrain.loc[train['id'] == 1131, 'budget'] = 4300000 # Smokey and the Bandittrain.loc[train['id'] == 1359, 'budget'] = 10000000 # Stir Crazytrain.loc[train['id'] == 1542, 'budget'] = 1 # All at Oncetrain.loc[train['id'] == 1570, 'budget'] = 15800000 # Crocodile Dundee IItrain.loc[train['id'] == 1571, 'budget'] = 4000000 # Lady and the Tramptrain.loc[train['id'] == 1714, 'budget'] = 46000000 # The Recruittrain.loc[train['id'] == 1721, 'budget'] = 17500000 # Cocoontrain.loc[train['id'] == 1865, 'revenue'] = 25000000 # Scooby-Doo 2: Monsters Unleashedtrain.loc[train['id'] == 1885, 'budget'] = 12 # In the Cuttrain.loc[train['id'] == 2091, 'budget'] = 10 # Deadfalltrain.loc[train['id'] == 2268, 'budget'] = 17500000 # Madea Goes to Jail budgettrain.loc[train['id'] == 2491, 'budget'] = 6 # Never Talk to Strangerstrain.loc[train['id'] == 2602, 'budget'] = 31000000 # Mr. Holland's Opustrain.loc[train['id'] == 2612, 'budget'] = 15000000 # Field of Dreamstrain.loc[train['id'] == 2696, 'budget'] = 10000000 # Nurse 3-Dtrain.loc[train['id'] == 2801, 'budget'] = 10000000 # Fracturetrain.loc[train['id'] == 335, 'budget'] = 2train.loc[train['id'] == 348, 'budget'] = 12train.loc[train['id'] == 470, 'budget'] = 13000000train.loc[train['id'] == 513, 'budget'] = 1100000train.loc[train['id'] == 640, 'budget'] = 6train.loc[train['id'] == 696, 'budget'] = 1train.loc[train['id'] == 797, 'budget'] = 8000000train.loc[train['id'] == 850, 'budget'] = 1500000train.loc[train['id'] == 1199, 'budget'] = 5train.loc[train['id'] == 1282, 'budget'] = 9 # Death at a Funeraltrain.loc[train['id'] == 1347, 'budget'] = 1train.loc[train['id'] == 1755, 'budget'] = 2train.loc[train['id'] == 1801, 'budget'] = 5train.loc[train['id'] == 1918, 'budget'] = 592train.loc[train['id'] == 2033, 'budget'] = 4train.loc[train['id'] == 2118, 'budget'] = 344train.loc[train['id'] == 2252, 'budget'] = 130train.loc[train['id'] == 2256, 'budget'] = 1train.loc[train['id'] == 2696, 'budget'] = 10000000test = pd.read_csv('./test.csv')# Clean Datatest.loc[test['id'] == 6733, 'budget'] = 5000000test.loc[test['id'] == 3889, 'budget'] = 15000000test.loc[test['id'] == 6683, 'budget'] = 50000000test.loc[test['id'] == 5704, 'budget'] = 4300000test.loc[test['id'] == 6109, 'budget'] = 281756test.loc[test['id'] == 7242, 'budget'] = 10000000test.loc[test['id'] == 7021, 'budget'] = 17540562 # Two Is a Familytest.loc[test['id'] == 5591, 'budget'] = 4000000 # The Orphanagetest.loc[test['id'] == 4282, 'budget'] = 20000000 # Big Top Pee-weetest.loc[test['id'] == 3033, 'budget'] = 250test.loc[test['id'] == 3051, 'budget'] = 50test.loc[test['id'] == 3084, 'budget'] = 337test.loc[test['id'] == 3224, 'budget'] = 4test.loc[test['id'] == 3594, 'budget'] = 25test.loc[test['id'] == 3619, 'budget'] = 500test.loc[test['id'] == 3831, 'budget'] = 3test.loc[test['id'] == 3935, 'budget'] = 500test.loc[test['id'] == 4049, 'budget'] = 995946test.loc[test['id'] == 4424, 'budget'] = 3test.loc[test['id'] == 4460, 'budget'] = 8test.loc[test['id'] == 4555, 'budget'] = 1200000test.loc[test['id'] == 4624, 'budget'] = 30test.loc[test['id'] == 4645, 'budget'] = 500test.loc[test['id'] == 4709, 'budget'] = 450test.loc[test['id'] == 4839, 'budget'] = 7test.loc[test['id'] == 3125, 'budget'] = 25test.loc[test['id'] == 3142, 'budget'] = 1test.loc[test['id'] == 3201, 'budget'] = 450test.loc[test['id'] == 3222, 'budget'] = 6test.loc[test['id'] == 3545, 'budget'] = 38test.loc[test['id'] == 3670, 'budget'] = 18test.loc[test['id'] == 3792, 'budget'] = 19test.loc[test['id'] == 3881, 'budget'] = 7test.loc[test['id'] == 3969, 'budget'] = 400test.loc[test['id'] == 4196, 'budget'] = 6test.loc[test['id'] == 4221, 'budget'] = 11test.loc[test['id'] == 4222, 'budget'] = 500test.loc[test['id'] == 4285, 'budget'] = 11test.loc[test['id'] == 4319, 'budget'] = 1test.loc[test['id'] == 4639, 'budget'] = 10test.loc[test['id'] == 4719, 'budget'] = 45test.loc[test['id'] == 4822, 'budget'] = 22test.loc[test['id'] == 4829, 'budget'] = 20test.loc[test['id'] == 4969, 'budget'] = 20test.loc[test['id'] == 5021, 'budget'] = 40test.loc[test['id'] == 5035, 'budget'] = 1test.loc[test['id'] == 5063, 'budget'] = 14test.loc[test['id'] == 5119, 'budget'] = 2test.loc[test['id'] == 5214, 'budget'] = 30test.loc[test['id'] == 5221, 'budget'] = 50test.loc[test['id'] == 4903, 'budget'] = 15test.loc[test['id'] == 4983, 'budget'] = 3test.loc[test['id'] == 5102, 'budget'] = 28test.loc[test['id'] == 5217, 'budget'] = 75test.loc[test['id'] == 5224, 'budget'] = 3test.loc[test['id'] == 5469, 'budget'] = 20test.loc[test['id'] == 5840, 'budget'] = 1test.loc[test['id'] == 5960, 'budget'] = 30test.loc[test['id'] == 6506, 'budget'] = 11test.loc[test['id'] == 6553, 'budget'] = 280test.loc[test['id'] == 6561, 'budget'] = 7test.loc[test['id'] == 6582, 'budget'] = 218test.loc[test['id'] == 6638, 'budget'] = 5test.loc[test['id'] == 6749, 'budget'] = 8test.loc[test['id'] == 6759, 'budget'] = 50test.loc[test['id'] == 6856, 'budget'] = 10test.loc[test['id'] == 6858, 'budget'] = 100test.loc[test['id'] == 6876, 'budget'] = 250test.loc[test['id'] == 6972, 'budget'] = 1test.loc[test['id'] == 7079, 'budget'] = 8000000test.loc[test['id'] == 7150, 'budget'] = 118test.loc[test['id'] == 6506, 'budget'] = 118test.loc[test['id'] == 7225, 'budget'] = 6test.loc[test['id'] == 7231, 'budget'] = 85test.loc[test['id'] == 5222, 'budget'] = 5test.loc[test['id'] == 5322, 'budget'] = 90test.loc[test['id'] == 5350, 'budget'] = 70test.loc[test['id'] == 5378, 'budget'] = 10test.loc[test['id'] == 5545, 'budget'] = 80test.loc[test['id'] == 5810, 'budget'] = 8test.loc[test['id'] == 5926, 'budget'] = 300test.loc[test['id'] == 5927, 'budget'] = 4test.loc[test['id'] == 5986, 'budget'] = 1test.loc[test['id'] == 6053, 'budget'] = 20test.loc[test['id'] == 6104, 'budget'] = 1test.loc[test['id'] == 6130, 'budget'] = 30test.loc[test['id'] == 6301, 'budget'] = 150test.loc[test['id'] == 6276, 'budget'] = 100test.loc[test['id'] == 6473, 'budget'] = 100test.loc[test['id'] == 6842, 'budget'] = 30# features from https://www.kaggle.com/kamalchhirang/eda-simple-feature-engineering-external-datatrain = pd.merge(train, pd.read_csv('./TrainAdditionalFeatures.csv'),how='left', on=['imdb_id'])test = pd.merge(test, pd.read_csv('./TestAdditionalFeatures.csv'),how='left', on=['imdb_id'])additionalTrainData = pd.read_csv('./additionalTrainData.csv')additionalTrainData['release_date'] = additionalTrainData['release_date'].astype('str').str.replace('-', '/')train = pd.concat([train, additionalTrainData])print(train.columns)print(train.shape)train['revenue'] = np.log1p(train['revenue'])y = train['revenue']json_cols = ['genres', 'production_companies', 'production_countries', 'spoken_languages', 'Keywords', 'cast','crew']for col in tqdm(json_cols + ['belongs_to_collection']):train[col] = train[col].apply(lambda x: get_dictionary(x))test[col] = test[col].apply(lambda x: get_dictionary(x))train_dict = get_json_dict(train)test_dict = get_json_dict(test)# remove cateogry with bias and low frequencyfor col in json_cols:remove = []train_id = set(list(train_dict[col].keys()))test_id = set(list(test_dict[col].keys()))remove += list(train_id - test_id) + list(test_id - train_id)for i in train_id.union(test_id) - set(remove):if train_dict[col][i] < 10 or i == '':remove += [i]for i in remove:if i in train_dict[col]:del train_dict[col][i]if i in test_dict[col]:del test_dict[col][i]all_data = prepare(pd.concat([train, test]).reset_index(drop=True))train = all_data.loc[:train.shape[0] - 1, :]test = all_data.loc[train.shape[0]:, :]print(train.columns)print(train.shape)train.to_csv('./X_train.csv', index=False)test.to_csv('./X_test.csv', index=False)y.to_csv('./y_train.csv', header=True, index=False)最后將待訓(xùn)練的數(shù)據(jù)保存為X_train, X_test, y_train。
X_train_p = pd.read_csv('./X_train.csv') X_train_p.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 5001 entries, 0 to 5000 Columns: 197 entries, budget to production_companies_etc dtypes: float64(25), int64(172) memory usage: 7.5 MB預(yù)處理后的訓(xùn)練數(shù)據(jù)共有5001條,197個(gè)特征,全為數(shù)值型變量。
模型訓(xùn)練
用xgboost,lightGBM,catboost三種GBDT梯度替身決策樹算法的變種模型訓(xùn)練數(shù)據(jù),而后融合三只青眼白龍,召喚出三頭青眼白龍 三個(gè)模型,用融合模型預(yù)測(cè)出票房結(jié)果。
訓(xùn)練模型代碼:
import numpy as np import pandas as pd import warnings from datetime import datetime from sklearn.model_selection import KFold import xgboost as xgb import lightgbm as lgb from catboost import CatBoostRegressor warnings.filterwarnings("ignore")def xgb_model(X_train, y_train, X_val, y_val, X_test, verbose):params = {'objective': 'reg:linear','eta': 0.01,'max_depth': 6,'subsample': 0.6,'colsample_bytree': 0.7,'eval_metric': 'rmse','seed': random_seed,'silent': True,}record = dict()model = xgb.train(params, xgb.DMatrix(X_train, y_train), 100000, [(xgb.DMatrix(X_train, y_train), 'train'),(xgb.DMatrix(X_val, y_val), 'valid')], verbose_eval=verbose, early_stopping_rounds=500, callbacks=[xgb.callback.record_evaluation(record)])best_idx = np.argmin(np.array(record['valid']['rmse']))val_pred = model.predict(xgb.DMatrix(X_val), ntree_limit=model.best_ntree_limit)test_pred = model.predict(xgb.DMatrix(X_test), ntree_limit=model.best_ntree_limit)return {'val': val_pred, 'test': test_pred, 'error': record['valid']['rmse'][best_idx],'importance': [i for k, i in model.get_score().items()]}def lgb_model(X_train, y_train, X_val, y_val, X_test, verbose):params = {'objective': 'regression','num_leaves': 30,'min_data_in_leaf': 20,'max_depth': 9,'learning_rate': 0.004,# 'min_child_samples':100,'feature_fraction': 0.9,"bagging_freq": 1,"bagging_fraction": 0.9,'lambda_l1': 0.2,"bagging_seed": random_seed,"metric": 'rmse',# 'subsample':.8,# 'colsample_bytree':.9,"random_state": random_seed,"verbosity": -1}record = dict()model = lgb.train(params, lgb.Dataset(X_train, y_train), num_boost_round=100000, valid_sets=[lgb.Dataset(X_val, y_val)], verbose_eval=verbose, early_stopping_rounds=500, callbacks=[lgb.record_evaluation(record)])best_idx = np.argmin(np.array(record['valid_0']['rmse']))val_pred = model.predict(X_val, num_iteration=model.best_iteration)test_pred = model.predict(X_test, num_iteration=model.best_iteration)return {'val': val_pred, 'test': test_pred, 'error': record['valid_0']['rmse'][best_idx],'importance': model.feature_importance('gain')}def cat_model(X_train, y_train, X_val, y_val, X_test, verbose):model = CatBoostRegressor(iterations=100000,learning_rate=0.004,depth=5,eval_metric='RMSE',colsample_bylevel=0.8,random_seed=random_seed,bagging_temperature=0.2,metric_period=None,early_stopping_rounds=200)model.fit(X_train, y_train,eval_set=(X_val, y_val),use_best_model=True,verbose=False)val_pred = model.predict(X_val)test_pred = model.predict(X_test)return {'val': val_pred, 'test': test_pred,'error': model.get_best_score()['validation_0']['RMSE'],'importance': model.get_feature_importance()}if __name__ == '__main__':X_train_p = pd.read_csv('./X_train.csv')X_test = pd.read_csv('./X_test.csv')y_train_p = pd.read_csv('./y_train.csv')random_seed = 2019k = 10fold = list(KFold(k, shuffle=True, random_state=random_seed).split(X_train_p))np.random.seed(random_seed)result_dict = dict()val_pred = np.zeros(X_train_p.shape[0])test_pred = np.zeros(X_test.shape[0])final_err = 0verbose = Falsefor i, (train, val) in enumerate(fold):print(i + 1, "fold. RMSE")X_train = X_train_p.loc[train, :]y_train = y_train_p.loc[train, :].values.ravel()X_val = X_train_p.loc[val, :]y_val = y_train_p.loc[val, :].values.ravel()fold_val_pred = []fold_test_pred = []fold_err = []# """ xgbooststart = datetime.now()result = xgb_model(X_train, y_train, X_val, y_val, X_test, verbose)fold_val_pred.append(result['val'] * 0.2)fold_test_pred.append(result['test'] * 0.2)fold_err.append(result['error'])print("xgb model.", "{0:.5f}".format(result['error']),'(' + str(int((datetime.now() - start).seconds)) + 's)')# """# """ lightgbmstart = datetime.now()result = lgb_model(X_train, y_train, X_val, y_val, X_test, verbose)fold_val_pred.append(result['val'] * 0.4)fold_test_pred.append(result['test'] * 0.4)fold_err.append(result['error'])print("lgb model.", "{0:.5f}".format(result['error']),'(' + str(int((datetime.now() - start).seconds)) + 's)')# """# """ catboost modelstart = datetime.now()result = cat_model(X_train, y_train, X_val, y_val, X_test, verbose)fold_val_pred.append(result['val'] * 0.4)fold_test_pred.append(result['test'] * 0.4)fold_err.append(result['error'])print("cat model.", "{0:.5f}".format(result['error']),'(' + str(int((datetime.now() - start).seconds)) + 's)')# """# mix result of multiple modelsval_pred[val] += np.sum(np.array(fold_val_pred), axis=0)print(fold_test_pred)test_pred += np.sum(np.array(fold_test_pred), axis=0) / kfinal_err += (sum(fold_err) / len(fold_err)) / kprint("---------------------------")print("avg err.", "{0:.5f}".format(sum(fold_err) / len(fold_err)))print("blend err.", "{0:.5f}".format(np.sqrt(np.mean((np.sum(np.array(fold_val_pred), axis=0) - y_val) ** 2))))print('')print("final avg err.", final_err)print("final blend err.", np.sqrt(np.mean((val_pred - y_train_p.values.ravel()) ** 2)))sub = pd.read_csv('./sample_submission.csv')df_sub = pd.DataFrame()df_sub['id'] = sub['id']df_sub['revenue'] = np.expm1(test_pred)print(df_sub['revenue'])df_sub.to_csv('./submission.csv', index=False)
提交后就可以得到預(yù)測(cè)分?jǐn)?shù)和名次了。
由于該比賽項(xiàng)目目前參賽人數(shù)不多,只有400只隊(duì)伍,目前排名還比較靠前,是個(gè)不錯(cuò)的開始。接下來就可以嘗試其他更多樣的比賽了!
總結(jié)
以上是生活随笔為你收集整理的Kaggle——TMDB电影票房预测的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: matlab小波分析张德丰,MATLAB
- 下一篇: PS去除图片和PDF中的水印