【Python学习系列二十四】scikit-learn库逻辑回归实现唯品会用户购买行为预测
生活随笔
收集整理的這篇文章主要介紹了
【Python学习系列二十四】scikit-learn库逻辑回归实现唯品会用户购买行为预测
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
1、背景:http://www.datafountain.cn/#/competitions/260/intro
? ? ? ? ? ? ? ? ? DataFountain上的唯品會用戶購買行為預測比賽題目,筆者用邏輯回歸實現,分數是0.48比較弱,代碼這里參考。
2、通過比賽提取的特征如下:
? ?
| 特征類別 | 特征名 | 特征說明 | 訓練說明 |
| 基本特征 | u_id | 用戶唯一標識 | |
| spu_id | 商品唯一標識 | ||
| brand_id | 商品所屬的品牌標識 | ||
| cat_id | 商品所屬的品類標識 | ||
| 人的特征 | u_buy_num | 購買次數 | |
| u_click_num | 點擊次數 | ||
| u_buy_date | 購買天數 | ||
| u_click_date | 點擊天數 | ||
| u_num_ratio | 購買點擊次數比,購買次數/點擊次數 | ||
| u_date_ratio | 購買點擊天數比,購買天數/點擊天數 | ||
| u_buy_freq | 購買頻率,購買次數/90天 | ||
| u_click_freq | 點擊頻率,購買次數/90天 | ||
| 商品的特征 | spu_buy_num | 購買次數 | |
| spu_click_num | 點擊次數 | ||
| spu_buy_date | 購買天數 | ||
| spu_click_date | 點擊天數 | ||
| spu_num_ratio | 購買點擊次數比,購買次數/點擊次數 | ||
| spu_date_ratio | 購買點擊天數比,購買天數/點擊天數 | ||
| spu_buy_freq | 購買頻率,購買次數/90天 | ||
| spu_click_freq | 點擊頻率,購買次數/90天 | ||
| 品牌的特征 | brand_buy_num | 購買次數 | |
| brand_click_num | 點擊次數 | ||
| brand_buy_date | 購買天數 | ||
| brand_click_date | 點擊天數 | ||
| brand_num_ratio | 購買點擊次數比,購買次數/點擊次數 | ||
| brand_date_ratio | 購買點擊天數比,購買天數/點擊天數 | ||
| brand_buy_freq | 購買頻率,購買次數/90天 | ||
| brand_click_freq | 點擊頻率,購買次數/90天 | ||
| 品類的特征 | cat_buy_num | 購買次數 | |
| cat_click_num | 點擊次數 | ||
| cat_buy_date | 購買天數 | ||
| cat_click_date | 點擊天數 | ||
| cat_num_ratio | 購買點擊次數比,購買次數/點擊次數 | ||
| cat_date_ratio | 購買點擊天數比,購買天數/點擊天數 | ||
| cat_buy_freq | 購買頻率,購買次數/90天 | ||
| cat_click_freq | 點擊頻率,購買次數/90天 | ||
| 標記 | action_type | 該用戶是否會在當日購買此商品(0否,1是) |
3、邏輯回歸參考代碼如下:
# -*- coding: utf-8 -*-import pandas as pd import time from sklearn import preprocessing from sklearn.linear_model import LogisticRegression from sklearn import metrics def main():#第一步:加載訓練集和測試集#加載帶標記數據label_ds=pd.read_csv(r"train_features_0714.txt",sep='\t',encoding='utf8',names=["u_id","u_buy_num","u_click_num","u_buy_date","u_click_date","u_num_ratio","u_date_ratio","u_buy_freq","u_click_freq","u_last_date",\"spu_id","spu_buy_num","spu_click_num","spu_buy_date","spu_click_date","spu_num_ratio","spu_date_ratio","spu_buy_freq","spu_click_freq","spu_last_date",\"brand_id","brand_buy_num","brand_click_num","brand_buy_date","brand_click_date","brand_num_ratio","brand_date_ratio","brand_buy_freq","brand_click_freq","brand_last_date",\"cat_id","cat_buy_num","cat_click_num","cat_buy_date","cat_click_date","cat_num_ratio","cat_date_ratio","cat_buy_freq","cat_click_freq","cat_last_date",\"action_type"]) #人特征label_ds["u_id"] = label_ds["u_id"].astype("int")label_ds["u_buy_num"] = label_ds["u_buy_num"].astype("int")label_ds["u_click_num"] = label_ds["u_click_num"].astype("int")label_ds["u_buy_date"] = label_ds["u_buy_date"].astype("int")label_ds["u_click_date"] = label_ds["u_click_date"].astype("int")label_ds["u_num_ratio"] = label_ds["u_num_ratio"].astype("float")label_ds["u_date_ratio"] = label_ds["u_date_ratio"].astype("float")label_ds["u_buy_freq"] = label_ds["u_buy_freq"].astype("float")label_ds["u_click_freq"] = label_ds["u_click_freq"].astype("float")label_ds["u_last_date"] = label_ds["u_last_date"].astype("int")#商品特征label_ds["spu_id"] = label_ds["spu_id"].astype("int")label_ds["spu_buy_num"] = label_ds["spu_buy_num"].astype("int")label_ds["spu_click_num"] = label_ds["spu_click_num"].astype("int")label_ds["spu_buy_date"] = label_ds["spu_buy_date"].astype("int")label_ds["spu_click_date"] = label_ds["spu_click_date"].astype("int")label_ds["spu_num_ratio"] = label_ds["spu_num_ratio"].astype("float")label_ds["spu_date_ratio"] = label_ds["spu_date_ratio"].astype("float")label_ds["spu_buy_freq"] = label_ds["spu_buy_freq"].astype("float")label_ds["spu_click_freq"] = label_ds["spu_click_freq"].astype("float")label_ds["spu_last_date"] = label_ds["spu_last_date"].astype("int")#品牌特征label_ds["brand_id"] = label_ds["brand_id"].astype("int")label_ds["brand_buy_num"] = label_ds["brand_buy_num"].astype("int")label_ds["brand_click_num"] = label_ds["brand_click_num"].astype("int")label_ds["brand_buy_date"] = label_ds["brand_buy_date"].astype("int")label_ds["brand_click_date"] = label_ds["brand_click_date"].astype("int")label_ds["brand_num_ratio"] = label_ds["brand_num_ratio"].astype("float")label_ds["brand_date_ratio"] = label_ds["brand_date_ratio"].astype("float")label_ds["brand_buy_freq"] = label_ds["brand_buy_freq"].astype("float")label_ds["brand_click_freq"] = label_ds["brand_click_freq"].astype("float")label_ds["brand_last_date"] = label_ds["brand_last_date"].astype("int")#品類特征label_ds["cat_id"] = label_ds["cat_id"].astype("int")label_ds["cat_buy_num"] = label_ds["cat_buy_num"].astype("int")label_ds["cat_click_num"] = label_ds["cat_click_num"].astype("int")label_ds["cat_buy_date"] = label_ds["cat_buy_date"].astype("int")label_ds["cat_click_date"] = label_ds["cat_click_date"].astype("int")label_ds["cat_num_ratio"] = label_ds["cat_num_ratio"].astype("float")label_ds["cat_date_ratio"] = label_ds["cat_date_ratio"].astype("float")label_ds["cat_buy_freq"] = label_ds["cat_buy_freq"].astype("float")label_ds["cat_click_freq"] = label_ds["cat_click_freq"].astype("float")label_ds["cat_last_date"] = label_ds["cat_last_date"].astype("int")#標記label_ds["action_type"] = label_ds["action_type"].astype("int")print "訓練集,有", label_ds.shape[0], "行", label_ds.shape[1], "列" #加載未標記數據unlabel_ds=pd.read_csv(r"test_features_0714.txt",sep='\t',encoding='utf8',names=["id","uid","spu_id","brand_id","cat_id",\"u_buy_num","u_click_num","u_buy_date","u_click_date","u_num_ratio","u_date_ratio","u_buy_freq","u_click_freq","u_last_date",\"spu_buy_num","spu_click_num","spu_buy_date","spu_click_date","spu_num_ratio","spu_date_ratio","spu_buy_freq","spu_click_freq","spu_last_date",\"brand_buy_num","brand_click_num","brand_buy_date","brand_click_date","brand_num_ratio","brand_date_ratio","brand_buy_freq","brand_click_freq","brand_last_date",\"cat_buy_num","cat_click_num","cat_buy_date","cat_click_date","cat_num_ratio","cat_date_ratio","cat_buy_freq","cat_click_freq","cat_last_date",]) #人特征unlabel_ds["id"] = unlabel_ds["id"].astype("int")unlabel_ds["u_id"] = unlabel_ds["u_id"].astype("int")unlabel_ds["u_buy_num"] = unlabel_ds["u_buy_num"].astype("int")#391萬unlabel_ds["u_click_num"] = unlabel_ds["u_click_num"].astype("int")unlabel_ds["u_buy_date"] = unlabel_ds["u_buy_date"].astype("int")unlabel_ds["u_click_date"] = unlabel_ds["u_click_date"].astype("int")unlabel_ds["u_num_ratio"] = unlabel_ds["u_num_ratio"].astype("float")unlabel_ds["u_date_ratio"] = unlabel_ds["u_date_ratio"].astype("float")unlabel_ds["u_buy_freq"] = unlabel_ds["u_buy_freq"].astype("float")unlabel_ds["u_click_freq"] = unlabel_ds["u_click_freq"].astype("float")unlabel_ds["u_last_date"] = unlabel_ds["u_last_date"].astype("int")#商品特征unlabel_ds["spu_id"] = unlabel_ds["spu_id"].astype("int")unlabel_ds["spu_buy_num"] = unlabel_ds["spu_buy_num"].astype("int")unlabel_ds["spu_click_num"] = unlabel_ds["spu_click_num"].astype("int")unlabel_ds["spu_buy_date"] = unlabel_ds["spu_buy_date"].astype("int")unlabel_ds["spu_click_date"] = unlabel_ds["spu_click_date"].astype("int")unlabel_ds["spu_num_ratio"] = unlabel_ds["spu_num_ratio"].astype("float")#241萬unlabel_ds["spu_date_ratio"] = unlabel_ds["spu_date_ratio"].astype("float")unlabel_ds["spu_buy_freq"] = unlabel_ds["spu_buy_freq"].astype("float")unlabel_ds["spu_click_freq"] = unlabel_ds["spu_click_freq"].astype("float")unlabel_ds["spu_last_date"] = unlabel_ds["spu_last_date"].astype("int")#品牌特征unlabel_ds["brand_id"] = unlabel_ds["brand_id"].astype("int")unlabel_ds["brand_buy_num"] = unlabel_ds["brand_buy_num"].astype("int")unlabel_ds["brand_click_num"] = unlabel_ds["brand_click_num"].astype("int")unlabel_ds["brand_buy_date"] = unlabel_ds["brand_buy_date"].astype("int")unlabel_ds["brand_click_date"] = unlabel_ds["brand_click_date"].astype("int")unlabel_ds["brand_num_ratio"] = unlabel_ds["brand_num_ratio"].astype("float")unlabel_ds["brand_date_ratio"] = unlabel_ds["brand_date_ratio"].astype("float")unlabel_ds["brand_buy_freq"] = unlabel_ds["brand_buy_freq"].astype("float")unlabel_ds["brand_click_freq"] = unlabel_ds["brand_click_freq"].astype("float")unlabel_ds["brand_last_date"] = unlabel_ds["brand_last_date"].astype("int")#品類特征unlabel_ds["cat_id"] = unlabel_ds["cat_id"].astype("int")unlabel_ds["cat_buy_num"] = unlabel_ds["cat_buy_num"].astype("int")unlabel_ds["cat_click_num"] = unlabel_ds["cat_click_num"].astype("int")unlabel_ds["cat_buy_date"] = unlabel_ds["cat_buy_date"].astype("int")unlabel_ds["cat_click_date"] = unlabel_ds["cat_click_date"].astype("int")unlabel_ds["cat_num_ratio"] = unlabel_ds["cat_num_ratio"].astype("float")unlabel_ds["cat_date_ratio"] = unlabel_ds["cat_date_ratio"].astype("float")unlabel_ds["cat_buy_freq"] = unlabel_ds["cat_buy_freq"].astype("float")unlabel_ds["cat_click_freq"] = unlabel_ds["cat_click_freq"].astype("float")unlabel_ds["cat_last_date"] = unlabel_ds["cat_last_date"].astype("int")print "測試集,有", unlabel_ds.shape[0], "行", unlabel_ds.shape[1], "列" #模型訓練ds_0=label_ds[label_ds['action_type']==0]#標記為0的樣本ds_0_train=ds_0.sample(frac=0.01)#抽0.01出來訓練ds_1=label_ds[label_ds['action_type']==1]#標記為1的樣本ds_train=ds_1.append(ds_0_train)label_X=ds_train[['u_num_ratio','spu_num_ratio','brand_num_ratio','cat_num_ratio']]label_X_scale=preprocessing.scale(label_X)#歸一化label_y = ds_train['action_type']#類別 ds=label_ds[label_ds['action_type']==0]model =LogisticRegression()#ensemble.GradientBoostingClassifier()model.fit(label_X_scale, label_y) #第五步:模型驗證和選擇test_df=ds_train.sample(frac=0.2)#抽0.2驗證test_X=test_df[['u_num_ratio','spu_num_ratio','brand_num_ratio','cat_num_ratio']]test_X_scale=preprocessing.scale(test_X)#歸一化test_y=test_df['action_type']#類別predicted = model.predict(test_X_scale) f1_score = metrics.f1_score(test_y, predicted) #模型評估 print f1_score#第六步:模型預測unlabe_X = unlabel_ds[['u_num_ratio','spu_num_ratio','brand_num_ratio','cat_num_ratio']]unlabe_X_scale=preprocessing.scale(unlabe_X)#歸一化unlabel_y=model.predict_proba(unlabe_X_scale)[:,1]#預測返回概率值,通過概率值閾值選擇正例樣本 out_y=pd.DataFrame(unlabel_y,columns=['prob']) #返回判定正例的比例 out_y["prob"]=out_y["prob"].apply(lambda x: '{0:.3f}'.format(x))out_1=out_y[out_y["prob"]>'0.5'] #看大于0.5的個數print out_1.shapeout_y['prob'].value_counts() #看值分布out_y.to_csv('fangjs/outvip.txt',index=False,header=None)#輸出預測數據 #執行 if __name__ == '__main__': start = time.clock() main()end = time.clock() print('finish all in %s' % str(end - start))總結
以上是生活随笔為你收集整理的【Python学习系列二十四】scikit-learn库逻辑回归实现唯品会用户购买行为预测的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【数据挖掘笔记二】认识数据
- 下一篇: 【Python学习系列二十五】数据结构-