當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

WSDM-爱奇艺：用户留存预测挑战赛线上0.865

發布時間：2024/1/8 编程问答 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 WSDM-爱奇艺：用户留存预测挑战赛线上0.865 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

賽題介紹

http://challenge.ai.iqiyi.com/detail?raceId=61600f6cef1b65639cd5eaa6

https://www.datafountain.cn/competitions/551

運行說明【非常重要】

data.zip是比賽原始數據集，wsdm_model_data.zip是比賽提取的特征。

代碼第一部分是EDA，第二部分是提取特征，第三部分是訓練模型。建議分開運行。

代碼需要20G以上內存，需要有GPU，代碼訓練模型部分使用paddlepaddle搭建。

賽題描述

愛奇藝是中國和世界領先的高品質視頻娛樂流媒體平臺，每個月有超過5億的用戶在愛奇藝上享受娛樂服務。愛奇藝秉承“悅享品質”的品牌口號，打造涵蓋影劇、綜藝、動漫在內的專業正版視頻內容庫，和“隨刻”等海量的用戶原創內容，為用戶提供豐富的專業視頻體驗。

愛奇藝手機端APP，通過深度學習等最新的AI技術，提升用戶個性化的產品體驗，更好地讓用戶享受定制化的娛樂服務。我們用“N日留存分”這一關鍵指標來衡量用戶的滿意程度。例如，如果一個用戶10月1日的“7日留存分”等于3，代表這個用戶接下來的7天里（10月2日~8日），有3天會訪問愛奇藝APP。預測用戶的留存分是個充滿挑戰的難題：不同用戶本身的偏好、活躍度差異很大，另外用戶可支配的娛樂時間、熱門內容的流行趨勢等其他因素，也有很強的周期性特征。

本次大賽基于愛奇藝APP脫敏和采樣后的數據信息，預測用戶的7日留存分。參賽隊伍需要設計相應的算法進行數據分析和預測。

數據描述

本次比賽提供了豐富的數據集，包含視頻數據、用戶畫像數據、用戶啟動日志、用戶觀影和互動行為日志等。針對測試集用戶，需要預測每一位用戶某一日的“7日留存分”。7日留存分取值范圍從0到7，預測結果保留小數點后2位。

評價指標

本次比賽是一個數值預測類問題。評價函數使用： $100-(1-\frac{1}{n}\sum_{i=1}^{n}{|\frac{F_i-A_i}{7}|})$ 。

$n$ 是測試集用戶數量， $F$ 是參賽者對用戶的7日留存分預測值， $A$ 是真實的7日留存分真實值。

評審說明

選手的提交應為UTF-8編碼的csv文件。文件的格式和順序需要和測試集保持一致。參見競賽數據集下載部分“sample-a”。所有預測數據保留小數點后2位有效數字。不符合提交格式的文件被視為無效，并浪費一次提交機會。

本次比賽分為A、B 2個階段。2個階段的訓練集是一樣的，但需要選手預測的測試集不同。

A階段截止2022.01.17。A階段測試集包含15001個需要預測的用戶，用于A階段比賽和排行榜。每個用戶提供用戶id和end_date日期。選手需要預測這個用戶，對應[end_date+1 ~ end_date+7]，這未來7天里的7日留存分。
B階段從2022.01.17開始，截止2022.01.20。屆時系統會重新提供B階段測試集。B階段測試集更大，包含35000個需要預測的用戶。B階段使用單獨的排行榜，其余細節和A階段一致。

最后比賽結果以B階段成績為準，同時選手需要提交輔助性材料，證明其成績合法有效。

特別說明

愛奇藝AI競賽平臺作為大賽官網，是挑戰賽主戰場。若參與主賽場比賽，選手需登錄大賽官網完成注冊報名，并務必在大賽官網主賽場提交預測結果。
每支參賽隊伍的隊伍人數最多5人。
DataFountain競賽平臺作為2022WSDM用戶留存預測挑戰賽的練習場，在A榜階段為參賽選手提供每天額外2次的成績測試提交機會，助力大家在大賽官網主賽場中取得優異成績。
A榜階段，DataFountain競賽平臺和大賽官網主賽場均可提交預測結果；B榜階段，請參賽選手前往大賽官網主賽場提交預測結果。該賽題最終排名榜單以大賽官網主賽場發布的結果為準。

數據集解釋

1. User portrait data

Field nameDescription

user_id
device_type	iOS, Android
device_rom	rom of the device
device_ram	ram of the device
sex
age
education
occupation_status
territory_code

2. App launch logs

Field nameDescription

user_id
date	Desensitization, started from 0
launch_type	spontaneous or launched by other apps & deep-links

3. Video related data

Field nameDescription

item_id	id of the video
father_id	album id, if the video is an episode of an album collection
cast	a list of actors/actresses
duration	video length
tag_list	a list of tags

4. User playback data

Field nameDescription

user_id
item_id
playtime	video playback time
date	timestamp of the behavior

5. User interaction data

Field nameDescription

user_id
item_id
interact_type	interaction types such as posting comments, etc.
date	timestamp of the behavior

時間線

2021.10.15：賽事啟動，賽題正式發布，開放賽題數據集，開放組隊報名。
2021.11.15：開放公開排名榜，參賽者可以提交預測結果。2021.12.20: 報名截止
2022.01.17: A階段停止提交結果，B階段測試集、排行榜開放。
2022.01.20: B階段停止提交結果
2022.01.21: B階段TOP5團隊解釋文檔停止提交（提交方式稍后公布）
2022.01.25: 公布最終成績
2022.02.17: Top 3隊伍報告會及獎項頒發

獎項設置

冠軍隊伍: 一支 ($2000)
亞軍隊伍: 一支 ($800)
季軍隊伍: 一支 ($500)

!unzip data.zip Archive: data.zipinflating: app_launch_logs.csv inflating: sample-a.csv inflating: test-a.csv inflating: user_interaction_data.csv inflating: user_playback_data.csv inflating: user_portrait_data.csv inflating: video_related_data.csv import pandas as pd import numpy as np from itertools import groupby%pylab inline import seaborn as snsPATH = './' Populating the interactive namespace from numpy and matplotlib # 讀取數據集 user_interaction = pd.read_csv(PATH + 'user_interaction_data.csv') user_portrait = pd.read_csv(PATH + 'user_portrait_data.csv') user_playback = pd.read_csv(PATH + 'user_playback_data.csv')app_launch = pd.read_csv(PATH + 'app_launch_logs.csv') video_related = pd.read_csv(PATH + 'video_related_data.csv')

基礎字段分析

user_portrait

Field nameDescription

user_id
device_type	iOS, Android
device_rom	rom of the device
device_ram	ram of the device
sex
age
education
occupation_status
territory_code

user_portrait.head(2) user_iddevice_typedevice_ramdevice_romsexageeducationoccupation_statusterritory_code01

10209854	2.0	5731	109581	1.0	2.0	0.0	1.0	865101.0
10230057	2.0	1877	20888	1.0	4.0	0.0	1.0	864102.0

print(user_portrait.shape) for col in user_portrait.columns:print(f'{col} \t {user_portrait.dtypes[col]} {user_portrait[col].nunique()}') (596906, 9) user_id int64 596905 device_type float64 4 device_ram object 2049 device_rom object 6217 sex float64 2 age float64 5 education float64 3 occupation_status float64 2 territory_code float64 373 user_portrait['user_id'].value_counts() 10268855 2 10280241 1 10444097 1 10442048 1 10267967 1.. 10037872 1 10080879 1 10082926 1 10076781 1 10485760 1 Name: user_id, Length: 596905, dtype: int64 user_portrait[user_portrait['user_id'] == 10268855] user_iddevice_typedevice_ramdevice_romsexageeducationoccupation_statusterritory_code596800596801

10268855	2.0	NaN	NaN	1.0	3.0	NaN	NaN	NaN
10268855	2.0	NaN	NaN	1.0	3.0	NaN	NaN	NaN

有一個用戶記錄存在重復，考慮剔除。

user_portrait = user_portrait.drop_duplicates()

device_type

device_type 為類別類型，根據手機系統占比，猜測2為安卓，1為ios，3為wp，4為未知或其他

user_portrait['device_type'].value_counts() 2.0 480055 1.0 85322 3.0 28909 4.0 2280 Name: device_type, dtype: int64

ram 和 rom

在手機上，ROM用來存放數據，如系統程序，應用程序，音頻，視頻和文檔的，由于視頻等存儲空間大，所以ROM比RAM大很多，現在主流手機都是8G的空間

RAM又叫運行內存，存放臨時程序的，速度要遠大于ROM，現在主流手機都是1G的RAM,RAM越大，手機運行越快，玩大型游戲，也就越流暢

# 提取手機信息 user_portrait['device_ram'] = user_portrait['device_ram'].apply(lambda x: str(x).split(';')[0]) user_portrait['device_rom'] = user_portrait['device_rom'].apply(lambda x: str(x).split(';')[0]) /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy"""Entry point for launching an IPython kernel. /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value insteadSee the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy sns.distplot(user_portrait['device_ram']) <matplotlib.axes._subplots.AxesSubplot at 0x7f97602fc650>

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-wzt0EGWL-1646533563012)(output_16_1.png)]

sns.distplot(user_portrait['device_rom']) <matplotlib.axes._subplots.AxesSubplot at 0x7f97602597d0>

sex

user_portrait['sex'].value_counts() 1.0 308846 2.0 281612 Name: sex, dtype: int64

age

sns.distplot(user_portrait['age']) <matplotlib.axes._subplots.AxesSubplot at 0x7f9760069950>

education

sns.distplot(user_portrait['education']) <matplotlib.axes._subplots.AxesSubplot at 0x7f0cc1cd5610>

occupation_status

sns.distplot(user_portrait['occupation_status']) <matplotlib.axes._subplots.AxesSubplot at 0x7f0c793e0d90>

territory_code

用戶常駐地域編號

sns.distplot(user_portrait['territory_code']) <matplotlib.axes._subplots.AxesSubplot at 0x7f0c791543d0>

app_launch

Field nameDescription

user_id
date	Desensitization, started from 0
launch_type	spontaneous or launched by other apps & deep-links

app_launch.head(2) user_idlaunch_typedate01

10157996	0	129
10139583	0	129

print(app_launch.shape) for col in app_launch.columns:print(f'{col} \t {app_launch.dtypes[col]} {app_launch[col].nunique()}') (8493878, 3) user_id int64 600000 launch_type int64 2 date int64 123 app_launch['launch_type'].value_counts() 0 8110781 1 383097 Name: launch_type, dtype: int64 app_launch.groupby('user_id')['launch_type'].mean() user_id 10000000 0.000000 10000001 0.000000 10000002 0.000000 10000003 0.500000 10000004 0.000000... 10599995 0.166667 10599996 0.000000 10599997 0.000000 10599998 0.000000 10599999 0.000000 Name: launch_type, Length: 600000, dtype: float64 app_launch = app_launch.sort_values(by=['user_id', 'date']) app_launch.head() user_idlaunch_typedate47229644675122115052047964526023992

10000000	0	131
10000000	0	132
10000000	0	141
10000000	0	164
10000000	0	179

video_related

Field nameDescription

item_id	id of the video
father_id	album id, if the video is an episode of an album collection
cast	a list of actors/actresses
duration	video length
tag_list	a list of tags

video_related.head(2) item_iddurationfather_idtag_listcast01

24403453.0	6.0	NaN	50365080;50338575;50313222;50165986	NaN
22838795.0	7.0	NaN	50001708;50323515;50125414	NaN

sns.distplot(video_related['duration']) <matplotlib.axes._subplots.AxesSubplot at 0x7f0c4f09ab50>

user_playback

user_playback.head() user_iditem_idplaytimedate01234

10057286	20628283.0	2208.612	145
10522615	23930557.0	31.054	145
10494028	20173699.0	115.952	145
10181987	21350426.0	1.585	145
10439175	22946929.0	51.726	145

user_interaction

Field nameDescription

user_id
item_id
interact_type	interaction types such as posting comments, etc.
date	timestamp of the behavior

user_interaction.head(2) user_iditem_idinteract_typedate01

10243056	22635954	1	213
10203565	24723827	3	213

探索性數據分析

app_launch
- 歷史一天、三天、一周、一個月、三個月的行為

def count_launch_by_day(day1, day2):u1 = set(app_launch[app_launch['date'].isin(day1)]['user_id'].unique())u2 = set(app_launch[app_launch['date'].isin(day2)]['user_id'].unique())print(len(u1&u2)/len(u1))count_launch_by_day([131], [132]) 0.49971594390170543 app_launch['date'].min(), app_launch['date'].max() (100, 222) app_launch[app_launch['user_id'] == 10052988] user_idlaunch_typedate65472132524483

10052988	0	147
10052988	0	149

test_a = pd.read_csv('test-a.csv') test_a user_idend_date01234...1499614997149981499915000

10007813	205
10052988	210
10279068	200
10546696	216
10406659	183
...	...
10355586	205
10589773	210
10181954	218
10544736	164
10354569	187

15001 rows × 2 columns

特征工程

# del user_interaction, user_portrait, user_playback, app_launch, video_related!mkdir wsdm_model_data !python3 baseline_feature_engineering.py mkdir: cannot create directory ‘wsdm_model_data’: File exists

構建模型 + 訓練

!unzip data.zip Archive: data.zip replace app_launch_logs.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C import pandas as pd import numpy as np import json import mathdata_dir = "./wsdm_model_data/" # 處理訓練集數據 data = pd.read_csv(data_dir + "train_data.txt", sep="\t") data["launch_seq"] = data.launch_seq.apply(lambda x: json.loads(x)) data["playtime_seq"] = data.playtime_seq.apply(lambda x: json.loads(x)) data["duration_prefer"] = data.duration_prefer.apply(lambda x: json.loads(x)) data["interact_prefer"] = data.interact_prefer.apply(lambda x: json.loads(x)) # shuffle data data = data.sample(frac=1).reset_index(drop=True) data.columns Index(['user_id', 'end_date', 'label', 'launch_seq', 'playtime_seq','duration_prefer', 'father_id_score', 'cast_id_score', 'tag_score','device_type', 'device_ram', 'device_rom', 'sex', 'age', 'education','occupation_status', 'territory_score', 'interact_prefer'],dtype='object') data user_idend_datelabellaunch_seqplaytime_seqduration_preferfather_id_scorecast_id_scoretag_scoredevice_typedevice_ramdevice_romsexageeducationoccupation_statusterritory_scoreinteract_prefer01234...599996599997599998599999600000

10309777	165	6	[0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, ...	[0, 0, 0, 0, 0, 0, 0.9414, 0, 0, 0.9998, 0.943...	[0.0, 0.0, 0.0, 0.0, 0.08, 0.0, 0.04, 0.0, 0.0...	1.209317	1.353447	0.178947	0.194954	-0.740852	1.043355	-0.955892	-0.319111	-0.544818	0.746096	0.167180	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...
10117035	123	0	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]	0.000000	0.000000	0.000000	0.194954	-1.195884	-1.173106	-0.955892	-0.319111	-0.544818	-1.340308	0.000000	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
10413843	149	0	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]	0.000000	0.000000	0.000000	-2.041925	-0.637283	-0.701308	-0.955892	-0.319111	0.755516	0.746096	-1.106625	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
10209341	165	0	[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...	[0, 0, 0, 0, 0, 0, 0, 0.0475, 0, 0, 0, 0, 0, 0...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]	0.000000	0.000000	0.000000	0.194954	0.150032	-0.117076	-0.955892	-0.319111	-0.544818	0.746096	0.940850	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
10430657	162	0	[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.0492...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]	0.000000	0.000000	0.000000	0.194954	1.012626	-0.145958	1.046141	0.000000	-0.544818	0.000000	-0.743187	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10070331	122	1	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]	0.000000	0.000000	0.000000	0.194954	0.191747	1.228884	-0.955892	-0.319111	-0.544818	0.746096	-0.480041	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
10056030	115	2	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	[0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, ...	-0.299726	0.000000	0.388082	0.194954	-1.195884	-0.834187	1.046141	0.828011	-0.544818	-1.340308	-1.524485	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
10235314	137	0	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	[0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 1.0, 0.5, 0.0, ...	-0.866054	0.000000	-0.084836	0.194954	1.020778	1.262729	-0.955892	-0.319111	-0.544818	0.746096	0.838748	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
10014483	195	1	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...	0.288450	0.760564	0.511767	0.194954	-0.796952	-0.111235	-0.955892	1.975134	-0.544818	0.746096	-1.638692	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
10446094	157	0	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]	0.000000	0.000000	0.000000	0.194954	0.000000	-0.857147	-0.955892	-0.319111	-0.544818	-1.340308	-0.891480	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

600001 rows × 18 columns

import paddle from paddle.io import DataLoader, Dataset# 定義模型數據集 class CoggleDataset(Dataset):def __init__(self, df):super(CoggleDataset, self).__init__()self.df = dfself.feat_col = list(set(self.df.columns) - set(['user_id', 'end_date', 'label', 'launch_seq', 'playtime_seq', 'duration_prefer', 'interact_prefer']))self.df_feat = self.df[self.feat_col]# 定義需要參與訓練的字段def __getitem__(self, index):launch_seq = self.df['launch_seq'].iloc[index]playtime_seq = self.df['playtime_seq'].iloc[index]duration_prefer = self.df['duration_prefer'].iloc[index]interact_prefer = self.df['interact_prefer'].iloc[index]feat = self.df_feat.iloc[index].values.astype(np.float32)launch_seq = paddle.to_tensor(launch_seq).astype(paddle.float32)playtime_seq = paddle.to_tensor(playtime_seq).astype(paddle.float32)duration_prefer = paddle.to_tensor(duration_prefer).astype(paddle.float32)interact_prefer = paddle.to_tensor(interact_prefer).astype(paddle.float32)feat = paddle.to_tensor(feat).astype(paddle.float32)label = paddle.to_tensor(self.df['label'].iloc[index]).astype(paddle.float32)return launch_seq, playtime_seq, duration_prefer, interact_prefer, feat, labeldef __len__(self):return len(self.df) import paddle# 定義模型，這里是LSTM + FC class CoggleModel(paddle.nn.Layer):def __init__(self):super(CoggleModel, self).__init__()# 序列建模self.launch_seq_gru = paddle.nn.GRU(1, 32)self.playtime_seq_gru = paddle.nn.GRU(1, 32)# 全連接層self.fc1 = paddle.nn.Linear(102, 64)self.fc2 = paddle.nn.Linear(64, 1)def forward(self, launch_seq, playtime_seq, duration_prefer, interact_prefer, feat):launch_seq = launch_seq.reshape((-1, 32, 1))playtime_seq = playtime_seq.reshape((-1, 32, 1))launch_seq_feat = self.launch_seq_gru(launch_seq)[0][:, :, 0]playtime_seq_feat = self.playtime_seq_gru(playtime_seq)[0][:, :, 0]all_feat = paddle.concat([launch_seq_feat, playtime_seq_feat, duration_prefer, interact_prefer, feat], 1)all_feat_fc1 = self.fc1(all_feat)all_feat_fc2 = self.fc2(all_feat_fc1)return all_feat_fc2

模型訓練

from tqdm import tqdm import warnings warnings.filterwarnings("ignore")# 模型訓練函數 def train(model, train_loader, optimizer, criterion):model.train()train_loss = []for launch_seq, playtime_seq, duration_prefer, interact_prefer, feat, label in tqdm(train_loader):pred = model(launch_seq, playtime_seq, duration_prefer, interact_prefer, feat)loss = criterion(pred, label)loss.backward()optimizer.step()optimizer.clear_grad()train_loss.append(loss.item())return np.mean(train_loss)# 模型驗證函數 def validate(model, val_loader, optimizer, criterion):model.eval()val_loss = []for launch_seq, playtime_seq, duration_prefer, interact_prefer, feat, label in tqdm(val_loader):pred = model(launch_seq, playtime_seq, duration_prefer, interact_prefer, feat)loss = criterion(pred, label)loss.backward()optimizer.step()optimizer.clear_grad()val_loss.append(loss.item())return np.mean(val_loss)# 模型預測函數 def predict(model, test_loader):model.eval()test_pred = []for launch_seq, playtime_seq, duration_prefer, interact_prefer, feat, label in tqdm(test_loader):pred = model(launch_seq, playtime_seq, duration_prefer, interact_prefer, feat)test_pred.append(pred.numpy())return test_pred from sklearn.model_selection import StratifiedKFold# 模型多折訓練 skf = StratifiedKFold(n_splits=7) fold = 0 for tr_idx, val_idx in skf.split(data, data['label']):train_dataset = CoggleDataset(data.iloc[tr_idx])val_dataset = CoggleDataset(data.iloc[val_idx])# 定義模型、損失函數和優化器model = CoggleModel()optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=0.001)criterion = paddle.nn.MSELoss()train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=4)# 每個epoch訓練for epoch in range(3):train_loss = train(model, train_loader, optimizer, criterion)val_loss = validate(model, val_loader, optimizer, criterion)print(fold, epoch, train_loss, val_loss)paddle.save(model.state_dict(), f"model_{fold}.pdparams")fold += 1 W1128 20:18:14.128268 128 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1 W1128 20:18:14.132313 128 device_context.cc:465] device: 0, cuDNN Version: 7.6.1%| | 131/16072 [00:05<09:05, 29.24it/s]

模型預測

test_data = pd.read_csv(data_dir + "test_data.txt", sep="\t") test_data["launch_seq"] = test_data.launch_seq.apply(lambda x: json.loads(x)) test_data["playtime_seq"] = test_data.playtime_seq.apply(lambda x: json.loads(x)) test_data["duration_prefer"] = test_data.duration_prefer.apply(lambda x: json.loads(x)) test_data["interact_prefer"] = test_data.interact_prefer.apply(lambda x: json.loads(x)) test_data['label'] = 0test_dataset = CoggleDataset(test_data) test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False, num_workers=4) test_pred_fold = np.zeros(test_data.shape[0])# 模型多折預測 for idx in range(7):model = CoggleModel()layer_state_dict = paddle.load(f"model_{idx}.pdparams")model.set_state_dict(layer_state_dict)model.eval()test_pred = predict(model, test_loader)test_pred = np.vstack(test_pred)test_pred_fold += test_pred[:, 0]test_pred_fold /= 7 100%|██████████| 235/235 [00:02<00:00, 98.58it/s] 100%|██████████| 235/235 [00:02<00:00, 79.41it/s] 100%|██████████| 235/235 [00:02<00:00, 78.44it/s] 100%|██████████| 235/235 [00:02<00:00, 78.63it/s] 100%|██████████| 235/235 [00:03<00:00, 77.96it/s] 100%|██████████| 235/235 [00:02<00:00, 78.47it/s] 100%|██████████| 235/235 [00:03<00:00, 77.44it/s] test_data["prediction"] = test_pred[:, 0] test_data = test_data[["user_id", "prediction"]] # can clip outputs to [0, 7] or use other tricks test_data.to_csv("./baseline_submission.csv", index=False, header=False, float_format="%.2f")

總結

本項目基于已有的比賽的數據，構建時序模型，對用戶的留存進行預測。

與原有的keras代碼相比，本項目將特征工程與構建模型進行拆分，更加適合迭代。

本項目在模型加入了序列建模，可以使用LSTM或GRU等。

改進思路

用戶行為特征可以加入注意力機制。

視頻行為與用戶行為可以進行交叉，參考DeepFM。

總結

以上是生活随笔為你收集整理的WSDM-爱奇艺：用户留存预测挑战赛线上0.865的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Android11不如,1200万像素的
下一篇：这颗“洋葱”要上市了，低调盈利2亿元能跟

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

WSDM-爱奇艺：用户留存预测挑战赛 线上0.865

賽題介紹

運行說明【非常重要】

賽題描述

數據描述

評價指標

評審說明

特別說明

數據集解釋

時間線

獎項設置

基礎字段分析

user_portrait

device_type

ram 和 rom

sex

age

education

occupation_status

territory_code

app_launch

video_related

user_playback

user_interaction

探索性數據分析

特征工程

構建模型 + 訓練

模型訓練

模型預測

總結

改進思路

總結

WSDM-爱奇艺：用户留存预测挑战赛线上0.865