推荐系统算法总结(三)——FM与DNN DeepFM
來源:https://blog.csdn.net/qq_23269761/article/details/81366939,如有不妥,請隨時聯系溝通,謝謝~
0.瘋狂安利一個博客
FM的前世今生:?
https://tracholar.github.io/machine-learning/2017/03/10/factorization-machine.html#%E7%BB%BC%E8%BF%B0
1.FM 與 DNN和embedding的關系
先來復習一下FM?
?
?
對FM模型進行求解后,對于每一個特征xi都能夠得到對應的隱向量vi,那么這個vi到底是什么呢?
想一想Google提出的word2vec,word2vec是word embedding方法的一種,word embedding的意思就是,給出一個文檔,文檔就是一個單詞序列,比如 “A B A C B F G”, 希望對文檔中每個不同的單詞都得到一個對應的向量(往往是低維向量)表示。比如,對于這樣的“A B A C B F G”的一個序列,也許我們最后能得到:A對應的向量為[0.1 0.6 -0.5],B對應的向量為[-0.2 0.9 0.7] 。
所以結論就是:?
FM算法是一個特征組合以及降維的工具,它能夠將原本因為one-hot編碼產生的稀疏特征,進行兩兩組合后還能做一個降維!!降到多少維呢?就是FM中隱因子的個數k
2.FNN
利用FM做預訓練實現embedding,再通過DNN進行訓練?
?
這樣的模型則是考慮了高階特征,而在最后sigmoid輸出時忽略了低階特征本身。
3.DeepFM
鑒于上述理論,目前新出的很多基于深度學習的CTR模型都從wide、deep(即低階、高階)兩方面同時進行考慮,進一步提高模型的泛化能力,比如DeepFM。?
參考博客:https://blog.csdn.net/zynash2/article/details/79348540?
?
可以看到,整個模型大體分為兩部分:FM和DNN。簡單敘述一下模型的流程:借助FNN的思想,利用FM進行embedding,之后的wide和deep模型共享embedding之后的結果。DNN的輸入完全和FNN相同(這里不用預訓練,直接把embedding層看作一層的NN),而通過一定方式組合后,模型在wide上完全模擬出了FM的效果(至于為什么,論文中沒有詳細推導,本文會稍后給出推導過程),最后將DNN和FM的結果組合后激活輸出。
需要著重強調理解的時模型中關于FM的部分,究竟時如何搭建網絡計算2階特征的?
**劃重點:**embedding層對于DNN來說時在提取特征,對于FM來說就是他的2階特征啊!!!!只不過FM和DNN共享embedding層而已。
4.DeepFM代碼解讀
先放代碼鏈接:?
https://github.com/ChenglongChen/tensorflow-DeepFM?
數據下載地址:?
https://www.kaggle.com/c/porto-seguro-safe-driver-prediction
4.0 項目目錄
?
data:存儲訓練數據與測試數據?
output/fig:用來存放輸出結果和訓練曲線?
config:數據獲取和特征工程中一些參數的設置?
DataReader:特征工程,獲得真正用于訓練的特征集合?
main:主程序入口?
mertics:定義了gini指標作為評價指標?
DeepFM:模型定義
4.1 整體過程
推薦一篇此數據集的EDA分析,看過可以對數據集的全貌有所了解:?
https://blog.csdn.net/qq_37195507/article/details/78553581
- 1._load_data()
def _load_data():
dfTrain = pd.read_csv(config.TRAIN_FILE)
dfTest = pd.read_csv(config.TEST_FILE)
def preprocess(df):
cols = [c for c in df.columns if c not in ["id", "target"]]
df["missing_feat"] = np.sum((df[cols] == -1).values, axis=1)
df["ps_car_13_x_ps_reg_03"] = df["ps_car_13"] * df["ps_reg_03"]
return df
dfTrain = preprocess(dfTrain)
dfTest = preprocess(dfTest)
cols = [c for c in dfTrain.columns if c not in ["id", "target"]]
cols = [c for c in cols if (not c in config.IGNORE_COLS)]
X_train = dfTrain[cols].values
y_train = dfTrain["target"].values
X_test = dfTest[cols].values
ids_test = dfTest["id"].values
cat_features_indices = [i for i,c in enumerate(cols) if c in config.CATEGORICAL_COLS]
return dfTrain, dfTest, X_train, y_train, X_test, ids_test, cat_features_indices
首先讀取原始數據文件TRAIN_FILE,TEST_FILE?
preprocess(df)添加了兩個特征分別是missing_feat【缺失特征個數】與ps_car_13_x_ps_reg_03【兩個特征的乘積】?
返回:?
dfTrain, dfTest :所有特征都存在的Dataframe形式?
X_train, X_test:刪掉了IGNORE_COLS的ndarray格式 【X_test后面都沒有用到啊】?
y_train: label?
ids_test:測試集的id,ndarray?
cat_features_indices:類別特征的特征indices
- 利用X_train, y_train 進行了K折均衡交叉驗證切分數據集
- DeepFM參數設置
- 2._run_base_model_dfm
def _run_base_model_dfm(dfTrain, dfTest, folds, dfm_params):
fd = FeatureDictionary(dfTrain=dfTrain, dfTest=dfTest,
numeric_cols=config.NUMERIC_COLS,
ignore_cols=config.IGNORE_COLS)
data_parser = DataParser(feat_dict=fd)
Xi_train, Xv_train, y_train = data_parser.parse(df=dfTrain, has_label=True)
Xi_test, Xv_test, ids_test = data_parser.parse(df=dfTest)
dfm_params["feature_size"] = fd.feat_dim
dfm_params["field_size"] = len(Xi_train[0])
y_train_meta = np.zeros((dfTrain.shape[0], 1), dtype=float)
y_test_meta = np.zeros((dfTest.shape[0], 1), dtype=float)
_get = lambda x, l: [x[i] for i in l]
gini_results_cv = np.zeros(len(folds), dtype=float)
gini_results_epoch_train = np.zeros((len(folds), dfm_params["epoch"]), dtype=float)
gini_results_epoch_valid = np.zeros((len(folds), dfm_params["epoch"]), dtype=float)
for i, (train_idx, valid_idx) in enumerate(folds):
Xi_train_, Xv_train_, y_train_ = _get(Xi_train, train_idx), _get(Xv_train, train_idx), _get(y_train, train_idx)
Xi_valid_, Xv_valid_, y_valid_ = _get(Xi_train, valid_idx), _get(Xv_train, valid_idx), _get(y_train, valid_idx)
dfm = DeepFM(**dfm_params)
dfm.fit(Xi_train_, Xv_train_, y_train_, Xi_valid_, Xv_valid_, y_valid_)
y_train_meta[valid_idx,0] = dfm.predict(Xi_valid_, Xv_valid_)
y_test_meta[:,0] += dfm.predict(Xi_test, Xv_test)
gini_results_cv[i] = gini_norm(y_valid_, y_train_meta[valid_idx])
gini_results_epoch_train[i] = dfm.train_result
gini_results_epoch_valid[i] = dfm.valid_result
y_test_meta /= float(len(folds))
# save result
if dfm_params["use_fm"] and dfm_params["use_deep"]:
clf_str = "DeepFM"
elif dfm_params["use_fm"]:
clf_str = "FM"
elif dfm_params["use_deep"]:
clf_str = "DNN"
print("%s: %.5f (%.5f)"%(clf_str, gini_results_cv.mean(), gini_results_cv.std()))
filename = "%s_Mean%.5f_Std%.5f.csv"%(clf_str, gini_results_cv.mean(), gini_results_cv.std())
_make_submission(ids_test, y_test_meta, filename)
_plot_fig(gini_results_epoch_train, gini_results_epoch_valid, clf_str)
return y_train_meta, y_test_meta
經過?
DataReader中的FeatureDictionary?
這個對象中有一個self.feat_dict屬性,長下面這個樣子:
- ?
DataReader中的DataParser
class DataParser(object):
def __init__(self, feat_dict):
self.feat_dict = feat_dict #這個feat_dict是FeatureDictionary對象實例
def parse(self, infile=None, df=None, has_label=False):
assert not ((infile is None) and (df is None)), "infile or df at least one is set"
assert not ((infile is not None) and (df is not None)), "only one can be set"
if infile is None:
dfi = df.copy()
else:
dfi = pd.read_csv(infile)
if has_label:
y = dfi["target"].values.tolist()
dfi.drop(["id", "target"], axis=1, inplace=True)
else:
ids = dfi["id"].values.tolist()
dfi.drop(["id"], axis=1, inplace=True)
# dfi for feature index
# dfv for feature value which can be either binary (1/0) or float (e.g., 10.24)
dfv = dfi.copy()
for col in dfi.columns:
if col in self.feat_dict.ignore_cols:
dfi.drop(col, axis=1, inplace=True)
dfv.drop(col, axis=1, inplace=True)
continue
if col in self.feat_dict.numeric_cols:
dfi[col] = self.feat_dict.feat_dict[col]
else:
dfi[col] = dfi[col].map(self.feat_dict.feat_dict[col])
dfv[col] = 1.
#dfi.to_csv('dfi.csv')
#dfv.to_csv('dfv.csv')
# list of list of feature indices of each sample in the dataset
Xi = dfi.values.tolist()
# list of list of feature values of each sample in the dataset
Xv = dfv.values.tolist()
if has_label:
return Xi, Xv, y
else:
return Xi, Xv, ids
這里Xi,Xv都是二位數組,可以將dfi,dfv存在csv文件中看一下長什么樣子,長的很奇怪【可能后面模型需要吧~】?
dfi:value值為特征index,也就是上文中feat_dict屬性保存的值?
dfv:如果是數值變量,則保持原本的值,如果是分類變量,則value為1?
4.2 模型架構
def _init_graph(self):
self.graph = tf.Graph()
with self.graph.as_default():
tf.set_random_seed(self.random_seed)
self.feat_index = tf.placeholder(tf.int32, shape=[None, None],
name="feat_index") # None * F
self.feat_value = tf.placeholder(tf.float32, shape=[None, None],
name="feat_value") # None * F
self.label = tf.placeholder(tf.float32, shape=[None, 1], name="label") # None * 1
self.dropout_keep_fm = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_fm")
self.dropout_keep_deep = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_deep")
self.train_phase = tf.placeholder(tf.bool, name="train_phase")
self.weights = self._initialize_weights()
# model
self.embeddings = tf.nn.embedding_lookup(self.weights["feature_embeddings"],
self.feat_index) # None * F * K
#print(self.weights["feature_embeddings"]) shape=[259,8] n*k個隱向量
#print(self.embeddings) shape=[?,39,8] f*k 每個field取出一個隱向量[這不是FFM每個field取是在取非0量,減少計算]
feat_value = tf.reshape(self.feat_value, shape=[-1, self.field_size, 1])
#print(feat_value) shape=[?,39*1] 某一個樣本的39個Feature值
self.embeddings = tf.multiply(self.embeddings, feat_value) #multiply在有一個維度不同時,較少的維度會自行擴展
#print(self.embeddings) shape=[?,39*8]
# 所以這個multiply之后得到的矩陣是Vixi,方便以后進行<Vi,Vj>*xi*xj=<Vi*xi,Vj*xj>的計算,后面的計算FM被簡化為了
# sum_square part-square_sum part的形式,采用上面multiply的形式更方便啊!
# ---------- first order term ----------
self.y_first_order = tf.nn.embedding_lookup(self.weights["feature_bias"], self.feat_index) # None * F * 1
self.y_first_order = tf.reduce_sum(tf.multiply(self.y_first_order, feat_value), 2) # None * F
self.y_first_order = tf.nn.dropout(self.y_first_order, self.dropout_keep_fm[0]) # None * F
# ---------- second order term ---------------
# sum_square part
self.summed_features_emb = tf.reduce_sum(self.embeddings, 1) # None * K
self.summed_features_emb_square = tf.square(self.summed_features_emb) # None * K
# square_sum part
self.squared_features_emb = tf.square(self.embeddings)
self.squared_sum_features_emb = tf.reduce_sum(self.squared_features_emb, 1) # None * K
# second order
self.y_second_order = 0.5 * tf.subtract(self.summed_features_emb_square, self.squared_sum_features_emb) # None * K
self.y_second_order = tf.nn.dropout(self.y_second_order, self.dropout_keep_fm[1]) # None * K
# ---------- Deep component ----------
self.y_deep = tf.reshape(self.embeddings, shape=[-1, self.field_size * self.embedding_size]) # None * (F*K)
self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[0])
for i in range(0, len(self.deep_layers)):
self.y_deep = tf.add(tf.matmul(self.y_deep, self.weights["layer_%d" %i]), self.weights["bias_%d"%i]) # None * layer[i] * 1
if self.batch_norm:
self.y_deep = self.batch_norm_layer(self.y_deep, train_phase=self.train_phase, scope_bn="bn_%d" %i) # None * layer[i] * 1
self.y_deep = self.deep_layers_activation(self.y_deep)
self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[1+i]) # dropout at each Deep layer
# ---------- DeepFM ----------
if self.use_fm and self.use_deep:
concat_input = tf.concat([self.y_first_order, self.y_second_order, self.y_deep], axis=1)
elif self.use_fm:
concat_input = tf.concat([self.y_first_order, self.y_second_order], axis=1)
elif self.use_deep:
concat_input = self.y_deep
self.out = tf.add(tf.matmul(concat_input, self.weights["concat_projection"]), self.weights["concat_bias"])
不知道為什么這篇代碼把FM寫的看起來很復雜。人家復雜是有原因的!!避免了使用one-hot編碼后的大大大矩陣?
其實就是embedding層Deep和FM共用了隱向量【feature_size*k】矩陣
所以這個實現的重點在embedding層啊,這里的實現方式通過Xi,Xv兩個較小的矩陣【n*field】注意這里field不是FFM中的F,而是未one-hot編碼前的Feature數量。?
根據內積的公式我們可以得到
總結
以上是生活随笔為你收集整理的推荐系统算法总结(三)——FM与DNN DeepFM的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: FM算法python实现
- 下一篇: Intellij IDEA集成JProf