推荐系统surprise库教程
推薦系統surprise庫官方文檔解讀
- 安裝時常見問題
- Surprise的功能
- 示例
- 使用內置的數據集+交叉驗證
- 不使用交叉驗證,只把數據集分割一次
- 使用自己的數據集、不使用測試集
- 自行指定訓練集和測試集
 
- 內置算法和參數設置
- NormalPredictor算法
- 示例代碼
 
 
- Baseline算法
- ALS
- 示例代碼
 
- SGD
- 示例代碼
 
 
- KNNBasic算法
- 示例代碼
 
 
- KNNWithMeans算法
- 示例代碼
 
 
- KNNWithZScore算法
- 示例代碼
 
 
- KNNBaseline算法
- 示例代碼
 
 
- SVD算法
- 示例代碼
 
 
- SVDpp算法
- 示例代碼
 
 
- NMF算法
- 示例代碼
 
 
- SlopeOne算法
- 示例代碼
 
 
- CoClustering算法
 
- Precision、Recall、MAP和NDCG的計算
安裝時常見問題
安裝還是常見的
pip install surprise安裝常見問題:出現報錯(error: Microsoft Visual C++ 14.0 is required. Get it with “Microsoft Visual C++ Build Tools”: https://visualstudio.microsoft.com/downloads/)
 解決方法:
 ①最笨的方法,下載所提示的對應的Visual Studio版本;
 ②核心思想,躲!在https://www.lfd.uci.edu/~gohlke/pythonlibs/上找到對應python版本的想要的庫的whl包,然后pip install xx.whl進行安裝,surprise庫的shl文件在https://pypi.org/project/surprise/#files,不過可能還是躲不掉;
 ③對于2.7選手,可以在https://www.microsoft.com/en-us/download/details.aspx?id=44266上下載VCForPython27.msi以支持對用C寫成的包的支持;
Surprise的功能
Surprise庫非常適用于初學者了解推薦算法,其內置的功能包括:
示例
本節會給出Surprise庫使用的相關示例,讀者可以根據自己的需要對示例的代碼進行改寫,從而實現自己所需的功能。
使用內置的數據集+交叉驗證
from surprise import SVD from surprise import Dataset from surprise.model_selection import cross_validate # 加載內置的ml100k數據集 data = Dataset.load_builtin('ml-100k') # 使用SVD算法 algo = SVD() # 使用五折交叉驗證,使用cv參數設置幾折,measures設置評價指標,verbose設置為True表示顯示詳細信息 cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)不使用交叉驗證,只把數據集分割一次
寫成類似于sklearn中的常見寫法
from surprise import SVD from surprise import Dataset from surprise import accuracy from surprise.model_selection import train_test_split data = Dataset.load_builtin('ml-100k') # 類似于sklearn中的寫法,將數據分割為75% trainset, testset = train_test_split(data, test_size=.25) algo = SVD() # 不同上一個例子,這里使用fit和test函數 algo.fit(trainset) predictions = algo.test(testset) # 選用rmse指標 accuracy.rmse(predictions)使用自己的數據集、不使用測試集
from surprise import SVD from surprise import Dataset from surprise import accuracy from surprise import Reader # 指定要讀入的文件的格式,本例中每行三列,分別是用戶、項目和評分,中間用空格隔開,若是用逗號或其他符號隔開,在sep參數中進行變化即可 reader = Reader(line_format='user item rating', sep=' ') # 指定要讀入的數據文件,本例中為test.txt data = Dataset.load_from_file('test.txt', reader=reader) # 把全部數據集都作為訓練集 data = data.build_full_trainset() algo = SVD() algo.fit(trainset) predictions = algo.test(testset) accuracy.rmse(predictions)自行指定訓練集和測試集
from surprise import SVD from surprise import Dataset from surprise import accuracy from surprise import Reader from surprise.model_selection import PredefinedKFold # 數據集在系統路徑\data\下 files_dir = os.path.expanduser('~/data/') # 訓練集為u1.base、u2.base train_file = files_dir + 'u%d.base' # 測試集為u1.test、u2.test test_file = files_dir + 'u%d.test' # range(m,n)表示訓練集和測試集文件的命名,因為本例中是從u1到u2,所以這里為range(1,3),其實就是定義一個列表,里面是一組組訓練集和測試集文件,即[(訓練集1,測試集1),(訓練集2,測試集2)……] folds_files = [(train_file % i, test_file % i) for i in range(1, 3)] reader = Reader(line_format='user item rating', sep='\t') data = Dataset.load_from_folds(folds_files, reader=reader) pkf = PredefinedKFold() algo = SVD() # 因為本例中有兩組訓練集和測試集,所以出現兩次結果 for trainset, testset in pkf.split(data):algo.fit(trainset)predictions = algo.test(testset)accuracy.rmse(predictions, verbose=True)accuracy.mae(predictions, verbose=True)內置算法和參數設置
NormalPredictor算法
該算法即隨機預測算法,假設測試集的評分滿足正態分布,然后生成正態分布的隨機數進行預測,正態分布N(μ^,σ^2)N(\hat{\mu},\hat{\sigma}^2)N(μ^?,σ^2)的參數均值和方差從訓練集中得到。
 μ^=1∣Rtrain∣∑rui∈Rtrainrui\hat{\mu}=\frac{1}{\vert R_{train}\vert}\sum_{r_{ui}\in R_{train}}r_{ui} μ^?=∣Rtrain?∣1?rui?∈Rtrain?∑?rui?
 σ^=∑rui∈Rtrain(rui?μ^)2∣Rtrain∣\hat{\sigma}=\sqrt{\sum_{r_{ui}\in R_{train}}\frac{(r_{ui}-\hat{\mu})^2}{\vert R_{train}\vert}} σ^=rui?∈Rtrain?∑?∣Rtrain?∣(rui??μ^?)2??
示例代碼
algo = NormalPredictor()Baseline算法
Koren提出的baseline算法,不考慮用戶的偏好
 rui^=μ+bu+bi\hat{r_{ui}}=\mu+b_u+b_i rui?^?=μ+bu?+bi?
 對于未在訓練集中出現的uuu,bu=0b_u=0bu?=0(bib_ibi?做類似處理)
 參數設置
 訓練方法是使用交替最小二乘法(ALS)還是隨機梯度下降(SGD)
ALS
示例代碼
bsl_options = {'method': 'als','n_epochs': 5,'reg_u': 12,'reg_i': 5} algo = BaselineOnly(bsl_options=bsl_options)SGD
示例代碼
bsl_options = {'method': 'sgd','learning_rate': .00005,} algo = BaselineOnly(bsl_options=bsl_options)KNNBasic算法
最基礎的KNN算法,可分為user-based KNN和item-based KNN
 user-based KNN的公式
 rui^=∑v∈Nik(u)sim(u,v)?rvi∑v∈Nik(u)sim(u,v)\hat{r_{ui}} = \frac {\sum_{v\in N_i^k(u)} sim(u,v)\cdot r_{vi}} {\sum_{v\in N_i^k(u)}sim(u,v)} rui?^?=∑v∈Nik?(u)?sim(u,v)∑v∈Nik?(u)?sim(u,v)?rvi??
 item-based KNN的公式
 rui^=∑j∈Nuk(i)sim(i,j)?ruj∑j∈Nuk(i)sim(i,j)\hat{r_{ui}} = \frac {\sum_{j\in N_u^k(i)} sim(i,j)\cdot r_{uj}} {\sum_{j\in N_u^k(i)}sim(i,j)} rui?^?=∑j∈Nuk?(i)?sim(i,j)∑j∈Nuk?(i)?sim(i,j)?ruj??
 8. kkk:設置的鄰居的個數,默認為40
 9. min_kmin\_kmin_k:最少的鄰居的個數,如果合適的鄰居達不到設置的最小鄰居值,則使用全局平均值進行預測,默認為1
 10. sim_optionssim\_optionssim_options中的namenamename:使用的計算相似度的函數,默認為MSD,也可設置為cosine或pearson_baseline
 11. sim_optionssim\_optionssim_options中的user_baseduser\_baseduser_based:默認為True,即使用user-based KNN,若設置為True,則使用item-based KNN
 12. sim_optionssim\_optionssim_options中的min_supportmin\_supportmin_support:相似度達到該值,才能進入鄰居的選擇范圍,無默認值
 13. sim_optionssim\_optionssim_options中的shrinkageshrinkageshrinkage:當相似函數選擇為pearson_baseline,用該參數設置是否衰減,默認為100
示例代碼
sim_options = {'name': 'cosine','user_based': False # compute similarities between items} algo = KNNBasic(k=10, sim_options=sim_options) sim_options = {'name': 'pearson_baseline','shrinkage': 0 # no shrinkage} algo = KNNBasic(k=10, sim_options=sim_options)KNNWithMeans算法
在KNNBasic算法的基礎上,考慮用戶均值或項目均值
 r^ui=μu+∑v∈Nik(u)sim(u,v)?(rvi?μv)∑v∈Nik(u)sim(u,v)\hat{r}_{ui} = \mu_u + \frac{ \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot (r_{vi} - \mu_v)} {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)} r^ui?=μu?+v∈Nik?(u)∑?sim(u,v)v∈Nik?(u)∑?sim(u,v)?(rvi??μv?)?
 或
 r^ui=μi+∑j∈Nuk(i)sim(i,j)?(ruj?μj)∑j∈Nuk(i)sim(i,j)\hat{r}_{ui} = \mu_i + \frac{ \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot (r_{uj} - \mu_j)} {\sum\limits_{j \in N^k_u(i)} \text{sim}(i, j)} r^ui?=μi?+j∈Nuk?(i)∑?sim(i,j)j∈Nuk?(i)∑?sim(i,j)?(ruj??μj?)?
 參數設置與KNNBasic類似
示例代碼
sim_options = {'name': 'cosine','user_based': False # compute similarities between items} algo = KNNWithMeans(k=10, sim_options=sim_options)KNNWithZScore算法
引入Z-Score的思想
 r^ui=μu+σu∑v∈Nik(u)sim(u,v)?(rvi?μv)/σv∑v∈Nik(u)sim(u,v)\hat{r}_{ui} = \mu_u + \sigma_u \frac{ \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot (r_{vi} - \mu_v) / \sigma_v} {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)} r^ui?=μu?+σu?v∈Nik?(u)∑?sim(u,v)v∈Nik?(u)∑?sim(u,v)?(rvi??μv?)/σv??
 或
 r^ui=μi+σi∑j∈Nuk(i)sim(i,j)?(ruj?μj)/σj∑j∈Nuk(i)sim(i,j)\hat{r}_{ui} = \mu_i + \sigma_i \frac{ \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot (r_{uj} - \mu_j) / \sigma_j} {\sum\limits_{j \in N^k_u(i)} \text{sim}(i, j)} r^ui?=μi?+σi?j∈Nuk?(i)∑?sim(i,j)j∈Nuk?(i)∑?sim(i,j)?(ruj??μj?)/σj??
 參數設置與KNNBasic類似
示例代碼
sim_options = {'name': 'cosine','user_based': False # compute similarities between items} algo = KNNWithZScore(k=10, sim_options=sim_options)KNNBaseline算法
和KNNWithMeans的區別在于,用的不是均值而是bias
 r^ui=bui+∑v∈Nik(u)sim(u,v)?(rvi?bvi)∑v∈Nik(u)sim(u,v)\hat{r}_{ui} = b_{ui} + \frac{ \sum\limits_{v \in N^k_i(u)} \text{sim}(u, v) \cdot (r_{vi} - b_{vi})} {\sum\limits_{v \in N^k_i(u)} \text{sim}(u, v)} r^ui?=bui?+v∈Nik?(u)∑?sim(u,v)v∈Nik?(u)∑?sim(u,v)?(rvi??bvi?)?
 或
 r^ui=bui+∑j∈Nuk(i)sim(i,j)?(ruj?buj)∑j∈Nuk(i)sim(i,j)\hat{r}_{ui} = b_{ui} + \frac{ \sum\limits_{j \in N^k_u(i)} \text{sim}(i, j) \cdot (r_{uj} - b_{uj})} {\sum\limits_{j \in N^k_u(i)} \text{sim}(i, j)} r^ui?=bui?+j∈Nuk?(i)∑?sim(i,j)j∈Nuk?(i)∑?sim(i,j)?(ruj??buj?)?
 參數設置與KNNBasic類似
示例代碼
sim_options = {'name': 'cosine','user_based': False # compute similarities between items} algo = KNNBaseline(k=10, sim_options=sim_options)SVD算法
經典的SVD算法
 r^ui=μ+bu+bi+qiTpu\hat{r}_{ui} = \mu + b_u + b_i + q_i^Tp_u r^ui?=μ+bu?+bi?+qiT?pu?
 損失函數為
 ∑rui∈Rtrain(rui?r^ui)2+λ(bi2+bu2+∣∣qi∣∣2+∣∣pu∣∣2)\sum_{r_{ui} \in R_{train}} \left(r_{ui} - \hat{r}_{ui} \right)^2 + \lambda\left(b_i^2 + b_u^2 + ||q_i||^2 + ||p_u||^2\right) rui?∈Rtrain?∑?(rui??r^ui?)2+λ(bi2?+bu2?+∣∣qi?∣∣2+∣∣pu?∣∣2)
 優化公式為
 bu←bu+γ(eui?λbu)b_u \leftarrow b_u + \gamma (e_{ui} - \lambda b_u) bu?←bu?+γ(eui??λbu?)
 bi←bi+γ(eui?λbi)b_i \leftarrow b_i + \gamma (e_{ui} - \lambda b_i) bi?←bi?+γ(eui??λbi?)
 pu←pu+γ(eui?qi?λpu)p_u \leftarrow p_u + \gamma (e_{ui} \cdot q_i - \lambda p_u) pu?←pu?+γ(eui??qi??λpu?)
 qi←qi+γ(eui?pu?λqi)q_i \leftarrow q_i + \gamma (e_{ui} \cdot p_u - \lambda q_i) qi?←qi?+γ(eui??pu??λqi?)
 14. n_factorsn\_factorsn_factors:隱因子的數量,默認為100
 15. n_epochsn\_epochsn_epochs:迭代次數,默認為20
 16. biasedbiasedbiased:默認為True,即使用SGD,如果為False,則使用MF算法也就是PMF算法
 17. init_meaninit\_meaninit_mean:p和q兩個向量的初始值由正態分布生成,均值參數由該參數設置,默認為0
 18. init_std_devinit\_std\_devinit_std_dev:p和q兩個向量的初始值由正態分布生成,標準差參數由該參數設置,默認為0.1
 19. lr_alllr\_alllr_all:可由該參數直接設置所有學習速率的值,默認為0.005
 20. reg_allreg\_allreg_all:可由該參數直接設置所有正則化系數的值,默認為0.02
 21. lr_bulr\_bulr_bu:設置bub_ubu?的學習速率,可覆蓋lr_alllr\_alllr_all,默認未設置
 22. lr_bilr\_bilr_bi:設置bib_ibi?的學習速率,可覆蓋lr_alllr\_alllr_all,默認未設置
 23. lr_pulr\_pulr_pu:設置pup_upu?的學習速率,可覆蓋lr_alllr\_alllr_all,默認未設置
 24. lr_qilr\_qilr_qi:設置qiq_iqi?的學習速率,可覆蓋lr_alllr\_alllr_all,默認未設置
 25. reg_bureg\_bureg_bu:設置bub_ubu?的學習速率,可覆蓋reg_allreg\_allreg_all,默認未設置
 26. reg_bireg\_bireg_bi:設置bib_ibi?的學習速率,可覆蓋reg_allreg\_allreg_all,默認未設置
 27. reg_pureg\_pureg_pu:設置pup_upu?的學習速率,可覆蓋reg_allreg\_allreg_all,默認未設置
 28. reg_qireg\_qireg_qi:設置qiq_iqi?的學習速率,可覆蓋reg_allreg\_allreg_all,默認未設置
 29. random_staterandom\_staterandom_state:隨機種子設置,默認未設置,可設置為一個整數,即可在多次試驗時得到相同結果(在相同的訓練集和測試集的情況下)
示例代碼
algo = SVD(n_factors=5, n_epochs=20, lr_all=0.007, reg_all=0.002, verbose=False, init_mean=0.1, init_std_dev=0)SVDpp算法
依然是Koren提出的,考慮了隱性反饋的SVDpp算法
 r^ui=μ+bu+bi+qiT(pu+∣Iu∣?12∑j∈Iuyj)\hat{r}_{ui} = \mu + b_u + b_i + q_i^T\left(p_u + |I_u|^{-\frac{1}{2}} \sum_{j \in I_u}y_j\right) r^ui?=μ+bu?+bi?+qiT????pu?+∣Iu?∣?21?j∈Iu?∑?yj????
 和SVD相比,多了兩個參數
 30. lr_yjlr\_yjlr_yj:設置yjy_jyj?的學習速率,可覆蓋lr_alllr\_alllr_all,默認未設置
 31. reg_yjreg\_yjreg_yj:設置yjy_jyj?的學習速率,可覆蓋reg_allreg\_allreg_all,默認未設置
示例代碼
algo = SVDpp(n_factors=5, n_epochs=20, lr_all=0.007, reg_all=0.002, verbose=False, init_mean=0.1, init_std_dev=0)NMF算法
非負矩陣分解,即要求p矩陣和q矩陣都是正的
 r^ui=qiTpu,\hat{r}_{ui} = q_i^Tp_u, r^ui?=qiT?pu?,
 和SVD相比,多了兩個參數
 32. init_lowinit\_lowinit_low:設置初始值的下限,默認為0
 33. init_highinit\_highinit_high:設置初始值的上限,默認為1
示例代碼
algo = NMF(n_factors=5, n_epochs=20, lr_all=0.007, reg_all=0.002, verbose=False, init_mean=0.1, init_std_dev=0)SlopeOne算法
r^ui=μu+1∣Ri(u)∣∑j∈Ri(u)dev(i,j)\hat{r}_{ui} = \mu_u + \frac{1}{ |R_i(u)|} \sum\limits_{j \in R_i(u)} \text{dev}(i, j) r^ui?=μu?+∣Ri?(u)∣1?j∈Ri?(u)∑?dev(i,j)
 dev(i,j)=1∣Uij∣∑u∈Uijrui?ruj\text{dev}(i, j) = \frac{1}{ |U_{ij}|}\sum\limits_{u \in U_{ij}} r_{ui} - r_{uj} dev(i,j)=∣Uij?∣1?u∈Uij?∑?rui??ruj?
示例代碼
algo = SlopeOne()CoClustering算法
r^ui=Cui ̄+(μu?Cu ̄)+(μi?Ci ̄)\hat{r}_{ui} = \overline{C_{ui}} + (\mu_u - \overline{C_u}) + (\mu_i- \overline{C_i}) r^ui?=Cui??+(μu??Cu??)+(μi??Ci??)
Precision、Recall、MAP和NDCG的計算
#!/usr/bin/python # -*- coding: utf-8 -*- from surprise import KNNBasic from surprise import Dataset import pandas as pd from surprise import Reader import numpy as np from surprise.model_selection import KFold import mathnum_item = 80 reader = Reader(line_format='user item rating', sep=',') data = Dataset.load_from_file('rating2.txt', reader=reader) kf = KFold(n_splits=5) sim_options = {'name': 'cosine','user_based': False} algo = KNNBasic(sim_options=sim_options, verbose=False) precision = 0.0 recall = 0.0 map = 0.0 ndcg = 0.0 topk = 3 for trainset, testset in kf.split(data):algo.fit(trainset)fenmu = pd.DataFrame(np.array(testset)[:, 0]).drop_duplicates().shape[0]real = [[] for i in range(fenmu)]sor = [[] for i in range(fenmu)]hit = 0score = 0.0dcg = 0.0dic = {}m = 0for i in range(len(testset)):if int(testset[i][0]) not in dic:dic[int(testset[i][0])] = mm += 1ls = []real[m - 1].append(int(testset[i][1]))for j in range(num_item):uid = str(testset[i][0])iid = str(j)pred = algo.predict(uid, iid)ls.append([pred[3], j])ls = sorted(ls, key=lambda x: x[0], reverse=True)for s in range(topk):sor[m-1].append(int(ls[s][1]))else:real[dic[int(testset[i][0])]].append(int(testset[i][1]))for i in range(fenmu):idcg = 0.0ap_score = 0.0ap = 0.0cg = 0.0for y in range(topk):if sor[i][y] in real[i]:ap_score += 1ap += ap_score / (y + 1)cg += 1 / math.log((y + 2), 2)score += ap / min(len(real[i]), topk)for z in range(int(ap_score)):idcg += 1 / math.log((z + 2), 2)if idcg > 0:dcg += cg / idcgrecall += ap_score / (len(real[i]) * fenmu)precision += ap_score / (topk * fenmu)map += float(score) / fenmundcg += float(dcg) / fenmu print 'precision ' + str(precision) print 'recall ' + str(recall) print 'map ' + str(map) print 'ndcg ' + str(ndcg)總結
以上是生活随笔為你收集整理的推荐系统surprise库教程的全部內容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: 推荐系统-Python+surprise
- 下一篇: Surprise安装
