當前位置：首頁 > 编程语言 > python >内容正文

python

python surprise库_surprise库文档翻译

發布時間：2023/12/8 python 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 python surprise库_surprise库文档翻译小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

這里的格式并沒有做過多的處理，可參考于OneNote筆記鏈接

由于OneNote取消了單頁分享，如果需要請留下郵箱，我會郵件發送pdf版本，后續再解決這個問題

推薦算法庫surprise安裝

pip install surprise

基本用法

? 自動交叉驗證

# Load the movielens-100k dataset (download it if needed),

data = Dataset.load_builtin('ml-100k')

# We'll use the famous SVD algorithm.

algo = SVD()

# Run 5-fold cross-validation and print results

cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

load_builtin方法會自動下載“movielens-100k”數據集，放在.surprise_data目錄下面

? 使用自定義的數據集

# path to dataset file

file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')

# As we're loading a custom dataset, we need to define a reader. In the

# movielens-100k dataset, each line has the following format:

# 'user item rating timestamp', separated by '\t' characters.

reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path, reader=reader)

# We can now use this dataset as we please, e.g. calling cross_validate

cross_validate(BaselineOnly(), data, verbose=True)

交叉驗證

○ cross_validate(算法，數據集，評估模塊measures=[]，交叉驗證折數cv)

○ 通過test方法和KFold也可以對數據集進行更詳細的操作，也可以使用LeaveOneOut或是ShuffleSplit

from surprise import SVD

from surprise import Dataset

from surprise import accuracy

from surprise.model_selection import Kfold

# Load the movielens-100k dataset

data = Dataset.load_builtin('ml-100k')

# define a cross-validation iterator

kf = KFold(n_splits=3)

algo = SVD()

for trainset, testset in kf.split(data):

# train and test algorithm.

algo.fit(trainset)

predictions = algo.test(testset)

# Compute and print Root Mean Squared Error

accuracy.rmse(predictions, verbose=True)

使用GridSearchCV來調節算法參數

如果需要對算法參數來進行比較測試，GridSearchCV類可以提供解決方案

例如對SVD的參數嘗試不同的值

from surprise import SVD

from surprise import Dataset

from surprise.model_selection import GridSearchCV

# Use movielens-100K

data = Dataset.load_builtin('ml-100k')

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],

'reg_all': [0.4, 0.6]}

gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

# best RMSE score

print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score

print(gs.best_params['rmse'])

# We can now use the algorithm that yields the best rmse:

algo = gs.best_estimator['rmse']

algo.fit(data.build_full_trainset())

使用預測算法

○ 基線估算配置

§ 在使用最小二乘法（ALS）時傳入參數：

1) reg_i：項目正則化參數，默認值為10

2) reg_u：用戶正則化參數，默認值為15

3) n_epochs：als過程中的迭代次數，默認值為10

print('Using ALS')

bsl_options = {'method': 'als',

'n_epochs': 5,

'reg_u': 12,

'reg_i': 5

}

algo = BaselineOnly(bsl_options=bsl_options)

§ 在使用隨機梯度下降（SGD）時傳入參數：

1) reg：優化成本函數的正則化參數，默認值為0.02

2) learning_rate：SGD的學習率，默認值為0.005

3) n_epochs：SGD過程中的迭代次數，默認值為20

print('Using SGD')

bsl_options = {'method': 'sgd',

'learning_rate': .00005,

}

algo = BaselineOnly(bsl_options=bsl_options)

§ 在創建KNN算法時候來傳遞參數

bsl_options = {'method': 'als',

'n_epochs': 20,

}

sim_options = {'name': 'pearson_baseline'}

algo = KNNBasic(bsl_options=bsl_options, sim_options=sim_options)

○ 相似度配置

§ name：要使用的相似度名稱，默認是MSD

§ user_based：是否時基于用戶計算相似度，默認為True

§ min_support：最小的公共數目，當最小的公共用戶或者公共項目小于min_support時候，相似度為0

§ shrinkage：收縮參數，默認值為100

i. sim_options = {'name': 'cosine',

'user_based': False # compute similarities between items

}

algo = KNNBasic(sim_options=sim_options)

ii. sim_options = {'name': 'pearson_baseline',

'shrinkage': 0 # no shrinkage

}

algo = KNNBasic(sim_options=sim_options)

? 其他一些問題

○ 如何獲取top-N的推薦

from collections import defaultdict

from surprise import SVD

from surprise import Dataset

def get_top_n(predictions, n=10):

'''Return the top-N recommendation for each user from a set of predictions.

Args:

predictions(list of Prediction objects): The list of predictions, as

returned by the test method of an algorithm.

n(int): The number of recommendation to output for each user. Default

is 10.

Returns:

A dict where keys are user (raw) ids and values are lists of tuples:

[(raw item id, rating estimation), ...] of size n.

'''

# First map the predictions to each user.

top_n = defaultdict(list)

for uid, iid, true_r, est, _ in predictions:

top_n[uid].append((iid, est))

# Then sort the predictions for each user and retrieve the k highest ones.

for uid, user_ratings in top_n.items():

user_ratings.sort(key=lambda x: x[1], reverse=True)

top_n[uid] = user_ratings[:n]

return top_n

# First train an SVD algorithm on the movielens dataset.

data = Dataset.load_builtin('ml-100k')

trainset = data.build_full_trainset()

algo = SVD()

algo.fit(trainset)

# Than predict ratings for all pairs (u, i) that are NOT in the training set.

testset = trainset.build_anti_testset()

predictions = algo.test(testset)

top_n = get_top_n(predictions, n=10)

# Print the recommended items for each user

for uid, user_ratings in top_n.items():

print(uid, [iid for (iid, _) in user_ratings])

○ 如何計算精度

from collections import defaultdict

from surprise import Dataset

from surprise import SVD

from surprise.model_selection import KFold

def precision_recall_at_k(predictions, k=10, threshold=3.5):

'''Return precision and recall at k metrics for each user.'''

# First map the predictions to each user.

user_est_true = defaultdict(list)

for uid, _, true_r, est, _ in predictions:

user_est_true[uid].append((est, true_r))

precisions = dict()

recalls = dict()

for uid, user_ratings in user_est_true.items():

# Sort user ratings by estimated value

user_ratings.sort(key=lambda x: x[0], reverse=True)

# Number of relevant items

n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

# Number of recommended items in top k

n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

# Number of relevant and recommended items in top k

n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))

for (est, true_r) in user_ratings[:k])

# Precision@K: Proportion of recommended items that are relevant

precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 1

# Recall@K: Proportion of relevant items that are recommended

recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 1

return precisions, recalls

data = Dataset.load_builtin('ml-100k')

kf = KFold(n_splits=5)

algo = SVD()

for trainset, testset in kf.split(data):

algo.fit(trainset)

predictions = algo.test(testset)

precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)

# Precision and recall can then be averaged over all users

print(sum(prec for prec in precisions.values()) / len(precisions))

print(sum(rec for rec in recalls.values()) / len(recalls))

○ 如何獲得用戶（或項目）的k個最近鄰居

import io # needed because of weird encoding of u.item file

from surprise import KNNBaseline

from surprise import Dataset

from surprise import get_dataset_dir

def read_item_names():

"""Read the u.item file from MovieLens 100-k dataset and return two

mappings to convert raw ids into movie names and movie names into raw ids.

"""

file_name = get_dataset_dir() + '/ml-100k/ml-100k/u.item'

rid_to_name = {}

name_to_rid = {}

with io.open(file_name, 'r', encoding='ISO-8859-1') as f:

for line in f:

line = line.split('|')

rid_to_name[line[0]] = line[1]

name_to_rid[line[1]] = line[0]

return rid_to_name, name_to_rid

# First, train the algortihm to compute the similarities between items

data = Dataset.load_builtin('ml-100k')

trainset = data.build_full_trainset()

sim_options = {'name': 'pearson_baseline', 'user_based': False}

algo = KNNBaseline(sim_options=sim_options)

algo.fit(trainset)

# Read the mappings raw id <-> movie name

rid_to_name, name_to_rid = read_item_names()

# Retrieve inner id of the movie Toy Story

toy_story_raw_id = name_to_rid['Toy Story (1995)']

toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)

# Retrieve inner ids of the nearest neighbors of Toy Story.

toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)

# Convert inner ids of the neighbors into names.

toy_story_neighbors = (algo.trainset.to_raw_iid(inner_id)

for inner_id in toy_story_neighbors)

toy_story_neighbors = (rid_to_name[rid]

for rid in toy_story_neighbors)

print()

print('The 10 nearest neighbors of Toy Story are:')

for movie in toy_story_neighbors:

print(movie)

○ 解釋一下什么是raw_id和inner_id？

i. 用戶和項目有自己的raw_id和inner_id，原生id是評分文件或者pandas數據集中定義的id，重點在于要知道你使用predict()或者其他方法時候接收原生的id

ii. 在訓練集創建時，每一個原生的id映射到inner id（這是一個唯一的整數，方便surprise操作），原生id和內部id之間的轉換可以用訓練集中的to_inner_uid(), to_inner_iid(), to_raw_uid(), 以及to_raw_iid()方法

○ 默認數據集下載到了哪里？怎么修改這個位置

i. 默認數據集下載到了——“~/.surprise_data”中

ii. 如果需要修改，可以通過設置“SURPRISE_DATA_FOLDER”環境變量來修改位置

? API合集

○ 推薦算法包

random_pred.NormalPredictor Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.

baseline_only. BaselineOnly Algorithm predicting the baseline estimate for given user and item.

knns.KNNBasic A basic collaborative filtering algorithm.

knns.KNNWithMeans A basic collaborative filtering algorithm, taking into account the mean ratings of each user.

knns.KNNWithZScore A basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

knns.KNNBaseline A basic collaborative filtering algorithm taking into account a baseline rating.

matrix_factorization.SVD The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.

matrix_factorization.SVDpp The SVD++ algorithm, an extension of SVD taking into account implicit ratings.

matrix_factorization.NMF A collaborative filtering algorithm based on Non-negative Matrix Factorization.

slope_one.SlopeOne A simple yet accurate collaborative filtering algorithm.

co_clustering.CoClustering A collaborative filtering algorithm based on co-clustering.

○ 推薦算法基類

§ class surprise.prediction_algorithms.algo_base.AlgoBase(**kwargs)

§ 如果算法需要計算相似度，那么baseline_options參數可以用來配置

§ 方法介紹：

1) compute_baselines() 計算用戶和項目的基線，這個方法只能適用于Pearson相似度或者BaselineOnly算法，返回一個包含用戶相似度和用戶相似度的元組

2) compute_similarities() 相似度矩陣，計算相似度矩陣的方式取決于sim_options算法創建時候所傳遞的參數，返回相似度矩陣

3) default_preditction() 默認的預測值，如果計算期間發生了異常，那么預測值則使用這個值。默認情況下時所有評分的均值（可以在子類中重寫，以改變這個值），返回一個浮點類型

4) fit(trainset) 在給定的訓練集上訓練算法，每個派生類都會調用這個方法作為訓練算法的第一個基本步驟，它負責初始化一些內部結構和設置self.trainset屬性，返回self指針

5) get_neighbors(iid, k) 返回inner id所對應的k個最近鄰居的，取決于這個iid所對應的是用戶還是項目（由sim_options里面的user_based是True還是False決定），返回K個最近鄰居的內部id列表

6) predict(uid, iid, r_ui=None, clip=True, verbose=False) 計算給定的用戶和項目的評分預測，該方法將原生id轉換為內部id，然后調用estimate每個派生類中定義的方法。如果結果是一個不可能的預測結果，那么會根據default_prediction()來計算預測值

另外解釋一下clip，這個參數決定是否對預測結果進行近似。舉個例子來說，如果預測結果是5.5，而評分的區間是[1,5]，那么將預測結果修改為5；如果預測結果小于1，那么修改為1。默認為True

verbose參數決定了是否打印每個預測的詳細信息。默認值為False

返回值，一個rediction對象，包含了：

a) 原生用戶id

b) 原生項目id

c) 真實評分

d) 預測評分

e) 可能對后面預測有用的一些其他的詳細信息

7) test(testset, verbose=False) 在給定的測試集上測試算法，即估計給定測試集中的所有評分。返回值是prediction對象的列表

○ 預測模塊

§ surprise.prediction_algorithms.predictions模塊定義了Prediction命名元組和PredictionImpossible異常

§ Prediction

□ 用于儲存預測結果的命名元組

□ 僅用于文檔和打印等目的

□ 參數：

uid 原生用戶id

iid 原生項目id

r_ui 浮點型的真實評分

est 浮點型的預測評分

details 預測相關的其他詳細信息

§ surprise.prediction_algorithms.predictions.PredictionImpossible

□ 當預測不可能時候，出現這個異常

□ 這個異常會設置當前的預測評分變為默認值（全局平均值）

○ model_selection包

§ 交叉驗證迭代器

□ 該模塊中包含各種交叉驗證迭代器：

KFold 基礎交叉驗證迭代器

RepeatedKFold 重復KFold交叉驗證迭代器

ShuffleSplit 具有隨機訓練集和測試集的基本交叉驗證迭代器

LeaveOneOut 交叉驗證迭代器，其中每個用戶再測試集中只有一個評級

PredefinedKFold 使用load_from_folds方法加載數據集時的交叉驗證迭代器

□ 該模塊中還包含了將數據集分為訓練集和測試集的功能

train_test_split(data, test_size=0,2, train_size=None, random_state=None, shuffle=True)

data，要拆分的數據集

test_size，如果是浮點數，表示要包含在測試集中的評分比例；如果是整數，則表示測試集中固定的評分數；如果是None，則設置為訓練集大小的補碼；默認為0.2

train_size，如果是浮點數，表示要包含在訓練集中的評分比例；如果是整數，則表示訓練集中固定的評分數；如果是None，則設置為訓練集大小的補碼；默認為None

random_state，整形，一個隨機種子，如果多次拆分后獲得的訓練集和測試集沒有多大分別，可以用這個參數來定義隨機種子

shuffle，布爾值，是否在數據集中改變評分，默認為True

§ 交叉驗證

surprise.model_selection.validation.cross_validate(algo, data, measures=[u'rmse'，u'mae'], cv=None, return_train_measures=False, n_jobs=1, pre_dispatch=u'2 * n_jobs', verbose=False)

? algo，算法

? data，數據集

? measures，字符串列表，指定評估方案

? cv，交叉迭代器或者整形或者None，如果是迭代器那么按照指定的參數；如果是int，則使用KFold交叉驗證迭代器，以參數為折疊次數；如果是None，那么使用默認的KFold，默認折疊次數5

? return_train_measures，是否計算訓練集的性能指標，默認為False

? n_jobs，整形，并行進行評估的最大折疊數。如果為-1，那么使用所有的CPU；如果為1，那么沒有并行計算（有利于調試）；如果小于-1，那么使用（CPU數目 + n_jobs + 1）個CPU計算；默認值為1

? pre_dispatch，整形或者字符串，控制在并行執行期間調度的作業數。（減少這個數量可有助于避免在分配過多的作業多于CPU可處理內容時候的內存消耗）這個參數可以是：

None，所有作業會立即創建并生成

int，給出生成的總作業數確切數量

string，給出一個表達式作為函數n_jobs，例如“2*n_jobs”

默認為2*n_jobs

返回值是一個字典：

? test_*，*對應評估方案，例如“test_rmse”

? train_*，*對應評估方案，例如“train_rmse”。當return_train_measures為True時候生效

? fit_time，數組，每個分割出來的訓練數據評估時間，以秒為單位

? test_time，數組，每個分割出來的測試數據評估時間，以秒為單位

§ 參數搜索

□ class surprise.model_selection.search.GridSearchCV(algo_class, param_grid, measures=[u'rmse', u'mae'], cv=None, refit=False, return_train_measures=False, n_jobs=1, pre_dispatch=u'2 * n_jobs', joblib_verbose=0)

? 參數類似于上文中交叉驗證

? refit，布爾或者整形。如果為True，使用第一個評估方案中最佳平均性能的參數，在整個數據集上重新構造算法measures；通過傳遞字符串可以指定其他的評估方案；默認為False

? joblib_verbose，控制joblib的詳細程度，整形數字越高，消息越多

□ 內部方法：

a) best_estimator，字典，使用measures方案的最佳評估值，對所有的分片計算平均

b) best_score，浮點數，計算平均得分

c) best_params，字典，獲得measure中最佳的參數組合

d) best_index，整數，獲取用于該指標cv_results的最高精度（平均下來的）的指數

e) cv_results，數組字典，measures中所有的參數組合的訓練和測試的時間

f) fit，通過cv參數給出不同的分割方案，對所有的參數組合計算

g) predit，當refit為False時候生效，傳入數組，見上文

h) test，當refit為False時候生效，傳入數組，見上文

□ class surprise.model_selection.search.RandomizedSearchCV（algo_class，param_distributions，n_iter = 10，measures = [u'rmse'，u'mae']，cv = None，refit = False，return_train_measures = False，n_jobs = 1，pre_dispatch = u'2 * n_jobs'，random_state =無，joblib_verbose = 0 ）

隨機抽樣進行計算而非像上面的進行瓊劇

○ 相似度模塊

§ similarities模塊中包含了用于計算用戶或者項目之間相似度的工具：

1) cosine

2) msd

3) pearson

4) pearson_baseline

○ 精度模塊

§ surprise.accuracy模塊提供了用于計算一組預測的精度指標的工具：

1) rmse（均方根誤差）

2) mae（平均絕對誤差）

3) fcp

○ 數據集模塊

§ dataset模塊定義了用于管理數據集的Dataset類和其他子類

§ class surprise.dataset.Dataset（reader）

§ 內部方法：

1) load_builtin(name=u'ml-100k')，加載內置數據集，返回一個Dataset對象

2) load_from_df(df, reader)，df（dataframe），數據框架，要求必須具有三列（要求順序），用戶原生id，項目原生id，評分；reader，指定字段內容

3) load_from_file(file_path, reader)，從文件中加載數據，參數為路徑和讀取器

4) load_from_folds(folds_files, reader)，處理一種特殊情況，movielens-100k數據集中已經定義好了訓練集和測試集，可以通過這個方法導入

○ 訓練集類

§ class surprise.Trainset(ur, ir, n_users, n_items, n_ratings, rating_scale, offset, raw2inner_id_users, raw2inner_id_items)

§ 屬性分析：

1) ur，用戶評分列表（item_inner_id，rating）的字典，鍵是用戶的inner_id

2) ir，項目評分列表（user_inner_id，rating）的字典，鍵是項目的inner_id

3) n_users，用戶數量

4) n_items，項目數量

5) n_ratings，總評分數

6) rating_scale，評分的最高以及最低的元組

7) global_mean，所有評級的平均值

§ 方法分析：

1) all_items()，生成函數，迭代所有項目，返回所有項目的內部id

2) all_ratings(),生成函數，迭代所有評分，返回一個(uid, iid, rating)的元組

3) all_users()，生成函數，迭代所有的用戶，然會用戶的內部id

4) build_anti_testset(fill=None)，返回可以在test()方法中用作測試集的評分列表，參數決定填充未知評級的值，如果使用None則使用global_mean

5) knows_item(iid)，標志物品是否屬于訓練集

6) knows_user(uid)，標志用戶是否屬于訓練集

7) to_inner_iid(riid)，將項目原始id轉換為內部id

8) to_innser_uid(ruid)，將用戶原始id轉換為內部id

9) to_raw_iid(iiid)，將項目的內部id轉換為原始id

10) to_raw_uid(iuid)，將用戶的內部id轉換為原始id

○ 讀取器類

§ class surprise.reader.Reader(name=None, line_format=u'user item rating', sep=None, rating_scale=(1, 5), skip_lines=0)

Reader類用于解析包含評分的文件，要求這樣的文件每行只指定一個評分，并且需要每行遵守這個接口：用戶；項目；評分；[時間戳]，不要求順序，但是需要指定

§ 參數分析：

1) name，如果指定，則返回一個內置的數據集Reader，并忽略其他參數，可接受的值是"ml-100k"，“m1l-1m”和“jester”。默認為None

2) line_format，string類型，字段名稱，指定時需要用空格分割，默認是“user item rating”

3) sep，char類型，指定字段之間的分隔符

4) rating_scale，元組類型，評分區間，默認為(1,5)

5) skip_lines，int類型，要在文件開頭跳過的行數，默認為0

○ 轉儲模塊

§ surprise.dump.dump(file_name, predictions=None, algo=None, verbose=0)

□ 一個pickle的基本包裝器，用來序列化預測或者算法的列表

□ 參數分析：

a) file_name，str，指定轉儲的位置

b) predictions，Prediction列表，用來轉儲的預測

c) algo，Algorithm，用來轉儲的算法

d) verbose，詳細程度，0或者1

§ surprise.dump.load(file_name)

□ 用于讀取轉儲文件

□ 返回一個元組（predictions, algo），其中可能為None

文章來源：segmentfault，作者：Wildcard。如果您發現本社區中有涉嫌抄襲的內容，歡迎發送郵件至：william.shi#ucloud.cn（郵箱中#請改為@）進行舉報，并提供相關證據，一經查實，本社區將立刻刪除涉嫌侵權內容。

后臺-系統設置-擴展變量-手機廣告位-內容正文底部

總結

以上是生活随笔為你收集整理的python surprise库_surprise库文档翻译的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python surprise库_Pyt
下一篇： Surprise入门

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

python surprise库_surprise库文档翻译

總結