奇异值值分解。svd_推荐系统-奇异值分解(SVD)和截断SVD
奇異值值分解。svd
The most common method for recommendation systems often comes with Collaborating Filtering (CF) where it relies on the past user and item dataset. Two popular approaches of CF are latent factor models, which extract features from user and item matrices and neighborhood models, which finds similarities between products or users.
推薦系統最常用的方法通常是協作過濾(CF),它依賴于過去的用戶和項目數據集。 CF的兩種流行方法是潛在因子模型,它們從用戶和項目矩陣以及鄰域模型中提取特征,從而發現產品或用戶之間的相似性。
The neighborhood model is an item-oriented approach to discover the user preference based on the ratings given by the user for similar items. On the other hand, latent factor models such as Singular Value Decomposition (SVD) extract features and correlation from the user-item matrix. For example, when items are movies in different categories. SVD would generate factors when looking into the dimension space like action vs comedy, Hollywood vs Bollywood, or Marvel vs Disney. Mainly, we will focus on the latent factor model for the Singular Value Decomposition (SVD) approach.
鄰域模型是一種面向項目的方法,用于根據用戶對相似項目給出的評分來發現用戶偏好。 另一方面,諸如奇異值分解(SVD)之類的潛在因子模型從用戶項矩陣中提取特征和相關性。 例如,當項目是不同類別的電影時。 SVD在調查維度空間時會產生各種因素,例如動作與喜劇,好萊塢與寶萊塢或漫威與迪士尼。 主要,我們將專注于奇異值分解(SVD)方法的潛在因子模型。
In this article, you will learn the singular value decomposition and truncated SVD of the recommender system:
在本文中,您將學習推薦系統的奇異值分解和截斷SVD:
(1) Introduction to singular value decomposition
(1)奇異值分解導論
(2) Introduction to truncated SVD
(2)截斷SVD簡介
(3) Hands-on experience of python code on matrix factorization
(3)關于矩陣分解的python代碼的實踐經驗
奇異值分解導論 (Introduction to singular value decomposition)
When it comes to dimensionality reduction, the Singular Value Decomposition (SVD) is a popular method in linear algebra for matrix factorization in machine learning. Such a method shrinks the space dimension from N-dimension to K-dimension (where K<N) and reduces the number of features. SVD constructs a matrix with the row of users and columns of items and the elements are given by the users’ ratings. Singular value decomposition decomposes a matrix into three other matrices and extracts the factors from the factorization of a high-level (user-item-rating) matrix.
在降維方面,奇異值分解(SVD)是線性代數中機器學習中矩陣分解的一種流行方法。 這種方法將空間尺寸從N維縮小到K維(其中K <N),并減少了特征數量。 SVD用用戶的行和項目的列構造一個矩陣,并且元素由用戶的等級給出。 奇異值分解將矩陣分解為其他三個矩陣,并從高級(用戶項評級)矩陣的分解中提取因子。
Matrix U: singular matrix of (user*latent factors)Matrix S: diagonal matrix (shows the strength of each latent factor)Matrix U: singular matrix of (item*latent factors)
矩陣U:(用戶*潛在因子)的奇異矩陣矩陣S:對角矩陣(顯示每個潛在因子的強度)矩陣U:(項目*潛在因子)的奇異矩陣
From matrix factorization, the latent factors show the characteristics of the items. Finally, the utility matrix A is produced with shape m*n. The final output of the matrix A reduces the dimension through latent factors’ extraction. From the matrix A, it shows the relationships between users and items by mapping the user and item into r-dimensional latent space. Vector X_i is considered each item and vector Y_u is regarded as each user. The rating is given by a user on an item as R_ui = X^T_i * Y_u. The loss can be minimized by the square error difference between the product of R_ui and the expected rating.
通過矩陣分解,潛在因子顯示出項目的特征。 最后,產生具有形狀m * n的效用矩陣A。 矩陣A的最終輸出通過潛在因子的提取減小維數。 從矩陣A,它通過將用戶和項目映射到r維潛在空間來顯示用戶和項目之間的關系。 向量X_i被視為每個項目,向量Y_u被視為每個用戶。 用戶對項目給出的評級為R_ui = X ^ T_i * Y_u。 R_ui的乘積與預期額定值之間的平方誤差可以使損失最小化。
Regularization is used to avoid overfitting and generalize the dataset by adding the penalty.
正則化用于避免過度擬合并通過添加懲罰來概括數據集。
Here, we add a bias term to reduce the error of actual versus predicted value by the model.
在這里,我們添加了一個偏差項,以減少模型對實際值和預測值的誤差。
(u, i): user-item pairμ: the average rating of all itemsbi: average rating of item i minus μbu: the average rating given by user u minus μ
(u,i):用戶項目對μ:所有項目的平均評級bi:項目i的平均評級減去μbu:用戶u給出的平均評級減去μ
The equation below adds the bias term and the regularization term:
下面的方程式將偏差項和正則項相加:
截斷的SVD簡介 (Introduction to truncated SVD)
When it comes to matrix factorization technique, truncated Singular Value Decomposition (SVD) is a popular method to produce features that factors a matrix M into the three matrices U, Σ, and V. Another popular method is Principal Component Analysis (PCA). Truncated SVD shares similarity with PCA while SVD is produced from the data matrix and the factorization of PCA is generated from the covariance matrix. Unlike regular SVDs, truncated SVD produces a factorization where the number of columns can be specified for a number of truncation. For example, given an n x n matrix, truncated SVD generates the matrices with the specified number of columns, whereas SVD outputs n columns of matrices.
對于矩陣分解技術,截斷奇異值分解 ( SVD )是一種流行的方法,用于產生將矩陣M分解為三個矩陣U,Σ和V的特征。另一種流行的方法是主成分分析(PCA)。 截斷的SVD與PCA具有相似性,而SVD是從數據矩陣生成的,而PCA的分解是從協方差矩陣生成的。 與常規SVD不同,截斷的SVD會產生分解,可以為截斷的數量指定列數。 例如,給定一個nxn矩陣,截短的SVD生成具有指定列數的矩陣,而SVD輸出n列矩陣。
截短的SVD優于PCA的優勢 (The advantages of truncated SVD over PCA)
Truncated SVD can deal with sparse matrix to generate features’ matrices, whereas PCA would operate on the entire matrix for the output of the covariance matrix.
截斷的SVD可以處理稀疏矩陣以生成特征矩陣,而PCA可以對整個矩陣進行操作以輸出協方差矩陣。
python代碼的動手經驗 (Hands-on experience of python code)
資料說明: (Data Description:)
The metadata includes 45,000 movies listed in the Full MovieLens Dataset and movies are released before July 2017. Cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages are in the dataset. The scale of ratings is 1–5 and obtained from the official GroupLens website. The dataset is referred to from the Kaggle dataset.
元數據包括Full MovieLens數據集中列出的45,000部電影,并且電影將于2017年7月之前發行。演員,劇組,劇情關鍵字,預算,收入,海報,發行日期,語言,制作公司,國家/地區,TMDB投票數和平均票數均在數據集。 評級等級為1-5,可從GroupLens官方網站獲得。 該數據集是從Kaggle數據集中引用的。
使用SVD推薦電影 (Recommending movies using SVD)
Singular value decomposition (SVD) is a collaborative filtering method for movie recommendation. The aim for the code implementation is to provide users with movies’ recommendation from the latent features of item-user matrices. The code would show you how to use the SVD latent factor model for matrix factorization.
奇異值分解(SVD)是一種用于電影推薦的協作過濾方法。 代碼實現的目的是根據項目用戶矩陣的潛在功能為用戶提供電影推薦。 該代碼將向您展示如何使用SVD潛在因子模型進行矩陣分解。
數據預處理 (Data Preprocessing)
Random sample the rating dataset and generate the movie features with genres. Then, labelencode all the movies and users with respective unique ids.
隨機采樣評級數據集并生成具有流派的電影特征。 然后,使用各自的唯一ID對所有電影和用戶進行標簽編碼。
num of users: 1105num of movies: 3000
模型表現 (Model Performance)
Through each run of the epoch, the rmse is reduced and the final output reaches rmse 0.57. The number of batch size would affect the number of input data fed into the model for each run. Batch size, learning rate, and regularization term are tunable to optimize the model performance.
通過每個時期,均方根值減小,最終輸出達到均方根值0.57。 批處理大小的數量將影響每次運行饋入模型的輸入數據的數量。 批次大小,學習率和正則項可調整以優化模型性能。
RMSE 2.1727233RMSE 2.101482
RMSE 2.0310202
RMSE 1.9610059
RMSE 1.8911659
RMSE 1.8213558
RMSE 1.7515925
RMSE 1.681992
RMSE 1.612707
RMSE 1.543902
RMSE 1.4757496
RMSE 1.408429
RMSE 1.3421307
RMSE 1.277059
RMSE 1.2134355
RMSE 1.1514966
RMSE 1.0914934
RMSE 1.0336862
RMSE 0.9783424
RMSE 0.9257237
RMSE 0.87606686
RMSE 0.82956517
RMSE 0.7863303
RMSE 0.7463626
RMSE 0.7095342
RMSE 0.67563176
RMSE 0.6445249
RMSE 0.6163493
RMSE 0.5914116
RMSE 0.5701855
使用截斷的SVD推薦電影 (Recommending movies using Truncated SVD)
The first 10 components of user x movie matrix s generated through truncated SVD. There are latent features in the reconstructed matrix showing a correlation with the user ratings for the rating prediction.
通過截斷的SVD生成的用戶x電影矩陣s的前10個分量。 重構矩陣中存在潛在特征,這些潛在特征顯示了與用戶評分之間的相關性,以進行評分預測。
Since the genres column is in the list of the dictionary format, The column is preprocessed and extracted with several genres’ names separated by | format.
由于流派列在字典格式的列表中,因此該列經過預處理和提取,并使用了多個流派名稱,并用|分隔。 格式。
在用戶和電影矩陣上執行截斷的SVD (Perform Truncated SVD on user and movie matrix)
Take a 3000 random sample of users’ ratings from the dataset and create the pivot table with the index of Userid and columns of MovieID with the rating value. Then, the user matrix is generated with 2921x1739 users by the user matrix.
從數據集中隨機抽取3000個用戶評分樣本,并創建包含Userid索引和帶有評分值的MovieID列的數據透視表。 然后,通過用戶矩陣生成具有2921x1739個用戶的用戶矩陣。
Take 3000 random samples of movies from the dataset and create the pivot table with the index of MovieID and columns of Userid with the rating value. Then, the movie matrix is generated with 3000x1105 users by the movie matrix.
從數據集中抽取3000個電影的隨機樣本,并創建帶有MovieID索引和帶有評估值的Userid列的數據透視表。 然后,電影矩陣通過3000x1105用戶生成電影矩陣。
From both user and rating matrix, 80% of data is used for training data and the rest 20% is for test data. For the train data, the reconstructed matrix is produced from 10 components of truncated SVD. The row*col length of matrices is movie_features.shape = (2400, 10) and user_features.shape = (2336, 10).
在用戶矩陣和評分矩陣中,有80%的數據用于培訓數據,其余20%的數據用于測試數據。 對于火車數據,從截斷的SVD的10個分量生成重建的矩陣。 矩陣的行*列長度為movie_features.shape =(2400,10)和user_features.shape =(2336,10)。
TSNE Visualization
TSNE可視化
TSNE transforms the high-dimensional space of data into a low-dimensional space of data and visualizes it. Perplexity is one of the tuneable features to take the balance of local and global data and suggest the number of close neighbors each point has.
TSNE將數據的高維空間轉換為數據的低維空間并對其進行可視化。 困惑是可調整的功能之一,可以平衡本地數據和全局數據并建議每個點具有的近鄰數量。
Take the perplexity of 5 and 2 components of the movie features and the plot is produced and shows the clusters of movies. The correlated movies are clustered by the latent features produced from the TSNE method.
以電影特征的5個和2個部分的困惑為例,將繪制情節并顯示電影的群集。 相關電影通過TSNE方法產生的潛在特征進行聚類。
TSNE plot of correlated moviesTSNE相關電影的情節準備火車和目標數據 (Prepare the train and target data)
The label of the target data is the average users’ rating and round it to 1 decimal point. There are a total of 501 movies and 1108 users’ ratings. The size of the train and the target data are data.shape = (3000, 1105) and targets.shape = (3000,).
目標數據的標簽是平均用戶評分,并將其舍入到小數點后一位。 共有501部電影和1108個用戶評分。 火車的大小和目標數據為data.shape =(3000,1105)和targets.shape =(3000,)。
在潛在特征上訓練梯度增強回歸器 (Training a Gradient Boosted Regressor on latent features)
Train the model of GradientBoostingRegressor with the learning rate of 0.1 and 200 estimators. The loss function is calculated through mean squared error.
用0.1和200個估計量的學習率訓練GradientBoostingRegressor模型。 損失函數通過均方誤差計算。
The final prediction is the average rating of each movie by all the ratings produced from the users. The final MSE is around 0.51, which is quite optimal for the average rating model.
最終的預測是根據用戶產生的所有評分,每部電影的平均評分。 最終的MSE約為0.51,對于平均評級模型而言,這是非常理想的。
Iter Train Loss Remaining Time1 0.3735 5.43s
2 0.3710 5.12s
3 0.3689 4.89s
4 0.3672 4.76s
5 0.3656 4.67s
6 0.3641 4.64s
7 0.3628 4.59s
8 0.3614 4.54s
9 0.3601 4.52s
10 0.3589 4.51s
20 0.3480 4.14s
30 0.3391 3.83s
40 0.3316 3.59s
50 0.3245 3.35s
60 0.3174 3.14s
70 0.3118 2.91s
80 0.3063 2.68s
90 0.3013 2.45s
100 0.2968 2.22s
200 0.2620 0.00sFinal MSE:0.5118555681581297
結論: (In Conclusion:)
翻譯自: https://towardsdatascience.com/recommender-system-singular-value-decomposition-svd-truncated-svd-97096338f361
奇異值值分解。svd
總結
以上是生活随笔為你收集整理的奇异值值分解。svd_推荐系统-奇异值分解(SVD)和截断SVD的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 轩辕传奇手游刺客攻略有哪些
- 下一篇: 游戏名字女生霸气短一点怎么起(4399小