當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

使用mnist数据集_使用MNIST数据集上的t分布随机邻居嵌入（t-SNE）进行降维

發(fā)布時間：2023/12/15 编程问答 36 豆豆

生活随笔收集整理的這篇文章主要介紹了使用mnist数据集_使用MNIST数据集上的t分布随机邻居嵌入（t-SNE）进行降维小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

使用mnist數(shù)據(jù)集

It is easy for us to visualize two or three dimensional data, but once it goes beyond three dimensions, it becomes much harder to see what high dimensional data looks like.

對我們來說，可視化二維或三維數(shù)據(jù)很容易，但是一旦它超出了三維，就很難看到高維數(shù)據(jù)的外觀。

Today we are often in a situation that we need to analyze and find patterns on datasets with thousands or even millions of dimensions, which makes visualization a bit of a challenge. However, a tool that can definitely help us better understand the data is dimensionality reduction.

如今，我們經(jīng)常處于需要分析和查找具有數(shù)千甚至上百萬個維度的數(shù)據(jù)集的模式的情況，這使可視化成為一個挑戰(zhàn)。但是，絕對可以幫助我們更好地理解數(shù)據(jù)的工具是降維。

In this post, I will discuss t-SNE, a popular non-linear dimensionality reduction technique and how to implement it in Python using sklearn. The dataset I have chosen here is the popular MNIST dataset.

在本文中，我將討論t-SNE(一種流行的非線性降維技術(shù))以及如何使用sklearn在Python中實現(xiàn)該技術(shù)。我在這里選擇的數(shù)據(jù)集是流行的MNIST數(shù)據(jù)集。

好奇心表 (Table of Curiosities)

What is t-SNE and how does it work?

什么是t-SNE，它如何工作？

How is t-SNE different with PCA?

t-SNE與PCA有何不同？

How can we improve upon t-SNE?

我們?nèi)绾胃纳苩-SNE？

What are the limitations?

有什么限制？

What can we do next?

接下來我們該怎么辦？

總覽 (Overview)

T-Distributed Stochastic Neighbor Embedding, or t-SNE, is a machine learning algorithm and it is often used to embedding high dimensional data in a low dimensional space [1].

T分布隨機(jī)鄰居嵌入(t-SNE)是一種機(jī)器學(xué)習(xí)算法，通常用于在低維空間中嵌入高維數(shù)據(jù)[1]。

In simple terms, the approach of t-SNE can be broken down into two steps. The first step is to represent the high dimensional data by constructing a probability distribution P, where the probability of similar points being picked is high, whereas the probability of dissimilar points being picked is low. The second step is to create a low dimensional space with another probability distribution Q that preserves the property of P as close as possible.

簡單來說，t-SNE的方法可以分為兩個步驟。第一步是通過構(gòu)造概率分布P來表示高維數(shù)據(jù)，其中相似點被拾取的概率較高，而相異點被拾取的概率較低。第二步是創(chuàng)建具有另一個概率分布Q的低維空間，該概率分布Q保持P的屬性盡可能接近。

In step 1, we compute the similarity between two data points using a conditional probability p. For example, the conditional probability of j given i represents that x_j would be picked by x_i as its neighbor assuming neighbors are picked in proportion to their probability density under a Gaussian distribution centered at x_i [1]. In step 2, we let y_i and y_j to be the low dimensional counterparts of x_i and x_j, respectively. Then we consider q to be a similar conditional probability for y_j being picked by y_i and we employ a student t-distribution in the low dimension map. The locations of the low dimensional data points are determined by minimizing the Kullback–Leibler divergence of probability distribution P from Q.

在步驟1中，我們使用條件概率p計算兩個數(shù)據(jù)點之間的相似度。例如，給定i的條件概率j表示x_j將被x_i作為其鄰居，并假設(shè)在以x_i [1]為中心的高斯分布下，按其概率密度成比例地選擇了鄰居。在步驟2中，我們讓y_i和y_j分別為x_i和x_j的低維對應(yīng)物。然后我們認(rèn)為q是y_i選擇y_j的相似條件概率，并且在低維圖中使用學(xué)生t分布 。通過最小化概率分布P與Q的Kullback-Leibler散度來確定低維數(shù)據(jù)點的位置。

For more technical details of t-SNE, check out this paper.

有關(guān)t-SNE的更多技術(shù)細(xì)節(jié)，請查閱本文。

I have chosen the MNIST dataset from Kaggle (link) as the example here because it is a simple computer vision dataset, with 28x28 pixel images of handwritten digits (0–9). We can think of each instance as a data point embedded in a 784-dimensional space.

我選擇了Kaggle( link )中的MNIST數(shù)據(jù)集作為示例，因為它是一個簡單的計算機(jī)視覺數(shù)據(jù)集，具有28x28像素數(shù)字(0–9)的像素圖像。我們可以將每個實例視為嵌入784維空間的數(shù)據(jù)點。

To see the full Python code, check out my Kaggle kernel.

要查看完整的Python代碼，請查看我的Kaggle內(nèi)核。

Without further ado, let’s get to the details!

事不宜遲，讓我們來談?wù)劶?xì)節(jié)！

勘探 (Exploration)

Note that in the original Kaggle competition, the goal is to build a ML model using the training images with true labels that can accurately predict the labels on the test set. For our purposes here we will only use the training set.

請注意，在原始的Kaggle競賽中，目標(biāo)是使用帶有真實標(biāo)簽的訓(xùn)練圖像構(gòu)建ML模型，該標(biāo)簽可以準(zhǔn)確預(yù)測測試集上的標(biāo)簽。為了我們的目的，我們將僅使用訓(xùn)練集。

As usual, we check its shape first:

與往常一樣，我們首先檢查其形狀：

train.shape
--------------------------------------------------------------------
(42000, 785)

There are 42K training instances. The 785 columns are the 784 pixel values, as well as the ‘label’ column.

有42K個訓(xùn)練實例。 785列是784像素值，以及“標(biāo)簽”列。

We can check the label distribution as well:

我們也可以檢查標(biāo)簽分布：

label = train["label"]
label.value_counts()
--------------------------------------------------------------------
1 4684
7 4401
3 4351
9 4188
2 4177
6 4137
0 4132
4 4072
8 4063
5 3795
Name: label, dtype: int64

Principal Component Analysis (PCA)

主成分分析(PCA)

Before we implement t-SNE, let’s try PCA, a popular linear method for dimensionality reduction.

在實施t-SNE之前，讓我們嘗試PCA，一種流行的線性降維方法。

After we standardize the data, we can transform our data using PCA (specify ‘n_components’ to be 2):

在對數(shù)據(jù)進(jìn)行標(biāo)準(zhǔn)化之后，我們可以使用PCA轉(zhuǎn)換數(shù)據(jù)(將'n_components'指定為2)：

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCAtrain = StandardScaler().fit_transform(train)
pca = PCA(n_components=2)
pca_res = pca.fit_transform(train)

Let’s make a scatter plot to visualize the result:

讓我們繪制一個散點圖以可視化結(jié)果：

sns.scatterplot(x = pca_res[:,0], y = pca_res[:,1], hue = label, palette = sns.hls_palette(10), legend = 'full');2D Scatter plot of MNIST data after applying PCA應(yīng)用PCA后MNIST數(shù)據(jù)的2D散點圖

As shown in the scatter plot, PCA with two components does not sufficiently provide meaningful insights and patterns about the different labels. We know one drawback of PCA is that the linear projection can’t capture non-linear dependencies. Let’s try t-SNE now.

如散點圖所示，具有兩個組件的PCA不足以提供有關(guān)不同標(biāo)簽的有意義的見解和模式。我們知道PCA的一個缺點是線性投影無法捕獲非線性依賴性。讓我們現(xiàn)在嘗試t-SNE。

T-SNE with sklearn

帶Sklearn的T-SNE

We will implement t-SNE using sklearn.manifold (documentation):

我們將使用sklearn.manifold ( 文檔 )實現(xiàn)t-SNE：

from sklearn.manifold import TSNEtsne = TSNE(n_components = 2, random_state=0)
tsne_res = tsne.fit_transform(train)
sns.scatterplot(x = tsne_res[:,0], y = tsne_res[:,1], hue = label, palette = sns.hls_palette(10), legend = 'full');2D Scatter plot of MNIST data after applying t-SNE應(yīng)用t-SNE后MNIST數(shù)據(jù)的二維散點圖

Now we can see that the different clusters are more separable compared with the result from PCA. Here are a few observations on this plot:

現(xiàn)在我們可以看到，與PCA的結(jié)果相比，不同的聚類更可分離。以下是該圖的一些觀察結(jié)果：

The “5” data points seem to be more spread out compared with the other clusters such as “2” and “4”.

與“ 2”和“ 4”之類的其他群集相比，“ 5”數(shù)據(jù)點似乎更分散。

There are a few “5” and “8” data points that are similar to “3”s.

有一些“ 5”和“ 8”數(shù)據(jù)點與“ 3”相似。

There are two clusters of “7” and “9” where they are next to each other.

有兩個簇“ 7”和“ 9”彼此相鄰。

An Approach that Combines Both

結(jié)合兩者的方法

It is generally recommended to use PCA or TruncatedSVD to reduce the number of dimension to a reasonable amount (e.g. 50) before applying t-SNE [2].

通常建議在應(yīng)用t-SNE之前使用PCA或TruncatedSVD將尺寸數(shù)量減少到合理的數(shù)量(例如50)[2]。

Doing so can reduce the level of noise as well as speed up the computations.

這樣做可以降低噪聲水平并加快計算速度。

Let’s try PCA (50 components) first and then apply t-SNE. Here is the scatter plot:

讓我們先嘗試PCA(50個組件)，然后再應(yīng)用t-SNE。這是散點圖：

2D Scatter plot of MNIST data after applying PCA(50 components) and then t-SNE先應(yīng)用PCA(50個分量)再進(jìn)行t-SNE后的MNIST數(shù)據(jù)的二維散點圖

Compared with the previous scatter plot, wecan now separate out the 10 clusters better. here are a few observations:

與以前的散點圖相比，我們現(xiàn)在可以更好地分離出10個群集。以下是一些觀察結(jié)果：

Most of the “5” data points are not as spread out as before, despite a few that still look like “3”.

盡管很少有5個數(shù)據(jù)點看起來仍然像“ 3”個數(shù)據(jù)點，但大多數(shù)“ 5”個數(shù)據(jù)點的分布都沒有像以前那樣分散。

There is one cluster of “7” and one cluster of “9” now.

現(xiàn)在有一個集群“ 7”和一個集群“ 9”。

Besides, the runtime in this approach decreased by over 60%.

此外，這種方法的運行時間減少了60％以上。

For more interactive 3D scatter plots, check out this post.

有關(guān)更多交互式3D散點圖，請查看此文章。

局限性 (Limitations)

Here are a few limitations of t-SNE:

這是t-SNE的一些限制：

Unlike PCA, the cost function of t-SNE is non-convex, meaning there is a possibility that we would be stuck in a local minima.

與PCA不同，t-SNE的成本函數(shù)是非凸的，這意味著我們有可能陷入局部最小值。

Similar to other dimensionality reduction techniques, the meaning of the compressed dimensions as well as the transformed features becomes less interpretable.

與其他降維技術(shù)類似，壓縮尺寸以及變換后的特征的含義變得難以解釋。

下一步 (Next Steps)

Here are a few things that we can try as next steps:

以下是一些我們可以嘗試做的下一步操作：

Hyperparameter tuning — Try tune ‘perplexity’ and see its effect on the visualized output.

超參數(shù)調(diào)整—嘗試調(diào)整“困惑”，并查看其對可視化輸出的影響。

Try some of the other non-linear techniques such as Uniform Manifold Approximation and Projection (UMAP), which is the generalization of t-SNE and it is based on Riemannian geometry.

嘗試其他一些非線性技術(shù)，例如統(tǒng)一流形逼近和投影 (UMAP)，它是t-SNE的推廣，它基于黎曼幾何。

Train ML models on the transformed data and compare its performance with those from models without dimensionality reduction.

在轉(zhuǎn)換后的數(shù)據(jù)上訓(xùn)練ML模型，并將其性能與未降維的模型的性能進(jìn)行比較。

摘要 (Summary)

Let’s quickly recap.

讓我們快速回顧一下。

We implemented t-SNE using sklearn on the MNIST dataset. We compared the visualized output with that from using PCA, and lastly, we tried a mixed approach which applies PCA first and then t-SNE.

我們在MNIST數(shù)據(jù)集上使用sklearn實現(xiàn)了t-SNE。我們將可視化輸出與使用PCA的可視化輸出進(jìn)行了比較，最后，我們嘗試了一種混合方法，該方法首先應(yīng)用PCA，然后再應(yīng)用t-SNE。

I hope you enjoyed this blog post and please share any thoughts that you may have :)

我希望您喜歡這篇博客文章，并請分享您可能有的任何想法:)

Check out my other post on Chi-square test for independence:

看看我關(guān)于卡方檢驗的其他文章是否具有獨立性：

翻譯自: https://towardsdatascience.com/dimensionality-reduction-using-t-distributed-stochastic-neighbor-embedding-t-sne-on-the-mnist-9d36a3dd4521