k均值算法 二分k均值算法_如何获得K均值算法面试问题
k均值算法 二分k均值算法
數據科學訪談 (Data Science Interviews)
KMeans is one of the most common and important clustering algorithms to know for a data scientist. It is, however, often the case that experienced data scientists do not have a good grasp of this algorithm. This makes KMeans an excellent topic for interviews, to get a good grasp of the understanding of one of the most foundational machine learning algorithm.
對于數據科學家而言,KMeans是最常見且最重要的聚類算法之一。 但是,通常情況下,經驗豐富的數據科學家對這種算法不太了解。 這使KMeans成為面試的絕佳話題,可以很好地理解最基礎的機器學習算法之一。
There are a lot of questions that can touched-on when discussing the topic:
討論該主題時,有很多問題可以涉及:
算法說明 (Description of Algorithm)
Describing the inner working of the K-Means algorithm is typically the first step in an interview questions centered around clustering. It shows the interviewer whether you have grasped how the algorithm works.
描述K-Means算法的內部工作通常是圍繞聚類的訪談問題的第一步。 它向面試官顯示您是否已掌握算法的工作原理。
It might sound fine just to apply a KMeans().fit() and let the library handle all the algorithm work. Still, in case you need to debug some behavior or understand if using KMeans would be fit for purpose, it starts with having a sound understanding of how an algorithm works.
僅應用KMeans().fit()并讓庫處理所有算法工作,這聽起來似乎不錯。 盡管如此,如果您需要調試某些行為或了解使用KMeans是否適合其目的,則首先應充分了解算法的工作原理。
高級說明 (High-Level Description)
There are different aspects of K-means that are worth mentioning when describing the algorithm. The first one being that it is an unsupervised learning algorithm, aiming to group “records” based on their distances to a fixed number (i.e., k) of “centroids.” Centroids being defined as the means of the K-clusters.
描述算法時,K均值的不同方面值得一提。 第一個是它是一種無監督的學習算法,旨在根據記錄與固定數量(即k)的“質心”之間的距離對“記錄”進行分組。 質心定義為K-簇的均值。
內部運作 (Inner workings)
Besides the high-level description provided above, it is also essential to be able to walk an interviewer through the inner workings of the algorithm. That is from initialization, to the actual processing and the stop conditions.
除了上面提供的高級描述之外,還必須能夠引導訪問者了解算法的內部原理。 那就是從初始化到實際的處理以及停止條件。
Initialization: It is important to discuss that the initialization method determines the initial clusters’ means. It would be expected from this point of view, to at least mention the problem of initialization, how it can lead to different cluster being created, the impact on the time it takes to obtain the different clusters, etc.. One of the key initialization method to mention is the “Forgy” initialization method.
初始化:討論初始化方法確定初始簇的均值很重要。 從這一角度來看,可以期望至少提及初始化問題,如何導致創建不同的集群,對獲取不同集群所需時間的影響等。初始化的關鍵之一提及的方法是“ Forgy”初始化方法。
Processing: I would expect a discussion on how the algorithm traverses the points, and iteratively assigns them to the nearest cluster. Great candidates would be able to go beyond that description and into a discussion over KMeans, minimizing the within-cluster variance and discuss Lloyd’s algorithm.
處理:我希望能對算法如何遍歷這些點并將其迭代地分配給最近的簇進行討論。 優秀的候選人將能夠超越該描述而進入有關KMeans的討論,從而最大程度地降低集群內部差異并討論Lloyd算法。
Stop condition: The stop conditions for the algorithm needs to be mentioned. The typical stop conditions for the algorithm are usually based on the following
停止條件:需要提及算法的停止條件。 該算法的典型停止條件通常基于以下條件
- (stability) Centroids of new cluster do not change (穩定性)新集群的質心不變
- (convergence) points stay in the same cluster (收斂)點保持在同一群集中
- (cap) Maximum number of iterations has been reached (上限)已達到最大迭代次數
Stop conditions are quite important to the algorithm, and I would expect a candidate, to at least mention the stability or convergence and the cap conditions. Another key point to highlight going through these stop conditions is articulating the importance of having a cap implemented (see Big O complexity below).
停止條件對算法非常重要,我希望有一個候選人至少提及穩定性或收斂性和上限條件。 突出顯示通過這些停止條件的另一個關鍵點是闡明實施上限的重要性(請參見下面的“大O”復雜性)。
大O復雜度 (Big O Complexity)
It is important for candidates to understand the complexity of the algorithm, both from a training and prediction standpoint, and how the different variables impact the performance of the algorithm. This is why questions around the complexity of the KMeans are often asked, when deep-diving into the algorithm:
對于候選人而言,從訓練和預測的角度了解算法的復雜性以及不同的變量如何影響算法的性能非常重要。 這就是為什么在深入研究算法時經常會問有關KMeans復雜性的問題:
培訓BigO (Training BigO)
From a training perspective, the complexity is (if using Lloyds’ algorithm):
從訓練的角度來看,復雜度是(如果使用勞埃德算法) :
BigO(KmeansTraining) = K *I * N * M
BigO(KmeansTraining) = K *I * N * M
Where:
哪里:
- K: Number of clusters K:簇數
- I: The number of iterations I:迭代次數
- N: The sample size N:樣本量
- M: The number of variables M:變量數
As it is possible to see, there can be a significant impact on capping the number of iterations.
可以看到,對限制迭代次數可能會產生重大影響。
預測BigO (Prediction BigO)
K-means predictions have a different complexity:
K均值預測具有不同的復雜度:
BigO(KmeansPrediction) = K * N * M
BigO(KmeansPrediction) = K * N * M
KMeans prediction, only needs to have computed for each record, the distance (which complexity is based on the number of variables) to each cluster, and assign it to the nearest one.
KMeans預測只需為每條記錄計算到每個聚類的距離(其復雜度基于變量的數量),然后將其分配給最接近的一個。
擴展KMeans (Scaling KMeans)
During an interview, you might be asked if there are any ways to make KMeans perform faster on larger datasets. This should be a trigger to discuss mini-batch KMeans.
在采訪中,可能會詢問您是否有任何方法可以使KMeans在較大的數據集上更快地執行。 這應該是討論迷你批處理KMeans的觸發器。
Mini batch KMeans is an alternative to the traditional KMeans, that provides better performance for training on larger datasets. It leverages mini-batches of data, taken at random to update the clusters’ mean with a decreasing learning rate. For each data bach, the points are all first assigned to a cluster and then means are then re-calculated. The clusters’ centers are recalculated using gradient descent. The algorithm provides a faster convergence than the typical KMeans, but with a slightly different cluster output.
迷你批處理KMeans是傳統KMeans的替代方法,可為較大數據集的訓練提供更好的性能。 它利用隨機獲取的小批量數據,以降低的學習率來更新聚類的均值。 對于每個數據bach,首先將所有點都分配給一個聚類,然后重新計算均值。 使用梯度下降重新計算群集的中心。 該算法提供了比典型KMeans更快的收斂速度,但是群集輸出略有不同。
應用K均值 (Applying K-means)
用例 (Use cases)
There are multiple use cases for leveraging the K-Means algorithm, from offering recommendations or offering some level of personalization on a website, to deep diving into potential cluster definitions from customer analysis and targeting.
有多種使用K-Means算法的用例,從在網站上提供建議或提供某種程度的個性化 ,到從客戶分析和定位中深入研究潛在的集群定義。
Understanding what is expected from applying k-means also dictates how you should be applying it. Do you need to find the number of optimal number of K? or an arbitrary number given by the marketing department. Do you need to have interpretable variables, or is this something that would be better left for an algorithm to decide?
了解應用k均值的期望值還指示您應如何應用它。 您是否需要找到最佳數量的K? 或市場部門提供的任意數字。 您是否需要具有可解釋的變量,還是最好由算法決定?
It is important to understand how particular K-Means use cases can impact its’ implementations. Implementation specific questions, usually come up as follow-ups, such as:
了解特定的K-Means用例如何影響其實施非常重要。 實施方面的特定問題,通常是后續問題,例如:
Let say, the marketing department asked you to providse them with user segments for an upcoming marketing campaign. What features would you look to feed into your model and what transformations woud you apply to provide them with these segments?
假設營銷部門要求您為他們提供即將進行的營銷活動的用戶群。 您希望將哪些功能引入模型中,并希望應用哪些轉換為它們提供這些細分?
This type of followup question is very open-ended, can require further clarification, but does usually provide insights into whether or not the candidate understands how the results of the segmentation might be used.
這種類型的跟進問題是開放式的,可能需要進一步澄清,但是通常會提供有關候選人是否了解如何使用細分結果的見解。
求最佳K (Finding the optimal K)
Understanding how to determine the number of K to use for KMeans often comes up as a followup question in the application of the algorithm.
理解如何確定用于KMeans的K數通常是算法應用中的后續問題。
There are different techniques to identify the optimal number of clusters to use with KMeans. Three different methods are used the Elbow method, the Silhouette method, and Gap statistics.
有多種技術可以確定與KMeans一起使用的最佳群集數。 肘部,輪廓法和間隙統計使用了三種不同的方法。
The Elbow method: is all about finding the point of inflection on a graph of % of variance explained to the number of K.
Elbow方法:都是關于在解釋了K數的方差百分比圖上找到拐點。
Silhouette method: The silhouette method, involves calculating for each point, a similarity/dissimilarity score between their assigned cluster, and the next best (i.e., nearest) cluster.
輪廓法:輪廓法涉及為每個點計算其分配的聚類和次佳(即最接近)的聚類之間的相似度/不相似度得分。
Gap statistics: The goal of the gap statistic is to compare the cluster assignments on the actual dataset against some randomly generated reference datasets. This comparison is done through the calculation of the intracluster variation, using the log of the sum of the pairwise distance between the clusters’ points. Large gap statistics indicates that the cluster obtained on observed data, are very different from those obtained from the randomly generated reference data.
差距統計:差距統計的目標是將實際數據集上的集群分配與一些隨機生成的參考數據集進行比較。 通過使用群集點之間成對距離之和的對數,通過計算群集內變化來完成此比較。 大的間隙統計數據表明,根據觀測數據獲得的聚類與根據隨機生成的參考數據獲得的聚類有很大差異。
輸入變量 (Input variables)
When applying KMeans, it is crucial to understand what kind of data can be fed to the algorithm.
應用KMeans時,至關重要的是要了解可以將哪種數據饋送到該算法。
For each user on our video streaming platform, you have been provided with their historical content views as well as their demographic data. How do you determine what to train the model on?
對于我們視頻流平臺上的每個用戶,系統都向您提供了他們的歷史內容視圖以及人口統計數據。 您如何確定訓練模型的依據?
It is generally an excellent way to breach into the two subtopics of variable normalization and on the number of variables.
通常,這是突破變量歸一化和變量數量這兩個子主題的絕佳方法。
Normalization of variables
變量歸一化
In order to work correctly, KMeans typically needs to have some form of normalization done of the datasets. K-means is sensitive to both means and variance in the datasets.
為了正常工作,KMeans通常需要對數據集進行某種形式的標準化。 K均值對數據集中的均值和方差均敏感。
For numerical performing normalization using a StandardScaler is recommended, but depending on the specific cases, other techniques might be more suitable.
對于使用StandardScaler進行數值執行歸一化的建議,但是根據具體情況,其他技術可能更合適。
For pure categorical data, one hot encoding would likely be preferred, but worth being careful with the number of variables it ends up producing, both from an efficiency (BigO) standpoint and for managing KMeans’ performance (see below: Number of variables).
對于純類別數據,可能會首選一種熱編碼,但從效率(BigO)角度和管理KMeans的性能(請參閱下文:變量數 )的角度來看,值得謹慎對待最終產生的變量數 。
For mixed data types, it might be needed to pre-process the features beforehand. Techniques such as Principal Components Analysis (PCA) or Singular Value Decomposition (SVD) can, however, be used to transform the input data into a dataset that can be leveraged appropriately into KMeans.
對于混合數據類型,可能需要預先對功能進行預處理。 但是,可以使用諸如主成分分析(PCA)或奇異值分解(SVD)之類的技術將輸入數據轉換為可以適當利用到KMeans中的數據集。
Number of variables
變量數
The number of variables going into K-means has an impact on both the time/complexity it takes to train and apply the algorithm, but as well as an effect on how the algorithm behaves.
進入K均值的變量數量不僅影響訓練和應用算法所需的時間/復雜度,還影響算法的行為方式。
This due to the curse of dimensionality:
這是由于維數的詛咒:
So as the dimensionality increases, more and more examples become nearest neighbors of xt, until the choice of nearest neighbor (and therefore of class) is effectively random.https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdfA large number of dimensions has a direct impact on distance-based computations, a key component of KMeans:
大量維度直接影響基于距離的計算,這是KMeans的關鍵組成部分:
The distances between a data point and its nearest and farthest neighbours can become equidistant in high dimensions, potentially compromising the accuracy of some distance-based analysis tools.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2238676/Dimensionality reductions methods such as PCA, or feature selection techniques are things to bring up when reaching this topic.
降維方法(例如PCA)或特征選擇技術是達到此主題時需要提出的內容。
與其他算法的比較 (Comparison with other Algorithm)
Besides understanding the inner working of the KMeans algorithm, it is also important to know how it compares to other clustering algorithms.
除了了解KMeans算法的內部工作原理之外,了解與其他聚類算法的比較方式也很重要。
There is a wide range of other algorithms out there, hierarchical clustering, mean shift clustering, Gaussian mixture models (GMM), DBScan, Affinity propagation (AP), K-Medoids/ PAM, …
那里還有各種各樣的其他算法,包括層次聚類,均值漂移聚類,高斯混合模型(GMM),DBScan,親和傳播(AP),K-Medoids / PAM,…
What other clustering methods do you know?
您還知道其他哪些聚類方法?
How does Algorithm X, compares to K-Means?
算法X與K均值相比如何?
Going through the list of algorithms, it is essential to at least know the different types of clustering methods: centroid/medoids (e.g., KMeans), hierarchical, density-based (e.g., MeanShift, DBSCAN). distribution-based (e.g., GMM) and Affinity propagation (Affinity Propagation)…
遍歷算法列表,至少要了解不同類型的聚類方法至關重要:質心/類聚體(例如,KMeans),分層的,基于密度的(例如,MeanShift,DBSCAN)。 基于分布的(例如GMM)和相似性傳播(相似性傳播)…
When doing these types of comparisons, it is important to list at least some K-Means alternatives, and showcasing some high-level knowledge of what the algorithm does and how it compares to K-Means.
在進行這些類型的比較時,重要的是至少列出一些K-Means備選方案,并展示有關該算法的功能以及與K-Means進行比較的一些高級知識。
You might be asked at this point to deep dive into one of the algorithms you previously mentioned, so be prepared to be able to explain how some of the other algorithm works, list their strengths and weakness compared to K-means and describe how the inner working of the algorithm differs from K-Means.
此時可能會要求您深入研究您先前提到的一種算法,因此準備好能夠解釋其他一些算法的工作原理,列出它們與K均值相比的優缺點,并描述內部該算法的工作方式不同于K-Means。
使用K均值的優點/缺點 (Advantages / Disadvantage of using K-Means)
Going through any algorithms, it is important to know their advantage and disadvantage, it is not unsurprising that this is often asked during interviews.
遍歷任何算法,重要的是要知道它們的優缺點,在面試中經常問到這一點并不奇怪。
Some of the key advantages of KMeans are:
KMeans的一些主要優點是:
While some of its disadvantages are:
雖然它的一些缺點是:
Need pre-processing on mix data as it can’t take advantages of alternative distance function such as Gower’s distance
需要對混合數據進行預處理,因為它無法利用替代距離函數(例如高爾距離)的優勢
More from me on Hacking Analytics:
我提供的有關Hacking Analytics的更多信息:
- SQL interview Questions For Aspiring Data Scientist — The Histogram - 面向有抱負的數據科學家SQL采訪問題-直方圖 
- Python Screening Interview questions for DataScientists - DataScientists的Python篩選面試問題 
- ON Applying K-means Personalization to a website - 關于將K-means個性化應用于網站 
- ON Coding K-Means in Vanilla Python - 在香草Python中編碼K均值 
- How to Learn Data science from scratch - 如何從零開始學習數據科學 
翻譯自: https://medium.com/analytics-and-data/how-to-ace-the-k-means-algorithm-interview-questions-afe346f8fc09
k均值算法 二分k均值算法
總結
以上是生活随笔為你收集整理的k均值算法 二分k均值算法_如何获得K均值算法面试问题的全部內容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: 树莓派和香橙派的区别有哪些
- 下一篇: 迷你世界如何制作刷怪塔
