nba球员python_分组NBA球员
nba球員python
In basketball, we typically talk about 5 positions: point guard, shooting guard, small forward, power forward, and center. Based on this, one might expect NBA players to fall into 5 distinct groups- Point guards perform similar to other point guards, shooting guards perform similar to other shooting guards, etc. Is this the case? Do NBA players fall neatly into position groups?
在籃球比賽中,我們通常談論5個位置:控球后衛,得分后衛,小前鋒,大前鋒和中鋒。 基于此,人們可能會期望NBA球員分為5個不同的組:控球后衛的表現與其他控球后衛相似,得分后衛的表現與其他得分后衛相似,等等。 NBA球員會整齊地屬于位置組嗎?
To answer this question, I will look at how NBA players “group” together. For example, there might be a group of players who collect lots of rebounds, shoot poorly from behind the 3 point line, and block lots of shots. I might call these players forwards. If we allow player performance to create groups, what will these groups look like?
為了回答這個問題,我將研究NBA球員如何“分組”在一起。 例如,可能有一群球員收集很多籃板,在3分線后投籃不佳,并且蓋帽很多。 我可能會叫這些球員前鋒。 如果我們允許玩家表演來創建組,那么這些組會是什么樣?
To group players, I will use k-means clustering (https://en.wikipedia.org/wiki/K-means_clustering).
為了對玩家進行分組,我將使用k-means聚類( https://en.wikipedia.org/wiki/K-means_clustering )。
When choosing a clustering algorithm, its important to think about how the clustering algorithm defines clusters. k-means minimizes the distance between data points (players in my case) and the center of K different points. Because distance is between the cluster center and a given point, k-means assumes clusters are spherical. When thinking about clusters of NBA players, do I think these clusters will be spherical? If not, then I might want try a different clustering algorithm.
選擇聚類算法時,考慮聚類算法如何定義聚類很重要。 k均值可最小化數據點(在我的情況下為玩家)與K個不同點的中心之間的距離。 由于距離在聚類中心和給定點之間,因此k均值假設聚類是球形的。 在考慮NBA球員群體時,我認為這些群體將是球形的嗎? 如果沒有,那么我可能想嘗試其他聚類算法。
For now, I will assume generally spherical clusters and use k-means. At the end of this post, I will comment on whether this assumption seems valid.
現在,我將假設總體上為球形簇,并使用k均值。 在這篇文章的結尾,我將評論這個假設是否有效。
| 1 1 2 2 3 3 4 4 5 5 6 6 7 7 |
We need data. Collecting the data will require a couple steps. First, I will create a matrix of all players who ever played in the NBA (via the NBA.com API).
我們需要數據。 收集數據將需要幾個步驟。 首先,我將創建一個所有曾在NBA比賽的球員的矩陣(通過NBA.com API)。
| 1 1 2 2 3 3 4 4 5 5 6 6 |
In the 1979-1980 season, the NBA started using the 3-point line. The 3-point has dramatically changed basketball, so players performed different before it. While this change in play was not instantaneous, it does not make sense to include players before the 3-point line.
在1979-1980賽季,NBA開始使用三分線。 3分制極大地改變了籃球,因此球員在此之前的表現有所不同。 盡管這種變化不是立即發生的,但在三分線之前加入球員沒有任何意義。
| 1 1 2 2 |
I have a list of all the players after 1979, but I want data about all these players. When grouping the players, I am not interested in how much a player played. Instead, I want to know HOW a player played. To remove variability associated with playing time, I will gather data that is standardized for 36 minutes of play. For example, if a player averages 4 points and 12 minutes a game, this player averages 12 points per 36 minutes.
我有1979年以后所有球員的名單,但是我想要有關所有這些球員的數據。 在對球員分組時,我對球員的出場次數不感興趣。 相反,我想知道玩家的表現。 為了消除與比賽時間相關的可變性,我將收集36分鐘比賽中標準化的數據。 例如,如果一個玩家平均每場比賽獲得4分12分鐘,則該玩家每36分鐘平均獲得12分。
Below, I have written a function that will collect every player’s performance per 36 minutes. The function collects data one player at a time, so its VERY slow. If you want the data, it can be found on my github (https://github.com/dvatterott/nba_project).
在下面,我編寫了一個函數,該函數將每36分鐘收集一次每個玩家的表現。 該功能一次只能收集一個玩家的數據,因此其速度非常慢。 如果需要數據,可以在我的github( https://github.com/dvatterott/nba_project )上找到。
| 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 |
| 1 1 |
Great! Now we have data that is scaled for 36 minutes of play (per36 data) from every player between 1979 and 2016. Above, I printed out the columns. I don’t want all this data. For instance, I do not care about how many minutes a player played. Also, some of the data is redundant. For instance, if I know a player’s field goal attempts (FGA) and field goal percentage (FG_PCT), I can calculate the number of made field goals (FGM). I removed the data columns that seem redundant. I do this because I do not want redundant data exercising too much influence on the grouping process.
大! 現在,我們擁有的數據在1979年至2016年之間可以擴展到每位玩家36分鐘的比賽(per36數據)。在上面,我打印了列。 我不想要所有這些數據。 例如,我不在乎玩家玩了多少分鐘。 另外,某些數據是冗余的。 例如,如果我知道一個球員的射門得分嘗試(FGA)和射門得分百分比(FG_PCT),那么我可以計算出射門得分的數量(FGM)。 我刪除了似乎多余的數據列。 我這樣做是因為我不希望冗余數據對分組過程產生太大影響。
Below, I create new data columns for 2 point field goal attempts and 2 point field goal percentage. I also remove all players who played less than 50 games. I do this because these players have not had the opportunity to establish consistent performance.
在下面,我將為2分目標得分和2分目標百分比創建新的數據列。 我還將刪除所有玩過少于50場比賽的玩家。 我這樣做是因為這些參與者沒有機會建立一致的表現。
| df = df[df['GP']>50] #only players with more than 50 games. df = df.fillna(0) #some players have "None" in some cells. Turn these into 0s df['FG2M'] = df['FGM']-df['FG3M'] #find number of 2pt field goals df['FG2A'] = df['FGA']-df['FG3A'] #2 point field goal attempts df['FG2_PCT'] = df['FG2M']/df['FG2A'] #2 point field goal percentage saveIDs = df['PLAYER_ID'] #save player IDs for later df = df.drop(['PLAYER_ID','LEAGUE_ID','TEAM_ID','GP','GS','MIN','FGM','FGA','FG_PCT','FG3M','FTM','REB','PTS','FG2M'],1) #pull out unncessary columns df = df [ df [ 'GP' ] > 50 ] #only players with more than 50 games. df = df . fillna ( 0 ) #some players have "None" in some cells. Turn these into 0sdf [ 'FG2M' ] = df [ 'FGM' ] - df [ 'FG3M' ] #find number of 2pt field goalsdf [ 'FG2A' ] = df [ 'FGA' ] - df [ 'FG3A' ] #2 point field goal attemptsdf [ 'FG2_PCT' ] = df [ 'FG2M' ] / df [ 'FG2A' ] #2 point field goal percentagesaveIDs = df [ 'PLAYER_ID' ] #save player IDs for laterdf = df . drop ([ 'PLAYER_ID' , 'LEAGUE_ID' , 'TEAM_ID' , 'GP' , 'GS' , 'MIN' , 'FGM' , 'FGA' , 'FG_PCT' , 'FG3M' , 'FTM' , 'REB' , 'PTS' , 'FG2M' ], 1 ) #pull out unncessary columns |
It’s always important to visualize the data, so lets get an idea what we’re working with!
可視化數據始終很重要,因此讓我們了解一下我們正在使用的數據!
The plot below is called a scatter matrix. This type of plot will appear again, so lets go through it carefully. Each subplot has the feature (stat) labeled on its row which serves as its y-axis. The column feature serves as the x-axis. For example the subplot in the second column of the first row plots 3-point field goal attempts by 3-point field goal percentage. As you can see, players that have higher 3-point percentages tend to take more 3-pointers… makes sense.
下面的圖稱為散布矩陣。 這種情節將再次出現,因此請仔細檢查它。 每個子圖的行上都標記了特征(統計),作為其y軸。 列特征用作x軸。 例如,第一行第二列中的子圖以3點投籃命中百分比繪制3點投籃嘗試。 如您所見,擁有較高三分球的球員往往會拿出更多的三分球……這是有道理的。
On the diagonals, I plot the Kernel Density Estimation for the sample histogram. More players fall into areas where where the line is higher on the y-axis. For instance, no players shoot better than ~45% from behind the 3 point line.
在對角線上,我繪制了樣本直方圖的內核密度估計。 更多玩家進入y軸上的線更高的區域。 例如,沒有球員在三分線后的投籃命中率超過?45%。
One interesting part about scatter matrices is the plots below the diagonal are a reflection of the plots above the diagonal. For example, the data in the second column of the first row and the first column of the second row are the same. The only difference is the axes have switched.
關于散布矩陣的一個有趣的部分是對角線以下的圖是對角線上方的圖的反映。 例如,第一行的第二列和第二行的第一列中的數據相同。 唯一的區別是軸已切換。
| axs = pd.tools.plotting.scatter_matrix(df, alpha=0.2, figsize=(12, 12), diagonal='kde'); #the diagonal will show kernel density [ax.set_yticks([]) for ax in axs[:,0]] #turn off the ticks that take up way too much space in such a crammed figure [ax.set_xticks([]) for ax in axs[-1,:]]; axs = pd . tools . plotting . scatter_matrix ( df , alpha = 0.2 , figsize = ( 12 , 12 ), diagonal = 'kde' ); #the diagonal will show kernel density[ ax . set_yticks ([]) for ax in axs [:, 0 ]] #turn off the ticks that take up way too much space in such a crammed figure[ ax . set_xticks ([]) for ax in axs [ - 1 ,:]]; |
There are a couple things to note in the graph above. First, there’s a TON of information there. Second, it looks like there are some strong correlations. For example, look at the subplots depicting offensive rebounds by defensive rebounds.
上圖中需要注意幾件事。 首先,那里有大量信息。 其次,似乎存在一些強相關性。 例如,查看通過防守籃板描述進攻籃板的子圖。
While I tried to throw out redundant data, I clearly did not throw out all redundant data. For example, players that are good 3-point shooters are probably also good free throw shooters. These players are simply good shooters, and being a good shooter contributes to multiple data columns above.
當我嘗試拋出冗余數據時,我顯然沒有拋出所有冗余數據。 例如,優秀的三分球手也可能是優秀的罰球手。 這些球員只是好射手,而成為好射手則有助于上面的多個數據列。
When I group the data, I do not want an ability such as shooting to contribute too much. I want to group players equally according to all their abilities. Below I use a PCA to seperate variance associated with the different “components” (e.g., shooting ability) of basketball performance.
當我對數據進行分組時,我不希望諸如射擊之類的功能做出太多貢獻。 我想根據所有能力將球員平均分配。 下面,我使用PCA來區分與籃球表現的不同“組成部分”(例如投籃能力)相關的差異。
For an explanation of PCA I recommend this link – https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/.
對于PCA的解釋,我建議使用此鏈接– https://georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/。
| from sklearn.decomposition import PCA from sklearn.preprocessing import scale X = df.as_matrix() #take data out of dataframe X = scale(X) #standardize the data before giving it to the PCA. #I standardize the data because some features such as PF or steals have lower magnitudes than other features such as FG2A #I want both to contribute equally to the PCA, so I make sure they're on the same scale. pca = PCA() #great PCA object pca.fit(X) #pull out principle components var_expl = pca.explained_variance_ratio_ #find amount of variance explained by each component tot_var_expl = np.array([sum(var_expl[0:i+1]) for i,x in enumerate(var_expl)]) #create vector with cumulative variance plt.figure(figsize=(12,4)) #create cumulative proportion of variance plot plt.subplot(1,2,1) plt.plot(range(1,len(tot_var_expl)+1), tot_var_expl*100,'o-') plt.axis([0, len(tot_var_expl)+1, 0, 100]) plt.xlabel('Number of PCA Components Included') plt.ylabel('Percentage of variance explained (%)') plt.subplot(1,2,2) #create scree plot plt.plot(range(1,len(var_expl)+1), var_expl*100,'o-') plt.axis([0, len(var_expl)+1, 0, 100]) plt.xlabel('PCA Component'); from sklearn.decomposition import PCAfrom sklearn.preprocessing import scaleX = df . as_matrix () #take data out of dataframeX = scale ( X ) #standardize the data before giving it to the PCA. #I standardize the data because some features such as PF or steals have lower magnitudes than other features such as FG2A#I want both to contribute equally to the PCA, so I make sure they're on the same scale.pca = PCA () #great PCA objectpca . fit ( X ) #pull out principle componentsvar_expl = pca . explained_variance_ratio_ #find amount of variance explained by each componenttot_var_expl = np . array ([ sum ( var_expl [ 0 : i + 1 ]) for i , x in enumerate ( var_expl )]) #create vector with cumulative varianceplt . figure ( figsize = ( 12 , 4 )) #create cumulative proportion of variance plotplt . subplot ( 1 , 2 , 1 )plt . plot ( range ( 1 , len ( tot_var_expl ) + 1 ), tot_var_expl * 100 , 'o-' )plt . axis ([ 0 , len ( tot_var_expl ) + 1 , 0 , 100 ])plt . xlabel ( 'Number of PCA Components Included' )plt . ylabel ( 'Percentage of variance explained (%)' )plt . subplot ( 1 , 2 , 2 ) #create scree plotplt . plot ( range ( 1 , len ( var_expl ) + 1 ), var_expl * 100 , 'o-' )plt . axis ([ 0 , len ( var_expl ) + 1 , 0 , 100 ])plt . xlabel ( 'PCA Component' ); |
On the left, I plot the amount of variance explained after including each additional PCA component. Using all the components explains all the variability, but notice how little the last few components contribute. It doesn’t make sense to include a component that only explains 1% of the variability…but how many components to include!?
在左側,我繪制了包括每個其他PCA組件之后說明的方差量。 使用所有組件可以解釋所有可變性,但請注意最后幾個組件的貢獻很少。 包含僅解釋1%變異性的組件是沒有意義的……但是要包含多少個組件!
I chose to include the first 5 components because no component after the 5th explained more than 5% of the data. This part of the analysis is admittedly arbitrary, but 5% is a relatively conservative cut-off.
我之所以選擇包括前5個組成部分,是因為第5個之后的任何組成部分都不能解釋超過5%的數據。 誠然,這部分分析是任意的,但5%是相對保守的臨界值。
Below is the fun part of the data. We get to look at what features contribute to the different principle components.
以下是數據的有趣部分。 我們來看看哪些功能對不同的主要原理有所貢獻。
- Assists and 3-point shooting contribute to the first component. I will call this the Outside Skills component.
- Free throw attempts, assists, turnovers and 2-point field goals contribute to the second component. I will call this the Rim Scoring component.
- Free throw percentage and 2-point field goal percentage contribute to the third component. I will call this the Pure Points component.
- 2-point field goal percentage and steals contribute to the fourth component. I will call this the Defensive Big Man component.
- 3-point shooting and free throws contribute to the fifth component. I will call this the Dead Eye component.
- 助攻和三分球命中率是第一要素。 我將其稱為“ 外部技能”部分。
- 罰球,助攻,失誤和2分投籃得分是第二要素。 我將其稱為Rim Scoring組件。
- 罰球率和2分投籃命中率構成第三部分。 我將其稱為“ 純積分”組件。
- 2分投籃命中率和搶斷是第四要素。 我將其稱為“ 防御大人物”組件。
- 三分球和罰球是第五要素。 我將其稱為“ 死眼”組件。
One thing to keep in mind here is that each component explains less variance than the last. So while 3 point shooting contributes to both the 1st and 5th component, more 3 point shooting variability is probably explained by the 1st component.
這里要記住的一件事是,每個組件所解釋的差異都小于最后一個。 因此,盡管三點射擊對第一和第五部分都有貢獻,但第一點可能解釋了更多的三點射擊變異性。
It would be great if we had a PCA component that was only shooting and another that was only rebounding since we typically conceive these to be different skills. Yet, there are multiple aspects of each skill. For example, a 3-point shooter not only has to be a dead-eye shooter, but also has to find ways to get open. Additionally, being good at “getting open” might be something akin to basketball IQ which would also contribute to assists and steals!
如果我們擁有僅用于射擊的PCA組件而僅用于反彈的PCA組件,那將是很好的,因為我們通常認為這些是不同的技能。 但是,每種技能都有多個方面。 例如,一個三點射手不僅必須是死眼射手,而且還必須找到打開自己的方法。 另外,擅長“開放”可能類似于籃球智商,這也有助于助攻和搶斷!
| factor_names = ['Outside Skills','Rim Scoring','Pure Points','Defensive Big Man','Dead Eye'] #my component names loadings_df = pd.DataFrame(pca.components_, columns=df.columns) #loadings_df[0:5] #all the factor loadings appear below. factor_names = [ 'Outside Skills' , 'Rim Scoring' , 'Pure Points' , 'Defensive Big Man' , 'Dead Eye' ] #my component namesloadings_df = pd . DataFrame ( pca . components_ , columns = df . columns )#loadings_df[0:5] #all the factor loadings appear below. |
Cool, we have our 5 PCA components. Now lets transform the data into our 5 component PCA space (from our 13 feature space – e.g., FG3A, FG3_PCT, ect.). To do this, we give each player a score on each of the 5 PCA components.
太酷了,我們有5個PCA組件。 現在讓我們將數據轉換為5個組成部分的PCA空間(來自13個特征空間-例如FG3A,FG3_PCT等)。 為此,我們給每位玩家5個PCA組件中的每個得分。
Next, I want to see how players cluster together based on their scores on these components. First, let’s investigate how using more or less clusters (i.e., groups) explains different amounts of variance.
接下來,我想看看玩家是如何根據他們在這些組件上的得分聚集在一起的。 首先,讓我們研究一下如何使用更多或更少的群集(即組)來解釋不同數量的方差。
| from scipy.spatial.distance import cdist, pdist, euclidean from sklearn.cluster import KMeans from sklearn import metrics #http://stackoverflow.com/questions/6645895/calculating-the-percentage-of-variance-measure-for-k-means #The above link was helpful when writing this code. reduced_data = PCA(n_components=5, whiten=True).fit_transform(X) #transform data into the 5 PCA components space #kmeans assumes clusters have equal variance, and whitening helps keep this assumption. k_range = range(2,31) #looking amount of variance explained by 1 through 30 cluster k_means_var = [KMeans(n_clusters=k).fit(reduced_data) for k in k_range] #fit kmeans with 1 cluster to 30 clusters #get labels and calculate silhouette score labels = [i.labels_ for i in k_means_var] sil_score = [metrics.silhouette_score(reduced_data,i,metric='euclidean') for i in labels] centroids = [i.cluster_centers_ for i in k_means_var] #get the center of each cluster k_euclid = [cdist(reduced_data,cent,'euclidean') for cent in centroids] #calculate distance between each item and each cluster center dist = [np.min(ke,axis=1) for ke in k_euclid] #get the distance between each item and its cluster wcss = [sum(d**2) for d in dist] #within cluster sum of squares tss = sum(pdist(reduced_data)**2/reduced_data.shape[0]) #total sum of squares bss = tss-wcss #between cluster sum of squares plt.clf() plt.figure(figsize=(12,4)) #create cumulative proportion of variance plot plt.subplot(1,2,1) plt.plot(k_range, bss/tss*100,'o-') plt.axis([0, np.max(k_range), 0, 100]) plt.xlabel('Number of Clusters') plt.ylabel('Percentage of variance explained (%)'); plt.subplot(1,2,2) #create scree plot plt.plot(k_range, np.transpose(sil_score)*100,'o-') plt.axis([0, np.max(k_range), 0, 40]) plt.xlabel('Number of Clusters'); plt.ylabel('Average Silhouette Score*100'); from scipy.spatial.distance import cdist , pdist , euclideanfrom sklearn.cluster import KMeansfrom sklearn import metrics#http://stackoverflow.com/questions/6645895/calculating-the-percentage-of-variance-measure-for-k-means#The above link was helpful when writing this code.reduced_data = PCA ( n_components = 5 , whiten = True ) . fit_transform ( X ) #transform data into the 5 PCA components space#kmeans assumes clusters have equal variance, and whitening helps keep this assumption.k_range = range ( 2 , 31 ) #looking amount of variance explained by 1 through 30 clusterk_means_var = [ KMeans ( n_clusters = k ) . fit ( reduced_data ) for k in k_range ] #fit kmeans with 1 cluster to 30 clusters#get labels and calculate silhouette scorelabels = [ i . labels_ for i in k_means_var ]sil_score = [ metrics . silhouette_score ( reduced_data , i , metric = 'euclidean' ) for i in labels ]centroids = [ i . cluster_centers_ for i in k_means_var ] #get the center of each clusterk_euclid = [ cdist ( reduced_data , cent , 'euclidean' ) for cent in centroids ] #calculate distance between each item and each cluster centerdist = [ np . min ( ke , axis = 1 ) for ke in k_euclid ] #get the distance between each item and its clusterwcss = [ sum ( d ** 2 ) for d in dist ] #within cluster sum of squarestss = sum ( pdist ( reduced_data ) ** 2 / reduced_data . shape [ 0 ]) #total sum of squaresbss = tss - wcss #between cluster sum of squaresplt . clf ()plt . figure ( figsize = ( 12 , 4 )) #create cumulative proportion of variance plotplt . subplot ( 1 , 2 , 1 )plt . plot ( k_range , bss / tss * 100 , 'o-' )plt . axis ([ 0 , np . max ( k_range ), 0 , 100 ])plt . xlabel ( 'Number of Clusters' )plt . ylabel ( 'Percentage of variance explained (%)' );plt . subplot ( 1 , 2 , 2 ) #create scree plotplt . plot ( k_range , np . transpose ( sil_score ) * 100 , 'o-' )plt . axis ([ 0 , np . max ( k_range ), 0 , 40 ])plt . xlabel ( 'Number of Clusters' );plt . ylabel ( 'Average Silhouette Score*100' ); |
As you can in the left hand plot, adding more clusters explains more of the variance, but there are diminishing returns. Each additional cluster explains a little less data than the last (much like each PCA component explained less variance than the previous component).
就像您在左圖中可以看到的那樣,添加更多的聚類可以解釋更多的方差,但是收益遞減。 每個附加群集比最后一個群集解釋的數據要少一些(很像每個PCA組件所解釋的差異都比前一個組件少)。
The particularly intersting point here is the point where the second derivative is greatest, when the amount of change changes the most (the elbow). The elbow occurs at the 6th cluster.
此處特別的交點是變化量變化最大(彎頭)時二階導數最大的點。 肘部出現在第6組。
Perhaps not coincidently, 6 clusters also has the highest silhouette score (right hand plot). The silhouette score computes the average distance between a player and all other players in this player’s cluster. It then divides this distance by the distance between this player and all players in the next nearest cluster. Silhouette scores range between -1 and 1 (where negative one means the player is in the wrong cluster, 0 means the clusters completely overlap, and 1 means the clusters are extermely well separated).
也許并非巧合,六個聚類也具有最高的輪廓分數(右圖)。 輪廓分數計算玩家與該玩家集群中所有其他玩家之間的平均距離。 然后,將這個距離除以該玩家與下一個最近群集中所有玩家之間的距離。 輪廓分數在-1和1之間(負1表示玩家位于錯誤的群集中,0表示群集完全重疊,而1則表示群集之間的間隔良好)。
Six clusters has the highest silhouette score at 0.19. 0.19 is not great, and suggests a different clustering algorithm might be better. More on this later.
六個群集的輪廓得分最高,為0.19。 0.19不好,表明使用不同的聚類算法可能更好。 稍后再詳細介紹。
Because 6 clusters is the elbow and has the highest silhouette score, I will use 6 clusters in my grouping analysis. Okay, now that I decided on 6 clusters lets see what players fall into what clusters!
因為6個聚類是肘部,并且輪廓分數最高,所以在分組分析中將使用6個聚類。 好吧,現在我決定對6個集群進行分類,讓我們看看哪些玩家屬于哪個集群!
| final_fit = KMeans(n_clusters=6).fit(reduced_data) #fit 6 clusters df['kmeans_label'] = final_fit.labels_ #label each data point with its clusters df['PLAYER_ID'] = saveIDs #of course we want to know what players are in what cluster player_names = [pd.DataFrame(players_df[players_df['PERSON_ID']==x]['DISPLAY_LAST_COMMA_FIRST']).to_string(header=False,index=False) for x in df['PLAYER_ID']] # because playerID #s mean nothing to me, lets get the names too df['Name'] = player_names #lets also create a dataframe with data about where the clusters occur in the 5 component PCA space. cluster_locs = pd.DataFrame(final_fit.cluster_centers_,columns=['component %s'% str(s) for s in range(np.size(final_fit.cluster_centers_,1))]) cluster_locs.columns = factor_names final_fit = KMeans ( n_clusters = 6 ) . fit ( reduced_data ) #fit 6 clustersdf [ 'kmeans_label' ] = final_fit . labels_ #label each data point with its clustersdf [ 'PLAYER_ID' ] = saveIDs #of course we want to know what players are in what clusterplayer_names = [ pd . DataFrame ( players_df [ players_df [ 'PERSON_ID' ] == x ][ 'DISPLAY_LAST_COMMA_FIRST' ]) . to_string ( header = False , index = False ) for x in df [ 'PLAYER_ID' ]]# because playerID #s mean nothing to me, lets get the names toodf [ 'Name' ] = player_names#lets also create a dataframe with data about where the clusters occur in the 5 component PCA space.cluster_locs = pd . DataFrame ( final_fit . cluster_centers_ , columns = [ 'component %s ' % str ( s ) for s in range ( np . size ( final_fit . cluster_centers_ , 1 ))])cluster_locs . columns = factor_names |
Awesome. Now lets see how all the clusters look. These clusters were created in 5 dimensional space, which is not easy to visualize. Below I plot another scatter matrix. The scatter matrix allows us to visualize the clusters in different 2D combinations of the 5D space.
太棒了 現在,讓我們看看所有群集的外觀。 這些簇是在5維空間中創建的,這不容易可視化。 在下面,我繪制了另一個散布矩陣。 散布矩陣使我們能夠以5D空間的不同2D組合可視化群集。
| from scipy.stats import gaussian_kde plt.clf() centroids = final_fit.cluster_centers_ #find centroids so we can plot them colors = ['r','g','y','b','c','m'] #cluster colors Clusters = ['Cluster 0','Cluster 1','Cluster 2','Cluster 3','Cluster 4','Cluster 5'] #cluster...names numdata, numvars = reduced_data.shape #players by PCA components fig, axes = plt.subplots(nrows=numvars, ncols=numvars, figsize=(10,10)) #create a scatter matrix with 5**2 cells fig.subplots_adjust(hspace=0.05, wspace=0.05) recs=[] for col in colors: #make some patches for the legend recs.append(mpl.patches.Rectangle((0,0),1,1,fc=col)) fig.legend(recs,Clusters,8,ncol=6) #create legend with patches above for i,ax in enumerate(axes.flat): # Hide all ticks and labels plt.setp(ax.get_yticklabels(), visible=False) #tick labels are too much with this many subplots plt.setp(ax.get_xticklabels(), visible=False) ax.grid(False) #again, too much if i%5==0:ax.set_ylabel(factor_names[i/5]) #label outer y axes if i>19:ax.set_xlabel(factor_names[i-20]) #label outer x axes for i, j in zip(*np.triu_indices_from(axes, k=1)): for x, y in [(i,j), (j,i)]: #plot individual data points and cluster centers axes[y,x].plot(reduced_data[:, x], reduced_data[:, y], 'k.', markersize=2) axes[y,x].scatter(centroids[:,x], centroids[:,y],marker='x',s=169,linewidth=3,color=colors, zorder=10) #create kernel density estimation for each PCA factor on the diagonals for i, label in enumerate(factor_names): density = gaussian_kde(reduced_data[:,i]) density.covariance_factor = lambda : .25 density._compute_covariance() x = np.linspace(min(reduced_data[:,i]),max(reduced_data[:,1])) axes[i,i].plot(x,density(x)) from scipy.stats import gaussian_kdeplt . clf ()centroids = final_fit . cluster_centers_ #find centroids so we can plot themcolors = [ 'r' , 'g' , 'y' , 'b' , 'c' , 'm' ] #cluster colorsClusters = [ 'Cluster 0' , 'Cluster 1' , 'Cluster 2' , 'Cluster 3' , 'Cluster 4' , 'Cluster 5' ] #cluster...namesnumdata , numvars = reduced_data . shape #players by PCA componentsfig , axes = plt . subplots ( nrows = numvars , ncols = numvars , figsize = ( 10 , 10 )) #create a scatter matrix with 5**2 cellsfig . subplots_adjust ( hspace = 0.05 , wspace = 0.05 )recs = []for col in colors : #make some patches for the legendrecs . append ( mpl . patches . Rectangle (( 0 , 0 ), 1 , 1 , fc = col ))fig . legend ( recs , Clusters , 8 , ncol = 6 ) #create legend with patches abovefor i , ax in enumerate ( axes . flat ):# Hide all ticks and labelsplt . setp ( ax . get_yticklabels (), visible = False ) #tick labels are too much with this many subplotsplt . setp ( ax . get_xticklabels (), visible = False )ax . grid ( False ) #again, too muchif i % 5 == 0 : ax . set_ylabel ( factor_names [ i / 5 ]) #label outer y axesif i > 19 : ax . set_xlabel ( factor_names [ i - 20 ]) #label outer x axesfor i , j in zip ( * np . triu_indices_from ( axes , k = 1 )):for x , y in [( i , j ), ( j , i )]:#plot individual data points and cluster centersaxes [ y , x ] . plot ( reduced_data [:, x ], reduced_data [:, y ], 'k.' , markersize = 2 )axes [ y , x ] . scatter ( centroids [:, x ], centroids [:, y ], marker = 'x' , s = 169 , linewidth = 3 , color = colors , zorder = 10 )#create kernel density estimation for each PCA factor on the diagonalsfor i , label in enumerate ( factor_names ):density = gaussian_kde ( reduced_data [:, i ])density . covariance_factor = lambda : . 25density . _compute_covariance ()x = np . linspace ( min ( reduced_data [:, i ]), max ( reduced_data [:, 1 ]))axes [ i , i ] . plot ( x , density ( x )) |
In this plot above. I mark the center of a given cluster with an X. For example, Cluster 0 and Cluster 5 are both high in outside skills. Cluster 5 is also high in rim scoring, but low in pure points.
在上面的情節中。 我用X標記給定群集的中心。例如,群集0和群集5的外部技能都很高。 類別5的籃圈得分也很高,但純分數卻很低。
Below I look at the players in each cluster. The first thing I do is identify the player closest to the cluster’s center. I call this player the prototype. It is the player that most exemplifies a cluster.
下面我看每個集群中的參與者。 我要做的第一件事是確定離集群中心最近的玩家。 我稱這個玩家為原型。 最能說明集群的是玩家。
I then show a picture of this player because… well I wanted to see who these players were. I print out this player’s stats and the cluster’s centroid location. Finally, I print out the first ten players in this cluster. This is the first ten players alphabetically. Not the ten players closest to cluster center.
然后,我給這位球員展示照片,因為……好吧,我想看看這些球員是誰。 我打印出該玩家的統計數據和集群的質心位置。 最后,我打印出該集群中的前十名玩家。 按字母順序排列,這是前十名玩家。 不是最接近集群中心的十個玩家。
| from IPython.display import display from IPython.display import Image name = player_names[np.argmin([euclidean(x,final_fit.cluster_centers_[0]) for x in reduced_data])] #find cluster prototype PlayerID = str(int(df[df['Name']==name]['PLAYER_ID'])) #get players ID number #player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png") player = Image(url = 'http://www.pybloggers.com/wp-content/uploads/2016/02/4.bp_.blogspot.com_RaOrchOImw8S3mNk3exLeIAAAAAAAAdZkHs-81mnXO_Es400Lloyd-Daniels-ea756945cc0b89b3cf169e62fa86250980926bfc.jpg',width=100) display(player) #display(df[df['Name']==name]) #prototype's stats display(cluster_locs[0:1]) #cluster centroid location df[df['kmeans_label']==0]['Name'][:10] #first ten players in the cluster (alphabetically) from IPython.display import displayfrom IPython.display import Imagename = player_names [ np . argmin ([ euclidean ( x , final_fit . cluster_centers_ [ 0 ]) for x in reduced_data ])] #find cluster prototypePlayerID = str ( int ( df [ df [ 'Name' ] == name ][ 'PLAYER_ID' ])) #get players ID number#player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png")player = Image ( url = 'http://www.pybloggers.com/wp-content/uploads/2016/02/4.bp_.blogspot.com_RaOrchOImw8S3mNk3exLeIAAAAAAAAdZkHs-81mnXO_Es400Lloyd-Daniels-ea756945cc0b89b3cf169e62fa86250980926bfc.jpg' , width = 100 )display ( player )#display(df[df['Name']==name]) #prototype's statsdisplay ( cluster_locs [ 0 : 1 ]) #cluster centroid locationdf [ df [ 'kmeans_label' ] == 0 ][ 'Name' ][: 10 ] #first ten players in the cluster (alphabetically) |
| 0.830457 | 0.830457 | -0.930833 | -0.930833 | 0.28203 | 0.28203 | -0.054093 | -0.054093 | 0.43606 | 0.43606 |
First, let me mention that cluster number is a purely categorical variable. Not ordinal. If you run this analysis, you will likely create clusters with similar players, but in a different order. For example, your cluster 1 might be my cluster 0.
首先,讓我說一下集群號是一個純粹的分類變量。 不連續。 如果運行此分析,則可能會創建具有相似參與者但順序不同的集群。 例如,您的集群1可能是我的集群0。
Cluster 0 has the most players (25%; about 490 of the 1965 in this cluster analysis) and is red in the scatter matrix above.
聚類0的參與者最多(25%;在此聚類分析中為1965年的490%),并且在上方的散點圖中為紅色。
Cluster 0 players are second highest in outside shooting (in the table above you can see their average score on the outside skills component is 0.83). These players are lowest in rim scoring (-0.93), so they do not draw many fouls – they are basically the snipers from the outside.
類別0的玩家在外部射擊中排名第二(在上表中,您在外部技能部分的平均得分為0.83)。 這些球員的籃下得分最低(-0.93),因此他們沒有犯規很多-他們基本上是外界的狙擊手。
The prototype is Lloyd Daniels who takes a fair number of 3s. I wouldn’t call 31% a dominant 3-point percentage, but its certainly not bad. Notably, Lloyd Daniels doesn’t seem to do much but shoot threes, as 55% of his shots come from the great beyond.
原型是勞埃德·丹尼爾斯(Lloyd Daniels),他需要3分之多的時間。 我不會說31%是占優勢的三分百分比,但它的確不錯。 值得一提的是,勞埃德·丹尼爾斯(Lloyd Daniels)似乎并沒有做多,而是投三分,因為他55%的投籃命中率來自于超凡脫俗。
Cluster 0 notable players include Andrea Bargnani, JJ Barea, Danilo Gallinari, and Brandon Jennings. Some forwards. Some Guards. Mostly good shooters.
類別0的著名球員包括Andrea Bargnani,JJ Barea,Danilo Gallinari和Brandon Jennings。 一些前鋒。 一些警衛。 主要是優秀的射手。
On to Cluster 1… I probably should have made a function from this code, but I enjoyed picking the players pictures too much.
在集群1上……我可能應該從這段代碼中創建一個函數,但是我非常喜歡挑選玩家照片。
| 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 |
| -0.340177 | -0.340177 | 1.008111 | 1.008111 | 1.051622 | 1.051622 | -0.150204 | -0.150204 | 0.599516 | 0.599516 |
Cluster 1 is green in the scatter matrix and includes about 14% of players.
群集1在散布矩陣中為綠色,包含約14%的玩家。
Cluster 1 is highest on the rim scoring, pure points, and Dead Eye components. These players get the ball in the hoop.
類別1在輪輞得分,純分和“死眼”組件上最高。 這些球員將球踢進籃筐。
Christian Laettner is the prototype. He’s a solid scoring forward.
Christian Laettner是原型。 他是一個可靠的得分手。
Gilbert Arenas stands out in the first ten names as I was tempted to think of this cluster as big men, but it really seems to be players who shoot, score, and draw fouls.
吉爾伯特·阿里納斯(Gilbert Arenas)在前十名中脫穎而出,因為我很想把這個集群視為大個子,但實際上似乎是那些能夠投籃,得分和犯規的球員。
Cluster 1 Notable players include James Harden,Kevin Garnet, Kevin Durant, Tim Duncan, Kobe, Lebron, Kevin Martin, Shaq, Anthony Randolph??, Kevin Love, Derrick Rose, and Michael Jordan.
第一組的著名球員包括詹姆斯·哈登,凱文·加內特,凱文·杜randint,蒂姆·鄧肯,科比,勒布朗,凱文·馬丁,沙克,安東尼·蘭道夫,凱文·洛夫,德里克·羅斯和邁克爾·喬丹。
| name = player_names[np.argmin([euclidean(x,final_fit.cluster_centers_[2]) for x in reduced_data])] PlayerID = str(int(df[df['Name']==name]['PLAYER_ID'])) #player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png") player = Image(url = 'http://www.pybloggers.com/wp-content/uploads/2016/02/imageocd.comimagesnba10doug-west-wallpaper-height-weight-position-college-high-schooldoug-west-cf9867b0b04ef0bb3d71c7696b2acfac313c1995.jpg',width=100) display(player) #display(df[df['Name']==name]) display(cluster_locs[2:3]) df[df['kmeans_label']==2]['Name'][:10] name = player_names [ np . argmin ([ euclidean ( x , final_fit . cluster_centers_ [ 2 ]) for x in reduced_data ])]PlayerID = str ( int ( df [ df [ 'Name' ] == name ][ 'PLAYER_ID' ]))#player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png")player = Image ( url = 'http://www.pybloggers.com/wp-content/uploads/2016/02/imageocd.comimagesnba10doug-west-wallpaper-height-weight-position-college-high-schooldoug-west-cf9867b0b04ef0bb3d71c7696b2acfac313c1995.jpg' , width = 100 )display ( player )#display(df[df['Name']==name])display ( cluster_locs [ 2 : 3 ])df [ df [ 'kmeans_label' ] == 2 ][ 'Name' ][: 10 ] |
| 0.013618 | 0.013618 | 0.101054 | 0.101054 | 0.445377 | 0.445377 | -0.347974 | -0.347974 | -1.257634 | -1.257634 |
Cluster 2 is yellow in the scatter matrix and includes about 17% of players.
第2類在散布矩陣中為黃色,包含約17%的玩家。
Lots of big men who are not outside shooters and don’t draw many fouls. These players are strong 2 point shooters and free throw shooters. I think of these players as mid-range shooters. Many of the more recent Cluster 2 players are forwards since mid-range guards do not have much of a place in the current NBA.
很多大個子不在射手之外,也不會犯規。 這些球員是強力的2分射手和罰球手。 我認為這些球員是中距離射手。 由于中距離后衛在當前NBA中沒有太多位置,因此許多最近的Cluster 2球員都是前鋒。
Cluster 2’s prototype is Doug West. Doug West shoots well from the free throw line and on 2-point attempts, but not the 3-point line. He does not draw many fouls or collect many rebounds.
Cluster 2的原型是Doug West。 道格·韋斯特(Doug West)在罰球線和2分球嘗試中投籃不錯,但3分球卻沒有。 他沒有犯規很多,也沒有得到很多籃板。
Cluster 2 noteable players include LaMarcus Aldridge, Tayshaun Prince, Thaddeus Young, and Shaun Livingston
第2組值得注意的參與者包括LaMarcus Aldridge,Tayshaun Prince,Thaddeus Young和Shaun Livingston
| 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 |
| -1.28655 | -1.28655 | -0.467105 | -0.467105 | -0.133546 | -0.133546 | 0.905368 | 0.905368 | 0.000679 | 0.000679 |
Cluster 3 is blue in the scatter matrix and includes about 16% of players.
群集3在分散矩陣中為藍色,包含約16%的玩家。
Cluster 3 players do not have outside skills such as assists and 3-point shooting (they’re last in outside skills). They do not draw many fouls or shoot well from the free throw line. These players do not shoot often, but have a decent shooting percentage. This is likely because they only shoot when wide open next to the hoop.
第3類玩家沒有助攻和3分投籃等外部技能(他們在外部技能上排名倒數)。 他們在罰球線上沒有很多犯規或投籃不佳。 這些球員不經常投籃,但是投籃命中率很高。 這可能是因為它們僅在箍緊敞開時才射擊。
Cluster 3 players are highest on the defensive big man component. They block lots of shots and collect lots of rebounds.
第三組的球員在防守大個子方面最高。 他們擋住了很多投籃,收集了很多籃板。
The Cluster 3 prototype is Kelvin Cato. Cato is not and outside shooter and he only averages 7.5 shots per 36, but he makes these shots at a decent clip. Cato averages about 10 rebounds per 36.
Cluster 3原型是Kelvin Cato。 卡托不是外線投籃手,他每36次只能投籃7.5次,但他的投籃命中率相當不錯。 卡托平均每36籃板可得到10個籃板。
Notable Cluster 3 players include Andrew Bogut, Tyson Chandler, Andre Drummond, Kawahi Leonard??, Dikembe Mutumbo, and Hassan Whiteside.
著名的第3類玩家包括Andrew Bogut,Tyson Chandler,Andre Drummond,Kawahi Leonard ??,Dikembe Mutumbo和Hassan Whiteside。
| name = player_names[np.argmin([euclidean(x,final_fit.cluster_centers_[4]) for x in reduced_data])] PlayerID = str(int(df[df['Name']==name]['PLAYER_ID'])) #player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png") player = Image(url = 'http://www.pybloggers.com/wp-content/uploads/2016/02/www.thenolookpass.comwp-contentuploads201201IMG-724x1024-09e16ef35f0e017bc2181a66651e8ea0dfa9fb4b.jpg', width=100) #a photo just for fun display(player) #display(df[df['Name']==name]) display(cluster_locs[4:5]) df[df['kmeans_label']==4]['Name'][:10] name = player_names [ np . argmin ([ euclidean ( x , final_fit . cluster_centers_ [ 4 ]) for x in reduced_data ])]PlayerID = str ( int ( df [ df [ 'Name' ] == name ][ 'PLAYER_ID' ]))#player = Image(url = "http://stats.nba.com/media/players/230x185/"+PlayerID+".png")player = Image ( url = 'http://www.pybloggers.com/wp-content/uploads/2016/02/www.thenolookpass.comwp-contentuploads201201IMG-724x1024-09e16ef35f0e017bc2181a66651e8ea0dfa9fb4b.jpg' , width = 100 ) #a photo just for fundisplay ( player )#display(df[df['Name']==name])display ( cluster_locs [ 4 : 5 ])df [ df [ 'kmeans_label' ] == 4 ][ 'Name' ][: 10 ] |
| -0.668445 | -0.668445 | 0.035927 | 0.035927 | -0.917479 | -0.917479 | -1.243347 | -1.243347 | 0.244897 | 0.244897 |
Cluster 4 is cyan in the scatter matrix above and includes the least number of players (about 13%).
簇4在上面的散射矩陣中是青色的,并且包含最少數量的玩家(大約13%)。
Cluster 4 players are not high on outsize skills. They are average on rim scoring. They do not score many points, and they don’t fill up the defensive side of the stat sheet. These players don’t seem like all stars.
第4類玩家的技巧不高。 他們在籃筐得分上平均。 他們得分不高,也沒有填補統計表的防守端。 這些球員似乎并非全明星。
Looking at Doug Edwards’ stats – certainly not a 3-point shooter. I guess a good description of cluster 4 players might be … NBA caliber bench warmers.
看看道格·愛德華茲(Doug Edwards)的數據-當然不是三分球。 我想第4組球員的好描述可能是…NBA口徑替補球員。
Cluster 4’s notable players include Yi Jianlian and Anthony Bennet….yeesh
第4組的知名球員包括易建聯和安東尼·貝內特.... yeesh
| 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 |
| 0.890984 | 0.890984 | 0.846109 | 0.846109 | -0.926444 | -0.926444 | 0.735306 | 0.735306 | -0.092395 | -0.092395 |
Cluster 5 is magenta in the scatter matrix and includes 16% of players.
群集5在散布矩陣中為洋紅色,包含16%的參與者。
Cluster 5 players are highest in outside skills and second highest in rim scoring yet these players are dead last in pure points. It seems they score around the rim, but do not draw many fouls. They are second highest in defensive big man.
第5類球員的外部技能最高,籃下得分最高,但這些球員在純積分上倒數第二。 看來他們在籃筐附近得分,但是并沒有犯規。 他們在防守大個子中排名第二。
Gerald Henderson Sr is the prototype. Henderson is a good 3 point and free throw shooter but does not draw many fouls. He has lots of assists and steals.
Gerald Henderson Sr是原型。 亨德森是個不錯的三分球罰球手,但是并沒有犯規。 他有很多助攻和搶斷。
Of interest mostly because it generates an error in my code, Gerald Henderson Jr is in cluster 2 – the mid range shooters.
有趣的主要是因為它在我的代碼中生成錯誤,小杰拉德·亨德森(Jerald Henderson Jr)位于第2類–中距離射擊游戲。
Notable cluster 5 players include Mugsy Bogues, MCW, Jeff Hornacek, Magic Johnson, Jason Kidd, Steve Nash, Rajon Rando, John Stockton. Lots of guards.
值得注意的5類玩家包括Mugsy Bogues,MCW,Jeff Hornacek,Magic Johnson,Jason Kidd,Steve Nash,Rajon Rando和John Stockton。 很多警衛。
In the cell below, I plot the percentage of players in each cluster.
在下面的單元格中,我繪制了每個群集中玩家的百分比。
| plt.clf() plt.hist(df['kmeans_label'], normed=True,bins=[0,1,2,3,4,5,6],rwidth = 0.5) plt.xticks([0.5,1.5,2.5,3.5,4.5,5.5],['Group 0','Group 1','Group 2','Goup 3','Group 4','Group 5']) plt.ylabel('Percentage of Players in Each Cluster'); plt . clf ()plt . hist ( df [ 'kmeans_label' ], normed = True , bins = [ 0 , 1 , 2 , 3 , 4 , 5 , 6 ], rwidth = 0.5 )plt . xticks ([ 0.5 , 1.5 , 2.5 , 3.5 , 4.5 , 5.5 ],[ 'Group 0' , 'Group 1' , 'Group 2' , 'Goup 3' , 'Group 4' , 'Group 5' ])plt . ylabel ( 'Percentage of Players in Each Cluster' ); |
I began this post by asking whether player positions is the most natural way to group NBA players. The clustering analysis here suggests not.
我在開始這篇文章時首先詢問球員的位置是否是對NBA球員進行分組的最自然的方法。 這里的聚類分析表明不是。
Here’s my take on the clusters: Cluster 0 is pure shooters, Cluster 1 is talented scorers, Cluster 2 is mid-range shooters, Cluster 3 is defensive big-men, Cluster 4 is bench warmers, Cluster 5 is distributors. We might call the “positions” shooters, scorers, rim protectors, and distributors.
這是我對集群的看法:集群0是純射手,集群1是有才華的得分手,集群2是中距離射手,集群3是防守大個子,集群4是替補球員,集群5是發行人。 我們可以稱其為“位置”射手,得分手,籃框保護者和分發者。
It’s possible that our notion of position comes more from defensive performance than offensive. On defense, a player must have a particular size and agility to guard a particular opposing player. Because of this, a team will want a range of sizes and agility – strong men to defend the rim and quick men to defend agile ball carriers. Box scores are notoriously bad at describing defensive performance. This could account for the lack of “positions” in my cluster.
我們的位置觀念有可能更多來自防守表現而非進攻。 在防守時,球員必須具有特定的身材和敏捷度,以防守特定的對位球員。 因此,一支球隊需要各種尺寸和敏捷度-堅強的人防守籃筐,敏捷的人防守敏捷的球籃。 臭名昭著的禁區得分在描述防守表現方面很糟糕。 這可以解釋我的集群中缺少“位置”的原因。
I did not include player height and weight in this analysis. I imagine height and weight might have made clusters that resemble the traditional positions. I chose to not include height and weight because these are player attributes; not player performance.
在此分析中,我沒有包括球員的身高和體重。 我猜想身高和體重可能使群集類似于傳統位置。 我選擇不包括身高和體重,因為它們是球員屬性。 不是球員的表現。
After looking through all the groups one thing that stands out to me is the lack of specialization. For example we did not find a single cluster of incredible 3-point shooters. Cluster 1 includes many great shooters, but it’s not populated exclusively by great shooters. It would be interesting if adding additional clusters to the analysis could find more specific clusters such as big-men that can shoot from the outside (e.g., Dirk) or high-volume scorers (e.g., Kobe).
在查看了所有組之后,對我而言突出的一件事是缺乏專業性。 例如,我們沒有發現一個令人難以置信的三點射手集群。 群集1包含許多優秀射手,但并非僅由優秀射手組成。 如果將額外的聚類添加到分析中會發現更具體的聚類,例如可以從外面射擊的大個子(例如Dirk)或得分高的得分手(例如Kobe),這將很有趣。
翻譯自: https://www.pybloggers.com/2016/02/grouping-nba-players/
nba球員python
總結
以上是生活随笔為你收集整理的nba球员python_分组NBA球员的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【直播】手把手带你 5 分钟写一个小爬虫
- 下一篇: 天翼云探索云原生、边缘计算融合新思路