當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

【数据分析】数据分析达人赛3:汽车产品聚类分析

發布時間：2023/12/31 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了【数据分析】数据分析达人赛3:汽车产品聚类分析小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

賽題簡介

賽題背景

賽題數據

一、查看數據

?查看類別型變量

查看數值型變量?

?二、數據處理

處理類別型特征

LabelEncoder

one-hot

特征歸一化

PCA降維?

三、K-means進行聚類

肘方法看k值?

聚類結果可視化?

輪廓系數判斷k值?

四、分析聚類結果?

賽題簡介

本次教學賽是數據科學家陳博士發起的數據分析系列賽事第3場 —— 汽車產品聚類分析

賽題以競品分析為背景，通過數據的聚類，為汽車提供聚類分類。對于指定的車型，可以通過聚類分析找到其競品車型。通過這道賽題，鼓勵學習者利用車型數據，進行車型畫像的分析，為產品的定位，競品分析提供數據決策。

賽題背景

賽題數據

數據源：car_price.csv，數據包括了205款車的26個字段

一、查看數據

import pandas as pd import time import matplotlib.pyplot as pltcar_price = pd.read_csv("./car_price.csv") car_price.head()car_price.info() # car_price.duplicated().sum()

數據特征具體可區分為3大類：

第一類：汽車ID類屬性

1 Car_ID 車號

3 CarName 車名

第二類：類別型變量（10個）

2 Symboling 保險風險評級

4 fueltype 燃料類型

5 aspiration 發動機吸氣形式

6 doornumber 車門數

7 carbody 車身型式

8 drivewheel 驅動輪

9 enginelocation 發動機位置

15 enginetype 發動機型號

16 cylindernumber 氣缸數

18 fuelsystem 燃油系統

第三類：連續數值型變量（14個）

10 wheelbase 軸距

11 carlength 車長

12 carwidth 車寬

13 carheight 車高

14 curbweight 整備質量（汽車凈重）

17 enginesize 發動機尺寸

19 boreratio 氣缸橫截面面積與沖程比

20 stroke 發動機沖程

21 compressionratio 壓縮比

22 horsepower 馬力

23 peakrpm 最大功率轉速

24 citympg 城市里程（每加侖英里數）

25 highwaympg 高速公路里程（每加侖英里數）

26 price(Dependent variable) 價格（因變量）

?查看類別型變量

# 提取類別變量的列名 cate_columns=['symboling','fueltype','aspiration','doornumber','carbody','drivewheel','enginelocation','enginetype','fuelsystem','cylindernumber']#打印類別變量每個分類的取值情況 for i in cate_columns:print (i)print(set(car_price[i])) symboling {0, 1, 2, 3, -2, -1} fueltype {'gas', 'diesel'} aspiration {'std', 'turbo'} doornumber {'two', 'four'} carbody {'convertible', 'hatchback', 'wagon', 'sedan', 'hardtop'} drivewheel {'4wd', 'fwd', 'rwd'} enginelocation {'rear', 'front'} enginetype {'ohcv', 'ohcf', 'dohc', 'ohc', 'l', 'rotor', 'dohcv'} fuelsystem {'idi', 'mfi', '4bbl', '2bbl', 'mpfi', 'spfi', '1bbl', 'spdi'} cylindernumber {'eight', 'six', 'five', 'two', 'four', 'three', 'twelve'}

查看數值型變量?

#提取連續數值型變量特征數據(除了'car_ID'和'CarName') car_df=car_price.drop(['car_ID','CarName'],axis=1) #查看連續數值型情況，并是檢查否有異常值 #對數據進行描述性統計 car_df.describe()# 描繪數據集的箱線圖，查看異常值#提取連續數值型數據的列名 num_cols=car_df.columns.drop(cate_columns) print(num_cols)#繪制連續數值型數據的箱線圖，檢查異常值 import seaborn as snsfig=plt.figure(figsize=(12,8)) i=1 for col in num_cols:ax=fig.add_subplot(3,5,i)sns.boxplot(data=car_df[col],ax=ax)i=i+1plt.title(col) plt.subplots_adjust(wspace=0.4,hspace=0.3) plt.show()

#查看數值型特征的相關系數 df_corr=car_df.corr() df_corr['price'].sort_values(ascending = False) price 1.000000 enginesize 0.874145 curbweight 0.835305 horsepower 0.808139 carwidth 0.759325 carlength 0.682920 wheelbase 0.577816 boreratio 0.553173 carheight 0.119336 stroke 0.079443 compressionratio 0.067984 symboling -0.079978 peakrpm -0.085267 citympg -0.685751 highwaympg -0.697599 Name: price, dtype: float64 f , ax = plt.subplots(figsize = (7, 7))plt.title('Correlation of Numeric Features with Price',y=1,size=16)sns.heatmap(df_corr,square = True, vmax=0.8)

?二、數據處理

?cylindernumber

car_price['cylindernumber'] = car_price.cylindernumber.replace({'three':3,'four':4,'five':5,'six':6,'eight':8,'twelve':12})

CarName?

#去重查看CarName print(car_price['CarName'].drop_duplicates())#驗證是否object全部改為數值類型carBrand = car_price['CarName'].str.split(expand=True)[0]#根據車名提取品牌，車名中第一個詞為品牌 print(set(carBrand))

由 carlength構建新特征carSize

# 由上面描述性統計可知，車身長范圍為141.1~208.1英寸之間，可劃分為6類 bins=[min(car_df.carlength)-0.01,145.67,169.29,181.10,192.91,200.79,max(car_df.carlength)+0.01] label=['A00','A0','A','B','C','D'] carSize=pd.cut(car_df.carlength,bins,labels=label) print(carSize)#將車型大小分類放入數據集中 car_price['carSize']=carSize car_df['carSize']=carSize#剔除carlength features=car_df.drop(['carlength'],axis=1)

處理類別型特征

對于類別型特征的取值，有大小意義的數據轉換為數值型映射，沒有大小意義（不同取值表示類別不同），進行獨熱編碼。?

LabelEncoder

# 將取值具有大小意義的類別型變量數據轉變為數值型映射 features1=features.copy()#使用LabelEncoder對不具實體數值數據編碼 from sklearn.preprocessing import LabelEncoder carSize1=LabelEncoder().fit_transform(features1['carSize']) features1['carSize']=carSize1 carSize1a

one-hot

#對于類別離散型特征，取值間沒有大小意義的，可采用one-hot編碼 cate=features1.select_dtypes(include='object').columns print(cate)features1=features1.join(pd.get_dummies(features1[cate])).drop(cate,axis=1) features1.head()

特征歸一化

獲取的原始特征，必須對每一特征分別進行歸一化，比如，特征A的取值范圍是[-1000,1000]，特征B的取值范圍是[-1,1].
如果使用logistic回歸，w1*x1+w2*x2，因為x1的取值太大了，所以x2基本起不了作用。
所以，必須進行特征的歸一化，每個特征都單獨進行歸一化。

連續型特征歸一化：

1、均值歸一化（方差為1，均值為0）

2、最大最小值歸一化（0-1）

3、 x = (2x - max - min)/(max - min).線性放縮到[-1,1]

離散型特征（類別型特征）：

離散特征進行one-hot編碼后，編碼后的特征，其實每一維度的特征都可以看做是連續的特征。就可以跟對連續型特征的歸一化方法一樣，對每一維特征再進行歸一化。比如歸一化到[-1,1]或歸一化到均值為0,方差為1

因為之前對類別型特征分別進行標簽和獨熱編碼，類別型特征已經可以看做連續特征，所以統一對所有特征進行歸一化

#對特征進行歸一化 from sklearn import preprocessingfeatures1=preprocessing.MinMaxScaler().fit_transform(features1) features1=pd.DataFrame(features1) features1.head()

PCA降維?

#對數據集進行PCA降維（信息保留為99.99%） from sklearn.decomposition import PCA pca=PCA(n_components=0.9999) #保證降維后的數據保持90%的信息，則填0.9 features2=pca.fit_transform(features1)#降維后，每個主要成分的解釋方差占比（解釋PC攜帶的信息多少） ratio=pca.explained_variance_ratio_ print('各主成分的解釋方差占比：',ratio)#降維后有幾個成分 print('降維后有幾個成分：',len(ratio))#累計解釋方差占比 cum_ratio=np.cumsum(ratio)#cumsum函數通常用于計算一個數組各行的累加值 print('累計解釋方差占比：',cum_ratio) 各主成分的解釋方差占比： [2.34835648e-01 1.89291914e-01 1.11193502e-01 6.41024136e-025.90453139e-02 4.54763783e-02 4.21689429e-02 3.65477617e-022.97528000e-02 2.24095237e-02 1.98458305e-02 1.95803021e-021.70780800e-02 1.47611074e-02 1.32208566e-02 1.19093756e-029.01434709e-03 8.74908243e-03 7.28321292e-03 6.65001057e-035.68867886e-03 4.89870846e-03 4.50894857e-03 3.81422315e-033.45197486e-03 2.23759951e-03 2.14676779e-03 1.84529725e-031.56025958e-03 1.22067828e-03 1.12126257e-03 1.03278716e-038.30359553e-04 6.87972243e-04 5.63679041e-04 4.64609849e-043.33065301e-04 2.76366954e-04 1.67241531e-04 1.07861538e-047.49681455e-05] 降維后有幾個成分： 41 累計解釋方差占比： [0.23483565 0.42412756 0.53532106 0.59942348 0.65846879 0.703945170.74611411 0.78266187 0.81241467 0.8348242 0.85467003 0.874250330.89132841 0.90608952 0.91931037 0.93121975 0.9402341 0.948983180.95626639 0.9629164 0.96860508 0.97350379 0.97801274 0.981826960.98527894 0.98751654 0.9896633 0.9915086 0.99306886 0.994289540.9954108 0.99644359 0.99727395 0.99796192 0.9985256 0.998990210.99932327 0.99959964 0.99976688 0.99987474 0.99994971] #繪制PCA降維后各成分方差占比的直方圖和累計方差占比折線圖 plt.figure(figsize=(8,6)) X=range(1,len(ratio)+1) Y=ratio plt.bar(X,Y,edgecolor='black') plt.plot(X,Y,'r.-') plt.plot(X,cum_ratio,'b.-') plt.ylabel('explained_variance_ratio') plt.xlabel('PCA') plt.show()

#PCA選擇降維保留8個主要成分 pca=PCA(n_components=8) features3=pca.fit_transform(features1)#降維后的累計各成分方差占比和（即解釋PC攜帶的信息多少） print(sum(pca.explained_variance_ratio_))#0.7826618733273734 features3

三、K-means進行聚類

肘方法看k值?

##肘方法看k值，簇內離差平方和 #對每一個k值進行聚類并且記下對于的SSE，然后畫出k和SSE的關系圖 from sklearn.cluster import KMeanssse=[] for i in range(1,15):km=KMeans(n_clusters=i,init='k-means++',n_init=10,max_iter=300,random_state=0)km.fit(features3)sse.append(km.inertia_)plt.plot(range(1,15),sse,marker='*') plt.xlabel('n_clusters') plt.ylabel('distortions') plt.title("The Elbow Method") plt.show()

?選擇5個聚類點進行聚類

#進行K-Means聚類分析 kmeans=KMeans(n_clusters=5,init='k-means++',n_init=10,max_iter=300,random_state=0) kmeans.fit(features3) lab=kmeans.predict(features3) print(lab)

聚類結果可視化?

#繪制聚類結果2維的散點圖 plt.figure(figsize=(8,8)) plt.scatter(features3[:,0],features3[:,1],c=lab)for ii in np.arange(205):plt.text(features3[ii,0],features3[ii,1],s=car_price.car_ID[ii]) plt.xlabel('PC1') plt.ylabel('PC2') plt.title('K-Means PCA') plt.show()

#繪制聚類結果后3d散點圖 from mpl_toolkits.mplot3d import Axes3D plt.figure(figsize=(8,8)) ax=plt.subplot(111,projection='3d') ax.scatter(features3[:,0],features3[:,1],features3[:,2],c=lab) #視角轉換，轉換后更易看出簇群 ax.view_init(30,45) ax.set_xlabel('PC1') ax.set_ylabel('PC2') ax.set_zlabel('PC3') plt.show()

輪廓系數判斷k值?

#繪制輪廓圖和3d散點圖 from sklearn.datasets import make_blobs from sklearn.metrics import silhouette_samples, silhouette_score import matplotlib.cm as cm from mpl_toolkits.mplot3d import Axes3Dfor n_clusters in range(2,9):fig=plt.figure(figsize=(12,6))ax1=fig.add_subplot(121)ax2=fig.add_subplot(122,projection='3d')ax1.set_xlim([-0.1,1])ax1.set_ylim([0,len(features3)+(n_clusters+1)*10])km=KMeans(n_clusters=n_clusters,init='k-means++',n_init=10,max_iter=300,random_state=0)y_km=km.fit_predict(features3)silhouette_avg=silhouette_score(features3,y_km)print('n_cluster=',n_clusters,'The average silhouette_score is :',silhouette_avg)cluster_labels=np.unique(y_km) silhouette_vals=silhouette_samples(features3,y_km,metric='euclidean')y_ax_lower=10for i in range(n_clusters):c_silhouette_vals=silhouette_vals[y_km==i]c_silhouette_vals.sort()cluster_i=c_silhouette_vals.shape[0]y_ax_upper=y_ax_lower+cluster_icolor=cm.nipy_spectral(float(i)/n_clusters)ax1.fill_betweenx(range(y_ax_lower,y_ax_upper),0,c_silhouette_vals,edgecolor='none',color=color)ax1.text(-0.05,y_ax_lower+0.5*cluster_i,str(i))y_ax_lower=y_ax_upper+10ax1.set_title('The silhouette plot for the various clusters')ax1.set_xlabel('The silhouette coefficient values')ax1.set_ylabel('Cluster label')ax1.axvline(x=silhouette_avg,color='red',linestyle='--')ax1.set_yticks([])ax1.set_xticks([-0.1,0,0.2,0.4,0.6,0.8,1.0])colors=cm.nipy_spectral(y_km.astype(float)/n_clusters)ax2.scatter(features3[:,0],features3[:,1],features3[:,2],marker='.',s=30,lw=0,alpha=0.7,c=colors,edgecolor='k')centers=km.cluster_centers_ax2.scatter(centers[:,0],centers[:,1],centers[:,2],marker='o',c='white',alpha=1,s=200,edgecolor='k')for i,c in enumerate(centers):ax2.scatter(c[0],c[1],c[2],marker='$%d$' % i,alpha=1,s=50,edgecolor='k')ax2.set_title("The visualization of the clustered data.")ax2.set_xlabel("Feature space for the 1st feature")ax2.set_ylabel("Feature space for the 2nd feature")ax2.view_init(30,45)plt.suptitle(("Silhouette analysis for KMeans clustering on sample data ""with n_clusters = %d" % n_clusters),fontsize=14, fontweight='bold') plt.show()

結合輪廓圖和3d散點圖：當k太小時，單獨的集群會合并；而當k太大時，某些集群會被分成多個。

當k=2，每個集群很大且很大部分實例系數接近0，表明集群內很大部分實例接近邊界，一些單獨的集群被合并了，模型效果不好；

當k=3時，集群‘0’大部分實例輪廓系數低于集群的輪廓分數，且有小部分實例系數小于0趨向-1，說明該部分實例可能已分配給錯誤的集群；

k=4時，集群‘0’大部分實例輪廓系數低于集群的輪廓分數且接近0，說明這些實例接近邊界，該集群可能分為2個單獨集群更合適；

k=7或8時，某些集群被分成多個，中心非常接近，導致非常糟糕的模型；

當k為5或6時，大多數實例都超出虛線，集群看起來很好，聚類效果都很好。按得分排k更佳是6>5，當k=5時，集群‘3’很大，k=6時，各個集群分布更均衡一些；

綜上所述，k值選取5或6都可以，聚類模型效果都可以，但考慮各集群均衡些，所以選取k=6。

#調整選擇k=6進行聚類 kmeans=KMeans(n_clusters=6,init='k-means++',n_init=10,max_iter=300,random_state=0) y_pred=kmeans.fit_predict(features3) print(y_pred)#將聚類后的類目放入原特征數據中 car_df_km=car_price.copy() car_df_km['km_result']=y_pred [4 4 4 1 5 3 5 5 5 0 4 5 4 5 5 5 4 5 3 3 1 3 3 0 1 1 1 0 1 0 3 3 3 3 3 1 13 3 1 1 1 3 1 3 1 3 5 5 4 3 3 3 1 1 4 4 4 4 3 1 3 1 2 1 5 2 2 2 2 2 5 4 54 0 3 3 3 0 0 3 0 0 0 1 1 1 1 3 2 3 1 1 3 3 1 1 3 1 1 5 5 5 4 4 4 5 2 5 25 2 5 2 5 2 5 3 0 1 1 1 1 0 4 4 4 4 4 1 3 3 1 3 1 0 5 3 3 3 1 1 1 1 5 1 11 5 3 3 1 1 1 1 1 1 2 2 1 1 1 3 3 4 4 4 4 4 4 4 4 1 2 1 1 1 4 4 5 5 2 3 21 1 2 1 3 3 5 2 1 5 5 5 5 5 5 5 5 5 2 5]

四、分析聚類結果?

#統計聚類后每個集群中包含的車型數 car_df_km.groupby('km_result')['car_ID'].count() km_result 0 13 1 59 2 20 3 43 4 31 5 39 Name: car_ID, dtype: int64 import pandas as pd #顯示所有列 pd.set_option('display.max_columns',None) #顯示所有行 pd.set_option('display.max_rows',None)#統計每個集群里各品牌的車型數 car_df_km.groupby(by=['km_result','carBrand'])['car_ID'].count()#統計每個品牌在各個集群里的車型數 car_df_km.groupby(by=['carBrand','km_result'])['car_ID'].count() #查看特指車名‘vokswagen’車型的聚類集群 df=car_df_km.loc[:,['car_ID','CarName','carBrand','km_result']] print(df.loc[df['CarName'].str.contains("vokswagen")]) # ’vokswagen’的車名為‘vokswagen rabbit’，car_ID 為183，集群分類為2.#查看特指車名為‘vokswagen’車型的競品車型（分類2的所有車型） df.loc[df['km_result']==2] #查看大眾volkswagen品牌在各集群內的競品車型li = [1, 2,3,5] #volkswagen品牌在1235這幾個集群里分布 df_volk=df[df['km_result'].isin(li)].sort_values(by=['km_result','carBrand']) df_volk

在全量數據里提取‘vokswagen’車型的競品車型

df0 = car_df_km.loc[car_df_km['km_result']==2] df0.head() df0_1=df0.drop(['car_ID','CarName','km_result'],axis=1)#查看集群2的車型所有特征分布 fig=plt.figure(figsize=(20,20)) i=1 for c in df0_1.columns:ax=fig.add_subplot(7,4,i) if df0_1[c].dtypes=='int' or df0_1[c].dtypes=='float':#數值型變量sns.histplot(df0_1[c],ax=ax)#直方圖else:sns.barplot(df0_1[c].value_counts().index,df0_1[c].value_counts(),ax=ax)#條形圖3i=i+1plt.xlabel('')plt.title(c) plt.subplots_adjust(top=1.2) plt.show()

類別型變量取值只有一種的有：
fueltype : {‘diesel’}；enginelocation : {‘front’}；fuelsystem:{'idi'}

這些共性的特征在競品分析時可不考慮

#對不同車型級別、品牌、車身等類型特征進行數據透視 #按車型大小級別進行對比 df2=df0.pivot_table(index=['carSize','carbody','carBrand','CarName']) df2 boreratiocar_IDcarheightcarlengthcarwidthcitympgcompressionratiocurbweightenginesizehighwaympghorsepowerkm_resultpeakrpmpricestrokesymbolingwheelbasecarSizecarbodycarBrandCarNameA0hatchbacktoyotatoyota corollasedannissannissan gt-rtoyotatoyota coronaAsedanmazdamazda glc deluxemazda rx-7 gstoyotatoyota celica gtvolkswagenvokswagen rabbitvolkswagen model 111volkswagen rabbit customvolkswagen super beetleBhardtopbuickbuick centurysedanbuickbuick electra 225 custompeugeotpeugeot 304peugeot 504peugeot 604slvolvovolvo 246wagonbuickbuick century luxus (sw)Cwagonpeugeotpeugeot 504peugeot 505s turbo dieselDsedanbuickbuick skyhawk

3.27	160	52.8	166.3	64.4	38	22.5	2275	110	47	56	2	4500	7788.0	3.35	0	95.7
2.99	91	54.5	165.3	63.8	45	21.9	2017	103	50	55	2	4800	7099.0	3.47	1	94.5
3.27	159	53.0	166.3	64.4	34	22.5	2275	110	36	56	2	4500	7898.0	3.35	0	95.7
3.39	64	55.5	177.8	66.5	36	22.7	2443	122	42	64	2	4650	10795.0	3.39	0	98.8
3.43	67	54.4	175.0	66.1	31	22.0	2700	134	39	72	2	4200	18344.0	3.64	0	104.9
3.27	175	54.9	175.6	66.5	30	22.5	2480	110	33	73	2	4500	10698.0	3.35	-1	102.4
3.01	183	55.7	171.7	65.5	37	23.0	2261	97	46	52	2	4800	7775.0	3.40	2	97.3
3.01	185	55.7	171.7	65.5	37	23.0	2264	97	46	52	2	4800	7995.0	3.40	2	97.3
3.01	193	55.1	180.2	66.9	33	23.0	2579	97	38	68	2	4500	13845.0	3.40	0	100.4
3.01	188	55.7	171.7	65.5	37	23.0	2319	97	42	68	2	4500	9495.0	3.40	2	97.3
3.58	70	54.9	187.5	70.3	22	21.5	3495	183	25	123	2	4350	28176.0	3.64	0	106.7
3.58	68	56.5	190.9	70.3	22	21.5	3515	183	25	123	2	4350	25552.0	3.64	-1	110.0
3.70	109	56.7	186.7	68.4	28	21.0	3197	152	33	95	2	4150	13200.0	3.52	0	107.9
3.70	117	56.7	186.7	68.4	28	21.0	3252	152	33	95	2	4150	17950.0	3.52	0	107.9
3.70	113	56.7	186.7	68.4	28	21.0	3252	152	33	95	2	4150	16900.0	3.52	0	107.9
3.01	204	55.5	188.8	68.9	26	23.0	3217	145	27	106	2	4800	22470.0	3.40	-1	109.1
3.58	69	58.7	190.9	70.3	22	21.5	3750	183	25	123	2	4350	28248.0	3.64	-1	110.0
3.70	111	58.7	198.9	68.4	25	21.0	3430	152	25	95	2	4150	13860.0	3.52	0	114.2
3.70	115	58.7	198.9	68.4	25	21.0	3485	152	25	95	2	4150	17075.0	3.52	0	114.2
3.58	71	56.3	202.6	71.7	22	21.5	3770	183	25	123	2	4350	31600.0	3.64	-1	115.6

集群2中所有的車型大小級別為：A0小型車、A緊湊型車、B中型車、C中大型車、D豪華型車。
car_id183的車vokswagen rabbit屬于A緊湊型車，其最直接的細分競品為同屬于a級的7輛車??

#提取集群2中的A級車 df0_A=df0.loc[df0['carSize']=='A'] df0_A#查看集群0中A級車型的類別型變量的分類情況 ate_col=df0_A.select_dtypes(include='object').columns df3=df0_A[ate_col] df3

#對集群0中A級車的特征進行數據透視 df4=df0_A.pivot_table(index=['carBrand','CarName','doornumber','aspiration','drivewheel']) df4

包含‘vokswagen rabbit’在內的7輛A級車中均有4個氣缸，沖程范圍在3.4-3.64，最大功率轉速范圍在4500-4800，壓縮比范圍在22.5-23.0，車身寬范圍66.1-66.9，車高范圍在54.4-55.7，氣缸橫截面面積與沖程比范圍在3.01-3.43；以上這些數據都是比較相似的。

一般汽車關注點在：車型級別（carSize）、品牌（carBrand）、動力性能（馬力horsepower）、質量安全（Symboling ）、油耗（citympg、highwaympg）、空間體驗（軸距wheelbase）、車身（carbody、curbweight）等等。

下面提取其他一些不同關鍵特征進行考量‘vokswagen rabbit’與其他競品之間的差異化：

基本信息：‘carBrand’，‘doornumber’, ‘curbweight’

油耗：‘highwaympg’、‘citympg’

安全性：‘symboling’

底盤制動：‘drivewheel’

動力性能：‘aspiration’, ‘enginesize’, ‘horsepower’

空間體驗：‘wheelbase’

價格： ‘price’

#對油耗的分析('citympg','highwaympg') lab=df0_A['CarName']fig,ax=plt.subplots(figsize=(10,8)) ax.barh(range(len(lab)),df0_A['highwaympg'],tick_label=lab,color='red') ax.barh(range(len(lab)),df0_A['citympg'],tick_label=lab,color='blue')#在水平直方圖上標注數據 for i,(highway,city) in enumerate(zip(df0_A['highwaympg'],df0_A['citympg'])):ax.text(highway,i,highway,ha='right')ax.text(city,i,city,ha='right')plt.legend(('highwaympg','citympg'), loc='upper right') plt.title('miles per gallon') plt.show()

#其他6個特征分析 colors=['yellow', 'blue', 'green','red', 'gray','tan','darkviolet'] col2=['symboling','wheelbase','enginesize','horsepower','curbweight','price'] data=df0_A[col2]fig=plt.figure(figsize=(10,8)) i=1 for c in data.columns:ax=fig.add_subplot(3,2,i)plt.barh(range(len(lab)),data[c],tick_label=lab,color=colors)for y,x in enumerate(data[c].values):plt.text(x,y,"%s" %x)i=i+1plt.xlabel('')plt.title(c) plt.subplots_adjust(top=1.2,wspace=0.7) plt.show()

由上面條形圖，‘vokswagen rabbit’與其他競品相比：

質量安全方面：其保險風險評級為2，比馬自達品牌和豐田品牌車型相對更具有風險；

車身空間方面：軸距是最小的；

動力方面：發動機尺寸和馬力都是最小的；

車重方面：整備質量最小的；

價格方面：價格是最小的；
綜上所述，‘'vokswagen rabbit‘’與集群0中同是A級的競品相比：

劣勢：質量安全性偏低、車身空間偏小、動力馬力偏小

優勢：車身輕、油耗低、價格低（在類似的配置中性價比非常高）

設計特點：雙車門三廂車

產品定位：“經濟適用、城市代步緊湊型A級轎車”

建議：在銷售推廣時，可偏重于：①同類配置車型中超高的性價比；②油耗低，城市代步非常省油省錢；③車身小巧，停車方便；④雙車門設計，個性獨特

【算法競賽學習】數據分析達人賽3:汽車產品聚類分析

總結

以上是生活随笔為你收集整理的【数据分析】数据分析达人赛3:汽车产品聚类分析的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Flutter自定义iconfont字体
下一篇：华为虚拟服务器bim,bim云服务器

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

【数据分析】数据分析达人赛3:汽车产品聚类分析

賽題簡介

賽題背景

賽題數據

一、查看數據

?查看類別型變量

查看數值型變量?

?二、數據處理

處理類別型特征

LabelEncoder

one-hot

特征歸一化

PCA降維?

三、K-means進行聚類

肘方法看k值?

聚類結果可視化?

輪廓系數判斷k值?

四、分析聚類結果?

總結