【应用多元统计分析】-王学民Python主成分分析例题,特征值处理和可视化(2)
title: “應(yīng)用多元統(tǒng)計(jì)分析”
subtitle: “書上題目”
author: | OLSRR
由于字?jǐn)?shù)限制,本文省去部分?jǐn)?shù)據(jù)預(yù)覽。
7.6
下表中給出的是美國 50 個州每 100000 個人中七種犯罪的比率數(shù)據(jù), 這七種犯罪是: 殺人罪 ( x 1 ) \left(x_{1}\right) (x1?) 、 強(qiáng)奸罪 ( x 2 ) \left(x_{2}\right) (x2?) 、搶劫罪 ( x 3 ) \left(x_{3}\right) (x3?) 、傷害罪 ( x 4 ) \left(x_{4}\right) (x4?) 、夜盜罪 ( x 5 ) \left(x_{5}\right) (x5?) 、盜窮罪 ( x 6 ) \left(x_{6}\right) (x6?) 和汽車犯罪 ( x 7 ) \left(x_{7}\right) (x7?) 。試對表中犯罪數(shù)據(jù)進(jìn)行主成分分析。
答案
數(shù)據(jù)庫準(zhǔn)備:
import pandas as pd#不一定都會用到 import numpy as np import seaborn as sns import matplotlib.pyplot as plt import matplotlib.pyplot as mp,seaborn from pylab import mpl from sklearn import preprocessing from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity from factor_analyzer.factor_analyzer import calculate_kmo數(shù)據(jù)準(zhǔn)備:
df = pd.read_csv("/Users/che/Desktop/應(yīng)用多元分析/《應(yīng)用多元統(tǒng)計(jì)分析》(第五版,王學(xué)民 編著)配書資料/《應(yīng)用多元統(tǒng)計(jì)分析》(第五版)文本數(shù)據(jù)(以逗號為間隔)/exec7.6.csv", encoding='gbk',index_col=0).reset_index(drop=True) print(df)#顯示數(shù)據(jù) # x1 x2 x3 x4 x5 x6 x7 #0 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7 #1 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3 …… #49 5.4 21.9 39.7 173.9 811.6 2772.2 282.0Bartlett’s球狀檢驗(yàn):
chi_square_value, p_value = calculate_bartlett_sphericity(df) print(chi_square_value, p_value) #225.98725841755999 2.5997375924269063e-36KMO檢驗(yàn):
kmo_all, kmo_model = calculate_kmo(df) #[0.66737389 0.83457371 0.84650891 0.86291368 0.78956368 0.66450087 # 0.72385252]檢查了變量間的相關(guān)性和偏相關(guān)性,由此可以看出,數(shù)據(jù)均大于0.5,變量間的相關(guān)性較強(qiáng),同時偏相關(guān)性較弱,由此進(jìn)行因子分析的效果較好。
標(biāo)準(zhǔn)化:
df = preprocessing.scale(df) print(df) [[ 1.76493364e+00 -5.01338300e-02 -3.12049014e-01 6.75093885e-01-3.65336599e-01 -1.09848840e+00 -5.05748984e-01]……[-5.33973411e-01 -3.59949633e-01 -9.64914274e-01 -3.76843452e-01-1.12191907e+00 1.40426079e-01 -4.98958725e-01]]計(jì)算相關(guān)系數(shù):
e76_1=df.iloc[0:50,0:8] df76 = pd.DataFrame(e76_1) df76_corr = df76.corr() print(df76_corr) # x1 x2 x3 x4 x5 x6 x7 #x1 1.000000 0.601220 0.483708 0.648550 0.385817 0.101920 0.068814 …… #x7 0.068814 0.348902 0.590680 0.275843 0.557953 0.444180 1.000000可視化結(jié)果:
seaborn.heatmap(df76_corr, center=0, annot=True, cmap='YlGnBu') mp.show()該相關(guān)矩陣表明,變量之間存在一定的相關(guān)性,即彼此之間信息有不少是重復(fù)的,從而有一定的降維空間。
特征值相關(guān):
eig76_value,eig76_vector=np.linalg.eig(df76_corr) df=pd.DataFrame({"eig76_value":eig76_value}) df=df.sort_values(by=["eig76_value"], ascending=False) # 獲取累積貢獻(xiàn)度 df["eig76_cum"] = (df["eig76_value"]/df["eig76_value"].sum()).cumsum() eig76=df.merge(pd.DataFrame(eig76_vector).T, left_index=True, right_index=True) print(eig76) # eig76_value eig76_cum 0 ... 4 5 6 #0 4.114960 0.587851 0.300279 ... 0.440157 0.357360 0.295177 #1 1.238722 0.764812 0.629174 ... -0.203341 -0.402319 -0.502421 #2 0.725817 0.868500 0.178245 ... -0.209895 -0.539231 0.568384 #4 0.316432 0.913704 -0.232114 ... -0.057555 -0.234890 0.419238 #5 0.257974 0.950558 0.538123 ... 0.101033 0.030099 0.369753 #6 0.222039 0.982278 -0.259117 ... -0.535987 -0.039406 0.057298 #3 0.124056 1.000000 -0.267593 ... 0.648117 -0.601690 -0.147046可視化:
plt.scatter(range(1, df76.shape[1] + 1), eig76_value) plt.plot(range(1, df76.shape[1] + 1), eig76_value) plt.title("Scree Plot") plt.xlabel("Factors") plt.ylabel("Eigenvalue") plt.grid() # 顯示網(wǎng)格 plt.show() # 顯示圖形數(shù)據(jù)處理:
df76_std = (df76 - df76.mean())/df76.std() loading = eig76.iloc[:2,2:].T loading["vars"]=df76_std.columns print(loading) # 0 1 vars #0 0.300279 0.629174 x1 …… #6 0.295177 -0.502421 x7 score = pd.DataFrame(np.dot(df76_std,loading.iloc[:,0:2])) print(score) 0 1 0 -0.049880 2.096102 1 2.421515 -0.166523 …… 49 -1.424635 -0.062683可以認(rèn)為,第一主成分是對所有犯罪率的度量,第二主成分是用于度量暴力犯罪在犯罪性質(zhì)上占的比重,第三主成分很難給出明顯的解釋,因此只取前兩個主成分。
mpl.rcParams['font.sans-serif'] = ['FangSong'] # 解決保存圖像是負(fù)號'-'顯示為方塊的問題 mpl.rcParams['axes.unicode_minus'] = False plt.plot(loading[0],loading[1], "o") xmin ,xmax = loading[0].min(), loading[0].max() ymin, ymax = loading[1].min(), loading[1].max() dx = (xmax - xmin) * 0.2 dy = (ymax - ymin) * 0.2 plt.xlim(xmin - dx, xmax + dx) plt.ylim(ymin - dy, ymax + dy) plt.xlabel('first') plt.ylabel('second') for x, y,z in zip(loading[0], loading[1], loading["vars"]): plt.text(x, y+0.1, z, ha='center', va='bottom', fontsize=13) plt.grid(True) plt.show()#用絕對值做比較 plt.plot(score[0],score[1], "o") xmin ,xmax = score[0].min(), score[0].max() ymin, ymax = score[1].min(), score[1].max() dx = (xmax - xmin) * 0.2 dy = (ymax - ymin) * 0.2 plt.xlim(xmin - dx, xmax + dx) plt.xlabel('first') plt.ylabel('second') for x, y,z in zip(score[0], score[1], score.index):plt.text(x, y+0.1, z, ha='center', va='bottom', fontsize=13) plt.grid(True) plt.show()7.7
下表(完整數(shù)據(jù)可從作者網(wǎng)頁上下載) 是紐約股票交易所的五只股票(阿萊德化學(xué)、杜邦、聯(lián)合碳化物、??松偷率抗?從 1975 年 1 月到 1976 年 12 月期間的周回報(bào)率。周回報(bào)率定義為
周回報(bào)率 = 本周五收盤價(jià) ? 上周五收盤價(jià) 上周五收盤價(jià) \text{周回報(bào)率}=\frac{\text{本周五收盤價(jià)}-\text{上周五收盤價(jià)}}{\text{上周五收盤價(jià)}} 周回報(bào)率=上周五收盤價(jià)本周五收盤價(jià)?上周五收盤價(jià)?
有拆股和支付股息時對收盤價(jià)進(jìn)行調(diào)整, 試作主成分分析。
注:阿萊德化學(xué)、杜邦和聯(lián)合碳化物屬于化工類股票, 埃克森和德士古屬于石油類股票。
答案
數(shù)據(jù)準(zhǔn)備:
#不一定每個庫都用到 import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import matplotlib.pyplot as mp,seaborn from factor_analyzer import calculate_bartlett_sphericity, calculate_kmo from pylab import mpl from sklearn import preprocessing讀取數(shù)據(jù):
df1 = pd.read_csv("/Users/zz/Desktop/應(yīng)用多元分析/《應(yīng)用多元統(tǒng)計(jì)分析》(第五版,王學(xué)民 編著)配書資料/《應(yīng)用多元統(tǒng)計(jì)分析》(第五版)文本數(shù)據(jù)(以逗號為間隔)/exec7.7.csv", encoding='gbk',index_col=0) print(df1) # x1 x2 x3 x4 x5#0.000000 0.000000 0.000000 0.039473 0.000000 #0.027027 -0.044855 -0.003030 -0.014466 0.043478 #0.122807 0.060773 0.088146 0.086238 0.078124 #0.057031 0.029948 0.066808 0.013513 0.019512 #0.063670 -0.003793 -0.039788 -0.018644 -0.024154 #... ... ... ... ... #0.000000 -0.020080 -0.006579 0.029925 -0.004807 #0.021429 0.049180 0.006622 -0.002421 0.028985 #0.045454 0.046375 0.074561 0.014563 0.018779 #0.050167 0.036380 0.004082 -0.011961 0.009216 #0.019108 -0.033303 0.008362 0.033898 0.004566 # #[100 rows x 4 columns]Bartlett’s球狀檢驗(yàn):
chi_square_value, p_value = calculate_bartlett_sphericity(df1) print(chi_square_value, p_value) #108.0670240489058 5.175016294414447e-21標(biāo)準(zhǔn)化:
df1 = preprocessing.scale(df1) print(df1) [[-0.13526816 -0.13836418 -0.14405774 1.17734079 -0.13531234]……[ 0.34041039 -1.09296855 0.0689901 0.97952993 0.03128677]]相關(guān)系數(shù):
df77 = pd.DataFrame(df1) # 計(jì)算相關(guān)系數(shù) df77_corr = df77.corr() print(df77_corr) # x1 x2 x3 x4 x5 # x1 1.000000 0.576924 0.508656 0.386721 0.462178 …… # x5 0.462178 0.321953 0.425627 0.523529 1.000000可視化:
seaborn.heatmap(df77_corr, center=0, annot=True, cmap='YlGnBu') mp.show()該相關(guān)矩陣表明,變量之間存在一定的相關(guān)性,即彼此之間信息有不少是重復(fù)的,從而有一定的降維空間。
特征值處理:
eig77_value,eig77_vector=np.linalg.eig(df77_corr) eig77=pd.DataFrame({"eig77_value":eig77_value}) eig77=eig77.sort_values(by=["eig77_value"], ascending=False) # 獲取累積貢獻(xiàn)度 eig77["eig77_cum"] = (eig77["eig77_value"]/eig77["eig77_value"].sum()).cumsum() eig77=eig77.merge(pd.DataFrame(eig77_vector).T, left_index=True, right_index=True) print(eig77) # eig77_value eig77_cum 0 1 2 3 4 #0 2.856487 0.571297 0.463541 0.457076 0.469980 0.421677 0.421329 #1 0.809118 0.733121 0.240850 0.509100 0.260577 -0.525265 -0.582242 #3 0.540044 0.841130 0.613357 -0.177900 -0.337036 -0.539018 0.433603 #4 0.451347 0.931399 -0.381373 -0.211307 0.664098 -0.472804 0.381227 #2 0.343004 1.000000 0.453288 -0.674981 0.395725 0.179448 -0.387467畫圖:
plt.scatter(range(1, df77.shape[1] + 1), eig77_value) plt.plot(range(1, df77.shape[1] + 1), eig77_value) plt.title("Scree Plot") plt.xlabel("Factors") plt.ylabel("Eigenvalue") plt.grid() plt.show()得分情況:
df77_std = (df77 - df77.mean())/df77.std() loading = eig77.iloc[:2,2:].T loading["vars"]=df77_std.columns print(loading) score = pd.DataFrame(np.dot(df77_std,loading.iloc[:,0:2])) print(score) # 0 1 vars #0 0.463541 0.240850 0 #1 0.457076 0.509100 1 #2 0.469980 0.260577 2 #3 0.421677 -0.525265 3 #4 0.421329 -0.582242 4 0 1 0 0.244565 -0.676780 1 -0.203899 -1.105631 .. ... ... 99 0.116289 -0.984235可以認(rèn)為,第一主成分是對所有犯罪率的度量,第二主成分是用于度量暴力犯罪在犯罪性質(zhì)上占的比重,第三主成分很難給出明顯的解釋,因此只取前兩個主成分。
可視化:
mpl.rcParams['font.sans-serif'] = ['FangSong'] mpl.rcParams['axes.unicode_minus'] = False plt.plot(loading[0],loading[1], "o") xmin ,xmax = loading[0].min(), loading[0].max() ymin, ymax = loading[1].min(), loading[1].max() dx = (xmax - xmin) * 0.2 dy = (ymax - ymin) * 0.2 plt.xlim(xmin - dx, xmax + dx) plt.ylim(ymin - dy, ymax + dy) plt.xlabel('first') plt.ylabel('second') for x, y,z in zip(loading[0], loading[1], loading["vars"]):plt.text(x, y+0.1, z, ha='center', va='bottom', fontsize=13) plt.grid(True) plt.show() plt.plot(score[0],score[1], "o") xmin ,xmax = score[0].min(), score[0].max() ymin, ymax = score[1].min(), score[1].max()dx = (xmax - xmin) * 0.2 dy = (ymax - ymin) * 0.2plt.xlim(xmin - dx, xmax + dx) plt.ylim(ymin - dy, ymax + dy)plt.xlabel('first') plt.ylabel('second') for x, y,z in zip(score[0], score[1], score.index):plt.text(x, y+0.1, z, ha='center', va='bottom', fontsize=13)plt.grid(True) plt.show()總結(jié)
以上是生活随笔為你收集整理的【应用多元统计分析】-王学民Python主成分分析例题,特征值处理和可视化(2)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python手机编程调试_在Linux下
- 下一篇: 【Unity3D日常开发】Unity3D