當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

【应用多元统计分析】-王学民Python主成分分析例题，特征值处理和可视化（2）

發(fā)布時間：2024/3/13 python 31 豆豆

生活随笔收集整理的這篇文章主要介紹了【应用多元统计分析】-王学民Python主成分分析例题，特征值处理和可视化（2）小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

title: “應(yīng)用多元統(tǒng)計(jì)分析”
subtitle: “書上題目”
author: | OLSRR

由于字?jǐn)?shù)限制，本文省去部分?jǐn)?shù)據(jù)預(yù)覽。

7.6

下表中給出的是美國 50 個州每 100000 個人中七種犯罪的比率數(shù)據(jù), 這七種犯罪是: 殺人罪 $\left(x_{1}\right)$ 、強(qiáng)奸罪 $\left(x_{2}\right)$ 、搶劫罪 $\left(x_{3}\right)$ 、傷害罪 $\left(x_{4}\right)$ 、夜盜罪 $\left(x_{5}\right)$ 、盜窮罪 $\left(x_{6}\right)$ 和汽車犯罪 $\left(x_{7}\right)$ 。試對表中犯罪數(shù)據(jù)進(jìn)行主成分分析。

答案

數(shù)據(jù)庫準(zhǔn)備：

import pandas as pd#不一定都會用到 import numpy as np import seaborn as sns import matplotlib.pyplot as plt import matplotlib.pyplot as mp,seaborn from pylab import mpl from sklearn import preprocessing from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity from factor_analyzer.factor_analyzer import calculate_kmo

數(shù)據(jù)準(zhǔn)備：

df = pd.read_csv("/Users/che/Desktop/應(yīng)用多元分析/《應(yīng)用多元統(tǒng)計(jì)分析》（第五版，王學(xué)民編著）配書資料/《應(yīng)用多元統(tǒng)計(jì)分析》（第五版）文本數(shù)據(jù)（以逗號為間隔）/exec7.6.csv", encoding='gbk',index_col=0).reset_index(drop=True) print(df)#顯示數(shù)據(jù) # x1 x2 x3 x4 x5 x6 x7 #0 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7 #1 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3 …… #49 5.4 21.9 39.7 173.9 811.6 2772.2 282.0

Bartlett’s球狀檢驗(yàn):

chi_square_value, p_value = calculate_bartlett_sphericity(df) print(chi_square_value, p_value) #225.98725841755999 2.5997375924269063e-36

KMO檢驗(yàn)：

kmo_all, kmo_model = calculate_kmo(df) #[0.66737389 0.83457371 0.84650891 0.86291368 0.78956368 0.66450087 # 0.72385252]

檢查了變量間的相關(guān)性和偏相關(guān)性，由此可以看出，數(shù)據(jù)均大于0.5，變量間的相關(guān)性較強(qiáng)，同時偏相關(guān)性較弱，由此進(jìn)行因子分析的效果較好。

標(biāo)準(zhǔn)化：

df = preprocessing.scale(df) print(df) [[ 1.76493364e+00 -5.01338300e-02 -3.12049014e-01 6.75093885e-01-3.65336599e-01 -1.09848840e+00 -5.05748984e-01]……[-5.33973411e-01 -3.59949633e-01 -9.64914274e-01 -3.76843452e-01-1.12191907e+00 1.40426079e-01 -4.98958725e-01]]

計(jì)算相關(guān)系數(shù):

e76_1=df.iloc[0:50,0:8] df76 = pd.DataFrame(e76_1) df76_corr = df76.corr() print(df76_corr) # x1 x2 x3 x4 x5 x6 x7 #x1 1.000000 0.601220 0.483708 0.648550 0.385817 0.101920 0.068814 …… #x7 0.068814 0.348902 0.590680 0.275843 0.557953 0.444180 1.000000

可視化結(jié)果：

seaborn.heatmap(df76_corr, center=0, annot=True, cmap='YlGnBu') mp.show()

該相關(guān)矩陣表明，變量之間存在一定的相關(guān)性，即彼此之間信息有不少是重復(fù)的，從而有一定的降維空間。

特征值相關(guān)：

eig76_value,eig76_vector=np.linalg.eig(df76_corr) df=pd.DataFrame({"eig76_value":eig76_value}) df=df.sort_values(by=["eig76_value"], ascending=False) # 獲取累積貢獻(xiàn)度 df["eig76_cum"] = (df["eig76_value"]/df["eig76_value"].sum()).cumsum() eig76=df.merge(pd.DataFrame(eig76_vector).T, left_index=True, right_index=True) print(eig76) # eig76_value eig76_cum 0 ... 4 5 6 #0 4.114960 0.587851 0.300279 ... 0.440157 0.357360 0.295177 #1 1.238722 0.764812 0.629174 ... -0.203341 -0.402319 -0.502421 #2 0.725817 0.868500 0.178245 ... -0.209895 -0.539231 0.568384 #4 0.316432 0.913704 -0.232114 ... -0.057555 -0.234890 0.419238 #5 0.257974 0.950558 0.538123 ... 0.101033 0.030099 0.369753 #6 0.222039 0.982278 -0.259117 ... -0.535987 -0.039406 0.057298 #3 0.124056 1.000000 -0.267593 ... 0.648117 -0.601690 -0.147046

可視化：

plt.scatter(range(1, df76.shape[1] + 1), eig76_value) plt.plot(range(1, df76.shape[1] + 1), eig76_value) plt.title("Scree Plot") plt.xlabel("Factors") plt.ylabel("Eigenvalue") plt.grid() # 顯示網(wǎng)格 plt.show() # 顯示圖形

數(shù)據(jù)處理：

df76_std = (df76 - df76.mean())/df76.std() loading = eig76.iloc[:2,2:].T loading["vars"]=df76_std.columns print(loading) # 0 1 vars #0 0.300279 0.629174 x1 …… #6 0.295177 -0.502421 x7 score = pd.DataFrame(np.dot(df76_std,loading.iloc[:,0:2])) print(score) 0 1 0 -0.049880 2.096102 1 2.421515 -0.166523 …… 49 -1.424635 -0.062683

可以認(rèn)為，第一主成分是對所有犯罪率的度量，第二主成分是用于度量暴力犯罪在犯罪性質(zhì)上占的比重，第三主成分很難給出明顯的解釋，因此只取前兩個主成分。

mpl.rcParams['font.sans-serif'] = ['FangSong'] # 解決保存圖像是負(fù)號'-'顯示為方塊的問題 mpl.rcParams['axes.unicode_minus'] = False plt.plot(loading[0],loading[1], "o") xmin ,xmax = loading[0].min(), loading[0].max() ymin, ymax = loading[1].min(), loading[1].max() dx = (xmax - xmin) * 0.2 dy = (ymax - ymin) * 0.2 plt.xlim(xmin - dx, xmax + dx) plt.ylim(ymin - dy, ymax + dy) plt.xlabel('first') plt.ylabel('second') for x, y,z in zip(loading[0], loading[1], loading["vars"]): plt.text(x, y+0.1, z, ha='center', va='bottom', fontsize=13) plt.grid(True) plt.show()#用絕對值做比較

plt.plot(score[0],score[1], "o") xmin ,xmax = score[0].min(), score[0].max() ymin, ymax = score[1].min(), score[1].max() dx = (xmax - xmin) * 0.2 dy = (ymax - ymin) * 0.2 plt.xlim(xmin - dx, xmax + dx) plt.xlabel('first') plt.ylabel('second') for x, y,z in zip(score[0], score[1], score.index):plt.text(x, y+0.1, z, ha='center', va='bottom', fontsize=13) plt.grid(True) plt.show()

7.7

下表(完整數(shù)據(jù)可從作者網(wǎng)頁上下載) 是紐約股票交易所的五只股票(阿萊德化學(xué)、杜邦、聯(lián)合碳化物、?？松偷率抗?從 1975 年 1 月到 1976 年 12 月期間的周回報(bào)率。周回報(bào)率定義為
$\text{周回報(bào)率}=\frac{\text{本周五收盤價(jià)}-\text{上周五收盤價(jià)}}{\text{上周五收盤價(jià)}}$
有拆股和支付股息時對收盤價(jià)進(jìn)行調(diào)整, 試作主成分分析。
注:阿萊德化學(xué)、杜邦和聯(lián)合碳化物屬于化工類股票, 埃克森和德士古屬于石油類股票。

答案

數(shù)據(jù)準(zhǔn)備：

#不一定每個庫都用到 import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import matplotlib.pyplot as mp,seaborn from factor_analyzer import calculate_bartlett_sphericity, calculate_kmo from pylab import mpl from sklearn import preprocessing

讀取數(shù)據(jù)：

df1 = pd.read_csv("/Users/zz/Desktop/應(yīng)用多元分析/《應(yīng)用多元統(tǒng)計(jì)分析》（第五版，王學(xué)民編著）配書資料/《應(yīng)用多元統(tǒng)計(jì)分析》（第五版）文本數(shù)據(jù)（以逗號為間隔）/exec7.7.csv", encoding='gbk',index_col=0) print(df1) # x1 x2 x3 x4 x5#0.000000 0.000000 0.000000 0.039473 0.000000 #0.027027 -0.044855 -0.003030 -0.014466 0.043478 #0.122807 0.060773 0.088146 0.086238 0.078124 #0.057031 0.029948 0.066808 0.013513 0.019512 #0.063670 -0.003793 -0.039788 -0.018644 -0.024154 #... ... ... ... ... #0.000000 -0.020080 -0.006579 0.029925 -0.004807 #0.021429 0.049180 0.006622 -0.002421 0.028985 #0.045454 0.046375 0.074561 0.014563 0.018779 #0.050167 0.036380 0.004082 -0.011961 0.009216 #0.019108 -0.033303 0.008362 0.033898 0.004566 # #[100 rows x 4 columns]

Bartlett’s球狀檢驗(yàn):

chi_square_value, p_value = calculate_bartlett_sphericity(df1) print(chi_square_value, p_value) #108.0670240489058 5.175016294414447e-21

標(biāo)準(zhǔn)化：

df1 = preprocessing.scale(df1) print(df1) [[-0.13526816 -0.13836418 -0.14405774 1.17734079 -0.13531234]……[ 0.34041039 -1.09296855 0.0689901 0.97952993 0.03128677]]

相關(guān)系數(shù)：

df77 = pd.DataFrame(df1) # 計(jì)算相關(guān)系數(shù) df77_corr = df77.corr() print(df77_corr) # x1 x2 x3 x4 x5 # x1 1.000000 0.576924 0.508656 0.386721 0.462178 …… # x5 0.462178 0.321953 0.425627 0.523529 1.000000

可視化：

seaborn.heatmap(df77_corr, center=0, annot=True, cmap='YlGnBu') mp.show()

該相關(guān)矩陣表明，變量之間存在一定的相關(guān)性，即彼此之間信息有不少是重復(fù)的，從而有一定的降維空間。

特征值處理：

eig77_value,eig77_vector=np.linalg.eig(df77_corr) eig77=pd.DataFrame({"eig77_value":eig77_value}) eig77=eig77.sort_values(by=["eig77_value"], ascending=False) # 獲取累積貢獻(xiàn)度 eig77["eig77_cum"] = (eig77["eig77_value"]/eig77["eig77_value"].sum()).cumsum() eig77=eig77.merge(pd.DataFrame(eig77_vector).T, left_index=True, right_index=True) print(eig77) # eig77_value eig77_cum 0 1 2 3 4 #0 2.856487 0.571297 0.463541 0.457076 0.469980 0.421677 0.421329 #1 0.809118 0.733121 0.240850 0.509100 0.260577 -0.525265 -0.582242 #3 0.540044 0.841130 0.613357 -0.177900 -0.337036 -0.539018 0.433603 #4 0.451347 0.931399 -0.381373 -0.211307 0.664098 -0.472804 0.381227 #2 0.343004 1.000000 0.453288 -0.674981 0.395725 0.179448 -0.387467

畫圖：

plt.scatter(range(1, df77.shape[1] + 1), eig77_value) plt.plot(range(1, df77.shape[1] + 1), eig77_value) plt.title("Scree Plot") plt.xlabel("Factors") plt.ylabel("Eigenvalue") plt.grid() plt.show()

得分情況：

df77_std = (df77 - df77.mean())/df77.std() loading = eig77.iloc[:2,2:].T loading["vars"]=df77_std.columns print(loading) score = pd.DataFrame(np.dot(df77_std,loading.iloc[:,0:2])) print(score) # 0 1 vars #0 0.463541 0.240850 0 #1 0.457076 0.509100 1 #2 0.469980 0.260577 2 #3 0.421677 -0.525265 3 #4 0.421329 -0.582242 4 0 1 0 0.244565 -0.676780 1 -0.203899 -1.105631 .. ... ... 99 0.116289 -0.984235

可視化：

mpl.rcParams['font.sans-serif'] = ['FangSong'] mpl.rcParams['axes.unicode_minus'] = False plt.plot(loading[0],loading[1], "o") xmin ,xmax = loading[0].min(), loading[0].max() ymin, ymax = loading[1].min(), loading[1].max() dx = (xmax - xmin) * 0.2 dy = (ymax - ymin) * 0.2 plt.xlim(xmin - dx, xmax + dx) plt.ylim(ymin - dy, ymax + dy) plt.xlabel('first') plt.ylabel('second') for x, y,z in zip(loading[0], loading[1], loading["vars"]):plt.text(x, y+0.1, z, ha='center', va='bottom', fontsize=13) plt.grid(True) plt.show()

plt.plot(score[0],score[1], "o") xmin ,xmax = score[0].min(), score[0].max() ymin, ymax = score[1].min(), score[1].max()dx = (xmax - xmin) * 0.2 dy = (ymax - ymin) * 0.2plt.xlim(xmin - dx, xmax + dx) plt.ylim(ymin - dy, ymax + dy)plt.xlabel('first') plt.ylabel('second') for x, y,z in zip(score[0], score[1], score.index):plt.text(x, y+0.1, z, ha='center', va='bottom', fontsize=13)plt.grid(True) plt.show()

總結(jié)

以上是生活随笔為你收集整理的【应用多元统计分析】-王学民Python主成分分析例题，特征值处理和可视化（2）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python手机编程调试_在Linux下
下一篇：【Unity3D日常开发】Unity3D

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

python

【应用多元统计分析】-王学民Python主成分分析例题，特征值处理和可视化（2）

7.6

答案

7.7

答案

總結(jié)