优达棒球赛数据分析项目
棒球運動員的身高、體重的特點
作者獲得了一份從1820到1995年出生的棒球運動員的身體數據。這里我對各地運動員的身高、體重情況以及他們隨著時間的變化,以及它們和運動員壽命的關系情況感興趣。接下來,我將對這些進行分析提出問題:
1.運動員的出生區域分布 2.運動員的身高、體重隨出生年份的變化 3.運動員的壽命與身高、體重的關系這里,運動員的身高、體重是因變量,年份、城市是自變量 #導入數據庫# -*- coding: utf-8 -*- import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from __future__ import division %matplotlib inline導入數據
def read_csv(filename):file=filenamedata=pd.read_csv(file)return(data) player_df=read_csv('Master.csv') #stars_df=read_csv('AllstarFull.csv')讓我們先來看一下導入的數據的結構
player_df.head()| aardsda01 | 1981.0 | 12.0 | 27.0 | USA | CO | Denver | NaN | NaN | NaN | ... | Aardsma | David Allan | 220.0 | 75.0 | R | R | 2004/4/6 | 2015/8/23 | aardd001 | aardsda01 |
| aaronha01 | 1934.0 | 2.0 | 5.0 | USA | AL | Mobile | NaN | NaN | NaN | ... | Aaron | Henry Louis | 180.0 | 72.0 | R | R | 1954/4/13 | 1976/10/3 | aaroh101 | aaronha01 |
| aaronto01 | 1939.0 | 8.0 | 5.0 | USA | AL | Mobile | 1984.0 | 8.0 | 16.0 | ... | Aaron | Tommie Lee | 190.0 | 75.0 | R | R | 1962/4/10 | 1971/9/26 | aarot101 | aaronto01 |
| aasedo01 | 1954.0 | 9.0 | 8.0 | USA | CA | Orange | NaN | NaN | NaN | ... | Aase | Donald William | 190.0 | 75.0 | R | R | 1977/7/26 | 1990/10/3 | aased001 | aasedo01 |
| abadan01 | 1972.0 | 8.0 | 25.0 | USA | FL | Palm Beach | NaN | NaN | NaN | ... | Abad | Fausto Andres | 184.0 | 73.0 | L | L | 2001/9/10 | 2006/4/13 | abada001 | abadan01 |
5 rows × 24 columns
下面是數據中表頭的含義:
1.playerID A unique code asssigned to each player. The playerID linksthe data in this file with records in the other files. 2.birthYear Year player was born 3.birthMonth Month player was born 4.birthDay Day player was born 5.birthCountry Country where player was born 6.birthState State where player was born 7.birthCity City where player was born 8.deathYear Year player died 9.deathMonth Month player died 10.deathDay Day player died 11.deathCountry Country where player died 12.deathState State where player died 13.deathCity City where player died 14.nameFirst Player's first name 15.nameLast Player's last name 16.nameGiven Player's given name (typically first and middle) 17.weight Player's weight in pounds 18.height Player's height in inches 19.bats Player's batting hand (left, right, or both) 20.throws Player's throwing hand (left or right) 21.debut Date that player made first major league appearance數據項目有很多,但我們只需要選手ID,出生年份、出生國家、城市等數據,這里將提取這些數據
data1_df=player_df[['playerID','birthYear','deathYear','birthCountry','birthState','birthCity','weight','height']]讓我們看一下新數據的結構
data1_df.head()| aardsda01 | 1981.0 | NaN | USA | CO | Denver | 220.0 | 75.0 |
| aaronha01 | 1934.0 | NaN | USA | AL | Mobile | 180.0 | 72.0 |
| aaronto01 | 1939.0 | 1984.0 | USA | AL | Mobile | 190.0 | 75.0 |
| aasedo01 | 1954.0 | NaN | USA | CA | Orange | 190.0 | 75.0 |
| abadan01 | 1972.0 | NaN | USA | FL | Palm Beach | 184.0 | 73.0 |
| aardsda01 | 1981.0 | NaN | USA | CO | Denver | 220.0 | 75.0 |
| aaronha01 | 1934.0 | NaN | USA | AL | Mobile | 180.0 | 72.0 |
| aaronto01 | 1939.0 | 1984.0 | USA | AL | Mobile | 190.0 | 75.0 |
| aasedo01 | 1954.0 | NaN | USA | CA | Orange | 190.0 | 75.0 |
| abadan01 | 1972.0 | NaN | USA | FL | Palm Beach | 184.0 | 73.0 |
接下來讓我們查看一下數據的摘要信息
data1_df.describe()| 18703.000000 | 9336.000000 | 17975.000000 | 18041.000000 |
| 1930.664118 | 1963.850364 | 185.980862 | 72.255640 |
| 41.229079 | 31.506369 | 21.226988 | 2.598983 |
| 1820.000000 | 1872.000000 | 65.000000 | 43.000000 |
| 1894.000000 | 1942.000000 | 170.000000 | 71.000000 |
| 1936.000000 | 1966.000000 | 185.000000 | 72.000000 |
| 1968.000000 | 1989.000000 | 200.000000 | 74.000000 |
| 1995.000000 | 2016.000000 | 320.000000 | 83.000000 |
從摘要信息中可以看到,棒球運動員的平均身高為72.255英寸,分布在43英寸到83英寸之間;體重的波動范圍為65-320磅,平均體重為185.98磅
讓我們看一下是否存在數據缺失情況
data1_df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 18846 entries, 0 to 18845 Data columns (total 8 columns): playerID 18846 non-null object birthYear 18703 non-null float64 deathYear 9336 non-null float64 birthCountry 18773 non-null object birthState 18220 non-null object birthCity 18647 non-null object weight 17975 non-null float64 height 18041 non-null float64 dtypes: float64(4), object(4) memory usage: 1.2+ MB可以看到,數據中體重、身高、出生年份、死亡年份數據信息不全。 其中,身高、體重數據將用前值補全,出生年份缺失的則需要將其剔除 #定義補全函數 def enfull_ave(letter):data1_df[letter].fillna(method='ffill') #補全體重 enfull_ave('weight') #補全身高 enfull_ave('height') #剔除缺失數據 data1_df=data1_df.dropna(how='all')現在,讓我們對棒球運動員的國家分布和城市分布進行分析
#下面定義幾個常用函數 # 按照name對運動員進行分組后,計算每組的人數 def player_count(data,name):return data.groupby(name)['playerID'].count()def player_count_rate(data,name):b=player_count(data,name)a=data['playerID'].count()return b/a# 輸出餅圖 def print_pie(group_data,title):group_data.plot.pie(title=title,figsize=(12, 12),autopct='%3.1f%%',startangle =90,legend=True) # 輸出柱狀圖 def print_bar(data,title):bar=data.plot.bar(title=title,width=10)for p in bar.patches:bar.annotate('%3.1f%%' % (p.get_height()*100), (p.get_x() * 1.005, p.get_height() * 1.005)) #輸出折線圖 def print_plot(data,name1,title):x=data.indexy=data[name1]plt.figure(figsize=(12,6)) #創建繪圖對象 plt.plot(x,y,'ro',color="red",linewidth=1) #在當前繪圖對象繪圖(X軸,Y軸,藍色虛線,線寬度)plt.xlabel("year")plt.ylabel(name1)plt.title(title) #圖標題 plt.show() #顯示圖 plt.savefig("line.jpg") #保存圖接下來,讓我們查看棒球運動員在各個國家的分布比例
player_count_rate(data1_df,'birthCountry').sort_values(ascending=False) birthCountry USA 0.875730 D.R. 0.034119 Venezuela 0.018094 P.R. 0.013425 CAN 0.012947 Cuba 0.010506 Mexico 0.006261 Japan 0.003290 Panama 0.002918 Ireland 0.002653 United Kingdom 0.002600 Germany 0.002441 Australia 0.001486 South Korea 0.000902 Colombia 0.000902 Nicaragua 0.000743 Curacao 0.000743 V.I. 0.000637 Netherlands 0.000637 Taiwan 0.000584 Russia 0.000424 France 0.000424 Italy 0.000371 Bahamas 0.000318 Aruba 0.000265 Poland 0.000265 Austria 0.000212 Sweden 0.000212 Spain 0.000212 Czech Republic 0.000212 Jamaica 0.000212 Brazil 0.000159 Norway 0.000159 Saudi Arabia 0.000106 At Sea 0.000053 American Samoa 0.000053 Belgium 0.000053 Belize 0.000053 China 0.000053 Viet Nam 0.000053 Denmark 0.000053 Finland 0.000053 Greece 0.000053 Guam 0.000053 Honduras 0.000053 Indonesia 0.000053 Lithuania 0.000053 Philippines 0.000053 Singapore 0.000053 Slovakia 0.000053 Switzerland 0.000053 Afghanistan 0.000053 Name: playerID, dtype: float64可以看到,棒球運動員來自50多個國家和地區。絕大多數棒球運動員的出生國家在美國,占比87.6%;比較高的有D.R.、Venezuela、P.R.、CAN、Cuba ,都達到了1%以上。接下來,讓我們看一下美國運動員的州分布
#提取美國運動員數據 data_usa=data1_df[data1_df['birthCountry']=='USA'] #畫餅圖 print_pie(player_count_rate(data_usa,'birthState'),'The player rate about States')從這里可以看到,出生在CA的棒球運動員最多,占比為13%,其次為PA,為8.5%。排名前五的州為CA,PA,NY,IL,OH,有超過44%的美國棒球運動員在這些地方出生
讓我們看一下各地棒球運動員的身高、體重情況吧
data2=data1_df[['birthCountry','birthState','height','weight']] #按平均身高排序 data3=data2.groupby('birthCountry').mean().sort_values(by='height',ascending=False) print '有%d個國家超過了平均水平'%(data3['height'][data3['height']>=data1_df['height'].mean()].count()) data3 有26個國家超過了平均水平| 78.000000 | 220.000000 |
| 77.000000 | 205.000000 |
| 75.250000 | 201.250000 |
| 75.000000 | 215.000000 |
| 74.333333 | 205.000000 |
| 74.000000 | 205.000000 |
| 74.000000 | 185.000000 |
| 74.000000 | 210.000000 |
| 73.500000 | 200.500000 |
| 73.454545 | 183.333333 |
| 73.411765 | 198.294118 |
| 73.357143 | 207.857143 |
| 73.250000 | 189.666667 |
| 73.000000 | 170.000000 |
| 73.000000 | 185.000000 |
| 73.000000 | 180.000000 |
| 73.000000 | 165.000000 |
| 73.000000 | 188.000000 |
| 73.000000 | 200.000000 |
| 72.890909 | 186.018182 |
| 72.819596 | 192.916019 |
| 72.727273 | 194.454545 |
| 72.666667 | 185.000000 |
| 72.571429 | 189.785714 |
| 72.375000 | 182.871795 |
| 72.257213 | 185.427646 |
| 72.225806 | 197.222874 |
| 72.209677 | 192.354839 |
| 72.127119 | 189.118644 |
| 72.000000 | 200.000000 |
| 72.000000 | 185.000000 |
| 72.000000 | 210.000000 |
| 72.000000 | 180.833333 |
| 72.000000 | 196.000000 |
| 71.979167 | 185.212500 |
| 71.881423 | 185.818182 |
| 71.833333 | 184.666667 |
| 71.750000 | 190.250000 |
| 71.682051 | 185.451282 |
| 71.647059 | 199.125000 |
| 71.600000 | 179.800000 |
| 71.333333 | 186.250000 |
| 71.142857 | 180.428571 |
| 71.000000 | 184.000000 |
| 71.000000 | 170.000000 |
| 71.000000 | 200.000000 |
| 70.377778 | 174.500000 |
| 70.000000 | 180.000000 |
| 69.857143 | 167.428571 |
| 69.552632 | 170.131579 |
| 69.000000 | 165.000000 |
| 67.000000 | 158.000000 |
可以看到,平均身高最高的國家是印度尼西亞,為78英寸,接下來為比利時,為77英寸。各國的平均身高都不低于67英寸,超過平均水平的國家有26個。接下來,讓我們看一下體重情況
c=data2.groupby('birthCountry').mean().sort_values(by='weight',ascending=False) #對超過平均水平的國家計數 print '有%d個國家超過了平均水平'%(data3['weight'][data3['weight']>=data1_df['weight'].mean()].count()) c 有27個國家超過了平均水平| 78.000000 | 220.000000 |
| 75.000000 | 215.000000 |
| 72.000000 | 210.000000 |
| 74.000000 | 210.000000 |
| 73.357143 | 207.857143 |
| 74.000000 | 205.000000 |
| 77.000000 | 205.000000 |
| 74.333333 | 205.000000 |
| 75.250000 | 201.250000 |
| 73.500000 | 200.500000 |
| 72.000000 | 200.000000 |
| 71.000000 | 200.000000 |
| 73.000000 | 200.000000 |
| 71.647059 | 199.125000 |
| 73.411765 | 198.294118 |
| 72.225806 | 197.222874 |
| 72.000000 | 196.000000 |
| 72.727273 | 194.454545 |
| 72.819596 | 192.916019 |
| 72.209677 | 192.354839 |
| 71.750000 | 190.250000 |
| 72.571429 | 189.785714 |
| 73.250000 | 189.666667 |
| 72.127119 | 189.118644 |
| 73.000000 | 188.000000 |
| 71.333333 | 186.250000 |
| 72.890909 | 186.018182 |
| 71.881423 | 185.818182 |
| 71.682051 | 185.451282 |
| 72.257213 | 185.427646 |
| 71.979167 | 185.212500 |
| 73.000000 | 185.000000 |
| 72.000000 | 185.000000 |
| 74.000000 | 185.000000 |
| 72.666667 | 185.000000 |
| 71.833333 | 184.666667 |
| 71.000000 | 184.000000 |
| 73.454545 | 183.333333 |
| 72.375000 | 182.871795 |
| 72.000000 | 180.833333 |
| 71.142857 | 180.428571 |
| 73.000000 | 180.000000 |
| 70.000000 | 180.000000 |
| 71.600000 | 179.800000 |
| 70.377778 | 174.500000 |
| 69.552632 | 170.131579 |
| 71.000000 | 170.000000 |
| 73.000000 | 170.000000 |
| 69.857143 | 167.428571 |
| 69.000000 | 165.000000 |
| 73.000000 | 165.000000 |
| 67.000000 | 158.000000 |
這里我們可以看到,運動員的平均體重最高的國家仍然是印度尼西亞,為220磅,接下來是阿富汗,為215磅,有27個國家的運動員超過了平均水平
接下來,讓我們看一下全明星運動員的情況吧
接下來,讓我們看一下平均身高、平均體重歲隨年份的變化
#提取數據 b=data1_df.groupby('birthYear').mean()d=b.dropna() #打印體重-時間折線圖 print_plot(d,'weight','The weight change about birthyears') <matplotlib.figure.Figure at 0xe404400> #打印身高-時間折線圖 print_plot(d,'height','The height change about birthYear') <matplotlib.figure.Figure at 0xe1509e8>從這里可以看到,運動員的身高和體重隨著出生年份呈現正相關關系。那么,他們之間有多大的相關性呢?接下來讓我們查看一下
#提取數據 e=pd.DataFrame(d,columns=['birthyear','weight','height']) e['birthyear']=e.index #計算相關系數 e.corrwith(e['birthyear']) birthyear 1.000000 weight 0.929546 height 0.947681 dtype: float64從這里可以看到,運動員的出生年份與運動員的平均身高的的相關系數為0.947,與平均體重的相關系數為0.934。可以看到運動員的平均身高、體重與年份有很大的相關性。但是由于缺乏進一步數據,造成這種現象的原因不得而知接下來,我們看一下運動員的壽命與身高、體重情況
#剔除在世運動員的數據,并提取數據 data_age=data1_df.dropna(how='all') data_age=data_age[['playerID','birthYear','deathYear','weight','height']] #計算運動員壽命 data_age=pd.DataFrame(data_age,columns=['playerID','birthYear','deathYear','Age','weight','height']) data_age['Age']=data_age['deathYear']-data_age['birthYear']去掉可能存在的缺失值
#剔除存在缺失的數據 data_age=data_age.dropna() #計算平均值 f=data_age.groupby('Age').mean() f| 1907.500000 | 1927.500000 | 176.500000 | 70.500000 |
| 1867.000000 | 1888.000000 | 181.500000 | 72.500000 |
| 1925.800000 | 1947.800000 | 179.000000 | 71.400000 |
| 1915.000000 | 1938.000000 | 169.600000 | 72.000000 |
| 1916.200000 | 1940.200000 | 177.400000 | 71.300000 |
| 1898.307692 | 1923.307692 | 176.153846 | 72.461538 |
| 1903.400000 | 1929.400000 | 177.533333 | 71.733333 |
| 1887.769231 | 1914.769231 | 172.884615 | 70.884615 |
| 1894.500000 | 1922.500000 | 178.500000 | 71.500000 |
| 1907.432432 | 1936.432432 | 176.297297 | 71.486486 |
| 1888.709677 | 1918.709677 | 172.774194 | 71.064516 |
| 1881.666667 | 1912.666667 | 169.259259 | 70.777778 |
| 1889.393939 | 1921.393939 | 173.333333 | 70.727273 |
| 1894.258065 | 1927.258065 | 167.290323 | 70.516129 |
| 1898.900000 | 1932.900000 | 177.040000 | 71.820000 |
| 1899.135135 | 1934.135135 | 183.405405 | 71.756757 |
| 1891.051282 | 1927.051282 | 176.717949 | 70.128205 |
| 1886.538462 | 1923.538462 | 171.461538 | 70.333333 |
| 1892.083333 | 1930.083333 | 178.250000 | 71.354167 |
| 1897.589744 | 1936.589744 | 179.435897 | 71.641026 |
| 1892.311111 | 1932.311111 | 178.555556 | 71.133333 |
| 1893.500000 | 1934.500000 | 177.704545 | 70.727273 |
| 1893.225000 | 1935.225000 | 179.225000 | 71.275000 |
| 1891.204082 | 1934.204082 | 175.673469 | 70.816327 |
| 1885.344262 | 1929.344262 | 173.016393 | 70.377049 |
| 1898.121212 | 1943.121212 | 178.848485 | 71.136364 |
| 1893.938776 | 1939.938776 | 179.040816 | 71.061224 |
| 1893.441558 | 1940.441558 | 175.012987 | 70.805195 |
| 1894.000000 | 1942.000000 | 174.164557 | 70.949367 |
| 1894.213115 | 1943.213115 | 175.590164 | 70.868852 |
| ... | ... | ... | ... |
| 1900.285024 | 1975.285024 | 174.782609 | 71.164251 |
| 1897.894977 | 1973.894977 | 175.808219 | 71.118721 |
| 1897.607143 | 1974.607143 | 173.991071 | 71.004464 |
| 1897.606635 | 1975.606635 | 176.327014 | 71.033175 |
| 1898.990991 | 1977.990991 | 175.644144 | 71.157658 |
| 1899.351240 | 1979.351240 | 177.000000 | 71.190083 |
| 1899.879630 | 1980.879630 | 176.351852 | 70.925926 |
| 1900.754464 | 1982.754464 | 176.075893 | 71.281250 |
| 1901.454128 | 1984.454128 | 175.665138 | 71.243119 |
| 1898.257895 | 1982.257895 | 175.415789 | 70.915789 |
| 1900.005263 | 1985.005263 | 172.215789 | 70.968421 |
| 1903.913978 | 1989.913978 | 175.811828 | 71.209677 |
| 1897.798611 | 1984.798611 | 175.402778 | 71.090278 |
| 1904.540541 | 1992.540541 | 177.425676 | 71.533784 |
| 1900.299213 | 1989.299213 | 174.866142 | 71.228346 |
| 1901.486726 | 1991.486726 | 173.495575 | 70.858407 |
| 1899.068182 | 1990.068182 | 173.750000 | 70.681818 |
| 1901.673684 | 1993.673684 | 175.831579 | 71.157895 |
| 1901.513158 | 1994.513158 | 173.828947 | 71.000000 |
| 1898.088889 | 1992.088889 | 173.533333 | 71.311111 |
| 1899.461538 | 1994.461538 | 172.576923 | 70.826923 |
| 1902.222222 | 1998.222222 | 176.500000 | 71.111111 |
| 1893.647059 | 1990.647059 | 171.823529 | 70.352941 |
| 1900.882353 | 1998.882353 | 174.705882 | 70.705882 |
| 1897.222222 | 1996.222222 | 163.444444 | 69.666667 |
| 1899.700000 | 1999.700000 | 168.600000 | 70.100000 |
| 1900.400000 | 2001.400000 | 167.000000 | 70.400000 |
| 1900.000000 | 2002.000000 | 165.000000 | 71.000000 |
| 1911.000000 | 2014.000000 | 158.000000 | 65.000000 |
| 1891.000000 | 1998.000000 | 162.000000 | 69.000000 |
85 rows × 4 columns
#提取年齡 age_df=pd.DataFrame(f,columns=['age','weight','height']) age_df['age']=f.index #繪制折線圖 print_plot(age_df,'weight','weight-age') print_plot(age_df,'height','height-age') <matplotlib.figure.Figure at 0xe81df98> <matplotlib.figure.Figure at 0xdfd5c50> #計算相關系數 age_df.corr()| 1.000000 | -0.430298 | -0.371683 |
| -0.430298 | 1.000000 | 0.724237 |
| -0.371683 | 0.724237 | 1.000000 |
可以看到,運動員壽命與身高、體重存在弱相關關系,且與運動員身高、體重呈負相關關系。其相關性遠不如出生年份。但這里也說明運動員的身高、體重在某種程度上有可能影響運動員壽命
總結
以上是生活随笔為你收集整理的优达棒球赛数据分析项目的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 学生参加计算机比赛后的分析,湘南学院学生
- 下一篇: 人际关系图解