Pandas实战-DataFrame对象
本文將主要介紹以下內容:
1. DataFrame概述
2. Series和DataFrame的相似點
3. DataFrame排序
4. 按DataFrame索引排序
5. 設置新索引
6. 從DataFrame讀取列
7. 從DataFrame讀取行
8. 重命名列或行
9. 重置索引
?
DataFrame是Pandas的另外一個主要的數據結構,它是由行和列組成的二維數據結構,因此需要兩個參考點來從數據集中提取給定值。
1. DataFrame概述
DataFrame可以描述為網格或數據表,類似于Excel之類的電子表格中的網格或數據表。
1.1 從字典創建DataFrame
和往常一樣,讓我們從導入pandas開始,還將使用NumPy庫生成一些隨機數據:
In [1]: import pandas as pdimport numpy as np在導入第一個數據集之前,讓我們練習從一些Python內置對象實例化一個DataFrame。例如字典,其鍵將用作列名,而相應的值將用作該列的值。
下例使用三個長度相等的列表來存儲城市,國家和人口。也可以使用其它可迭代對象(如元組或Series)代替列表。DataFrame類的第一個參數data代表數據源。
In [2]: city_data = {"City": ["New York City", "Paris", "Barcelona", "Rome"],"Country": ["United States", "France", "Spain", "Italy"],"Population": [8600000, 2141000, 5515000, 2873000]}cities = pd.DataFrame(city_data)cities Out [2]:?????????????? City?? ???????Country Population0 New York City United States 86000001????????????Paris ??????????France ???21410002??????? Barcelona ???????????Spain ???55150003??????? ?????Rome ????? ?????Italy ???2873000我們創建了第一個DataFrame,它與Series非常相似。再次提醒,當我們沒有為構造函數提供一個明確的索引時,pandas會默認生成一個從0開始的順序索引。
如果我們想翻轉數據,以列標頭作為索引標簽,我們可以調用transpose方法或其T屬性:
In ?[3]: cities.transpose() # 此行代碼和下面代碼是一樣的cities.T Out [3]: ????? 0 ????????1 ???????????2 ?????????3City New York City Paris Barcelona ??????RomeCountry????United States??? France ???????Spain ?????ItalyPopulation ?????????8600000 ??2141000 ?????5515000 2873000另外還可以使用一個更方便的from_dict方法,此方法用于把字典轉換為DataFrame。其參數orient用于控制索引標簽的創建,如果想把字典的鍵保存為DataFrame的列,可以使用默認值columns;如果想把字典的鍵保存為行,則使用值index:
In ?[4]: pd.DataFrame.from_dict(data = city_data, orient = "index") Out [4]: ??????????? 0 ????? 1 ????????? 2 ?????????3City New York City ?????Paris??? Barcelona ??????RomeCountry????United States?????France ???????Spain ?????ItalyPopulation ?????????8600000 ?? 2141000 ?????5515000??? 28730001.2 從NumPy 數組創建DataFrame
DataFrame構造函數還接受NumPy ndarray對象。假設我們要創建一個3x5的DataFrame,其整數在1到100(含)之間,我們可以使用random模塊的randint方法:
In ?[5]: data = np.random.randint(1, 101, [3, 5])data Out [5]: array([[25, 22, 80, 43, 42],[40, 89, 7, 21, 25],[89, 71, 32, 28, 39]])接下來,我們將ndarray傳到DataFrame構造函數中。與行一樣,如果未提供自定義的列標頭,Pandas將為每列分配一個數字索引:
In ?[6]: pd.DataFrame(data = data) Out [6]: ?????0 ???1 ???2 ???3 ???40?? 25 ??22? ?80? ?43? ?421 ??40 ??89 ???7 ??21 ??252 ??89 ??71 ??32 ??28 ??39我們可以使用可迭代序列,例如列表,元組或ndarray,以用作行標簽。請注意,可迭代項的長度必須等于數據集中的行數。這是一個3x5的表格,因此我們必須為索引提供3個標簽。
In ?[7]: index = ["Morning", "Afternoon", "Evening"]temperatures = pd.DataFrame(data = data, index = index)temperatures Out [7]:??????? ??????0 ???1 ???2 ???3 ???4Morning?? 25 ??22? ?80? ?43? ?42Afternoon ??40 ??89 ???7 ??21 ??25Evening ??89? ?71 ??32 ??28 ??39columns參數允許我們設置DataFrame的列名(也稱為垂直標簽)。因為我們總共有5列,所以我們的可迭代項的長度必須為5。下例通過將標頭存儲在元組中來設置列名:
In ?[8]: index = ["Morning", "Afternoon", "Evening"]columns = ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")temperatures = pd.DataFrame(data = data,index = index,columns = columns)temperatures Out [8]:??????? ?? Monday Tuesday?? Wednesday?? Thursday?? FridayMorning 25 22 80 43? ?????42Afternoon ?? ??40 ???????89 ????????? 7 ?? ?????21 ??????25Evening ?????89?? ?????71 ?? ??????32 ????????28 ??????39行索引和列索引都允許包含重復值:
In ?[9]: index = ["Morning", "Afternoon", " Morning"]columns = ("Monday", "Tuesday", "Wednesday", "Tuesday", "Friday")temperatures = pd.DataFrame(data = data,index = index,columns = columns)temperatures Out [9]:??????? ?? Monday?? Tuesday?? Wednesday?? Tuesday?? FridayMorning???? 25 ???????22? ??????? 80? ? ??? 43? ?????42Afternoon ?? ?40 ???????89 ??????????7 ?? ????21 ??????25Morning ????89?? ?????71 ?? ??????32 ???????28 ??????392. Series和DataFrame的相似點
在之前介紹的許多Series屬性和方法在DataFrame上也適用。
2.1 導入CSV數據集
nba.csv文件是2019-2020 NBA運動員的數據集,包括每個球員的姓名,球隊,位置,生日和工資:
In ?[10]: pd.read_csv("nba.csv") Out [10]:??? Name ?????????????? Team Position? Birthday?????Salary0 Shake Milton ??Philadelphia 76ers ???? SG ? 9/26/96?? ?14456971 Christian Wood ? Detroit Pistons ???? PF 9/27/95 16453572 ???PJ Washington ???Charlotte Hornets ??????PF ?? 8/23/98 ???38318403 ????Derrick Rose ?????Detroit Pistons ??????PG ?? 10/4/88 ?? 73170744 ???Marial Shayok ? Philadelphia 76ers ???????G ?? 7/26/95 ?????79568… ?????????????? … ???????????????? … ???????… ?? ???? … ? …445 ???Austin Rivers ??? Houston Rockets ??????PG ??? 8/1/92 ???2174310446 ?????Harry Giles ??? Sacramento Kings ??????PF ?? 4/22/98 ???2578800447 ?????Robin Lopez ?????Milwaukee Bucks ???????C ??? 4/1/88 ???4767000448 ???Collin Sexton ?Cleveland Cavaliers ??????PG ??? 1/4/99 ???4764960449 ?????Ricky Rubio ????????Phoenix Suns ??????PG ? 10/21/90 ? 16200000450 rows x 5 columns注意,Birthday列中的值默認是作為字符串而不是datetime對象導入,之前介紹過可以使用parse_dates參數將值強制轉換為datetime,一致以YYYY-MM-DD格式顯示:
In ?[11]: pd.read_csv("nba.csv", parse_dates = ["Birthday"]) Out [11]:? Name ?????? Team Position Birthday????Salary0 ????Shake Milton ??Philadelphia 76ers ? SG 1996-09-26 14456971 ??Christian Wood Detroit Pistons ??? PF ??1995-09-27 ??16453572 ???PJ Washington ???Charlotte Hornets ????? PF ??1998-08-23 ? 38318403 ????Derrick Rose ?? Detroit Pistons ??? PG ??1988-10-04 73170744 ???Marial Shayok ??Philadelphia 76ers ?? G ??1995-07-26 79568… ??? … ???????? … ???????… ?? ??????? … ?? ?…445 ???Austin Rivers ???? Houston Rockets ??? PG ??1992-08-01 ??2174310446 ?????Harry Giles ??? Sacramento Kings ? PF ??1998-04-22 ? 2578800447 ?????Robin Lopez ?????Milwaukee Bucks ???? C ??1988-04-01 4767000448 ???Collin Sexton ?Cleveland Cavaliers ?? PG ??1999-01-04 ??4764960449 ?????Ricky Rubio ????????Phoenix Suns ?? PG ??1990-10-21 16200000450 rows x 5 columns現在我們可以把DataFrame分配給一個變量nba:
In ?[12]: nba = pd.read_csv("nba.csv", parse_dates = ["Birthday"])2.2 共享和專用屬性
Series的值必須是單一同質的數據類型,而DataFrame的列可以保存異構的數據。讓我們使用dtypes屬性查看各自的數據類型,它返回的是一個Series對象:
In ?[13]: nba.dtypes Out [13]: Name ???????????????objectTeam ???????????????objectPosition ???????????objectBirthday datetime64[ns]Salary ??????? ??????int64dtype: object然后調用value_counts方法可以計算每種數據類型的列數:
In ?[14]: nba.dtypes.value_counts() Out [14]: object ??????? ???3datetime64[ns] 1int64 ????????? ??1dtype: int64DataFrame基本上由兩個對象組成:一個由行標簽組成的索引和一個保存每一行值的數據容器。Pandas自帶了幾個索引對象,每個索引對象都經過優化以存儲特定類型的值(numeric、string、datetime等)。index屬性返回DataFrame的Index對象。下面讓我們看一下nba數據集使用的索引類型:
In ?[15]: nba.index Out [15]: RangeIndex(start=0, stop=450, step=1)RangeIndex是一個優化的索引,用于存儲順序的整數值。與Python的range函數非常相似,RangeIndex包含三個參數:start,stop和step(每兩個值之間的間隔)。上例的輸出告訴我們索引從0開始,以1為增量(即0、1、2…449)直到450。
DataFrame還具有專用的columns屬性,該屬性返回包含標頭的Index對象:
In ?[16]: nba.columns Out [16]: Index(['Name', 'Team', 'Position', 'Birthday', 'Salary'], dtype='object')axes屬性同時返回行和列索引:
In ?[17]: nba.axes Out [17]: [RangeIndex(start=0, stop=450, step=1),Index(['Name', 'Team', 'Position', 'Birthday', 'Salary'], dtype='object')]shape屬性返回DataFrame維度的元組,450行x 5列:
In ?[18]: nba.shape Out [18]: (450, 5)ndim屬性返回維數:
In ?[19]: nba.ndim Out [19]: 2size屬性返回數據集中值的總數,包括缺失值,它等于行數和列數的乘積:
In ?[20]: nba.size Out [20]: 2250In ?[21]: len(nba.index) * len(nba.columns) Out [21]: 2250如果要排除缺失值可以使用count方法,它將返回一個包含每個DataFrame列的非空值數量的Series。然后使用sum方法得出DataFrame中非空值的總數。因為此數據集不包含任何缺失值,所以size屬性和count方法的結果將相同。
In ?[22]: nba.count() Out [22]: Name??? ??? 450Team??? ????450Position?? ?450Birthday ???450Salary ?????450dtype: int64In ?[23]: nba.count().sum() Out [23]: 22502.3 共享方法
head和tail方法返回數據集的最前或最后的行:
In ?[24]: nba.head(2) Out [24]: ? Name ???? Team Position?????Birthday ???Salary0 Shake Milton Philadelphia 76ers???????SG?? 1996-09-26?? 14456971???Christian Wood ?????Detroit Pistons???????PF???1995-09-27???1645357In ?[25]: nba.tail(n = 3) Out [25]: ????? Name ????? Team Position Birthday ???Salary447 ????Robin Lopez ??????Milwaukee Bucks C 1988-04-01 4767000448 Collin Sexton???Cleveland Cavaliers ???? PG ??1999-01-04 4764960449 ????Ricky Rubio ?????????Phoenix Suns ?? PG ??1990-10-21 ?16200000sample方法返回DataFrame的隨機行:
In ?[26]: nba.sample(3) Out [26]: ???????? Name ???????? Team Position????Birthday Salary348? Patrick Patterson ??Los Angeles Clippers?? ????PF ?1989-03-14 ?306866014 ????????Alec Burks ?Golden State Warriors ??????SG ?1991-07-20 2320044228 ??Ignas Brazdeikis ???????New York Knicks ??????SF ?1999-01-08 898310nunique方法返回一個包含每一列唯一值的數量的Series:
In ?[27]: nba.nunique() Out [27]: Name ???????450Team ????????30Position ???? 9Birthday ???430Salary ?????269dtype: int64之前介紹過的max和min方法在DataFrame上也適用,它們返回包含每一列的最大值和最小值的Series。datetime列的最大值是按時間順序排列的最新日期。
In ?[28]: nba.max() Out [28]: Team ????????Washington WizardsPosition ????????????????????SGBirthday 2000-12-23 00:00:00Salary ????????????????40231758dtype: objectIn ?[29]: nba.min() Out [29]: Team ???????? ????Atlanta HawksPosition ?????????????????????CBirthday ???1977-01-26 00:00:00Salary ???????????????????79568dtype: objectnlargest方法返回按照指定列最大值排序的前n行。因為一個DataFrame可以包含多個可排序的列,所以我們必須使用columns參數指定要用作排序的列。參數值可以是單個列名,也可以是多個列名的列表。下例是返回NBA中收入最高的前4名球員:
In ?[30]: nba.nlargest(n = 4, columns = "Salary") Out [30]:???????? Name ???????????? Team Position ???Birthday ???Salary205 ?????Stephen Curry ?Golden State Warriors ??????PG 1988-03-14 4023175838 ????????Chris Paul? Oklahoma City Thunder ??????PG 1985-05-06 38506482219? Russell Westbrook ???????Houston Rockets ??????PG 1988-11-12 ?38506482251 ?????????John Wall ????Washington Wizards ??????PG 1990-09-06 ?38199000nsmallest方法返回按照指定列最小值排序的前n行。下例是返回NBA中最老的前3名球員:
In ?[31]: nba.nsmallest(3, columns = ["Birthday"]) Out [31]:???????? is Name ??????????? Team Position ????Birthday ???Salary98 ???Vince Carter ??? Atlanta Hawks ??????PF ??1977-01-26???2564753196 Udonis Haslem ???????Miami Heat ?????? C ??1980-06-09???2564753262 ????Kyle Korver ??Milwaukee Bucks ??????PF ??1981-03-17???6004753如果要計算所有NBA工資的總和,可以直接使用sum方法,但必須把numeric_only參數設為True,用于只計算數值類型的列。
In ?[32]: nba.sum(numeric_only = True) Out [32]: Salary 3444112694dtype: int64mean方法可以計算平均工資:
In ?[33]: nba.mean() Out [33]: Salary 7.653584e+06dtype: float64median方法可以計算工資的中位數,std方法統計偏差:
In ?[34]: nba.median() Out [34]: Salary 3303074.5dtype: float64In ?[35]: nba.std() Out [35]: Salary 9.288810e+06dtype: float643. 按DataFrame值排序
我們可以使用sort_values方法按一列或多列對DataFrame進行排序。默認情況下,該方法返回一個新的DataFrame。
3.1 按單列排序
首先讓我們按名字對球員進行排序,by參數用于指定排序的列:
In ?[36]: nba.sort_values("Name") # 和下面代碼是一樣的nba.sort_values(by = "Name") Out [36]:??????? Name ??? Team Position ?? Birthday ???Salary52 ??????Aaron Gordon ?????????Orlando Magic ??????PF ?1995-09-16 19863636101 ?????Aaron Holiday ????????Indiana Pacers ??????PG 1996-09-30 ??2239200437 ???????Abdel Nader Oklahoma City Thunder ????? SF ?1993-09-25 ??161852081 ???? Adam Mokoka ?????????Chicago Bulls ???????G 1998-07-18 ????79568399 ?Admiral Schofield ????Washington Wizards ???? ?SF 1997-03-30 ??1000000… ??????????? ??? ?… ???????????????????? … ????? ?… ??????????… ??????? …159 ???????Zach LaVine ?????????Chicago Bulls ??????PG 1995-03-10 ?19500000302 ??????Zach Norvell ????Los Angeles Lakers ??????SG 1997-12-09 ????79568312 ??????Zhaire Smith ????Philadelphia 76ers ??????SG 1999-06-04 ??3058800137 ???Zion Williamson ??New Orleans Pelicans ???????F 2000-07-06 ??9757440248 ??? Zylan Cheatham ??New Orleans Pelicans ??????SF 1995-11-17 ????79568450 rows × 5 columnsascending參數可以指定升序或降序排序,我們可以使用它來找出NBA中最年輕的五個球員,只需對Birthday列進行降序排序,head方法默認返回前5行:
In ?[37]: nba.sort_values("Birthday", ascending = False).head() Out [37]:????? Name ???? Team Position ? Birthday ??Salary136 ???? Sekou Doumbouya ??????Detroit Pistons ??????SF ?2000-12-23 ?3285120432 ?Talen Horton-Tucker ???Los Angeles Lakers ??????GF ?2000-11-25 ??898310137 ?????Zion Williamson ?New Orleans Pelicans ???????F ?2000-07-06 ?9757440313?????????? RJ Barrett ??????New York Knicks ??? SG ?2000-06-14 ?7839960392???????? Jalen Lecque ?????????Phoenix Suns ?? G ?2000-06-13 ??8983103.2 按多列排序
sort_values方法的by參數還支持按多個列排序,默認情況下,所有排序將按升序排列,也就是ascending參數默認值為True。下例按字母順序對球隊進行排序,然后再對每個球隊中的球員進行排序:
In ?[38]: nba.sort_values(by = ["Team", "Name"]) Out [38]:???????? ????????Name??????? ?Team Position ???Birthday ???Salary359 ????????Alex Len ??????Atlanta Hawks ???????C ?1993-06-16 ??4160000167 ????Allen Crabbe ??????Atlanta Hawks ??????SG ?1992-04-09 18500000276 ?Brandon Goodwin ??????Atlanta Hawks ??????PG? 1995-10-02 ????79568438 ??Bruno Fernando ??????Atlanta Hawks ???????C? 1998-08-15 ??1400000194 ?????Cam Reddish ??????Atlanta Hawks ??????SF? 1999-09-01 ??4245720… ???????????????… ??????????????????… ???????… ???????? ?… ????????…418 ????Jordan McRae ?Washington Wizards ??????PG ?1991-03-28 ??1645357273 ?Justin Robinson ?Washington Wizards ??????PG ?1997-10-12 ???898310428 ???Moritz Wagner ?Washington Wizards ???????C ?1997-04-26 ??206352021 ???Rui Hachimura ?Washington Wizards ??????PF ?1998-02-08 ??446916036 ???Thomas Bryant ?Washington Wizards ???????C ?1997-07-31 ???800000450 rows × 5 columns我們可以使用ascending參數對每個列按相同的順序排序:
In ?[39]: nba.sort_values(["Team", "Name"], ascending = False) Out [39]:???? Name ?? Team Position ?Birthday ?????Salary36 ???Thomas Bryant ?Washington Wizards ???????C 1997-07-31?? 800000021 ???Rui Hachimura ?Washington Wizards ??????PF ?1998-02-08 ??4469160428 ???Moritz Wagner ?Washington Wizards ???????C ?1997-04-26 ?2063520273 ?Justin Robinson ?Washington Wizards ??????PG ?1997-10-12 ???898310418 ????Jordan McRae ?Washington Wizards ??????PG ?1991-03-28 ??1645357… ???????????????… ??????????????????… ???????… ??????? ??… …194 ?????Cam Reddish ??????Atlanta Hawks ??????SF ?1999-09-01 ??4245720438 ??Bruno Fernando ??????Atlanta Hawks ???????C ?1998-08-15 ??1400000276? Brandon Goodwin ??????Atlanta Hawks ??????PG ?1995-10-02?????79568167 ????Allen Crabbe ??????Atlanta Hawks ??????SG ?1992-04-09 ?18500000359 ????????Alex Len ??????Atlanta Hawks ???????C ?1993-06-16 ??4160000450 rows × 5 columns如果我們想對每個列按照不同的排序順序,例如,對球隊按升序進行排序,然后再對工資按降序進行排序。為此,ascending參數也支持列表值,每個布爾值會與by參數的每個值對應,也就是by和ascending參數的列表的長度必須相等。
In ?[40]: nba.sort_values(by = ["Team", "Salary"],ascending = [True, False]) Out [40]: ?????? ?Name ????????? Team Position ???Birthday ???Salary111 ??Chandler Parsons ??????Atlanta Hawks ??????SF ?1988-10-25 ?2510251228 ???????Evan Turner ??????Atlanta Hawks ??????PG??1988-10-27 ?18606556167 ??????Allen Crabbe ??????Atlanta Hawks ??????SG??1992-04-09 ?18500000213 ???De'Andre Hunter ??????Atlanta Hawks ??????SF??1997-12-02 ??7068360339 ?????Jabari Parker ??????Atlanta Hawks ??????PF ?1995-03-15 ??6500000… ??????????? ?????… ??????????? … ???? ??… ????????? …?????????…80????????Isaac Bonga ?Washington Wizards ??????PG ?1999-11-08 ??1416852399 Admiral Schofield ?Washington Wizards ??????SF ?1997-03-30 ??1000000273 ? ?Justin Robinson ?Washington Wizards ??????PG ?1997-10-12 ???898310283 Garrison Mathews ?Washington Wizards ??????SG ?1996-10-24 ????79568353 ?? ??Chris Chiozza ?Washington Wizards ??????PG ?1995-11-21 ????79568450 rows × 5 columns與Series一樣,inplace參數會修改原始DataFrame而不是返回一個新的DataFrame。Jupyter Notebook中將不會產生任何輸出:
In ?[41]: nba.sort_values(by = ["Team", "Salary"],ascending = [True, False],inplace = True)4. 按DataFrame索引排序
使用inplace參數更改了原始的DataFrame,但我們也有方法將其恢復為原始形式。
4.1 按行索引排序
sort_index方法按索引值對DataFrame進行排序:
In ?[42]: nba.sort_index().head() # 與下行代碼是一樣的nba.sort_index(ascending = True).head() Out [42]:??????? Name ????????????? Team Position ????Birthday ???Salary0 ????Shake Milton ??Philadelphia 76ers ??????SG ??1996-09-26???14456971 Christian Wood ?????Detroit Pistons ??????PF ??1995-09-27???16453572 ???PJ Washington ???Charlotte Hornets ??????PF ??1998-08-23???38318403 ????Derrick Rose ?????Detroit Pistons ??????PG ??1988-10-04???73170744 ???Marial Shayok ??Philadelphia 76ers ???????G ??1995-07-26?????79568使用inplace參數使更改永久生效:
In ?[43]: nba.sort_index(inplace = True)4.2 按列索引排序
按列進行排序可以使用axis參數,只需把值設為1或columns:
In ?[44]: nba.sort_index(axis = 1).head() # 這三行代碼是一樣的nba.sort_index(axis = "columns").head()nba.sort_index(axis = "columns", ascending = True).head() Out [44]:???????Birthday ??? Name Position ???Salary ????????????? ? Team0 ??1996-09-26 ????Shake Milton ??????SG ??1445697???Philadelphia 76ers1 ??1995-09-27 ??Christian Wood ??????PF???1645357 ?????Detroit Pistons2 ??1998-08-23 ???PJ Washington ??????PF ??3831840 ???Charlotte Hornets3 ??1988-10-04 ????Derrick Rose ??????PG???7317074 ?????Detroit Pistons4 ??1995-07-26 ???Marial Shayok ???????G?????79568 ??Philadelphia 76ers5. 設置新索引
設置新索引可以使用set_index方法,它返回一個以給定列作為索引的新DataFrame:
In ?[45]: nba.set_index(keys = "Name") # is the same asnba.set_index("Name") Out [45]:??????????????????? Team Position ????Birthday ????SalaryNameShake Milton ? Philadelphia 76ers ??????SG ??1996-09-26 ???1445697Christian Wood ?????Detroit Pistons???????PF ??1995-09-27 ???1645357PJ Washington ?? Charlotte Hornets???????PF ??1998-08-23 ???3831840Derrick Rose ?????Detroit Pistons???????PG ??1988-10-04 ???7317074Marial Shayok ??Philadelphia 76ers????????G ??1995-07-26 ?????79568… ??????????? ?…?? ?????… ?????????? … ????? ???…Austin Rivers ??? Houston Rockets ????? PG ??1992-08-01 ???2174310Harry Giles ?? Sacramento Kings ??????PF ??1998-04-22 ???2578800Robin Lopez ???? Milwaukee Bucks ???????C ??1988-04-01 ???4767000Collin Sexton ?Cleveland Cavaliers ??????PG ??1999-01-04 ???4764960Ricky Rubio ????????Phoenix Suns ???? ?PG ??1990-10-21 ?16200000450 rows × 4 columns使用inplace參數使更改永久生效:
In ?[46]: nba.set_index(keys = "Name", inplace = True)如果我們知道要在導入數據集時用作索引的列,我們還可以使用read_csv方法的index_col參數:
In ?[47]: nba = pd.read_csv("nba.csv",parse_dates = ["Birthday"],index_col = "Name")6. 從DataFrame讀取列
DataFrame是共用相同索引的Series對象的集合,我們可以輕松地從DataFrame中讀取一個或多個這些列。
6.1 從DataFrame讀取單列
每個Series的列都可以使用DataFrame的屬性讀取。例如,我們可以使用nba.Salary讀取Salary列:
In ?[48]: nba.Salary Out [48]: NameShake Milton ????1445697Christian Wood 1645357PJ Washington ?? 3831840Derrick Rose???? 7317074Marial Shayo?? ????79568...Austin Rivers ?? 2174310Harry Giles ???? 2578800Robin Lopez ?????4767000Collin Sexton 4764960Ricky Rubio ??? 16200000Name: Salary, Length: 450, dtype: int64如果您希望始終使用二維數據結構,可以使用to_frame方法將Series轉換為DataFrame:
In ?[49]: nba.Salary.to_frame() Out [49]:?????????????????????SalaryNameShake Milton ????1445697Christian Wood ????1645357PJ Washington ????3831840Derrick Rose ? ??7317074Marial Shayok ??????79568… ?? ???????…Austin Rivers ????2174310Harry Giles ????2578800Robin Lopez ? ??4767000Collin Sexton ????4764960Ricky Rubio 16200000450 rows × 1 columns也可以通過名稱來讀取列,這種方法的優點是它支持名稱中帶有空格的列。
In ?[50]: nba["Position"] Out [50]: NameShake Milton ?????SGChristian Wood PFPJ Washington ??? PFDerrick Rose ???? PGMarial Shayok ?? ??G..Austin Rivers???? PGHarry Giles ????? PFRobin Lopez ???? ??CCollin Sexton ??? PGRicky Rubio ????? PGName: Position, Length: 450, dtype: object6.2 從DataFrame讀取多列
要讀取多個列,只需要在列表中指定多個列名,返回的是一個新的DataFrame。它的列的排列順序和列表中的一樣,這是重新排列DataFrame列的有效方式:
In ?[51]: nba[["Salary", "Birthday"]] Out [51]: ???????????????????Salary ?????BirthdayNameShake Milton??? 1445697??? 1996-09-26Christian Wood??? 1645357 ???1995-09-27PJ Washington??? 3831840 ???1998-08-23Derrick Rose??? 7317074 ???1988-10-04Marial Shayok? ????79568 ???1995-07-26… ???????? … ????????????…Austin Rivers ?? 2174310 ???1992-08-01Harry Giles ?? 2578800 ???1998-04-22Robin Lopez ?? 4767000 ???1988-04-01Collin Sexton ?? 4764960 ???1999-01-04Ricky Rubio ??16200000 ???1990-10-21450 rows × 2 columns如果我們想根據列的數據類型來選擇列,可以使用select_dtypes方法,其include和exclude這兩個參數可以指定單個數據類型或多個數據類型的列表。提醒一下,您可以使用dtypes屬性查看數據集中的數據類型。
In [52]: # 僅選擇字符串類型的列` nba.select_dtypes(include = "object") Out [52]: Team PositionName Shake Milton Philadelphia 76ers SGChristian Wood Detroit Pistons PFPJ Washington Charlotte Hornets PFDerrick Rose Detroit Pistons PGMarial Shayok Philadelphia 76ers G… … …Austin Rivers Houston Rockets PGHarry Giles Sacramento Kings PFRobin Lopez Milwaukee Bucks CCollin Sexton Cleveland Cavaliers PGRicky Rubio Phoenix Suns PG450 rows × 2 columnsIn [53]: # 排除字符串和整數類型的列nba.select_dtypes(exclude = ["object", "int"]) Out [53]: BirthdayName Shake Milton 1996-09-26Christian Wood 1995-09-27PJ Washington 1998-08-23Derrick Rose 1988-10-04Marial Shayok 1995-07-26… …Austin Rivers 1992-08-01Harry Giles 1998-04-22Robin Lopez 1988-04-01Collin Sexton 1999-01-04Ricky Rubio 1990-10-21450 rows × 1 columns7. 從DataFrame讀取行
DataFrame中的行可以通過索引標簽或索引位置讀取。
7.1 通過索引標簽讀取行
loc屬性返回具有給定索引標簽行的Series,注意是區分大小寫。下例是返回索引標簽是"LeBron James"的行:
In ?[54]: nba.loc["LeBron James"] Out [54]: Team ????????Los Angeles LakersPosition ???? ???????????????PFBirthday 1984-12-30 00:00:00Salary ????????????????37436858Name: LeBron James, dtype: object還可以給定一個列表以讀取多行,返回結果是一個DataFrame,順序和給定的列表一樣:
In ?[55]: nba.loc[["Kawhi Leonard", "Paul George"]] Out [55]: ???????????????????? Team Position ????Birthday ?? SalaryNameKawhi Leonard???Los Angeles Clippers ??????SF ??1991-06-29 ??32742000Paul George???Los Angeles Clippers???????SF ??1990-05-02 ??33005556Pandas還支持Python的列表切片語法。例如,我們可以對索引標簽進行排序以按字母順序獲得球員的姓名,然后選擇Otto Porter和Patrick Beverley之間的所有球員。注意,兩個端點的球員也會被包括在內:
In ?[56]: nba.sort_index().loc["Otto Porter":"Patrick Beverley"] Out [56]: ????????????????? Team Position ????Birthday ????SalaryNameOtto Porter ?????????Chicago Bulls ??????SF ??1993-06-03 ??27250576PJ Dozier ????????Denver Nuggets ??????PG ? 1996-10-25??? ??79568PJ Washington ?????Charlotte Hornets ??????PF ??1998-08-23 ???3831840Pascal Siakam ???????Toronto Raptors ??????PF ??1994-04-02 ???2351838Pat Connaughton ???????Milwaukee Bucks ??????SG ? 1993-01-06 ?? 1723050Patrick Beverley ? Los Angeles Clippers ??????PG ? 1988-07-12 ??12345680我們還可以讀取DataFrame從某行開始到最后一行的數據,它與從Python列表中讀取分片的語法相同,也就是在初始索引標簽后加一個冒號:
In ?[57]: nba.sort_index().loc["Zach Collins":] Out [57]:??????????????????? Team Position ????Birthday ?????SalaryNameZach Collins ?Portland Trail Blazers ???????C 1997-11-19 ????4240200Zach LaVine????????? Chicago Bulls ??????PG ??1995-03-10 19500000Zach Norvell ??? Los Angeles Lakers ??????SG ??1997-12-09???????79568Zhaire Smith ??? Philadelphia 76ers ??????SG ??1999-06-04 ????3058800Zion Williamson ? New Orleans Pelicans ???????F ??2000-07-06 ????9757440Zylan Cheatham ? New Orleans Pelicans ??????SF ??1995-11-17???????79568同理可以讀取DataFrame從開頭到某行的數據,下例返回所從開頭到"Al Horford"的行:
In ?[58]: nba.sort_index().loc[:"Al Horford"] Out [58]:??????????????????? Team Position ?? Birthday ? ?SalaryNameAaron Gordon ?????????Orlando Magic ???? ?PF??1995-09-16 19863636Aaron Holiday ????????Indiana Pacers ???? ?PG ?1996-09-30 ??2239200Abdel Nader ?Oklahoma City Thunder ??????SF ?1993-09-25 ??1618520Adam Mokoka ?????????Chicago Bulls ???????G ?1998-07-1??????79568Admiral Schofield ????Washington Wizards ??????SF ?1997-03-30 ??1000000Al Horford?????Philadelphia 76ers ???????C ?1986-06-03 ?28000000如果DataFrame中不存在指定的索引標簽,則會拋出KeyError異常:
In ?[59]: nba.loc["Bugs Bunny"] --------------------------------------------------------------------------- KeyError Traceback (most recent call last) KeyError: 'Bugs Bunny'7.2 通過索引位置讀取行
iloc(索引位置index location)屬性返回具有給定索引位置的一行或多行,參數可以是單個整數或整數的列表:
In ?[60]: nba.iloc[300] Out [60]: Team ???????? ??Denver NuggetsPosition?????????????? ?????PFBirthday ??1999-04-03 00:00:00Salary ????????????????1416852Name: Jarred Vanderbilt, dtype: objectIn ?[61]: nba.iloc[[100, 200, 300, 400]] Out [61]:????????????????? Team Position ???Birthday???SalaryNameBrian Bowen ????Indiana Pacers ?? ???SG ?1998-10-02 ???79568Marco Belinelli ?San Antonio Spurs ?? ???SF ?1986-03-25 ?5846154Jarred Vanderbilt ????Denver Nuggets ?? ???PF ?1999-04-03??1416852Louis King ?????Detroit Pistons ?? ????F ?1999-04-06 ???79568列表切片的語法在這里也是使用的,需要注意冒號后面指定端點的行是不包括的。下例返回索引位置400、401、402和403的行:
In ?[62]: nba.iloc[400:404] Out [62]: ??????? ???????? Team Position ???Birthday ????SalaryNameLouis King ????Detroit Pistons ??? ???F??1999-04-06 ?????79568Kostas Antetokounmpo ?Los Angeles Lakers ????? PF ?1997-11-20 79568Rodions Kurucs ??????Brooklyn Nets ??????PF ?1998-02-05 ???1699236Spencer Dinwiddie ??????Brooklyn Nets ??????PG ?1993-04-06 ??10605600In ?[63]: nba.iloc[:2] # 讀取頭兩行 Out [63]:????????????????????????????????Team Position ? Birthday ???SalaryNameShake Milton ??Philadelphia 76ers ??????SG???1996-09-26???1445697Christian Wood ?????Detroit Pistons ??????PF ??1995-09-27???1645357In ?[64]: nba.iloc[447:] # 讀取從索引位置447到最后 Out [64]:???????????????????? ????????? Team Position ?? Birthday ?? SalaryNameRobin Lopez ????? Milwaukee Bucks ???????C???1988-04-01 ???4767000Collin Sexton ??Cleveland Cavaliers ???? ?PG???1999-01-04 ???4764960Ricky Rubio ?????????Phoenix Suns ???? ?PG???1990-10-21 ??16200000索引位置也可以使用負數,也就是從倒數開始數。下例是從倒數第10行直到倒數第6行,但不包括倒數第6行:
In ?[65]: nba.iloc[-10:-6] Out [65]:???????????????????? ???? Team Position ????Birthday ???SalaryNameJared Dudley ?????Los Angeles Lakers ??? ??PF???1985-07-10 ??2564753Max Strus ???????????ChicagoBulls ??????SG ??1996-03-28 ????79568Kevon Looney ??Golden State Warriors ???????C ??1996-02-06 ??4464286Willy Hernangomez ??????Charlotte Hornets ???????C ??1994-05-27 ??1557250除此以外還可以指定索引位置的步長序列。在下例中,我們從前十行中選擇間隔為2的行。因此,返回結果的索引位置為0、2、4、6和8:
In ?[66]: nba.iloc[0:10:2] Out [66]:???????????????????? Team Position ????Birthday?????SalaryNameShake Milton ??Philadelphia 76ers ????? SG ??1996-09-26 ???1445697PJ Washington ???Charlotte Hornets ??????PF ??1998-08-23 ???3831840Marial Shayok ??Philadelphia 76ers ??? ???G ??1995-07-26??????79568Kendrick Nunn ??????????Miami Heat ??? ??SG ??1995-08-03 ???1416852Brook Lopez ???? Milwaukee Bucks ??? ???C ??1988-04-01 ??120930247.3 從行的指定列中讀取值
loc和iloc屬性都支持第二個參數指定要讀取的列。在下例中,我們讀取索引標簽為"Giannis Antetokounmpo" 所在行的Team列的值:
In ?[67]: nba.loc["Giannis Antetokounmpo", "Team"] Out [67]: 'Milwaukee Bucks'兩個參數都支持傳遞列表,下例第二個參數使用列表指定要讀取Position和Birthday列的值:
In ?[68]: nba.loc["James Harden", ["Position", "Birthday"]] Out [68]: Position ????????? PGBirthday ??1989-08-26 00:00:00Name: James Harden, dtype: object下例第一、二個參數都使用了列表:
In ?[69]: nba.loc[["Russell Westbrook", "Anthony Davis"],["Team", "Salary"]] Out [69]:???????? ??????????????? Team ????SalaryNameRussell Westbrook ?????Houston Rockets???38506482Anthony Davis ??Los Angeles Lakers???27093019列表切片語法也可以用于讀取多個列,注意,這里兩個端點都將包括在內:
In ?[70]: nba.loc["Joel Embiid", "Position":"Salary"] Out [70]: Position ????????????????????CBirthday ??1994-03-16 00:00:00Salary ???????????????27504630Name: Joel Embiid, dtype: object參數指定的列名必須和它們在DataFrame中出現的順序一樣。下例會返回空列表,因為Salary列位于Position列之后:
In ?[71]: nba.loc["Joel Embiid", "Salary":"Position"] Out [71]: Series([], Name: Joel Embiid, dtype: object)每個DataFrame列都分配有一個索引位置,在我們當前的DataFrame中,Team的索引為0,Position的索引為1,依此類推。
In ?[72]: nba.columns Out [72]: Index(['Team', 'Position', 'Birthday', 'Salary'], dtype='object')列的索引位置也可以作為第二個參數傳給iloc:
In ?[73]: nba.iloc[57, 3] Out [73]: 796806列表切片語法也可以在此處使用。下例返回索引位置100到104(不包括)所在行的從開始到索引位置3(不包括)的列:
In ?[74]: nba.iloc[100:104, :3] Out [74]:???????????????????? Team?Position ?????BirthdayNameBrian Bowen ???????Indiana Pacers ??????SG ??1998-10-02Aaron Holiday ???????Indiana Pacers ??????PG ??1996-09-30Troy Daniels ???Los Angeles Lakers ??????SG ? ?1991-07-15Buddy Hield ??? Sacramento Kings ??????SG ?? 1992-12-17iloc和loc屬性非常通用,參數可以是單個值、列表、列表切片等。這種靈活性的缺點是需要額外的開銷,pandas必須檢查iloc或loc的輸入類型。
當我們要從DataFrame中讀取單個值時,可以使用at和iat這兩個替代屬性: at屬性的參數是行和列標簽,而iat屬性的參數是行和列索引位置:
In ?[75]: nba.at["Austin Rivers", "Birthday"] Out [75]: Timestamp('1992-08-01 00:00:00')In ?[76]: nba.iat[263, 1] Out [76]: 'PF'8. 重命名列或行
我們可以通過使用新名稱列表覆蓋columns屬性來重命名DataFrame中的某些或所有列:
In ?[77]: nba.columns Out [77]: Index(['Team', 'Position', 'Birthday', 'Salary'], dtype='object')In ?[78]: nba.columns = ["Team", "Position", "Date of Birth", "Pay"]nba.head(1) Out [78]:???????? ??????????????? Team Position Date of Birth ???????PayNameShake Milton ??Philadelphia 76ers ??????SG ?? 1996-09-26 ???1445697我們也可以使用rename方法,其參數columns的值類型是一個字典,鍵是需要修改的原列名,值是新的列名。下例是把原列名"Date of Birth"重命名為"Birthday":
In ?[79]: nba.rename(columns = { "Date of Birth": "Birthday" }) Out [79]:???????????????????? ? Team Position ????Birthday ???????PayNameShake Milton?? Philadelphia 76ers ??????SG???1996-09-26 ???1445697Christian Woo????? ?Detroit Pistons ??????PF???1995-09-27 ???1645357PJ Washington ?? Charlotte Hornets ??????PF???1998-08-23 ???3831840Derrick Rose ?? ??Detroit Pistons ??????PG ??1988-10-04 ???7317074Marial Shayok ? Philadelphia 76ers ???????G ??1995-07-26 ?????79568… ????? ?????????????… ???????… ????????? ?… ???????? …Austin Rivers ???? Houston Rockets ????? PG ??1992-08-01????2174310Harry Giles ??? Sacramento Kings ???? ?PF ??1998-04-22 ???2578800Robin Lopez ?? ??Milwaukee Bucks ???????C ??1988-04-01 ???4767000Collin Sexton ?Cleveland Cavaliers ?? ???PG ??1999-01-04 ???4764960Ricky Rubio ????????Phoenix Suns ?? ???PG ??1990-10-21 ??16200000450 rows × 4 columns一如既往地,使用 inplace 參數使修改永久生效:
In ?[80]: nba.rename(columns = { "Date of Birth": "Birthday" },inplace = True)rename方法也可以重命名索引標簽。下例將 "Giannis Antetokounmpo" 重命名為他的昵稱 "Greek Freak":
In ?[81]: nba.loc["Giannis Antetokounmpo"] Out [81]: Team ???????????Milwaukee BucksPosition?????????????? ??????PFBirthday ???1994-12-06 00:00:00Pay ??????????? ???????25842697Name: Giannis Antetokounmpo, dtype: objectIn ?[82]: nba.rename(index = { "Giannis Antetokounmpo": "Greek Freak" },inplace = True)In??[83]: nba.loc["Greek Freak"] Out [83]: Team ???????? ?Milwaukee BucksPosition ???????????????????PFBirthday ??1994-12-06 00:00:00Pay ??????????????????25842697Name: Greek Freak, dtype: object9. 重置索引
如果我們想把另一列用作DataFrame的索引,可以使用set_index方法,但會造成當前索引Name的丟失:
In ?[84]: nba.set_index("Team").head() Out [84]:???????????????????? Position ?????Birthday ???? PayTeamPhiladelphia 76ers ???????SG ???1996-09-26 ???1445697Detroit Pistons ???????PF ???1995-09-27 ???1645357Charlotte Hornets ???????PF ???1998-08-23 ???3831840Detroit Pistons ???????PG ???1988-10-04 ???7317074Philadelphia 76ers ????????G ???1995-07-26 ?????79568為了保留原來的索引列Name,我們首先要使用reset_index方法把現有索引重新整合為DataFrame中的常規列,并生成新的順序索引:
In ?[85]: nba.reset_index().head() Out [85]:????? Name ?????????????? Team Position ????Birthday ??????Pay0 Shake Milton ??Philadelphia 76ers ??????SG ??1996-09-26 14456971 ??Christian Wood ?????Detroit Pistons ??????PF???1995-09-27???16453572 ???PJ Washington ???Charlotte Hornets ??????PF ??1998-08-23 ??38318403 ????Derrick Rose ?????Detroit Pistons ??????PG ??1988-10-04 ??73170744 ???Marial Shayok ??Philadelphia 76ers ???????G ??1995-07-26?????79568現在我們可以放心使用set_index方法了:
In ?[86]: nba.reset_index().set_index("Team").head() Out [86]: ??????? ??? Name Position ??? Birthday ??? PayTeamPhiladelphia 76ers ????Shake Milton ??????SG???1996-09-26???1445697Detroit Pistons???Christian Wood ??????PF???1995-09-27???1645357Charlotte Hornets ???PJ Washington ??????PF???1998-08-23???3831840Detroit Pistons ????Derrick Rose ??????PG???1988-10-04???7317074Philadelphia 76ers ???Marial Shayok ???????G???1995-07-26?????79568reset_index方法也支持inplace參數,但是要注意:如果參數設置為True,則該方法將不會返回新的DataFrame,因此不能直接鏈接調用set_index方法,必須依次分開單獨調用:
In ?[87]: nba.reset_index(inplace = True)nba.set_index("Name", inplace = True)?
END O(∩_∩)O
總結
以上是生活随笔為你收集整理的Pandas实战-DataFrame对象的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: x264中重要结构体参数解释,参数设置,
- 下一篇: gitlab日常使用命令