當前位置：首頁 > 编程语言 > python >内容正文

python

Python 数据分析三剑客之 Pandas（六）：GroupBy 数据分裂、应用与合并

發布時間：2023/12/10 python 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python 数据分析三剑客之 Pandas（六）：GroupBy 数据分裂、应用与合并小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

CSDN 課程推薦：《邁向數據科學家：帶你玩轉Python數據分析》，講師齊偉，蘇州研途教育科技有限公司CTO，蘇州大學應用統計專業碩士生指導委員會委員；已出版《跟老齊學Python：輕松入門》《跟老齊學Python：Django實戰》、《跟老齊學Python：數據分析》和《Python大學實用教程》暢銷圖書。

Pandas 系列文章：

Python 數據分析三劍客之 Pandas（一）：認識 Pandas 及其 Series、DataFrame 對象
Python 數據分析三劍客之 Pandas（二）：Index 索引對象以及各種索引操作
Python 數據分析三劍客之 Pandas（三）：算術運算與缺失值的處理
Python 數據分析三劍客之 Pandas（四）：函數應用、映射、排序和層級索引
Python 數據分析三劍客之 Pandas（五）：統計計算與統計描述
Python 數據分析三劍客之 Pandas（六）：GroupBy 數據分裂、應用與合并
Python 數據分析三劍客之 Pandas（七）：合并數據集
Python 數據分析三劍客之 Pandas（八）：數據重塑、重復數據處理與數據替換
Python 數據分析三劍客之 Pandas（九）：時間序列
Python 數據分析三劍客之 Pandas（十）：數據讀寫

另有 NumPy、Matplotlib 系列文章已更新完畢，歡迎關注：

NumPy 系列文章：https://itrhx.blog.csdn.net/category_9780393.html
Matplotlib 系列文章：https://itrhx.blog.csdn.net/category_9780418.html

推薦學習資料與網站（博主參與部分文檔翻譯）：

NumPy 官方中文網：https://www.numpy.org.cn/
Pandas 官方中文網：https://www.pypandas.cn/
Matplotlib 官方中文網：https://www.matplotlib.org.cn/
NumPy、Matplotlib、Pandas 速查表：https://github.com/TRHX/Python-quick-reference-table

文章目錄

- 【01x00】GroupBy 機制
- 【02x00】GroupBy 對象
- 【03x00】GroupBy Split 數據分裂
- - 【03x01】分組運算
  - 【03x02】按類型按列分組
  - 【03x03】自定義分組
  - - 【03x03x01】字典分組
    - 【03x03x02】函數分組
    - 【03x03x03】索引層級分組
  - 【03x04】分組迭代
  - 【03x05】對象轉換
- 【04x00】GroupBy Apply 數據應用
- - 【04x01】聚合函數
  - 【04x02】自定義函數
  - 【04x03】對不同列作用不同函數
  - 【04x04】GroupBy.apply()

這里是一段防爬蟲文本，請讀者忽略。本文原創首發于 CSDN，作者 TRHX。博客首頁：https://itrhx.blog.csdn.net/ 本文鏈接：https://itrhx.blog.csdn.net/article/details/106804881 未經授權，禁止轉載！惡意轉載，后果自負！尊重原創，遠離剽竊！

【01x00】GroupBy 機制

對數據集進行分組并對各組應用一個函數（無論是聚合還是轉換），通常是數據分析工作中的重要環節。在將數據集加載、融合、準備好之后，通常就是計算分組統計或生成透視表。Pandas 提供了一個靈活高效的 GroupBy 功能，雖然“分組”（group by）這個名字是借用 SQL 數據庫語言的命令，但其理念引用發明 R 語言 frame 的 Hadley Wickham 的觀點可能更合適：分裂（Split）、應用（Apply）和組合（Combine）。

分組運算過程：Split —> Apply —> Combine

分裂（Split）：根據某些標準將數據分組；
應用（Apply）：對每個組獨立應用一個函數；
合并（Combine）：把每個分組的計算結果合并起來。

官方介紹：https://pandas.pydata.org/docs/user_guide/groupby.html

【02x00】GroupBy 對象

常見的 GroupBy 對象：Series.groupby、DataFrame.groupby，基本語法如下：

Series.groupby(self,by=None,axis=0,level=None,as_index: bool = True,sort: bool = True,group_keys: bool = True,squeeze: bool = False,observed: bool = False) → ’groupby_generic.SeriesGroupBy’ DataFrame.groupby(self,by=None,axis=0,level=None,as_index: bool = True,sort: bool = True,group_keys: bool = True,squeeze: bool = False,observed: bool = False) → ’groupby_generic.DataFrameGroupBy’

官方文檔：

https://pandas.pydata.org/docs/reference/api/pandas.Series.groupby.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

常用參數解釋如下：

參數描述

by	映射、函數、標簽或標簽列表，用于確定分組依據的分組。如果 by 是函數，則會在對象索引的每個值上調用它。如果傳遞了 dict 或 Series，則將使用 Series 或 dict 的值來確定組（將 Series 的值首先對齊；請參見.align() 方法）。如果傳遞了 ndarray，則按原樣使用這些值來確定組。標簽或標簽列表可以按自身中的列傳遞給分組。注意，元組被解釋為（單個）鍵
axis	沿指定軸拆分，默認 0，0 or ‘index’，1 or ‘columns’，只有在 DataFrame 中才有 1 or 'columns’
level	如果軸是 MultiIndex（層次結構），則按特定層級進行分組，默認 None
as_index	bool 類型，默認 True，對于聚合輸出，返回以組標簽為索引的對象。僅與 DataFrame 輸入相關。 as_index=False 實際上是“SQL樣式”分組輸出
sort	bool 類型，默認 True，對組鍵排序。關閉此選項可獲得更好的性能。注：這不影響每組的觀察順序。Groupby 保留每個組中行的順序
group_keys	bool 類型，默認 True，調用 apply 方法時，是否將組鍵（keys）添加到索引（ index）以標識塊
squeeze	bool 類型，默認 False，如果可能，減少返回類型的維度，否則返回一致的類型

groupby() 進行分組，GroupBy 對象沒有進行實際運算，只是包含分組的中間數據，示例如下：

【03x00】GroupBy Split 數據分裂

【03x01】分組運算

前面通過 groupby() 方法獲得了一個 GroupBy 對象，它實際上還沒有進行任何計算，只是含有一些有關分組鍵 obj['key1'] 的中間數據而已。換句話說，該對象已經有了接下來對各分組執行運算所需的一切信息。例如，我們可以調用 GroupBy 的 mean() 方法來計算分組平均值，size() 方法返回每個分組的元素個數：

>>> import pandas as pd >>> import numpy as np >>> data = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],'key2' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],'data1': np.random.randn(8),'data2': np.random.randn(8)} >>> >>> obj = pd.DataFrame(data) >>> objkey1 key2 data1 data2 0 a one -0.544099 -0.614079 1 b one 2.193712 0.101005 2 a two -0.004683 0.882770 3 b three 0.312858 1.732105 4 a two 0.011089 0.089587 5 b two 0.292165 1.327638 6 a one -1.433291 -0.238971 7 a three -0.004724 -2.117326 >>> >>> grouped1 = obj.groupby('key1') >>> grouped2 = obj['data1'].groupby(obj['key1']) >>> >>> grouped1.mean()data1 data2 key1 a -0.395142 -0.399604 b 0.932912 1.053583 >>> >>> grouped2.mean() key1 a -0.395142 b 0.932912 Name: data1, dtype: float64 >>> >>> grouped1.size() key1 a 5 b 3 dtype: int64 >>> >>> grouped2.size() key1 a 5 b 3 Name: data1, dtype: int64

【03x02】按類型按列分組

groupby() 方法 axis 參數默認是 0，通過設置也可以在其他任何軸上進行分組，也支持按照類型（dtype）進行分組：

>>> import pandas as pd >>> import numpy as np >>> data = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],'key2' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],'data1': np.random.randn(8),'data2': np.random.randn(8)} >>> obj = pd.DataFrame(data) >>> objkey1 key2 data1 data2 0 a one -0.607009 1.948301 1 b one 0.150818 -0.025095 2 a two -2.086024 0.358164 3 b three 0.446061 1.708797 4 a two 0.745457 -0.980948 5 b two 0.981877 2.159327 6 a one 0.804480 -0.499661 7 a three 0.112884 0.004367 >>> >>> obj.dtypes key1 object key2 object data1 float64 data2 float64 dtype: object >>> >>> obj.groupby(obj.dtypes, axis=1).size() float64 2 object 2 dtype: int64 >>> >>> obj.groupby(obj.dtypes, axis=1).sum()float64 object 0 1.341291 aone 1 0.125723 bone 2 -1.727860 atwo 3 2.154858 bthree 4 -0.235491 atwo 5 3.141203 btwo 6 0.304819 aone 7 0.117251 athree

【03x03】自定義分組

groupby() 方法中可以一次傳入多個數組的列表，也可以自定義一組分組鍵。也可以通過一個字典、一個函數，或者按照索引層級進行分組。

傳入多個數組的列表：

>>> import pandas as pd >>> import numpy as np >>> data = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],'key2' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],'data1': np.random.randn(8),'data2': np.random.randn(8)} >>> obj = pd.DataFrame(data) >>> objkey1 key2 data1 data2 0 a one -0.841652 0.688055 1 b one 0.510042 -0.561171 2 a two -0.418862 -0.145983 3 b three -1.104698 0.563158 4 a two 0.329527 -0.893108 5 b two 0.753653 -0.342520 6 a one -0.882527 -1.121329 7 a three 1.726794 0.160244 >>> >>> means = obj['data1'].groupby([obj['key1'], obj['key2']]).mean() >>> means key1 key2 a one -0.862090three 1.726794two -0.044667 b one 0.510042three -1.104698two 0.753653 Name: data1, dtype: float64 >>> >>> means.unstack() key2 one three two key1 a -0.862090 1.726794 -0.044667 b 0.510042 -1.104698 0.753653

自定義分組鍵：

>>> import pandas as pd >>> import numpy as np >>> obj = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],'key2' : ['one', 'two', 'one', 'two', 'one'],'data1' : np.random.randn(5),'data2' : np.random.randn(5)}) >>> objkey1 key2 data1 data2 0 a one -0.024003 0.350480 1 a two -0.767534 -0.100426 2 b one -0.594983 -1.945580 3 b two -0.374482 0.817592 4 a one 0.755452 -0.137759 >>> >>> states = np.array(['Wuhan', 'Beijing', 'Beijing', 'Wuhan', 'Wuhan']) >>> years = np.array([2005, 2005, 2006, 2005, 2006]) >>> >>> obj['data1'].groupby([states, years]).mean() Beijing 2005 -0.7675342006 -0.594983 Wuhan 2005 -0.1992422006 0.755452 Name: data1, dtype: float64

【03x03x01】字典分組

通過字典進行分組：

>>> import pandas as pd >>> import numpy as np >>> obj = pd.DataFrame(np.random.randint(1, 10, (5,5)),columns=['a', 'b', 'c', 'd', 'e'],index=['A', 'B', 'C', 'D', 'E']) >>> obja b c d e A 1 4 7 1 9 B 8 2 4 7 8 C 9 8 2 5 1 D 2 4 2 8 3 E 7 5 7 2 3 >>> >>> obj_dict = {'a':'Python', 'b':'Python', 'c':'Java', 'd':'C++', 'e':'Java'} >>> obj.groupby(obj_dict, axis=1).size() C++ 1 Java 2 Python 2 dtype: int64 >>> >>> obj.groupby(obj_dict, axis=1).count()C++ Java Python A 1 2 2 B 1 2 2 C 1 2 2 D 1 2 2 E 1 2 2 >>> >>> obj.groupby(obj_dict, axis=1).sum()C++ Java Python A 1 16 5 B 7 12 10 C 5 3 17 D 8 5 6 E 2 10 12

【03x03x02】函數分組

通過函數進行分組：

>>> import pandas as pd >>> import numpy as np >>> obj = pd.DataFrame(np.random.randint(1, 10, (5,5)),columns=['a', 'b', 'c', 'd', 'e'],index=['AA', 'BBB', 'CC', 'D', 'EE']) >>> obja b c d e AA 3 9 5 8 2 BBB 1 4 2 2 6 CC 9 2 4 7 6 D 2 5 5 7 1 EE 8 8 8 2 2 >>> >>> def group_key(idx):"""idx 為列索引或行索引"""return len(idx)>>> obj.groupby(group_key).size() # 等價于 obj.groupby(len).size() 1 1 2 3 3 1 dtype: int64

【03x03x03】索引層級分組

通過不同索引層級進行分組：

>>> import pandas as pd >>> import numpy as np >>> columns = pd.MultiIndex.from_arrays([['Python', 'Java', 'Python', 'Java', 'Python'],['A', 'A', 'B', 'C', 'B']], names=['language', 'index']) >>> obj = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns) >>> obj language Python Java Python Java Python index A A B C B 0 7 1 9 8 5 1 4 5 4 5 6 2 4 3 1 9 5 3 6 6 3 8 1 4 7 9 2 8 2 >>> >>> obj.groupby(level='language', axis=1).sum() language Java Python 0 9 21 1 10 14 2 12 10 3 14 10 4 17 11 >>> >>> obj.groupby(level='index', axis=1).sum() index A B C 0 8 14 8 1 9 10 5 2 7 6 9 3 12 4 8 4 16 4 8

【03x04】分組迭代

GroupBy 對象支持迭代，對于單層分組，可以產生一組二元元組，由分組名和數據塊組成：

>>> import pandas as pd >>> import numpy as np >>> data = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],'key2' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],'data1': np.random.randn(8),'data2': np.random.randn(8)} >>> obj = pd.DataFrame(data) >>> objkey1 key2 data1 data2 0 a one -1.088762 0.668504 1 b one 0.275500 0.787844 2 a two -0.108417 -0.491296 3 b three 0.019524 -0.363390 4 a two 0.453612 0.796999 5 b two 1.982858 1.501877 6 a one 1.101132 -1.928362 7 a three 0.524775 -1.205842 >>> >>> for group_name, group_data in obj.groupby('key1'):print(group_name)print(group_data)akey1 key2 data1 data2 0 a one -1.088762 0.668504 2 a two -0.108417 -0.491296 4 a two 0.453612 0.796999 6 a one 1.101132 -1.928362 7 a three 0.524775 -1.205842 bkey1 key2 data1 data2 1 b one 0.275500 0.787844 3 b three 0.019524 -0.363390 5 b two 1.982858 1.501877

對于多層分組，元組的第一個元素將會是由鍵值組成的元組，第二個元素為數據塊：

>>> import pandas as pd >>> import numpy as np >>> data = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],'key2' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],'data1': np.random.randn(8),'data2': np.random.randn(8)} >>> obj = pd.DataFrame(data) >>> objkey1 key2 data1 data2 0 a one -1.088762 0.668504 1 b one 0.275500 0.787844 2 a two -0.108417 -0.491296 3 b three 0.019524 -0.363390 4 a two 0.453612 0.796999 5 b two 1.982858 1.501877 6 a one 1.101132 -1.928362 7 a three 0.524775 -1.205842 >>> >>> for group_name, group_data in obj.groupby(['key1', 'key2']):print(group_name)print(group_data)('a', 'one')key1 key2 data1 data2 0 a one -1.088762 0.668504 6 a one 1.101132 -1.928362 ('a', 'three')key1 key2 data1 data2 7 a three 0.524775 -1.205842 ('a', 'two')key1 key2 data1 data2 2 a two -0.108417 -0.491296 4 a two 0.453612 0.796999 ('b', 'one')key1 key2 data1 data2 1 b one 0.2755 0.787844 ('b', 'three')key1 key2 data1 data2 3 b three 0.019524 -0.36339 ('b', 'two')key1 key2 data1 data2 5 b two 1.982858 1.501877

【03x05】對象轉換

GroupBy 對象支持轉換成列表或字典：

>>> import pandas as pd >>> import numpy as np >>> data = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],'key2' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],'data1': np.random.randn(8),'data2': np.random.randn(8)} >>> obj = pd.DataFrame(data) >>> objkey1 key2 data1 data2 0 a one -0.607009 1.948301 1 b one 0.150818 -0.025095 2 a two -2.086024 0.358164 3 b three 0.446061 1.708797 4 a two 0.745457 -0.980948 5 b two 0.981877 2.159327 6 a one 0.804480 -0.499661 7 a three 0.112884 0.004367 >>> >>> grouped = obj.groupby('key1') >>> list(grouped) [('a', key1 key2 data1 data2 0 a one -0.607009 1.948301 2 a two -2.086024 0.358164 4 a two 0.745457 -0.980948 6 a one 0.804480 -0.499661 7 a three 0.112884 0.004367), ('b', key1 key2 data1 data2 1 b one 0.150818 -0.025095 3 b three 0.446061 1.708797 5 b two 0.981877 2.159327)] >>> >>> dict(list(grouped)) {'a': key1 key2 data1 data2 0 a one -0.607009 1.948301 2 a two -2.086024 0.358164 4 a two 0.745457 -0.980948 6 a one 0.804480 -0.499661 7 a three 0.112884 0.004367, 'b': key1 key2 data1 data2 1 b one 0.150818 -0.025095 3 b three 0.446061 1.708797 5 b two 0.981877 2.159327}

【04x00】GroupBy Apply 數據應用

聚合指的是任何能夠從數組產生標量值的數據轉換過程，常用于對分組后的數據進行計算

【04x01】聚合函數

之前的例子已經用過一些內置的聚合函數，比如 mean、count、min 以及 sum 等。常見的聚合運算如下表所示：

官方文檔：https://pandas.pydata.org/docs/reference/groupby.html

方法描述

count	非NA值的數量
describe	針對Series或各DataFrame列計算匯總統計
min	計算最小值
max	計算最大值
argmin	計算能夠獲取到最小值的索引位置（整數）
argmax	計算能夠獲取到最大值的索引位置（整數）
idxmin	計算能夠獲取到最小值的索引值
idxmax	計算能夠獲取到最大值的索引值
quantile	計算樣本的分位數（0到1）
sum	值的總和
mean	值的平均數
median	值的算術中位數（50%分位數）
mad	根據平均值計算平均絕對離差
var	樣本值的方差
std	樣本值的標準差

應用示例：

>>> import pandas as pd >>> import numpy as np >>> obj = {'key1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],'key2' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],'data1': np.random.randint(1,10, 8),'data2': np.random.randint(1,10, 8)} >>> obj = pd.DataFrame(obj) >>> objkey1 key2 data1 data2 0 a one 9 7 1 b one 5 9 2 a two 2 4 3 b three 3 4 4 a two 5 1 5 b two 5 9 6 a one 1 8 7 a three 2 4 >>> >>> obj.groupby('key1').sum()data1 data2 key1 a 19 24 b 13 22 >>> >>> obj.groupby('key1').max()key2 data1 data2 key1 a two 9 8 b two 5 9 >>> >>> obj.groupby('key1').min()key2 data1 data2 key1 a one 1 1 b one 3 4 >>> >>> obj.groupby('key1').mean()data1 data2 key1 a 3.800000 4.800000 b 4.333333 7.333333 >>> >>> obj.groupby('key1').size() key1 a 5 b 3 dtype: int64 >>> >>> obj.groupby('key1').count()key2 data1 data2 key1 a 5 5 5 b 3 3 3 >>> >>> obj.groupby('key1').describe()data1 ... data2 count mean std min 25% ... min 25% 50% 75% max key1 ... a 5.0 3.800000 3.271085 1.0 2.0 ... 1.0 4.0 4.0 7.0 8.0 b 3.0 4.333333 1.154701 3.0 4.0 ... 4.0 6.5 9.0 9.0 9.0[2 rows x 16 columns]

【04x02】自定義函數

如果自帶的內置函數滿足不了我們的要求，則可以自定義一個聚合函數，然后傳入 GroupBy.agg(func) 或 GroupBy.aggregate(func) 方法中即可。func 的參數為 groupby 索引對應的記錄。

【04x03】對不同列作用不同函數

使用字典可以對不同列作用不同的聚合函數：

【04x04】GroupBy.apply()

apply() 方法會將待處理的對象拆分成多個片段，然后對各片段調用傳入的函數，最后嘗試將各片段組合到一起。

>>> import pandas as pd >>> obj = pd.DataFrame({'A':['bob','sos','bob','sos','bob','sos','bob','bob'],'B':['one','one','two','three','two','two','one','three'],'C':[3,1,4,1,5,9,2,6],'D':[1,2,3,4,5,6,7,8]}) >>> objA B C D 0 bob one 3 1 1 sos one 1 2 2 bob two 4 3 3 sos three 1 4 4 bob two 5 5 5 sos two 9 6 6 bob one 2 7 7 bob three 6 8 >>> >>> grouped = obj.groupby('A') >>> for name, group in grouped:print(name)print(group)bobA B C D 0 bob one 3 1 2 bob two 4 3 4 bob two 5 5 6 bob one 2 7 7 bob three 6 8 sosA B C D 1 sos one 1 2 3 sos three 1 4 5 sos two 9 6 >>> >>> grouped.apply(lambda x:x.describe()) # 對 bob 和 sos 兩組數據使用 describe 方法C D A bob count 5.000000 5.000000mean 4.000000 4.800000std 1.581139 2.863564min 2.000000 1.00000025% 3.000000 3.00000050% 4.000000 5.00000075% 5.000000 7.000000max 6.000000 8.000000 sos count 3.000000 3.000000mean 3.666667 4.000000std 4.618802 2.000000min 1.000000 2.00000025% 1.000000 3.00000050% 1.000000 4.00000075% 5.000000 5.000000max 9.000000 6.000000 >>> >>> grouped.apply(lambda x:x.min()) # # 對 bob 和 sos 兩組數據使用 min 方法A B C D A bob bob one 2 1 sos sos one 1 2

總結

以上是生活随笔為你收集整理的Python 数据分析三剑客之 Pandas（六）：GroupBy 数据分裂、应用与合并的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：新版《仙剑》电视剧官宣：三大女主演员公布
下一篇： 1315km续航！“大号奶爸车”理想L9