pandasStudyNoteBook
pandas 入門培訓
pandas簡介
- 官網鏈接:http://pandas.pydata.org/
- pandas = pannel data + data analysis
- Pandas是python的一個數據分析包 , Pandas最初被作為金融數據分析工具而開發出來,因此,pandas為時間序列分析提供了很好的支持
基本功能
- 具備按軸自動或顯式數據對齊功能的數據結構
- 集成時間序列功能
- 既能處理時間序列數據也能處理非時間序列數據的數據結構
- 數學運算和約簡(比如對某個軸求和)可以根據不同的元數據(軸編號)執行
- 靈活處理缺失數據
- 合并及其他出現在常見數據庫(例如基于SQL的)中的關系型運算
數據結構
數據結構 serial
- Series是一種類似于一維數組的對象,它由一組數據(各種NumPy數據類型)以及一組與之相關的數據標簽(即索引)組成。
- Series的字符串表現形式為:索引在左邊,值在右邊。
代碼:
- serial的創建
- 使用列表
- 使用字典
- Serial的讀寫
- serial的運算
數據結構 DataFrame
- DataFrame是一個表格型的數據結構,它含有一組有序的列,每列可以是不同的值類型(數值、字符串、布爾值等)
- DataFrame既有行索引也有列索引,它可以被看做由Series組成的字典(共用同一個索引)
- 可以輸入給DataFrame構造器的數據
代碼:
- 創建
- 讀寫
數據結構 索引對象
- pandas的索引對象負責管理軸標簽和其他元數據(比如軸名稱等)。構建Series或DataFrame時,所用到的任何數組或其他序列的標簽都會被轉換成一個Index.
- Index對象是不可修改的(immutable),因此用戶不能對其進行修改。不可修改性非常重要,因為這樣才能使Index對象在多個數據結構之間安全共享
- pandas中主要的index對象
- Index的方法和屬性 I
- Index的方法和屬性 II
代碼:
基本功能
基本功能 重新索引
- 創建一個適應新索引的新對象,該Series的reindex將會根據新索引進行重排。如果某個索引值當前不存在,就引入缺失值
- 對于時間序列這樣的有序數據,重新索引時可能需要做一些插值處理。method選項即可達到此目的。
- reindex函數的參數
屏幕快照 2018-06-07 上午9.24.50.png
代碼
# -*- coding: utf-8 -*- import numpy as np from pandas import DataFrame, Seriesprint '重新指定索引及順序' obj = Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c']) print obj obj2 = obj.reindex(['a', 'b', 'd', 'c', 'e'])#默認的填充方法是nan print obj2 print obj.reindex(['a', 'b', 'd', 'c', 'e'], fill_value = 0) # 指定不存在元素的填充值 printprint '重新指定索引并指定填元素充方法' obj3 = Series(['blue', 'purple', 'yellow'], index = [0, 2, 4]) print obj3 print obj3.reindex(range(6), method = 'ffill')#根據前一個數據的值進行填充 printprint '對DataFrame重新指定索引' frame = DataFrame(np.arange(9).reshape(3, 3),index = ['a', 'c', 'd'],columns = ['Ohio', 'Texas', 'California']) print frame frame2 = frame.reindex(['a', 'b', 'c', 'd'])#默認更新軸為行 print frame2 printprint '重新指定column' states = ['Texas', 'Utah', 'California'] print frame.reindex(columns = states)#制定列索引的順序 print frameprint '對DataFrame重新指定索引并指定填元素充方法' print frame.reindex(index = ['a', 'b', 'c', 'd'],method = 'ffill') # columns = states) print frame.ix[['a', 'b', 'd', 'c'], states]#通過ix指定修改的軸為行 重新指定索引及順序 d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64 a -5.3 b 7.2 d 4.5 c 3.6 e NaN dtype: float64 a -5.3 b 7.2 d 4.5 c 3.6 e 0.0 dtype: float64重新指定索引并指定填元素充方法 0 blue 2 purple 4 yellow dtype: object 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object對DataFrame重新指定索引Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8Ohio Texas California a 0.0 1.0 2.0 b NaN NaN NaN c 3.0 4.0 5.0 d 6.0 7.0 8.0重新指定columnTexas Utah California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 對DataFrame重新指定索引并指定填元素充方法Ohio Texas California a 0 1 2 b 0 1 2 c 3 4 5 d 6 7 8Texas Utah California a 1.0 NaN 2.0 b NaN NaN NaN d 7.0 NaN 8.0 c 4.0 NaN 5.0/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:38: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated基本功能 丟棄指定軸上的項
- 丟棄某條軸上的一個或多個項很簡單,只要有一個索引數組或列表即可。由于需要執行一些數據整理和集合邏輯,所以drop方法返回的是一個在指定軸上刪除了指定值的新對象
代碼:
# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrame# print 'Series根據索引刪除元素' # obj = Series(np.arange(5.), index = ['a', 'b', 'c', 'd', 'e']) # new_obj = obj.drop('c')#根據行索引刪除某一個行 # print new_obj # obj = obj.drop(['d', 'c']) # print obj # printprint 'DataFrame刪除元素,可指定索引或列。' data = DataFrame(np.arange(16).reshape((4, 4)),index = ['Ohio', 'Colorado', 'Utah', 'New York'],columns = ['one', 'two', 'three', 'four']) print data print data.drop(['Colorado', 'Ohio']) print data.drop('two', axis = 1)#指定列索引 print data.drop(['two', 'four'], axis = 1) DataFrame刪除元素,可指定索引或列。one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15one two three four Utah 8 9 10 11 New York 12 13 14 15one three four Ohio 0 2 3 Colorado 4 6 7 Utah 8 10 11 New York 12 14 15one three Ohio 0 2 Colorado 4 6 Utah 8 10 New York 12 14基本功能 索引、選取和過濾
- Series索引(obj[…])的工作方式類似于NumPy數組的索引,只不過Series的索引值不只是整數。
- 利用標簽的切片運算與普通的Python切片運算不同,其末端是包含的(inclusive),完全閉區間。
- 對DataFrame進行索引其實就是獲取一個或多個列
- 為了在DataFrame的行上進行標簽索引,引入了專門的索引字段ix
- DataFrame的索引選項
代碼:
- 列表索引
- 切片索引
- 行/列索引
- 條件索引
-- coding: utf-8 --
import numpy as np
from pandas import Series, DataFrame
print ‘Series的索引,默認數字索引可以工作?!?
obj = Series(np.arange(4.), index = [‘a’, ‘b’, ‘c’, ‘d’])
print obj[‘b’]
print obj[3]
print obj[[1, 3]]#索引時候使用的是列表,非索引一般用的是元祖,選中obj[1]和obj[3]
print obj[obj < 2]#將obj中小于2的元素打印出來
print
print ‘Series的數組切片’
print obj[‘b’:’d’] # 閉區間[b:d]
obj[‘b’:’c’] = 5
print obj
print
print ‘DataFrame的索引’
data = DataFrame(np.arange(16).reshape((4, 4)),
index = [‘Ohio’, ‘Colorado’, ‘Utah’, ‘New York’],
columns = [‘one’, ‘two’, ‘three’, ‘four’])
print data
print data[‘two’] # 打印列.使用下標進行索引時,默認的是列索引
print data[[‘three’, ‘one’]]#以列表進行索引
print data[:2]
print data.ix[‘Colorado’, [‘two’, ‘three’]] # 指定索引和列,通過ix完成行索引
print data.ix[[‘Colorado’, ‘Utah’], [3, 0, 1]]
print data.ix[2] # 打印第2行(從0開始)
print data.ix[:’Utah’, ‘two’] # 從開始到Utah,第2列。
print
print ‘根據條件選擇’
print data[data.three > 5]
print data < 5 # 打印True或者False
data[data < 5] = 0
print data
基本功能 算術運算和數據對齊
- 對不同的索引對象進行算術運算
- 自動數據對齊在不重疊的索引處引入了NA值,缺失值會在算術運算過程中傳播。
- 對于DataFrame,對齊操作會同時發生在行和列上。
- fill_value參數
- DataFrame和Series之間的運算
代碼:
# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrameprint '加法' s1 = Series([7.3, -2.5, 3.4, 1.5], index = ['a', 'c', 'd', 'e']) s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index = ['a', 'c', 'e', 'f', 'g']) print s1 print s2 print s1 + s2 #相同索引的元素對應相加,不相同的部分直接賦值為nan,加法后的索引為之前索引的并集 printprint 'DataFrame加法,索引和列都必須匹配。' df1 = DataFrame(np.arange(9.).reshape((3, 3)),columns = list('bcd'),index = ['Ohio', 'Texas', 'Colorado']) df2 = DataFrame(np.arange(12).reshape((4, 3)),columns = list('bde'),index = ['Utah', 'Ohio', 'Texas', 'Oregon']) print df1 print df2 print df1 + df2#dataframe加法是作用于行和列兩個方向的,相同索引的相加,不同索引的賦值nan printprint '數據填充' df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns = list('abcd')) df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns = list('abcde')) print df1 print df2 print 'df1 + df2' print df1 + df2 print df1.add(df2, fill_value = 0)#使用add函數進行相加,和+符號的結果不一樣 print df1.reindex(columns = df2.columns, fill_value = 0)#使用dataframe2的列索引來跟新dataframe1的列索引,沒有的填充0 printprint 'DataFrame與Series之間的操作' arr = np.arange(12.).reshape((3, 4)) print arr print arr[0] print arr - arr[0] frame = DataFrame(np.arange(12).reshape((4, 3)),columns = list('bde'),index = ['Utah', 'Ohio', 'Texas', 'Oregon']) series = frame.ix[0] print frame print series print frame - series #把serial看成是一個dataframe,只不過,此時他只有一行而已,在利用dataframe的減法原則 series2 = Series(range(3), index = list('bef')) print frame + series2 series3 = frame['d'] print frame.sub(series3, axis = 0) # 按列減 加法 a 7.3 c -2.5 d 3.4 e 1.5 dtype: float64 a -2.1 c 3.6 e -1.5 f 4.0 g 3.1 dtype: float64 a 5.2 c 1.1 d NaN e 0.0 f NaN g NaN dtype: float64DataFrame加法,索引和列都必須匹配。b c d Ohio 0.0 1.0 2.0 Texas 3.0 4.0 5.0 Colorado 6.0 7.0 8.0b d e Utah 0 1 2 Ohio 3 4 5 Texas 6 7 8 Oregon 9 10 11b c d e Colorado NaN NaN NaN NaN Ohio 3.0 NaN 6.0 NaN Oregon NaN NaN NaN NaN Texas 9.0 NaN 12.0 NaN Utah NaN NaN NaN NaN數據填充a b c d 0 0.0 1.0 2.0 3.0 1 4.0 5.0 6.0 7.0 2 8.0 9.0 10.0 11.0a b c d e 0 0.0 1.0 2.0 3.0 4.0 1 5.0 6.0 7.0 8.0 9.0 2 10.0 11.0 12.0 13.0 14.0 3 15.0 16.0 17.0 18.0 19.0 df1 + df2a b c d e 0 0.0 2.0 4.0 6.0 NaN 1 9.0 11.0 13.0 15.0 NaN 2 18.0 20.0 22.0 24.0 NaN 3 NaN NaN NaN NaN NaNa b c d e 0 0.0 2.0 4.0 6.0 4.0 1 9.0 11.0 13.0 15.0 9.0 2 18.0 20.0 22.0 24.0 14.0 3 15.0 16.0 17.0 18.0 19.0a b c d e 0 0.0 1.0 2.0 3.0 0 1 4.0 5.0 6.0 7.0 0 2 8.0 9.0 10.0 11.0 0DataFrame與Series之間的操作 [[ 0. 1. 2. 3.][ 4. 5. 6. 7.][ 8. 9. 10. 11.]] [0. 1. 2. 3.] [[0. 0. 0. 0.][4. 4. 4. 4.][8. 8. 8. 8.]]b d e Utah 0 1 2 Ohio 3 4 5 Texas 6 7 8 Oregon 9 10 11 b 0 d 1 e 2 Name: Utah, dtype: int64b d e Utah 0 0 0 Ohio 3 3 3 Texas 6 6 6 Oregon 9 9 9b d e f Utah 0.0 NaN 3.0 NaN Ohio 3.0 NaN 6.0 NaN Texas 6.0 NaN 9.0 NaN Oregon 9.0 NaN 12.0 NaNb d e Utah -1 0 1 Ohio -1 0 1 Texas -1 0 1 Oregon -1 0 1/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:45: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated基本功能 函數應用和映射
- numpy的ufuncs(元素級數組方法)
- DataFrame的apply方法
- 對象的applymap方法(因為Series有一個應用于元素級的map方法)
- 所有numpy作用于元素級別的函數都可以作用于pandas的datafram
代碼:
# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrameprint '函數' frame = DataFrame(np.random.randn(4, 3),columns = list('bde'),index = ['Utah', 'Ohio', 'Texas', 'Oregon']) print frame print np.abs(frame)#對dataframe中的每個元素求絕對值 printprint 'lambda以及應用' f = lambda x: x.max() - x.min() print frame.apply(f)#默認是對列的元素進行操作 print frame.apply(f, axis = 1)#忽略列,對行進行操作def f(x):return Series([x.min(), x.max()], index = ['min', 'max']) print frame.apply(f) printprint 'applymap和map' _format = lambda x: '%.2f' % x print frame.applymap(_format) print frame['e'].map(_format) 函數b d e Utah -0.188935 0.298682 1.692648 Ohio -0.666434 -0.102262 -0.172966 Texas -1.103831 -1.324074 -1.024516 Oregon 1.354406 -0.564374 -0.967438b d e Utah 0.188935 0.298682 1.692648 Ohio 0.666434 0.102262 0.172966 Texas 1.103831 1.324074 1.024516 Oregon 1.354406 0.564374 0.967438lambda以及應用 b 2.458237 d 1.622756 e 2.717164 dtype: float64 Utah 1.881583 Ohio 0.564172 Texas 0.299558 Oregon 2.321844 dtype: float64b d e min -1.103831 -1.324074 -1.024516 max 1.354406 0.298682 1.692648applymap和mapb d e Utah -0.19 0.30 1.69 Ohio -0.67 -0.10 -0.17 Texas -1.10 -1.32 -1.02 Oregon 1.35 -0.56 -0.97 Utah 1.69 Ohio -0.17 Texas -1.02 Oregon -0.97 Name: e, dtype: object基本功能 排序和排名
- 對行或列索引進行排序
- 對于DataFrame,根據任意一個軸上的索引進行排序
- 可以指定升序降序
- 按值排序
- 對于DataFrame,可以指定按值排序的列
- rank函數
代碼:
# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrameprint '根據索引排序,對于DataFrame可以指定軸。' obj = Series(range(4), index = ['d', 'a', 'b', 'c']) print obj.sort_index()#通過索引進行排序 frame = DataFrame(np.arange(8).reshape((2, 4)),index = ['three', 'one'],columns = list('dabc')) print frame.sort_index()#默認是對行索引進行排序 print frame.sort_index(axis = 1)#對列索引進行排序 print frame.sort_index(axis = 1, ascending = False) # 降序 printprint '根據值排序' obj = Series([4, 7, -3, 2]) print obj.sort_values() # order已淘汰 printprint 'DataFrame指定列排序' frame = DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0, 1]}) print frame print frame.sort_values(by = 'b') # sort_index(by = ...)已淘汰 print frame.sort_values(by = ['a', 'b']) printprint 'rank,求排名的平均位置(從1開始)' obj = Series([7, -5, 7, 4, 2, 0, 4]) # 對應排名:-5(1), 0(2), 2(3), 4(4), 4(5), 7(6), 7(7) print obj.rank() print obj.rank(method = 'first') # 去第一次出現,不求平均值。 print obj.rank(ascending = False, method = 'max') # 逆序,并取最大值。所以-5的rank是7. frame = DataFrame({'b':[4.3, 7, -3, 2],'a':[0, 1, 0, 1],'c':[-2, 5, 8, -2.5]}) print frame print frame.rank(axis = 1) 根據索引排序,對于DataFrame可以指定軸。 a 1 b 2 c 3 d 0 dtype: int64d a b c one 4 5 6 7 three 0 1 2 3a b c d three 1 2 3 0 one 5 6 7 4d c b a three 0 3 2 1 one 4 7 6 5根據值排序 2 -3 3 2 0 4 1 7 dtype: int64DataFrame指定列排序a b 0 0 4 1 1 7 2 0 -3 3 1 2a b 2 0 -3 3 1 2 0 0 4 1 1 7a b 2 0 -3 0 0 4 3 1 2 1 1 7rank,求排名的平均位置(從1開始) 0 6.5 1 1.0 2 6.5 3 4.5 4 3.0 5 2.0 6 4.5 dtype: float64 0 6.0 1 1.0 2 7.0 3 4.0 4 3.0 5 2.0 6 5.0 dtype: float64 0 2.0 1 7.0 2 2.0 3 4.0 4 5.0 5 6.0 6 4.0 dtype: float64a b c 0 0 4.3 -2.0 1 1 7.0 5.0 2 0 -3.0 8.0 3 1 2.0 -2.5a b c 0 2.0 3.0 1.0 1 1.0 3.0 2.0 2 2.0 1.0 3.0 3 2.0 3.0 1.0基本功能 帶有重復值的索引
- 對于重復索引,返回Series,對應單個值的索引則返回標量。
代碼:
# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrameprint '重復的索引' obj = Series(range(5), index = ['a', 'a', 'b', 'b', 'c']) print obj print obj.index.is_unique # 判斷是非有重復索引 print obj['a'][0], obj.a[1] df = DataFrame(np.random.randn(4, 3), index = ['a', 'a', 'b', 'b']) print df print df.ix['b'].ix[0] print df.ix['b'].ix[1] 重復的索引 a 0 a 1 b 2 b 3 c 4 dtype: int64 False 0 10 1 2 a 1.166285 0.600093 1.043009 a 0.791440 0.764078 1.136826 b -1.624025 -0.384034 1.255976 b 0.164236 -0.181083 0.131282 0 -1.624025 1 -0.384034 2 1.255976 Name: b, dtype: float64 0 0.164236 1 -0.181083 2 0.131282 Name: b, dtype: float64/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:13: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecateddel sys.path[0] /Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:14: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated匯總和計算描述統計
匯總和計算描述統計 匯總和計算描述統計
- 常用方法選項
- 常用描述和匯總統計函數 I
- 常用描述和匯總統計函數 II
- 數值型和非數值型的區別
- NA值被自動排查,除非通過skipna選項
代碼:
# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrameprint '求和' df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],index = ['a', 'b', 'c', 'd'],columns = ['one', 'two']) print df print df.sum() # 按列求和,默認求和的方式是按列求和 print df.sum(axis = 1) # 按行求和,通過axis關鍵字指定按行進行求和 printprint '平均數' print df.mean(axis = 1, skipna = False)#按行進行求平均,不跳過nan print df.mean(axis = 1)#默認跳過nan printprint '其它' print df.idxmax()#默認對列進行操作 print df.idxmax(axis = 1) #默認是按列操作 print df.cumsum()#默認按列進行操作 print df.describe()#默認是按列進行操作 obj = Series(['a', 'a', 'b', 'c'] * 4) print obj print obj.describe() 求和one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 one 9.25 two -5.80 dtype: float64 a 1.40 b 2.60 c 0.00 d -0.55 dtype: float64平均數 a NaN b 1.300 c NaN d -0.275 dtype: float64 a 1.400 b 1.300 c NaN d -0.275 dtype: float64其它 one b two d dtype: object a one b one c NaN d one dtype: objectone two a 1.40 NaN b 8.50 -4.5 c NaN NaN d 9.25 -5.8one two count 3.000000 2.000000 mean 3.083333 -2.900000 std 3.493685 2.262742 min 0.750000 -4.500000 25% 1.075000 -3.700000 50% 1.400000 -2.900000 75% 4.250000 -2.100000 max 7.100000 -1.300000 0 a 1 a 2 b 3 c 4 a 5 a 6 b 7 c 8 a 9 a 10 b 11 c 12 a 13 a 14 b 15 c dtype: object count 16 unique 3 top a freq 8 dtype: object### 匯總和計算描述統計 相關系數與協方差
- 相關系數:相關系數是用以反映變量之間相關關系密切程度的統計指標。百度百科
- 協方差:從直觀上來看,協方差表示的是兩個變量總體誤差的期望。如果兩個變量的變化趨勢一致,也就是說如果其中一個大于自身的期望值時另外一個也大于自身的期望值,那么兩個變量之間的協方差就是正值;如果兩個變量的變化趨勢相反,即其中一個變量大于自身的期望值時另外一個卻小于自身的期望值,那么兩個變量之間的協方差就是負值。
代碼:
# -*- coding: utf-8 -*- import numpy as np # from pandas_datareader import data , web import pandas.io.data as web from pandas import DataFrameprint '相關性與協方差' # 協方差:https://zh.wikipedia.org/wiki/%E5%8D%8F%E6%96%B9%E5%B7%AE all_data = {} for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:all_data[ticker] = web.get_data_yahoo(ticker, '4/1/2016', '7/15/2015')price = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})volume = DataFrame({tic: data['Volume'] for tic, data in all_data.iteritems()}) returns = price.pct_change() print returns.tail() print returns.MSFT.corr(returns.IBM) print returns.corr() # 相關性,自己和自己的相關性總是1 print returns.cov() # 協方差 print returns.corrwith(returns.IBM) print returns.corrwith(returns.volume) ---------------------------------------------------------------------------ImportError Traceback (most recent call last)<ipython-input-61-a72f5c63b2a8> in <module>()3 import numpy as np4 # from pandas_datareader import data , web ----> 5 import pandas.io.data as web6 from pandas import DataFrame7 /Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/pandas/io/data.py in <module>()1 raise ImportError( ----> 2 "The pandas.io.data module is moved to a separate package "3 "(pandas-datareader). After installing the pandas-datareader package "4 "(https://github.com/pydata/pandas-datareader), you can change "5 "the import ``from pandas.io import data, wb`` to "ImportError: The pandas.io.data module is moved to a separate package (pandas-datareader). After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.匯總和計算描述統計 唯一值以及成員資格
- 常用方法
代碼:
# -*- coding: utf-8 -*- import numpy as np import pandas as pd from pandas import Series, DataFrameprint '去重' obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) print obj print obj.unique() #去重索引 print obj.value_counts() #計算索引對應的個數 printprint '判斷元素存在' mask = obj.isin(['b', 'c']) print mask print obj[mask] #只打印元素b和c data = DataFrame({'Qu1':[1, 3, 4, 3, 4],'Qu2':[2, 3, 1, 2, 3],'Qu3':[1, 5, 2, 4, 4]}) print data print data.apply(pd.value_counts).fillna(0) print data.apply(pd.value_counts, axis = 1).fillna(0) 去重 0 c 1 a 2 d 3 a 4 a 5 b 6 b 7 c 8 c dtype: object ['c' 'a' 'd' 'b'] c 3 a 3 b 2 d 1 dtype: int64判斷元素存在 0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool 0 c 5 b 6 b 7 c 8 c dtype: objectQu1 Qu2 Qu3 0 1 2 1 1 3 3 5 2 4 1 2 3 3 2 4 4 4 3 4Qu1 Qu2 Qu3 1 1.0 1.0 1.0 2 0.0 2.0 1.0 3 2.0 2.0 0.0 4 2.0 0.0 2.0 5 0.0 0.0 1.01 2 3 4 5 0 2.0 1.0 0.0 0.0 0.0 1 0.0 0.0 2.0 0.0 1.0 2 1.0 1.0 0.0 1.0 0.0 3 0.0 1.0 1.0 1.0 0.0 4 0.0 0.0 1.0 2.0 0.0處理缺失數據
處理缺失數據
- NA處理方法
- NaN(Not a Number)表示浮點數和非浮點數組中的缺失數據
- None也被當作NA處理
代碼:
# -*- coding: utf-8 -*- import numpy as np from pandas import Seriesprint '作為null處理的值' string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado']) print string_data print string_data.isnull() #判斷是否為空缺值 string_data[0] = None print string_data.isnull() 作為null處理的值 0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object 0 False 1 False 2 True 3 False dtype: bool 0 True 1 False 2 True 3 False dtype: bool處理缺失數據 濾除缺失數據
- dropna
- 布爾索引
- DatFrame默認丟棄任何含有缺失值的行
- how參數控制行為,axis參數選擇軸,thresh參數控制留下的數量
代碼:
# -*- coding: utf-8 -*- import numpy as np from numpy import nan as NA from pandas import Series, DataFrame# print '丟棄NA' # data = Series([1, NA, 3.5, NA, 7 , None]) # print data.dropna() #去掉serial數據中的NA值 # print data[data.notnull()] # printprint 'DataFrame對丟棄NA的處理' data = DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]]) print data print data.dropna() # 默認只要某行有NA就全部刪除 print data.dropna(how = 'all') # 全部為NA才刪除,使用how來指定方式 data[4] = NA # 新增一列 print data.dropna(axis = 1, how = 'all')#默認按行進行操作,可以通過axis來指定通過列進行操作 data = DataFrame(np.random.randn(7, 3)) data.ix[:4, 1] = NA data.ix[:2, 2] = NA print data print data.dropna(thresh = 2) # 每行至少要有2個非NA元素 DataFrame對丟棄NA的處理0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.00 1 2 0 1.0 6.5 3.00 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.00 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.00 1 2 0 -0.181398 NaN NaN 1 -1.153083 NaN NaN 2 -0.072996 NaN NaN 3 0.783739 NaN 0.324288 4 -1.277365 NaN -1.683068 5 2.305280 0.082071 0.175902 6 -0.167521 -0.043577 -0.9591340 1 2 3 0.783739 NaN 0.324288 4 -1.277365 NaN -1.683068 5 2.305280 0.082071 0.175902 6 -0.167521 -0.043577 -0.959134/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:22: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated處理缺失數據 填充缺失數據
- fillna
- inplace參數控制返回新對象還是就地修改
代碼:
# -*- coding: utf-8 -*- import numpy as np from numpy import nan as NA import pandas as pd from pandas import Series, DataFrame, Indexprint '填充0' df = DataFrame(np.random.randn(7, 3)) print df df.ix[:4, 1] = NA df.ix[:2, 2] = NA print df print df.fillna(0) df.fillna(0, inplace = False) #不在原先的數據結構上進行修改 df.fillna(0, inplace = True) #對原先的數據結構進行修改 print df printprint '不同行列填充不同的值' print df.fillna({1:0.5, 3:-1}) # 第3列不存在 printprint '不同的填充方式' df = DataFrame(np.random.randn(6, 3)) df.ix[2:, 1] = NA df.ix[4:, 2] = NA print df print df.fillna(method = 'ffill') print df.fillna(method = 'ffill', limit = 2) printprint '用統計數據填充' data = Series([1., NA, 3.5, NA, 7]) print data.fillna(data.mean()) 填充00 1 2 0 -0.747530 0.733795 0.207921 1 0.329993 -0.092622 -0.274532 2 -0.498705 1.097721 -0.248666 3 -1.072368 1.281738 1.143063 4 -0.838184 -1.229197 -1.588577 5 0.386622 -1.056740 0.120941 6 -0.104685 0.062590 -0.6826520 1 2 0 -0.747530 NaN NaN 1 0.329993 NaN NaN 2 -0.498705 NaN NaN 3 -1.072368 NaN 1.143063 4 -0.838184 NaN -1.588577 5 0.386622 -1.05674 0.120941 6 -0.104685 0.06259 -0.6826520 1 2 0 -0.747530 0.00000 0.000000 1 0.329993 0.00000 0.000000 2 -0.498705 0.00000 0.000000 3 -1.072368 0.00000 1.143063 4 -0.838184 0.00000 -1.588577 5 0.386622 -1.05674 0.120941 6 -0.104685 0.06259 -0.6826520 1 2 0 -0.747530 0.00000 0.000000 1 0.329993 0.00000 0.000000 2 -0.498705 0.00000 0.000000 3 -1.072368 0.00000 1.143063 4 -0.838184 0.00000 -1.588577 5 0.386622 -1.05674 0.120941 6 -0.104685 0.06259 -0.682652不同行列填充不同的值0 1 2 0 -0.747530 0.00000 0.000000 1 0.329993 0.00000 0.000000 2 -0.498705 0.00000 0.000000 3 -1.072368 0.00000 1.143063 4 -0.838184 0.00000 -1.588577 5 0.386622 -1.05674 0.120941 6 -0.104685 0.06259 -0.682652不同的填充方式0 1 2 0 0.037005 -0.554357 -0.968951 1 0.600986 -0.564576 -0.718096 2 1.268549 NaN 1.006229 3 0.813411 NaN 0.451489 4 0.097840 NaN NaN 5 -1.944482 NaN NaN0 1 2 0 0.037005 -0.554357 -0.968951 1 0.600986 -0.564576 -0.718096 2 1.268549 -0.564576 1.006229 3 0.813411 -0.564576 0.451489 4 0.097840 -0.564576 0.451489 5 -1.944482 -0.564576 0.4514890 1 2 0 0.037005 -0.554357 -0.968951 1 0.600986 -0.564576 -0.718096 2 1.268549 -0.564576 1.006229 3 0.813411 -0.564576 0.451489 4 0.097840 NaN 0.451489 5 -1.944482 NaN 0.451489用統計數據填充 0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:11: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated# This is added back by InteractiveShellApp.init_path() /Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:26: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated層次化索引
- 使你能在一個軸上擁有多個(兩個以上)索引級別。抽象的說,它使你能以低緯度形式處理高維度數據。
- 通過stack與unstack變換DataFrame
代碼:
# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrame, MultiIndex# print 'Series的層次索引' # data = Series(np.random.randn(10), # index = [['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], # [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]]) # print data # print data.index # print data.b # print data['b':'c'] # print data[:2] # print data.unstack() # print data.unstack().stack() # printprint 'DataFrame的層次索引' frame = DataFrame(np.arange(12).reshape((4, 3)),index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns = [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]) print frame frame.index.names = ['key1', 'key2'] frame.columns.names = ['state', 'color'] print frame print frame.ix['a', 1] print frame.ix['a', 2]['Colorado'] print frame.ix['a', 2]['Ohio']['Red'] printprint '直接用MultiIndex創建層次索引結構' print MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Gree', 'Red', 'Green']],names = ['state', 'color']) DataFrame的層次索引Ohio ColoradoGreen Red Green a 1 0 1 22 3 4 5 b 1 6 7 82 9 10 11 state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 22 3 4 5 b 1 6 7 82 9 10 11 state color Ohio Green 0Red 1 Colorado Green 2 Name: (a, 1), dtype: int64 color Green 5 Name: (a, 2), dtype: int64 4直接用MultiIndex創建層次索引結構 MultiIndex(levels=[[u'Colorado', u'Ohio'], [u'Gree', u'Green', u'Red']],labels=[[1, 1, 0], [0, 2, 1]],names=[u'state', u'color'])/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:27: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated層次化索引 重新分級順序
- 索引交換
- 索引重新排序
代碼:
# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrameprint '索引層級交換' frame = DataFrame(np.arange(12).reshape((4, 3)),index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns = [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]) frame.index.names = ['key1', 'key2'] frame_swapped = frame.swaplevel('key1', 'key2') print frame_swapped print frame_swapped.swaplevel(0, 1) printprint '根據索引排序' print frame.sortlevel('key2') print frame.swaplevel(0, 1).sortlevel(0) 索引層級交換Ohio ColoradoGreen Red Green key2 key1 1 a 0 1 2 2 a 3 4 5 1 b 6 7 8 2 b 9 10 11Ohio ColoradoGreen Red Green key1 key2 a 1 0 1 22 3 4 5 b 1 6 7 82 9 10 11根據索引排序Ohio ColoradoGreen Red Green key1 key2 a 1 0 1 2 b 1 6 7 8 a 2 3 4 5 b 2 9 10 11Ohio ColoradoGreen Red Green key2 key1 1 a 0 1 2b 6 7 8 2 a 3 4 5b 9 10 11/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:17: FutureWarning: sortlevel is deprecated, use sort_index(level= ...) /Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:18: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)層次化索引 根據級別匯總統計
- 指定索引級別和軸
代碼:
# -*- coding: utf-8 -*- import numpy as np from pandas import DataFrameprint '根據指定的key計算統計信息' frame = DataFrame(np.arange(12).reshape((4, 3)),index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns = [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]) frame.index.names = ['key1', 'key2'] print frame print frame.sum(level = 'key2') 根據指定的key計算統計信息Ohio ColoradoGreen Red Green key1 key2 a 1 0 1 22 3 4 5 b 1 6 7 82 9 10 11Ohio ColoradoGreen Red Green key2 1 6 8 10 2 12 14 16層次化索引 使用DataFrame的列
- 將指定列變為索引
- 移除或保留對象
- reset_index恢復
代碼:
# -*- coding: utf-8 -*- import numpy as np from pandas import DataFrameprint '使用列生成層次索引' frame = DataFrame({'a':range(7),'b':range(7, 0, -1),'c':['one', 'one', 'one', 'two', 'two', 'two', 'two'],'d':[0, 1, 2, 0, 1, 2, 3]}) print frame print frame.set_index(['c', 'd']) # 把c/d列變成索引 print frame.set_index(['c', 'd'], drop = False) # 列依然保留 frame2 = frame.set_index(['c', 'd']) print frame2.reset_index() 使用列生成層次索引a b c d 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 3 3 4 two 0 4 4 3 two 1 5 5 2 two 2 6 6 1 two 3a b c d one 0 0 71 1 62 2 5 two 0 3 41 4 32 5 23 6 1a b c d c d one 0 0 7 one 01 1 6 one 12 2 5 one 2 two 0 3 4 two 01 4 3 two 12 5 2 two 23 6 1 two 3c d a b 0 one 0 0 7 1 one 1 1 6 2 one 2 2 5 3 two 0 3 4 4 two 1 4 3 5 two 2 5 2 6 two 3 6 1其它話題
其它話題 整數索引
- 歧義的產生
- 可靠的,不考慮索引類型的,基于位置的索引
代碼:
# -*- coding: utf-8 -*- import numpy as np import sys from pandas import Series, DataFrameprint '整數索引' ser = Series(np.arange(3.)) print ser try:print ser[-1] # 這里會有歧義 except:print sys.exc_info()[0] ser2 = Series(np.arange(3.), index = ['a', 'b', 'c']) print ser2[-1] ser3 = Series(range(3), index = [-5, 1, 3]) print ser3.iloc[2] # 避免直接用[2]產生的歧義 printprint '對DataFrame使用整數索引' frame = DataFrame(np.arange(6).reshape((3, 2)), index = [2, 0, 1]) print frame print frame.iloc[0] print frame.iloc[:, 1] 整數索引 0 0.0 1 1.0 2 2.0 dtype: float64 <type 'exceptions.KeyError'> 2.0 2對DataFrame使用整數索引0 1 2 0 1 0 2 3 1 4 5 0 0 1 1 Name: 2, dtype: int64 2 1 0 3 1 5 Name: 1, dtype: int64其它話題 面板(Pannel)數據
- 通過三維ndarray創建pannel對象
- 通過ix[…]選取需要的數據
- 訪問順序:item -> major -> minor
- 通過stack展現面板數據
代碼:
# -*- coding: utf-8 -*- import numpy as np import pandas as pd import pandas.io.data as web from pandas import Series, DataFrame, Index, Panelpdata = Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2016', '1/15/2016')) for stk in ['AAPL', 'GOOG', 'BIDU', 'MSFT'])) print pdata pdata = pdata.swapaxes('items', 'minor') print pdata printprint "訪問順序:# Item -> Major -> Minor" print pdata['Adj Close'] print pdata[:, '1/5/2016', :] print pdata['Adj Close', '1/6/2016', :] printprint 'Panel與DataFrame相互轉換' stacked = pdata.ix[:, '1/7/2016':, :].to_frame() print stacked print stacked.to_panel() ---------------------------------------------------------------------------ImportError Traceback (most recent call last)<ipython-input-83-82a16090a331> in <module>()3 import numpy as np4 import pandas as pd ----> 5 import pandas.io.data as web6 from pandas import Series, DataFrame, Index, Panel7 /Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/pandas/io/data.py in <module>()1 raise ImportError( ----> 2 "The pandas.io.data module is moved to a separate package "3 "(pandas-datareader). After installing the pandas-datareader package "4 "(https://github.com/pydata/pandas-datareader), you can change "5 "the import ``from pandas.io import data, wb`` to "ImportError: The pandas.io.data module is moved to a separate package (pandas-datareader). After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.總結
以上是生活随笔為你收集整理的pandasStudyNoteBook的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: c++基础学习(13)--(STL、标准
- 下一篇: 学点数学(1)-随机变量函数变换