pandas新建dataframe_pandas数据处理
1.介紹幾個DataFrame的小功能
檢查數據類型 df.dtypes
df=DataFrame({'name':['zs,','ls','ww','zl'],'score':[19,29,39,20],'date':['2019-10-10','2019-10-11','2019-10-12','2019-10-13']})df.dtypesname objectscore int64date objectdtype: object將'date'這行數據轉換為時間數據類型
df.date=pd.to_datetime(df.date)df.dtypesname objectscore int64date datetime64[ns]dtype: object將'date'設置為行索引
df.set_index('date',inplace=True)dfdate name score 2019-10-10 zs 192019-10-11 ls 292019-10-12 ww 392019-10-13 zl 20可視化
import matplotlib.pyplot as plot%matplotlib inlinedf['score'].plot()2.pandas數據處理
(1)刪除重復元素
df.drop_duplicates(subset=None, keep='first', inplace=False)df = DataFrame({'color':['white','white','red','red','white'],'value':[2,1,3,3,2]})display(df,df.duplicated(),df.drop_duplicates())輸出:colorvalue0white21white12red33red34white20 False1 False2 False3 True4 Truedtype: boolcolorvalue0white21white12red3(2)映射
映射的含義:創建一個映射關系列表,把values元素和一個特定的標簽或者字符串綁定
需要使用字典:
map = {'label1':'value1','label2':'value2', ... }
包含三種操作:
a. replace()函數:替換元素
df = DataFrame({'item':['ball','mug','pen'],'color':['white','rosso','verde'], 'price':[5.56,4.20,1.30]})newcolors = {'rosso':'red','verde':'green','pen':'pi'}df.replace(newcolors)itemcolorprice0ballwhite5.561mugred4.202pigreen1.30b. 最重要:map()函數:新建一列
!!!map中返回的數據是一個具體值,不能迭代
使用map()函數,由已有的列生成一個新列,適合處理某一單獨的列df3 = DataFrame({'color':['red','green','blue'],'project':['math','english','chemistry']})price = {'red':5.56,'green':3.14,'chemistry':2.79}df3['price'] = df3['color'].map(price)display(df3)輸出:colorprojectprice0redmath5.561greenenglish3.142bluechemistryNaNc. rename()函數:替換索引
df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)})new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'}display(df4,df4.rename(new_index))輸出:colorvalue0white21gray02purple93blue24green0colorvaluefirstwhite2twogray0threepurple9fourblue2fivegreen0(3)異常值檢查和過濾
使用describe()函數查看每一列的描述性統計量
np.random.seed(0)df = DataFrame(np.random.randint(10,size = 10))display(df.head(10),df.describe())輸出:005102333475963758294 0count10.000000mean4.100000std2.558211min0.00000025%3.00000050%3.50000075%5.000000max9.000000使用std()函數可以求得DataFrame對象每一列的標準差
df2=DataFrame(np.random.randint(10,100,size=(8,8)))df2.std()0 26.7177711 20.3118542 20.2608003 26.0463194 28.6602645 36.4025126 18.0158667 22.646349dtype: float64根據每一列的標準差,對DataFrame元素進行過濾。借助any()函數,對每一列應用篩選條件
display(df.std(),np.abs(df)>(3*df.std()),df[(np.abs(df)>df.std()*3).any(axis = 1)])0 2.558211dtype: float64 00False1False2False3False4False5True6False7False8False9False059DataFrame.any功能說明DataFrame.any(self,axis=0,bool_only=None, skipna=True, level=None,**kwargs,)Return whether any element is True, potentially over an axis.(4)排序
使用.take()函數排序,可以借助np.random.permutation()函數隨機排序
df5 = DataFrame(np.arange(25).reshape(5,5))new_order = np.random.permutation(5)display(df5,new_order,df5.take(new_order))輸出:01234001234156789210111213143151617181942021222324array([4, 2, 3, 1, 0]) 0 1 2 3442021222324210111213143151617181915 6 7 8900 1 2 34隨機抽樣
當DataFrame規模足夠大時,直接使用np.random.randint()函數,就配合take()函數實現隨機抽樣
sample = np.random.randint(0,len(df5),size = 3)df.take(sample)輸出:012340012342101112131442021222324(5)數據聚合
df = DataFrame({'color':['white','red','green','red'],'item':['ball','mug','pen','pencil'], 'price1':np.random.rand(4),'price2':np.random.rand(4)})g = df.groupby('color')['price1']display(df,g,g.groups,type(g))display(g.sum(),g.mean(),g.max())輸出:coloritemprice1price20whiteball0.6527900.4143691redmug0.6350590.4746982greenpen0.9953000.6235103redpencil0.5818500.338008{'green': Int64Index([2], dtype='int64'), 'red': Int64Index([1, 3], dtype='int64'), 'white': Int64Index([0], dtype='int64')}pandas.core.groupby.SeriesGroupBycolorgreen 0.995300red 1.216909white 0.652790Name: price1, dtype: float64colorgreen 0.995300red 0.608455white 0.652790Name: price1, dtype: float64colorgreen 0.995300red 0.635059white 0.652790Name: price1, dtype: float64(6)高級數據聚合
可以使用pd.merge()函數包聚合操作的計算結果添加到df的每一行
d1={'item':['luobo','baicai','lajiao','donggua','luobo','baicai','lajiao','donggua'], 'color':['white','white','red','green','white','white','red','green'], 'weight':np.random.randint(10,size = 8), 'price':np.random.randint(10,size = 8)}df = DataFrame(d1)sums = df.groupby('color').sum().add_prefix('total_')items = df.groupby('item')['price','weight'].sum()means = items['price']/items['weight']means = DataFrame(means,columns=['means_price'])df2 = pd.merge(df,sums,left_on = 'color',right_index = True)df3 = pd.merge(df2,means,left_on = 'item',right_index = True)display(df2,df3)輸出:coloritempriceweight0whiteluobo921whitebaicai592redlajiao583greendonggua114whiteluobo745whitebaicai806redlajiao687greendonggua43total_pricetotal_weightcolorgreen54red1116white2915pandas.core.frame.DataFramepandas.core.frame.DataFrameOut[141]:coloritempriceweighttotal_pricetotal_weight0whiteluobo9229151whitebaicai5929154whiteluobo7429155whitebaicai8029152redlajiao5811166redlajiao6811163greendonggua11547greendonggua4354可以使用transform和apply實現相同功能
transform
d1={'item':['luobo','baicai','lajiao','donggua','luobo','baicai','lajiao','donggua'], 'color':['white','white','red','green','white','white','red','green'], 'weight':np.random.randint(10,size = 8), 'price':np.random.randint(10,size = 8)}df = DataFrame(d1)sum1 = df.groupby('color')['price','weight'].sum().add_prefix("total_")sums2 = df.groupby('color')['price','weight'].transform(lambda x:x.sum()).add_prefix('total_')sums3 = df.groupby('color')['price','weight'].transform(sum).add_prefix('total_')display(sum,df,sum1,sums2,sums3)輸出:coloritempriceweight0whiteluobo771whitebaicai772redlajiao273greendonggua664whiteluobo125whitebaicai366red lajiao707greendonggua02total_pricetotal_weightcolorgreen68red97white1822total_pricetotal_weight01822118222973684182251822697768total_pricetotal_weight01822118222973684182251822697768apply
def sum_price(x): return x.sum()sums3 = df.groupby('color')['price','weight'].apply(lambda x:x.sum()).add_prefix('total_')sums4 = df.groupby('color')['price','weight'].apply(sum_price).add_prefix('total_')display(df,sums3,sums4)輸出:coloritempriceweight0whiteluobo441whitebaicai032redlajiao043greendonggua754whiteluobo315whitebaicai336redlajiao067greendonggua07colortotal_pricetotal_weightgreen712red010white1011colortotals_pricetotals_weightgreen712red010white1011總結
以上是生活随笔為你收集整理的pandas新建dataframe_pandas数据处理的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: wordpress 表格文字对齐_Wor
- 下一篇: python自带的idle优点_pyth