数据分析学习03-pandas
                                                            生活随笔
收集整理的這篇文章主要介紹了
                                数据分析学习03-pandas
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.                        
                                簡(jiǎn)介
- Pandas 是 Python 的外部模塊,它非常像 Excel,提供了分析數(shù)據(jù)的功能。它提供了兩個(gè)數(shù)據(jù)類型 Series 和 DataFrame。 - 什么是 Series? - Series 是 Pandas 提供的一種數(shù)據(jù)類型,你可以把它想象成 Excel 的一行或一列。(一維,帶標(biāo)簽數(shù)組)
- Series對(duì)象本質(zhì)上由兩個(gè)數(shù)組組成(index索引,value值)
 
- 什么是 DataFrame? - DataFrame 是 Pandas 提供的一種數(shù)據(jù)類型,你可以把它想象成 Excel 的表格。(二維,Series容器)
 
 
- 什么是 Series? 
創(chuàng)建Series
import pandas as pdp1 = pd.Series([11,22,33,44,55]) print(p1) print(type(p1)) import pandas as pdp1 = pd.Series([11,12,13,14,15],index=list("abcde")) #index 指定索引 print(p1) import pandas as pdp1 = {"name":"gemoumou","age":"18","tel":"10086",} p2 = pd.Series(p1) print(p2) import pandas as pdp1 = pd.Series([11,22,33,44,55]) print(p1) print(type(p1)) p2 = p1.astype(float) print(p2)Series的切片和索引
import pandas as pdp1 = {"name":"gemoumou","age":"18","tel":"10086",} p2 = pd.Series(p1) print(p2) # name gemoumou # age 18 # tel 10086 # dtype: object print(p2["name"]) print(p2["age"]) # gemoumou # 18 print(p2[1]) print(p2[2]) # 18 # 10086 print(p2[[0,1]]) # name gemoumou # age 18 print(p2[["name","tel"]]) # name gemoumou # tel 10086 import pandas as pdp1 = pd.Series([11,22,33,44,55,66,77,88,99,100]) print(p1)print(p1[p1>50]) # 取出大于50的數(shù)據(jù)
 索引
 值
pandas 讀取外部數(shù)據(jù)
import pandas as pddf = pd.read_csv("數(shù)據(jù).csv") # 讀取csv中的文件 print(df)pands之DataFrame
- DataFrame對(duì)象既有行索引也有列索引
- 行索引,表面不同行,橫向索引,叫index 0軸,axis=0
- 列索引,表明不同列,縱向索引,叫columns 1軸,axis=1
DataFrame的基礎(chǔ)屬性
# -*- coding: utf-8 -*- import pandas as pd import numpy as npp1 = {"name":"zhangsan","age":18,"tel":10086},{"name":"lisi","age":20,"tel":10010},{"name":"wangmazi","age":22,"tel":100000} p2 = pd.DataFrame(p1) print(p2) print(p2.index) print(p2.columns) print(p2.values) print(p2.shape) print(p2.dtypes) print(p2.ndim) # -*- coding: utf-8 -*- import pandas as pd import numpy as npp1 = {"name":"zhangsan","age":18,"tel":10086},{"name":"lisi","age":20,"tel":10010},{"name":"wangmazi","age":22,"tel":100000},{"name":"xiaoming","age":22,"tel":100000},{"name":"xiaohong","age":22,"tel":100000} p2 = pd.DataFrame(p1) print(p2)print("-"*20+"顯示前幾行"+"-"*20) print(p2.head(2)) print("-"*20+"顯示后幾行"+"-"*20) print(p2.tail(2)) print("-"*20+"顯示p2的概覽"+"-"*20) print(p2.info()) print("-"*20+"快速對(duì)數(shù)字類型(int,float)進(jìn)行統(tǒng)計(jì)"+"-"*20) print(p2.describe())
 
 
切片索引
# -*- coding: utf-8 -*- import pandas as pdp1 = pd.read_csv("數(shù)據(jù).csv") # 讀取csv文件內(nèi)容 # DataFrame中的排序方法 # ascending=True/False 表示升序或者降序 p1 = p1.sort_values(by="NUM",ascending=False) # pandas取行或者列注意點(diǎn) # 方括號(hào)寫數(shù)組,表示取行,對(duì)行進(jìn)行操作 # 方括號(hào)寫字符串,表示取列,對(duì)列進(jìn)行操作 print("-"*20+"取前五行"+"-"*20) print(p1[:5]) # print("-"*20+"取后五行"+"-"*20) print(p1[5:]) print("-"*20+"取NAME列的數(shù)據(jù)"+"-"*20) print(p1["NAME"]) print("-"*20+"取前五行 NUM列的數(shù)據(jù)"+"-"*20) print(p1[:5]["NUM"])
 p1.loc 通過標(biāo)簽索引來獲取數(shù)據(jù)
 p1.iloc 通過位置來獲取數(shù)據(jù)
pandas之布爾索引
# -*- coding: utf-8 -*- import pandas as pdp1 = pd.read_csv("數(shù)據(jù).csv") # 讀取csv文件內(nèi)容 # ascending=True/False 表示升序或者降序 p1 = p1.sort_values(by="NUM",ascending=False) print(p1) print("-"*20+"顯示大于14的數(shù)據(jù)"+"-"*20) print(p1[p1["NUM"]>14]) print("-"*20+"顯示大于10小于22的數(shù)據(jù)"+"-"*20) # & 表示且 | 表示或 不同條件之間需要使用括號(hào)括起來 print(p1[(p1["NUM"]>10)&(p1["NUM"]<22)]) print("-"*20+"字符串顯示大于5小于7的數(shù)據(jù)"+"-"*20) print(p1[(p1["NAME"].str.len()>5)&(p1["NAME"].str.len()<7)])
 
 
缺失數(shù)據(jù)的處理
 
 刪除nan
 案例
 
 
數(shù)組合并
# -*- coding: utf-8 -*- import pandas as pd import numpy as npdf1 = pd.DataFrame(np.ones((2,4)),index=["A","B"],columns=list("abcd")) print(df1) # 兩行四列 print("-"*50) df2 = pd.DataFrame(np.zeros((3,3)),index=["A","B","C"],columns=list("xyz")) print(df2) # 3行3列 print("-"*50) print(df1.join(df2)) print("-"*50) print(df2.join(df1)) # -*- coding: utf-8 -*- import pandas as pd import numpy as npdf1 = pd.DataFrame(np.ones((2,4)),index=["A","B"],columns=list("abcd")) print(df1) # 兩行四列 df2 = pd.DataFrame(np.zeros((3, 3)), columns=list("asd")) print(df2) print("-" * 50) print(df1.merge(df2, on="a")) # on 表示按照什么進(jìn)行合并 df2.loc[1,"a"]=1 # 為a列1行進(jìn)行賦值1 print(df2) print(df1.merge(df2, on="a")) # -*- coding: utf-8 -*- import pandas as pd import numpy as npdf1 = pd.DataFrame(np.ones((2,4)),index=["A","B"],columns=list("abcd")) print(df1) print("-" * 50) df2 = pd.DataFrame(np.arange(9).reshape((3,3)), columns=list("sad")) print(df2) print("-" * 50) print(df1.merge(df2,on="a")) # on 表示按照什么進(jìn)行合并 print("-" * 50) df1.loc["A","a"]=100 print(df1) print("-" * 50) print(df1.merge(df2,on="a")) print("-" * 20+"外連接(并集)"+"-" * 20) print(df1.merge(df2,on="a",how="outer")) print("-" * 20+"左鏈接"+"-" * 20) print(df1.merge(df2,on="a",how="left")) print("-" * 20+"右鏈接"+"-" * 20) print(df1.merge(df2,on="a",how="right"))分組和聚合
# -*- coding: utf-8 -*- import pandas as pd import numpy as npfile_path = "starbucks_store_worldwide.csv" df =pd.read_csv(file_path) # print(df.head(1)) # print(df.info()) # 查看文件中有哪些數(shù)據(jù) grouped = df.groupby(by="Country") #print(grouped) # DataFrameGroupBy # # 可以進(jìn)行遍歷 # for i,j in grouped: # print(i) # print("-"*50) # print(j) # 調(diào)用聚合 # print(grouped.count()) country_count = grouped["Brand"].count() print(country_count["US"]) print(country_count["CN"])
 
 
 
索引和復(fù)合索引
# -*- coding: utf-8 -*- import pandas as pd import numpy as npfile_path = "starbucks_store_worldwide.csv" df =pd.read_csv(file_path) #統(tǒng)計(jì)中國(guó)每個(gè)身份店鋪數(shù)量 china_data = df[df["Country"]=="CN"] grouped = china_data.groupby(by="State/Province").count()["Brand"] print(grouped) #數(shù)據(jù)按照多個(gè)條件進(jìn)行分組,返回的Series grouped =df["Brand"].groupby(by=[df["Country"],df["State/Province"]]).count() print(grouped) #數(shù)據(jù)按照多個(gè)條件進(jìn)行分組,返回的DataFrame grouped1 =df[["Brand"]].groupby(by=[df["Country"],df["State/Province"]]).count() grouped2 =df.groupby(by=[df["Country"],df["State/Province"]])[["Brand"]].count() grouped3 =df.groupby(by=[df["Country"],df["State/Province"]]).count()[["Brand"]] print(grouped1,type(grouped1)) print("-"*50) print(grouped2,type(grouped2)) print("-"*50) print(grouped3,type(grouped3)) # -*- coding: utf-8 -*- import pandas as pd import numpy as npdf1 = pd.DataFrame(np.ones((2,4)),index=["A","B"],columns=list("abcd")) print(df1) print("-"*50) print(df1.index) print("-"*50) df1.index = ["c","d"] print(df1) print(df1.index) # -*- coding: utf-8 -*- import pandas as pd import numpy as npdf1 = pd.DataFrame(np.ones((2,4)),index=["A","B"],columns=list("abcd")) print(df1) df1.loc["A","a"]=100 print("-"*50) print(df1.reindex(["A","C"]))# 沒有的行全為NaN print("-"*50) print(df1.set_index("a")) # 把某一行作為索引 print(df1.set_index("a").index) print(df1.set_index(["a","b"])) # 把某幾行作為索引 print(df1.set_index(["a","b"]).index) print("-"*50) print(df1.set_index("a",drop=False)) print("-"*50) print(df1["d"].unique()) print(df1["a"].unique()) print("-"*50) print(len(df1.set_index("b").index)) # 求長(zhǎng)度 print("-"*50) # -*- coding: utf-8 -*- import pandas as pd import numpy as npa = pd.DataFrame({"a":range(7),"b":range(7,0,-1),"c":["one","one","one","two","two","two","two"],"d":list("hjklmno")}) print(a) print("-"*50) b = a.set_index(["c","d"]) print(b) print("-"*50) c = b["a"] print(c) print("-"*50) print(c["one"]["j"]) print("-"*50) d = a.set_index(["d","c"])["a"] print(d) print("-"*50) print(d.swaplevel()) print("-"*50) print(d.swaplevel()["one"])
 
pandas 時(shí)間序列
 
案例
# -*- coding: utf-8 -*- import pandas as pd pd.set_option('expand_frame_repr', False)#True就是可以換行顯示。設(shè)置成False的時(shí)候不允許換行 file_path = "BeijingPM20100101_20151231.csv" df =pd.read_csv(file_path)#把分開的時(shí)間字符串通過 PeriodIndex的方法轉(zhuǎn)化為pandas的事件類型 periond1 = pd.PeriodIndex(year=df["year"],month=df["month"],day=df["day"],hour=df["hour"],freq="H") # print(periond) df["datetime"] = periond1 print(df.head(10)) # -*- coding: utf-8 -*- import pandas as pd from matplotlib import pyplot as pltpd.set_option('expand_frame_repr', False)#True就是可以換行顯示。設(shè)置成False的時(shí)候不允許換行 file_path = "BeijingPM20100101_20151231.csv" df =pd.read_csv(file_path)#把分開的時(shí)間字符串通過 PeriodIndex的方法轉(zhuǎn)化為pandas的事件類型 periond1 = pd.PeriodIndex(year=df["year"],month=df["month"],day=df["day"],hour=df["hour"],freq="H") # print(periond) df["datetime"] = periond1 print(df.head(10)) # 把datetime設(shè)置為索引 df.set_index("datetime",inplace=True)# 處理NaN缺失數(shù)據(jù),刪除缺失數(shù)據(jù) data = df["PM_US Post"].dropna() # 畫圖 _x = data.index _y = data.values plt.figure(figsize=(20,8),dpi=80) plt.plot(range(len(_x)),_y) plt.xticks(range(0,len(_x),20),list(_x)[::20]) plt.show() # -*- coding: utf-8 -*- import pandas as pd from matplotlib import pyplot as pltpd.set_option('expand_frame_repr', False)#True就是可以換行顯示。設(shè)置成False的時(shí)候不允許換行 file_path = "BeijingPM20100101_20151231.csv" df =pd.read_csv(file_path)#把分開的時(shí)間字符串通過 PeriodIndex的方法轉(zhuǎn)化為pandas的事件類型 periond1 = pd.PeriodIndex(year=df["year"],month=df["month"],day=df["day"],hour=df["hour"],freq="H") # print(periond) df["datetime"] = periond1 #print(df.head(10)) # 把datetime設(shè)置為索引 df.set_index("datetime",inplace=True) # 應(yīng)數(shù)據(jù)較多我們進(jìn)行降采樣按周或者月平均統(tǒng)計(jì) # df =df.resample("M").mean() df =df.resample("7D").mean()data = df["PM_US Post"].dropna() # 畫圖 _x = data.index _x = [i.strftime("%Y%m%d")for i in _x] _y = data.values plt.figure(figsize=(20,8),dpi=80) plt.plot(range(len(_x)),_y) plt.xticks(range(0,len(_x),10),list(_x)[::10],rotation=45) plt.show() 與50位技術(shù)專家面對(duì)面20年技術(shù)見證,附贈(zèng)技術(shù)全景圖總結(jié)
以上是生活随笔為你收集整理的数据分析学习03-pandas的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: 数据分析学习02-numpy
- 下一篇: 机器学习-概述01
