pd.concat数据拼接
從本文你將學(xué)到如何利用pandas模塊下的concat函數(shù)進(jìn)行數(shù)據(jù)拼接。
pd.concat
- 背景
- 完整代碼
- 結(jié)果預(yù)覽
- 代碼解讀
- 參考文獻(xiàn)
背景
現(xiàn)在我有136price202006temp, 136price202007temp, 136price202008temp, 136price202009temp, 136price2020010temp, 136price2020011temp六份數(shù)據(jù),我想要將其上下拼接起來(lái)形成一張總表,各表名和表頭如下表所示
| 136price202006temp | newdiskid, plate, newdiskname, lon, lat, ras202005, p0, label |
| 136price202007temp | newdiskid, plate, newdiskname, lon, lat, price202006, p0, label |
| 136price202008temp | newdiskid, plate, newdiskname, lon, lat, price202007, p0, label |
| 136price202009temp | newdiskid, plate, newdiskname, lon, lat, price202008, p0, label |
| 136price2020010temp | newdiskid, plate, newdiskname, lon, lat, price202009, p0, label |
| 136price2020011temp | newdiskid, plate, newdiskname, lon, lat, price2020010, p0, label |
從表中可以看到每個(gè)表一共有8個(gè)字段,其中7個(gè)相同,1個(gè)不同,具體要求如下
- 關(guān)鍵是把p0拼接,表示各個(gè)月份算出來(lái)的價(jià)格,希望把下面7個(gè)字段對(duì)齊;
| df | newdiskid, plate, newdiskname, lon, lat, p0,label |
- 每個(gè)表增加表示時(shí)間的字段pricedate
- 增加表示計(jì)算人的字段computman;
- 將所有l(wèi)abel為1或者2的改成0,label為0的改成1;
完整代碼
# -*- coding: utf-8 -*- """ Project_name:join Description: Created on Tue Nov 10 14:44:44 2020 @author: 帥帥de三叔 """import pandas as pd #導(dǎo)入數(shù)據(jù)分析模塊 import pymysql #導(dǎo)入連接MySQL模塊data202006 = pd.read_excel("136price202006temp.xlsx") #讀取數(shù)據(jù) data202006["pricedate"]= ["2020-06-01"]*len(data202006) #增加一個(gè)pricedate的字段data202007 = pd.read_excel("136price202007temp.xlsx") data202007["pricedate"]= ["2020-07-01"]*len(data202007)data202008 = pd.read_excel("136price202008temp.xlsx") data202008["pricedate"]= ["2020-08-01"]*len(data202008)data202009 = pd.read_excel("136price202009temp.xlsx") data202009["pricedate"]= ["2020-09-01"]*len(data202009)data2020010 = pd.read_excel("136price2020010temp.xlsx") data2020010["pricedate"]= ["2020-10-01"]*len(data2020010)data2020011 = pd.read_excel("136price2020011temp.xlsx") data2020011["pricedate"]= ["2020-11-01"]*len(data2020011)df = pd.concat([data202006, data202007, data202008, data202009, data2020010, data2020011],join = "inner", axis = 0, ignore_index=True)[["newdiskid", "newdiskname", "pricedate","p0", "label"]] df["computman"] = ["zbw"]*len(df) #新增computman字段 df["remark"] = [""]*len(df) #新增remark字段 df["label"] = [2 if i == 1 or i == 2 else 0 for i in df["label"]] #推導(dǎo)式語(yǔ)句,如果為1或2則統(tǒng)一標(biāo)2,其余標(biāo)0 df["label"] = [0 if i == 2 else 1 for i in df["label"]] #接上一句,如果為2賦值為0,反之為1 df.columns = ["newdiskid", "newdiskname", "pricedate", "price", "casetype", "computman", "remark"] #重新給表頭 df = df[["newdiskid", "newdiskname","pricedate", "price", "computman", "casetype", "remark"]] #調(diào)整一下表頭順序 df.to_excel("136各樓盤各月份基價(jià).xlsx") #保存本地文件夾下結(jié)果預(yù)覽
代碼解讀
主要用到panda.concat()函數(shù),其基本調(diào)用方式如下
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)參數(shù)解釋
- objs
需要連接的對(duì)象,如series,dataframe或者是panel構(gòu)成的序列l(wèi)ist ,常用是兩個(gè)數(shù)據(jù)框 [df1, df2],我這里給了6個(gè)數(shù)據(jù)框;
df = pd.concat([data202006, data202007, data202008, data202009, data2020010, data2020011]) #拼接6個(gè)數(shù)據(jù)框- axis
表示拼接的方向,axis = 0, 表示在按行拼接,即上下方向; axis = 1, 表示按列方向拼接,即左右方向;
df = pd.concat([data202006, data202007, data202008, data202009, data2020010, data2020011], axis = 0) #按行拼接- join
表示連接方式,outer, 表示并集,即全部需要; inner,表示交集,只取重合的部分;
df = pd.concat([data202006, data202007, data202008, data202009, data2020010, data2020011],join = "inner", axis = 0)#取交集注: 取交集后ras202005,price202006,price202007, price202008, price202009, price2020010, price2020011這幾個(gè)字段就會(huì)被干掉;
- join_axes
傳入需要保留的index
- ignore_index
忽略需要連接的數(shù)據(jù)框本身的index,當(dāng)原本的index沒有特別意義的時(shí)候可以使用,尤其這些數(shù)據(jù)框的index不完全一致的時(shí)候特別有用;
df = pd.concat([data202006, data202007, data202008, data202009, data2020010, data2020011],join = "inner", axis = 0, ignore_index=True) #選擇忽略- key
keys參數(shù)可以用來(lái)標(biāo)注合并后的表數(shù)據(jù)來(lái)源,比如我這里可以給6個(gè)key值
df = pd.concat([data202006, data202007, data202008, data202009, data2020010, data2020011],join = "inner", axis = 0, ignore_index=True, keys=['06', ',07', '08','09', ',010', '011']) #標(biāo)注數(shù)據(jù)來(lái)源- levels
序列列表,用于構(gòu)建MultiIndex的特定級(jí)別(唯一值),否則,它們將從index推斷。
- copy
bool值,表示要不要復(fù)制數(shù)據(jù),默認(rèn)為True,如果為False,則不要復(fù)制數(shù)據(jù)。
后面2個(gè)參數(shù)用的比較少
參考文獻(xiàn)
1,https://jingyan.baidu.com/article/91f5db1b79205a1c7f05e3ae.html
2,https://www.jianshu.com/p/421f040dfe2f
總結(jié)
以上是生活随笔為你收集整理的pd.concat数据拼接的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: IDS uEye XCP USB3相机驱
- 下一篇: 【瞎扯】树上差分的基本思路