机器学习入门------pandas
print(food_info)
結果:無 pandas的read_csv是從文件中把內容讀取進來
first_rows = food_info.head()
#print (first_rows)
#print(food_info.head(3))
print (food_info.columns)
#print (food_info.shape)
結果:
Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)','Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)','Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)','Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)','Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)','Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)','Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg','Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)','Cholestrl_(mg)'],dtype='object') head默認是前5行,head(3)指定前3行,columns是指列頭,shape形狀 #pandas uses zero-indexing#Series object representing the row at index 0.
print (food_info.loc[1])
# Series object representing the seventh row.
#food_info.loc[6]
# Will throw an error: "KeyError: 'the label [8620] is not in the [index]'"
#food_info.loc[8620]
#The object dtype is equivalent to a string in Python結果: NDB_No 1002 Shrt_Desc BUTTER WHIPPED WITH SALT Water_(g) 15.87 Energ_Kcal 717 Protein_(g) 0.85 Lipid_Tot_(g) 81.11 Ash_(g) 2.11 Carbohydrt_(g) 0.06 Fiber_TD_(g) 0 列頭與所取的行# Returns a DataFrame containing the rows at indexes 3, 4, 5, and 6.
food_info.loc[3:6]
# Returns a DataFrame containing the rows at indexes 2, 5, and 10. Either of the following approaches will work.
# Method 1
#two_five_ten = [2,5,10]
#food_info.loc[two_five_ten]
# Method 2
#food_info.loc[[2,5,10]]
與上面一樣的道理
col_names = food_info.columns.tolist()
gram_columns = []
for c in col_names:
??? if c.endswith("(g)"):
??????? gram_columns.append(c)
gram_df = food_info[gram_columns]
print(gram_df.head(3))結果:
Water_(g) Protein_(g) Lipid_Tot_(g) Ash_(g) Carbohydrt_(g) \(表示分行顯示) 0 15.87 0.85 81.11 2.11 0.06 1 15.87 0.85 81.11 2.11 0.06 2 0.24 0.28 99.48 0.00 0.00 Fiber_TD_(g) Sugar_Tot_(g) FA_Sat_(g) FA_Mono_(g) FA_Poly_(g) 0 0 0.06 51.368 21.021 3.043 1 0 0.06 50.489 23.426 3.012 2 0 0.00 61.924 28.732 3.694#print(food_info["Iron_(mg)"])
div_1000 = food_info["Iron_(mg)"] / 1000
print (div_1000)
# Adds 100 to each value in the column and returns a Series object.
add_100 = food_info["Iron_(mg)"] + 100
# Subtracts 100 from each value in the column and returns a Series object.
#sub_100 = food_info["Iron_(mg)"] - 100
# Multiplies each value in the column by 2 and returns a Series object.
#mult_2 = food_info["Iron_(mg)"]*2結果:
water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]
water_energy = food_info["Water_(g)"] * food_info["Energ_Kcal"]
iron_grams = food_info["Iron_(mg)"] / 1000 ?
food_info["Iron_(g)"] = iron_grams
print(water_energy)結果:
0 11378.79 1 11378.79 2 210.24 3 14970.73 4 15251.81 5 16172.28 6 15540.00 7 14769.28 8 15062.60 9 14570.55 同上#By default, pandas will sort the data by the column we specify in ascending order and return a new DataFrame
# Sorts the DataFrame in-place, rather than returning a new DataFrame.
#print food_info["Sodium_(mg)"]
food_info.sort_values("Sodium_(mg)", inplace=True)
print (food_info["Sodium_(mg)"])
#Sorts by descending order, rather than ascending.
food_info.sort_values("Sodium_(mg)", inplace=True, ascending=False)
#print (food_info["Sodium_(mg)"])結果:
結合:泰坦尼克號案例強化pandas
import pandas as pd
import numpy as np
titanic_survial = pd.read_csv("C:/Users/LENOVO/Desktop/titanic_train.csv")
titanic_survial.head()
讀取部分文件內容展示
#The Pandas library uses NaN, which stands for "not a number", to indicate a missing value.
#we can use the pandas.isnull() function which takes a pandas series and returns a series of True and False values
age = titanic_survial["Age"]
#rint(age.loc[0:10]){取出Age這一列的前10行}
age_is_null = pd.isnull(age)
#print (age_is_null){缺失就是true。存在就是false}
age_null_true = age[age_is_null]
print (age_null_true){找出缺失的位置}
age_null_count = len(age_null_true)
print(age_null_count){統計缺失的個數}
#The result of this is that mean_age would be nan. This is because any calculations we do with a null value also result in a null value
titanic_survival[mean_age = sum(titanic_survial["Age"]) / len(titanic_survial["Age"])
print (mean_age)
結果:
nan 求年齡的平均數,但是缺失的部分也加入了,所有結果也是缺失的 good_ages = titanic_survial["Age"][age_is_null == False] {代表取出年齡不是缺失的部分}print (good_ages)
correct_mean_age = sum(good_ages) / len(good_ages){年齡是正確的求均值}
print (correct_mean_age)結果:
29.6991176471 {1: 84.15468749999992, 2: 20.66218315217391, 3: 13.675550101832997}
#index tells the method which column to group by
#values is the column that we want to apply the calculation to
#aggfunc specifies the calculation we want to perform
passenger_survival = titanic_survial.pivot_table(index="Pclass", values="Survived", aggfunc=np.mean)
print (passenger_survival)結果:
port_stats =titanic_survial.pivot_table(index="Embarked",values=["Fare","Survived"],aggfunc=np.sum)
print(port_stats)結果:
row_index_83_pclass = titanic_survial.loc[83,"Pclass"]
print (row_index_83_age)
print (row_index_1000_pclass)結果: 28.0 1 精確的定位到確定的一行到屬性new_titanic_survival = titanic_survial.sort_values("Age",ascending=False)
#print new_titanic_survial[0:10]
titanic_reindexed = new_titanic_survival.reset_index(drop=True){重新設置index}
print(titanic_reindexed.iloc[0:10])結果:{loc與iloc是不一樣的,loc——通過行標簽索引行數據,iloc——通過行號索引行數據,當行號和行標簽都是數字時,無區別}
PassengerId Survived Pclass Name Sex \ 0 631 1 1 Barkworth, Mr. Algernon Henry Wilson male 1 852 0 3 Svensson, Mr. Johan male 2 494 0 1 Artagaveytia, Mr. Ramon male 3 97 0 1 Goldschmidt, Mr. George B male 4 117 0 3 Connors, Mr. Patrick male 5 673 0 2 Mitchell, Mr. Henry Michael male 6 746 0 1 Crosby, Capt. Edward Gifford male 7 34 0 2 Wheadon, Mr. Edward H male 8 55 0 1 Ostby, Mr. Engelhart Cornelius male 9 281 0 3 Duane, Mr. Frank male Age SibSp Parch Ticket Fare Cabin Embarked 0 80.0 0 0 27042 30.0000 A23 S 1 74.0 0 0 347060 7.7750 NaN S 2 71.0 0 0 PC 17609 49.5042 NaN C 3 71.0 0 0 PC 17754 34.6542 A5 C 4 70.5 0 0 370369 7.7500 NaN Q 5 70.0 0 0 C.A. 24580 10.5000 NaN S 6 70.0 1 1 WE/P 5735 71.0000 B22 S 7 66.0 0 0 C.A. 24579 10.5000 NaN S 8 65.0 0 1 113509 61.9792 B30 C 9 65.0 0 0 336439 7.7500 NaN Q def hundredth_row(column):??? # Extract the hundredth item
??? hundredth_item = column.iloc[99]
??? return hundredth_item
# Return the hundredth item from each column
hundredth_row = titanic_survial. apply(hundredth_row)
print (hundredth_row)結果: PassengerId 100 Survived 0 Pclass 2 Name Kantor, Mr. Sinai Sex male Age 34 SibSp 1 Parch 0 Ticket 244367 Fare 26 Cabin NaN Embarked S 自定義第100行:但是需要用apply
總結
以上是生活随笔為你收集整理的机器学习入门------pandas的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 机器学习入门---------numpy
- 下一篇: mysql索引背后的数据结构及算法