當前位置：首頁 > 运维知识 > windows >内容正文

windows

1. PANDAS

PANDAS是一個包，它使我們能夠訪問高性能、易于使用的工具和數據結構，以便在Python中進行數據分析。
Python是一種運行很慢的語言。PANDAS通過使用C編程語言實現大量優化來克服這一問題。它還允許我們訪問Series和DataFrame，這是從R統計包中導入的兩個非常強大且用戶友好的數據結構。
PANDAS還使從外部文件導入數據到Python環境變得輕而易舉。它支持多種格式，如JSON、CSV、HDF5、SQL、NPY和XLSX。

2. PANDAS用于數據處理

實驗數據：https://www.kaggle.com/rounakbanik/the-mov ies-dataset/downloads/movies_metadata.csv/7

#將CSV文件讀取到PANDAS的DataFrame對象中 df = pd.read_csv('C:/Users/Administrator/Desktop/RecoSys/data/movies_metadata.csv') df.head() # output the head information type(df) # output the type of 'of' -> 'pandas.core.frame.DataFrame'

PANDAS - DataFrame數據結構

DataFrame是Python中Pandas庫中的一種數據結構，它類似excel，是一種二維表。或許說它可能有點像matlab的矩陣，但是matlab的矩陣只能放數值型值（當然matlab也可以用cell存放多類型數據），DataFrame的單元格可以存放數值、字符串等，這和excel表很像。同時DataFrame可以設置列名columns與行名index，可以通過像matlab一樣通過位置獲取數據也可以通過列名和行名定位。

輸出DataFrame對象的shape：

df.shape? >>>（45466， 24）

輸出DataFrame對象的columns：

df.columns >>> Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id','imdb_id', 'original_language', 'original_title', 'overview','popularity', 'poster_path', 'production_companies','production_countries', 'release_date', 'revenue', 'runtime','spoken_languages', 'status', 'tagline', 'title', 'video','vote_average', 'vote_count'],dtype='object')

輸出DataFrame對象的row實例：

df.iloc[1] >>> adult False belongs_to_collection NaN budget 65000000 genres [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... homepage NaN id 8844 imdb_id tt0113497 original_language en original_title Jumanji overview When siblings Judy and Peter discover an encha... popularity 17.0155 poster_path /vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg production_companies [{'name': 'TriStar Pictures', 'id': 559}, {'na... production_countries [{'iso_3166_1': 'US', 'name': 'United States o... release_date 1995-12-15 revenue 2.62797e+08 runtime 104 spoken_languages [{'iso_639_1': 'en', 'name': 'English'}, {'iso... status Released tagline Roll the dice and unleash the excitement! title Jumanji video False vote_average 6.9 vote_count 2413 Name: 1, dtype: object

改變行索引方式，設置索引title，例如：

#Change the index to the title df = df.set_index('title') #Access the movie with title 'Jumanji' jum = df.loc['Jumanji'] jum >>> 可以得到同樣的輸出#修改為數值索引 df = df.reset_index()

創建具有更少Feature的DataFrame對象：

#Create a smaller dataframe with a subset of all features small_df = df[['title', 'release_date', 'budget', 'revenue', 'runtime', 'genres']] #僅顯示前五行，如果需要需要顯示的更多，可以顯式設置 head(10) small_df.head()

檢查各個Feature的數據類型(python可以自動為feature賦予相應的數據類型)：

#Get information of the data types of each feature small_df.info() >>> <class 'pandas.core.frame.DataFrame'> RangeIndex: 45466 entries, 0 to 45465 Data columns (total 6 columns): title 45460 non-null object release_date 45379 non-null object budget 45466 non-null object revenue 45460 non-null float64 runtime 45203 non-null float64 genres 45466 non-null object dtypes: float64(2), object(4) memory usage: 2.1+ MB

人工轉換Feature的數據類型：

#Import the numpy library import numpy as np #Function to convert to float manually def to_float(x):try:x = float(x)except:x = np.nanreturn x #Apply the to_float function to all values in the budget column small_df['budget'] = small_df['budget'].apply(to_float) #Try converting to float using pandas astype small_df['budget'] = small_df['budget'].astype('float') #Get the data types for all features small_df.info() >>> <class 'pandas.core.frame.DataFrame'> RangeIndex: 45466 entries, 0 to 45465 Data columns (total 6 columns): title 45460 non-null object release_date 45379 non-null object budget 45463 non-null float64 revenue 45460 non-null float64 runtime 45203 non-null float64 genres 45466 non-null object dtypes: float64(3), object(3) memory usage: 2.1+ MB

按照某個特征進行排序：

#Sort Movies based on revenue (in descending order) small_df = small_df.sort_values('revenue', ascending=False)

PANDAS - Series數據結構

Series是一個一維標記數組，能夠保存任何類型的數據。可以把它看作是Python列表。上面我們使用iloc或者loc訪問Jumanji的時候，返回的數據結構就是Series對象。

type(small_df.iloc[1]) >>> pandas.core.series.Series

上面，我們也曾用.apply()和.astype()方法，事實上，這些函數就是作用在Series對象上。

因此，與DataFrame一樣，series對象也有自己的一組非常有用的方法，使數據分析變得輕而易舉（make...a breeze）。

返回某個特征下的統計數據，如極值、平均值、中位數、百分位數值等：

runtime = small_df['runtime'] print(runtime.max()) print(runtime.min())

返回某個特征下的計數特性：

small_df['year'].value_counts()

總結

以上是生活随笔為你收集整理的推荐系统-应用Pandas进行数据处理的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

windows

推荐系统-应用Pandas进行数据处理

1. PANDAS

2. PANDAS用于數據處理

總結