数据探索性分析_探索性数据分析
數(shù)據(jù)探索性分析
When we hear about Data science or Analytics , the first thing that comes to our mind is Modelling , Tuning etc. . But one of the most important and primary steps before all of these is Exploratory Data Analysis or EDA.
當我們聽到有關(guān)數(shù)據(jù)科學或分析的知識時,想到的第一件事就是建模,調(diào)整等。 但是,在所有這些步驟中最重要和最主要的步驟之一是探索性數(shù)據(jù)分析或EDA。
Exploratory data analysis (Machine learning process steps)探索性數(shù)據(jù)分析(機器學習過程步驟)為什么選擇EDA (Why EDA)
In Data Science one of the Major problem Data Scientists/Analysts are facing today is the Data Quality . Since we rely on multiple sources for data , data quality is often compromised.The quality of Data determines the quality of models which we are building on it .As the adage goes,Garbage in , garbage out . The above statement holds very true in the case of Data science.
在數(shù)據(jù)科學領(lǐng)域,數(shù)據(jù)科學家/分析師當今面臨的主要問題之一是數(shù)據(jù)質(zhì)量。 由于我們依賴于多個數(shù)據(jù)源,因此數(shù)據(jù)質(zhì)量常常受到損害。數(shù)據(jù)的質(zhì)量決定了我們在其上構(gòu)建的模型的質(zhì)量。 上面的陳述在數(shù)據(jù)科學領(lǐng)域非常正確。
We cannot build Empire State Building or Burj Khalifa on a shaky foundation !
我們不能在搖搖欲墜的基礎(chǔ)上建造帝國大廈或哈利法塔!
And that explains why 60–80% of time of Data Scientists are being spent on Data gathering and Data preparation.
這就解釋了為什么將60-80%的數(shù)據(jù)科學家的時間都花在數(shù)據(jù)收集和數(shù)據(jù)準備上。
When we are working with Data , EDA or Exploratory Data Analysis is the most important step .It is very important to gather as much information and insights from data as we could before processing it . This could be done by EDA. EDA Also help us to analyse the underlying trends and patterns in data and also help us to formulate our problem statement in a better way .
當我們處理數(shù)據(jù)時,EDA或探索性數(shù)據(jù)分析是最重要的步驟。在處理數(shù)據(jù)之前,從數(shù)據(jù)中收集盡可能多的信息和見解非常重要。 這可以由EDA完成。 EDA還可以幫助我們分析數(shù)據(jù)的潛在趨勢和模式,還可以幫助我們更好地制定問題陳述。
“ Well begun is half done”
“ 好的開始已經(jīng)完成了一半”
Exploratory Data Analysis helps to understand the data better and also it helps to understand what Data speaks.This could be done both by visual analysis as well as with few other analysis.Also EDA helps to distinguish between what to be pursued further and what is not worth following up.
探索性數(shù)據(jù)分析有助于更好地理解數(shù)據(jù),也有助于理解數(shù)據(jù)的含義,這既可以通過可視化分析也可以通過很少的其他分析來完成,此外EDA有助于區(qū)分需要進一步追求的目標和不追求的目標值得跟進。
Exploratory Data Analysis
探索性數(shù)據(jù)分析
Let’s explore steps of Exploratory data analysis using Bank loan Data set
讓我們探索使用銀行貸款數(shù)據(jù)集進行探索性數(shù)據(jù)分析的步驟
Import the Libraries:
導入庫:
To perform initial analysis , we would need libraries like Numpy, Pandas,Seaborn and Matplotlib. Numpy is an array processing package.Its a library for numerical computations .Pandas is used for data manipulation and analysis. Matplotlib and Seaborn are statistical libraries used for data visualization
為了進行初步分析,我們需要Numpy,Pandas,Seaborn和Matplotlib之類的庫。 Numpy是一個數(shù)組處理程序包,它是一個用于數(shù)值計算的庫.Pandas用于數(shù)據(jù)處理和分析。 Matplotlib和Seaborn是用于數(shù)據(jù)可視化的統(tǒng)計庫
Import Dataset:
導入數(shù)據(jù)集:
Data is stored in csv file format, hence we are importing it using pd.read_csv
數(shù)據(jù)以csv文件格式存儲,因此我們使用pd.read_csv導入數(shù)據(jù)
Imported data from the file is stored in bankloan_df dataframe
從文件導入的數(shù)據(jù)存儲在bankloan_df數(shù)據(jù)框中
Information of data set:
數(shù)據(jù)集信息:
.info() will display information about the data frames
.info()將顯示有關(guān)數(shù)據(jù)幀的信息
It shows the column names,number of rows and columns, data types etc.It gives an idea about what type of data it is .It is very important to understand whether a column represents categorical or numerical variable , if categorical we should understand whether its ordinal or nominal .We need to treat each of these data types differently which I will explain in another post.You can use .astype to change the datatype of a column
它顯示列名,行數(shù)和列數(shù),數(shù)據(jù)類型等。它給出有關(guān)數(shù)據(jù)類型的信息。了解列是表示類別變量還是數(shù)值變量非常重要,如果是類別變量,則應了解其類型順序或標稱。我們需要對每種數(shù)據(jù)類型進行不同的處理,這將在另一篇文章中進行解釋。您可以使用.astype更改列的數(shù)據(jù)類型
If need to know only the number of rows and columns .shape can be used
如果只需要知道行數(shù)和列數(shù),可以使用.shape
To see the data type , bankloan_df.dtypes can be used
要查看數(shù)據(jù)類型,可以使用bankloan_df.dtypes
To check the null values bankloan_df.isnull().sum() can be used
要檢查空值,可以使用bankloan_df.isnull()。sum()
Descriptive Analysis :
描述性分析:
.describe() is used for descriptive analysis , it provides details like count, mean, standard deviation, Inter Quartile Range etc.This analysis helps to understand the skewness of data.
.describe()用于描述性分析,它提供了諸如計數(shù),均值,標準差,四分位數(shù)間距等詳細信息。此分析有助于理解數(shù)據(jù)的偏度。
In the case of categorical variables,to check the representation of different groups , we use groupby. This is used to analyze whether any group is over represented than other . If such under representation is there for target variable, we need to treat it with certain techniques like SMOTE.
對于分類變量,為了檢查不同組的表示形式,我們使用groupby。 這用于分析是否有任何一個組比另一個組高。 如果目標變量存在這種表示不足的情況,則需要使用某些技術(shù)(例如SMOTE)對其進行處理。
Graphical analysis:
圖形分析:
Graphs are very important tool to understand the data distribution .We use different graphs for analyzing data. We use it for Univariate, Bi Variate and Multi Variate Analysis. Seaborn is a very good library to explore different graphs. I will explain few very common graphs in the analysis here and will write a post in detail about graphs later.
圖是了解數(shù)據(jù)分布的非常重要的工具。我們使用不同的圖來分析數(shù)據(jù)。 我們將其用于單變量,雙變量和多變量分析。 Seaborn是一個很好的圖書館,可以探索不同的圖形。 在這里的分析中,我將解釋一些非常常見的圖形,稍后將詳細撰寫有關(guān)圖形的文章。
Uni variate Analysis — Analysis where we consider only one variable. Few uni variate graphs are Count Plot, Box Plot etc.
單變量分析-僅考慮一個變量的分析。 很少有單變量圖是計數(shù)圖,箱形圖等。
Countplot:-Countplot shows the counts of observations in each category using bars
Countplot:-Countplot使用條形圖顯示每個類別中的觀察計數(shù)
Boxplot:-A box plot (or box-and-whisker plot) shows the distribution of quantitative data.The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.
Boxplot:-箱形圖(或箱須圖)顯示定量數(shù)據(jù)的分布。框顯示數(shù)據(jù)集的四分位數(shù),而晶須延伸以顯示其余分布,確定點除外使用四分位間距范圍的函數(shù)的“異常值”。
To identify outliers also we use boxplots
為了識別異常值,我們還使用箱線圖
Bi Variate Analysis is where relationship between two variables are plotted in the graph and in Multi variate Analysis , relationship between different variables represented using graphs.
雙變量分析是在圖中繪制兩個變量之間的關(guān)系的地方,而在多變量分析中,則是使用圖表表示的不同變量之間的關(guān)系的地方。
Pairplot is a Bi Variate graph which is used to analyse the relationship between different variables in a dataset. This is very important step for Model building.
Pairplot是Bi Variate圖,用于分析數(shù)據(jù)集中不同變量之間的關(guān)系。 這對于模型構(gòu)建非常重要。
Correlation
相關(guān)性
Correlation is another important step of EDA. While building a model, its important to understand whether any correlation exists between the independent variables and also with independent variable and dependent variable. This also helps in feature selection/elimination.
關(guān)聯(lián)是EDA的另一個重要步驟。 在構(gòu)建模型時,重要的是要了解自變量之間以及自變量和因變量之間是否存在任何關(guān)聯(lián)。 這也有助于特征選擇/消除。
Values closer to +1 and -1 are considered as maximum correlated variables.The values in diagonal is the correlation of variable with itself and it will always be +1.
接近+1和-1的值被視為最大相關(guān)變量。對角線的值是變量與其自身的相關(guān)性,它將始終為+1。
Correlation graphs can be designed using the below code snippet
可以使用以下代碼片段設(shè)計相關(guān)圖
These are initial few steps of Exploratory data analysis. Based on the findings of each step ,one can take appropriate action to improve data quality ,analyse the trend or to treat missing variables/Outliers or anomaly appropriately.
這些是探索性數(shù)據(jù)分析的最初幾個步驟。 根據(jù)每個步驟的發(fā)現(xiàn),可以采取適當?shù)拇胧﹣砀纳茢?shù)據(jù)質(zhì)量,分析趨勢或適當?shù)靥幚砣笔У淖兞?異常值或異常。
“Information is the oil of the 21st century, and analytics is the combustion engine.” — Peter Sondergaard,Gartner Research
“信息是21世紀的石油,分析是內(nèi)燃機。” -Peter Sondergaard,Gartner研究
翻譯自: https://medium.com/@viveksmenon/exploratory-data-analysis-d464f3adb777
數(shù)據(jù)探索性分析
總結(jié)
以上是生活随笔為你收集整理的数据探索性分析_探索性数据分析的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到壁虎在床上是什么征兆
- 下一篇: 梦到吃花生瓜子什么意思