美国人口普查年收入比赛_训练网络对收入进行分类:成人普查收入数据集
美國人口普查年收入比賽
We have all heard that data science is the ‘sexiest job of the 21st century’. Hence, it is also surprising to know that before the world was over-populated with data, the concept of neural networks was laid down half a century ago. Even before the word ‘machine learning’ was coined, Donald Hebb in his book ‘The Organization of Behavior’ created a model based on brain cell interaction in 1949. The book presents Hebb’s theories on neuron excitement and communication between neurons.
眾所周知,數據科學是“ 21世紀最艱巨的工作”。 因此,令人驚訝的是,在世界人口過多之前,神經網絡的概念是半個世紀前提出的。 甚至在創造“機器學習”這個詞之前,唐納德·赫布(Donald Hebb)在他的《行為的組織》(The Organisation of Behavior)中就在1949年基于腦細胞相互作用建立了一個模型。該書介紹了赫布關于神經元興奮和神經元之間交流的理論。
Hebb wrote, “When one cell repeatedly assists in firing another, the axon of the first cell develops synaptic knobs (or enlarges them if they already exist) in contact with the soma of the second cell.” Translating Hebb’s concepts to artificial neural networks and artificial neurons, his model can be described as a way of altering the relationships between artificial neurons (also referred to as nodes) and the changes to individual neurons. Arthur Samuel of IBM first came up with the phrase “Machine Learning” in 1952.
赫布寫道:“當一個細胞反復協助發射另一個細胞時,第一個細胞的軸突會與第二個細胞的軀體接觸,形成突觸旋鈕(如果已經存在,則放大它們)。” 將Hebb的概念轉化為人工神經網絡和人工神經元后,他的模型可以描述為改變人工神經元(也稱為節點)與單個神經元變化之間關系的一種方式。 IBM的Arthur Samuel于1952年首次提出“機器學習”一詞。
分析數據 (Analyzing the data)
The dataset named Adult Census Income is available in kaggle and UCI repository. This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). The prediction task is to determine whether a person makes over $50K a year or not.
在kaggle和UCI信息庫中可以找到名為“成人普查收入”的數據集。 此數據由Ronny Kohavi和Barry Becker(Silicon Graphics的數據挖掘和可視化)從1994年的人口普查局數據庫中提取。 預測任務是確定一個人的年收入是否超過5萬美元。
Dataset: https://www.kaggle.com/uciml/adult-census-income
數據集: https : //www.kaggle.com/uciml/adult-census-income
Using the python language and several visualizations, I have attempted to fit 4 machine learning models and find the best model to describe the data.
我嘗試使用python語言和幾種可視化方法來擬合4種機器學習模型,并找到描述數據的最佳模型。
There are 3 steps to working with data- Data, Discovery, Deployment
處理數據需要3個步驟-數據,發現,部署
數據 (DATA)
age workclass fnlwgt education education.num marital.status0 90 ? 77053 HS-grad 9 Widowed
1 82 Private 132870 HS-grad 9 Widowed
2 66 ? 186061 Some-college 10 Widowed
3 54 Private 140359 7th-8th 4 Divorced
4 41 Private 264663 Some-college 10 Separated
occupation relationship race sex capital.gain
0 ? Not-in-family White Female 0
1 Exec-managerial Not-in-family White Female 0
2 ? Unmarried Black Female 0
3 Machine-op-inspct Unmarried White Female 0
4 Prof-specialty Own-child White Female 0
capital.loss hours.per.week native.country income
0 4356 40 United-States <=50K
1 4356 18 United-States <=50K
2 4356 40 United-States <=50K
3 3900 40 United-States <=50K
4 3900 40 United-States <=50K
發現 (DISCOVERY)
Data preprocessing
數據預處理
The discovery phase is where we attempt to understand the data. It might require cleaning, transformation, integration. The following code snippet highlights the data preprocessing steps.
發現階段是我們嘗試了解數據的階段。 它可能需要清洗,改造,集成。 以下代碼片段突出顯示了數據預處理步驟。
The dataset contained null values, both numerical and categorical values. The categorical values were both nominal and ordinal. The data had redundant columns as well.
數據集包含空值,包括數值和分類值。 類別值既是標稱值又是有序值。 數據也有多余的列。
Since the missing values were represented by ‘?’ , they were replaced by NAN values and removed after detection. The dependent column, ‘income’ which is to be predicted has been replaced with 0 and 1 and hence convert the problem to a dichotomous classification problem. There was one redundant column, ‘education.num’ which was an ordinal representation of ‘education’, which is removed above.
由于缺少的值由“?”表示 ,將其替換為NAN值,并在檢測后將其刪除。 將要預測的從屬列“收入”已替換為0和1,因此將問題轉換為二分類的問題。 有一個冗余列“ education.num”,它是“ education”的序數表示,已在上面刪除。
Now that unnecessary data points and redundant attributes have been removed, it is necessary to select the set of attributes really contributing to the prediction of the income.
現在已經刪除了不必要的數據點和冗余屬性,有必要選擇真正有助于收入預測的屬性集。
To check the correlation between a binary variable and continuous variables, the point biserial correlation has been used. After appropriate application of the test, ‘fnlwgt’ has been dropped which showed negative correlation.
為了檢查二進制變量和連續變量之間的 相關性 ,已使用了點雙序列相關性 。 在適當應用測試后,已刪除“ fnlwgt”,它顯示負相關。
For feature selection, all the numerical columns are selected except ‘fnlwgt’. For categorical variables, chi-square estimate is used. Chi-square estimate is used to measure the correlation between 2 categorical variables.
對于特征選擇,將選擇除“ fnlwgt”以外的所有數字列。 對于分類變量,使用卡方估計。 卡方估計用于測量2個類別變量之間的相關性。
First, the categorical variables are encoded and the numerical values are normalized to be between [0,1]. It’s simply a case of getting all your data on the same scale: if the scales for different features are wildly different, this can have a knock-on effect on your ability to learn (depending on what methods you’re using to do it). Ensuring standardized feature values implicitly weights all features equally in their representation.
首先,對分類變量進行編碼,并將數值標準化為[0,1]之間。 這只是以相同的比例獲取所有數據的一種情況:如果不同功能的比例截然不同,則會對您的學習能力產生連鎖React (取決于您使用的是哪種方法) 。 確保標準化特征值隱含地加權所有特征的表示形式。
There were 103 attributes including numerical variables. After feature selection, there are 65 attributes.
有103個屬性,包括數值變量。 選擇特征后,有65個屬性。
This dataset contains a typical example of class imbalance. It is shown in the following charts.
該數據集包含類不平衡的典型示例 。 如下圖所示。
Visualization
可視化
The pie chart clearly denotes that more than 50% of the dataset is occupied by one type of observation. This problem is handled using SMOTE(Synthetic Minority Oversampling Technique).
餅圖清楚地表明,一種類型的觀測值占據了數據集的50%以上。 使用SMOTE(綜合少數族裔過采樣技術)可以解決此問題。
部署 (DEPLOYMENT)
As mentioned above, 4 models are shown below. The training and testing is divided in 80–20 for logistic and naive bayes whereas 70–30 for decision tree and random forest.
如上所述,下面顯示了4個模型。 邏輯貝葉斯和樸素貝葉斯的訓練和測試分為80–20,決策樹和隨機森林的訓練和測試分為70–30。
Logistic Regression
邏輯回歸
The sigmoid function in orange乙狀結腸功能呈橙色The foremost model to predict a dichotomous variable is logistic regression. The logistic function is a sigmoid function, which takes any real input t, and outputs a value between zero and one. It gives the probability.
預測二分變量的最重要模型是邏輯回歸。 邏輯函數是一個S型函數,它接受任何實數輸入t,并輸出一個介于0和1之間的值 。 它給出了概率。
Naive Bayes
樸素貝葉斯
A naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature, given the class variable. Basically, it’s “naive” because it makes assumptions that may or may not turn out to be correct.
樸素的貝葉斯分類器假定類的特定特征的存在(或不存在)與任何其他特征的存在(或不存在)無關。 基本上,它是“ 幼稚的 ”,因為它所做的假設可能正確也可能不正確 。
Decision Tree
決策樹
A decision tree is a branched flowchart showing multiple pathways for potential decisions and outcomes. The tree starts with what is called a decision node, which signifies that a decision must be made. From the decision node, a branch is created for each of the alternative choices under consideration.
決策樹是一個分支流程圖,顯示了潛在決策和結果的多種途徑。 該樹始于所謂的決策節點 ,它表示必須做出決策。 從決策節點為正在考慮的每個替代選擇創建一個分支。
Random Forest
隨機森林
Random Forests are a combination of tree predictors where each tree depends on the values of a random vector sampled independently with the same distribution for all trees in the forest. The basic principle is that a group of “weak learners” can come together to form a “strong learner”.
隨機森林是樹預測器的組合,其中每棵樹都取決于獨立采樣的隨機向量的值,并且對森林中的所有樹具有相同的分布。 基本原則是一群“弱學習者”可以聚在一起形成“強學習者”。
To construct the ROC curve the following code is thus.
因此,要構建ROC曲線,請使用以下代碼。
A comparative study of the above models with respect to accuracy, precision, recall, ROC score is computed together for better decision.
對上述模型在準確性,準確性,召回率,ROC得分方面進行比較研究,以便更好地做出決策。
From the table above, random forest gives the best accuracy and ROC score.
從上表中,隨機森林提供了最佳的準確性和ROC分數。
All the ROC curves is shown below.
所有ROC曲線如下所示。
Random forest covers the maximum area and hence is a better model. I have not tried neural networks on this problem as there were only 30K plus data points I felt it would overfit the data. To further improve, more complex ensemble methods can be used. Also, according to Ockham’s Razor “the simplest explanation is most likely the right one”.
隨機森林覆蓋了最大面積,因此是一個更好的模型。 我沒有在這個問題上嘗試過神經網絡,因為只有30K數據點,我認為這會過擬合數據。 為了進一步改進,可以使用更復雜的集成方法。 而且,根據奧克漢姆(Ockham)的《剃刀》(Razor),“最簡單的解釋很可能是正確的解釋”。
Detailed report on the project is available in my kaggle notebook.
我的kaggle筆記本中提供了有關該項目的詳細報告。
Please let me know if there is any part I could have done better.
請讓我知道我是否可以做得更好。
Thanks for Reading!!
謝謝閱讀!!
翻譯自: https://medium.com/analytics-vidhya/training-a-network-to-classify-income-adult-census-income-dataset-79a7472e6eb7
美國人口普查年收入比賽
總結
以上是生活随笔為你收集整理的美国人口普查年收入比赛_训练网络对收入进行分类:成人普查收入数据集的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 谷歌 colab_使用Google Co
- 下一篇: 北方稀土最高涨到多少,历史最高为99.8