用OneR算法对Iris植物数据进行分类
數據集介紹
Iris是植物分類數據集,這個數據集一共有150條植物數據。每條數據都 給出了四個特征:sepal length、sepal width、petal length、petal width(分別表示萼片和花瓣的長與寬),單位均為cm。
該數據集共有三種類別:Iris Setosa(山鳶尾)、Iris Versicolour(變色鳶尾)和Iris Virginica(維吉尼亞鳶尾)。我們這里的分類目的是根據植物的特征推測它的種類。
這個數據是sklearn自帶的,首先我們導入數據:
# Load our dataset from sklearn.datasets import load_iris #X, y = np.loadtxt("X_classification.txt"), np.loadtxt("y_classification.txt") dataset = load_iris() X = dataset.data y = dataset.target print(dataset.DESCR) n_samples, n_features = X.shape部分輸出如下:
Iris Plants Database
Notes
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
數據預處理
為了應用OneR算法,我們需要對數據進行一些預處理工作。
數據集中各特征值為連續型,也就是有無數個可能的值。測量得到的數據就是這個樣子,比 如,測量結果可能是1、1.2或1.25,等等。連續值的另一個特點是,如果兩個值相近,表示相似 度很大。一種萼片長1.2cm的植物跟一種萼片寬1.25cm的植物很相像。
與此相反,類別的取值為離散型。雖然常用數字表示類別,但是類別值不能根據數值大小比 較相似性。Iris數據集用不同的數字表示不同的類別,比如類別0、1、2分別表示Iris Setosa、Iris Versicolour、Iris Virginica。
數據集的特征為連續值,而我們即將使用的算法使用類別型特征值,因此我們需要把連續值 轉變為類別型,這個過程叫作離散化
簡單的離散化算法,莫過于確定一個閾值,將低于該閾值的特征值置為0,高于閾值的置 為1。
# Compute the mean for each attribute attribute_means = X.mean(axis=0) assert attribute_means.shape == (n_features,) X_d = np.array(X >= attribute_means, dtype='int')我們得到了一個長度為4的數組,這正好是特征的數量。數組的第一項是第一個特征的均值, 以此類推。接下來,用該方法將數據集打散,把連續的特征值轉換為類別型。最后得到的就是類似于[0,1,0,0];[1,1,1,0]這樣的(150,4)的數據。
實現OneR算法
OneR算法的思路很簡單,它根據已有數據中,具有相同特征值的個體可能屬于哪個類別進行分類。OneR是One Rule(一條規則)的簡寫,表示我們只選取四個特征中分類效果好的一個用作分類依據。
算法首先遍歷每個特征的每一個取值,對于每一個特征值,統計它在各個類別中的出現次數, 找到它出現次數多的類別,并統計它在其他類別中的出現次數。
舉例來說,假如數據集的某一個特征可以取0或1兩個值。數據集共有三個類別。特征值為0 的情況下,A類有20個這樣的個體,B類有60個,C類也有20個。那么特征值為0的個體可能屬 于B類,當然還有40個個體確實是特征值為0,但是它們不屬于B類。將特征值為0的個體分到B類 的錯誤率就是40%,因為有40個這樣的個體分別屬于A類和C類。特征值為1時,計算方法類似。
統計完所有的特征值及其在每個類別的出現次數后,我們再來計算每個特征的錯誤率。計算 方法為把它的各個取值的錯誤率相加,選取錯誤率低的特征作為唯一的分類準則(OneR),用于接下來的分類。
下面,用代碼實現一下:
from collections import defaultdict from operator import itemgetterdef train(X, y_true, feature):"""Computes the predictors and error for a given feature using the OneR algorithmParameters----------X: array [n_samples, n_features] The two dimensional array that holds the dataset. Each row is a sample, each columnis a feature.y_true: array [n_samples,] The one dimensional array that holds the class values. Corresponds to X, such thaty_true[i] is the class value for sample X[i].feature: intAn integer corresponding to the index of the variable we wish to test.0 <= variable < n_featuresReturns-------predictors: dictionary of tuples: (value, prediction)For each item in the array, if the variable has a given value, make the given prediction.error: floatThe ratio of training data that this rule incorrectly predicts."""# Check that variable is a valid numbern_samples, n_features = X.shapeassert 0 <= feature < n_features# Get all of the unique values that this variable hasvalues = set(X[:,feature])# Stores the predictors array that is returnedpredictors = dict()errors = []for current_value in values:most_frequent_class, error = train_feature_value(X, y_true, feature, current_value)predictors[current_value] = most_frequent_classerrors.append(error)# Compute the total error of using this feature to classify ontotal_error = sum(errors)return predictors, total_error# Compute what our predictors say each sample is based on its value #y_predicted = np.array([predictors[sample[feature]] for sample in X])def train_feature_value(X, y_true, feature, value):# 輸入的四個參數分別是數據集,類別數組,選好的特征索引值和特征值# Create a simple dictionary to count how frequency they give certain predictionsclass_counts = defaultdict(int)# Iterate through each sample and count the frequency of each class/value pairfor sample, y in zip(X, y_true):if sample[feature] == value:class_counts[y] += 1# Now get the best one by sorting (highest first) and choosing the first itemsorted_class_counts = sorted(class_counts.items(), key=itemgetter(1), reverse=True)most_frequent_class = sorted_class_counts[0][0]# The error is the number of samples that do not classify as the most frequent class# *and* have the feature value.n_samples = X.shape[1]error = sum([class_count for class_value, class_count in class_counts.items()if class_value != most_frequent_class])return most_frequent_class, error測試算法
# Now, we split into a training and test set from sklearn.cross_validation import train_test_split# Set the random state to the same number to get the same results as in the book random_state = 14X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=random_state) print("There are {} training samples".format(y_train.shape)) print("There are {} testing samples".format(y_test.shape)) # Compute all of the predictors all_predictors = {variable: train(X_train, y_train, variable) for variable in range(X_train.shape[1])} errors = {variable: error for variable, (mapping, error) in all_predictors.items()} # Now choose the best and save that as "model" # Sort by error best_variable, best_error = sorted(errors.items(), key=itemgetter(1))[0] print("The best model is based on variable {0} and has error {1:.2f}".format(best_variable, best_error))# Choose the bset model model = {'variable': best_variable,'predictor': all_predictors[best_variable][0]}def predict(X_test, model):variable = model['variable']predictor = model['predictor']y_predicted = np.array([predictor[int(sample[variable])] for sample in X_test])return y_predictedy_predicted = predict(X_test, model)# Compute the accuracy by taking the mean of the amounts that y_predicted is equal to y_test accuracy = np.mean(y_predicted == y_test) * 100 print("The test accuracy is {:.1f}%".format(accuracy))參考資料
《Python數據挖掘入門與實踐》
總結
以上是生活随笔為你收集整理的用OneR算法对Iris植物数据进行分类的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 用数据方法进行简单商品推荐
- 下一篇: 判断电离层是否存在自由电子