當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

用OneR算法对Iris植物数据进行分类

發布時間：2025/4/16 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了用OneR算法对Iris植物数据进行分类小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

數據集介紹

Iris是植物分類數據集，這個數據集一共有150條植物數據。每條數據都給出了四個特征：sepal length、sepal width、petal length、petal width（分別表示萼片和花瓣的長與寬），單位均為cm。
該數據集共有三種類別：Iris Setosa（山鳶尾）、Iris Versicolour（變色鳶尾）和Iris Virginica（維吉尼亞鳶尾）。我們這里的分類目的是根據植物的特征推測它的種類。

這個數據是sklearn自帶的，首先我們導入數據：

# Load our dataset from sklearn.datasets import load_iris #X, y = np.loadtxt("X_classification.txt"), np.loadtxt("y_classification.txt") dataset = load_iris() X = dataset.data y = dataset.target print(dataset.DESCR) n_samples, n_features = X.shape

部分輸出如下：

Iris Plants Database

Notes

Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:

============== ==== ==== ======= ===== ====================Min Max Mean SD Class Correlation ============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ====================

數據預處理

為了應用OneR算法，我們需要對數據進行一些預處理工作。

數據集中各特征值為連續型，也就是有無數個可能的值。測量得到的數據就是這個樣子，比如，測量結果可能是1、1.2或1.25，等等。連續值的另一個特點是，如果兩個值相近，表示相似度很大。一種萼片長1.2cm的植物跟一種萼片寬1.25cm的植物很相像。

與此相反，類別的取值為離散型。雖然常用數字表示類別，但是類別值不能根據數值大小比較相似性。Iris數據集用不同的數字表示不同的類別，比如類別0、1、2分別表示Iris Setosa、Iris Versicolour、Iris Virginica。

數據集的特征為連續值，而我們即將使用的算法使用類別型特征值，因此我們需要把連續值轉變為類別型，這個過程叫作離散化

簡單的離散化算法，莫過于確定一個閾值，將低于該閾值的特征值置為0，高于閾值的置為1。

# Compute the mean for each attribute attribute_means = X.mean(axis=0) assert attribute_means.shape == (n_features,) X_d = np.array(X >= attribute_means, dtype='int')

我們得到了一個長度為4的數組，這正好是特征的數量。數組的第一項是第一個特征的均值，以此類推。接下來，用該方法將數據集打散，把連續的特征值轉換為類別型。最后得到的就是類似于[0,1,0,0];[1,1,1,0]這樣的（150，4）的數據。

實現OneR算法

OneR算法的思路很簡單，它根據已有數據中，具有相同特征值的個體可能屬于哪個類別進行分類。OneR是One Rule（一條規則）的簡寫，表示我們只選取四個特征中分類效果好的一個用作分類依據。

算法首先遍歷每個特征的每一個取值，對于每一個特征值，統計它在各個類別中的出現次數，找到它出現次數多的類別，并統計它在其他類別中的出現次數。

舉例來說，假如數據集的某一個特征可以取0或1兩個值。數據集共有三個類別。特征值為0 的情況下，A類有20個這樣的個體，B類有60個，C類也有20個。那么特征值為0的個體可能屬于B類，當然還有40個個體確實是特征值為0，但是它們不屬于B類。將特征值為0的個體分到B類的錯誤率就是40%，因為有40個這樣的個體分別屬于A類和C類。特征值為1時，計算方法類似。

統計完所有的特征值及其在每個類別的出現次數后，我們再來計算每個特征的錯誤率。計算方法為把它的各個取值的錯誤率相加，選取錯誤率低的特征作為唯一的分類準則（OneR），用于接下來的分類。

下面，用代碼實現一下：

from collections import defaultdict from operator import itemgetterdef train(X, y_true, feature):"""Computes the predictors and error for a given feature using the OneR algorithmParameters----------X: array [n_samples, n_features] The two dimensional array that holds the dataset. Each row is a sample, each columnis a feature.y_true: array [n_samples,] The one dimensional array that holds the class values. Corresponds to X, such thaty_true[i] is the class value for sample X[i].feature: intAn integer corresponding to the index of the variable we wish to test.0 <= variable < n_featuresReturns-------predictors: dictionary of tuples: (value, prediction)For each item in the array, if the variable has a given value, make the given prediction.error: floatThe ratio of training data that this rule incorrectly predicts."""# Check that variable is a valid numbern_samples, n_features = X.shapeassert 0 <= feature < n_features# Get all of the unique values that this variable hasvalues = set(X[:,feature])# Stores the predictors array that is returnedpredictors = dict()errors = []for current_value in values:most_frequent_class, error = train_feature_value(X, y_true, feature, current_value)predictors[current_value] = most_frequent_classerrors.append(error)# Compute the total error of using this feature to classify ontotal_error = sum(errors)return predictors, total_error# Compute what our predictors say each sample is based on its value #y_predicted = np.array([predictors[sample[feature]] for sample in X])def train_feature_value(X, y_true, feature, value):# 輸入的四個參數分別是數據集，類別數組，選好的特征索引值和特征值# Create a simple dictionary to count how frequency they give certain predictionsclass_counts = defaultdict(int)# Iterate through each sample and count the frequency of each class/value pairfor sample, y in zip(X, y_true):if sample[feature] == value:class_counts[y] += 1# Now get the best one by sorting (highest first) and choosing the first itemsorted_class_counts = sorted(class_counts.items(), key=itemgetter(1), reverse=True)most_frequent_class = sorted_class_counts[0][0]# The error is the number of samples that do not classify as the most frequent class# *and* have the feature value.n_samples = X.shape[1]error = sum([class_count for class_value, class_count in class_counts.items()if class_value != most_frequent_class])return most_frequent_class, error

測試算法

# Now, we split into a training and test set from sklearn.cross_validation import train_test_split# Set the random state to the same number to get the same results as in the book random_state = 14X_train, X_test, y_train, y_test = train_test_split(X_d, y, random_state=random_state) print("There are {} training samples".format(y_train.shape)) print("There are {} testing samples".format(y_test.shape)) # Compute all of the predictors all_predictors = {variable: train(X_train, y_train, variable) for variable in range(X_train.shape[1])} errors = {variable: error for variable, (mapping, error) in all_predictors.items()} # Now choose the best and save that as "model" # Sort by error best_variable, best_error = sorted(errors.items(), key=itemgetter(1))[0] print("The best model is based on variable {0} and has error {1:.2f}".format(best_variable, best_error))# Choose the bset model model = {'variable': best_variable,'predictor': all_predictors[best_variable][0]}def predict(X_test, model):variable = model['variable']predictor = model['predictor']y_predicted = np.array([predictor[int(sample[variable])] for sample in X_test])return y_predictedy_predicted = predict(X_test, model)# Compute the accuracy by taking the mean of the amounts that y_predicted is equal to y_test accuracy = np.mean(y_predicted == y_test) * 100 print("The test accuracy is {:.1f}%".format(accuracy))

參考資料
《Python數據挖掘入門與實踐》

總結

以上是生活随笔為你收集整理的用OneR算法对Iris植物数据进行分类的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。