當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

【机器学习】sklearn数据集获取、分割、分类和回归

發(fā)布時(shí)間：2024/7/5 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了【机器学习】sklearn数据集获取、分割、分类和回归小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

sklearn數(shù)據(jù)集

1、數(shù)據(jù)集劃分
- 1.1 獲取數(shù)據(jù)
- 1.2 獲取數(shù)據(jù)返回的類型
- - 舉個(gè)栗子：
- 1.3 對(duì)數(shù)據(jù)集進(jìn)行分割
- - 舉個(gè)栗子：
2、 sklearn分類數(shù)據(jù)集
3、 sklearn回歸數(shù)據(jù)集

1、數(shù)據(jù)集劃分

機(jī)器學(xué)習(xí)一般的數(shù)據(jù)集會(huì)劃分為兩個(gè)部分：
訓(xùn)練數(shù)據(jù)：用于訓(xùn)練，構(gòu)建模型（分類、回歸和聚類）
測(cè)試數(shù)據(jù)：在模型檢驗(yàn)時(shí)使用，用于評(píng)估模型是否有效
劃分的時(shí)候一般就是75%和25%的比例。

sklearn數(shù)據(jù)集劃分API：sklearn.model_selection.train_test_split

1.1 獲取數(shù)據(jù)

分為兩種，一個(gè)是在datasets中的直接加載可以使用的，另一個(gè)一個(gè)是需要下載的大規(guī)模的數(shù)據(jù)集。

sklearn.datasets 加載獲取流行數(shù)據(jù)集 datasets.load_*() 獲取小規(guī)模數(shù)據(jù)集，數(shù)據(jù)包含在datasets里datasets.fetch_*(data_home=None) 獲取大規(guī)模數(shù)據(jù)集，需要從網(wǎng)絡(luò)上下載，函數(shù)的第一個(gè)參數(shù)是data_home，表示數(shù)據(jù)集下載的目錄,默認(rèn)是 ~/scikit_learn_data/

1.2 獲取數(shù)據(jù)返回的類型

load*和fetch*返回的數(shù)據(jù)類型datasets.base.Bunch(字典格式)data：特征數(shù)據(jù)數(shù)組，是 [n_samples * n_features] 的二維 numpy.ndarray 數(shù)組target：標(biāo)簽數(shù)組，是 n_samples 的一維 numpy.ndarray 數(shù)組DESCR：數(shù)據(jù)描述feature_names：特征名,新聞數(shù)據(jù)，手寫(xiě)數(shù)字、回歸數(shù)據(jù)集沒(méi)有target_names：標(biāo)簽名

舉個(gè)栗子：

**
sklearn.datasets.load_iris() 加載并返回鳶尾花數(shù)據(jù)集

這是一個(gè)150行4列的矩陣數(shù)組。來(lái)看一下如何實(shí)現(xiàn)數(shù)據(jù)加載的：

from sklearn.datasets import load_iris li = load_iris() print("獲取特征值") print(li.data) print("目標(biāo)值") print(li.target)

其中l(wèi)i就是datasets.base.Bunch的格式，
然后運(yùn)行輸出：

目標(biāo)值 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2] .. _iris_dataset:Iris plants dataset --------------------**Data Set Characteristics:**:Number of Instances: 150 (50 in each of three classes):Number of Attributes: 4 numeric, predictive attributes and the class:Attribute Information:- sepal length in cm- sepal width in cm- petal length in cm- petal width in cm- class:- Iris-Setosa- Iris-Versicolour- Iris-Virginica:Summary Statistics:============== ==== ==== ======= ===== ====================Min Max Mean SD Class Correlation============== ==== ==== ======= ===== ====================sepal length: 4.3 7.9 5.84 0.83 0.7826sepal width: 2.0 4.4 3.05 0.43 -0.4194petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)============== ==== ==== ======= ===== ====================:Missing Attribute Values: None:Class Distribution: 33.3% for each of 3 classes.:Creator: R.A. Fisher:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov):Date: July, 1988The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken from Fisher's paper. Note that it's the same as in R, but not as in the UCI Machine Learning Repository, which has two wrong data points.This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other... topic:: References- Fisher, R.A. "The use of multiple measurements in taxonomic problems"Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions toMathematical Statistics" (John Wiley, NY, 1950).- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New SystemStructure and Classification Rule for Recognition in Partially ExposedEnvironments". IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. PAMI-2, No. 1, 67-71.- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactionson Information Theory, May 1972, 431-433.- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS IIconceptual clustering system finds 3 classes in the data.- Many, many more ...Process finished with exit code 0

目標(biāo)值是類別0 1 2 3

里面還帶有解釋：這也許是模式識(shí)別文獻(xiàn)中最有名的數(shù)據(jù)庫(kù)。費(fèi)舍爾的論文是該領(lǐng)域的經(jīng)典之作，至今一直被引用。（例如，請(qǐng)參見(jiàn)Duda＆Hart。）數(shù)據(jù)集包含3類，每類50個(gè)實(shí)例，其中每個(gè)類均指一種鳶尾植物。

1.3 對(duì)數(shù)據(jù)集進(jìn)行分割

劃分訓(xùn)練集和測(cè)試集，其中訓(xùn)練集的特征值需要和目標(biāo)值進(jìn)行對(duì)應(yīng)。

sklearn.model_selection.train_test_split(*arrays, **options)x 數(shù)據(jù)集的特征值 y 數(shù)據(jù)集的標(biāo)簽值 test_size 測(cè)試集的大小，一般為float random_state 隨機(jī)數(shù)種子,不同的種子會(huì)造成不同的隨機(jī) 采樣結(jié)果。相同的種子采樣結(jié)果相同。return 訓(xùn)練集特征值，測(cè)試集特征值，訓(xùn)練標(biāo)簽，測(cè)試標(biāo)簽 (默認(rèn)隨機(jī)取)

舉個(gè)栗子：

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split li = load_iris() # print("獲取特征值") # print(li.data) # print("目標(biāo)值") # print(li.target) # print(li.DESCR) # 注意返回值, 訓(xùn)練集 train x_train, y_train 測(cè)試集 test x_test, y_test 順序不可以搞錯(cuò) x_train, x_test, y_train, y_test = train_test_split(li.data, li.target, test_size=0.25) print("訓(xùn)練集特征值和目標(biāo)值：", x_train, y_train) print("測(cè)試集特征值和目標(biāo)值：", x_test, y_test)

運(yùn)行結(jié)果：

D:\softwares\anaconda3\python.exe D:/PycharmProjects/MyTest/Day_0707/__init__.py 訓(xùn)練集特征值和目標(biāo)值： [[5.4 3.9 1.7 0.4][6.3 2.9 5.6 1.8][6.2 3.4 5.4 2.3][4.6 3.4 1.4 0.3][5.5 2.5 4. 1.3][6.5 3.2 5.1 2. ][5. 3.5 1.3 0.3][5.9 3. 4.2 1.5][5.4 3.7 1.5 0.2][4.7 3.2 1.3 0.2][5.5 2.4 3.8 1.1][5.1 3.8 1.6 0.2][5.8 2.6 4. 1.2][6.3 2.8 5.1 1.5][6.4 2.9 4.3 1.3][5.1 3.8 1.9 0.4][6. 2.2 5. 1.5][6.6 2.9 4.6 1.3][7.1 3. 5.9 2.1][6.2 2.9 4.3 1.3][6.5 3. 5.2 2. ][5.6 2.9 3.6 1.3][4.9 3.1 1.5 0.1][5.1 3.7 1.5 0.4][6.3 3.4 5.6 2.4][4.9 3. 1.4 0.2][5. 3.4 1.5 0.2][5.4 3. 4.5 1.5][6.1 2.8 4. 1.3][5.1 3.5 1.4 0.2][4.8 3.4 1.6 0.2][6.1 3. 4.6 1.4][5.7 2.6 3.5 1. ][5.7 2.5 5. 2. ][6.9 3.1 4.9 1.5][7.7 2.8 6.7 2. ][5.7 2.9 4.2 1.3][5.1 3.8 1.5 0.3][4.6 3.6 1. 0.2][7.7 3. 6.1 2.3][6.4 3.2 5.3 2.3][6.4 2.8 5.6 2.1][5.2 4.1 1.5 0.1][5. 3. 1.6 0.2][6. 2.9 4.5 1.5][6.3 3.3 4.7 1.6][4.9 3.1 1.5 0.2][6.3 3.3 6. 2.5][5.7 4.4 1.5 0.4][4.4 2.9 1.4 0.2][6.7 3. 5.2 2.3][5.2 3.4 1.4 0.2][5.5 3.5 1.3 0.2][4.8 3. 1.4 0.3][6.9 3.1 5.4 2.1][6.3 2.5 5. 1.9][5.8 4. 1.2 0.2][5.1 2.5 3. 1.1][6. 2.2 4. 1. ][5.8 2.7 5.1 1.9][6.7 3.1 4.7 1.5][7.2 3.6 6.1 2.5][6.8 2.8 4.8 1.4][6.1 2.9 4.7 1.4][4.3 3. 1.1 0.1][7. 3.2 4.7 1.4][6.7 3.3 5.7 2.5][5.6 2.7 4.2 1.3][5.2 3.5 1.5 0.2][7.7 2.6 6.9 2.3][6.7 3.3 5.7 2.1][6.7 3.1 5.6 2.4][6.5 2.8 4.6 1.5][5.1 3.3 1.7 0.5][7.2 3. 5.8 1.6][5.8 2.7 4.1 1. ][7.3 2.9 6.3 1.8][5.8 2.8 5.1 2.4][6.4 2.7 5.3 1.9][4.8 3.1 1.6 0.2][7.2 3.2 6. 1.8][5.9 3.2 4.8 1.8][4.5 2.3 1.3 0.3][4.9 2.4 3.3 1. ][5.6 3. 4.5 1.5][5.1 3.5 1.4 0.3][4.9 3.6 1.4 0.1][5. 3.4 1.6 0.4][5. 3.6 1.4 0.2][6. 3.4 4.5 1.6][5.8 2.7 5.1 1.9][4.9 2.5 4.5 1.7][6.3 2.3 4.4 1.3][5.5 2.3 4. 1.3][6.1 3. 4.9 1.8][7.9 3.8 6.4 2. ][5.7 2.8 4.5 1.3][6.7 3.1 4.4 1.4][5.6 2.5 3.9 1.1][6. 3. 4.8 1.8][6.1 2.8 4.7 1.2][6.5 3. 5.8 2.2][5.9 3. 5.1 1.8][4.6 3.2 1.4 0.2][6.4 3.1 5.5 1.8][7.7 3.8 6.7 2.2][7.6 3. 6.6 2.1][5. 3.5 1.6 0.6][6.1 2.6 5.6 1.4][5.3 3.7 1.5 0.2][5. 2. 3.5 1. ][5. 3.3 1.4 0.2]] [0 2 2 0 1 2 0 1 0 0 1 0 1 2 1 0 2 1 2 1 2 1 0 0 2 0 0 1 1 0 0 1 1 2 1 2 10 0 2 2 2 0 0 1 1 0 2 0 0 2 0 0 0 2 2 0 1 1 2 1 2 1 1 0 1 2 1 0 2 2 2 1 02 1 2 2 2 0 2 1 0 1 1 0 0 0 0 1 2 2 1 1 2 2 1 1 1 2 1 2 2 0 2 2 2 0 2 0 10] 測(cè)試集特征值和目標(biāo)值： [[4.6 3.1 1.5 0.2][5.6 3. 4.1 1.3][5.5 4.2 1.4 0.2][7.4 2.8 6.1 1.9][5.6 2.8 4.9 2. ][5. 3.2 1.2 0.2][5.7 3.8 1.7 0.3][4.4 3. 1.3 0.2][6.6 3. 4.4 1.4][6.4 3.2 4.5 1.5][6.3 2.5 4.9 1.5][5.1 3.4 1.5 0.2][6.9 3.1 5.1 2.3][6.8 3.2 5.9 2.3][5.8 2.7 3.9 1.2][4.7 3.2 1.6 0.2][6.9 3.2 5.7 2.3][5.4 3.4 1.5 0.4][4.8 3. 1.4 0.1][5.5 2.4 3.7 1. ][6.5 3. 5.5 1.8][6.2 2.2 4.5 1.5][5. 2.3 3.3 1. ][5.7 2.8 4.1 1.3][6.7 3. 5. 1.7][6.8 3. 5.5 2.1][4.4 3.2 1.3 0.2][6.7 2.5 5.8 1.8][5.7 3. 4.2 1.2][5.5 2.6 4.4 1.2][5.2 2.7 3.9 1.4][6.2 2.8 4.8 1.8][5.4 3.9 1.3 0.4][4.8 3.4 1.9 0.2][6. 2.7 5.1 1.6][6.4 2.8 5.6 2.2][6.3 2.7 4.9 1.8][5.4 3.4 1.7 0.2]] [0 1 0 2 2 0 0 0 1 1 1 0 2 2 1 0 2 0 0 1 2 1 1 1 1 2 0 2 1 1 1 2 0 0 1 2 20]Process finished with exit code 0

現(xiàn)在可以看出訓(xùn)練集就變少了，為原來(lái)的75%，且默認(rèn)為亂序的。

2、 sklearn分類數(shù)據(jù)集

用于分類的大數(shù)據(jù)集：

sklearn.datasets.fetch_20newsgroups(data_home=None,subset=‘train’)subset: 'train'或者'test','all'，可選，選擇要加載的數(shù)據(jù)集. 訓(xùn)練集的“訓(xùn)練”，測(cè)試集的“測(cè)試”，兩者的“全部” datasets.clear_data_home(data_home=None) 清除目錄下的數(shù)據(jù)

文章20個(gè)類別即20個(gè)特征值；data_home為目錄；獲取時(shí)先下載文件在獲取數(shù)據(jù)

from sklearn.datasets import fetch_20newsgroups news = fetch_20newsgroups(subset='all') print(news.data) print(news.target)

3、 sklearn回歸數(shù)據(jù)集

sklearn.datasets.load_boston() 加載并返回波士頓房?jī)r(jià)數(shù)據(jù)集

from sklearn.datasets import load_boston lb = load_boston() print("獲取特征值") print(lb.data) print("目標(biāo)值") print(lb.target) print(lb.DESCR)

總結(jié)

以上是生活随笔為你收集整理的【机器学习】sklearn数据集获取、分割、分类和回归的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：【Python基础知识-pycharm版
下一篇： mysql 字符串搜_mysql –

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

【机器学习】sklearn数据集获取、分割、分类和回归

sklearn數(shù)據(jù)集

1、數(shù)據(jù)集劃分

1.1 獲取數(shù)據(jù)

1.2 獲取數(shù)據(jù)返回的類型

舉個(gè)栗子：

1.3 對(duì)數(shù)據(jù)集進(jìn)行分割

舉個(gè)栗子：

2、 sklearn分類數(shù)據(jù)集

3、 sklearn回歸數(shù)據(jù)集

總結(jié)