當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

UCI数据集汇总及描述

發布時間：2023/12/20 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 UCI数据集汇总及描述小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1. Abalone: Predict the age of abalone from physical measurements

鮑魚DataSet：根據物理度量，預測鮑魚的年齡。

2. Abscisic Acid Signaling Network: The objective is to determine the set of boolean rules that describe the interactions of the nodes within this plant signaling network. The dataset includes 300 separate boolean pseudodynamic simulations using an asynchronous update scheme.

目標是測定布爾值的度量集合，以描述植物的信號網路節點。該數據集包括了300個獨立的布爾值形式的虛擬動態模擬值，使用了異步更新的架構。

3. Acute Inflammations: The data was created by a medical expert as a data set to test the expert system, which will perform the presumptive diagnosis of two diseases of the urinary system.

急性炎癥DataSet：數據來源于一位醫學專家的數據集，用以檢測專家系統，可以推斷出泌尿系統的兩種疾病的診斷結果。

4. Adult: Predict whether income exceeds $50K/yr based on census data. Also known as "Census Income" dataset.

成人DataSet：根據戶口普查資料，預測收入是否能超過50000美元/年。通常也被稱為“收入普查”數據集。

5. Annealing: Steel annealing data

退火DataSet：訓練退火數據。

6. Anonymous Microsoft Web Data: Log of anonymous users of www.microsoft.com; predict areas of the web site a user visited based on data on other areas the user visited.

匿名微軟網絡數據：微軟網站的匿名用戶記錄；通過其他的用戶訪問區域數據，預測用戶在web站點的訪問區域。

7. Arcene: ARCENE's task is to distinguish cancer versus normal patterns from mass-spectrometric data. This is a two-class classification problem with continuous input variables. This dataset is one of 5 datasets of the NIPS 2003 feature selection challenge.

ArceneDataSet：該數據集的任務是根據大量的觀測數據，從正常的模式中辨別出癌癥。這是一個根據不斷輸入的變量的二級分類問題。該數據集是從NIPS2003特征選擇挑戰比賽中的5個數據集之一。

8. Arrhythmia: Distinguish between the presence and absence of cardiac arrhythmia and classify it in one of the 16 groups.

心率失常DataSet：分辨是否出現心率失常，并將結果分類進16個組之一。

9. Artificial Characters: Dataset artificially generated by using first order theory which describes structure of ten capital letters of English alphabet

人為性狀DataSet：通過使用第一次序理論（該理論可以描述出英語字母表的十個開頭字母的結構），自動生成的數據集。

10. Audiology (Original): Nominal audiology dataset from Baylor

原始AudiologyDataSet：來自Baylor的標稱型的audiology數據集。

11. Audiology (Standardized): Standardized version of the original audiology database

標準AudiologyDataSet：原始Audiology數據集的標準化版本。

12. Australian Sign Language signs: This data consists of sample of Auslan (Australian Sign Language) signs. Examples of 95 signs were collected from five signers with a total of 6650 sign samples.

澳大利亞標記語言標記DataSet：這些數據包括了澳大利亞標記語言標記的樣本。95個實例，均來自五個標識器，其中有6650個標記樣本。

13. Australian Sign Language signs (High Quality): This data consists of sample of Auslan (Australian Sign Language) signs. 27 examples of each of 95 Auslan signs were captured from a native signer using high-quality position trackers

澳大利亞標記語言標記DataSet高品質版：該數據集包含了Auslan標記的樣本。有27個實例，它們來自95個標記，這27個實例是使用高質量位置追蹤器的當地標識器捕捉出來的。

14. Auto MPG: Revised from CMU StatLib library, data concerns city-cycle fuel consumption

自動MPGDataSet：來自CMU StatLib實驗室的精品，是與城市循環能源消耗相關的數據集。

15. Automobile: From 1985 Ward's Automotive Yearbook

汽車DataSet：來自1985的沃德自動化年鑒。

16. AutoUniv: AutoUniv is an advanced data generator for classifications tasks. The aim is to reflect the nuances and heterogeneity of real data. Data can be generated in .csv, ARFF or C4.5 formats.

AutoUniv是一個高級數據生成器，可以用來處理分類任務。目標是反映現實數據的微妙與不同之處。數據可以在.csv中生成，采用ARFF或者C4.5的格式。

17. Bach Chorales: Time-series data based on chorales; challenge is to learn generative grammar; data in Lisp

基于Chorales的時間序列數據集；可以用來挑戰生成性的語法；數據放在Lisp中。

18. Badges: Badges labeled with a "+" or "-" as a function of a person's name

徽章DataSet：標記了“+”或“-”的符號的標記，可以作為一個人姓名的函數表達式。

19. Bag of Words: This data set contains five text collections in the form of bags-of-words.

詞語包DataSet：該數據集包含了5個文本集合，每個文本集合以詞語包的形式展現。

20. Balance Scale: Balance scale weight & distance database

天平DataSet：天平的重量和距離數據庫。

21. Balloons: Data previously used in cognitive psychology experiment; 4 data sets represent different conditions of an experiment

氣球DataSet：曾經用在認知心理學實驗中的數據；4個數據集代表了一個實驗中的不同條件。

22. Blood Transfusion Service Center: Data taken from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan -- this is a classification problem.

輸血服務中心DataSet：來自臺灣的Hsin-CHu市的輸血服務中心的數據——用以解決分類問題。

23. Breast Cancer: Breast Cancer Data (Restricted Access)

乳腺癌DataSet：乳腺癌數據（訪問限制）。

24. Breast Cancer Wisconsin (Diagnostic): Diagnostic Wisconsin Breast Cancer Database

乳腺癌威斯康星洲（診斷數據）DataSet：威斯康星的乳腺癌診斷數據。

25. Breast Cancer Wisconsin (Original): Original Wisconsin Breast Cancer Database

乳腺癌威斯康星洲（原始數據）：原始的威斯康星州乳腺癌數據庫。

26. Breast Cancer Wisconsin (Prognostic): Prognostic Wisconsin Breast Cancer Database

乳腺癌威斯康星洲（Prognostic版）：威斯康星州乳腺癌數據庫。

27. Breast Tissue: Dataset with electrical impedance measurements of freshly excised tissue samples from the breast.

乳腺組織DataSet：乳腺的新鮮切除組織樣本的電阻度量數據集。

28. CalIt2 Building People Counts: This data comes from the main door of the CalIt2 building at UCI.

Calt2建筑的人數：該數據集來自UCI的Calts建筑的主要大門。

29. Car Evaluation: Derived from simple hierarchical decision model, this database may be useful for testing constructive induction and structure discovery methods.

汽車評估DataSet：來源于簡單層次決策模型，該數據集可用于測試建設性的回歸，和發現結構性方法。

30. Cardiotocography: The dataset consists of measurements of fetal heart rate (FHR) and uterine contraction (UC) features on cardiotocograms classified by expert obstetricians.

胎兒心率DataSet：該數據集包括胎兒心率（FHR），和基于產科專家醫生分類的cardiotocograms　子宮收縮（UC）特征。

31. Census Income: Predict whether income exceeds $50K/yr based on census data. Also known as "Adult" dataset.

收入普查DataSet：基于普查數據，預測收入是否超過50000美元/年。也被稱為“成人”數據集。

32. Census-Income (KDD): This data set contains weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau.

收入普查（KDD）DataSet：這個數據集包含了從1994－1995年的U.S普查局的《當前人口調查》中提取出來的普查數據。

33. Challenger USA Space Shuttle O-Ring: Task: predict the number of O-rings that experience thermal distress on a flight at 31 degrees F given data on the previous 23 shuttle flights

挑戰者號USA航天飛機O形圈DataSet：任務：基于前23次飛行數據，預測在一次31度熱壓F的狀況中的飛行任務的O形圈的數目。

34. Character Trajectories: Multiple, labelled samples of pen tip trajectories recorded whilst writing individual characters. All samples are from the same writer, for the purposes of primitive extraction. Only characters with a single pen-down segment were considered.

字符軌跡DataSet：同時寫出單個字幕的筆尖軌道的多個標記樣本記錄。為了保證初始的提取數據，所有的樣本都來自于同一個書寫人員。僅僅考慮了單一落筆段的字符。

35. Chess (Domain Theories): 6 different domain theories for generating legal moves of chess

國際象棋（域理論）DataSet：產生國際象棋的規定路數的6個不同的域理論。

36. Chess (King-Rook vs. King): Chess Endgame Database for White King and Rook against Black King (KRK).

國際象棋（王RookVS王）DataSet：白國王與黑國王的象棋殘局數據庫。

37. Chess (King-Rook vs. King-Knight): Knight Pin Chess End-Game Database Creator

國際象棋（王Rook對戰騎士）：騎士

38. Chess (King-Rook vs. King-Pawn): King+Rook versus King+Pawn on a7 (usually abbreviated KRKPA7).

國王Rook與國王Pawn的a7（通常簡寫為KAEPA7）。

39. Cloud: Little Documentation

小文檔。

40. CMU Face Images: This data consists of 640 black and white face images of people taken with varying pose (straight, left, right, up), expression (neutral, happy, sad, angry), eyes (wearing sunglasses or not), and size

CMU人臉圖像DataSet：該數據集包含了640張黑白人臉圖像，并且有直、左、右、上四個角度，中性、高興、悲傷、生氣四個表情，有的戴著太陽鏡，有的沒有，并且大小也不一。

41. Coil 1999 Competition Data: This data set is from the 1999 Computational Intelligence and Learning (COIL) competition. The data contains measurements of river chemical concentrations and algae densities.

Coil1999競賽數據：該數據集來自1999年的計算機智能學習競賽（簡寫為Coil）。該數據集包含了河流的化學濃度度量和藻類的密度度量。

42. Communities and Crime: Communities within the United States. The data combines socio-economic data from the 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR.

社區與犯罪DataSet：美國的社區。該數據集包含了來自1990美國普查的社會經濟數據、來自1990美國LEMAS調查的法律實施數據，還有來自1995年FBI UCR的犯罪數據。

43. Communities and Crime Unnormalized: Communities in the US. Data combines socio-economic data from the '90 Census, law enforcement data from the 1990 Law Enforcement Management and Admin Stats survey, and crime data from the 1995 FBI UCR

社區和非標準化犯罪DataSet：美國的社區。數據包含了來自90年代普查的社會經濟數據、來自1990年法律實施管理調查的法律實施數據，還有來自1995年FBI UCR的犯罪數據。

44. Computer Hardware: Relative CPU Performance Data, described in terms of its cycle time, memory size, etc.

計算機硬件：相關CPU運行數據，采用它的時間周期、內存大小來描述。

45. Concrete Compressive Strength: Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients.

混凝土抗壓強度DataSet：混凝土是土木工程中最重要的材料。抗壓強度是混凝土年齡與組成非線性特征。

46. Concrete Slump Test: Concrete is a highly complex material. The slump flow of concrete is not only determined by the water content, but that is also influenced by other concrete ingredients.

混凝土塌方度試驗：混凝土是一種非常復雜的材料。它的塌落度流量不僅取決于含水量，也受其他具體成分的影響。

47. Congressional Voting Records: 1984 United Stated Congressional Voting Records; Classify as Republican or Democrat

國會投票記錄DataSet：1984年美國國會投票記錄；按照共和黨與民主黨分類。

48. Connect-4: Contains connect-4 positions

連接4：包含了連接4的位置。

49. Connectionist Bench (Nettalk Corpus): The file "nettalk.data" contains a list of 20,008 English words, along with a phonetic transcription for each word. The task is to train a network to produce the proper phonemes

連接工作臺（Nettalk資料庫）：文件“nettalk.data”包含了一個有20008個英語單詞的列表，還有一個每個單詞的phonetic副本。任務是訓練一個網絡，用來產生適當的phonemes。

50. Connectionist Bench (Sonar, Mines vs. Rocks): The task is to train a network to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock.

連接工作臺（聲納、礦產和巖石）：目標是訓練一個網絡，用來區別在金屬圓柱體的反彈聲納信號，和在基本為圓柱體的巖石上的反彈信號。

51. Connectionist Bench (Vowel Recognition - Deterding Data): Speaker independent recognition of the eleven steady state vowels of British English using a specified training set of lpc derived log area ratios.

連接工作臺（元音識別—Detering數據）：使用一個來源于一個比率的指定訓練集的11個英式英語的穩定元音字母的獨立識別揚聲器。

52. Contraceptive Method Choice: Dataset is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey.

避孕方法的選擇：該數據集是1997年印度尼西亞全國的避孕患病率調查的的一個子集。

53. Corel Image Features: This dataset contains image features extracted from a Corel image collection. Four sets of features are available based on the color histogram, color histogram layout, color moments, and co-occurrence

Corel圖像特征：該數據集包含了提取自一個Corel圖像集合的圖片特征。基于顏色直方圖、顏色直方圖布局、顏色的時機和調和，可得到四個特征集合。

54. Covertype: Forest CoverType dataset

覆蓋類型：森林覆蓋類型數據集。

55. Credit Approval: This data concerns credit card applications; good mix of attributes

信貸審批：該數據集與信用卡的使用相關；是各種屬性的集合。

56. Cylinder Bands: Used in decision tree induction for mitigating process delays known as "cylinder bands" in rotogravure printing

氣缸帶：使用判定樹來歸納，減緩氣缸帶的凸版打印。

57. Demospongiae: Marine sponges of the Demospongiae class classification domain.

Demospongiae類別下的海綿分類域。

58. Dermatology: Aim for this dataset is to determine the type of Eryhemato-Squamous Disease.

皮膚科：該數據集用于判定Eryhemato鱗狀疾病的類型。

59. Dexter: DEXTER is a text classification problem in a bag-of-word representation. This is a two-class classification problem with sparse continuous input variables. This dataset is one of five datasets of the NIPS 2003 feature selection challenge.

DETEX是一個用一個文字包來表現的文本分類問題。這是一個通過不斷的輸入參數的兩層的分類問題。該數據集是NIPS2003年特征提取邀請賽的五個數據集中的一個。　

60. DGP2 - The Second Data Generation Program: Generates application domains based on specific parameters, number of features, and proportion of positive to negative examples

DGP2—第二個數據生成程序：基于具體的參數、特征的數量、和正面到負面例子的比率，產生應用域。

61. Diabetes: This diabetes dataset is from AIM '94

糖尿病：該糖尿病數據集來自AIM94。

62. Document Understanding: Five concepts, expressed as predicates, to be learned

文件理解：要學習的五個概念，作為謂詞來表現。

63. Dodgers Loop Sensor: Loop sensor data was collected for the Glendale on ramp for the 101 North freeway in Los Angeles

Dodgers回路傳感器：回路傳感器數據集來自Gledale的斜坡（在洛杉磯的101個北高速公路）。

64. Dorothea: DOROTHEA is a drug discovery dataset. Chemical compounds represented by structural molecular features must be classified as active (binding to thrombin) or inactive. This is one of 5 datasets of the NIPS 2003 feature selection challenge.

Dorothea是一個藥物發現數據集。以結構分析特征來表現的化合物必須分類為活性的（綁定到凝血酶）或者非活性的。這是五個NIPS2003特征選擇挑戰賽數據集中的一個。

65. E. Coli Genes: Data giving characteristics of each ORF (potential gene) in the E. coli genome. Sequence, homology (similarity to other genes) and structural information, and function (if known) are provided.

大腸桿菌基因：每個在E.coli基因組里面ORD(潛在基因)的特征數據集。提供序列、同源性（與其他基因的相似形）和結構信息。還有功能（如果知道的話）。

66. EBL Domain Theories: Assorted small-scale domain theories

EBL域理論：各種小規模的域理論。

67. Echocardiogram: Data for classifying if patients will survive for at least one year after a heart attack

超聲心動圖：該數據集用來分類是否病人在一次心臟病后，至少可以存活一年。

68. Ecoli: This data contains protein localization sites

該數據集包含了蛋白質本地化地址。

69. Economic Sanctions: Domain Theory on Economic Sanctions; Undocumented

經濟制裁：經濟制裁方面的域理論，無記錄文檔。

70. EEG Database: This data arises from a large study to examine EEG correlates of genetic predisposition to alcoholism. It contains measurements from 64 electrodes placed on the scalp sampled at 256 Hz

EEG數據庫：該數據集來源于一個檢查EEG的、與易患酒精中毒的基因體質相關的大型研究、包含了放在頭皮上的、為256HZ的、來自64個電極的度量。

71. El Nino: The data set contains oceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.

厄爾尼諾：該數據集包含了從整個赤道太平洋的一系列浮標的海洋與地面氣象讀數。

72. Entree Chicago Recommendation Data: This data contains a record of user interactions with the Entree Chicago restaurant recommendation system.

芝加哥主菜推薦數據：該數據集包含了一個與芝加哥主菜館的推薦系統的用戶交互的記錄。

73. Flags: From Collins Gem Guide to Flags, 1986

標志：從柯林斯寶石指南的標志，1986

74. Forest Fires: This is a difficult regression task, where the aim is to predict the burned area of forest fires, in the northeast region of Portugal, by using meteorological and other data (see details at: http://www.dsi.uminho.pt/~pcortez/forestfires).

森林火災：這是一個艱難的回歸的任務，其目的是在葡萄牙東北部地區，利用氣象數據和其他數據，預測森林火災的過火面積，（詳見：http://www.dsi.uminho PT / pcortez / forestfires）。

75. Function Finding: Cases collected mostly from investigations in physical science; intention is to evaluate function-finding algorithms

尋找功能：收集的情況下，大多是從在物理科學的調查;意圖是評價函數發現算法

76. Gisette: GISETTE is a handwritten digit recognition problem. The problem is to separate the highly confusible digits '4' and '9'. This dataset is one of five datasets of the NIPS 2003 feature selection challenge.

Gisette：GISETTE是一個手寫數字識別問題。問題是獨立的高度confusible數字'4'和'9'。這個數據集是5 NIPS的2003年特征選擇挑戰的數據集之一。

77. Glass Identification: From USA Forensic Science Service; 6 types of glass; defined in terms of their oxide content (i.e. Na, Fe, K, etc)

玻璃鑒定：從美國法醫科學服務; 6種玻璃;在他們的氧化物含量定義（即鈉，鐵，鉀等）

78. Haberman's Survival: Dataset contains cases from study conducted on the survival of patients who had undergone surgery for breast cancer

哈伯曼的生存：DataSet包含誰經歷了乳腺癌手術患者的生存所進行的研究情況

79. Hayes-Roth: Topic: human subjects study

海斯 - 羅斯：主題：人類受試者的研究

80. Heart Disease: 4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach

心臟病：4個數據庫：克利夫蘭，匈牙利，瑞士，和弗吉尼亞州的長灘

81. Hepatitis: From G.Gong: CMU; Mostly Boolean or numeric-valued attribute types; Includes cost data (donated by Peter Turney)

肝炎：從G.龔：債務工具中央結算系統;大多是布爾值或數字值的屬性類型，包括成本數據（彼得特尼捐贈）

82. Hill-Valley: Each record represents 100 points on a two-dimensional graph. When plotted in order (from 1 through 100) as the Y co-ordinate, the points will create either a Hill (a �bump� in the terrain) or a Valley (a �dip� in the terrain).

希爾谷：每個記錄代表一個二維圖形上100點。當策劃，以統籌的Y（從1到100），積分將創建一個山（在凹凸的地形）或谷（浸在地形）。

83. Horse Colic: Well documented attributes; 368 instances with 28 attributes (continuous, discrete, and nominal); 30% missing values

馬絞痛：有據可查的屬性; 368 28屬性（連續，離散的，標稱值）的實例; 30％的缺失值

84. Housing: Taken from StatLib library

房屋：兩者StatLib庫

85. ICU: Data set prepared for the use of participants for the 1994 AAAI Spring Symposium on Artificial Intelligence in Medicine.

ICU的數據集，為1994年AAAI春季研討會的與會者在醫學上使用人工智能準備。

86. Image Segmentation: Image data described by high-level numeric-valued attributes, 7 classes

圖像分割：由高層次的數字值屬性描述的圖像數據，7類

87. Insurance Company Benchmark (COIL 2000): This data set used in the CoIL 2000 Challenge contains information on customers of an insurance company. The data consists of 86 variables and includes product usage data and socio-demographic data

保險公司的基準（線圈2000年）：使用該數據集在線圈2000挑戰包含保險公司對客戶的信息。該數據由86變數，包括產品使用的數據和社會人口數據

88. Internet Advertisements: This dataset represents a set of possible advertisements on Internet pages.

互聯網廣告：這個DataSet表示一組可能在互聯網上的網頁廣告。

89. Internet Usage Data: This data contains general demographic information on internet users in 1997.

互聯網應用的數據：該數據包含一般的互聯網用戶在1997年的人口統計信息。

90. Ionosphere: Classification of radar returns from the ionosphere

電離層：從電離層雷達回波分類

91. IPUMS Census Database: This data set contains unweighted PUMS census data from the Los Angeles and Long Beach areas for the years 1970, 1980, and 1990.

IPUMS普查數據庫：該數據集包含未加權PUMS普查從洛杉磯和長灘地區1970年，1980年和1990年的數據。

92. Iris: Famous database; from Fisher, 1936

光圈：著名的數據庫;從1936年費舍爾，

93. ISOLET: Goal: Predict which letter-name was spoken--a simple classification task.

ISOLET：目標：預測字母名稱是口語 - 一個簡單的分類任務。

94. Japanese Credit Screening: Includes domain theory (generated by talking to Japanese domain experts); data in Lisp

日本信用篩選：包括域理論（日本領域的專家交談生成）;在Lisp中的數據

95. Japanese Vowels: This dataset records 640 time series of 12 LPC cepstrum coefficients taken from nine male speakers.

日本元音：該數據集的記錄640 12的LPC倒譜系系數從九男揚聲器的時間序列。

96. KDD Cup 1998 Data: This is the data set used for The Second International Knowledge Discovery and Data Mining Tools Competition, which was held i?n conjunction with KDD-98

KDD杯1998年的數據：這是數據集的第二屆國際知識發現和數據挖掘工具的競爭，這是在同時舉行的KDD - 98

97. KDD Cup 1999 Data: This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99

KDD杯1999年的數據：這是數據集使用的第三次國際知識發現和數據挖掘工具的競爭，這是在同時舉行的KDD - 99

98. Kinship: Relational dataset

親屬關系：關系數據集

99. Labor Relations: From Collective Bargaining Review

勞動關系：從集體談判檢討

100. LED Display Domain: From Classification and Regression Trees book; We provide here 2 C programs for generating sample databases

LED顯示域：從分類和回歸樹書，我們在這里提供2 C程序生成示例數據庫

101. Lenses: Database for fitting contact lenses

鏡頭：裝修隱形眼鏡數據庫

102. Letter Recognition: Database of character image features; try to identify the letter

信承認：人物形象特征的數據庫;試圖找出信

103. Libras Movement: The data set contains 15 classes of 24 instances each. Each class references to a hand movement type in LIBRAS (Portuguese name 'L�ngua BRAsileira de Sinais', oficial brazilian signal language).

天秤座的運動：該數據集包含了15類24個實例。每個類的引用，在天秤座的人的手部動作類型（葡萄牙名“Lngua BRAsileira Sinais”，公報巴西信號語言）。

104. Liver Disorders: BUPA Medical Research Ltd. database donated by Richard S. Forsyth

肝臟疾病：保柏醫療研究公司數據庫由理查德福塞斯捐贈

105. Localization Data for Person Activity: Data contains recordings of five people performing different activities. Each person wore four sensors (tags) while performing the same scenario five times.

人活動的本地化數據：數據包含五個執行不同的活動的人的錄音。每個人穿的4個傳感器（標簽），同時執行相同的情況下的五倍。

106. Logic Theorist: All code for Logic Theorist

邏輯理論家：邏輯理論家的所有代碼

107. Low Resolution Spectrometer: From IRAS data -- NASA Ames Research Center

低分辨率光譜儀：從紅外天文衛星數據 - 美國國家航空航天局艾姆斯研究中心

108. Lung Cancer: Lung cancer data; no attribute definitions

肺癌：肺癌數據;沒有屬性定義

109. Lymphography: This lymphography domain was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. (Restricted access)

淋巴造影：從大學醫學中心，腫瘤研究所，南斯拉夫盧布爾雅那的這淋巴域。（限制訪問）

110. M. Tuberculosis Genes: Data giving characteristics of each ORF (potential gene) in the M. tuberculosis bacterium. Sequence, homology (similarity to other genes) and structural information, and function (if known) are provided

結核分枝桿菌基因：給每個ORF在結核分枝桿菌的細菌特性（潛在的基因）的數據。序列，同源性（其他基因的相似性）和結構信息，和功能（如果已知）

111. Madelon: MADELON is an artificial dataset, which was part of the NIPS 2003 feature selection challenge. This is a two-class classification problem with continuous input variables. The difficulty is that the problem is multivariate and highly non-linear.

Madelon：MADELON是一個人造的數據集，這是對2003年的NIPS的特征選擇挑戰的一部分。這是一個連續的輸入變量的兩個類的分類問題。困難的是，問題是多元的和高度非線性。

112. MAGIC Gamma Telescope: Data are MC generated to simulate registration of high energy gamma particles in an atmospheric Cherenkov telescope

魔伽馬望遠鏡：數據生成高能量的伽瑪粒子來模擬大氣切倫科夫望遠鏡登記MC

113. Mammographic Mass: Discrimination of benign and malignant mammographic masses based on BI-RADS attributes and the patient's age.

乳腺質量：良性和惡性乳腺群眾基于BI - RADS的屬性和病人的年齡歧視。

114. Mechanical Analysis: Fault diagnosis problem of electromechanical devices; also PUMPS DATA SET is newer version with domain theory and results

力學分析：機電設備的故障診斷問題;水泵數據集與域的理論和成果是較新的版本

115. Meta-data: Meta-Data was used in order to give advice about which classification method is appropriate for a particular dataset (taken from results of Statlog project).

元數據：元數據使用的分類方法是適合于一個特定的數據集（Statlog項目的結果），以提供意見。

116. MiniBooNE particle identification: This dataset is taken from the MiniBooNE experiment and is used to distinguish electron neutrinos (signal) from muon neutrinos (background).

MiniBooNE的粒子鑒別：該數據集是從MiniBooNE的實驗是使用電子中微子（信號），以區別于μ子中微子（背景）。

117. Mobile Robots: Learning concepts from sensor data of a mobile robot; set of data sets

移動機器人：從移動機器人的傳感器數據學習觀念;組數據集

118. Molecular Biology (Promoter Gene Sequences): E. Coli promoter gene sequences (DNA) with partial domain theory

分子生物學（啟動子序列）：大腸桿菌啟動子的基因序列（DNA）的部分域理論

119. Molecular Biology (Protein Secondary Structure): From CMU connectionist bench repository; Classifies secondary structure of certain globular proteins

分子生物學（蛋白質二級結構）：從債務工具中央結算系統聯結板凳資源庫;某些球狀蛋白質的二級結構進行分類

120. Molecular Biology (Splice-junction Gene Sequences): Primate splice-junction gene sequences (DNA) with associated imperfect domain theory

分子生物學（拼接交界的基因序列）：靈長類動物的基因序列拼接結與相關的不完善域理論（脫氧核糖核酸）

121. MONK's Problems: A set of three artificial domains over the same attribute space; Used to test a wide range of induction algorithms

和尚的問題：三個以上相同的屬性空間的人工域;用于測試一個廣泛的歸納算法

122. Moral Reasoner: Horn-clause model that qualitatively simulates moral reasoning; Theory includes negated literals

道德推理：霍恩子句模型定性模擬道德推理理論包括否定的文字

123. Movie: This data set contains a list of over 10000 films including many older, odd, and cult films. There is information on actors, casts, directors, producers, studios, etc.

電影：該數據集包含一個10000多部電影，包括許多年紀大了，奇怪，和邪教的電影列表。有上的演員，演員，董事，制片人，制片公司等信息

124. MSNBC.com Anonymous Web Data: This data describes the page visits of users who visited msnbc.com on September 28, 1999. Visits are recorded at the level of URL category (see description) and are recorded in time order.

MSNBC.com匿名Web數據：這個數據描述了用戶的頁面訪問參觀，1999年9月28日msnbc.com。記錄訪問的URL類別的水平（見說明），在時間順序記錄。

125. Multiple Features: This dataset consists of features of handwritten numer?als (`0'--`9') extracted from a collection of Dutch utility maps

多種功能：這個數據集，包括從荷蘭實用地圖的集合中提取的手寫體數字（`0'結束 - `9“）功能

126. Mushroom: From Audobon Society Field Guide; mushrooms described in terms of physical characteristics; classification: poisonous or edible

蘑菇：從Audobon社會領域指南“;蘑菇描述的物理特性;分類：有毒或食用

127. Musk (Version 1): The goal is to learn to predict whether new molecules will be musks or non-musks

麝香（版本1）：我們的目標是要學會預測是否有新的分子，將麝香或非麝香

128. Musk (Version 2): The goal is to learn to predict whether new molecules will be musks or non-musks

麝香（第2版）：我們的目標是要學會預測是否有新的分子，將麝香或非麝香

129. NSF Research Award Abstracts 1990-2003: This data set consists of (a) 129,000 abstracts describing NSF awards for basic research, (b) bag-of-word data files extracted from the abstracts, (c) a list of words used for indexing the bag-of-word

NSF研究獎論文摘要1990年至2003年：（一）129000摘要描述NSF的獎項，用于基礎研究（二）字袋從抽象的數據中提取的文件，（三）為索引使用的單詞列表，該數據集組成字袋

130. Nursery: Nursery Database was derived from a hierarchical decision model originally developed to rank applications for nursery schools.

苗圃：苗圃數據庫是從最初開發托兒所排名應用分層決策模型派生。

131. Online Handwritten Assamese Characters Dataset: This is a dataset of 8235 online handwritten assamese characters. The “online” process involves capturing of data as text is written on a digitizing tablet with an electronic pen.

在線手寫阿薩姆字符數據集：這是一個8235聯機手寫阿薩姆字符的數據集。 “在線”的過程包括數據采集，數字化儀上用電子筆的書面文本。

132. Opinosis Opinion ? Review: This dataset contains sentences extracted from user reviews on a given topic. Example topics are “performance of Toyota Camry” and “sound quality of ipod nano”.

Opinosis意見/評論：此數據集包含一個給定的主題從用戶評論中提取的句子。示例主題是“表現的豐田佳美”和“音質”的iPod nano。

133. OpinRank Review Dataset: This data set contains user reviews of cars and and hotels collected from Tripadvisor (~259,000 reviews) and Edmunds (~42,230 reviews).

OpinRank審查數據集：該數據集包含車和酒店收集到到網（259000評語）和埃德蒙茲（?42230條評論）的用戶評論。

134. Optical Recognition of Handwritten Digits: Two versions of this database available; see folder

光學識別手寫體數字：這個數據庫提供的兩個版本，請參閱文件夾

135. Othello Domain Theory: Used in research to generate features for an inductive learning system

奧賽羅域理論：在研究中使用生成歸納學習系統的功能

136. Ozone Level Detection: Two ground ozone level data sets are included in this collection. One is the eight hour peak set (eighthr.data), the other is the one hour peak set (onehr.data). Those data were collected from 1998 to 2004 at the Houston, Galveston and Brazoria area.

臭氧濃度檢測：兩個地面臭氧濃度的數據集都包含在此集合。之一，是8個小時的高峰集（eighthr.data），另一種是一個小時的高峰集（onehr.data）。這些數據收集從1998年至2004年在休斯敦，加爾維斯頓和Brazoria區域。

137. p53 Mutants: The goal is to model mutant p53 transcriptional activity (active vs inactive) based on data extracted from biophysical simulations.

p53基因突變體：我們的目標是到模型的基礎上從生物物理模擬提取數據的突變型p53的轉錄活性（有源VS無效）。

138. Page Blocks Classification: The problem consists of classifying all the blocks of the page layout of a document that has been detected by a segmentation process.

頁塊分類：問題進行分類的一個已被分割過程中檢測到的文件的頁面布局的所有塊組成。

139. Parkinsons: Oxford Parkinson's Disease Detection Dataset

帕金森：牛津帕金森氏病的檢測數據集

140. Parkinsons Telemonitoring: Oxford Parkinson's Disease Telemonitoring Dataset

帕金森遠程監護：牛津帕金森病的遠程監護數據集

141. PEMS-SF: 15 months worth of daily data (440 daily records) that describes the occupancy rate, between 0 and 1, of different car lanes of the San Francisco bay area freeways across time.

PEMS - SF：15個月，每天的數據（440每日記錄）描述的入住率，0和1之間，不同的汽車車道，舊金山灣地區的高速公路，跨越時間的價值。

142. Pen-Based Recognition of Handwritten Digits: Digit database of 250 samples from 44 writers

基于筆的手寫數字識別：來自44個作家的250個樣本的數字數據庫

143. Pima Indians Diabetes: From National Institute of Diabetes and Digestive and Kidney Diseases; Includes cost data (donated by Peter Turney)

皮馬印第安人糖尿病：國立糖尿病，消化道和腎臟疾病研究所;包括成本數據（彼得特尼捐贈）

144. Pioneer-1 Mobile Robot Data: This dataset contains time series sensor readings of the Pioneer-1 mobile robot. The data is broken into "experiences" in which the robot takes action for some period of time and experiences a control

先鋒- 1移動機器人數據：該數據集包含了時間序列的先鋒- 1移動機器人的傳感器讀數。數據分解成“經驗”中，機器人需要一段時間的行動和經驗的控制

145. Pittsburgh Bridges: Bridges database that has original and numeric-discretized datasets

匹茲堡橋梁：橋梁數據庫，具有原始和數值離散數據集

146. Plants: Data has been extracted from the USDA plants database. It contains all plants (species and genera) in the database and the states of USA and Canada where they occur.

植物：數據已經從美國農業部植物數據庫中提取。它包含在數據庫中，美國和加拿大發生的所有植物（種屬）。

147. Poker Hand: Purpose is to predict poker hands

牌手：目的是預測撲克牌

148. Post-Operative Patient: Dataset of patient features

手術后的病人：病人的特征數據集

149. Primary Tumor: From Ljubljana Oncology Institute

原發腫瘤：腫瘤研究所從盧布爾雅那

150. Prodigy: Assorted domains like blocksworld, eightpuzzle, and schedworld.

奇才：blocksworld，eightpuzzle，schedworld什錦域。

151. Protein Data: Undocumented

蛋白質數據：無證

152. Pseudo Periodic Synthetic Time Series: This data set is designed for testing indexing schemes in time series databases. The data appears highly periodic, but never exactly repeats itself.

偽定期的合成時間系列：該數據集是測試時間序列數據庫中的索引計劃的設計。的數據顯示高度周期性的，但永遠不會完全重演。

153. PubChem Bioassay Data: These highly imbalanced bioassay datasets are from the differing types of screening that can be performed using HTS technology. 21 datasets were created from 12 bioassays.

PubChem數據庫生物測定數據：這些高度不平衡的生物測定數據集的篩選不同類型可以使用高溫超導技術。 21數據集創建了來自12個生物測定。

154. Quadruped Mammals: The file animals.c is a data generator of structured instances representing quadruped animals

四足哺乳動物：該文件animals.c是一個代表四足動物的結構實例的數據發生器

155. Qualitative Structure Activity Relationships: Two sets of datasets are given: pyrimidines and triazines

定性結構活性關系：給出兩套數據集：嘧啶和三嗪

156. Record Linkage Comparison Patterns: Element-wise comparison of records with personal data from a record linkage setting. The task is to decide from a comparison pattern whether the underlying records belong to one person.

記錄鏈接比較模式：元素比較明智的，從創紀錄的聯動設置的個人資料記錄。任務是從一個比較模式，決定是否屬于一個人的基本紀錄。

157. Relative location of CT slices on axial axis: The dataset consists of 384 features extracted from CT images. The class variable is numeric and denotes the relative location of the CT slice on the axial axis of the human body.?

CT片的軸向軸的相對位置：數據集包括從CT圖像中提取的384功能。類變量是數值表示的CT片對人體的軸向軸的相對位置。

158. Reuters Transcribed Subset: This dataset is created by reading out 200 files from the 10 largest Reuters classes and using an Automatic Speech Recognition system to create corresponding transcriptions.

路透社轉錄子集：創建該數據集是通過讀出最大路透社從10類200個文件，并使用自動語音識別系統，建立相應的改編。

159. Reuters-21578 Text Categorization Collection: This is a collection of documents that appeared on Reuters newswire in 1987. The documents were assembled and indexed with categories.

路透- 21578文本分類收集：這是出現于1987年，路透通訊社的文件的集合。組裝和類別索引文件。

160. Robot Execution Failures: This dataset contains force and torque measurements on a robot after failure detection. Each failure is characterized by 15 force/torque samples collected at regular time intervals

機器人執行失敗：此數據集包含后故障檢測機器人的力和力矩測量。每次失敗的特點是在固定的時間間隔采集的樣品15力/力矩

161. SECOM: Data from a semi-conductor manufacturing process

世強：從半導體制造過程中的數據

162. Semeion Handwritten Digit: 1593 handwritten digits from around 80 persons were scanned, stretched in a rectangular box 16x16 in a gray scale of 256 values.

Semeion手寫體數字：1593從80人左右的手寫數字進行掃描，伸一個矩形框，在256個值的灰度的16x16。

163. Servo: Data was from a simulation of a servo system

伺服：數據從一個伺服系統的仿真

164. Shuttle Landing Control: Tiny database; all nominal values

航天飛機著陸控制：微型數據庫;所有標稱值

165. Solar Flare: Each class attribute counts the number of solar flares of a certain class that occur in a 24 hour period

太陽耀斑：每個類的屬性一定的階級，在24小時內發生的太陽耀斑的數量進行計數

166. Soybean (Large): Michalski's famous soybean disease database

大豆（大）：MICHALSKI著名的大豆疾病數據庫

167. Soybean (Small): Michalski's famous soybean disease database

大豆（小）：MICHALSKI著名的大豆疾病數據庫

168. Spambase: Classifying Email as Spam or Non-Spam

Spambase：歸類為“垃圾郵件”或“非垃圾郵件的電子郵件

169. SPECT Heart: Data on cardiac Single Proton Emission Computed Tomography (SPECT) images. Each patient classified into two categories: normal and abnormal.

SPECT的心臟：心臟單個質子發射計算機斷層顯像（SPECT）的圖像數據。每個病人分為兩類：正常和不正常的。

170. SPECTF Heart: Data on cardiac Single Proton Emission Computed Tomography (SPECT) images. Each patient classified into two categories: normal and abnormal.

SPECTF心臟：心臟單個質子發射計算機斷層顯像（SPECT）的圖像數據。每個病人分為兩類：正常和不正常的。

171. Spoken Arabic Digit: This dataset contains timeseries of mel-frequency cepstrum coefficients (MFCCs) corresponding to spoken Arabic digits. Includes data from 44 male and 44 female native Arabic speakers.

口語阿拉伯語位：該數據集包含MEL頻率倒譜系數（MFCCs）講阿拉伯語數字對應的時間序列。包括44男44女的母語講阿拉伯語的數據。

172. Sponge: Data on sponges; Attributes in Spanish

海綿：海綿上的數據，在西班牙語中的屬性

173. Statlog (Australian Credit Approval): This file concerns credit card applications. This database exists elsewhere in the repository (Credit Screening Database) in a slightly different form

Statlog（澳大利亞授信審批）：這個文件是關于信用卡申請。該數據庫存在于其他地方略有不同形式的資源庫（授信數據庫）

174. Statlog (German Credit Data): This dataset classifies people described by a set of attributes as good or bad credit risks. Comes in two formats (one all numeric). Also comes with a cost matrix

Statlog（德國信用數據）：這個數據集劃分好壞信貸風險的屬性所描述的人。來自于兩種格式（所有數字）。還帶有一個成本矩陣

175. Statlog (Heart): This dataset is a heart disease database similar to a database already present in the repository (Heart Disease databases) but in a slightly different form

Statlog（心）：這個數據集是一個心臟疾病數據庫，數據庫已經在庫（心臟病數據庫）類似，但略有不同的形式

176. Statlog (Image Segmentation): This dataset is an image segmentation database similar to a database already present in the repository (Image segmentation database) but in a slightly different form.

Statlog（圖像分割）：該數據集是一個圖像分割數據庫，數據庫中已存在的資源庫（圖像分割數據庫），但在一個稍微不同的的形式類似。

177. Statlog (Landsat Satellite): Multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood

Statlog（地球資源衛星多光譜）：在3x3的街區在衛星圖像的像素值，并與中央像素在每個居委會相關的分類

178. Statlog (Shuttle): The shuttle dataset contains 9 attributes all of which are numerical. Approximately 80% of the data belongs to class 1

Statlog（班車）：穿梭集包含20個屬性，所有這一切都是數字。大約80％的數據屬于1級

179. Statlog (Vehicle Silhouettes): 3D objects within a 2D image by application of an ensemble of shape feature extractors to the 2D silhouettes of the objects.

Statlog（車剪影）：在一個物體的二維輪廓的形狀特征提取的合奏中的應用2D圖像的三維對象。

180. Statlog Project: Various Databases: Vehicle silhouttes, Landsat Sattelite, Shuttle, Australian Credit Approval, Heart Disease, Image Segmentation, German Credit

Statlog項目：各種數據庫：車輛silhouttes，地球資源衛星，航天飛機，澳大利亞信貸審批，心臟病，圖像分割，德國信用

181. Steel Plates Faults: A dataset of steel plates’ faults, classified into 7 different types. The goal was to train machine learning for automatic pattern recognition.

鋼板缺陷：一個數據集鋼板斷裂，分為7個不同的類型。我們的目標是培養學習機，自動模式識別。

182. Student Loan Relational: Student Loan Relational Domain

。助學貸款的關系：助學貸款的關系域

183. Synthetic Control Chart Time Series: This data consists of synthetically generated control charts.

合成控制圖的時間序列數據的綜合生成的控制圖組成。

184. Syskill and Webert Web Page Ratings: This database contains HTML source of web pages plus the ratings of a single user on these web pages. Web pages are on four seperate subjects (Bands- recording artists; Goats; Sheep; and BioMedical)

Syskill和Webert網頁評價：該數據庫包含網頁的HTML源代碼再加上這些網頁上的一個單用戶的收視率。網頁是在四個不同科目（樂隊的錄音藝術家;山羊;綿羊;和生物醫學）

185. Teaching Assistant Evaluation: The data consist of evaluations of teaching performance; scores are "low", "medium", or "high"

助教評價：數據包括教學績效評價;分數“低”，“中等”，或“高”

186. Thyroid Disease: 10 separate databases from Garavan Institute

甲狀腺疾病：10個單獨的數據庫Garavan研究所

187. Tic-Tac-Toe Endgame: Binary classification task on possible configurations of tic-tac-toe game

井字腳趾殘局：可能的配置的tic - tac - toe游戲的二元分類任務

188. Trains: 2 data formats (structured, one-instance-per-line)

火車：2數據格式（結構化，每行一個實例）

189. Twenty Newsgroups: This data set consists of 20000 messages taken from 20 newsgroups.

第二十新聞組：該數據集由來自20個新聞組采取的20000消息。

190. UJI Pen Characters: Data consists of written characters in a UNIPEN-like format

宇治筆特點：數據包括在UNIPEN樣的格式寫入的字符

191. UJI Pen Characters (Version 2): A pen-based database with more than 11k isolated handwritten characters

宇治鋼筆字（第2版）：一個孤立的手寫字符超過11K的鋼筆型數據庫

192. Undocumented: Various datasets without documentation (feel free to explore!)

無證：沒有證件的各種數據集（自由探索！）

193. University: Data in original (LISP-readable) form

大學：原（Lisp的可讀形式）中的數據

194. UNIX User Data: This file contains 9 sets of sanitized user data drawn from the command histories of 8 UNIX computer users at Purdue over the course of up to 2 years.

UNIX用戶數據：該文件包含9套消毒的用戶在長達2年的，當然從8 UNIX計算機用戶的命令歷史數據繪制在普渡大學。

195. URL Reputation: Anonymized 120-day subset of the ICML-09 URL data containing 2.4 million examples and 3.2 million features.

URL的信譽：不具名的120天的ICML - 09的URL數據，含有240萬的例子和320萬功能的一個子集。

196. US Census Data (1990): The USCensus1990raw data set contains a one percent sample of the Public Use Microdata Samples (PUMS) person records drawn from the full 1990 census sample.

美國人口普查數據（1990年）：USCensus1990raw數據集包含一成市民使用微觀數據（PUMS）人記錄完整的1990年人口普查抽樣抽樣樣品。

197. Volcanoes on Venus - JARtool experiment: The JARtool project was a pioneering effort to develop an automatic system for cataloging small volcanoes in the large set of Venus images returned by the Magellan spacecraft.

金星上的火山 - JARtool實驗：JARtool項目是一項開創性的努力開發一個自動化系統編目在大麥哲倫飛船返回的金星圖像設置的小火山。

198. Wall-Following Robot Navigation Data: The data were collected as the SCITOS G5 robot navigates through the room following the wall in a clockwise direction, for 4 rounds, using 24 ultrasound sensors arranged circularly around its 'waist'.

以下壁掛式機器人的導航數據：數據收集的SCITOS G5機器人的導航，通過房間下面的墻壁以順時針方向，4輪，使用圓周圍的“腰”，安排了24超聲傳感器。

199. Water Treatment Plant: Multiple classes predict plant state

水處理廠：多類預測植物狀態

200. Waveform Database Generator (Version 1): CART book's waveform domains

波形數據庫生成器（版本1）：訂購書的波形域

201. Waveform Database Generator (Version 2): CART book's waveform domains

波形數據庫生成（第2版）：訂購書的波形域

202. Wine: Using chemical analysis determine the origin of wines

葡萄酒：使用化學分析器判定葡萄酒的來源。

203. Wine Quality: Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests (see [Cortez et al., 2009], http://www3.dsi.uminho.pt/pcortez/wine/).

葡萄酒的質量：包括兩個數據集，與來自葡萄牙北部的紅與白葡萄酒樣本樣品相關。目標是通過物理化學檢驗，設計出葡萄酒的質量模型。

204. YearPredictionMSD: Prediction of the release year of a song from audio features. Songs are mostly western, commercial tracks ranging from 1922 to 2011, with a peak in the year 2000s.

年度預測MSD：從聲音的特征里，預測一首歌曲的發行年份、歌曲大部來自西部的、從1922至2011年的商業性的音軌，在2000年到達頂峰。

205. Yeast: Predicting the Cellular Localization Sites of Proteins

酵母DataSet：預測蛋白質的細胞定位點。

206. Zoo: Artificial, 7 classes of animals

動物園DataSet：人工，其中類別的動物。

創作不易，轉載請注明出處：https://blog.csdn.net/mago2015

總結

以上是生活随笔為你收集整理的UCI数据集汇总及描述的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

数据
UCI

上一篇： C# 开发Chrome内核浏览器(Web
下一篇： java实现控件绑定数据源_控件（三）—

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

UCI数据集汇总及描述

總結