當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

azure机器学习_Microsoft Azure机器学习x Udacity —第4课笔记

發布時間：2023/12/15 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 azure机器学习_Microsoft Azure机器学习x Udacity —第4课笔记小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

azure機器學習

Detailed Notes for Machine Learning Foundation Course by Microsoft Azure & Udacity, 2020 on Lesson 4 — Supervised & Unsupervised Learning

Microsoft Azure和Udacity于2020年第4課-有 監督和無監督學習的 機器學習基礎課程的詳細說明

This lesson covers two of Machine Learning’s fundamental approaches: supervised and unsupervised learning. You will learn about classification, regression, clustering, representation learning, and more.

本課程涵蓋了機器學習的兩種基本方法：有監督的學習和無監督的學習。 您將學習分類，回歸，聚類，表示學習等內容。

監督學習：分類 (Supervised Learning: Classification)

In a classification problem, the outputs are categorical or discrete.

在分類問題中，輸出是分類的或離散的。

Some of the most common types of classification problems include:

最常見的分類問題類型包括：

Classification on tabular data: The data is available in the form of rows and columns, potentially originating from a wide variety of data sources.
表格數據的分類 ：數據以行和列的形式提供，可能源自多種數據源。
Classification on image or sound data: The training data consists of images or sounds whose categories are already known.
圖像或聲音的數據分類 ：訓練數據由其類別已知的圖像或聲音組成。
Classification on text data: The training data consists of texts whose categories are already known.
文本數據分類 ：訓練數據由其類別已知的文本組成。

Examples of Classification problems are:

分類問題的示例是：

Computer Vision
計算機視覺
Speech Recognition
語音識別
Biometric Identification
生物識別
Document Classification
文件分類
Sentiment Analysis
情緒分析
Credit Scoring
信用評分
Anomaly Detection
異常檢測

Jobs in AI人工智能工作

算法類別： (Categories of Algorithms:)

At a high level, there are mainly 3 categories of the algorithms:

概括而言，算法主要分為3類：

Two-Class (Binary) Classification: used when the prediction has to be made only between two categories, e.g. True/False, Yes/No.
兩級(二進制)分類：僅在必須在兩個類別(例如，是/否，是/否)之間進行預測時使用。
Multi-Class SIngle-Label Classification: used when there are multiple categories to predict from, however, the output belongs to a single category, e.g. red, yellow, green, or blue.
多類SIngle-Label分類 ：當有多個類別可以預測時使用，但是輸出屬于單個類別，例如紅色，黃色，綠色或藍色。
Multi-Class Multi-Label Classification: used when there are multiple categories to predict from and the output can belong to multiple categories, e.g. red, yellow, green, or blue.
多類別多標簽分類 ：在有多個類別可以進行預測并且輸出可以屬于多個類別(例如紅色，黃色，綠色或藍色)時使用。

兩類分類算法： (Two-Class Classification Algorithms:)

多類分類算法： (Multi-Class Classification Algorithms:)

多類算法 (Multi-Class Algorithms)

多類算法超參數： (Multi-Class Algorithms Hyperparameters:)

Multi-Class Logistic Regression: It is a well-known method in statistics that is used to predict the probability of an outcome and is popular in classification tasks. The two key parameters to configure this algorithm are 1) Optimization Tolerance: controls when to stop the iterations, if the improvement between iterations is less than the specified threshold, the algorithm stops and returns the current model, and 2) Regularization Weight: regularization is a method of preventing overfitting by penalizing models with extreme coefficient values. The Regularization Weight control how much to penalize the models at each iteration.
多類Logistic回歸 ：這是統計中眾所周知的方法，用于預測結果的概率，在分類任務中很流行。配置此算法的兩個關鍵參數是1) 優化容限：控制何時停止迭代，如果迭代之間的改進小于指定的閾值，則算法停止并返回當前模型，以及2) 正則化權重：正則化是一種通過懲罰具有極高系數值的模型來防止過度擬合的方法。正則化權重控制每次迭代要對模型進行多少懲罰。
Multi-Class Neural Network: A typical example includes the input layer, hidden layer, and output layer. The relationship between the input and the output is learned from training the Neural Network on input data. The 3 key parameters to configure the Multi-Class Neural Network include: 1) Number of Hidden Nodes: this option allows customizing the number of nodes in the neural network, 2) Learning Rate: controls the size of the step taken at each iteration before the correction, 3) Num of Learning Iterations: maximum number the algorithm should process the training cases.
多類神經網絡 ：一個典型示例包括輸入層，隱藏層和輸出層。輸入和輸出之間的關系是通過對輸入數據進行神經網絡訓練來學習的。配置多類神經網絡的3個關鍵參數包括： 1)隱藏節點數：此選項允許自定義神經網絡中的節點數； 2)學習率：控制每次迭代之前執行的步長更正， 3) 學習迭代次數：算法應處理的訓練案例的最大數量。
Multi-Class Decision Forest: an ensemble of Decision Trees. The algorithm works by building multiple Decision Trees and then voting on the most popular output class. The 5 key parameters to configure the Multi-Class Decision Forest include: 1) Resampling Methods: controls the method used to create the individual Decision Trees, 2) Number of Decision Trees: specifies the maximum number of Decision Trees that can be created in the ensemble, 3) Maximum Depth of Decision Trees: number to limit the depth of any Decision Tree, 4) Number of Random Splits per Node: the number of splits to use when building each node of the tree, 5) Minimum Number of samples per Leaf Node: controls the minimum number of cases required to create any terminal node in a tree.
多類決策森林：決策樹的集合。該算法通過構建多個決策樹，然后對最受歡迎的輸出類進行投票來工作。配置多類別決策林的5個關鍵參數包括： 1)重采樣方法：控制用于創建單個決策樹的方法， 2)決策樹數：指定可在決策樹中創建的最大決策樹數。合奏， 3)決策樹的最大深度：限制任何決策樹深度的數量， 4)每個節點的隨機分割數：構建樹的每個節點時要使用的分割數， 5)每個葉節點的最小樣本數：控制在樹中創建任何終端節點所需的最小案例數。

熱門AI文章： (Trending AI Articles:)

1. Machine Learning Concepts Every Data Scientist Should Know

1.每個數據科學家都應該知道的機器學習概念

2. AI for CFD: byteLAKE’s approach (part3)

2. CFD的人工智能：byteLAKE的方法(第3部分)

3. AI Fail: To Popularize and Scale Chatbots, We Need Better Data

3. AI失敗：要普及和擴展聊天機器人，我們需要更好的數據

4. Top 5 Jupyter Widgets to boost your productivity!

4.前5個Jupyter小部件可提高您的生產力！

監督學習：回歸 (Supervised Learning: Regression)

In a regression problem, the output is numerical or continuous.

在回歸問題中，輸出是數字或連續的。

回歸概論 (Introduction to Regression)

Common types of regression problems include:

回歸問題的常見類型包括：

Regression on tabular data: The data is available in the form of rows and columns, potentially originating from a wide variety of data sources.
表格數據的回歸：數據以行和列的形式提供，可能源自多種數據源。
Regression on image or sound data: Training data consists of images/sounds whose numerical scores are already known. Several steps need to be performed during the preparation phase to transform images/sounds into numerical vectors accepted by the algorithms.
圖像或聲音數據的回歸：訓練數據由其數字分數已知的圖像/聲音組成。在準備階段需要執行幾個步驟，以將圖像/聲音轉換為算法接受的數值向量。
Regression on text data: Training data consists of texts whose numerical scores are already known. Several steps need to be performed during the preparation phase to transform text into numerical vectors accepted by the algorithms.
對文本數據進行回歸：訓練數據由數字分數已知的文本組成。在準備階段需要執行幾個步驟，以將文本轉換為算法接受的數值向量。

Examples of Regression Problems:

回歸問題的示例：

Housing prices
房屋價格
Customer churn
客戶流失
Customer Lifetime Value
客戶終身價值
Forecasting (time series)
預測(時間序列)
Anomaly detection
異常檢測

算法類別 (Categories of Algorithms)

Common machine learning algorithms for regression problems include:

用于回歸問題的常見機器學習算法包括：

Linear Regression

線性回歸

A linear relationship between one or more independent variables and a numeric outcome (dependent variable)
一個或多個自變量與數值結果(因變量)之間的線性關系
Fast training
快速訓練
Two popular approaches to measuring error and fit the regression line:
兩種流行的方法來測量誤差并擬合回歸線：

Ordinary Least Square Method: computes error as the sum of the squares of distance from the actual value to the predicted line and fits the model by minimizing the squared error. This method assumes a strong linear relationship between the independent and the dependent variables.

普通最小二乘法 ：將誤差計算為從實際值到預測線的距離的平方和，并通過最小化平方誤差來擬合模型。該方法假定自變量和因變量之間具有很強的線性關系。

Gradient Descent: minimize the amount of error at each step of the model training process.

梯度下降 ：在模型訓練過程的每個步驟中使誤差最小化。

Decision Forest Regression

決策森林回歸

An ensemble learning method using multiple decision tress
使用多決策樹的整體學習方法
Each tree outputs a distribution as a prediction
每棵樹輸出一個分布作為預測
Aggregation is performed to find a distribution closest to the combined distribution
執行聚合以查找最接近組合分布的分布
Accurate, fast training times
準確，快速的培訓時間
It supports some of the same hyperparameters as the Multi-Class Decision Forest Algorithm i.e. Number of Trees, Max Depth, etc.
它支持與多類決策森林算法相同的一些超參數，即樹數，最大深度等。

Neural Net Regression

神經網絡回歸

Label column must be a numerical data type
標簽列必須是數字數據類型
A fully connected Neural Network: Input layer + one Hidden layer + Output layer
完全連接的神經網絡：輸入層+一個隱藏層+輸出層
Accurate, long training times
準確，長時間的培訓
It supports the hyperparameters as the Multi-Class Neural Network Algorithm, i.e. Number of Hidden Nodes, Learning Rate, Number of Iterations, etc.
它支持超參數作為多類神經網絡算法，即隱藏節點數，學習率，迭代數等。

自動化回歸器的培訓 (Automate the Training of Regressors)

Automated Machine Learning enables the automated exploration of the combinations needed to successfully produce a trained model. AutoML intelligently tests multiple combinations of algorithms and hyperparameters in parallel and returns the best one. It enables building Machine Learning models with high-scale efficiency and productivity, all while sustaining model quality. The resulting models can be:

自動化機器學習可以自動探索成功生成訓練模型所需的組合。 AutoML可以并行智能地測試算法和超參數的多種組合 ，并返回最佳組合。它可以在保持模型質量的同時，以大規模的效率和生產率構建機器學習模型。結果模型可以是：

Deployed into production, or

部署到生產中，或

Further refined and customized

進一步完善和定制

Beyond the primary metric, you can also review a comprehensive set of performance metrics and charts to further assess the model performance.

除了主要指標之外，您還可以查看一組全面的性能指標和圖表，以進一步評估模型的性能。

無監督學習 (Unsupervised Learning)

In unsupervised learning, algorithms learn from unlabeled data by looking for hidden structures in the data.

在 無監督學習中 ，算法通過查找數據中的隱藏結構來從未標記的數據中學習。

Obtaining unlabeled data is comparatively inexpensive and unsupervised learning can be used to uncover very useful information in such data.

獲得未標記的數據相對便宜，并且可以使用無監督學習來發現此類數據中非常有用的信息。

無監督機器學習的類型 (Types of Unsupervised Machine Learning)

Clustering: organizes entities from the input data into a finite number of subsets or clusters

聚類：將輸入數據中的實體組織成有限數量的子集或聚類

Feature Learning: transforms sets of inputs into other inputs that are potentially more useful in solving a given problem

特征學習 ：將輸入集轉換為其他輸入，這些輸入對于解決給定問題可能更有用

Anomaly Detection: identifies two major groups of entities: 1) Normal, 2) Abnormal (anomalies)

異常檢測 ：識別兩個主要的實體組：1)正常，2)異常(異常)

Some other types include Dimensionality Reduction, Feature Extraction, Neural Networks, Principle Component Analysis, Matrix Factorization.

其他一些類型包括降維，特征提取，神經網絡，主成分分析，矩陣分解。

半監督學習 (Semi-Supervised Learning)

Semi-supervised learning combines the supervised and unsupervised approaches; typically it involves having small amounts of labeled data and large amounts of unlabeled data.

半監督學習 結合了有監督和無監督的方法；通常，它涉及擁有少量標記數據和大量未標記數據。

問題： (The problem:)

Difficult and expensive to acquire labeled data
獲取標記數據困難且昂貴
Acquiring unlabeled data which is usually inexpensive
獲取通常不昂貴的無標簽數據

解決方案： (The solution:)

Uses a small amount of labeled data and a much larger amount of unlabeled data

使用少量標記數據和大量未標記數據

Self-Training: train the model using labeled data and use it to make predictions on the unlabeled data. The output is a dataset that is fully labeled and can be used in a Supervised Learning approach.
自我訓練：使用標記的數據訓練模型，并使用其對未標記的數據進行預測。輸出是一個完全標記的數據集，可以在“監督學習”方法中使用。
Multi-view Training: train multiple models on different views of data that includes various feature selection, parts of training data, or various model architectures.
多視圖訓練：在數據的不同視圖上訓練多個模型，包括各種功能選擇，部分訓練數據或各種模型架構。
Self-ensemble Training: similar to Multi-view Training except a single model is trained on different views of data
自我訓練 ：與多視圖訓練類似，不同之處在于單個模型針對不同的數據視圖進行訓練

聚類 (Clustering)

Clustering is the problem of organizing entities from the input data into a finite number of subsets or clusters; the goal is to maximize both intra-cluster similarity and inter-cluster differences.

分簇是將輸入數據中的實體組織成有限數量的子集或簇的問題。目標是最大程度地提高集群內相似度和集群間差異。

Applications of Clustering Algorithms:

聚類算法的應用：

Personalization and target marketing
個性化和目標營銷
Document classification
文件分類
Fraud Detection
欺詐識別
Medical imaging
醫學影像
City Planning
城市規劃

聚類算法： (Clustering Algorithms:)

Centroid-Based Clustering: organizes data into clusters based on the distance of members from the centroid of the cluster, e.g. K-Means.
基于質心的聚類 ：基于成員到聚類的質心的距離將數據組織到聚類中，例如K-Means。
Density-based Clustering: clusters members that are closely packed together and it can learn clusters of arbitrary shapes.
基于密度的聚類 ：將緊密堆積的成員聚在一起，并且可以學習任意形狀的聚類。
Distribution-based Clustering: The underlying assumption is that the data has an inherent distribution type such as normal distribution. The algorithm clusters based on the probability of a member belonging to a particular distribution.
基于分布的聚類 ：基本假設是數據具有固有分布類型，例如正態分布。該算法基于成員屬于特定分布的概率進行聚類。
Hierarchical Clustering: builds a tree of clusters. This is best-suited for hierarchical data such as taxonomies.
層次集群 ：構建集群樹。這最適合分類數據等分層數據。

K-均值聚類： (K-Means Clustering:)

K-means is a centroid-based unsupervised clustering algorithm.

K均值 是基于質心的無監督聚類算法。

It creates up to a target (K) number of clusters and group similar members together in a cluster. The objective is to minimize intra-cluster distances (squared error of the distance between the members of the cluster and its center).

它最多可創建目標(K)個集群，并將集群中的相似成員分組在一起。目的是最小化群集內距離 ( 群集成員與其中心之間的 距離的 平方誤差)。

K均值聚類算法： (K-Means Clustering Algorithm:)

Steps:

腳步：

Initializes Centroid locations.

初始化質心位置。

Assign each member to a cluster represented by the closest centroid.

將每個成員分配給以最接近的質心表示的聚類。

Compute the new cluster centroids based on current cluster membership.

根據當前群集成員身份計算新的群集質心。

Check for Convergence.

檢查收斂性。

Different types of Convergence criteria. 1) check how much the centroid location change as a result of new cluster membership. If the total change in centroid location is less than a given tolerance, it will assume convergence and stop. 2) based on a fixed number of iterations, If the convergence criterion is not met, it will iterate starting with step number two.
不同類型的收斂準則。 1)檢查由于新的群集成員關系，質心位置發生了多少變化。如果質心位置的總變化小??于給定的公差，它將假定會聚并停止。 2)基于固定的迭代次數，如果不滿足收斂標準，它將從第二步開始進行迭代。

K-Means模塊配置： (K-Means Module Configurations:)

Number of Centroids: number of clusters you want the algorithm to begin with. The algorithm starts with this number of data points and iterates to find the optimal configuration.
質心數：您要算法開始的聚類數。該算法從此數量的數據點開始，并進行迭代以找到最佳配置。
Initialization approach: the selection of the initial centroids. The options for initialization are first n random or k-means++ algorithm.
初始化方法 ：選擇初始質心。初始化的選項是n個隨機算法或k-means ++算法。
Distance metric: default for this is the Euclidean distance
距離度量 ：默認為歐幾里得距離
Normalize features: uses the Min-Max Normalizer to scale the numeric data point from zero to one
歸一化功能 ：使用最小-最大歸一化器將數字數據點從零縮放到一個
Assign label mode: used only if your dataset already has a label column. uses the min-max normalizer to scale the numeric data point from zero to one. Optionally, the label values can be used to guide the selection of the clusters. Another use of the label column is to fill in missing values.
分配標簽模式 ：僅在數據集已經具有標簽列時使用。使用最小-最大規范化器將數字數據點從零縮放到一。可選地，標簽值可用于指導群集的選擇。標簽列的另一種用法是填寫缺失值。
Number of iterations: dictates the number of times the algorithm should iterate over the training data before it finalizes the selection of centroids
迭代次數 ：指示算法在最終確定質心之前應迭代訓練數據的次數。

課程總結 (Lesson Summary)

This lesson covered two of Machine Learning’s fundamental approaches: supervised and unsupervised learning.

本課程涵蓋了兩方面的機器學習的基本方法：監督和無監督的學習。

First, we learned about supervised learning. Specifically, we learned:

首先，我們了解了監督學習 。具體來說，我們了解到：

More about classification and regression, two of the most representative supervised learning tasks
有關分類和回歸的更多信息，這是最具代表性的兩個監督學習任務
Some of the major algorithms involved in supervised learning, as well as how to evaluate and compare their performance
監督學習中涉及的一些主要算法，以及如何評估和比較其性能
How to use automated machine learning to automate the training and selection of classifiers and regressors
如何使用自動化機器學習來自動化分類器和回歸器的訓練和選擇

Next, the lesson focused on unsupervised learning, including:

接下來，本課的重點是無監督學習 ，包括：

Its most representative learning task, clustering
它最有代表性的學習任務是聚類
How unsupervised learning can address challenges like lack of labeled data, the curse of dimensionality, overfitting, feature engineering, and outliers
無監督學習如何解決諸如缺少標簽數據，維度詛咒，過度擬合，特征工程和離群值之類的挑戰
An introduction to representation learning
表征學習入門

別忘了給我們您的👏！ (Don’t forget to give us your 👏 !)

翻譯自: https://becominghuman.ai/microsoft-azure-machine-learning-x-udacity-lesson-4-notes-ab5444ed9227