當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习中一阶段网络是啥_机器学习项目的各个阶段

發布時間：2023/12/15 编程问答 25 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习中一阶段网络是啥_机器学习项目的各个阶段小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

機器學習中一階段網絡是啥

Many businesses and organizations are turning to machine learning for solutions to challenging business goals and problems. Providing machine learning solutions to meet these needs requires that one follows a systematic process from problem to solution. The stages of a machine learning project constitute the machine learning pipeline. The machine learning pipeline is a systematic progression of a machine learning task from data to intelligence.

許多企業和組織正在轉向機器學習來尋求具有挑戰性的業務目標和問題的解決方案。提供滿足這些需求的機器學習解決方案要求從問題到解決方案遵循一個系統的過程。機器學習項目的各個階段構成了機器學習管道。機器學習管道是機器學習任務從數據到智能的系統性演進。

During our training as ML engineers, a lot of focus is invested in learning about algorithms, techniques, and machine learning tools but often, less attention is given to how to approach industry and business problems from the problem to a usable solution.

在我們作為ML工程師的培訓期間，我們投入了大量的精力來學習算法，技術和機器學習工具，但通常很少關注如何從問題到可用的解決方案來解決行業和業務問題。

In this article, I present the machine learning pipeline that provisions for a comprehensive approach to solving real-world problems using machine learning. I will start with the observable or explainable problem as companies/businesses are likely to present them to an engineer and will walk you through the stages that a project needs to go through up till it ends as a usable solution available to platform end-users.

在本文中，我介紹了機器學習管道，該管道提供了一種全面的方法來使用機器學習解決實際問題。我將從一個可觀察或可以解釋的問題開始，因為公司/企業很可能將它們呈現給工程師，并將引導您完成項目需要經歷的各個階段，直到它作為平臺最終用戶可用的可用解決方案結束為止。

You will basically see at a top-level what stages were involved in building, for instance, the Netflix movie recommendation engine that runs in the background of the movie platform and personalizes your experience, showing you the movies you are likely to be interested in.

您基本上會在高層看到構建的各個階段，例如，在電影平臺的后臺運行的Netflix電影推薦引擎，它將個性化您的體驗，向您展示您可能感興趣的電影。

Solving any business problem follows these fundamental stages and so it is necessary for all practitioners to understand and leverage it. If you sharpen your thinking about machine learning projects in light of this article, I believe that you will be more effective, structured when doing ML projects. You will understand from this article how to relate more with industrial stakeholders who may not understand the whole ML buzz but are genuinely seeking relevant and desired solutions to good problems.

解決任何業務問題都遵循這些基本階段，因此所有從業人員都必須理解并利用它。如果您根據本文加強對機器學習項目的思考，我相信您在進行ML項目時會更有效率，更有條理。您將從本文中了解如何與行業利益相關者建立更多聯系，他們可能不了解整個ML嗡嗡聲，但實際上正在尋求相關的問題和理想的解決方案。

I know from my experience when I did my first internship as a data analyst for one of Africa’s leading data center colocation service providers, Africa Data Centers, the frustration inherent in not following this paradigm. I did not think then that my approach was not optimal, because I did not know it then. I can only imagine how much time and frustration it would have saved me and how considerably improved my performance and output would have been if I had this understanding then.

我從作為非洲領先的數據中心托管服務提供商之一的非洲數據中心的數據分析師的第一份實習經歷時就知道，由于不遵循這種范例而產生的挫敗感。當時我并不認為我的方法不是最優的，因為那時我還不知道。我只能想象如果我有了這種理解，它將節省我多少時間和挫敗感，以及可以大大改善我的性能和輸出。

The stages of a machine learning project are summarized by the figure below.

下圖總結了機器學習項目的各個階段。

The machine learning pipeline, business problem to solution __ (image by author)機器學習管道，解決方案__的業務問題(作者提供的圖像)

業務問題或研究問題 (The Business problem or research problem)

Start with the business problem__

從業務問題開始__

In many cases, organizations tend to present this as a goal, what they want to achieve. Very often there is a story to it and that story is important. This is how we have been doing things or how the system was behaving, and this is what we would like to achieve. In my case, it was something like;

在許多情況下，組織傾向于將其作為目標，即他們想要實現的目標。很多時候都有一個故事，這個故事很重要。這就是我們的工作方式或系統的行為方式，這就是我們想要實現的目標。就我而言，這有點像；

“We will like to use the historical data we have on our energy consumption to determine our options for energy efficiency optimization and cost-saving.”

“我們希望利用我們在能源消耗方面的歷史數據來確定我們在能源效率優化和成本節省方面的選擇?！?

Simply put, the business problem was what can we do to reduce expenses on energy?

簡而言之，業務問題是我們該如何減少能源支出？

構架機器學習問題 (Framing the machine learning problem)

From the business problem, you frame the machine learning problem. This is where domain knowledge/expertise comes in. This is not trivial at all because to get the right solution you must start with the right problem/questions.

從業務問題中，您可以構架機器學習問題。這就是領域知識/專長的來源。這根本不是一件容易的事，因為要獲得正確的解決方案，您必須從正確的問題/問題開始。

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question”__J. Tukey, The Future of Statistical Analysis

“對于正確的問題 (通常是模糊的)，比對錯誤的問題的精確答案要好得多”。 Tukey，統計分析的未來

Many organizations that are using machine learning seriously sometimes consult domain experts to help them ask the right questions. However, it may not be the case that when faced with a problem, you would bring in an expert. You may be the expert/consultant that was brought to figure things out. In that case, research is the only way to go. What is the industry doing to solve the same or similar problems?

許多認真使用機器學習的組織有時會咨詢領域專家，以幫助他們提出正確的問題。但是，遇到問題時，不一定會聘請專家。您可能是被帶去解決問題的專家/顧問。在這種情況下， 研究是唯一的方法 。行業如何解決相同或相似的問題？

Your goal is to not waste time answering the wrong question, right? Translating a business problem to a machine learning problem is so important that it determines the fate of your entire project. Completing this step should start leading you towards the kind of data that will be necessary to answer the machine learning question. From all your research and understanding, you should already have relevant features to expect in your dataset.

您的目標是不要浪費時間回答錯誤的問題，對嗎？將業務問題轉換為機器學習問題非常重要，以至于它決定了整個項目的命運。完成此步驟應開始引導您獲得回答機器學習問題所需的數據類型。從您的所有研究和理解中，您應該已經具有期望在數據集中使用的相關功能。

數據收集和/或整合 (Data collection and/or integration)

Is the data relevant to the problem?
數據與問題有關嗎？
Is the data enough to train a good model?
數據足以訓練一個好的模型嗎？

This step involves putting together already existing data or collecting the necessary data. If there is already existing data, you will have to determine if the data is relevant to the machine learning problem and thus the business problem. This is very important especially if the organization is not a typical machine learning organization to have predetermined that before collecting the data they now have. It is not uncommon (been there once) to find that an organization has collected data that is not relevant to the problem they want to solve.

此步驟涉及將已經存在的數據放在一起或收集必要的數據。如果已經存在數據，則必須確定數據是否與機器學習問題以及業務問題相關。這是非常重要的，特別是如果該組織不是典型的機器學習組織，那么在收集數據之前要預先確定它們。發現組織收集的數據與他們要解決的問題無關的情況并不少見(一次見過)。

A good rule of thumb is to ask the question “What data will a human expert need to solve the problem if this task was left to them?”. If a human expert cannot use the data available to deduce correct predictions, it is almost definite that a machine cannot. Again, an expert will provide you with better information about how they will solve the problem and what data they will need to answer the question you are trying to answer using machine learning.

一個好的經驗法則是問一個問題：“如果這項任務留給他們，專家將需要什么數據來解決問題？”。如果人類專家無法使用可用數據來得出正確的預測，則幾乎可以肯定機器無法做到。同樣，專家將為您提供更好的信息，說明他們如何解決問題以及使用機器學習來回答您要回答的問題所需的數據。

The quality of the model or analysis performed is totally dependent on the quality of the data. Just like one cannot make fine wine with low-quality grapes, one cannot build a good model with poor quality data.

模型或執行的分析的質量完全取決于數據的質量。就像一個人不能用低質量的葡萄來釀造優質葡萄酒一樣，一個人也不能用低質量的數據來建立一個好的模型。

It might be possible to deduce more valuable features from the original data using feature engineering. Therefore, also think critically to see if relevant features are simply hidden in the dataset. Nevertheless, it is better to advise what data the organization/business should collect that will help their quest better.

使用特征工程可以從原始數據中推斷出更多有價值的特征。因此，還必須進行批判性思考，以查看相關特征是否僅隱藏在數據集中。但是，最好建議組織/企業應收集哪些數據，以更好地幫助他們進行搜索。

The final consideration is the size (number of examples) of the dataset. While there is no definite answer to how much data is enough data, algorithms always perform better when trained with huge amounts of data.

最后要考慮的是數據集的大小(示例數)。盡管對于多少數據就是足夠的數據沒有確切的答案，但是當訓練大量數據時，算法始終會表現更好。

The required minimum is to have at least 10times as many data examples as there are features in the dataset.

所需的最小值是數據示例的至少10倍，是數據集中存在的特征的數量 。

If this is not the case, then more data should be collected. Many options are available for getting more data. These include crowdsourcing using platforms like Amazon Mechanical Turk; other external sources; or internal data collection within the organization. For some problems, it might be possible and appropriate to generate more data from existing data examples. This is best determined by a machine learning engineer.

如果不是這種情況，則應收集更多數據。許多選項可用于獲取更多數據。其中包括使用Amazon Mechanical Turk等平臺進行眾包；其他外部來源；或組織內部的內部數據收集。對于某些問題，從現有數據示例中生成更多數據可能是適當的。最好由機器學習工程師決定。

數據準備/預處理 (Data preparation/pre-processing)

At this stage, you explore the data critically and prepare or transform it such that it is ready for training. Look out for such things as missing data, duplicate examples and features, feature value ranges, the data type of values, feature units, and so on. Use easy tools to quickly examine the data and scavenge as much general information as possible. After gathering useful information, some of the following actions may be required:

在此階段，您需要批判性地探索數據，并準備或轉換數據以使其準備好進行訓練。請注意缺少數據，重復的示例和特征，特征值范圍，值的數據類型，特征單位等問題。使用簡單的工具快速檢查數據并清除盡可能多的常規信息。收集有用的信息后，可能需要執行以下一些操作：

Deal with missing data (NaN, NA, “”, ?, None) and outliers__ Standardize all missing data to np.nan. Some common options for handling missing data and outliers: dropping the data examples with missing values or applying imputation techniques (mean, mode/frequency, median).
處理丟失的數據(NaN，NA，“”，？，無)和異常值 __將所有丟失的數據標準化為np.nan。處理缺失數據和離群值的一些常見選項：刪除具有缺失值的數據示例或應用插補技術(均值，模式/頻率，中位數)。
Deal with duplicate features and/or examples__ Duplicate features cause problems of linear dependence in the data set and duplicate examples may give a false impression of the data being enough meanwhile the number of unique examples might be too small to reasonably train a good model.
處理重復的特征和/或示例 __重復的特征會導致數據集線性相關的問題，重復的示例可能給數據帶來錯誤的印象，同時獨特示例的數量可能太少而無法合理地訓練一個好的模型。
Feature scaling, normalization, standardization__ You want to ensure that your features are in the same or comparable ranges typically 0 to 1. This ensures that your model trains faster and is stable especially if you are using optimization algorithms like gradient descent.
特征縮放，歸一化，標準化 __您要確保特征處于相同或可比較的范圍內(通常為0到1)。這可以確保模型訓練更快且穩定，尤其是在使用諸如梯度下降的優化算法時。
Balance the class sizes for categorical data__ Ensure that the number of training examples across the different target categories in your dataset is comparable. But if the task you are working on involves naturally skewed patterns where one class always dominates the other, balancing is not an option. This is common with anomaly detection tasks like rare diseases prediction (e.g. cancer), and fraud detection. An appropriate training method and evaluation metric must be chosen for skewed datasets that cannot be reasonably balanced.
平衡分類數據的類大小 __確保數據集中不同目標類別的訓練示例的數量可比。但是，如果您正在處理的任務涉及自然偏斜的模式，其中一類總是主導另一類，那么平衡就不是一種選擇。這在異常檢測任務(例如罕見病預測(例如癌癥)和欺詐檢測)中很常見。必須為無法合理平衡的偏斜數據集選擇適當的訓練方法和評估指標。
Harmonize inconsistent units__ Inconsistent units can easily escape notice. Ensure that all units measuring the same physical quantities are the same. Just to emphasize the point, NASA lost its $125-million Mars Climate Orbiter satellite because of inconsistent units.
協調不一致的單元 __不一致的單元可以輕松逃脫通知。確保所有測量相同物理量的單位都相同。為了強調這一點，美國宇航局由于單位不一致而損失了價值1.25億美元的“火星氣候軌道器”衛星。

數據可視化和探索性分析 (Data visualization and exploratory analysis)

Data visualization provides the most optimum means for exploratory analysis. Using plots like histograms and scatter plots one may easily spot things like outliers, trends, clusters, or categories in your dataset. However, visualizations tend to be very useful only for low dimensional data (1D, 2D, 3D) as higher dimensions cannot be plotted. For high dimensional data, you may select some specific features to visualize.

數據可視化為探索性分析提供了最佳的方法。使用直方圖和散點圖之類的圖，可以輕松發現數據集中的異常值，趨勢，聚類或類別。但是，可視化僅對低維數據(1D，2D，3D)有用，因為無法繪制高維。對于高維數據，您可以選擇一些特定功能進行可視化。

特征選擇和特征工程 (Feature selection and Feature engineering)

Which features are relevant to make correct predictions?

哪些特征與做出正確的預測有關？

The goal is to select features so that you have the least correlation between features but the maximum correlation between each feature and the targets.

目的是選擇要素，以使要素之間的關聯最少，但每個要素與目標之間的關聯最大 。

Feature engineering involves manipulating the original features in the dataset into new potentially more useful features. As mentioned above, always think of what hidden features might be deduced from the original data. Debatably, feature engineering is one of the most critical and time-consuming activities in the ML pipeline.

特征工程涉及將數據集中的原始特征操縱為可能更有用的新特征。如上所述，請始終考慮可以從原始數據中推斷出哪些隱藏特征。值得一提的是，特征工程是ML管道中最關鍵和最耗時的活動之一。

With all the above steps performed, you now have a sizable dataset with features that are relevant for the ML task and we can proceed (with some confidence) to train a model.

完成上述所有步驟后，您現在已經擁有一個相當大的數據集，其中包含與ML任務相關的功能，我們可以(有把握地)進行模型訓練。

模型訓練 (Model Training)

The first step before training is to split your dataset into a train set, cross-validation, or development set and test set with randomization.

訓練之前的第一步是將您的數據集分為具有隨機性的訓練集，交叉驗證或開發集和測試集。

Randomization helps to eliminate bias in your models and is achieved by shuffling the data before splitting. Randomization is extremely important, especially when dealing with sequential data that follows some chronological order. This will ensure that the model does not go learning the structure in the data.

隨機化有助于消除模型中的偏差，可以通過在分割前對數據進行混洗來實現。隨機化非常重要，尤其是在處理遵循某些時間順序的順序數據時。這將確保模型不會學習數據中的結構。

There is no guiding rule for optimum splits, but the main intuition is to have as much training data as possible; smaller but sufficient data to tune hyperparameters during training and enough test data to test the model’s ability to generalize on. Some typical and commonly used splits include:

沒有最佳分割的指導規則，但主要的直覺是要擁有盡可能多的訓練數據。較小但足夠的數據以在訓練期間調整超參數，而足夠的測試數據可測試模型的概括能力。一些典型和常用的拆分包括：

Some common data split percentages in machine learning機器學習中的一些常見數據拆分百分比

Next, you will set aside the test set for later testing your models and proceed with the train set to train your model. It is good to quickly try out different potential algorithms and pick the one with the best generalization performance on the cross-validation or dev set for further tuning or pick a set of algorithms to form an ensemble. Use the dev set for model hyperparameter tuning.

接下來，您將預留測試集以用于以后測試模型，并繼續使用訓練模型來訓練您的模型。 Swift嘗試不同的潛在算法，并在交叉驗證或開發集上選擇具有最佳泛化性能的算法，以進行進一步調整，或者選擇一組算法以形成整體，這是很好的。將開發集用于模型超參數調整。

Dataset splits and usage __ (image by author)數據集拆分和用法__(作者提供的圖片)

模型評估 (Model Evaluation)

Is the model useful, (does it have the minimum required performance measure)?

該模型有用嗎(它具有最低要求的性能指標)嗎？

Is the model computationally efficient?

模型的計算效率高嗎？

Once you have optimized your model’s performance on the dev set as much as possible, you can now assess how well it performs on unseen data that was set aside in your test set. The performance observed on the test data gives you a glimpse of what you can expect to see in the production environment. Use single value evaluation metrics for quantifying performance.

一旦盡可能在開發集上優化了模型的性能，就可以評估模型在測試集中保留的未見數據上的性能。在測試數據上觀察到的性能使您可以大致了解在生產環境中可以看到的內容。使用單值評估指標來量化性能。

Accuracy: suitable for classification task
精度：適合分類任務
Precision/recall: suitable for skewed classification task
精度/召回率：適用于傾斜的分類任務
Rsquared: suitable for regression
Rsquared：適合回歸

It is hard to strike a good balance between precision and recall, hence they are always combined into a single value evaluation metric, the F1 score.

很難在精度和召回率之間取得良好的平衡，因此，它們總是組合為一個單一的價值評估指標，即F1得分 。

If the minimum required performance is obtained, then you have a useful model that is ready for deployment.

如果獲得了最低要求的性能，那么您就有了一個可供部署的有用模型。

“All models are wrong, but some are useful.” __ George Box

“所有模型都是錯誤的，但有些是有用的?！?__喬治·博克斯

模型部署，集成和監控 (Model deployment, integration, and monitoring)

“The deployment of machine learning models is the process for making your models available in production environments, where they can provide predictions to other software systems. It is only once models are deployed to production that they start adding value.” __ Christopher Samiullah

機器學習模型的部署是使模型在生產環境中可用的過程，他們可以在其中為其他軟件系統提供預測。只有將模型部署到生產中后，它們才能開始增加價值?！?__克里斯托弗·薩米拉

Deployment is very crucial and probably the ML engineer’s nightmare as it is more of a software engineering discipline. Nevertheless, ML engineers are largely expected to be able to deploy and integrate their models with existing software systems to cater to end-users. I have very little to say about deployment, but a few things to note are how ML deployments fundamentally differ from explicitly programmed software.

部署非常關鍵，可能是ML工程師的噩夢，因為它更多地是軟件工程學科。盡管如此，人們普遍期望ML工程師能夠將其模型與現有軟件系統進行部署和集成，以滿足最終用戶的需求。關于部署，我幾乎沒有什么要說的，但是要注意的是ML部署與顯式編程的軟件在根本上有何不同。

Models in production environments suffer from performance decay with time. As a solution, monitoring the performance of your model in production is standard practice. Performance decay is inevitable partly because of drifts in data distribution in the production environment outside the data distribution that was existent in the train set. If you notice a significant difference in the production data distribution, then you need to retrain your model.

生產環境中的模型會隨著時間的推移而性能下降。作為解決方案，在生產中監視模型的性能是標準做法。性能下降是不可避免的，部分原因是生產環境中的數據分布在列車集中存在的數據分布之外漂移。如果您發現生產數據分布存在顯著差異，則需要重新訓練模型。

Model in production is continuously monitored, retrained and deployed __ (image by author)對生產中的模型進行持續監控，再培訓和部署__(作者提供圖片)

Over the lifetime of any deployed ML model, the cycle monitor, retrain, and update is a routine process and it helps to use continuous logging of system performance information and creating performance drift alerts for efficient monitoring.

在任何已部署的ML模型的整個生命周期中，周期監視，重新訓練和更新都是例行過程，它有助于使用系統性能信息的連續記錄并創建性能漂移警報以進行有效監視。

結論 (Conclusion)

In this article, I summarized the stages of a machine learning project from understanding the problem to a usable solution.

在本文中，我總結了機器學習項目的各個階段，從理解問題到可用的解決方案。

An ML solution is a system with a machine learning engine running in the background.

ML解決方案是一個在后臺運行機器學習引擎的系統。

Summary of activities:

活動摘要：

Understand the business problems and needs
了解業務問題和需求
Frame the ML problem
框架ML問題
Understand the data needs and acquire the data
了解數據需求并獲取數據
Clean and preprocess the data
清理和預處理數據
Select relevant features
選擇相關功能
Perform feature engineering
執行特征工程
Train a model
訓練模型
Tune hyperparameters to optimize the performance of the model (accuracy and speed).
調整超參數以優化模型的性能(準確性和速度)。
Test the model
測試模型
Deploy the model
部署模型
Monitoring and updating the model/system (continuous process)
監視和更新模型/系統(連續過程)

資源資源 (Resources)

Getting started with AWS machine learning course on Coursera

Coursera上的AWS機器學習課程入門

Machine Learning course by Andrew on Coursera

Andrew在Coursera上的機器學習課程

How to deploy machine learning models

如何部署機器學習模型

6 stages to get success in machine learning projects

在機器學習項目中獲得成功的6個階段

翻譯自: https://medium.com/swlh/the-stages-of-a-machine-learning-project-cf4bb073a4ad