當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

异常检测机器学习_使用机器学习检测异常

發(fā)布時間：2023/11/29 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了异常检测机器学习_使用机器学习检测异常小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

異常檢測機(jī)器學(xué)習(xí)

什么是異常檢測？ (What is Anomaly Detection?)

The anomaly detection problem has been a problem that has been frequently explored in the field of machine learning and has become a classic problem. Anomalies are any unusual sequence or pattern inside a large corpus of data. These anomalies usually cause unexpected and complex errors or inefficiencies unless resolved. Searching for these anomalies through a corpus might be easy if the corpus was relatively small, but when it scales to an enormous size, that solution becomes unreasonable. For example, trying to find a grammatical mistake in a 200 word paragraph is pretty easy but imagine trying to find all the grammatical errors in a 5000 page encyclopedia. The problem becomes much more difficult for humans. Fortunately, with the help of machine learning, we are able to solve this problem much easier (kind of).

異常檢測問題已經(jīng)成為機(jī)器學(xué)習(xí)領(lǐng)域中經(jīng)常探討的問題，并且已經(jīng)成為經(jīng)典問題。異常是大型數(shù)據(jù)集中的任何異常序列或模式。除非解決，否則這些異常通常會導(dǎo)致意外的復(fù)雜錯誤或效率低下。如果語料庫相對較小，則通過語料庫搜索這些異常可能很容易，但是當(dāng)它擴(kuò)展到巨大規(guī)模時，該解決方案將變得不合理。例如，嘗試在200個單詞的段落中查找語法錯誤是很容易的，但是可以想象一下，嘗試在5000頁的百科全書中查找所有語法錯誤。這個問題對人類來說變得更加困難。幸運(yùn)的是，借助機(jī)器學(xué)習(xí)，我們能夠(更輕松)解決此問題。

First of all, what is machine learning? Machine learning is essentially using statistics to model and train how a system (or corpus) normally behaves from a training set (the background data set). Afterwards, we can compare the abnormally behaving system (the target data set) to our model of how a normal system behaves and try to uncover anomalies in the target. Although the main idea sounds pretty easy and intuitive, there are many complexities associated with this process such as finding a background data set that is representative of the whole population, distributing the calculations to different machines for large data sets, etc. Although these problems are all difficult obstacles that software engineers have to tackle before creating a polished machine learning model, I will not be talking about these issues but rather the application of machine learning to find anomalies.

首先，什么是機(jī)器學(xué)習(xí)？機(jī)器學(xué)習(xí)本質(zhì)上是使用統(tǒng)計數(shù)據(jù)來建模和訓(xùn)練系統(tǒng)(或語料庫)通常如何根據(jù)訓(xùn)練集(背景數(shù)據(jù)集)表現(xiàn)。然后，我們可以將行為異常的系統(tǒng)(目標(biāo)數(shù)據(jù)集)與正常系統(tǒng)行為的模型進(jìn)行比較，并嘗試發(fā)現(xiàn)目標(biāo)中的異常。盡管主要想法聽起來很容易且直觀，但是與此過程相關(guān)的復(fù)雜性很多，例如找到代表整個人群的背景數(shù)據(jù)集，將計算分布到大型數(shù)據(jù)集的不同機(jī)器等。盡管這些問題是在創(chuàng)建完善的機(jī)器學(xué)習(xí)模型之前軟件工程師必須解決的所有困難障礙，我不會在談?wù)撨@些問題，而是在機(jī)器學(xué)習(xí)中應(yīng)用以發(fā)現(xiàn)異常。

異常檢測問題的類型 (Types of Anomaly Detection Problems)

已知數(shù)據(jù)語料庫中的結(jié)構(gòu)異常 (Structured Anomalies in a Known Corpus of Data)

There are four main types of anomaly detection problems. The first (and also easiest) type is detecting structured anomalies in a known corpus. These are problems where you know what the structure of the anomalies will be and you know the format of the corpus. As a simplified analogy, the problem of detecting numbers that decrease from the number prior to it where the corpus is a string of strictly increasing numbers would fall under this type. In this example, we know the pattern of the normal behavior (strictly increasing numbers) and we are detecting for a known anomaly (a decrease between adjacent numbers). This problem is relatively easy as we can clearly measure and know for sure when something is an anomaly as we have a clear structure we are comparing it to. In this case, it is relatively easy to have a high performance machine learning algorithm and have negligible false negatives.

有四種主要類型的異常檢測問題。第一種(也是最簡單的一種)類型是檢測已知語料庫中的結(jié)構(gòu)異常。在這些問題中，您知道異常的結(jié)構(gòu)將是什么，并且您知道語料庫的格式。作為簡化的類比，在語料庫是一串嚴(yán)格增加的數(shù)字的情況下，檢測從其之前的數(shù)字開始減少的數(shù)字的問題將屬于這種類型。在此示例中，我們知道正常行為的模式(數(shù)字嚴(yán)格增加)，并且正在檢測已知的異常(相鄰數(shù)字之間的減少)。這個問題相對容易，因?yàn)槲覀兛梢郧宄販y量并確定什么時候異常，因?yàn)槲覀冇幸粋€清晰的結(jié)構(gòu)要與之進(jìn)行比較。在這種情況下，擁有高性能的機(jī)器學(xué)習(xí)算法和具有可忽略的錯誤否定條件相對容易。

未知數(shù)據(jù)語料庫中的結(jié)構(gòu)異常 (Structured Anomalies in an Unknown Corpus of Data)

The second type is detecting a structured anomaly in an unknown corpus. These problems are more difficult than the previous example as we now need to consider the problem of how to parse through and evaluate the corpus in order to uncover the anomalies. This problem is not that much more difficult than the previous example as we still know the structure of the anomalies so after we solve the parsing problem then this type of problem becomes identical to the previous type. However, as the target corpus has an unknown structure, there will most likely be more false negatives than in the first type.

第二種類型是檢測未知語料庫中的結(jié)構(gòu)異常。這些問題比前面的示例更加困難，因?yàn)槲覀儸F(xiàn)在需要考慮如何解析和評估語料庫以發(fā)現(xiàn)異常的問題。因?yàn)槲覀內(nèi)匀恢喇惓５慕Y(jié)構(gòu)，所以這個問題并不比前面的示例困難得多，因此在解決了解析問題之后，該類型的問題就變得與前面的類型相同。但是，由于目標(biāo)語料庫的結(jié)構(gòu)未知，因此與第一種類型相比，假陰性率最高。

已知數(shù)據(jù)語料庫中的非結(jié)構(gòu)化異常 (Unstructured Anomalies in a Known Corpus of Data)

The third type is detecting an unstructured anomaly in a known corpus. Again, this type of problem is more complex than the previous type. Although we have a defined structure where we can build our parsing algorithm upon, the anomalies are unstructured meaning that we have to truly understand the heuristics of the background corpus in order to evaluate the target corpus against. In this case, we start to have false positives in addition to false negatives as we do not have a proper way to evaluate if our detected anomalies are in fact true positives through the program without human interaction.

第三種是檢測已知語料庫中的非結(jié)構(gòu)異常。同樣，這種類型的問題比以前的類型更為復(fù)雜。盡管我們有一個定義的結(jié)構(gòu)可以在其中構(gòu)建我們的解析算法，但是異常是非結(jié)構(gòu)化的，這意味著我們必須真正了解背景語料庫的啟發(fā)式方法才能評估目標(biāo)語料庫。在這種情況下，除了假陰性外，我們還開始有假陽性，因?yàn)槲覀儧]有適當(dāng)?shù)姆椒▉碓u估通過程序在沒有人工干預(yù)的情況下檢測到的異常是否實(shí)際上是真正的陽性。

未知數(shù)據(jù)語料庫中的非結(jié)構(gòu)化異常 (Unstructured Anomalies in an Unknown Corpus of Data)

The last type is the toughest anomaly detection problem and is still being researched and improved today. The remaining type is, of course, detecting unstructured anomalies in an unknown corpus. In this case, not only do we have to understand the heuristics of the corpus, we also have to create many measures based on the heuristics to evaluate how anomalous each segment of the target corpus is. For all of these measures, we need to set thresholds for which we classify a segment as an anomaly. These thresholds each have their own trade offs and finding the optimal thresholds for detecting anomalies requires operating and evaluating performance in a multi-dimensional space, each dimension representing one of the thresholds. Additionally, after exploring this multi-dimensional space, one might realize that the heuristics of the background corpus was not properly represented by the machine learning model and must restart and think of another way to quantify or identify the patterns of the corpus. The whole process can be really complex and frustrating due to the performance feedback loop. This type of anomaly detection, although very difficult, can potentially yield amazing results.

最后一種是最棘手的異常檢測問題，目前仍在研究和改進(jìn)中。當(dāng)然，剩下的類型是檢測未知語料庫中的非結(jié)構(gòu)化異常。在這種情況下，我們不僅必須了解語料庫的啟發(fā)式方法，還必須基于啟發(fā)式方法創(chuàng)建許多度量，以評估目標(biāo)語料庫的每個片段的異常程度。對于所有這些措施，我們需要設(shè)置閾值，將其分類為異常。這些閾值各有其自身的權(quán)衡，找到用于檢測異常的最佳閾值需要在多維空間中進(jìn)行操作和評估性能，每個維表示一個閾值。另外，在探索了多維空間之后，人們可能會意識到，背景語料庫的啟發(fā)式方法不能正確地由機(jī)器學(xué)習(xí)模型表示，因此必須重新開始思考另一種量化或識別語料庫模式的方法。由于性能反饋回路，整個過程可能非常復(fù)雜且令人沮喪。這種異常檢測雖然非常困難，但可能會產(chǎn)生驚人的結(jié)果。

結(jié)論 (Conclusion)

Understandably, the degree of which we can ignore the structure of the anomalies and corpus is proportional to the degree of difficulty in creating the algorithm. The more specific we are about the structure of the anomalies and the corpus, the easier the machine learning algorithm is to make. The less structured the anomalies and corpus are, the wider the range of problems that the algorithm can be applied to. However, accuracy and precision will also become issues as the structure of the anomalies and corpus becomes more vague. In an ideal world, if we made a super generic and accurate machine learning algorithm and tuned it perfectly to fix every situation, we would be able apply it to any problem in the world. In the field of health and medicine, we can detect problematic sub-sequences in genomes to detect illnesses like cancer way before it becomes an issue. In the field of technology, we can apply the algorithm to a real time logging system and uncover hackers or malicious activity the instant it occurs. There are so many other fields that anomaly detection can be applied to and if we can one day perfect it, we can solve many issues that are stumping scientists, engineers, and researchers today.

可以理解，我們可以忽略異常和語料庫的結(jié)構(gòu)的程度與創(chuàng)建算法的難度成正比。我們對異常和語料庫的結(jié)構(gòu)越具體，機(jī)器學(xué)習(xí)算法就越容易實(shí)現(xiàn)。異常和語料庫的結(jié)構(gòu)越少，可以應(yīng)用該算法的問題范圍就越廣。但是，隨著異常和語料庫的結(jié)構(gòu)越來越模糊，準(zhǔn)確性和準(zhǔn)確性也將成為問題。在理想的世界中，如果我們制作了超級通用且準(zhǔn)確的機(jī)器學(xué)習(xí)算法，并對其進(jìn)行了完美的調(diào)整以解決每種情況，那么我們便可以將其應(yīng)用于世界上的任何問題。在健康和醫(yī)學(xué)領(lǐng)域，我們可以檢測到基因組中有問題的子序列，從而在疾病成為問題之前檢測出諸如癌癥之類的疾病。在技??術(shù)領(lǐng)域，我們可以將該算法應(yīng)用于實(shí)時日志記錄系統(tǒng)，并在發(fā)生黑客或惡意活動后立即對其進(jìn)行發(fā)現(xiàn)。還有很多其他領(lǐng)域可以應(yīng)用異常檢測，如果我們有一天能夠完善它，我們可以解決當(dāng)今困擾科學(xué)家，工程師和研究人員的許多問題。

翻譯自: https://towardsdatascience.com/detecting-anomalies-using-machine-learning-e3495f79718