當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

在SLS中快速实现异常巡检

發(fā)布時間：2024/8/23 编程问答 37 豆豆

生活随笔收集整理的這篇文章主要介紹了在SLS中快速实现异常巡检小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

一、相關(guān)算法研究

1.1 常見的開源算法

Yahoo：EGADS
FaceBook：Prophet
Baidu：Opprentice
Twitter：Anomaly Detection
Redhat：hawkular
Ali+Tsinghua：Donut
Tencent：Metis
Numenta：HTM
CMU：SPIRIT
Microsoft：YADING
Linkedin：SAX改進版本
Netflix：Argos
NEC：CloudSeer
NEC+Ant：LogLens
MoogSoft：一家創(chuàng)業(yè)公司，做的內(nèi)容蠻好的，供大家參考

1.2 基于統(tǒng)計方法的異常檢測

基于統(tǒng)計方法對時序數(shù)據(jù)進行不同指標（均值、方差、散度、峰度等）結(jié)果的判別，通過一定的人工經(jīng)驗設(shè)定閾值進行告警。同時可以引入時序歷史數(shù)據(jù)利用環(huán)比、同比等策略，通過一定的人工經(jīng)驗設(shè)定閾值進行告警。
通過建立不同的統(tǒng)計指標：窗口均值變化、窗口方差變化等可以較好的解決下圖中（1，2，5）所對應(yīng)的異常點檢測；通過局部極值可以檢測出圖（4）對應(yīng)的尖點信息；通過時序預(yù)測模型可以較好的找到圖（3，6）對應(yīng)的變化趨勢，檢測出不符合規(guī)律的異常點。

如何判別異常？

N-sigma
Boxplot（箱線圖）
Grubbs’Test
Extreme Studentized Deviate Test

PS：

N-sigma：在正態(tài)分布中，99.73%的數(shù)據(jù)分布在距平均值三個標準差以內(nèi)。如果我們的數(shù)據(jù)服從一定分布，就可以從分布曲線推斷出現(xiàn)當前值的概率。

Grubbs假設(shè)檢驗：常被用來檢驗正態(tài)分布數(shù)據(jù)集中的單個異常值

ESD假設(shè)檢驗：將Grubbs'

Test擴展到k個異常值檢測

1.3 基于無監(jiān)督的方法做異常檢測

什么是無監(jiān)督方法：是否有監(jiān)督（supervised），主要看待建模的數(shù)據(jù)是否有標簽（label）。若輸入數(shù)據(jù)有標簽，則為有監(jiān)督學習；沒標簽則為無監(jiān)督學習。
為何需要引入無監(jiān)督方法：在監(jiān)控建立的初期，用戶的反饋是非常稀少且珍貴的，在沒有用戶反饋的情況下，為了快速建立可靠的監(jiān)控策略，因此引入無監(jiān)督方法。
針對單維度指標

采用一些回歸方法（Holt-Winters、ARMA），通過原始的觀測序列學習出預(yù)測序列，通過兩者之間的殘差進行分析得到相關(guān)的異常。
針對單維度指標
- 多維度的含義（time，cpu，iops，flow）
- iForest（IsolationForest）是基于集成的異常檢測方法
  - 適用連續(xù)數(shù)據(jù)，具有線性時間復(fù)雜度和高精度
  - 異常定義：容易被孤立的離群點，分布稀疏且離密度高的群體較遠的點。
- 幾點說明
  - 判別樹越多越穩(wěn)定，且每棵樹都是互相獨立的，可以部署在大規(guī)模分布系統(tǒng)中
  - 該算法不太適合特別高維度數(shù)據(jù)，噪音維度維度和敏感維度無法主動剔除
  - 原始iForest算法僅對全局異常值敏感，對局部相對稀疏的點敏感度較低

1.4 基于深度學習的異常檢測

論文題目：《Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications》（WWW 2018）

解決的問題：針對具有周期性的時序監(jiān)控數(shù)據(jù)，數(shù)據(jù)中包含一些缺失點和異常點

模型訓練結(jié)構(gòu)如下

檢測時使用了MCMC填補的技術(shù)處理觀測窗口中的已知缺失點，核心思想根據(jù)已經(jīng)訓練好的模型，迭代逼近邊際分布（下圖表示MCMC填補的一次迭代示意圖）

1.5 使用有監(jiān)督的方法做異常檢測

標注異常這件事兒，本身很復(fù)雜？
- 用戶定義的異常往往是從系統(tǒng)或者服務(wù)角度出發(fā)，對數(shù)據(jù)進行打標，所關(guān)聯(lián)的底層指標、鏈路指標繁雜，無法從幾個維度出發(fā)（更多的是系統(tǒng)的一個Shapshot）
- 在進行架構(gòu)層設(shè)計時，都會進行服務(wù)自愈設(shè)計，底層的異常并未影響到上層業(yè)務(wù)
- 異常的溯源很復(fù)雜，很多情況下，單一監(jiān)控數(shù)據(jù)僅是異常結(jié)果的反應(yīng)，而不是異常本身
- 打標樣本數(shù)量很少，且異常類型多樣，針對小樣本的學習問題還有待提高
常用的有監(jiān)督的機器學習方法
- xgboost、gbdt、lightgbm等
- 一些dnn的分類網(wǎng)絡(luò)等

二、SLS中提供的算法能力

時序分析
- 預(yù)測：根據(jù)歷史數(shù)據(jù)擬合基線
- 異常檢測、變點檢測、折點檢測：找到異常點
- 多周期檢測：發(fā)現(xiàn)數(shù)據(jù)訪問中的周期規(guī)律
- 時序聚類：找到形態(tài)不一樣的時序
模式分析
- 頻繁模式挖掘
- 差異模式挖掘
海量文本智能聚類
- 支持任意格式日志：Log4J、Json、單行（syslog）
- 日志經(jīng)任意條件過濾后再Reduce；對Reduce后Pattern，根據(jù)signature反查原始數(shù)據(jù)
- 不同時間段Pattern比較
- 動態(tài)調(diào)整Reduce精度
- 億級數(shù)據(jù)，秒級出結(jié)果

三、針對流量場景的實戰(zhàn)分析

3.1 多維度的監(jiān)控指標的可視化

具體的SQL邏輯如下：

* | selecttime,buffer_cnt,log_cnt,buffer_rate,failed_cnt,first_play_cnt,fail_rate from(selectdate_trunc('minute', time) as time,sum(buffer_cnt) as buffer_cnt,sum(log_cnt) as log_cnt,casewhenis_nan(sum(buffer_cnt)*1.0 / sum(log_cnt)) then0.0 elsesum(buffer_cnt)*1.0 / sum(log_cnt) end as buffer_rate, sum(failed_cnt) as failed_cnt, sum(first_play_cnt) as first_play_cnt , casewhenis_nan(sum(failed_cnt)*1.0 / sum(first_play_cnt)) then0.0 elsesum(failed_cnt)*1.0 / sum(first_play_cnt) end as fail_rate fromlog group bytime order bytime)limit 100000

3.2 各指標的時序環(huán)比圖

具體的SQL邏輯如下：

*?| select time,log_cnt_cmp[1] as log_cnt_now,log_cnt_cmp[2] as log_cnt_old,case when is_nan(buffer_rate_cmp[1]) then 0.0 else buffer_rate_cmp[1] end as buf_rate_now,case when is_nan(buffer_rate_cmp[2]) then 0.0 else buffer_rate_cmp[2] end as buf_rate_old,case when is_nan(fail_rate_cmp[1]) then 0.0 else fail_rate_cmp[1] end as fail_rate_now,case when is_nan(fail_rate_cmp[2]) then 0.0 else fail_rate_cmp[2] end as fail_rate_old from ( select time, ts_compare(log_cnt, 86400) as log_cnt_cmp,ts_compare(buffer_rate, 86400) as buffer_rate_cmp,ts_compare(fail_rate, 86400) as fail_rate_cmp from ( select date_trunc('minute', time - time % 120) as time, sum(buffer_cnt) as buffer_cnt, sum(log_cnt) as log_cnt, sum(buffer_cnt)*1.0 / sum(log_cnt) as buffer_rate, sum(failed_cnt) as failed_cnt, sum(first_play_cnt) as first_play_cnt ,sum(failed_cnt)*1.0 / sum(first_play_cnt) as fail_rate from log group by time order by time) group by time) where time is not null limit 1000000

3.3 各指標動態(tài)可視化

具體的SQL邏輯如下：

* | select time, case when is_nan(buffer_rate) then 0.0 else buffer_rate end as show_index,isp as index from (select date_trunc('minute', time) as time, sum(buffer_cnt)*1.0 / sum(log_cnt) as buffer_rate,sum(failed_cnt)*1.0 / sum(first_play_cnt) as fail_rate,sum(log_cnt) as log_cnt,sum(failed_cnt) as failed_cnt,sum(first_play_cnt) as first_play_cnt,isp from log group by time, isp order by time) limit 200000

3.4 異常集合的監(jiān)控Dashboard頁面

異常監(jiān)控項目的背后圖表SQL邏輯

* | select res.name from ( select ts_anomaly_filter(province, res[1], res[2], res[3], res[6], 100, 0) as res from ( select t1.province as province, array_transpose( ts_predicate_arma(t1.time, t1.show_index, 5, 1, 1) ) as res from ( selectprovince,time,case when is_nan(buffer_rate) then 0.0 else buffer_rate end as show_indexfrom (select province, time, sum(buffer_cnt)*1.0 / sum(log_cnt) as buffer_rate, sum(failed_cnt)*1.0 / sum(first_play_cnt) as fail_rate, sum(log_cnt) as log_cnt, sum(failed_cnt) as failed_cnt, sum(first_play_cnt) as first_play_cntfrom log group by province, time) ) t1 inner join ( select DISTINCT province from ( select province, time, sum(log_cnt) as total from log group by province, time ) where total > 200 ) t2 on t1.province = t2.province group by t1.province ) ) limit 100000

針對上述SQL邏輯的具體分析

原文鏈接
本文為云棲社區(qū)原創(chuàng)內(nèi)容，未經(jīng)允許不得轉(zhuǎn)載。

總結(jié)

以上是生活随笔為你收集整理的在SLS中快速实现异常巡检的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：蚂蚁金服OceanBase挑战TPCC
下一篇：阿里前端委员会主席圆心：未来前端的机会在

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

在SLS中快速实现异常巡检

一、相關(guān)算法研究

二、SLS中提供的算法能力

三、針對流量場景的實戰(zhàn)分析

總結(jié)

一、相關(guān)算法研究