基于简单的机器学习方法等异常值识别方法(含2022年全国服务外包大赛实例)
??我們以2022年全國服務外包大賽的A03題目作為示例代碼演示異常值識別過程。
??問題的主要任務時找出商品的銷量異常和價格異常,提供4個月的商品信息數據,共1700萬余條,4個月的店鋪信息數據,共60萬余條,強調時間復雜度空間復雜度、異常值識別率和準確率。我們用店鋪分析輔助商品的異常,以提高可信度和準確率。
??部分數據鏈接:https://pan.baidu.com/s/1KatV_6ozYHjPkNjfVGBPmw 提取碼:ee8i
??在這里我們只是對1w的商品信息進行嘗試性的分析計算。如果想要更進一步了解其他異常值識別方法的可以看這篇博文:https://editor.csdn.net/md/?articleId=124340047
嶺回歸
import numpy import pandas from sklearn.linear_model import Ridge # 通過sklearn.linermodel加載嶺回歸方法 from sklearn import model_selection # 加載交叉驗證模塊 import matplotlib.pyplot as plt # 畫圖 from sklearn.preprocessing import StandardScaler # 標準化 from sklearn.model_selection import train_test_split from sklearn.preprocessing import PolynomialFeatures from sklearn.metrics import mean_squared_error import pandas as pdtarget_file = r"D:\PythonFiles\Service Outsourcing\Distribution testing\ITEM_STOCK已補202106_10000_drop.tsv"def draw(file, x, y, pre_y):# plot1 = file.plot.scatter(x="ITEM_STOCK", y="ITEM_SALES_VOLUME") # 銷售量與點贊的關系plot1 = plt.plot(x, y, 'b', label='Fitting Curve')plot2 = plt.plot(x, pre_y, 'r', label='Fitting Curve')plt.show()def error_calculation(file, pre_y, presentage):y = numpy.array(file["ITEM_SALES_VOLUME"])pre_y = numpy.array(pre_y)n = pre_y.shape[0]# print(n)diff = abs(y-pre_y[:, 0])diff = pandas.DataFrame(diff)diff.columns = ["error"]file1 = file.join(diff)file2 = file1.sort_values(by="error", ascending=False)# print(file2)file2.head(int((1-presentage)*n)).to_csv("2.csv", encoding="utf-8")def ridge(file):file = file.sort_values(by="ITEM_STOCK", ascending=False)x = file[["ITEM_STOCK", "ITEM_PRICE", "ITEM_FAV_NUM", "TOTAL_EVAL_NUM"]]print(x)y = file[["ITEM_SALES_AMOUNT"]]# for i in range(10):# poly_reg = PolynomialFeatures(degree=6)# x_poly = poly_reg.fit_transform(x)estimator = Ridge(alpha=1)estimator.fit(x, y)y_predict = estimator.predict(x)# print("預測值為:\n", y_predict)# print("模型中的系數為:\n", estimator.coef_)# print("模型中的偏置為:\n", estimator.intercept_)error = mean_squared_error(y, y_predict)print("誤差為:\n", error)error_calculation(file, y_predict, 0.95)draw(file, x, y, y_predict)def main():f = pd.read_csv(target_file, sep="\t", encoding="utf-8")ridge(f.head(10000))if __name__ == '__main__':main()??結果如下:
??可以發現效果非常非常一般,因為不太適合這個分布的數據。
經驗公式擬合
??根據最簡單的經濟學常識,我們知道價格與銷量大致是反比關系,因此不太適合嶺回歸線性模型,或者多項式擬合模型。我們采取經驗公式最小二乘法擬合,挑選三種經驗公式擬合中效果最佳的。
import numpy as np import pandas import pandas as pd import pylab import math from scipy.optimize import curve_fitdef choose(x, y):popt1, pcov1 = curve_fit(func1, x, y) # 曲線擬合, popt為函數的參數listy_pred1 = [func1(i, popt1[0], popt1[1], popt1[2]) for i in x] # 直接用函數和函數參數list來進行y值的計算popt2, pcov2 = curve_fit(func2, x, y)y_pred2 = [func2(i, popt2[0], popt2[1], popt2[2]) for i in x]popt3, pcov3 = curve_fit(func3, x, y)y_pred3 = [func3(i, popt3[0], popt3[1], popt3[2], popt3[3], popt3[4]) for i in x]error1, error1_save = error_cul(y, y_pred1)error2, error2_save = error_cul(y, y_pred2)error3, error3_save = error_cul(y, y_pred3)temp = [error1, error2, error3]print(error1, error2, error3)if error2 == min(temp):print(popt2)return error2_save, y_pred2elif error1 == min(temp):print(popt1)return error1_save, y_pred1else:print(popt3)return error3_save, y_pred3def func1(x, a, b, c):return a * np.exp(b * x)+cdef func2(x, a, b, c):return a * pow(x, b) + cdef func3(x, a, b, c, d, e):return a * pow(x, b) + c * np.exp(d * x) + edef error_cul(y, pre_y):diff = abs(np.array(y) - np.array(pre_y))error_every = diff/np.array(y)error_all = np.sum(error_every)return error_all, error_everydef draw(x, y, y_pred):plot1 = pylab.plot(x, y, '*', label='original values')plot2 = pylab.plot(x, y_pred, 'r', label='fit values')pylab.title('')pylab.xlabel('')pylab.ylabel('')pylab.legend(loc=3, borderaxespad=0., bbox_to_anchor=(0, 0))pylab.show()def save(file, error_rate, presentage):n = error_rate.shape[0]error_rate = pandas.DataFrame(error_rate)error_rate.columns = ["error"]file = file.join(error_rate)file = file.sort_values(by="error", ascending=False)file.head(int((1-presentage)*n)).to_csv("2.csv", encoding="utf-8")def main():df = pd.read_csv("data_202106_head.tsv", sep="\t", encoding="utf-8")df=df.dropna(axis=0, how='any', subset=["ITEM_FAV_NUM"])df = df.sort_values(by="ITEM_FAV_NUM", ascending=False)# x= np.loadtxt("data_202106_head.tsv", delimiter="\t", usecols=6, encoding="utf-8", skiprows=1)x = df["ITEM_FAV_NUM"]y = df["ITEM_SALES_VOLUME"]error_rate, y_pred = choose(x, y)draw(x, y, y_pred)save(df, error_rate, 0.95)if __name__ == '__main__':main()??效果如下:
??放大之后:
??效果比嶺回歸強點,但還是強差人意。而且存在受遠端異常值影響嚴重的問題:
孤立森林
??我們先嘗試單變量檢測,直接刪除Nan項。
import numpy as np import pandas as pd from matplotlib import pyplot as plt from sklearn.ensemble import IsolationForest # 這里嘗試sklearn的包def Isolation_Forest(file, target, n_estimators = 100):for i in target:file_for_this = file.dropna(axis=0, how='any', subset=["ITEM_FAV_NUM"])clf = IsolationForest(n_estimators=n_estimators)clf.fit(file_for_this[file.columns[i]].values.reshape(-1, 1)) # 訓練=>靠reshap變成2維一列的數據xx = np.linspace(file_for_this[file.columns[i]].min(), file_for_this[file.columns[i]].max(), len(file_for_this)).reshape(-1, 1) # 序列生成器生成序列,均勻一下差值anomaly_score = clf.decision_function(xx) # 分類器的一種方法,看是在分類器超平面的左右哪邊 => 算是打出了一個異常得分outlier = clf.predict(xx) # 預測,看看誰是異常的plt.figure(figsize=(20, 10))plt.plot(xx, anomaly_score, color='r', linewidth=1, label='anomaly score')plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score),alpha=0.4, where=outlier == -1,label='outlier_region') # 很神奇,這里可以直接寫outlier == -1plt.legend()plt.ylabel('anomaly score')plt.xlabel(file.columns[i])print(file.columns[i], "的異常值范圍是:", np.min(anomaly_score), "----", np.max(anomaly_score))plt.show()def main():df = pd.read_csv(r"../data_202106_head.tsv", encoding="utf-8", sep="\t")target = [5, 6, 7, 13, 14, 15]# target是目標列,n_estimators是估算器數量Isolation_Forest(df, target, n_estimators=101)if __name__ == '__main__':main()注意
??這部分主要是剛拿到數據的嘗試,更多的是熟悉此領域的一些方法思路,整套流程可以在博主的異常值識別專欄中查看。
總結
以上是生活随笔為你收集整理的基于简单的机器学习方法等异常值识别方法(含2022年全国服务外包大赛实例)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 使用二进制位来控制权限,表设计
- 下一篇: 课堂笔记:树、森林与二叉树的转换、哈夫曼