异常数据剔除---格拉布斯准则(java实现)
第一次發(fā)表博客,先發(fā)一個(gè)簡單的算法吧,原理很簡單,和大家分享一下。
最近在做一個(gè)項(xiàng)目(android端),需要剔除異常數(shù)據(jù),就是根據(jù)之前一組數(shù)據(jù)判斷新來的數(shù)據(jù)是否為合理數(shù)據(jù),在相差過大的時(shí)候予以剔除,找到了格拉布斯準(zhǔn)則,具體內(nèi)容和步驟參見鏈接http://www.docin.com/p-730815444.html,我覺得寫的很好,里面的例子可以借鑒理解:
有一組數(shù)據(jù)8.2,5.4,14.0,7.3,4.7,9.0,6.5,10.1,7.7,6.0,我們要對里面的異常值(最大或最小)進(jìn)行剔除。
1.首先進(jìn)行排序:4.7,5.4,6.0,6.5,7.3,7.7,8.2,9.0,10.1,14.0
2.求平均值及標(biāo)準(zhǔn)差:平均值7.89,標(biāo)準(zhǔn)差:2.704
3.最大值及最小值為可疑值,偏離差分別為:14-7.89=6.11;7.89-4.7=3.19.(根據(jù)這兩個(gè)值判斷是否剔除最大值或最小值,文檔中說只能判斷一個(gè),實(shí)際上兩邊都可以判斷)
4.最小值的G1=(平均值-最小值)/標(biāo)準(zhǔn)差
最大值的Gn=(最大值-平均值)/標(biāo)準(zhǔn)差
5.確定檢出水平alpha,一般為0.01或0.05,越大越寬松,根據(jù)實(shí)際條件進(jìn)行確定,我這里使用0.05,根據(jù)后面的表格求出臨界值,與G1,Gn作比較;若G1(Gn)大于臨界值,則剔除,反之保留。
以下為java代碼實(shí)現(xiàn):
public class Grubbs {private ArrayList<Double> dataArrayList;private int length;private final double alpha = 0.05; //傳入一組數(shù)據(jù),我們要做的是剔除最大或最小的異常值public Grubbs(ArrayList<Double> arrayList) {this.dataArrayList = arrayList;this.length = arrayList.size();}public ArrayList<Double> calc() {//因?yàn)楦窭妓箿?zhǔn)則只能對大于等于3個(gè)數(shù)據(jù)進(jìn)行判斷,所以數(shù)據(jù)量小于3時(shí),直接返回if (dataArrayList.size() < 3) {return dataArrayList;}//首先對數(shù)據(jù)進(jìn)行排序,我這里用了最基本的冒泡法dataArrayList = bubbleSort(dataArrayList, length);//求出數(shù)據(jù)平均值和標(biāo)準(zhǔn)差double average = calcAverage(dataArrayList);double standard = calcStandard(dataArrayList, length, average);//求助最小值和最大值G1,Gndouble dubMin = average - dataArrayList.get(0);double dubMax = dataArrayList.get(length - 1) - average;double G1 = dubMin / standard;double Gn = dubMax / standard;//做比較,是否剔除if (G1 > calcG(alpha, length)) {dataArrayList.remove(0);if (Gn > calcG(alpha, length)) {dataArrayList.remove(length - 2);}} else if (Gn > calcG(alpha, length)) {dataArrayList.remove(length - 1);}return dataArrayList;}//冒泡排序private ArrayList<Double> bubbleSort(ArrayList<Double> arr, int n) {// TODO Auto-generated method stubdouble temp = 0;for (int i = 0; i < n; i++) {for (int j = 0; j < n - i - 1; j++) {if (arr.get(j) > arr.get(j + 1)) {temp = arr.get(j);arr.set(j, arr.get(j + 1));arr.set(j + 1, temp);}}}return arr;} //求平均public double calcAverage(ArrayList<Double> sample) {// TODO Auto-generated method stubdouble sum = 0;int cnt = 0;for (int i = 0; i < sample.size(); i++) {sum += sample.get(i);cnt++;}return (double) sum / cnt;} //求標(biāo)準(zhǔn)差private double calcStandard(ArrayList<Double> array, int n, double average) {// TODO Auto-generated method stubdouble sum = 0;for (int i = 0; i < n; i++) {sum += ((double) array.get(i) - average)* ((double) array.get(i) - average);}return (double) Math.sqrt((sum / (n - 1)));} //算臨界值的表,這里alpha為0.05private double calcG(double alpha, int n) {double[] N = { 1.1546847100299753, 1.4962499999999703,1.763678479497787, 1.9728167175443088, 2.1391059896012203,2.2743651271139984, 2.386809875078279, 2.4820832497170997,2.564121252001767, 2.6357330437346365, 2.698971864039854,2.755372404941574, 2.8061052912205966, 2.8520798130619083,2.894013795424427, 2.932482154393285, 2.9679513293748547,3.0008041587489247, 3.031358153993366, 3.0598791335206963,3.086591582831163, 3.1116865231590722, 3.135327688211162,3.157656337622164, 3.178795077984819, 3.198850919445483,3.2179177419513314, 3.2360783011390764, 3.2534058719727748,3.26996560491852, 3.2858156522011304, 3.301008108808857,3.31558980320037, 3.329602965279218, 3.3430857935316243,3.356072938839107, 3.368595919061223, 3.3806834758032323,3.3923618826659503, 3.403655212591846, 3.41458557057518,3.4251732969213213, 3.435437145364717, 3.4453944396432576,3.4550612115453876, 3.464452322969104, 3.4735815741386,3.482461799798589, 3.491104954935569, 3.4995221913492585,3.507723926208097, 3.5157199035634887, 3.5235192496631433,3.5311305227901078, 3.5385617582575746, 3.5458205091071684,3.5529138829882037, 3.5598485756350797 };return N[n - 3];}總結(jié)
以上是生活随笔為你收集整理的异常数据剔除---格拉布斯准则(java实现)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 机器学习:使用python生成训练集和测
- 下一篇: oracle查看表索引及索引类型