weka使用训练集分类测试集_科学网—使用独立测试集对分类模型进行评估 - 李向东的博文...
這兩天還是糾結于分類模型的準確率。因為對從網上隨機摘錄的文本進行分類時,結果總是不甚理想,不像使用cross-validation得到的結果那么好。
于是決定使用獨立測試集(含1402個實例)進行評估。訓練集實例9804個,特征9302個,沒有使用特征選擇。準確率大約78%,其中“歷史”和“藝術”有點分不清。結果如下:
-------------------------------------------------------------------------
weka.filters.unsupervised.attribute.StringToWordVector in:9804
Number of instances: 9804
Number of attributes: 9302
loading test data in:test_segmented......
weka.filters.unsupervised.attribute.StringToWordVector in:1402
weka.filters.unsupervised.attribute.ReplaceMissingValues in:9804
weka.filters.unsupervised.attribute.Normalize in:9804
evaluating.........
=== Detailed Accuracy By Class ===
TP Rate?? FP Rate?? Precision?? Recall? F-Measure?? ROC Area? Class
0.91????? 0.008????? 0.901???? 0.91????? 0.905????? 0.993??? C11-Space
0.455???? 0.001????? 0.938???? 0.455???? 0.612????? 0.928??? C15-Energy
0.464???? 0????????? 1???????? 0.464???? 0.634????? 0.974??? C16-Electronics
0.556???? 0.001????? 0.938???? 0.556???? 0.698????? 0.989??? C17-Communication
0.98????? 0.031????? 0.705???? 0.98????? 0.82?????? 0.985??? C19-Computer
0.588???? 0.003????? 0.833???? 0.588???? 0.69?????? 0.96???? C23-Mine
0.78????? 0.001????? 0.979???? 0.78????? 0.868????? 0.996??? C29-Transport
0.81????? 0.035????? 0.638???? 0.81????? 0.714????? 0.974??? C3-Art
0.95????? 0.006????? 0.922???? 0.95????? 0.936????? 0.994??? C31-Enviornment
0.92????? 0.009????? 0.885???? 0.92????? 0.902????? 0.99???? C32-Agriculture
0.96????? 0.034????? 0.686???? 0.96????? 0.8??????? 0.979??? C34-Economy
0.692???? 0.004????? 0.878???? 0.692???? 0.774????? 0.989??? C35-Law
0.472???? 0????????? 1???????? 0.472???? 0.641????? 0.98???? C36-Medical
0.526???? 0.002????? 0.952???? 0.526???? 0.678????? 0.992??? C37-Military
0.91????? 0.048????? 0.591???? 0.91????? 0.717????? 0.965??? C38-Politics
0.97????? 0.021????? 0.782???? 0.97????? 0.866????? 0.989??? C39-Sports
0.235???? 0????????? 1???????? 0.235???? 0.381????? 0.852??? C4-Literature
0.639???? 0.004????? 0.886???? 0.639???? 0.743????? 0.974??? C5-Education
0.489???? 0.002????? 0.88????? 0.489???? 0.629????? 0.891??? C6-Philosophy
0.75????? 0.026????? 0.688???? 0.75????? 0.718????? 0.963??? C7-History
Correctly Classified Instances??????? 1095?????????????? 78.1027 %
Incorrectly Classified Instances?????? 307?????????????? 21.8973 %
Kappa statistic????????????????????????? 0.7661
Mean absolute error????????????????????? 0.0904
Root mean squared error????????????????? 0.2092
Relative absolute error???????????????? 97.1367 %
Root relative squared error???????????? 94.8845 %
Total Number of Instances???????????? 1402
=== Confusion Matrix ===
a? b? c? d? e? f? g? h? i? j? k? l? m? n? o? p? q? r? s? t??
91? 0? 0? 0? 9? 0? 0? 0? 0? 0? 0? 0? 0? 0? 0? 0? 0? 0? 0? 0 |? a = C11-Space
0 15? 0? 0? 4? 4? 0? 0? 2? 1? 3? 0? 0? 0? 2? 2? 0? 0? 0? 0 |? b = C15-Energy
0? 0 13? 1? 9? 0? 0? 0? 0? 0? 2? 0? 0? 0? 0? 3? 0? 0? 0? 0 |? c = C16-Electronics
1? 0? 0 15? 7? 0? 0? 0? 0? 0? 1? 0? 0? 1? 1? 1? 0? 0? 0? 0 |? d = C17-Communication
2? 0? 0? 0 98? 0? 0? 0? 0? 0? 0? 0? 0? 0? 0? 0? 0? 0? 0? 0 |? e = C19-Computer
0? 0? 0? 0? 7 20? 0? 0? 2? 0? 2? 0? 0? 0? 2? 1? 0? 0? 0? 0 |? f = C23-Mine
0? 1? 0? 0? 1? 0 46? 0? 0? 0? 5? 2? 0? 0? 3? 1? 0? 0? 0? 0 |? g = C29-Transport
0? 0? 0? 0? 0? 0? 0 81? 0? 0? 1? 0? 0? 0? 0? 0? 0? 0? 0 18 |? h = C3-Art
0? 0? 0? 0? 1? 0? 0? 0 95? 4? 0? 0? 0? 0? 0? 0? 0? 0? 0? 0 |? i = C31-Enviornment
0? 0? 0? 0? 0? 0? 0? 0? 0 92? 7? 0? 0? 0? 0? 0? 0? 0? 0? 1 |? j = C32-Agriculture
0? 0? 0? 0? 0? 0? 0? 0? 0? 1 96? 0? 0? 0? 2? 0? 0? 0? 0? 1 |? k = C34-Economy
0? 0? 0? 0? 0? 0? 1? 0? 0? 1? 5 36? 0? 1? 8? 0? 0? 0? 0? 0 |? l = C35-Law
0? 0? 0? 0? 0? 0? 0? 2? 0? 4? 8? 1 25? 0? 7? 4? 0? 2? 0? 0 |? m = C36-Medical
4? 0? 0? 0? 0? 0? 0? 0? 1? 0? 1? 1? 0 40 24? 3? 0? 1? 0? 1 |? n = C37-Military
0? 0? 0? 0? 0? 0? 0? 0? 0? 0? 3? 0? 0? 0 91? 0? 0? 0? 0? 6 |? o = C38-Politics
0? 0? 0? 0? 0? 0? 0? 1? 1? 0? 0? 0? 0? 0? 0 97? 0? 0? 0? 1 |? p = C39-Sports
0? 0? 0? 0? 1? 0? 0 13? 0? 0? 1? 0? 0? 0? 3? 2? 8? 0? 2? 4 |? q = C4-Literature
0? 0? 0? 0? 0? 0? 0? 3? 1? 1? 1? 1? 0? 0? 6? 9? 0 39? 0? 0 |? r = C5-Education
3? 0? 0? 0? 2? 0? 0? 8? 1? 0? 1? 0? 0? 0? 4? 0? 0? 2 22? 2 |? s = C6-Philosophy
0? 0? 0? 0? 0? 0? 0 19? 0? 0? 3? 0? 0? 0? 1? 1? 0? 0? 1 75 |? t = C7-History
-------------------------------------------------------------------------
源文件主要代碼:
String traindatadir = "train_segmented";
TextDirectoryLoader loader = new TextDirectoryLoader();
loader.setDirectory(new File( traindatadir ));
Instances dataRaw = loader.getDataSet();
StringToWordVector filter = new StringToWordVector();
filter.setStemmer( new NullStemmer() );
filter.setInputFormat(dataRaw);
System.out.println("nnfiltering data in:" + traindatadir+ "......nn");
Instances dataFiltered = Filter.useFilter(dataRaw, filter);
System.out.println("Number of instances: "+ dataFiltered.numInstances());
System.out.println("Number of attributes: "+ dataFiltered.numAttributes());
String testdatadir = "test_segmented";
System.out.println("nnloading test data in:" + testdatadir+ "......nn");
loader.setDirectory(new File( testdatadir ));
Instances testRaw = loader.getDataSet();
//因為剛剛過濾了訓練集,所以過濾器會使用訓練集的結構對testRaw進行過濾
Instances testFiltered=Filter.useFilter(testRaw, filter);
SMO classifier = new SMO();
classifier.buildClassifier(dataFiltered);
System.out.println("evaluating.........");
Evaluation eval = new Evaluation(dataFiltered);
eval.evaluateModel(classifier, testFiltered); //使用獨立測試集進行評估
System.out.println(eval.toClassDetailsString());
System.out.println(eval.toSummaryString());
System.out.println(eval.toMatrixString());
現在想知道的是,否能保存剛剛過濾了訓練集的過濾器?以便下次對一個文本進行過濾和分類?
轉載本文請聯系原作者獲取授權,同時請注明本文來自李向東科學網博客。
鏈接地址:http://blog.sciencenet.cn/blog-713110-574111.html
上一篇:weka中使用TFIDF進行特征選擇
下一篇:使用DataSource和DataSink
總結
以上是生活随笔為你收集整理的weka使用训练集分类测试集_科学网—使用独立测试集对分类模型进行评估 - 李向东的博文...的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 大数问题(一个特别大的数需要用数组或字符
- 下一篇: 对链表的删除操作