离线轻量级大数据平台Spark之MLib机器学习库TF-IDF实例
TF-IDF(termfrequency–inverse document frequency)是TF-IDF是一種統計方法,用以評估一字詞對于一個文件集或一個語料庫中的其中一份文件的重要程度。字詞的重要性隨著它在文件中出現的次數成正比增加,但同時會隨著它在語料庫中出現的頻率成反比下降。主要思想是:如果某個詞或短語在一篇文章中出現的頻率TF高,并且在其他文章中很少出現,則認為此詞或者短語具有很好的類別區分能力,適合用來分類。
在一份給定的文件里,詞頻 (term frequency, TF) 指的是某一個給定的詞語在該文件中出現的次數。這個數字通常會被歸一化(分子一般小于分母區別于IDF),以防止它偏向長的文件。(同一個詞語在長文件里可能會比短文件有更高的詞頻,而不管該詞語重要與否。)
nij是i詞在文件j中的出現次數,而分母則是在j文件中所有字詞的出現次數之和。
逆向文件頻率(inverse document frequency, IDF) 是一個詞語普遍重要性的度量。某一特定詞語的IDF,可以由總文件數目除以包含該詞語之文件的數目,再將得到的商取對數得到。
|D|是文件總數,j是包含i詞的文件數,一般加1,避免分母為零。
某一特定文件內的高詞語頻率,以及該詞語在整個文件集合中的低文件頻率,可以產生出高權重的TF-IDF。因此,TF-IDF傾向于過濾掉常見的詞語,保留重要的詞語。
spark.mllib中的算法接口是基于RDDs的;spark.ml中的算法接口是基于DataFrames的。
Spark平臺ML庫TF-IDF特征提取算法實例代碼如下: package sk.mlib;import java.util.Arrays; import java.util.List;import org.apache.spark.ml.feature.HashingTF; import org.apache.spark.ml.feature.IDF; import org.apache.spark.ml.feature.IDFModel; import org.apache.spark.ml.feature.Tokenizer; import org.apache.spark.ml.linalg.Vector; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.DataTypes; import org.apache.spark.sql.types.Metadata; import org.apache.spark.sql.types.StructField; import org.apache.spark.sql.types.StructType;public class TfIdfDemo {public static void main(String[] args) {SparkSession spark = SparkSession.builder().appName("TfIdfDemo").getOrCreate();List<Row> data = Arrays.asList(RowFactory.create(0.0, "I heard about Spark and i like spark"),RowFactory.create(1.0, "I wish Java could use case classes for spark"),RowFactory.create(2.0, "Logistic regression models of spark are neat and easy to use"));StructType schema = new StructType(new StructField[]{new StructField("label", DataTypes.DoubleType, false, Metadata.empty()),new StructField("sentence", DataTypes.StringType, false, Metadata.empty())});Dataset<Row> sentenceData = spark.createDataFrame(data, schema);Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words"); Dataset<Row> wordsData = tokenizer.transform(sentenceData);//打印結果 //String[] cols=wordsData.columns(); //for (String col:cols){// System.out.println(col);//}//int rows=(int)wordsData.count();//for (Row r : wordsData.select("label","sentence","words").takeAsList(rows)){// System.out.println(r.getDouble(0));// System.out.println(r.getString(1));// System.out.println(r.getAs(2));//}//將每個詞映射到哈希表,統計其頻次,將單詞映射成整數,作為hash索引int numFeatures = 20;HashingTF hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(numFeatures);//Hash分桶的數量設置,默認為2的20次方Dataset<Row> featurizedData = hashingTF.transform(wordsData);//打印結果int rows=(int)featurizedData.count();for (Row r : featurizedData.select("label","sentence","words","rawFeatures").takeAsList(rows)){System.out.println(r.getDouble(0));System.out.println(r.getString(1));System.out.println(r.getAs(2));System.out.println(r.getAs(3));}//計算TF-IDF,對數log,分子是3份文件,分母是包含詞條的文件數// alternatively, CountVectorizer can also be used to get term frequency vectorsIDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");IDFModel idfModel = idf.fit(featurizedData);Dataset<Row> rescaledData = idfModel.transform(featurizedData);//打印結果for (Row r : rescaledData.select("features", "label").takeAsList(3)) {Vector features = r.getAs(0);Double label = r.getDouble(1);System.out.println(features);System.out.println(label);}spark.stop();} } /** 執行結果:0.0 I heard about Spark and i like spark WrappedArray(i, heard, about, spark, and, i, like, spark) (20,[5,9,10,13,17],[2.0,2.0,1.0,1.0,2.0]) 1.0 I wish Java could use case classes for spark WrappedArray(i, wish, java, could, use, case, classes, for, spark) (20,[2,5,7,9,13,15,16],[1.0,1.0,1.0,3.0,1.0,1.0,1.0]) 2.0 Logistic regression models of spark are neat and easy to use WrappedArray(logistic, regression, models, of, spark, are, neat, and, easy, to, use) (20,[0,3,4,5,6,8,9,13,15,18],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0]) (20,[5,9,10,13,17],[0.0,0.0,0.6931471805599453,0.0,1.3862943611198906]) 0.0 (20,[2,5,7,9,13,15,16],[0.6931471805599453,0.0,0.6931471805599453,0.0,0.0,0.28768207245178085,0.6931471805599453]) 1.0 (20,[0,3,4,5,6,8,9,13,15,18],[0.6931471805599453,0.6931471805599453,0.6931471805599453,0.0,0.6931471805599453,0.6931471805599453,0.0,0.0,0.28768207245178085,0.6931471805599453]) 2.0*/總結
以上是生活随笔為你收集整理的离线轻量级大数据平台Spark之MLib机器学习库TF-IDF实例的全部內容,希望文章能夠幫你解決所遇到的問題。