Hadoop_23_MapReduce倒排索引实现
生活随笔
收集整理的這篇文章主要介紹了
Hadoop_23_MapReduce倒排索引实现
小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.
1.1.倒排索引?
根據(jù)屬性的值來查找記錄。這種索引表中的每一項都包括一個屬性值和具有該屬性值的各記錄的地址。由于不是由記錄來確
定屬性值,而是由屬性值來確定記錄的位置,因而稱為倒排索引(invertedindex)
例如:單詞——文檔矩陣(將屬性值放在前面作為索引)
1.2.MapReduce實現(xiàn)倒排索引
需求:對大量的文本(文檔、網(wǎng)頁),需要建立搜索索引
代碼實現(xiàn):
package cn.bigdata.hdfs.mr; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;/*** 使用MapRedeuce程序建立倒排索引文件* 文件列表如下:* a.txt b.txt c.txt* hello tom hello jerry hello jerry* hello jerry hello jerry hello tom* hello tom tom jerry*/public class InverIndexStepOne {static class InverIndexStepOneMapper extends Mapper<LongWritable, Text, Text, IntWritable>{Text k = new Text();IntWritable v = new IntWritable(1);@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();//將得到的每行文本數(shù)據(jù)根據(jù)空格" "進(jìn)行切分String [] words = line.split(" ");//根據(jù)切片信息獲取文件名FileSplit inputSplit = (FileSplit)context.getInputSplit();String fileName = inputSplit.getPath().getName();for(String word : words){k.set(word + "--" + fileName);context.write(k, v);}}}static class InverIndexStepOneReducer extends Reducer<Text, IntWritable, Text, IntWritable>{@Overrideprotected void reduce(Text key, Iterable<IntWritable> values ,Context context) throws IOException, InterruptedException {int count = 0;for(IntWritable value : values){count += value.get();}context.write(key, new IntWritable(count));}}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = Job.getInstance(conf);job.setJarByClass(InverIndexStepOne.class);job.setMapperClass(InverIndexStepOneMapper.class);job.setReducerClass(InverIndexStepOneReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);//輸入文件路徑FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.waitForCompletion(true);} }?運行結(jié)果輸出文件:E:\inverseOut\part-r-00000
hello--a.txt 3 hello--b.txt 2 hello--c.txt 2 jerry--a.txt 1 jerry--b.txt 3 jerry--c.txt 1 tom--a.txt 2 tom--b.txt 1 tom--c.txt 1?在原來的基礎(chǔ)上進(jìn)行二次合并,格式如上圖單詞矩陣,代碼實現(xiàn)如下:
package cn.bigdata.hdfs.mr; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /*** 對第一次的輸出結(jié)果進(jìn)行合并,使得一個value對應(yīng)的多個文檔記錄組成一條完整記錄* @author Administrator**/public class IndexStepTwo {static class IndexStepTwoMapper extends Mapper<LongWritable, Text, Text, Text>{@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();String[] files = line.split("--");context.write(new Text(files[0]), new Text(files[1]));}}static class IndexStepTwoReducer extends Reducer<Text, Text, Text, Text>{@Overrideprotected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {//定義Buffer緩沖數(shù)組StringBuffer sb = new StringBuffer();for (Text text : values) {sb.append(text.toString().replace("\t", "-->") + "\t");}context.write(key, new Text(sb.toString()));}}public static void main(String[] args) throws Exception{if (args.length < 1 || args == null) {args = new String[]{"E:/inverseOut/part-r-00000", "D:/inverseOut2"};}Configuration config = new Configuration();Job job = Job.getInstance(config);job.setMapperClass(IndexStepTwoMapper.class);job.setReducerClass(IndexStepTwoReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true) ? 1:0);} }?運行結(jié)果:
hello c.txt-->2 b.txt-->2 a.txt-->3 jerry c.txt-->1 b.txt-->3 a.txt-->1 tom c.txt-->1 b.txt-->1 a.txt-->2總結(jié):
對大量的文檔建立索引無非就兩個過程,一個是分詞,另一個是統(tǒng)計分詞在每個文檔中出現(xiàn)的次數(shù),根據(jù)分詞在每個文檔
中出現(xiàn)的次數(shù)建立索引文件,下次搜索詞的時候直接查詢索引文件,從而返回文檔的摘要等信息;
?
轉(zhuǎn)載于:https://www.cnblogs.com/yaboya/p/9252313.html
總結(jié)
以上是生活随笔為你收集整理的Hadoop_23_MapReduce倒排索引实现的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: V8 —— 你需要知道的垃圾回收机制
- 下一篇: 多云战略未来五大趋势分析,必看!