當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hadoop_23_MapReduce倒排索引实现

發(fā)布時間：2025/6/16 编程问答 45 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hadoop_23_MapReduce倒排索引实现小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

1.1.倒排索引?

　　　　根據(jù)屬性的值來查找記錄。這種索引表中的每一項都包括一個屬性值和具有該屬性值的各記錄的地址。由于不是由記錄來確

定屬性值，而是由屬性值來確定記錄的位置，因而稱為倒排索引(invertedindex)

　　　　例如：單詞——文檔矩陣（將屬性值放在前面作為索引）

1.2.MapReduce實現(xiàn)倒排索引

需求：對大量的文本（文檔、網(wǎng)頁），需要建立搜索索引

代碼實現(xiàn)：

package cn.bigdata.hdfs.mr; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;/*** 使用MapRedeuce程序建立倒排索引文件* 文件列表如下：* a.txt b.txt c.txt* hello tom hello jerry hello jerry* hello jerry hello jerry hello tom* hello tom tom jerry*/public class InverIndexStepOne {static class InverIndexStepOneMapper extends Mapper<LongWritable, Text, Text, IntWritable>{Text k = new Text();IntWritable v = new IntWritable(1);@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();//將得到的每行文本數(shù)據(jù)根據(jù)空格" "進(jìn)行切分String [] words = line.split(" ");//根據(jù)切片信息獲取文件名FileSplit inputSplit = (FileSplit)context.getInputSplit();String fileName = inputSplit.getPath().getName();for(String word : words){k.set(word + "--" + fileName);context.write(k, v);}}}static class InverIndexStepOneReducer extends Reducer<Text, IntWritable, Text, IntWritable>{@Overrideprotected void reduce(Text key, Iterable<IntWritable> values ,Context context) throws IOException, InterruptedException {int count = 0;for(IntWritable value : values){count += value.get();}context.write(key, new IntWritable(count));}}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = Job.getInstance(conf);job.setJarByClass(InverIndexStepOne.class);job.setMapperClass(InverIndexStepOneMapper.class);job.setReducerClass(InverIndexStepOneReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);//輸入文件路徑FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.waitForCompletion(true);} }

?運行結(jié)果輸出文件：E:\inverseOut\part-r-00000

hello--a.txt 3 hello--b.txt 2 hello--c.txt 2 jerry--a.txt 1 jerry--b.txt 3 jerry--c.txt 1 tom--a.txt 2 tom--b.txt 1 tom--c.txt 1

?在原來的基礎(chǔ)上進(jìn)行二次合并，格式如上圖單詞矩陣，代碼實現(xiàn)如下：

package cn.bigdata.hdfs.mr; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /*** 對第一次的輸出結(jié)果進(jìn)行合并，使得一個value對應(yīng)的多個文檔記錄組成一條完整記錄* @author Administrator**/public class IndexStepTwo {static class IndexStepTwoMapper extends Mapper<LongWritable, Text, Text, Text>{@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();String[] files = line.split("--");context.write(new Text(files[0]), new Text(files[1]));}}static class IndexStepTwoReducer extends Reducer<Text, Text, Text, Text>{@Overrideprotected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {//定義Buffer緩沖數(shù)組StringBuffer sb = new StringBuffer();for (Text text : values) {sb.append(text.toString().replace("\t", "-->") + "\t");}context.write(key, new Text(sb.toString()));}}public static void main(String[] args) throws Exception{if (args.length < 1 || args == null) {args = new String[]{"E:/inverseOut/part-r-00000", "D:/inverseOut2"};}Configuration config = new Configuration();Job job = Job.getInstance(config);job.setMapperClass(IndexStepTwoMapper.class);job.setReducerClass(IndexStepTwoReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true) ? 1:0);} }

?運行結(jié)果：

hello c.txt-->2 b.txt-->2 a.txt-->3 jerry c.txt-->1 b.txt-->3 a.txt-->1 tom c.txt-->1 b.txt-->1 a.txt-->2

總結(jié)：

　　對大量的文檔建立索引無非就兩個過程，一個是分詞，另一個是統(tǒng)計分詞在每個文檔中出現(xiàn)的次數(shù)，根據(jù)分詞在每個文檔

中出現(xiàn)的次數(shù)建立索引文件，下次搜索詞的時候直接查詢索引文件，從而返回文檔的摘要等信息；

轉(zhuǎn)載于:https://www.cnblogs.com/yaboya/p/9252313.html

總結(jié)

以上是生活随笔為你收集整理的Hadoop_23_MapReduce倒排索引实现的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： V8 —— 你需要知道的垃圾回收机制
下一篇：多云战略未来五大趋势分析，必看！

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

Hadoop_23_MapReduce倒排索引实现

總結(jié)