當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Hadoop入门（十一）Mapreduce的InputFomrat各种子类

發布時間：2023/12/3 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了 Hadoop入门（十一）Mapreduce的InputFomrat各种子类小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

一、TextInputFormat

extends FileInputFomrat<LongWritable,Text>? 是默認讀取文件的切分器，其內的LineRecordReader:用來讀取每一行的內容，

? LineRecordReader:內的 nextKeyValue(){}中，key的賦值在：

? initialize()方法內， key=start=split.getStart();?? split假如對應文件 hello.txt 期內為hello you? hello me

? 那么起始位置就是0

? end = start + split.getLength()，

? 而行文本在方法讀取到的行字節長度=readLine(...)中讀取，對應到LineReader.readLine(...) 170行

? string key = getCurrentKey()?? string value = getCurrentValue() 中得到

? 然后在Mapper類中：

?while(LineRecordReader.nextKeyValue()){key = linerecordreader.getCurrentKey()'value = linerecordreader.getCurrentValue()map.(key,value,context); //不停的將鍵值對寫出去 }

二、DBInputFormat

? DBInputFormat 在讀取數據時，產生的鍵值對是 <LongWritable,DBWritable的實例>

??? LongWritable仍舊是偏移量，

? 可以參看 org.apache.hadoop.mapreduce.lib.db.DBRecordReader.nextKeyValue()/232行，如下

?key.set(pos + split.getStart());?? 來確認表示的仍舊是偏移量

package inputformat;import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.net.URI; import java.sql.PreparedStatement; import java.sql.ResultSet; import java.sql.SQLException;import mapreduce.MyWordCount;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.db.DBConfiguration; import org.apache.hadoop.mapreduce.lib.db.DBInputFormat; import org.apache.hadoop.mapreduce.lib.db.DBWritable; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;/*** 目的：將mysql/test庫/myuser表中將字段id,name對應的屬性通過 mapreduce(下面例子僅是通過map 沒有reduce操作)將記錄寫出到hdfs中* mysql--->map--->hdfs* 要運行本示例* 1.把mysql的jdbc驅動放到各TaskTracker節點的hadoop/mapreduce/lib目錄下* 2.重啟集群*/ public class MyDBInputFormatApp {private static final String OUT_PATH = "hdfs://hadoop0:9000/out";public static void main(String[] args) throws Exception {Configuration conf = new Configuration();// 連接數據庫代碼盡量考前寫寫在后面執行會報錯 DBConfiguration.configureDB(conf, "com.mysql.jdbc.Driver", "jdbc:mysql://hadoop0:3306/test", "root", "admin");final FileSystem filesystem = FileSystem.get(new URI(OUT_PATH), conf);if (filesystem.exists(new Path(OUT_PATH))) {filesystem.delete(new Path(OUT_PATH), true);}final Job job = new Job(conf, MyDBInputFormatApp.class.getSimpleName()); // 創建job job.setJarByClass(MyDBInputFormatApp.class);job.setInputFormatClass(DBInputFormat.class);// 指定inputsplit具體實現類 // 下面方法參數屬性為: 操作javabean，對應表名，查詢條件，排序要求，需要查詢的表字段 DBInputFormat.setInput(job, MyUser.class, "myuser", null, null, "id", "name");// // 設置map類和map處理的 key value 對應的數據類型 job.setMapperClass(MyMapper.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(NullWritable.class);job.setNumReduceTasks(0); //指定不需要使用reduce，直接把map輸出寫入到HDFS中 job.setOutputKeyClass(Text.class); // 設置job output key 輸出類型 job.setOutputValueClass(NullWritable.class);// 設置job output value 輸出類型 FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));job.waitForCompletion(true);}//<k1,v1>對應的是數據庫對應表下記錄位置，和這行對應的JavaBean， <k2,v2>表示經過map處理好輸出結果 public static class MyMapper extends Mapper<LongWritable, MyUser, Text, NullWritable> {protected void map(LongWritable key, MyUser value, Context context) throws java.io.IOException, InterruptedException {context.write(new Text(value.toString()), NullWritable.get());};}/*** Writable是為了在Hadoop各節點之間傳輸使用的，因此需要實例化* DBWritable表示和數據庫傳輸時使用的** @author zm*/public static class MyUser implements Writable, DBWritable {int id;String name;// 針對Writable 需要重寫的方法 @Overridepublic void write(DataOutput out) throws IOException {out.writeInt(id);Text.writeString(out, name);}@Overridepublic void readFields(DataInput in) throws IOException {this.id = in.readInt();this.name = Text.readString(in);}// 針對DBWritable需要重寫的方法 @Overridepublic void write(PreparedStatement statement) throws SQLException {statement.setInt(1, id);statement.setString(2, name);}@Overridepublic void readFields(ResultSet resultSet) throws SQLException {this.id = resultSet.getInt(1);this.name = resultSet.getString(2);}@Overridepublic String toString() {return id + "\t" + name;}} }

三、NLineInputFormat

?這種格式下，split的數量就不是由文件對應block塊個數決定的，而是由設置處理多少行決定，

? 比如一個文件 100行，設置NlineInputFormat 處理2行，那么會產生50個map任務，每個map任務

? 仍舊一行行的處理會調用2次map函數、

四、KeyValueInputFormat

?如果行中有分隔符，那么分隔符前面的作為key，后面的作為value

?如果行中沒有分隔符，那么整行作為key，value為空

?默認分隔符為 \t

GenericWritable

適用于不同輸入源下，多map輸出類型不同

package inputformat;import java.net.URI;import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.GenericWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat; import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;/*** MyMapper, MyMapper2的 v2輸出類型一個是longWritable,一個是String, 兩者需要統一成一個輸出類型，* 以方便job在設置v2類型----> job.setMapOutputValueClass(MyGenericWritable.class)* * 文件hello 內容為:* hello you* hello me* * 文件hello2 內容為:* hello,you* hello,me** @author zm* * * 結果：* [root@master hadoop]# hadoop fs -text /out/part-r-00000* Warning: $HADOOP_HOME is deprecated.* * hello 4* me 2* you 2*/ public class MyGenericWritableApp {private static final String OUT_PATH = "hdfs://master:9000/out";public static void main(String[] args) throws Exception {Configuration conf = new Configuration();final FileSystem filesystem = FileSystem.get(new URI(OUT_PATH), conf);if (filesystem.exists(new Path(OUT_PATH))) {filesystem.delete(new Path(OUT_PATH), true);}final Job job = new Job(conf, MyGenericWritableApp.class.getSimpleName());job.setJarByClass(MyGenericWritableApp.class);// 設置每種輸入文件的位置具體切分文件類和對應的處理map類 MultipleInputs.addInputPath(job, new Path("hdfs://master:9000/hello"), KeyValueTextInputFormat.class, MyMapper.class);MultipleInputs.addInputPath(job, new Path("hdfs://master:9000/hello2"), TextInputFormat.class, MyMapper2.class);// 設置map //job.setMapperClass(MyMapper.class); //不應該有這一行上面已經設置好了map類 job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(MyGenericWritable.class);// 設置reduce job.setReducerClass(MyReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(LongWritable.class);// 設置輸出結果存放路徑 FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));job.waitForCompletion(true);}public static class MyMapper extends Mapper<Text, Text, Text, MyGenericWritable> {//解析源文件會產生2個鍵值對，分別是<hello,you> <hello,me>；所以map函數會被調用2次 // 處理后結果為: <hello,（MyGenericWritable（1），MyGenericWritable（1））> <you,（MyGenericWritable（1））> <me,（MyGenericWritable（1））> protected void map(Text key, Text value, org.apache.hadoop.mapreduce.Mapper<Text, Text, Text, MyGenericWritable>.Context context) throws java.io.IOException, InterruptedException {context.write(key, new MyGenericWritable(new LongWritable(1)));context.write(value, new MyGenericWritable(new LongWritable(1)));};}public static class MyMapper2 extends Mapper<LongWritable, Text, Text, MyGenericWritable> {//解析源文件會產生2個鍵值對，分別是<0,（hello,you）><10,（hello,me）>；鍵值對內的()是我自己加上去的為了便于和前面偏移量的,區分開來所以map函數會被調用2次 // 處理后結果為: <hello,（MyGenericWritable（"1"），MyGenericWritable（"1"））> <you,（MyGenericWritable（"1"））> <me,（MyGenericWritable（"1"））> protected void map(LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, MyGenericWritable>.Context context) throws java.io.IOException, InterruptedException {//為什么要把hadoop類型轉換為java類型？ final String line = value.toString();final String[] splited = line.split(",");//產生的<k,v>對少了 for (String word : splited) {System.out.println("MyMapper2 word is：" + word);//在for循環體內，臨時變量word的出現次數是常量1 final Text text = new Text("1");context.write(new Text(word), new MyGenericWritable(text));}};}//map產生的<k,v>分發到reduce的過程稱作shuffle public static class MyReducer extends Reducer<Text, MyGenericWritable, Text, LongWritable> {//每一組調用一次reduce函數，一共調用了3次 //分組的數量與reduce函數的調用次數有什么關系？ //reduce函數的調用次數與輸出的<k,v>的數量有什么關系？ protected void reduce(Text key, java.lang.Iterable<MyGenericWritable> values, org.apache.hadoop.mapreduce.Reducer<Text, MyGenericWritable, Text, LongWritable>.Context context) throws java.io.IOException, InterruptedException {//count表示單詞key在整個文件中的出現次數 long count = 0L;for (MyGenericWritable times : values) {final Writable writable = times.get();if (writable instanceof LongWritable) {count += ((LongWritable) writable).get();}if (writable instanceof Text) {count += Long.parseLong(((Text) writable).toString());}}context.write(key, new LongWritable(count));};}/*** @author zm*/public static class MyGenericWritable extends GenericWritable {public MyGenericWritable() {}public MyGenericWritable(Text text) {super.set(text);}public MyGenericWritable(LongWritable longWritable) {super.set(longWritable);}// 數組里面存放要處理的類型 @Overrideprotected Class<? extends Writable>[] getTypes() {return new Class[]{LongWritable.class, Text.class};}} }

五、CombineTextInputFormat

將輸入源目錄下多個小文件合并成一個文件(split)來交給mapreduce處理這樣只會生成一個map任務
比如用戶給的文件全都是10K那種的文件，其內部也是用的TextInputFormat 當合并大小大于(64M)128M的時候，
也會產生對應個數的split

SequenceFile

?也是合并還沒明白和CombineTextInputFormat的區別在哪里：

import java.io.File; import java.io.IOException; import java.net.URI; import java.net.URISyntaxException; import java.util.Collection;import org.apache.commons.io.FileUtils; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.SequenceFile.Writer; import org.apache.hadoop.io.Text;public class SequenceFileMore {public static void main(String[] args) throws IOException, URISyntaxException {final Configuration conf = new Configuration();final FileSystem fs = FileSystem.get(new URI("hdfs://h2single:9000/"), conf);Path path = new Path("/sf_logs");//寫操作 final Writer writer = new SequenceFile.Writer(fs, conf, path, Text.class, BytesWritable.class);// false表示不迭代子目錄 Collection<File> listFiles = FileUtils.listFiles(new File("/usr/local/logs"), new String[]{"log"}, false);for (File file : listFiles) { // 將/usr/local/logs下的所有.log文件以對應文件文件名為key 對應文件內容字節數組為value 共同寫入到/sf_logs內 String fileName = file.getName();Text key = new Text(fileName);byte[] bytes = FileUtils.readFileToByteArray(file);BytesWritable value = new BytesWritable(bytes);writer.append(key, value);}IOUtils.closeStream(writer);//讀操作 final SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);final Text key = new Text();final BytesWritable val = new BytesWritable();while (reader.next(key, val)) {String fileName = "/usr/local/logs_bak/" + key.toString();File file = new File(fileName);FileUtils.writeByteArrayToFile(file, val.getBytes());}IOUtils.closeStream(reader);}}

MultipleInputs

對應于多個文件處理類型下比如又要處理數據庫的文件同時又要處理小文件

這里僅將main函數拼接展示下，各自對應的mapper類自己去寫：

總結

以上是生活随笔為你收集整理的Hadoop入门（十一）Mapreduce的InputFomrat各种子类的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：宁德时代三季度出货 100GWh，其中动
下一篇： smart 确认将推出纯电 fortwo