生活随笔
收集整理的這篇文章主要介紹了
Hadoop大数据——mapreduce的排序机制之total排序
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
(1)設置一個reduce task ,全局有序,但是并發度太低,單節點負載太大
(2)設置分區段partitioner,設置相應數量的reduce task,可以實現全局有序,但難以避免數據分布不均勻——數據傾斜問題,有些reduce task負載過大,而有些則過小;
(3)可以通過編寫一個job來統計數據分布規律,獲取合適的區段劃分,然后用分區段partitioner來實現排序;但是這樣需要另外編寫一個job對整個數據集運算,比較費事
(4)利用hadoop自帶的取樣器,來對數據集取樣并劃分區段,然后利用hadoop自帶的TotalOrderPartitioner分區來實現全局排序
public class TotalSort {static class TotalSortMapper extends Mapper<Text, Text, Text, Text> {OrderBean bean
= new OrderBean();@Overrideprotected void map(Text key
, Text value
, Context context
) throws IOException
, InterruptedException
{context
.write(key
, value
);}}static class TotalSortReducer extends Reducer<Text, Text, Text, Text> {@Overrideprotected void reduce(Text key
, Iterable
<Text> values
, Context context
) throws IOException
, InterruptedException
{for (Text v
: values
) {context
.write(key
, v
);}}}public static void main(String
[] args
) throws Exception
{Configuration conf
= new Configuration();Job job
= Job
.getInstance(conf
);job
.setJarByClass(TotalSort
.class);job
.setMapperClass(TotalSortMapper
.class);job
.setReducerClass(TotalSortReducer
.class);
job
.setInputFormatClass(SequenceFileInputFormat
.class);FileInputFormat
.setInputPaths(job
, new Path(args
[0]));FileOutputFormat
.setOutputPath(job
, new Path(args
[1]));job
.setPartitionerClass(TotalOrderPartitioner
.class);RandomSampler randomSampler
= new InputSampler.RandomSampler<Text,Text>(0.1,100,10);InputSampler
.writePartitionFile(job
, randomSampler
);Configuration conf2
= job
.getConfiguration();String partitionFile
= TotalOrderPartitioner
.getPartitionFile(conf2
);job
.addCacheFile(new URI(partitionFile
));job
.setNumReduceTasks(3);job
.waitForCompletion(true);}
}
總結
以上是生活随笔為你收集整理的Hadoop大数据——mapreduce的排序机制之total排序的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。