3.2-3.3 Hive中常见的数据压缩
生活随笔
收集整理的這篇文章主要介紹了
3.2-3.3 Hive中常见的数据压缩
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
一、數據壓縮
1、
數據壓縮數據量小*本地磁盤,IO*減少網絡IOHadoop作業通常是IO綁定的; 壓縮減少了跨網絡傳輸的數據的大小; 通過簡單地啟用壓縮,可以提高總體作業性能; 要壓縮的數據必須支持可分割性;2、什么時候壓縮?
1、Use Compressed Map Input · Mapreduce jobs read input from HDFS · Compress if input data is large. This will reduce disk read cost. · Compress with splittable algorithms like Bzip2 · Or use compression with splittable file structures such as Sequence Files, RC Files etc.2、Compress Intermediate Data ·Map output is written to disk(spill)and transferred accross the network ·Always use compression toreduce both disk write,and network transfer load ·Beneficial in performace point of view even if input and output is uncompressed ·Use faster codecs such as Snappy,LZO3、Compress Reducer Output .Mapreduce output used for both archiving or chaining mapreduce jobs ·Use compression to reduce disk space for archiving ·Compression is also beneficial for chaining jobsespecially with limited disk throughput resource. ·Use compression methods with higher compress ratio to save more disk space3、Supported Codecs in Hadoop
Zlib→org.apache.hadoop.io.compress.DefaultCodec Gzip →org.apache.hadoop.io.compress.Gzipcodec Bzip2→org.apache.hadoop.io.compress.BZip2Codec Lzo→com.hadoop.compression.1zo.LzoCodec Lz4→org.apache.hadoop.io.compress.Lz4Codec Snappy→org.apache.hadoop.io.compress.Snappycodec4、Compression in MapReduce
##### Compressed Input Usage:File format is auto recognized with extension.Codec must be defined in core-site.xml.##### Compress Intermediate Data (Map Output):mapreduce.map.output.compress=True; mapreduce.map.output.compress.codec=CodecName;##### Compress Job Output (Reducer Output):mapreduce.output.fileoutputformat.compress=True; mapreduce.output.fileoutputformat.compress.codec=CodecName;5、Compression in Hive
##### Compressed Input Usage: Can be defined in table definition STORED AS INPUTFORMAT \"com.hadoop.mapred.DeprecatedLzoText Input Format\"##### Compress Intermediate Data (Map Output): SET hive. exec. compress. intermediate=True; SET mapred. map. output. compression. codec=CodecName; SET mapred. map. output. compression. type=BLOCK/RECORD; Use faster codecs such as Snappy, Lzo, LZ4 Useful for chained mapreduce jobs with lots of intermediate data such as joins.##### Compress Job Output (Reducer Output): SET hive.exec.compress.output=True; SET mapred.output.compression.codec=CodecName; SET mapred.output.compression.type=BLOCK/RECORD;二、snappy
1、簡介
在hadoop集群中snappy是一種比較好的壓縮工具,相對gzip壓縮速度和解壓速度有很大的優勢, 而且相對節省cpu資源,但壓縮率不及gzip。它們各有各的用途。Snappy是用C++開發的壓縮和解壓縮開發包,旨在提供高速壓縮速度和合理的壓縮率。Snappy比zlib更快,但文件相對要大20%到100%。 在64位模式的Core i7處理器上,可達每秒250~500兆的壓縮速度。Snappy的前身是Zippy。雖然只是一個數據壓縮庫,它卻被Google用于許多內部項目程,其中就包括BigTable,MapReduce和RPC。 Google宣稱它在這個庫本身及其算法做了數據處理速度上的優化,作為代價,并沒有考慮輸出大小以及和其他類似工具的兼容性問題。 Snappy特地為64位x86處理器做了優化,在單個Intel Core i7處理器內核上能夠達到至少每秒250MB的壓縮速率和每秒500MB的解壓速率。如果允許損失一些壓縮率的話,那么可以達到更高的壓縮速度,雖然生成的壓縮文件可能會比其他庫的要大上20%至100%,但是, 相比其他的壓縮庫,Snappy卻能夠在特定的壓縮率下擁有驚人的壓縮速度,“壓縮普通文本文件的速度是其他庫的1.5-1.7倍, HTML能達到2-4倍,但是對于JPEG、PNG以及其他的已壓縮的數據,壓縮速度不會有明顯改善”。2、使得Snappy類庫對Hadoop可用
此處使用的是編譯好的庫文件;
#這里是編譯好的庫文件,在壓縮包里,先解壓縮 [root@hadoop-senior softwares]# mkdir 2.5.0-native-snappy[root@hadoop-senior softwares]# tar zxf 2.5.0-native-snappy.tar.gz -C 2.5.0-native-snappy[root@hadoop-senior softwares]# cd 2.5.0-native-snappy[root@hadoop-senior 2.5.0-native-snappy]# ls libhadoop.a libhadoop.so libhadooputils.a libhdfs.so libsnappy.a libsnappy.so libsnappy.so.1.2.0 libhadooppipes.a libhadoop.so.1.0.0 libhdfs.a libhdfs.so.0.0.0 libsnappy.la libsnappy.so.1#替換hadoop的安裝 [root@hadoop-senior lib]# pwd /opt/modules/hadoop-2.5.0/lib[root@hadoop-senior lib]# mv native/ 250-native[root@hadoop-senior lib]# mkdir native[root@hadoop-senior lib]# ls 250-native native native-bak[root@hadoop-senior lib]# cp /opt/softwares/2.5.0-native-snappy/* ./native/[root@hadoop-senior lib]# ls native libhadoop.a libhadoop.so libhadooputils.a libhdfs.so libsnappy.a libsnappy.so libsnappy.so.1.2.0 libhadooppipes.a libhadoop.so.1.0.0 libhdfs.a libhdfs.so.0.0.0 libsnappy.la libsnappy.so.1#檢查 [root@hadoop-senior hadoop-2.5.0]# bin/hadoop checknative 19/04/25 09:59:51 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native 19/04/25 09:59:51 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library Native library checking: hadoop: true /opt/modules/hadoop-2.5.0/lib/native/libhadoop.so zlib: true /lib64/libz.so.1 snappy: true /opt/modules/hadoop-2.5.0/lib/native/libsnappy.so.1 #snappy已經為true lz4: true revision:99 bzip2: true /lib64/libbz2.so.13、mapreduce壓縮測試
#創建測試文件 [root@hadoop-senior hadoop-2.5.0]# bin/hdfs dfs -mkdir -p /user/root/mapreduce/wordcount/input[root@hadoop-senior hadoop-2.5.0]# touch /opt/datas/wc.input[root@hadoop-senior hadoop-2.5.0]# vim !$ hadoop hdfs hadoop hive hadoop mapreduce hadoop hue[root@hadoop-senior hadoop-2.5.0]# bin/hdfs dfs -put /opt/datas/wc.input /user/root/mapreduce/wordcount/input put: `/user/root/mapreduce/wordcount/input/wc.input': File exists[root@hadoop-senior hadoop-2.5.0]# bin/hdfs dfs -ls -R /user/root/mapreduce/wordcount/input -rw-r--r-- 1 root supergroup 12 2019-04-08 15:03 /user/root/mapreduce/wordcount/input/wc.input#先不壓縮運行MapReduce [root@hadoop-senior hadoop-2.5.0]# bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount /user/root/mapreduce/wordcount/input /user/root/mapreduce/wordcount/output#壓縮運行MapReduce [root@hadoop-senior hadoop-2.5.0]# bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec /user/root/mapreduce/wordcount/input /user/root/mapreduce/wordcount/output2#-Dmapreduce.map.output.compress=true :map輸出的值要使用壓縮;-D是參數#-Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec :使用snappy壓縮;-D是參數 #由于數據量太小,基本上看不出差別三、hive配置壓縮
hive (default)> set mapreduce.map.output.compress=true; hive (default)> set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;測試:
在hive中運行一個select會執行MapReduce:
hive (default)> select count(*) from emp;在web頁面的具體job中可以看到此作業使用的配置:
轉載于:https://www.cnblogs.com/weiyiming007/p/10768896.html
總結
以上是生活随笔為你收集整理的3.2-3.3 Hive中常见的数据压缩的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: chrome浏览器 新建 标签 页面 跳
- 下一篇: uva1025城市里的间谍