Hadoop2.2.0+hive使用LZO压缩那些事
環(huán)境:
Centos6.4 64位
Hadoop2.2.0
Sun JDK1.7.0_45
hive-0.12.0
準(zhǔn)備工作:
yum -y install ?lzo-devel ?zlib-devel ?gcc autoconf automake libtool
開(kāi)始了哦!
(1)安裝LZO
wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
tar -zxvf lzo-2.06.tar.gz
./configure -enable-shared -prefix=/usr/local/hadoop/lzo/
make && make test && make install
(2)安裝LZOP
wget http://www.lzop.org/download/lzop-1.03.tar.gz
tar -zxvf lzop-1.03.tar.gz
export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include/
PS:如果不配置,會(huì)報(bào)錯(cuò): configure: error: LZO header files not found. Please check your installation or set the environment variable `CPPFLAGS'. 接下來(lái),./configure -enable-shared -prefix=/usr/local/hadoop/lzop
make ?&& make install
(3)把lzop復(fù)制到/usr/bin/
ln -s /usr/local/hadoop/lzop/bin/lzop /usr/bin/lzop
(4)測(cè)試lzop
lzop /home/hadoop/data/access_20131219.log
報(bào)錯(cuò):lzop: error while loading shared libraries: liblzo2.so.2: cannot open shared object file: No such file or directory
解決辦法:增加環(huán)境變量export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64會(huì)在生成一個(gè)lzo后綴的壓縮文件:?/home/hadoop/data/access_20131219.log.lzo即表示前述幾個(gè)步驟正確哦。
(5)安裝Hadoop-LZO
當(dāng)然的還有一個(gè)前提,就是配置好maven和svn 或者Git(我使用的是SVN),這個(gè)就不說(shuō)了,如果這些搞不定,其實(shí)也不必要進(jìn)行下去了!
我這里使用https://github.com/twitter/hadoop-lzo
使用SVN從https://github.com/twitter/hadoop-lzo/trunk下載代碼,修改pom.xml文件中的一部分。
從:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.current.version>2.1.0-beta</hadoop.current.version>
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>
修改為:
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.current.version>2.2.0</hadoop.current.version>
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>
再依次執(zhí)行:
mvn clean package -Dmaven.test.skip=true
tar -cBf - -C target/native/Linux-amd64-64/lib . | tar -xBvf - -C /home/hadoop/hadoop-2.2.0/lib/native/
cp target/hadoop-lzo-0.4.20-SNAPSHOT.jar /home/hadoop/hadoop-2.2.0/share/hadoop/common/
接下來(lái)就是將/home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar以及/home/hadoop/hadoop-2.2.0/lib/native/ 同步到其它所有的hadoop節(jié)點(diǎn)。注意,要保證目錄/home/hadoop/hadoop-2.2.0/lib/native/ 下的jar包,你運(yùn)行hadoop的用戶都有執(zhí)行權(quán)限。
(6)配置Hadoop
在文件$HADOOP_HOME/etc/hadoop/hadoop-env.sh中追加如下內(nèi)容:
export LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib
在文件$HADOOP_HOME/etc/hadoop/core-site.xml中追加如下內(nèi)容:
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.BZip2Codec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
在文件$HADOOP_HOME/etc/hadoop/mapred-site.xml中追加如下內(nèi)容:
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
<property>
<name>mapred.child.env</name>
<value>LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib</value>
</property>
(7)在Hive中體驗(yàn)lzo
A:首先創(chuàng)建nginx_lzo的表
CREATE TABLE logs_app_nginx (
ip STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
rt STRING,
referer STRING,
agent STRING,
forwarded String
)
partitioned by (
date string,
host string
)
row format delimited
fields terminated by '\t'
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
B:導(dǎo)入數(shù)據(jù)
LOAD DATA Local INPATH '/home/hadoop/data/access_20131230_25.log.lzo' INTO TABLE logs_app_nginx PARTITION(date=20131229,host=25);
/home/hadoop/data/access_20131219.log文件的格式如下:
?
221.207.93.109? -?????? [23/Dec/2013:23:22:38 +0800]??? "GET /ClientGetResourceDetail.action?id=318880&token=Ocm HTTP/1.1"?? 200???? 199???? 0.008?? "xxx.com"??????? "Android4.1.2/LENOVO/Lenovo A706/ch_lenovo/80"?? "-"
直接采用lzop? /home/hadoop/data/access_20131219.log即可生成lzo格式壓縮文件/home/hadoop/data/access_20131219.log.lzo
C:索引LZO文件
$HADOOP_HOME/bin/hadoop jar /home/hadoop/hadoop-2.2.0/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/<span style="font-family: Arial, Helvetica, sans-serif;">logs_app_nginx</span>
D:開(kāi)始跑利用hive來(lái)跑map/reduce任務(wù)了
set hive.exec.reducers.max=10;
set mapred.reduce.tasks=10;
select ip,rt from nginx_lzo limit 10;
在hive的控制臺(tái)能看到類似如下格式輸出,就表示正確了!
hive> set hive.exec.reducers.max=10;
hive> set mapred.reduce.tasks=10;
hive> select ip,rt from nginx_lzo limit 10;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1388065803340_0009, Tracking URL = http://lrts216:8088/proxy/application_1388065803340_0009/
Kill Command = /home/hadoop/hadoop-2.2.0/bin/hadoop job -kill job_1388065803340_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2013-12-27 09:13:39,163 Stage-1 map = 0%, reduce = 0%
2013-12-27 09:13:45,343 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
2013-12-27 09:13:46,369 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.22 sec
MapReduce Total cumulative CPU time: 1 seconds 220 msec
Ended Job = job_1388065803340_0009
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 1.22 sec HDFS Read: 63570 HDFS Write: 315 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 220 msec
OK
221.207.93.109 "XXX.com"
Time taken: 17.498 seconds, Fetched: 10 row(s)
轉(zhuǎn)載于:https://www.cnblogs.com/luxiaorui/p/3931024.html
總結(jié)
以上是生活随笔為你收集整理的Hadoop2.2.0+hive使用LZO压缩那些事的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: adb failed to start
- 下一篇: weblogic配置domain和删除d