在測試hive0.14.0 on tez時遇到的問題比較多:
1.在使用cdh5.2.0+hive0.14.0+tez-0.5.0測試時,首先遇到下面的問題
java.lang.NoSuchMethodError:?org.apache.tez.dag.api.client.Progress.getFailedTaskAttemptCount()Iat?org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor.printStatusInPlace(TezJobMonitor.java:613)at?org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor.monitorExecution(TezJobMonitor.java:311)at?org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:167)at?org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)at?org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)at?org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1604)at?org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1364)at?org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1177)at?org.apache.hadoop.hive.ql.Driver.run(Driver.java:1004)at?org.apache.hadoop.hive.ql.Driver.run(Driver.java:994)at?org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:247)at?org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:199)at?org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:410)at?org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:783)at?org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:677)at?org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616)at?sun.reflect.NativeMethodAccessorImpl.invoke0(Native?Method)at?sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)at?sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)at?java.lang.reflect.Method.invoke(Method.java:597)at?org.apache.hadoop.util.RunJar.main(RunJar.java:212)
通過堆棧可以看出是在tez job提交之后報的錯,在org.apache.hadoop.hive.ql.exec.tez.TezTask中
job通過submit方法提交后,實例化一個TezJobMonitor 對象,用來記錄tez job的運行情況:
//?submit?will?send?the?job?to?the?cluster?and?start?executing
client?=?submit(jobConf,?dag,?scratchDir,?appJarLr,?session,
additionalLr,?inputOutputJars,?inputOutputLocalResources);
//?finally?monitor?will?print?progress?until?the?job?is?done
TezJobMonitor?monitor?=?new?TezJobMonitor();
rc?=?monitor.monitorExecution(client,?ctx.getHiveTxnManager(),?conf,?dag);
TezJobMonitor.monitorExecution方法中:
boolean?isProfileEnabled?=?conf.getBoolVar(conf,?HiveConf.ConfVars.TEZ_EXEC_SUMMARY);?//hive.tez.exec.print.summary,默認為false
boolean?inPlaceUpdates?=?conf.getBoolVar(conf,?HiveConf.ConfVars.TEZ_EXEC_INPLACE_PROGRESS);?//hive.tez.exec.inplace.progress,默認為true
boolean?wideTerminal?=?false;
boolean?isTerminal?=?inPlaceUpdates?==?true???isUnixTerminal()?:?false;
//?we?need?at?least?80?chars?wide?terminal?to?display?in-place?updates?properly
if?(isTerminal)?{if?(getTerminalWidth()?>=?MIN_TERMINAL_WIDTH)?{wideTerminal?=?true;}
}
boolean?inPlaceEligible?=?false;
if?(inPlaceUpdates?&&?isTerminal?&&?wideTerminal?&&?!console.getIsSilent())?{inPlaceEligible?=?true;
}
//進入一個while循環,判斷?job的狀態,并運行printStatusInPlace或者printStatus方法(其中printStatus最終調用getReport方法)
......
case?RUNNING:if?(!running)?{perfLogger.PerfLogEnd(CLASS_NAME,?PerfLogger.TEZ_SUBMIT_TO_RUNNING);console.printInfo("Status:?Running?("?+?dagClient.getExecutionContext()?+?")\n");startTime?=?System.currentTimeMillis();running?=?true;}if?(inPlaceEligible)?{printStatusInPlace(progressMap,?startTime,?false,?dagClient);//?log?the?progress?report?to?log?file?as?welllastReport?=?logStatus(progressMap,?lastReport,?console);}?else?{lastReport?=?printStatus(progressMap,?lastReport,?console);}break;
比如在printStatusInPlace方法中:
SortedSet<String>?keys?=?new?TreeSet<String>(progressMap.keySet());
int?idx?=?0;
int?maxKeys?=?keys.size();
for?(String?s?:?keys)?{idx++;Progress?progress?=?progressMap.get(s);final?int?complete?=?progress.getSucceededTaskCount();final?int?total?=?progress.getTotalTaskCount();final?int?running?=?progress.getRunningTaskCount();final?int?failed?=?progress.getFailedTaskAttemptCount();?//?會調用Progress類getFailedTaskAttemptCount方法獲取失敗的task數final?int?pending?=?progress.getTotalTaskCount()?-?progress.getSucceededTaskCount()?-progress.getRunningTaskCount();final?int?killed?=?progress.getKilledTaskCount();
在0.5.0的tez中org.apache.tez.dag.api.client.Progress類沒有getFailedTaskAttemptCount方法
在0.5.2的tez中才開始增加這個方法,因此要想使用hive0.14.0的話,需要使用tez-0.5.2以上的版本
2.升級至hive0.14.0+tez-0.5.2之后,發現如下錯誤:
15/01/13?14:09:21?INFO?client.TezClient:?The?url?to?track?the?Tez?Session:?http://xxxx:8042/proxy/application_1416818587155_0049/
Exception?in?thread?"main"?java.lang.RuntimeException:?org.apache.tez.dag.api.SessionNotRunning:?TezSession?has?already?shutdownat?org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:457)at?org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:672)at?org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616)at?sun.reflect.NativeMethodAccessorImpl.invoke0(Native?Method)at?sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)at?sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)at?java.lang.reflect.Method.invoke(Method.java:597)at?org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused?by:?org.apache.tez.dag.api.SessionNotRunning:?TezSession?has?already?shutdownat?org.apache.tez.client.TezClient.waitTillReady(TezClient.java:599)at?org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:212)at?org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:122)at?org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:454)...?7?more
可以看到是由于在session初始化異常導致,異常是由TezSessionState.open方法拋出:
....try?{session.waitTillReady();}?catch(InterruptedException?ie)?{//ignore}
其中session為TezClient的實例,在TezClient.waitTillReady方法中
public?synchronized?void?waitTillReady()?throws?IOException,?TezException,?InterruptedException?{if?(!isSession)?{//?nothing?to?wait?for?in?non-session?modereturn;}verifySessionStateForSubmission();while?(true)?{TezAppMasterStatus?status?=?getAppMasterStatus();?//這里getAppMasterStatus方法返回了TezAppMasterStatus.SHUTDOWNif?(status.equals(TezAppMasterStatus.SHUTDOWN))?{throw?new?SessionNotRunning("TezSession?has?already?shutdown");}if?(status.equals(TezAppMasterStatus.READY))?{return;}Thread.sleep(SLEEP_FOR_READY);}
}
這里創建TezClient時設置了為sessionmode,并且getAppMasterStatus返回了TezAppMasterStatus.SHUTDOWN,導致在waitTillReady方法中拋出異常,即TezAppMaster沒有啟動正常導致,查看nm的日志,發現由如下報錯:
2015-01-13?16:27:58,162?WARN?org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:?Exception?from?container-launch?with?container?ID:?container_1416818587155_0060_01_000001?and?exit?code:?1
ExitCodeException?exitCode=1:at?org.apache.hadoop.util.Shell.runCommand(Shell.java:538)at?org.apache.hadoop.util.Shell.run(Shell.java:455)at?org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)at?org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:196)at?org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)at?org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)at?java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)at?java.util.concurrent.FutureTask.run(FutureTask.java:138)at?java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)at?java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)at?java.lang.Thread.run(Thread.java:662)
是由于啟動am的container異常報錯,查看對應的container日志:
2015-01-13?17:34:59,731?FATAL?[main]?app.DAGAppMaster:?Error?starting?DAGAppMaster
java.lang.VerifyError:?class?org.apache.hadoop.yarn.proto.YarnProtos$ApplicationIdProto?overrides?final?method?getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;at?java.lang.ClassLoader.defineClass1(Native?Method)at?java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)at?java.lang.ClassLoader.defineClass(ClassLoader.java:615)at?java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)at?java.net.URLClassLoader.defineClass(URLClassLoader.java:283)at?java.net.URLClassLoader.access$000(URLClassLoader.java:58)at?java.net.URLClassLoader$1.run(URLClassLoader.java:197)at?java.security.AccessController.doPrivileged(Native?Method)at?java.net.URLClassLoader.findClass(URLClassLoader.java:190)at?java.lang.ClassLoader.loadClass(ClassLoader.java:306)at?sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)at?java.lang.ClassLoader.loadClass(ClassLoader.java:247)at?java.lang.Class.getDeclaredConstructors0(Native?Method)at?java.lang.Class.privateGetDeclaredConstructors(Class.java:2389)at?java.lang.Class.getConstructor0(Class.java:2699)at?java.lang.Class.getConstructor(Class.java:1657)at?org.apache.hadoop.yarn.factories.impl.pb.RecordFactoryPBImpl.newRecordInstance(RecordFactoryPBImpl.java:62)at?org.apache.hadoop.yarn.util.Records.newRecord(Records.java:36)at?org.apache.hadoop.yarn.api.records.ApplicationId.newInstance(ApplicationId.java:49)at?org.apache.hadoop.yarn.util.ConverterUtils.toApplicationAttemptId(ConverterUtils.java:137)at?org.apache.hadoop.yarn.util.ConverterUtils.toContainerId(ConverterUtils.java:177)at?org.apache.tez.dag.app.DAGAppMaster.main(DAGAppMaster.java:1794)
看樣子是protoc-buf兼容的問題。
cdh5.2.0默認使用protobuf-java-2.5.0.jar,hive0.14.0默認使用protobuf-java-2.5.0.jar,tez 0.5.2也使用pb2.5.0編譯,理論上應該不會有pb兼容性問題,懷疑是在tezam啟動時加載了2.4.0a 的pb,需要查看啟動命令,找到對應的classpath:
通過更改org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor類,增加Thread.sleep來查看啟動am的shell,重新編譯cdh5.2.0包(主要需要java7支持 range [1.7.0,1.7.1000}],編譯時跳過native: mvn package -DskipTests -Pdist -Dtar -e -X),
并替換./share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.5.0-cdh5.2.0.jar 測試:
shell的調用如下:
default_container_executor.sh-->default_container_executor_session.sh-->launch_container.sh
而在launch_container.sh腳本:
export?HADOOP_COMMON_HOME="/home/vipshop/platform/hadoop-2.5.0-cdh5.2.0"??#先設置相關的變量
export?CLASSPATH="$PWD:$PWD/*:$HADOOP_CONF_DIR:"?#這里重設了CLASSPATH
export?HADOOP_TOKEN_FILE_LOCATION="/home/vipshop/hard_disk/7/yarn/local/usercache/hdfs/appcache/application_1416818587155_0075/container_1416818587155_0075_01_000001/container_tokens"
....
ln?-sf?"/home/vipshop/hard_disk/10/yarn/local/filecache/42/hadoop-yarn-api-2.5.0.jar"?"hadoop-yarn-api-2.5.0.jar"??#建立相關jar的軟連接到本地目錄
hadoop_shell_errorcode=$?
if?[?$hadoop_shell_errorcode?-ne?0?]
thenexit?$hadoop_shell_errorcode
fi
.....
exec?/bin/bash?-c?"$JAVA_HOME/bin/java??-Xmx819m?-server-Djava.net.preferIPv4Stack=true?-Dhadoop.metrics.log.level=WARN?-XX:+PrintGCDetails?-verbose:gc?-XX:+PrintGCTimeStamps?-XX:+UseNUMA-XX:+UseParallelGC?-Dlog4j.configuration=tez-container-log4j.properties-Dyarn.app.container.log.dir=/home/vipshop/hard_disk/9/yarn/logs/application_1416818587155_0075/container_1416818587155_0075_01_000001?-Dtez.root.logger=INFO,CLA?-Dsun.nio.ch.bugLevel=''?org.apache.tez.dag.app.DAGAppMaster?--session?1>/home/vipshop/hard_disk/9/yarn/logs/application_1416818587155_0075/container_1416818587155_0075_01_000001/stdout?2>/home/vipshop/hard_disk/9/yarn/logs/application_1416818587155_0075/container_1416818587155_0075_01_000001/stderr?"#最后運行?java??org.apache.tez.dag.app.DAGAppMaster,即
org.apache.tez.dag.app.DAGAppMaster的main方法,啟動DAGAppMaster
CLASSPATH為shell所在的目錄,比如這里
CLASSPATH='/home/vipshop/hard_disk/11/yarn/local/usercache/hdfs/appcache/
application_1416818587155_0079/container_1416818587155_0079_01_000001:
/home/vipshop/hard_disk/11/yarn/local/usercache/hdfs/appcache/
application_1416818587155_0079/container_1416818587155_0079_01_000001/*:
/home/vipshop/conf:'
在shell的當前目錄下查找包含pb的包,發現有一個hive-solr中集成了pb,并且查看到其pb版本為2.4.0a:
for?i?in?`find?.?-name?"*.jar"`;?do?echo?$i?`jar?-tvf?$i|grep?GeneratedMessage|wc?-l`;?done|awk?'{if($2>0)?print}'????????????????????????
./protobuf-java-2.5.0.jar?31??//2.5.0
./hive-exec-0.14.0-dfffe4217f40bd764977b741ad970a562e07fb9×××f0180620bd13f68a2577b.jar?31?//2.5.0
./hive-solr-0.0.1-SNAPSHOT-jar-with-dependencies.jar??//2.4.0a
這就導致在container啟動時,classloader加載到了2.4.0a的pb,最終導致container啟動失敗。使用2.5.0的pb重新編譯這個jar包后,hive on tez就運行正常了。
轉載于:https://blog.51cto.com/caiguangguang/1604100
《新程序員》:云原生和全面數字化實踐50位技術專家共同創作,文字、視頻、音頻交互閱讀
總結
以上是生活随笔為你收集整理的hive on tez踩坑记2-hive0.14 on tez的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。