spark 序列化错误 集群提交时_【问题解决】本地提交任务到Spark集群报错:Initial job has not accepted any resources...
本地提交任務到Spark集群報錯:Initial job has not accepted any resources
錯誤信息如下:
18/04/17 18:18:14 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/04/17 18:18:29 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
將該python文件放到集群機器上提交到spark就沒有問題。后來嘗試在本機執(zhí)行Spark自帶的example,問題依舊存在。
雖然是WARN,但是任務并未成功執(zhí)行,在Spark的webui里也一直是運行狀態(tài)。我在本機和集群上執(zhí)行的命令分別如下:
bin\spark-submit --master spark://192.168.3.207:7077 examples\src\main\python\pi.py
./spark-submit --master spark://192.168.3.207:7077 ../examples/src/main/python/pi.py執(zhí)行的都是spark自帶的例子。
從網(wǎng)上找的解決辦法大概有2個,都不好使,先在此記錄一下:
1)加大執(zhí)行內存:
bin\spark-submit --driver-memory 2000M --executor-memory 2000M --master spark://192.168.3.207:7077 examples\src\main\python\pi.py
2)修改防火墻或放開對spark的限制,或者暫時先關閉。
繼續(xù)查看master和slave各自的log,也沒有錯誤,后來到master的webui界面:http://192.168.3.207:8080/,點擊剛才的任務進去:
點擊某個workder的stderr,內容如下:18/04/17 18:55:54 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 23412@he-200
18/04/17 18:55:54 INFO SignalUtils: Registered signal handler for TERM
18/04/17 18:55:54 INFO SignalUtils: Registered signal handler for HUP
18/04/17 18:55:54 INFO SignalUtils: Registered signal handler for INT
18/04/17 18:55:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/04/17 18:55:55 INFO SecurityManager: Changing view acls to: he,shaowei.liu
18/04/17 18:55:55 INFO SecurityManager: Changing modify acls to: he,shaowei.liu
18/04/17 18:55:55 INFO SecurityManager: Changing view acls groups to:
18/04/17 18:55:55 INFO SecurityManager: Changing modify acls groups to:
18/04/17 18:55:55 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users ?with view permissions: Set(he, shaowei.liu); groups with view permissions: Set(); users ?with modify permissions: Set(he, shaowei.liu); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from 192.168.56.1:51378 in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
...
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply from 192.168.56.1:51378 in 120 seconds
... 8 more
18/04/17 18:57:55 ERROR RpcOutboxMessage: Ask timeout before connecting successfully
發(fā)現(xiàn)日志報連接192.168.56.1:51378超時。問題是這個ip是哪里來的呢?查看下自己機器ip,命令行執(zhí)行ipconfig,問題找到了:192.168.56.1是我本機Docker創(chuàng)建的VirtualBox虛擬網(wǎng)絡IP。應該是本地在提交任務到集群時,沒有正確獲取到本機的ip地址,導致集群節(jié)點接受任務一直超時。解決辦法很簡單:把該網(wǎng)絡禁用。
再試一次,很快就執(zhí)行完畢了。
bin\spark-submit --master spark://192.168.3.207:7077 examples\src\main\python\pi.py
再看下webui里的日志,發(fā)現(xiàn)集群節(jié)點要連接我本機,然后將我的任務pi.py,傳到節(jié)點臨時目錄/tmp/spark-xxx/,并拷貝到$SPARM_HOME/work/下才真正執(zhí)行。以后有時間再學習下具體流程。順便把日志貼出來:
18/04/17 19:13:11 INFO TransportClientFactory: Successfully created connection to /192.168.0.138:51843 after 3 ms (0 ms spent in bootstraps)
18/04/17 19:13:11 INFO DiskBlockManager: Created local directory at /tmp/spark-67d75b11-65e7-4bc7-89b5-c07fb159470f/executor-b8ce41a3-7c6e-49f6-95ef-7ed6cdef8e53/blockmgr-030eb78d-e46b-4feb-b7b7-108f9e61ec85
18/04/17 19:13:11 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
18/04/17 19:13:12 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler@192.168.0.138:51843
18/04/17 19:13:12 INFO WorkerWatcher: Connecting to worker spark://Worker@192.168.3.102:34041
18/04/17 19:13:12 INFO TransportClientFactory: Successfully created connection to /192.168.3.102:34041 after 0 ms (0 ms spent in bootstraps)
18/04/17 19:13:12 INFO WorkerWatcher: Successfully connected to spark://Worker@192.168.3.102:34041
18/04/17 19:13:12 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(1, 192.168.3.102, 44683, None)
18/04/17 19:13:12 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(1, 192.168.3.102, 44683, None)
18/04/17 19:13:12 INFO BlockManager: Initialized BlockManager: BlockManagerId(1, 192.168.3.102, 44683, None)
18/04/17 19:13:14 INFO CoarseGrainedExecutorBackend: Got assigned task 0
18/04/17 19:13:14 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/04/17 19:13:14 INFO Executor: Fetching spark://192.168.0.138:51843/files/pi.py with timestamp 1523963609005
18/04/17 19:13:14 INFO TransportClientFactory: Successfully created connection to /192.168.0.138:51843 after 1 ms (0 ms spent in bootstraps)
18/04/17 19:13:14 INFO Utils: Fetching spark://192.168.0.138:51843/files/pi.py to /tmp/spark-67d75b11-65e7-4bc7-89b5-c07fb159470f/executor-b8ce41a3-7c6e-49f6-95ef-7ed6cdef8e53/spark-98745f3b-2f70-47b2-8c56-c5b9f6eac496/fetchFileTemp2255624304256249008.tmp
18/04/17 19:13:14 INFO Utils: Copying /tmp/spark-67d75b11-65e7-4bc7-89b5-c07fb159470f/executor-b8ce41a3-7c6e-49f6-95ef-7ed6cdef8e53/spark-98745f3b-2f70-47b2-8c56-c5b9f6eac496/-11088979641523963609005_cache to /home/ubutnu/spark_2_2_1/work/app-20180417191311-0005/1/./pi.py
……
18/04/17 19:13:14 INFO TransportClientFactory: Successfully created connection to /192.168.0.138:51866 after 5 ms (0 ms spent in bootstraps)
18/04/17 19:13:14 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1803 bytes result sent to driver
……
18/04/17 19:13:16 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
18/04/17 19:13:16 INFO MemoryStore: MemoryStore cleared
18/04/17 19:13:16 INFO ShutdownHookManager: Shutdown hook called
18/04/17 19:13:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-67d75b11-65e7-4bc7-89b5-c07fb159470f/executor-b8ce41a3-7c6e-49f6-95ef-7ed6cdef8e53/spark-98745f3b-2f70-47b2-8c56-c5b9f6eac496
總結
以上是生活随笔為你收集整理的spark 序列化错误 集群提交时_【问题解决】本地提交任务到Spark集群报错:Initial job has not accepted any resources...的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: golang sdk后端怎么用_Gola
- 下一篇: 我不想让你走是哪首歌啊?