十二:NodeManager
Health Checker Service 創(chuàng)建檢查服務
NM運行一個檢查服務來檢查節(jié)點的狀態(tài),該服務可以使用用戶自定義的檢查腳本。如果節(jié)點檢查不通過,NM通過heart beat通知RM,RM將不再使用該節(jié)點上新增的container。Disk Checker 磁盤檢查
disk checker會檢查NM使用到的磁盤,如local-dirs and log-dirs(本地文件、日志文件)、 configured using yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs respectively等。檢查包括權限、空閑空間、磁盤是否只讀等。如果磁盤檢查失敗,NM不會使用那些有問題的磁盤,但仍將將節(jié)點標記為healthy,但是如果相當數(shù)量(可以設置值)的磁盤檢查失敗后,則會將節(jié)點標記為unheathy。磁盤檢查的參數(shù):
| yarn.nodemanager.disk-health-checker.enable | true, false | Enable or disable the disk health checker service |
| yarn.nodemanager.disk-health-checker.interval-ms | Positive integer | The interval, in milliseconds, at which the disk checker should run; the default value is 2 minutes |
| yarn.nodemanager.disk-health-checker.min-healthy-disks | Float between 0-1 | The minimum fraction of disks that must pass the check for the NodeManager to mark the node as healthy; the default is 0.25 |
| yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage | Float between 0-100 | The maximum percentage of disk space that may be utilized before a disk is marked as unhealthy by the disk checker service. This check is run for every disk used by the NodeManager. The default value is 90 i.e. 90% of the disk can be used. |
| yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb | Integer | The minimum amount of free space that must be available on the disk for the disk checker service to mark the disk as healthy. This check is run for every disk used by the NodeManager. The default value is 0 i.e. the entire disk can be used. 來源:?http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/NodeManager.html |
External Health Script 附件健康檢查腳本
用戶可以添加額外的健康檢查腳本。如果該腳本以非0的狀態(tài)退出、超時、拋出異常等,則該節(jié)點標記為unhealty。注意:如果因為權限問題,該腳本無法被執(zhí)行,節(jié)點也會被標記為unhealthy.附加創(chuàng)建檢查腳本不是必須的,如果沒胡附加健康檢查腳本,則只運行disk checker。參數(shù):| yarn.nodemanager.health-checker.interval-ms | Postive integer | The interval, in milliseconds, at which health checker service runs; the default value is 10 minutes. |
| yarn.nodemanager.health-checker.script.timeout-ms | Postive integer | The timeout for the health script that’s executed; the default value is 20 minutes. |
| yarn.nodemanager.health-checker.script.path | String | Absolute path to the health check script to be run. |
| yarn.nodemanager.health-checker.script.opts | String | Arguments to be passed to the script when the script is executed. |
NodeManager Restart NM重啟
NM restart保證了在NM重啟的過程中,節(jié)點上的container能正常運行。NM會把必要的信息存儲在節(jié)點上的statu-store中,當NM重啟后會加載這些信息,然后恢復NM的正常運行。步驟如下:
Step 1. To enable NM Restart functionality, set the following property in?conf/yarn-site.xml?to?true. ?啟用NM restart
| yarn.nodemanager.recovery.enabled | true, (default value is set to false) |
Step 2. Configure a path to the local file-system directory where the NodeManager can save its run state. ?配置state-store
| yarn.nodemanager.recovery.dir | The local filesystem directory in which the node manager will store state when recovery is enabled. The default value is set to$hadoop.tmp.dir/yarn-nm-recovery. |
Step 3. Configure a valid RPC address for the NodeManager. ?重啟后NM可能會使用不同的端口導致client連接失效,因此要把隨機端口改成固定端口
| yarn.nodemanager.address | Ephemeral ports (port 0, which is default) cannot be used for the NodeManager’s RPC server specified via yarn.nodemanager.address as it can make NM use different ports before and after a restart. This will break any previously running clients that were communicating with the NM before restart. Explicitly setting yarn.nodemanager.address to an address with specific port number (for e.g 0.0.0.0:45454) is a precondition for enabling NM restart. |
Step 4. Auxiliary services. ? 輔助服務 ?應用程序應該支持重啟
NodeManagers in a YARN cluster can be configured to run auxiliary services. For a completely functional NM restart, YARN relies on any auxiliary service configured to also support recovery. This usually includes (1) avoiding usage of ephemeral ports so that previously running clients (in this case, usually containers) are not disrupted after restart and (2) having the auxiliary service itself support recoverability by reloading any previous state when NodeManager restarts and reinitializes the auxiliary service.
A simple example for the above is the auxiliary service ‘ShuffleHandler’ for MapReduce (MR). ShuffleHandler respects the above two requirements already, so users/admins don’t have do anything for it to support NM restart: (1) The configuration property?mapreduce.shuffle.port?controls which port the ShuffleHandler on a NodeManager host binds to, and it defaults to a non-ephemeral port. (2) The ShuffleHandler service also already supports recovery of previous state after NM restarts. ?ShuffleHandler支持NM的重啟
來自為知筆記(Wiz)
轉載于:https://www.cnblogs.com/skyrim/p/7455990.html
總結
以上是生活随笔為你收集整理的十二:NodeManager的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 在C++函数中使用__asm int 3
- 下一篇: monitor.go