KingbaseES V8R6 集群运维系列 -- 命令行部署repmgr管理集群+switchover测试
本次部署未使用securecmd/kbha工具,無需普通用戶到root用戶的互信。
一、環境準備
1、創建OS用戶
建立系統數據庫安裝用戶組及用戶,在所有的節點執行。
root用戶登陸服務器,創建用戶組及用戶并且設置密碼
2、配置互信
配置/etc/hosts文件
vim /etc/hosts 192.168.57.30 node1 node1 #主庫 192.168.57.10 node2 node2 #備庫 備注:/etc/hosts配置屬于可選項目,這里配置主要是區分后續的復制槽,在一主多備的場景方便區分復制槽對應的主機配置數據庫系統用戶ssh互信
最好把root/系統數據庫都配置(如果只訪問數據庫data目錄文件,無需配置root用戶互信,只配置數據庫用戶就可以)
3、操作系統參數設置
系統參數配置:所有節點都必須配置
系統內核參數/etc/sysctl.conf,詳細配置請參考系統手冊
NUMA設置
大多數情況,可以在BIOS層面關閉NUMA支持,并且在OS啟動參數中設置numa off參數,那么我們再OS上就可以不用關注NUMA問題了。 如果在OS上沒有關閉NUMA,也可以通過下面的手段讓數據庫在分配內存的時候不理會NUMA的遠程內存。 vm.zone_reclaim_mode=0 vm.numa_balancing=0 numactl-interleave=all避免OOM發生
vm.swappiness = 0 # 建議設置為0( vm.overcommit_memory = 2 # 允許一定程度的Overcommit vm.overcommit_ratio = 50 # 允許的Overcommit:$((($mem - $swap) * 100 / $mem))文件緩存臟塊回寫策略
vm.dirty_background_ratio = 5(10–>5) vm.dirty_ratio = 10~15 vm.dirty_background_bytes=25當pagecache變臟了,需要將臟頁刷出內存。linux中通過一些參數來控制何時進行內存回寫。
vm.dirty_background_ratio/vm.dirty_background_bytes內存中允許存在的臟頁比率或者具體值。達到該值,后臺刷盤。取決于外存讀寫速度的不同,通常將vm.dirty_background_ratio設置為5,而
vm.dirty_background_bytes設置為讀寫速度的25%。 vm.dirty_ratio/vm.dirty_bytes前臺刷盤會阻塞讀寫,一般vm.dirty_ratio設置的比vm.dirty_background_ratio大,設置該值確保系統不會再內存中保留過多數據,避免丟失。
vm.dirty_expire_centisecs臟頁在內存中保留的最大時間。
vm.dirty_writeback_centisecs刷盤進程(pdflush/flush/kdmflush)周期性啟動的時間
預留足夠的物理內存,用于文件的讀寫CACHE
根據業務系統的特點與硬件的IO能力調整臟塊刷新頻率
如果物理IO性能很強,則降低刷新頻率,減少臟塊比例,如果物理IO性能較弱,則往反方向調整。
設置完成后使用root用戶執行 sysctl -p 是參數生效
4、修改系統資源限制
系統資源限制,查看/etc/security/limits.conf /etc/security/limits.d/20-nproc.conf
如果系統只有/etc/security/limits.conf文件,沒有/etc/security/limits.d/20-nproc.conf文件,只需修改/etc/security/limits.conf文件。如果全部都有,那么2個文件都要修改。
設置完成后,切換到kingbase用戶使用ulimit -a進行查看
# 使用unlimited ,是最大數量則表示無限制 # * 表示所有用戶,這里也可只設置root 和要安裝的kingbase 用戶設置其值 # nofile 是打開文件最大數目,nproc 為進程最大數目,core 為生成內核文件大小的上限 # soft 代表一個警告值,hard 為真正的閾值,超過就會報錯,可以適當繼續往高調 # PAM 的調整針對單個會話生效,調整后需重新登錄root 和kingbase,用ulimit -n 查看生效情況 # 注意:設置nofile 的hard limit 不能大于/proc/sys/fs/nr_open,否則注銷后將無法正常登陸5、修改磁盤調度策略
查看磁盤IO調度
查看當前I/O 調度策略
? 修改I/O 調度策略為deadline(最后期限算法,根據算法保證請求不餓死)
? {DEVICE-NAME} = 硬盤名稱
機械硬盤,推薦deadline 調度算法,較為適合業務單一并且IO 比較重的業務,比如數據庫。
固態硬盤,推薦noop 調度算法。
查看系統支持IO 調度算法:
查看某塊盤的IO 調度算法
cat /sys/block/{DEVICE-NAME}/queue/scheduler如果是普通的機械硬盤建議修改磁盤IO調度策略為:deadline (最后期限算法,根據算法保證請求不餓死)
修改IO磁盤調度策略為deadline,示例如下:
linux6版本:
echo deadline > /sys/block/sda/queue/scheduler也可以寫在grub中:
kernel /vmlinuz-2.6.18-274.3.1.el5 ro root=LABEL=/elevator=deadline crashkernel=128M@16M quiet console=tty1console=ttyS1,115200 panic=30 transparent_hugepage=never initrd /initrd-2.6.18-274.3.1.el5.imglinux7版本
grubby --update-kernel=ALL --args="elevator=deadline"暫時不建議使用透明大頁,并且在一些高負載的大型數據庫系統中建議關閉操作系統的透明大頁功能
grubby --update-kernel=ALL --args="transparent_hugepage=never"磁盤陣列一般都用write-back cache,因為他配置了電池,一般稱為battery-backed write cache(BBC or BBWC)
6、kingbase.conf常用參數說明
wal_log_hints=on # 參數必須開啟,在流復制主備需要進行切換時,主備時間線出現分歧,方便使用sys_rewind進行時間同步 full_page_writes=on # 參數必須開啟,在流復制主備需要進行切換時,主備時間線出現分歧,方便使用sys_rewind進行時間同步 archive_mode=on # 開啟歸檔模式。啟用archive_mode時,通過設置archive_command將已完成的WAL段發送到歸檔存儲。# 除了off,disable,還有兩種模式:on,always# 在正常操作期間,兩種模式之間沒有區別,但是當設置為always的情況下,WAL archiver在存檔恢復或待 # 機模式下也被啟用。# 在always模式下,從歸檔還原或流式復制流的所有文件都將被再次歸檔# archive_mode和archive_command是單獨的變量,因此可以在不更改存檔模式的情況下更改archive_command# 此參數只能在服務器啟動時設置。當wal_level設置為minimal時,無法啟用archive_mode wal_level=replica # minimal 記錄基本的數據操作,保證數據庫的ACID# replica 在minmal的基礎上,記錄額外的事務類型操作和數據,保證主備同步一致# logical 在replica的基礎上,記錄完整的數據(主要是更新操作中的舊數據,低于此級別只記錄舊數據的 # 某個標記# 保證邏輯解碼功能邏輯解碼是為邏輯同步服務的,兩個獨立主庫通過邏輯同步的方式同步表數據. archive_command='test ! -f /home/kingbase/data/sys_wal/archive_status/%f && cp %p /home/kingbase/archive/%f' # 定義對wal進行歸檔的命令。# 當archive_mode配置參數啟用并且archive_command配置參數是空字符串時,# wal archiving暫時被禁用,但是數據庫會繼續積累wal segment文件。# archive_command參數值設置為/bin/true會禁用歸檔# 但這樣會導致wal文件歸檔中斷,歸檔中斷是無法進行歸檔恢復的,請注意這一點。# archive_command = 'test ! -f /mnt/archivedir/%f && cp %p /mnt/archivedir/%f'# archive_command = 'gzip < %p > /mnt/archivedir/%f' archive_timeout=0 # archive_command執行本地shell命令來歸檔已完成的WAL文件段。僅對已完成的WAL段進行調用。# 因此,如果你的服務器產生很少的WAL(或者在這種情況下有很長的時間),在事務完成和歸檔存儲器中的安 # 全記錄之間可能會有很長的延遲。# 為了限制未歸檔的數據的可能性,可以設置archive_timeout來強制服務器定期切換到新的WAL段文件。# 當此參數大于零時只要從最后一個段文件切換開始經過了許多秒,服務器就會切換到一個新的段文件,# 并且存在任何數據庫活動,包括一個檢查點(如果沒有檢查點,則跳過檢查點數據庫活動)。# 請注意,由于強制切換而提前關閉的歸檔文件的長度與完整文件的長度相同。 因此,使用一個非常短的 # archive_timeout是不明智的# 這會使您的存檔存儲空間膨脹。一分鐘左右的archive_timeout設置通常是合理的。 synchronous_commit=on # on off local remote_write remote_apply# 在流復制的環境下對性能的影響由小到大分別是:# off (async) > on (async) > remote_write (sync) > on|local (sync) > remote_apply (sync)# remote_apply 應用發起提交后,等到在備庫上應用WAL(更新數據)后,它將返回COMMIT響應,并且可 # 以在備庫上進行引用。# 由于完全保證了數據同步,因此它適合需要備庫始終保持最新數據的負載分配場景。# on 應用發起提交后,在備庫上寫入WAL之后,返回COMMIT響應。該選項是性能和可靠性之間的最佳平衡。# remote_write 應用發起提交后,等到WAL已傳輸到備庫后,返回COMMIT響應。# local 應用發起提交后,寫入主庫WAL之后,返回COMMIT響應。# off 應用發起提交后,直接返回COMMIT響應,而無需等待主庫WAL完成寫入。 synchronous_standby_names='node2' # 如果不配置此參數,默認為async同步方式。kingbaseES生成規則為''kingbase_*&+_++'',主庫 # synchronous_standby_names參數為備機名稱。# 當自身是備庫的時候synchronous_standby_names參數為自己# 主庫配置為備庫節點名稱。當自身為備庫時,配置為備庫節點名稱。 max_wal_size=1GB # 兩個檢查點之間,wal可增長的最大大小,這是一個軟限制# 如果日志量大于max_wal_size,則WAL日志空間盡量保持在max_wal_size。因為會觸發檢查點,不需要 # 的段文件將被移除直到系統回到這個限制以下# 如果日志量小于max_wal_size,則WAL日志空間至少保持min_wal_size# 通常情況下,WAL日志空間大小在min_wal_size和max_wal_size之間動態評估。該估計基于在以前的檢 # 查點周期中使用的WAL文件數的動態平均值。# 如果實際使用量超過估計值,動態平均數會立即增加 min_wal_size=100MB # 檢查點后用來保留的,用于未來循環使用的wal文件。可以被用來確保有足夠的 WAL 空間被保留來應付 # WAL 使用的高峰# WAL異常增長,或WAL一直膨脹且超過max_wal_size,執行檢查點后,WAL使用量未見降低或WAL日志不會 # 被刪除重用,需要排查以下因素# 獨立于max_wal_size之外,wal_keep_segments + 1 個最近的 WAL 文件將總是被保留# 啟用了WAL 歸檔,舊的段在被歸檔之前不能被移除或者再利用# 啟用了復制槽功能,一個使用了復制槽的較慢或者失敗的后備服務器也會導致WAL不能被刪除或重用# checkpoing未完成,長事務未提交 checkpoint_completion_target=0.5 # 100GB /(0.5*5*60)*1024≈670M/s 100GB /(0.9*5*60)*1024≈380M/s checkpoint_timeout=10min # 寫入速度越低,對客戶而言,體驗越好,性能越高。反之,較低的值可能會引起I/O峰值,導致“卡死”的現象 shared_buffers=1GB # 最佳值為內存RAM 1/3 archive_cleanup_command # 提供一個清理不被standby server所需要的老的archived wal file# %r代表最后一個有效的restart point 的 wal file.該 wal file 是最早一個必須保留的文件,以便 # 允許 restore 操作可以被 restart# 注意:restart point 是一個 point ,該 point 用于 standby server 重啟 recovery 操作。# 因此,所有早于 % r 的文件可以被安全的清理掉。本信息可以用來 truncate 掉 # archive wal file,以便滿足從當前 restore 可以 restart 的最小需求# 常被用在單個 standby 配置的 archive_cleanup_command 參數中# 當命令被一個 signal 終止或者 shell 中有錯誤時,一個 fatal error 會被拋出關于synchronous_standby_names參數
該參數用于指定: 基于優先的多同步后備節點對事務提交的影響 synchronous_standby_names =“FIRST 2 (*)”,所有流復制鏈接到此主節點的standy節點中,任意選擇兩個節點是sync狀態,剩下的都是potential狀態; synchronous_standby_names =“ANY 2 (*)”,所有流復制鏈接到此主節點的所有standy節點都是quorum狀態。pg_stat_replication 表中的 sync_state 字段的狀態: FIRST語法下,同步的節點值為sync,異步的節點值為asyn;被匹配為同步的節點,但是被個數限制的值為 potential; Any語法下,值為quorum,表現上沒有同步和異步一說,一個寫事務主節點收到指定數量的standy節點的反饋,接著就會給客戶端返回事務執行成功。多個同步后備 同步復制支持一個或者更多個同步后備服務器,事務將會等待,直到所有同步后備服務器都確認收到了它們的數據為止。事務必須等待其回復的同步后備的數量由synchronous_standby_names指定。這個參數還指定一個后備服務器名稱及方法(FIRST和ANY)的列表來從列出的后備中選取同步后備。方法FIRST指定一種基于優先的同步復制并且讓事務提交等待,直到它們的WAL記錄被復制到基于優先級選中的所要求數量的同步后備上為止。在列表中出現較早的后備被給予較高的優先級,并且將被考慮為同步后備。其他在這個列表中位置靠后的后備服務器表示可能的同步后備。如果任何當前的同步后備由于任何原因斷開連接,它將立刻被下一個最高優先級的后備所替代。基于優先的多同步后備的synchronous_standby_names示例為:synchronous_standby_names = ‘FIRST 2 (s1, s2, s3)’ 在這個例子中,如果有四個后備服務器s1、s2、s3和s4在運行,兩個后備服務器s1和s2將被選中為同步后備,因為它們出現在后備服務器名稱列表的前部。s3是一個潛在的同步后備,當s1或s2中的任何一個失效, 它就會取而代之。s4則是一個異步后備因為它的名字不在列表中。方法ANY指定一種基于規定數量的同步復制并且讓事務提交等待,直到它們的WAL記錄至少被復制到列表中所要求數量的同步后備上為止。synchronous_standby_names的基于規定數量的多同步后備的例子:synchronous_standby_names = ‘ANY 2 (s1, s2, s3)’ 在這個例子中,如果有四臺后備服務器s1、s2、s3以及s4正在運行,事務提交將會等待來自至少其中任意兩臺后備服務器的回復。s4是一臺異步后備,因為它的名字不在該列表中。后備服務器的同步狀態可以使用pg_stat_replication視圖查看。7、repmgr.conf參數說明
# 日志管理log_level=INFO log_file='/home/kes86/cluster/log/repmgrd.log' # repmgr log 文件 log_status_interval=10 # 此設置導致 repmgrd 以指定的時間間隔(以秒為單位,默認為 300)發出狀態日志行,# 描述 repmgrd 的當前狀態, # 例如: [2022-10-30 17:51:15] [INFO] monitoring primary node "node1" (ID: 1) in normal state failover=automatic/manual # failover設置promote_command='/home/kes86/kes86/cluster/bin/repmgr standby promote -f /home/kes86/kes86/cluster/etc/repmgr.conf' # 當repmgrd 確定當前節點將成為新的主節點時 ,將在故障轉移情況下執行 follow_command='/home/kes86/kes86/cluster/bin/repmgr standby follow -f /home/kes86/kes86/cluster/etc/repmgr.conf --upstream-node=%n' # %n將被替換repmgrd與新的主節點的ID,如果沒有提供,repmgr standby follow將嘗試自行確定新的主repmgr standby follow節點 # 但如果在新主節點提升后原主節點重新上線,則存在導致節點繼續跟隨原主節點的風險。 repmgrd_pid_file='/home/kes86/kes86/cluster/etc/hamgrd.pid' # repmgrd 運行時的 pid 文件 # 高可用參數設置location='location1' # 定義節點位置的任意字符串,在故障轉移期間用于檢查當前主節點的可見性 priority=100 # 節點優先級,選主時可能使用到。(lsn > priority > node_id)# 0 代表該節點不會被提升為主節點 monitoring_history=yes # 是否將監控數據寫入“monitoring_history”表 reconnect_interval=10 # 故障轉移之前,嘗試重新連接的間隔(以秒為單位) reconnect_attempts=6 # 故障轉移之前,嘗試重新連接的次數 connection_check_type=ping # ping: repmg 使用PQPing() 方法測試連接# connection: 嘗試與節點建立新的連接# query: 通過現有連接在節點上執行 SQL 語句 monitor_interval_secs=5 # 寫入監控數據的間隔 use_replication_slots=true # 是否使用復制槽8、集群相關視圖
為了有效地管理復制集群,repmgr提供專用數據庫存儲和管理有關repmgr集群服務的相關信息。
此模式在部署repmgr服務時,由repmgr擴展自動創建,該擴展在初始化repmgr集群(repmgr主節點)的第一步中安裝。
包含以下對象:1、表repmgr.events:記錄感興趣的事件 repmgr.nodes:復制群集中每個服務器的連接和狀態信息 repmgr.monitoring_history:repmgrd寫入的歷史備用監視信息2、視圖repmgr.show_nodes:基于表repmgr.nodes,另外顯示服務器上游節點的名稱 repmgr.replication_status:啟用repmgrd的監視時,顯示每個備用數據庫的當前監視狀態。 repmgr元數據模式可以存儲在現有的數據庫或在自己的專用數據庫。注意,repmgr元數據模式不能存儲在不屬于repmgr管理的復制集群的數據庫服務器上。 數據庫用戶必須可供repmgr訪問此數據庫并執行必要的更改。 此用戶不需要是超級用戶,但是某些操作(如初始安裝repmgr擴展)將需要超級用戶連接(可以使用命令行選項--superuser在需要時指定 )。9、關于KBHA工具:
kbha工具是保護KingbaseES的集群高可用/數據庫服務器硬件掉電或者其他的故障導致的宕機后,故障恢復后會自動對數據庫進行恢復。 如果不使用自動auto failover,那么就不需要啟動kbha這個進程,虛擬ip也無需添加。二、部署過程
1、配置主節點:
在系統數據庫用戶家目錄上傳db.zip文件。
scp db.zip kes86@192.168.57.10:~1. 將上傳的db.zip解壓并初始化數據庫
解壓上傳的db.zip文件unzip db.zip -d 路徑初始化數據庫,只初始化主節點,備節點無需做此操作 initdb -Usystem -Eutf-8 -mpg -D /home/kes86/data -A scram-sha-256 -x system --data-checksums[kes86@node1 ~]$ initdb -Usystem -Eutf-8 -mpg -D /home/kes86/data -A scram-sha-256 -x system --data-checksums 屬于此數據庫系統的文件宿主為用戶 "kes86". 此用戶也必須為服務器進程的宿主. 數據庫簇將使用本地化語言 "zh_CN.UTF-8"進行初始化. initdb: could not find suitable text search configuration for locale "zh_CN.UTF-8" 缺省的文本搜索配置將會被設置到"simple"The comparision of strings is case-sensitive. 允許生成數據頁校驗和.創建目錄 /home/kes86/data ... 成功 正在創建子目錄 ... 成功 選擇動態共享內存實現 ......posix 選擇默認最大聯接數 (max_connections) ... 100 選擇默認共享緩沖區大小 (shared_buffers) ... 128MB selecting default time zone ... Asia/Shanghai 創建配置文件 ... 成功 Begin setup encrypt device initializing the encrypt device ... 成功 正在運行自舉腳本 ...成功 正在執行自舉后初始化 ...成功 create security database ... 成功 load security database ... 成功 同步數據到磁盤...成功成功。您現在可以用下面的命令開啟數據庫服務器:sys_ctl -D /home/kes86/data -l 日志文件 start2. 修改數據庫kingbase.conf配置文件內容
shared_preload_libraries='repmgr' listen_addresses = '*' port = 54322 full_page_writes = on wal_log_hints = on max_wal_senders = 32 wal_keep_segments = 512 max_connections = 100 wal_level = replica archive_mode = on archive_command = '/bin/cp -f %p /home/kes86/data/archive/%f' control_file_copy = '/home/kes86/data/copy_file_bak' max_replication_slots = 32 hot_standby = on hot_standby_feedback = on logging_collector = on log_destination = 'csvlog' log_checkpoints = on log_replication_commands = on wal_compression = on synchronous_commit = remote_write max_prepared_transactions = 100 #shared_buffers = 512MB fsync = on #synchronous_standby_names='ANY 1(node2)'備注: archive_command參數配置的目錄需提前創建,或者已經存在的目錄。 mkdir -p /home/kes86/data/archive synchronous_standby_names此參數暫時先不配置,可以考慮注釋掉。如果配置此參數會導致做任何操作都需等待備節點的響應。3. 修改sys_hba.conf文件:
在sys_hba.conf末尾添加以下內容(考慮密碼安全性,可以考慮復制角色不驗證密碼,或者通過對密碼加密)# add replication privhost all all 0.0.0.0/0 trust host replication all 0.0.0.0/0 trust host replication all ::0/0 trust4. 啟動數據庫,最好使用絕對路徑啟動數據庫。
備注:后期進行數據庫備份使用相對路徑啟動的話,備份會有問題
[kes86@node1 ~]$ sys_ctl -D /home/kes86/data/ start waiting for server to start....2022-10-30 15:38:45.726 CST [7921] 日志: 正在啟動 KingbaseES V008R006C006B0021 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit 2022-10-30 15:38:45.726 CST [7921] 日志: 正在監聽IPv4地址"0.0.0.0",端口 54322 2022-10-30 15:38:45.726 CST [7921] 日志: 正在監聽IPv6地址"::",端口 54322 2022-10-30 15:38:45.731 CST [7921] 日志: 在Unix套接字 "/tmp/.s.KINGBASE.54322"上偵聽 2022-10-30 15:38:45.759 CST [7921] 日志: 日志輸出重定向到日志收集進程 2022-10-30 15:38:45.759 CST [7921] 提示: 后續的日志輸出將出現在目錄 "sys_log"中.done server started5. 創建replication用戶并授權
create user repmgr replication login; alter user repmgr password 'repmgr'; alter user repmgr superuser createdb createrole;6. 創建repmgr數據庫,儲存repmgr相關的信息
create database repmgr encoding UTF8; alter database repmgr owner to repmgr ;至此主節點數據庫已配置完成。
7. 配置repmgr.conf文件,默認路徑跟軟件bin目錄同級路徑
備注:use_scmd='off' 未使用securecmdd必須顯式設置為off。注釋或者默認都是on
[kes86@node1 etc]$ cat repmgr.conf node_id=1 node_name='node1' promote_command='/home/kes86/kes86/cluster/bin/repmgr standby promote -f /home/kes86/kes86/cluster/etc/repmgr.conf' follow_command='/home/kes86/kes86/cluster/bin/repmgr standby follow -f /home/kes86/kes86/cluster/etc/repmgr.conf --upstream-node=%n' conninfo='host=node1 user=repmgr dbname=repmgr port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3' log_file='/home/kes86/kes86/cluster/log/hamgr.log' log_level=info #kbha_log_file='/home/kes86/kes86/cluster/log/kbha.log' data_directory='/home/kes86/data' sys_bindir='/home/kes86/kes86/cluster/bin' #scmd_options='-q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 8890' reconnect_attempts=10 reconnect_interval=6 failover='manual' recovery='standby' monitoring_history='no' #trusted_servers='192.168.57.1' #virtual_ip='192.168.57.32/24' #net_device='enp0s17' #net_device_ip='192.168.57.40' #ipaddr_path='/sbin' #arping_path='/home/kes86/kes86/cluster/bin' synchronous='quorum' #repmgrd_pid_file='/home/kes86/kes86/cluster/etc/hamgrd.pid' #kbha_pid_file='/home/kes86/kes86/cluster/etc/kbha.pid' ping_path='/usr/bin' auto_cluster_recovery_level=1 use_check_disk=off use_scmd='off' #running_under_failure_trusted_servers=on connection_check_type='mix' location='location1' priority=100 如果未部署securecmdd節點通信工具,以下參數無需配置 #scmd_options='-q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 8890' 此參數未部署securecmdd必須配置為off use_scmd='off'手動switchover無需配置kbhakbha工具選項
[kes86@node1 ~]$ kbha --help kbha: replication management daemon for Kingbasekbha starts the repmgrd and do auto-recovery for Kingbase.Usage:kbha [OPTIONS]General options:-f, --config-file=PATH path to the repmgr configuration file-A, --action={rejoin|register|follow|daemon|loadvip|unloadvip|arping|stopdb|startdb|updateinfo|removeinfo|check_ip}what to do for program, default is 'daemon'--dev=device name, if -A is check_ip, could input net device name--ip=ip address, if -A is check_ip, must be input ip which will be check, support IPV4 and IPV6--upstream-node-id=NODE_ID, if -A is rejoin or follow, this node will follow the node of NODE_IDDatabase connection options:-d, --dbname=DBNAME database to connect to (default: "kes86")-h, --host=HOSTNAME database server host-p, --port=PORT database server port (default: "54322")-U, --username=USERNAME database user name to connect as (default: "kes86")Other options:-?, --help show this help, then exit-V, --version output version information, then exit-v, --verbose display additional log output (useful for debugging)通過kbha工具添加虛擬ip
添加vip命令
kbha -A 192.168.57.32 kbha -A arping
刪除vip命令
kbha -A unloadvip
8. repmgr注冊primary節點
[kes86@node1 ~]$ repmgr primary register [INFO] connecting to primary database... [DEBUG] connecting to: "user=repmgr connect_timeout=10 dbname=repmgr host=node1 port=54322 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path=" [NOTICE] attempting to install extension "repmgr" [NOTICE] "repmgr" extension successfully installed [INFO] primary registration complete [NOTICE] primary node record (ID: 1) registered查看ip是否正常
[kes86@node1 ~]$ ifconfig enp0s8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet 192.168.1.30 netmask 255.255.255.0 broadcast 192.168.1.255inet6 fe80::a00:27ff:fede:9422 prefixlen 64 scopeid 0x20<link>ether 08:00:27:de:94:22 txqueuelen 1000 (Ethernet)RX packets 24 bytes 2976 (2.9 KiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 32 bytes 3616 (3.5 KiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0enp0s9: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet 192.168.1.61 netmask 255.255.255.0 broadcast 192.168.1.255inet6 fe80::a00:27ff:fef0:68ca prefixlen 64 scopeid 0x20<link>ether 08:00:27:f0:68:ca txqueuelen 1000 (Ethernet)RX packets 255 bytes 33730 (32.9 KiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 29 bytes 4466 (4.3 KiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0enp0s17: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500inet 192.168.57.40 netmask 255.255.255.0 broadcast 192.168.57.255inet6 fe80::a00:27ff:fe59:efcf prefixlen 64 scopeid 0x20<link>ether 08:00:27:59:ef:cf txqueuelen 1000 (Ethernet)RX packets 1718 bytes 150972 (147.4 KiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 997 bytes 138337 (135.0 KiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536inet 127.0.0.1 netmask 255.0.0.0inet6 ::1 prefixlen 128 scopeid 0x10<host>loop txqueuelen 1000 (Local Loopback)RX packets 756 bytes 280777 (274.1 KiB)RX errors 0 dropped 0 overruns 0 frame 0TX packets 756 bytes 280777 (274.1 KiB)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255ether 52:54:00:68:17:f3 txqueuelen 1000 (Ethernet)RX packets 0 bytes 0 (0.0 B)RX errors 0 dropped 0 overruns 0 frame 0TX packets 0 bytes 0 (0.0 B)TX errors 0 dropped 0 overruns 0 carrier 0 collisions 09. 登陸主節點數據庫執行以下命令:
alter system set synchronous_standby_names='ANY 1(node2)';10. 重新加載配置文件
``` select pg_reload_conf(); ```三、配置備節點:
1. 備用節點數據庫軟件安裝,備節點軟件目錄跟主節點軟件目錄一致:2種方式
拷貝軟件目錄到備用節點scp -r kes86 kes86@node2:~解壓db.zip壓縮包unzip db.zip -d /home/kes86/kes86/cluster配置數據庫用戶環境變量vi .bash_profileexport KDBHOME=/home/kes86/kes86/cluster export LANG=zh_CN.UTF-8 export KDBDATA=/home/kes86/data export PATH=/usr/sbin:/usr/bin:/usr/local/sbin:/usr/local/bin:$KDBHOME/binsource .bash_profile使環境變量生效2. 配置repmgr.conf文件,默認路徑跟kingbaseES軟件bin目錄同級
use_scmd='off' 未使用securecmdd必須顯式設置為off。注釋或者默認都是on [kes86@node1 etc]$ cat repmgr.conf node_id=1 node_name='node1' promote_command='/home/kes86/kes86/cluster/bin/repmgr standby promote -f /home/kes86/kes86/cluster/etc/repmgr.conf' follow_command='/home/kes86/kes86/cluster/bin/repmgr standby follow -f /home/kes86/kes86/cluster/etc/repmgr.conf --upstream-node=%n' conninfo='host=node1 user=repmgr dbname=repmgr port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3' log_file='/home/kes86/kes86/cluster/log/hamgr.log' log_level=info #kbha_log_file='/home/kes86/kes86/cluster/log/kbha.log' data_directory='/home/kes86/data' sys_bindir='/home/kes86/kes86/cluster/bin' #scmd_options='-q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 8890' reconnect_attempts=10 reconnect_interval=6 failover='manual' recovery='standby' monitoring_history='no' #trusted_servers='192.168.57.1' #virtual_ip='192.168.57.32/24' #net_device='enp0s17' #net_device_ip='192.168.57.40' #ipaddr_path='/sbin' #arping_path='/home/kes86/kes86/cluster/bin' synchronous='quorum' #repmgrd_pid_file='/home/kes86/kes86/cluster/etc/hamgrd.pid' #kbha_pid_file='/home/kes86/kes86/cluster/etc/kbha.pid' ping_path='/usr/bin' auto_cluster_recovery_level=1 use_check_disk=off use_scmd='off' #running_under_failure_trusted_servers=on connection_check_type='mix' location='location1' priority=100如果未部署securecmdd節點通信工具,以下參數無需配置 #scmd_options='-q -o ConnectTimeout=10 -o StrictHostKeyChecking=no -o ServerAliveInterval=2 -o ServerAliveCountMax=5 -p 8890'此參數不使用securcmdd插件,需要設置為off use_scmd='off'3. 在備用節點測試連通性:
在備用節點執行以下命令檢查是否有錯誤:repmgr -h node1 -U repmgr -d repmgr -p 54322 --upstream-node-id 1 standby clone --dry-run[kes86@node2 ~]$ repmgr -h node1 -U repmgr -d repmgr -p 54322 --upstream-node-id 1 standby clone --dry-run [NOTICE] destination directory "/home/kes86/data" provided [INFO] connecting to source node [DETAIL] connection string is: host=node1 user=repmgr port=54322 dbname=repmgr [DETAIL] current installation size is 45 MB [INFO] "repmgr" extension is installed in database "repmgr" [DEBUG] 1 node records returned by source node [DEBUG] connecting to: "user=repmgr connect_timeout=10 dbname=repmgr host=node1 port=54322 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path=" [DEBUG] upstream_node_id determined as 1 [INFO] parameter "max_replication_slots" set to 32 [INFO] parameter "max_wal_senders" set to 32 [NOTICE] checking for available walsenders on the source node (2 required) [INFO] sufficient walsenders available on the source node [DETAIL] 2 required, 32 available [NOTICE] checking replication connections can be made to the source server (2 required) [INFO] required number of replication connections could be made to the source server [DETAIL] 2 replication connections required [INFO] replication slots will be created by user "repmgr" [NOTICE] standby will attach to upstream node 1 [HINT] consider using the -c/--fast-checkpoint option [INFO] would execute:/home/kes86/kes86/cluster/bin/sys_basebackup -l "repmgr base backup" -D /home/kes86/data -h node1 -p 54322 -U repmgr -X stream -S repmgr_slot_2 [INFO] all prerequisites for "standby clone" are met如果repmgr未注冊主節點,會提示 unable to retrieve record for upstream node 1錯誤
[kes86@node2 ~]$ repmgr -h node1 -U repmgr -d repmgr -p 54322 --upstream-node-id 1 standby clone --dry-run [NOTICE] destination directory "/home/kes86/data" provided [INFO] connecting to source node [DETAIL] connection string is: host=node1 user=repmgr port=54322 dbname=repmgr [DETAIL] current installation size is 45 MB [INFO] "repmgr" extension is installed in database "repmgr" [DEBUG] 0 node records returned by source node [DEBUG] upstream_node_id determined as 1 [INFO] parameter "max_replication_slots" set to 32 [INFO] parameter "max_wal_senders" set to 32 [NOTICE] checking for available walsenders on the source node (2 required) [INFO] sufficient walsenders available on the source node [DETAIL] 2 required, 32 available [NOTICE] checking replication connections can be made to the source server (2 required) [INFO] required number of replication connections could be made to the source server [DETAIL] 2 replication connections required [ERROR] unable to retrieve record for upstream node 14. 備用節點執行未發現錯誤信息就可以執行standby clone(在standby過程中會自動創建復制槽):
repmgr -h node1 -U repmgr -d repmgr -p 54322 --upstream-node-id 1 standby clone[kes86@node2 ~]$ repmgr -h node1 -U repmgr -d repmgr -p 54322 --upstream-node-id 1 standby clone [NOTICE] destination directory "/home/kes86/data" provided [INFO] connecting to source node [DETAIL] connection string is: host=node1 user=repmgr port=54322 dbname=repmgr [DETAIL] current installation size is 45 MB [DEBUG] 1 node records returned by source node [DEBUG] connecting to: "user=repmgr connect_timeout=10 dbname=repmgr host=node1 port=54322 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path=" [DEBUG] upstream_node_id determined as 1 [NOTICE] checking for available walsenders on the source node (2 required) [NOTICE] checking replication connections can be made to the source server (2 required) [INFO] creating directory "/home/kes86/data"... [INFO] creating replication slot as user "repmgr" [DEBUG] create_replication_slot_sql(): creating slot "repmgr_slot_2" on upstream [NOTICE] starting backup (using sys_basebackup)... [HINT] this may take some time; consider using the -c/--fast-checkpoint option [INFO] executing:/home/kes86/kes86/cluster/bin/sys_basebackup -l "repmgr base backup" -D /home/kes86/data -h node1 -p 54322 -U repmgr -X stream -S repmgr_slot_2 [DEBUG] connecting to: "user=repmgr connect_timeout=10 dbname=repmgr host=node1 port=54322 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path=" [NOTICE] standby clone (using sys_basebackup) complete [NOTICE] you can now start your Kingbase server [HINT] for example: sys_ctl -D /home/kes86/data start [HINT] after starting the server, you need to register this standby with "repmgr standby register"5. 執行standby clone無錯誤信息,啟動備節點數據庫
[kes86@node2 ~]$ sys_ctl -D /home/kes86/data/ start waiting for server to start....2022-10-30 16:41:28.063 CST [1595] 日志: 正在啟動 KingbaseES V008R006C006B0021 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit 2022-10-30 16:41:28.063 CST [1595] 日志: 正在監聽IPv4地址"0.0.0.0",端口 54322 2022-10-30 16:41:28.063 CST [1595] 日志: 正在監聽IPv6地址"::",端口 54322 2022-10-30 16:41:28.068 CST [1595] 日志: 在Unix套接字 "/tmp/.s.KINGBASE.54322"上偵聽 2022-10-30 16:41:28.092 CST [1595] 日志: 日志輸出重定向到日志收集進程 2022-10-30 16:41:28.092 CST [1595] 提示: 后續的日志輸出將出現在目錄 "sys_log"中.done server started6. repmgr注冊備用節點:
repmgr standby register[kes86@node2 ~]$ repmgr standby register [INFO] connecting to local node "node2" (ID: 2) [DEBUG] connecting to: "user=repmgr connect_timeout=10 dbname=repmgr host=node2 port=54322 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path=" [INFO] connecting to primary database [DEBUG] connecting to: "user=repmgr connect_timeout=10 dbname=repmgr host=node1 port=54322 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 fallback_application_name=repmgr options=-csearch_path=" [WARNING] --upstream-node-id not supplied, assuming upstream node is primary (node ID: 1) [NOTICE] failed to update nodes_info file on primary node. [INFO] standby registration complete [NOTICE] standby node "node2" (ID: 2) successfully registered7. 通過repmgr查看集群節點狀態
[kes86@node2 ~]$ repmgr cluster showID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------1 | node1 | primary | * running | | default | 100 | 1 | | host=node1 user=repmgr dbname=repmgr port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=32 | node2 | standby | running | node1 | default | 100 | 1 | 0 bytes | host=node2 user=repmgr dbname=repmgr port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3配置.encpwd密碼文件:
ksql登錄時,會去數據庫安裝用戶家目錄下的密碼文件.encpwd中讀取用戶信息,然后登陸數據庫;
密碼文件為普通文本文件,可以vi創建,也可以使用kingbase提供的工具sys_encpwd工具來生成密碼文件。
sys_encpwd配置工具使用方法:
該工具包含5個參數,5個參數均需要輸入才能配置成功
[kes86@node2 ~]$ sys_encpwd -H * -P 54322 -D * -U repmgr -W repmgr
[kes86@node2 ~]$ sys_encpwd -H * -P 54322 -D * -U system -W system
8. 回到主節點登陸主節點數據庫執行以下查詢查看數據庫同步狀態跟復制槽信息:
[kes86@node1 ~]$ ksql -Usystem -dtest ksql (V8.0) 輸入 "help" 來獲取幫助信息.test=# \x 擴展顯示已打開. test=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 2170 usesysid | 16384 usename | repmgr application_name | node2 client_addr | 192.168.57.40 client_hostname | client_port | 57184 backend_start | 2022-12-20 16:52:08.272839+08 backend_xmin | state | streaming sent_lsn | 0/6001940 write_lsn | 0/6001940 flush_lsn | 0/6001940 replay_lsn | 0/6001940 write_lag | 00:00:00.000869 flush_lag | 00:00:00.004688 replay_lag | 00:00:00.00547 sync_priority | 0 sync_state | async reply_time | 2022-12-20 16:53:11.721118+08test=# select * from pg_replication_slots ; -[ RECORD 1 ]-------+-------------- slot_name | repmgr_slot_2 plugin | slot_type | physical datoid | database | temporary | f active | t active_pid | 2170 xmin | 577 catalog_xmin | restart_lsn | 0/6001940 confirmed_flush_lsn | 可以看到目前集群同步狀態未async異步方式。要想修改為sync同步模式,需修改一下參數,主節點執行 alter system set synchronous_standby_names='ANY 1(node2)';重新加載配置文件 select pg_reload_conf();[kes86@node1 ~]$ ksql -Usystem -dtest -p54322 ksql (V8.0) 輸入 "help" 來獲取幫助信息.test=# alter system set synchronous_standby_names='ANY 1(node2)'; ALTER SYSTEM test=# select pg_reload_conf();pg_reload_conf ----------------t (1 行記錄)9. 查看修改synchronous_standby_names參數后的同步模式
test=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 2170 usesysid | 16384 usename | repmgr application_name | node2 client_addr | 192.168.57.40 client_hostname | client_port | 57184 backend_start | 2022-12-20 16:52:08.272839+08 backend_xmin | state | streaming sent_lsn | 0/6002418 write_lsn | 0/6002418 flush_lsn | 0/6002418 replay_lsn | 0/6002418 write_lag | flush_lag | replay_lag | sync_priority | 1 sync_state | quorum reply_time | 2022-12-20 16:55:56.498748+08test=# select * from pg_replication_slots ; -[ RECORD 1 ]-------+-------------- slot_name | repmgr_slot_2 plugin | slot_type | physical datoid | database | temporary | f active | t active_pid | 2170 xmin | 577 catalog_xmin | restart_lsn | 0/6002418 confirmed_flush_lsn |10. 啟動repmgrd進程(所有節點執行):如果不使用auto failover此服務可以不開啟
Repmgrd 守護進程它主動監視復制集群中的服務器并執行以下任務:監控和記錄集群復制性能 通過檢測主服務器故障并提升最合適的備用服務器來執行故障轉移 將有關群集中事件的通知提供給用戶定義的腳本,該腳本可以執行諸如通過電子郵件發送警報等任務repmgrd 根據本地數據庫角色不同,其功能也不同: 主庫:repmgrd僅監控本地數據庫,負責自動恢復、同異步切換 備庫:repmgrd監控本地數據庫和主數據庫,負責自動切換、復制槽刪除 repmgr.conf文件中將location參數設置為一致,不設置的話默認也是一致的。(location='location1'). 同時啟動repmgrd必須在kingbase.conf配置文件中設置shared_preload_libraries='repmgr'.11. repmgrd啟動命令(所有節點執行):
repmgrd -d -v -f kes86/cluster/etc/repmgr.conf[kes86@node1 ~]$ repmgrd -d -v -f kes86/cluster/etc/repmgr.conf [2022-12-20 17:10:57] [NOTICE] using provided configuration file "kes86/cluster/etc/repmgr.conf" [2022-12-20 17:10:57] [NOTICE] redirecting logging output to "/home/kes86/kes86/cluster/log/hamgr.log"啟動kbha服務(所有節點執行):如果不使用auto failover建議不要啟動此服務
kbha -A daemon -f kes86/cluster/etc/repmgr.conf
[kes86@node1 ~]$ kbha -A daemon -f kes86/cluster/etc/repmgr.conf
[2022-12-20 17:12:57] [NOTICE] redirecting logging output to "/home/kes86/kes86/cluster/log/kbha.log"
repmgrd 日志輪換
為確保當前的 repmgrd 日志文件(repmgr.conf配置文件中用參數log_file指定的文件)不會無限增長,請將您的系統配置logrotate為定期輪換它。
/data/sys_log/repmgr/repmgrd.log {
missingok
compress
rotate 52
maxsize 500M
weekly
create 0600 postgres postgres
postrotate
/usr/bin/killall -HUP repmgrd
endscript
}
四、switchover切換測試:
1. 主節點查看集群狀態
[kes86@node1 ~]$ repmgr cluster showID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string ----+-------+---------+-----------+----------+-----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------1 | node1 | primary | * running | | location1 | 100 | 7 | | host=node1 user=repmgr password=repmgr dbname=repmgr port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=32 | node2 | standby | running | node1 | location1 | 100 | 7 | 0 bytes | host=node2 user=repmgr password=repmgr dbname=repmgr port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3主節點查看服務狀態[kes86@node1 ~]$ repmgr service statusID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+-------+---------+-----------+----------+---------+------+---------+--------------------1 | node1 | primary | * running | | running | 2730 | no | n/a 2 | node2 | standby | running | node1 | running | 2017 | no | 0 second(s) ago2. 備節點查看集群狀態
[kes86@node2 ~]$ repmgr cluster showID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string ----+-------+---------+-----------+----------+-----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------1 | node1 | primary | * running | | location1 | 100 | 7 | | host=node1 user=repmgr password=repmgr dbname=repmgr port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=32 | node2 | standby | running | node1 | location1 | 100 | 7 | 0 bytes | host=node2 user=repmgr password=repmgr dbname=repmgr port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3備節點查看服務狀態[kes86@node2 ~]$ repmgr service statusID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+-------+---------+-----------+----------+---------+------+---------+--------------------1 | node1 | primary | * running | | running | 2730 | no | n/a 2 | node2 | standby | running | node1 | running | 2017 | no | 0 second(s) ago3. 在備節點執行switchover切換命令:
repmgr standby switchover --dry-run
過程中沒有有warning/error信息說明可以正常切換
正式切換
4. 在備節點執行:repmgr standby switchover
執行過程中會刪除掉老主庫的復制槽。[kes86@node2 ~]$ repmgr standby switchover [NOTICE] executing switchover on node "node2" (ID: 2) [INFO] The output from primary check cmd "repmgr node check --terse -LERROR --archive-ready --optformat" is: "--status=OK --files=14 " [NOTICE] attempting to pause repmgrd on 2 nodes [INFO] pausing repmgrd on node "node1" (ID 1) [INFO] pausing repmgrd on node "node2" (ID 2) [NOTICE] local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby [NOTICE] stopping current primary node "node1" (ID: 1) [NOTICE] issuing CHECKPOINT on node "node1" (ID: 1) [DETAIL] executing server command "/home/kes86/kes86/cluster/bin/sys_ctl -D '/home/kes86/data' -l /home/kes86/kes86/cluster/bin/logfile -W -m fast stop" [INFO] checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout") [INFO] checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout") [INFO] checking for primary shutdown; 3 of 60 attempts ("shutdown_check_timeout") [INFO] checking for primary shutdown; 4 of 60 attempts ("shutdown_check_timeout") [INFO] checking for primary shutdown; 5 of 60 attempts ("shutdown_check_timeout") [INFO] checking for primary shutdown; 6 of 60 attempts ("shutdown_check_timeout") [NOTICE] current primary has been cleanly shut down at location 0/D000028 [NOTICE] promoting standby to primary [DETAIL] promoting server "node2" (ID: 2) using sys_promote() [NOTICE] waiting for promotion to complete, replay lsn: 0/D0000A0 [INFO] SET synchronous TO "async" on primary host [NOTICE] STANDBY PROMOTE successful [DETAIL] server "node2" (ID: 2) was successfully promoted to primary [NOTICE] issuing CHECKPOINT [NOTICE] node "node2" (ID: 2) promoted to primary, node "node1" (ID: 1) demoted to standby [NOTICE] switchover was successful [DETAIL] node "node2" is now primary and node "node1" is attached as standby [INFO] unpausing repmgrd on node "node1" (ID 1) [INFO] unpause node "node1" (ID 1) successfully [INFO] unpausing repmgrd on node "node2" (ID 2) [INFO] unpause node "node2" (ID 2) successfully [NOTICE] STANDBY SWITCHOVER has completed successfully執行成功后,查看集群狀態[kes86@node1 ~]$ repmgr cluster showID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string ----+-------+---------+-----------+----------+-----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------1 | node1 | standby | running | node2 | location1 | 100 | 7 | 0 bytes | host=node1 user=repmgr password=repmgr dbname=repmgr port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=32 | node2 | primary | * running | | location1 | 100 | 8 | | host=node2 user=repmgr password=repmgr dbname=repmgr port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3繼續通過switchover切換回原來的主備角色:
[kes86@node1 ~]$ repmgr standby switchover [NOTICE] executing switchover on node "node1" (ID: 1) [INFO] The output from primary check cmd "repmgr node check --terse -LERROR --archive-ready --optformat" is: "--status=OK --files=12 " [NOTICE] attempting to pause repmgrd on 2 nodes [INFO] pausing repmgrd on node "node1" (ID 1) [INFO] pausing repmgrd on node "node2" (ID 2) [NOTICE] local node "node1" (ID: 1) will be promoted to primary; current primary "node2" (ID: 2) will be demoted to standby [NOTICE] stopping current primary node "node2" (ID: 2) [NOTICE] issuing CHECKPOINT on node "node2" (ID: 2) [DETAIL] executing server command "/home/kes86/kes86/cluster/bin/sys_ctl -D '/home/kes86/data' -l /home/kes86/kes86/cluster/bin/logfile -W -m fast stop" [INFO] checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout") [INFO] checking for primary shutdown; 2 of 60 attempts ("shutdown_check_timeout") [INFO] checking for primary shutdown; 3 of 60 attempts ("shutdown_check_timeout") [NOTICE] current primary has been cleanly shut down at location 0/E000028 [NOTICE] promoting standby to primary [DETAIL] promoting server "node1" (ID: 1) using sys_promote() [NOTICE] waiting for promotion to complete, replay lsn: 0/E0000A0 [INFO] SET synchronous TO "async" on primary host [NOTICE] STANDBY PROMOTE successful [DETAIL] server "node1" (ID: 1) was successfully promoted to primary [NOTICE] issuing CHECKPOINT [NOTICE] node "node1" (ID: 1) promoted to primary, node "node2" (ID: 2) demoted to standby [NOTICE] switchover was successful [DETAIL] node "node1" is now primary and node "node2" is attached as standby [INFO] unpausing repmgrd on node "node1" (ID 1) [INFO] unpause node "node1" (ID 1) successfully [INFO] unpausing repmgrd on node "node2" (ID 2) [INFO] unpause node "node2" (ID 2) successfully [NOTICE] STANDBY SWITCHOVER has completed successfully5. 切換成功后查看數據庫集群狀態:
[kes86@node1 ~]$ repmgr cluster showID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string ----+-------+---------+-----------+----------+-----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------1 | node1 | primary | * running | | location1 | 100 | 9 | | host=node1 user=repmgr password=repmgr dbname=repmgr port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=32 | node2 | standby | running | node1 | location1 | 100 | 8 | 0 bytes | host=node2 user=repmgr password=repmgr dbname=repmgr port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=36. ksql登陸數據庫
查看復制槽狀態是否正常[kes86@node1 ~]$ ksql -Usystem -dtest ksql (V8.0) 輸入 "help" 來獲取幫助信息.test=# \x 擴展顯示已打開. test=# select * from pg_replication_slots ; -[ RECORD 1 ]-------+-------------- slot_name | repmgr_slot_2 plugin | slot_type | physical datoid | database | temporary | f active | t active_pid | 7121 xmin | 635 catalog_xmin | restart_lsn | 0/E001BC8 confirmed_flush_lsn | test=# select * from pg_stat_replication ; -[ RECORD 1 ]----+------------------------------ pid | 7121 usesysid | 16384 usename | repmgr application_name | node2 client_addr | 192.168.57.43 client_hostname | client_port | 45080 backend_start | 2022-12-21 17:19:52.395706+08 backend_xmin | state | streaming sent_lsn | 0/E001BC8 write_lsn | 0/E001BC8 flush_lsn | 0/E001BC8 replay_lsn | 0/E001BC8 write_lag | flush_lag | replay_lag | sync_priority | 1 sync_state | quorum reply_time | 2022-12-21 17:24:53.675362+08到此switchover切換測試驗證通過。
建議:如果不使用autofailover,建議不要開啟kbha服務。只需啟動repmgrd守護進程就可以。
啟動repmgrd守護進程命令(如果使用sys_monitor.sh管理Kingbase數據庫啟停,不用額外執行命令啟動repmgrd服務,sys_monitor.sh腳本會自動啟動repmgrd服務)
repmgrd -d -v -f /home/kes86/kes86/cluster/etc/repmgr.conf測試使用非root權限sys_monitor.sh腳本啟停數據庫
--停止數據庫[kes86@node1 ~]$ sys_monitor.sh stop 2022-12-21 17:28:06 Ready to stop all DB ... There is no service "node_export" running currently. There is no service "postgres_ex" running currently. There is no service "node_export" running currently. There is no service "postgres_ex" running currently. 2022-12-21 17:28:13 begin to stop repmgrd on "[node1]". 2022-12-21 17:28:18 repmgrd on "[node1]" stop success. 2022-12-21 17:28:18 begin to stop repmgrd on "[node2]". 2022-12-21 17:28:19 repmgrd on "[node2]" stop success. 2022-12-21 17:28:19 begin to stop DB on "[node2]". waiting for server to shut down.... done server stopped 2022-12-21 17:28:20 DB on "[node2]" stop success. 2022-12-21 17:28:20 begin to stop DB on "[node1]". waiting for server to shut down.... done server stopped 2022-12-21 17:28:34 DB on "[node1]" stop success. 2022-12-21 17:28:34 Done. --啟動數據庫[kes86@node1 ~]$ sys_monitor.sh start 2022-12-21 17:30:29 Ready to start all DB ... 2022-12-21 17:30:29 begin to start DB on "[node1]". waiting for server to start.... done server started 2022-12-21 17:30:57 execute to start DB on "[node1]" success, connect to check it. 2022-12-21 17:30:59 DB on "[node1]" start success. 2022-12-21 17:31:00 Try to ping trusted_servers on host node1 ... 2022-12-21 17:31:00 Try to ping trusted_servers on host node2 ... 2022-12-21 17:31:00 begin to start DB on "[node2]". waiting for server to start.... done server started 2022-12-21 17:31:02 execute to start DB on "[node2]" success, connect to check it. 2022-12-21 17:31:03 DB on "[node2]" start success.ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string ----+-------+---------+-----------+----------+-----------+----------+----------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------1 | node1 | primary | * running | | location1 | 100 | 9 | | host=node1 user=repmgr password=repmgr dbname=repmgr port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=32 | node2 | standby | running | node1 | location1 | 100 | 9 | 0 bytes | host=node2 user=repmgr password=repmgr dbname=repmgr port=54322 connect_timeout=10 keepalives=1 keepalives_idle=10 keepalives_interval=1 keepalives_count=3 2022-12-21 17:31:04 The primary DB is started. 2022-12-21 17:31:04 begin to start repmgrd on "[node1]". [2022-12-21 17:31:12] [NOTICE] using provided configuration file "/home/kes86/kes86/cluster/bin/../etc/repmgr.conf" [2022-12-21 17:31:12] [NOTICE] redirecting logging output to "/home/kes86/kes86/cluster/log/hamgr.log"2022-12-21 17:31:19 repmgrd on "[node1]" start success. 2022-12-21 17:31:19 begin to start repmgrd on "[node2]". [2022-12-21 17:31:51] [NOTICE] using provided configuration file "/home/kes86/kes86/cluster/bin/../etc/repmgr.conf" [2022-12-21 17:31:51] [NOTICE] redirecting logging output to "/home/kes86/kes86/cluster/log/hamgr.log"2022-12-21 17:31:22 repmgrd on "[node2]" start success.ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen ----+-------+---------+-----------+----------+---------+------+---------+--------------------1 | node1 | primary | * running | | running | 8637 | no | n/a 2 | node2 | standby | running | node1 | running | 8708 | no | 1 second(s) ago [2022-12-21 17:31:22] [NOTICE] redirecting logging output to "/home/kes86/kes86/cluster/log/kbha.log"[2022-12-21 17:31:56] [NOTICE] redirecting logging output to "/home/kes86/kes86/cluster/log/kbha.log"2022-12-21 17:31:26 Done.總結
以上是生活随笔為你收集整理的KingbaseES V8R6 集群运维系列 -- 命令行部署repmgr管理集群+switchover测试的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: dos批处理脚本自动添加网络IP打印机-
- 下一篇: R语言基础作图之点图