《童虎学习笔记》14分钟结合ProxySQL处理超半数MGR节点故障
| 本文章配套視頻 | https://www.ixigua.com/7086085500540289572?id=7088719800846778910 |
| 本專欄全部文章 | https://blog.csdn.net/tonghu_note/category_11755726.html |
| 總目錄 | https://blog.csdn.net/tonghu_note/article/details/124333034 |
來我的dou音 aa10246666, 看配套視頻
一、實戰(zhàn)環(huán)境
| Primary 節(jié)點(node1) | mysql 8.0.28 | 10.211.55.9 |
| Secondary1 節(jié)點(node2) | mysql 8.0.28 | 10.211.55.4 |
| Secondary2 節(jié)點(node3) | mysql 8.0.28 | 10.211.55.6 |
| ProxySQL 節(jié)點(node4) | 2.2.0 | 10.211.55.7 |
二、模擬超半數(shù)節(jié)點故障
我們模擬MGR集群中2個Secondary節(jié)點都因故障停掉了
1、先干掉一個節(jié)點,手動殺掉node2上的mysqld服務(wù)
找到mysqld進程號后,進行kill
root@node2:~# ps aux|grep mysqld
root ? ? 1155011 ?0.0 ?0.0 ? 2064 ?1412 pts/0 ? ?S ? ?10:52 ? 0:00 /bin/sh /usr/local/mysql/bin/mysqld_safe --user=mysql
mysql ? ?1155290 ?1.3 28.6 1919508 581212 pts/0 ?Sl ? 10:52 ? 0:01 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql --log-error=node2.err --pid-file=node2.pid
root ? ? 1155950 ?0.0 ?0.0 ? 5908 ? 648 pts/0 ? ?S+ ? 10:54 ? 0:00 grep mysqld
root@node2:~# kill -9 1155011 1155290
2、查看MGR集群狀態(tài)
可以看到node2的狀態(tài)由ONLINE變?yōu)?span style="color:#fe2c24;">UNREACHABLE后在集群中消失了
mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------------+-----------------------+------------------+
| member_host | member_state? ? ?| member_role |
+------------------+------------------------+------------------+
| node1? ? ? ? ? ? | ONLINE? ? ? ? ? ? ? ? | PRIMARY ? ? |
| node2? ? ? ? ? ? | UNREACHABLE ?| SECONDARY ? |
| node3? ? ? ? ? ? | ONLINE? ? ? ? ? ? ? ? | SECONDARY ? |
。。。等一會兒
mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------------+-------------------+-------------------+
| member_host | member_state | member_role |
+------------------+---------------------+------------------+
| node1? ? ? ? ? ? | ONLINE? ? ? ? ? ?| PRIMARY ? ? |
| node3? ? ? ? ? ? | ONLINE? ? ? ? ? ?| SECONDARY ? |
3、驗證此時集群可以正常工作
連接ProxySQL節(jié)點的程序端口6033,看集群是否可以正常工作
root@node4:/var/lib/proxysql# mysql -uapp_user -papp_pwd -h 127.0.0.1 -P6033 --prompt='Admin> '
Admin> use d1
Admin> show tables;
+--------------+
| Tables_in_d1 |
+--------------+
| t2 ? ? ? ? ? |
+--------------+
1 row in set (0.01 sec)
Admin> insert into t2 select 1;
Query OK, 1 row affected (0.01 sec)
Records: 1 ?Duplicates: 0 ?Warnings: 0
Admin> select * from t2;
+----+
| id |
+----+
| ?1 |
+----+
1 row in set (0.00 sec)
Admin>?
大家可以看到,MGR+ProxySQL是可以正常工作的,說明壞了一個節(jié)點不影響集群的工作
4、再干掉一個節(jié)點,手動殺掉node3上的mysqld服務(wù)
root@node3:~# ps aux|grep mysqld
root ? ? ? 69807 ?0.0 ?0.0 ? 2064 ?1520 ? ? ? ? ?S ? ?09:49 ? 0:00 /bin/sh /usr/local/mysql/bin/mysqld_safe --user=mysql
mysql ? ? ?70098 ?0.9 29.9 1949132 607712 ? ? ? ?Sl ? 09:49 ? 0:26 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data --plugin-dir=/usr/local/mysql/lib/plugin --user=mysql --log-error=node3.err --pid-file=node3.pid
root ? ? ? 70524 ?0.0 ?0.0 ? 5908 ? 644 pts/1 ? ?R+ ? 10:35 ? 0:00 grep mysqld
root@node3:~# kill -9 69807 70098
5、再查看MGR集群狀態(tài)
可以看到node3的狀態(tài)由ONLINE變?yōu)?span style="color:#fe2c24;">UNREACHABLE后,一直是這個狀態(tài)沒有其它變化
mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------+--------------+-------------+
| member_host | member_state | member_role |
+-------------+--------------+-------------+
| node1 ? ? ? | ONLINE ? ? ? | PRIMARY ? ? |
| node3 ? ? ? | UNREACHABLE ?| SECONDARY ? |
+-------------+--------------+-------------+
2 rows in set (0.00 sec)
。。。等一會兒
mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------+--------------+-------------+
| member_host | member_state | member_role |
+-------------+--------------+-------------+
| node1 ? ? ? | ONLINE ? ? ? | PRIMARY ? ? |
| node3 ? ? ? | UNREACHABLE ?| SECONDARY ? |
+-------------+--------------+-------------+
2 rows in set (0.00 sec)
6、驗證此時集群不可以正常工作
連接ProxySQL節(jié)點的程序端口6033,看集群是否可以正常工作
root@node4:~# mysql -uapp_user -papp_pwd -h 127.0.0.1 -P6033 --prompt='Admin> '
Admin> use d1
No connection. Trying to reconnect...
Connection id: ? ?28
Current database: *** NONE ***
Database changed
Admin> select database();
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
No connection. Trying to reconnect...
Connection id: ? ?29
Current database: d1
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
No connection. Trying to reconnect...
Connection id: ? ?30
Current database: d1
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
Admin>?
可以看到集群已經(jīng)不可用了
連接ProxySQL節(jié)點的管理端口6032,看集群的成員節(jié)點狀態(tài)是否正常
root@node4:~# mysql -uadmin -padmin -h 127.0.0.1 -P6032 --prompt='Admin> '
Admin> select hostgroup_id, hostname, status from runtime_mysql_servers;
+--------------+-------------+---------+
| hostgroup_id | hostname ? ?| status ?|
+--------------+-------------+---------+
| 3 ? ? ? ? ? ?| 10.211.55.6 | SHUNNED |
| 4 ? ? ? ? ? ?| 10.211.55.9 | ONLINE ?|
| 4 ? ? ? ? ? ?| 10.211.55.4 | SHUNNED |
+--------------+-------------+---------+
3 rows in set (0.00 sec)
Admin>?
可以看到,其中有2個節(jié)點是離線狀態(tài)SHUNNED,唯一一個ONLINE狀態(tài)的節(jié)點已被移到了4號組(故障組),所以整個集群現(xiàn)在是不可用的狀態(tài)
三、故障處理
1、在唯一的ONLINE節(jié)點node1上執(zhí)行命令,使集群認為只有一個正常節(jié)點
set global group_replication_force_members='10.211.55.9:33061';
set global group_replication_force_members='';
2、再查看集群狀態(tài),可以看node3已經(jīng)從集群中剔除了
mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------+--------------+-------------+
| member_host | member_state | member_role |
+-------------+--------------+-------------+
| node1 ? ? ? | ONLINE ? ? ? | PRIMARY ? ? |
+-------------+--------------+-------------+
1 row in set (0.00 sec)
mysql>?
? 3、驗證此時集群是否可以正常工作
連接ProxySQL節(jié)點的管理端口6032,看集群1號寫組是正常ONLINE的,但3號讀組是異常SHUNNED的
root@node4:~# mysql -uadmin -padmin -h 127.0.0.1 -P6032 --prompt='Admin> '
Admin> select hostgroup_id, hostname, status from runtime_mysql_servers;
+--------------+-------------+---------+
| hostgroup_id | hostname ? ?| status ?|
+--------------+-------------+---------+
| 1 ? ? ? ? ? ?| 10.211.55.9 | ONLINE ?|
| 4 ? ? ? ? ? ?| 10.211.55.4 | SHUNNED |
| 3 ? ? ? ? ? ?| 10.211.55.6 | SHUNNED |
+--------------+-------------+---------+
3 rows in set (0.01 sec)
Admin>?
由上面可以看到node1已經(jīng)從4號組移到了1號組,這個移到是ProxySQL自動移的,不是我們手工發(fā)起命令移的
再連接ProxySQL節(jié)點的程序端口6033,看集群是否可以正常工作
root@node4:~# mysql -uapp_user -papp_pwd -h 127.0.0.1 -P6033 --prompt='Admin> '
Admin> use d1
No connection. Trying to reconnect...
Connection id: ? ?44
Current database: *** NONE ***
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
Admin> insert into t2 select 2;
Query OK, 1 row affected (0.00 sec)
Records: 1 ?Duplicates: 0 ?Warnings: 0
Admin> select * from t2;
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
No connection. Trying to reconnect...
Connection id: ? ?45
Current database: d1
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
No connection. Trying to reconnect...
Connection id: ? ?46
Current database: d1
ERROR 2013 (HY000): Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111
Admin>?
大家可以看到,插入語句可以執(zhí)行成功,但其它語句都無法執(zhí)行。
這是因為插入語句走的是1號寫組,而1號寫組里的node1節(jié)點是ONLINE,是可以提供服務(wù)的;而其它語句執(zhí)行不可成功,是因為其它的語句走的都是3組,而3組中沒有正常的節(jié)點;所以說現(xiàn)在的集群是只恢復(fù)了一個寫節(jié)點的集群,不是正常的狀態(tài)。
4、將node2恢復(fù)后加入到集群,并查看集群狀態(tài)
root@node2:~# mysqld_safe --user=mysql &
root@node2:~# mysql -uroot -proot
mysql> start group_replication;
Query OK, 0 rows affected (19.59 sec)
mysql> select member_host, member_state, member_role from performance_schema.replication_group_members;
+-------------+--------------+-------------+
| member_host | member_state | member_role |
+-------------+--------------+-------------+
| node1 ? ? ? | ONLINE ? ? ? | PRIMARY ? ? |
| node2 ? ? ? | ONLINE ? ? ? | SECONDARY ? |
+-------------+--------------+-------------+
2 rows in set (0.01 sec)
mysql>?
5、驗證此時集群是否可以正常工作
連接ProxySQL節(jié)點的管理端口6032,看集群1號寫組是正常ONLINE的,3號讀組中1臺是ONLINE,1臺異常SHUNNED的
root@node4:~# mysql -uadmin -padmin -h 127.0.0.1 -P6032 --prompt='Admin> '
Admin> select hostgroup_id, hostname, status from runtime_mysql_servers;
+--------------+-------------+---------+
| hostgroup_id | hostname ? ?| status ?|
+--------------+-------------+---------+
| 1 ? ? ? ? ? ?| 10.211.55.9 | ONLINE ?|
| 3 ? ? ? ? ? ?| 10.211.55.4 | ONLINE ?|
| 3 ? ? ? ? ? ?| 10.211.55.6 | SHUNNED |
+--------------+-------------+---------+
3 rows in set (0.01 sec)
Admin>?
由上面可以看到node2已經(jīng)移到了3號組,這個移到是ProxySQL自動移的,不是我們手工發(fā)起命令移的
再連接ProxySQL節(jié)點的程序端口6033,看集群是否可以正常工作
root@node4:~# mysql -uapp_user -papp_pwd -h 127.0.0.1 -P6033 --prompt='Admin> '
Admin> use d1
Admin> show tables;
+--------------+
| Tables_in_d1 |
+--------------+
| t2 ? ? ? ? ? |
+--------------+
1 row in set (0.01 sec)
Admin> select * from t2;
+----+
| id |
+----+
| ?1 |
| ?2 |
+----+
2 rows in set (0.00 sec)
Admin> insert into t2 select 3;
Query OK, 1 row affected (0.01 sec)
Records: 1 ?Duplicates: 0 ?Warnings: 0
Admin>?
上圖可以看到,插入和查詢都正常了,說明集群恢復(fù)了,之后再把node3也加進MGR中來,集群就完全恢復(fù)之前的正常狀態(tài)了。
四、總結(jié)
當(dāng)出現(xiàn)類擬的故障時,如下的命令是非常重要的,大家一定要記住
set global group_replication_force_members='10.211.55.9:33061';
set global group_replication_force_members='';
總結(jié)
以上是生活随笔為你收集整理的《童虎学习笔记》14分钟结合ProxySQL处理超半数MGR节点故障的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 如何在Ubuntu 16.04上使用Pr
- 下一篇: Linux工控主板的多通道高速并行采集方