[MXS-2626] Clustrix node goes directly to 'Drained` state instead of 'Draining'. Created: 2019-08-08  Updated: 2019-08-14  Resolved: 2019-08-14

Status: Closed
Project: MariaDB MaxScale
Component/s: xpandmon
Affects Version/s: None
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Rahul Joshi (Inactive) Assignee: Johan Wikman
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

MaxScale server karma197:
OS: CentOS 7
Version: built from 2.4 branch, MariaDB MaxScale 2.4.1 started (Commit: 658aae6b6c77774c34d0f52c03f63edc2d44969e)
Clustrix nodes:
OS: CentOS 7
Version: clustrix-fredonyer-16045


Attachments: File clustrix_karma108_trxreplay.cnf     Text File maxscale.log    
Sprint: MXS-SPRINT-88

 Description   

[root@karma108 ~]# clx s
Cluster Name:    cld8d568c4e1f3e0db
Cluster Version: clustrix-fredonyer-16045
Cluster Status:   OK
Cluster Size:    4 nodes - 16 CPUs per Node
Current Node:    karma108 - nid 1
 
nid |  Hostname | Status |  IP Address  | Zone | TPS |      Used     |  Total
----+-----------+--------+--------------+------+-----+---------------+--------
  1 |  karma108 |    OK  |   10.2.13.91 |    1 |   0 |  1.8G (0.24%) |  762.9G
  2 |  karma180 |    OK  |  10.2.15.180 |    2 |   0 |  1.8G (0.24%) |  762.9G
  3 |  karma123 |    OK  |   10.2.15.89 |    3 |   0 |  1.8G (0.24%) |  762.9G
  4 |  karma065 |    OK  |  10.2.14.119 |    0 |   0 |  9.3M (0.00%) |  762.9G
----+-----------+--------+--------------+------+-----+---------------+--------
                                                   0 |  5.5G (0.18%) |    3.0T
MaxScale run as:
[root@karma197 etc]# maxscale -d -f clustrix_karma108_trxreplay.cnf --user=root
 
[root@karma197 log]# maxctrl list servers
┌─────────────────────┬─────────────┬──────┬─────────────┬─────────────────┬──────┐
│ Server              │ Address     │ Port │ Connections │ State           │ GTID │
├─────────────────────┼─────────────┼──────┼─────────────┼─────────────────┼──────┤
│ @@Clustrix:node-4   │ 10.2.14.119 │ 3306 │ 0           │ Master, Running │      │
├─────────────────────┼─────────────┼──────┼─────────────┼─────────────────┼──────┤
│ @@Clustrix:node-3   │ 10.2.15.89  │ 3306 │ 0           │ Master, Running │      │
├─────────────────────┼─────────────┼──────┼─────────────┼─────────────────┼──────┤
│ @@Clustrix:node-2   │ 10.2.15.180 │ 3306 │ 0           │ Master, Running │      │
├─────────────────────┼─────────────┼──────┼─────────────┼─────────────────┼──────┤
│ @@Clustrix:node-1   │ 10.2.13.91  │ 3306 │ 0           │ Master, Running │      │
├─────────────────────┼─────────────┼──────┼─────────────┼─────────────────┼──────┤
│ Bootstrap1-karma108 │ 10.2.13.91  │ 3306 │ 0           │ Master, Running │      │
└─────────────────────┴─────────────┴──────┴─────────────┴─────────────────┴──────┘

Now, let's softfail node 4.

MySQL [system]> SELECT * FROM system.nodeinfo ORDER BY nodeid;
+--------+----------------------------+----------------------------+----------------------------+-----------------------------+------------+-------------------+-------------+-----------------+------------+-------------+---------+--------------+----------------+-------------------+-------+------------+----------+------+
| nodeid | started                    | ntptime                    | up_since                   | hostname                    | iface_name | iface_mac_addr    | iface_ip    | iface_connected | mysql_port | be_ip       | be_port | control_port | healthmon_port | pnid              | cores | first_pass | be_iface | zone |
+--------+----------------------------+----------------------------+----------------------------+-----------------------------+------------+-------------------+-------------+-----------------+------------+-------------+---------+--------------+----------------+-------------------+-------+------------+----------+------+
|      1 | 2019-08-07 22:32:47.808417 | 2019-08-08 00:08:38.300635 | 2019-08-05 05:48:39.300635 | karma108.colo.sproutsys.com | enp5s0f0   | 00:25:90:20:f5:a2 | 10.2.13.91  |               1 |       3306 | 10.2.13.91  |   24378 |         2048 |           3581 | pfec37993505e4d91 |    16 |          0 | enp5s0f0 |    1 |
|      2 | 2019-08-07 22:32:47.571999 | 2019-08-08 00:08:38.299715 | 2019-08-05 05:48:16.299715 | karma180.colo.sproutsys.com | enp5s0f0   | 00:25:90:20:f4:ce | 10.2.15.180 |               1 |       3306 | 10.2.15.180 |   24378 |         2048 |           3581 | pcb504659bf22ce02 |    16 |          0 | enp5s0f0 |    2 |
|      3 | 2019-08-07 22:32:47.519503 | 2019-08-08 00:08:38.305634 | 2019-08-05 05:47:56.305634 | karma123.colo.sproutsys.com | enp5s0f0   | 00:25:90:78:f7:3c | 10.2.15.89  |               1 |       3306 | 10.2.15.89  |   24378 |         2048 |           3581 | pe268896eb1da201a |    16 |          0 | enp5s0f0 |    3 |
|      4 | 2019-08-07 22:33:01.935622 | 2019-08-08 00:08:38.297207 | 2019-07-29 00:37:02.297207 | karma065.colo.sproutsys.com | enp5s0f0   | 00:25:90:7b:5c:f4 | 10.2.14.119 |               1 |       3306 | 10.2.14.119 |   24378 |         2048 |           3581 | p67082d6315829f66 |    16 |          0 | enp5s0f0 |    0 |
+--------+----------------------------+----------------------------+----------------------------+-----------------------------+------------+-------------------+-------------+-----------------+------------+-------------+---------+--------------+----------------+-------------------+-------+------------+----------+------+
4 rows in set (0.00 sec)
 
MySQL [system]> ALTER CLUSTER SOFTFAIL 4;
Query OK, 0 rows affected (0.04 sec)

Immediately run `maxctrl list servers` and it shows that node is marked as `Drained` instead of `Draining`.

[root@karma197 log]# maxctrl list servers;date
┌─────────────────────┬─────────────┬──────┬─────────────┬──────────────────────────┬──────┐
│ Server              │ Address     │ Port │ Connections │ State                    │ GTID │
├─────────────────────┼─────────────┼──────┼─────────────┼──────────────────────────┼──────┤
│ @@Clustrix:node-4   │ 10.2.14.119 │ 3306 │ 0           │ Drained, Master, Running │      │
├─────────────────────┼─────────────┼──────┼─────────────┼──────────────────────────┼──────┤
│ @@Clustrix:node-2   │ 10.2.15.180 │ 3306 │ 0           │ Master, Running          │      │
├─────────────────────┼─────────────┼──────┼─────────────┼──────────────────────────┼──────┤
│ @@Clustrix:node-3   │ 10.2.15.89  │ 3306 │ 0           │ Master, Running          │      │
├─────────────────────┼─────────────┼──────┼─────────────┼──────────────────────────┼──────┤
│ @@Clustrix:node-1   │ 10.2.13.91  │ 3306 │ 0           │ Master, Running          │      │
├─────────────────────┼─────────────┼──────┼─────────────┼──────────────────────────┼──────┤
│ Bootstrap1-karma108 │ 10.2.13.91  │ 3306 │ 0           │ Master, Running          │      │
└─────────────────────┴─────────────┴──────┴─────────────┴──────────────────────────┴──────┘
Thu Aug  8 00:11:50 UTC 2019

Whereas Clustrix is not yet done softfailing the node.

MySQL [system]>  SELECT count(*) FROM system.softfailing_containers; select now();
+----------+
| count(*) |
+----------+
|        2 |
+----------+
1 row in set (0.03 sec)
 
+---------------------+
| now()               |
+---------------------+
| 2019-08-08 00:12:01 |
+---------------------+
1 row in set (0.00 sec)
 
MySQL [system]> SELECT * FROM system.softfailed_nodes       WHERE nodeid NOT IN          (SELECT DISTINCT nodeid           FROM system.softfailing_containers);
Empty set (0.04 sec)

This shows that the node is not ready to be removed as Clustrix rebalancer is still moving data from the node.

MaxScale log also says:

2019-08-08 00:11:43   notice : [clustrixmon] Clustrix: Node 4 (10.2.14.119) has been SOFTFAILed. Turning ON 'Being Drained'.
2019-08-08 00:11:50   info   : Accept authentication from 'admin', using password. Request: /v1/servers
2019-08-08 00:11:50   info   : Accept authentication from 'admin', using password. Request: /v1/monitors/Clustrix
2019-08-08 00:12:07   info   : Accept authentication from 'admin', using password. Request: /v1/servers
2019-08-08 00:12:07   info   : Accept authentication from 'admin', using password. Request: /v1/monitors/Clustrix
2019-08-08 00:12:19   info   : Accept authentication from 'admin', using password. Request: /v1/servers
2019-08-08 00:12:19   info   : Accept authentication from 'admin', using password. Request: /v1/monitors/Clustrix
2019-08-08 00:13:12   info   : Accept authentication from 'admin', using password. Request: /v1/servers
2019-08-08 00:13:12   info   : Accept authentication from 'admin', using password. Request: /v1/monitors/Clustrix
2019-08-08 00:14:21   info   : Accept authentication from 'admin', using password. Request: /v1/servers
2019-08-08 00:14:21   info   : Accept authentication from 'admin', using password. Request: /v1/monitors/Clustrix

It's only at 2019-08-08 00:13:14.502252 UTC, that node should be marked as `Drained`

2019-08-08 00:13:14.502252 UTC nid 1 karma108.colo.sproutsys.com clxnode: INFO dbcore/softfail.ct:26 softfail_node_ready_signal(): softfailing nodes are ready to be removed: 4

Cofig file and logs attached.



 Comments   
Comment by Johan Wikman [ 2019-08-14 ]

This is actually not a bug. From the release notes:

It is now possible to drain a server, which means that existing connections to the server can continue to be used but new connections are no longer created to the server.

Softfailing a node will cause the server to enter the state Draining, which means that existing connection can continue to be used, but that no new connections will be created. Once the number of connection has dropped to 0, the state will be Drained, but that does not in any way indicate whether the softfailing operation has completed.

I have clarified the documentation.

Generated at Thu Feb 08 04:15:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.