[MDEV-15933] Cannot resume Node SYNCED state when wsrep_desync is done after FTWRL Created: 2018-04-19  Updated: 2018-08-15  Resolved: 2018-08-15

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.1, 10.3.6, 10.2
Fix Version/s: 10.1.36, 10.2.18, 10.3.10

Type: Bug Priority: Major
Reporter: Zdravelina Sokolovska (Inactive) Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.4; Galera 2 Nodes


Issue Links:
Relates
relates to MDEV-15067 Node fails to rejoin mdb cluster afte... Closed

 Description   

Cannot resume Node SYNCED state when wsrep_desync is done after FTWRL

1. Issue ' flush tables with read lock;' on Galera Node
MariaDB [(none)]> flush tables with read lock;
Query OK, 0 rows affected (0.000 sec)

2. Set wsrep_desync – Command hangs and wsrep_desync=1 is not applied ,
but afterwards cannot resume SYNCED state

MariaDB [(none)]> set global wsrep_desync=1;
^CCtrl-C – query killed. Continuing normally.
^CCtrl-C – query killed. Continuing normally.
ERROR 2013 (HY000): Lost connection to MySQL server during query
MariaDB [(none)]> show processlist ;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id: 81
Current database: *** NONE ***

------------------------------------------------------------------------------------------------

Id User Host db Command Time State Info Progress

------------------------------------------------------------------------------------------------

1 system user   NULL Sleep 3331 WSREP aborter idle NULL 0.000
2 system user   NULL Sleep 3331 NULL NULL 0.000
4 system user   NULL Daemon NULL InnoDB purge coordinator NULL 0.000
5 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
3 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
6 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
7 system user   NULL Daemon NULL InnoDB shutdown handler NULL 0.000
9 system user   NULL Sleep 3331 NULL NULL 0.000
8 system user   NULL Sleep 3331 NULL NULL 0.000
10 system user   NULL Sleep 3331 NULL NULL 0.000
26 system user   NULL Sleep 3331 NULL NULL 0.000
17 system user   NULL Sleep 3331 NULL NULL 0.000
16 system user   NULL Sleep 3331 NULL NULL 0.000
19 system user   NULL Sleep 3331 NULL NULL 0.000
20 system user   NULL Sleep 3331 NULL NULL 0.000
22 system user   NULL Sleep 3331 NULL NULL 0.000
23 system user   NULL Sleep 3331 NULL NULL 0.000
18 system user   NULL Sleep 3331 NULL NULL 0.000
24 system user   NULL Sleep 3331 NULL NULL 0.000
21 system user   NULL Sleep 3331 NULL NULL 0.000
14 system user   NULL Sleep 3331 NULL NULL 0.000
25 system user   NULL Sleep 3331 NULL NULL 0.000
35 system user   NULL Sleep 3331 NULL NULL 0.000
27 system user   NULL Sleep 3331 NULL NULL 0.000
13 system user   NULL Sleep 3331 NULL NULL 0.000
34 system user   NULL Sleep 3331 NULL NULL 0.000
11 system user   NULL Sleep 3331 NULL NULL 0.000
30 system user   NULL Sleep 3331 NULL NULL 0.000
37 system user   NULL Sleep 3331 NULL NULL 0.000
15 system user   NULL Sleep 3331 NULL NULL 0.000
33 system user   NULL Sleep 3331 NULL NULL 0.000
32 system user   NULL Sleep 3331 NULL NULL 0.000
31 system user   NULL Sleep 3331 NULL NULL 0.000
12 system user   NULL Sleep 3331 NULL NULL 0.000
55 system user   NULL Sleep 3331 NULL NULL 0.000
38 system user   NULL Sleep 3331 NULL NULL 0.000
51 system user   NULL Sleep 3331 NULL NULL 0.000
42 system user   NULL Sleep 3331 NULL NULL 0.000
43 system user   NULL Sleep 3331 NULL NULL 0.000
36 system user   NULL Sleep 3331 NULL NULL 0.000
41 system user   NULL Sleep 3331 NULL NULL 0.000
45 system user   NULL Sleep 3331 NULL NULL 0.000
44 system user   NULL Sleep 3331 NULL NULL 0.000
39 system user   NULL Sleep 3331 NULL NULL 0.000
48 system user   NULL Sleep 3331 NULL NULL 0.000
46 system user   NULL Sleep 3331 NULL NULL 0.000
47 system user   NULL Sleep 3331 NULL NULL 0.000
50 system user   NULL Sleep 3331 NULL NULL 0.000
52 system user   NULL Sleep 3331 NULL NULL 0.000
53 system user   NULL Sleep 3331 NULL NULL 0.000
40 system user   NULL Sleep 3331 NULL NULL 0.000
28 system user   NULL Sleep 3331 NULL NULL 0.000
49 system user   NULL Sleep 3331 NULL NULL 0.000
29 system user   NULL Sleep 3331 NULL NULL 0.000
58 system user   NULL Sleep 3331 NULL NULL 0.000
59 system user   NULL Sleep 3331 NULL NULL 0.000
61 system user   NULL Sleep 3331 NULL NULL 0.000
56 system user   NULL Sleep 3331 NULL NULL 0.000
60 system user   NULL Sleep 3331 NULL NULL 0.000
62 system user   NULL Sleep 3331 NULL NULL 0.000
57 system user   NULL Sleep 3331 NULL NULL 0.000
63 system user   NULL Sleep 3331 NULL NULL 0.000
64 system user   NULL Sleep 3331 NULL NULL 0.000
54 system user   NULL Sleep 3331 NULL NULL 0.000
65 system user   NULL Sleep 3331 NULL NULL 0.000
69 system user   NULL Sleep 3331 NULL NULL 0.000
66 system user   NULL Sleep 3331 NULL NULL 0.000
67 system user   NULL Sleep 3331 NULL NULL 0.000
70 system user   NULL Sleep 3331 NULL NULL 0.000
68 system user   NULL Sleep 3331 NULL NULL 0.000
78 root localhost NULL Killed 30 NULL set global wsrep_desync=1 0.000
81 root localhost NULL Query 0 Init show processlist 0.000

------------------------------------------------------------------------------------------------
72 rows in set (0.001 sec)

Node moved in DESYNCED state and cannot be recovered
2018-04-19 17:34:17 0 [Note] WSREP: Member 0.0 (t4w6) desyncs itself from group
2018-04-19 17:34:17 0 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 2)
2018-04-19 17:34:17 78 [Note] WSREP: Provider paused at 9f23373a-3e6d-11e8-b625-0646494b946c:2 (31)

MariaDB [(none)]> unlock tables ;
Query OK, 0 rows affected (0.000 sec)

MariaDB [(none)]> show variables like '%wsrep_desync%' ;
--------------------+

Variable_name Value

--------------------+

wsrep_desync OFF

--------------------+
1 row in set (0.002 sec)

MariaDB [(none)]> show status like 'wsrep_local_state_comment' ;
-----------------------------------------+

Variable_name Value

-----------------------------------------+

wsrep_local_state_comment Donor/Desynced

-----------------------------------------+
1 row in set (0.001 sec)



 Comments   
Comment by Zdravelina Sokolovska (Inactive) [ 2018-04-19 ]

In addition fail restart mariadb.service

# systemctl restart  mariadb.service
Job for mariadb.service failed because a timeout was exceeded. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@t4w6 ~]# mysql -u root -p1
 
 
 
 
^C
[root@t4w6 ~]# mysql -u root -p1
 
 
 
^C
[root@t4w6 ~]# systemctl status mariadb.service
● mariadb.service - MariaDB 10.3.5 database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/mariadb.service.d
           └─migrated-from-my.cnf-settings.conf
   Active: failed (Result: timeout) since Thu 2018-04-19 18:30:51 EEST; 48s ago
     Docs: man:mysqld(8)
           https://mariadb.com/kb/en/library/systemd/
  Process: 1142 ExecStartPost=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
  Process: 1389 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= ||   VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ]   && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=killed, signal=TERM)
  Process: 1387 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
 Main PID: 1043
   CGroup: /system.slice/mariadb.service
           └─1043 /usr/sbin/mysqld --wsrep_start_position=9f23373a-3e6d-11e8-b625-0646494b946c:2
 
Apr 19 18:29:21 t4w6.xentio.lan systemd[1]: Starting MariaDB 10.3.5 database server...
Apr 19 18:30:51 t4w6.xentio.lan systemd[1]: mariadb.service start-pre operation timed out. Terminating.
Apr 19 18:30:51 t4w6.xentio.lan systemd[1]: Failed to start MariaDB 10.3.5 database server.
Apr 19 18:30:51 t4w6.xentio.lan systemd[1]: Unit mariadb.service entered failed state.
Apr 19 18:30:51 t4w6.xentio.lan systemd[1]: mariadb.service failed.

2018-04-19 18:26:20 0 [Note] /usr/sbin/mysqld (unknown): Normal shutdown
2018-04-19 18:26:20 0 [Note] WSREP: Stop replication
2018-04-19 18:26:20 0 [Note] WSREP: Closing send monitor...
2018-04-19 18:26:20 0 [Note] WSREP: Closed send monitor.
2018-04-19 18:26:20 0 [Note] WSREP: gcomm: terminating thread
2018-04-19 18:26:20 0 [Note] WSREP: gcomm: joining thread
2018-04-19 18:26:20 0 [Note] WSREP: gcomm: closing backend
2018-04-19 18:26:24 0 [Note] WSREP: (3c3e4c49, 'tcp://0.0.0.0:4567') connection to peer 6a5bf34e with addr tcp://192.168.104.195:4567 timed out, no messages seen in PT3S
2018-04-19 18:26:24 0 [Note] WSREP: (3c3e4c49, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.104.195:4567
2018-04-19 18:26:25 0 [Note] WSREP: (3c3e4c49, 'tcp://0.0.0.0:4567') reconnecting to 6a5bf34e (tcp://192.168.104.195:4567), attempt 0
2018-04-19 18:26:26 0 [Note] WSREP: evs::proto(3c3e4c49, LEAVING, view_id(REG,3c3e4c49,154)) suspecting node: 6a5bf34e
2018-04-19 18:26:26 0 [Note] WSREP: evs::proto(3c3e4c49, LEAVING, view_id(REG,3c3e4c49,154)) suspected node without join message, declaring inactive
2018-04-19 18:26:26 0 [Note] WSREP: view(view_id(NON_PRIM,3c3e4c49,154) memb {
        3c3e4c49,0
} joined {
} left {
} partitioned {
        6a5bf34e,0
})
2018-04-19 18:26:26 0 [Note] WSREP: view((empty))
2018-04-19 18:26:26 0 [Note] WSREP: gcomm: closed
2018-04-19 18:26:26 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2018-04-19 18:26:26 0 [Note] WSREP: Flow-control interval: [16, 16]
2018-04-19 18:26:26 0 [Note] WSREP: Trying to continue unpaused monitor
2018-04-19 18:26:26 0 [Note] WSREP: Received NON-PRIMARY.
2018-04-19 18:26:26 0 [Note] WSREP: Shifting DONOR/DESYNCED -> OPEN (TO: 2)
2018-04-19 18:26:26 0 [Note] WSREP: Received self-leave message.
2018-04-19 18:26:26 0 [Note] WSREP: Flow-control interval: [0, 0]
2018-04-19 18:26:26 0 [Note] WSREP: Trying to continue unpaused monitor
2018-04-19 18:26:26 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2018-04-19 18:26:26 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 2)
2018-04-19 18:26:26 0 [Note] WSREP: RECV thread exiting 0: Success
2018-04-19 18:26:26 0 [Note] WSREP: recv_thread() joined.
2018-04-19 18:26:26 0 [Note] WSREP: Closing replication queue.
2018-04-19 18:26:26 0 [Note] WSREP: Closing slave action queue.
2018-04-19 18:26:28 0 [Note] WSREP: killing local connection: 78
 

Comment by Mario Karuza (Inactive) [ 2018-08-14 ]

This is fixed by not allowing to set wsrep_desync after global read lock is acquired.
PR is done to 10.1.

Generated at Thu Feb 08 08:25:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.