[MDEV-22240] galera.galera_gcache_recover MTR failed: Failed to establish quorum Created: 2020-04-14  Updated: 2021-09-29  Resolved: 2021-06-24

Status: Closed
Project: MariaDB Server
Component/s: Galera, Tests
Affects Version/s: 10.2.31, 10.3.22, 10.5.2
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Stepan Patryshev (Inactive) Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

debian-10


Attachments: Zip Archive 200410_galera.galera_gcache_recover.zip     Text File 200410_stdout.log     Zip Archive 200413_galera_gcache_recover_10.2.zip     Text File 200413_stdout_10.2.log    

 Description   

galera.galera_gcache_recover failed on CI: "gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum".

stdio.log:

10.5.2-0 ES 022c86e4b886e4d87cfcbfc5c5c0cdcb7e4eb3fa RelWithDebInfo debian-10

galera.galera_gcache_recover 'innodb'    w3 [ fail ]
        Test ended at 2020-04-10 15:16:26
 
CURRENT_TEST: galera.galera_gcache_recover
 
 
Failed to start mysqld.2
mysqltest failed but provided no output
 
 
 - saving '/var/tmp/mtr/3/log/galera.galera_gcache_recover-innodb/' to '/var/tmp/mtr/log/galera.galera_gcache_recover-innodb/'
***Warnings generated in error logs during shutdown after running tests: galera.galera_gcache_recover
 
2020-04-10 15:13:34 0 [Warning] WSREP: No re-merged primary component found.
2020-04-10 15:13:34 0 [Warning] WSREP: No bootstrapped primary component found.
2020-04-10 15:13:34 0 [ERROR] WSREP: gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum.

Server logs.



 Comments   
Comment by Stepan Patryshev (Inactive) [ 2020-04-14 ]

It failed also on CI, 10.2 ES, rhel-7.

stdio.log:

10.2.31-6 ES 2969d0702d56405d1aec8c16a272ac85fef7bd61 Debug

 
galera.galera_gcache_recover 'innodb'    w3 [ fail ]
        Test ended at 2020-04-13 11:04:16
 
CURRENT_TEST: galera.galera_gcache_recover
mysqltest: In included file "./include/galera_wait_ready.inc": 
included from /home/jenkins/workspace/10.2e-MTR-GALERA/BuildType/Debug/label/rhel-7/mysql-test/suite/galera/include/start_mysqld.inc at line 16:
included from /home/jenkins/workspace/10.2e-MTR-GALERA/BuildType/Debug/label/rhel-7/mysql-test/suite/galera/t/galera_gcache_recover.test at line 44:
At line 28: "Server did not transition to READY state"
 
The result from queries just before the failure was:
CREATE TABLE t1 (f1 INTEGER PRIMARY KEY) ENGINE=InnoDB;
INSERT INTO t1 VALUES (1);
SET SESSION wsrep_sync_wait = 0;
connection node_2;
SET SESSION wsrep_sync_wait = 0;
Killing server ...
connection node_1;
INSERT INTO t1 VALUES (2);
Killing server ...
connection node_1;
Performing --wsrep-recover ...
Using --wsrep-start-position when starting mysqld ...
INSERT INTO t1 VALUES (3);
connection node_2;
Performing --wsrep-recover ...
Using --wsrep-start-position when starting mysqld ...
 
 - saving '/var/tmp/mtr/3/log/galera.galera_gcache_recover-innodb/' to '/var/tmp/mtr/log/galera.galera_gcache_recover-innodb/'
***Warnings generated in error logs during shutdown after running tests: galera.galera_gcache_recover
 
2020-04-13 11:03:25 139694994077440 [Warning] WSREP: No re-merged primary component found.
2020-04-13 11:03:25 139694994077440 [Warning] WSREP: No bootstrapped primary component found.
2020-04-13 11:03:25 139694994077440 [ERROR] WSREP: gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():873: Failed to establish quorum.

Server logs.

Comment by Stepan Patryshev (Inactive) [ 2020-05-08 ]

It failed also on Jenkins, 10.3.22-6 ES 164232ab3faf1f43eb38d1a4c9cdb4393a5563ab, Build Debug, debian-9.

Comment by Alexey [ 2021-03-20 ]

What happens here (following the logs from 2020-04-13 since this is very difficult to reproduce) is an issue with Galera layered architecture.

As expected from the test, node2 restarts and successfully recovers gcache:
2020-04-13 11:03:15 139695533410432 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 1-2

It then connects to node1, but connection is broken (due to external forces) before the upper layer manages to initialize:

2020-04-13 11:03:23 139695136687872 [Note] WSREP: (48d0798f, 'tcp://0.0.0.0:16025') turning message relay requesting on, nonlive peers: tcp://127.0.0.1:16022
2020-04-13 11:03:25 139695136687872 [Note] WSREP: (48d0798f, 'tcp://0.0.0.0:16025') reconnecting to 486ea196 (tcp://127.0.0.1:16022), attempt 0
...
2020-04-13 11:03:25 139694994077440 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2020-04-13 11:03:25 139694994077440 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2020-04-13 11:03:25 139695136687872 [Note] WSREP: forgetting 486ea196 (tcp://127.0.0.1:16022)
2020-04-13 11:03:25 139694994077440 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1

At this point there are no stateful nodes in node2 component. In normal situation gcomm layer would have returned non-PRIMARY component and would have reconnected. However for this test pc.ignore_sb=true was set up, so gcomm layer thinks everything is fine, while the upper layer didn't have a chance to exchange states with node1.

So it all boils down to an unfortunate timing of events.

The test can be made more reliable by not using pc.ignore_sb=true on node2, since it does not need it.

Submitted a trivial fix for 10.2 in https://github.com/MariaDB/server/pull/1785
Should be easily merged downstream

Generated at Thu Feb 08 09:13:15 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.