[MDEV-22240] galera.galera_gcache_recover MTR failed: Failed to establish quorum - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.2.31, 10.3.22, 10.5.2
Fix Version/s: N/A
Component/s: Galera, Tests
Labels:
None
Environment:
debian-10

Description

galera.galera_gcache_recover failed on CI: "gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum".

stdio.log:

10.5.2-0 ES 022c86e4b886e4d87cfcbfc5c5c0cdcb7e4eb3fa RelWithDebInfo debian-10
galera.galera_gcache_recover 'innodb' w3 [ fail ]
Test ended at 2020-04-10 15:16:26

CURRENT_TEST: galera.galera_gcache_recover


Failed to start mysqld.2
mysqltest failed but provided no output


- saving '/var/tmp/mtr/3/log/galera.galera_gcache_recover-innodb/' to '/var/tmp/mtr/log/galera.galera_gcache_recover-innodb/'
***Warnings generated in error logs during shutdown after running tests: galera.galera_gcache_recover

2020-04-10 15:13:34 0 [Warning] WSREP: No re-merged primary component found.
2020-04-10 15:13:34 0 [Warning] WSREP: No bootstrapped primary component found.
2020-04-10 15:13:34 0 [ERROR] WSREP: gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum.

Server logs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

200410_galera.galera_gcache_recover.zip
76 kB
2020-04-14 12:27
200410_stdout.log
37 kB
2020-04-14 12:27
200413_galera_gcache_recover_10.2.zip
63 kB
2020-04-14 13:00
200413_stdout_10.2.log
24 kB
2020-04-14 13:00

Activity

Ascending order - Click to sort in descending order

Stepan Patryshev (Inactive) added a comment - 2020-04-14 13:00 - edited

It failed also on CI, 10.2 ES, rhel-7.

stdio.log:

10.2.31-6 ES 2969d0702d56405d1aec8c16a272ac85fef7bd61 Debug

galera.galera_gcache_recover 'innodb' w3 [ fail ]
Test ended at 2020-04-13 11:04:16

CURRENT_TEST: galera.galera_gcache_recover
mysqltest: In included file "./include/galera_wait_ready.inc":
included from /home/jenkins/workspace/10.2e-MTR-GALERA/BuildType/Debug/label/rhel-7/mysql-test/suite/galera/include/start_mysqld.inc at line 16:
included from /home/jenkins/workspace/10.2e-MTR-GALERA/BuildType/Debug/label/rhel-7/mysql-test/suite/galera/t/galera_gcache_recover.test at line 44:
At line 28: "Server did not transition to READY state"

The result from queries just before the failure was:
CREATE TABLE t1 (f1 INTEGER PRIMARY KEY) ENGINE=InnoDB;
INSERT INTO t1 VALUES (1);
SET SESSION wsrep_sync_wait = 0;
connection node_2;
SET SESSION wsrep_sync_wait = 0;
Killing server ...
connection node_1;
INSERT INTO t1 VALUES (2);
Killing server ...
connection node_1;
Performing --wsrep-recover ...
Using --wsrep-start-position when starting mysqld ...
INSERT INTO t1 VALUES (3);
connection node_2;
Performing --wsrep-recover ...
Using --wsrep-start-position when starting mysqld ...

- saving '/var/tmp/mtr/3/log/galera.galera_gcache_recover-innodb/' to '/var/tmp/mtr/log/galera.galera_gcache_recover-innodb/'
***Warnings generated in error logs during shutdown after running tests: galera.galera_gcache_recover

2020-04-13 11:03:25 139694994077440 [Warning] WSREP: No re-merged primary component found.
2020-04-13 11:03:25 139694994077440 [Warning] WSREP: No bootstrapped primary component found.
2020-04-13 11:03:25 139694994077440 [ERROR] WSREP: gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():873: Failed to establish quorum.

Server logs.

Stepan Patryshev (Inactive) added a comment - 2020-04-14 13:00 - edited It failed also on CI, 10.2 ES, rhel-7. stdio.log : 10.2.31-6 ES 2969d0702d56405d1aec8c16a272ac85fef7bd61 Debug galera.galera_gcache_recover 'innodb' w3 [ fail ] Test ended at 2020-04-13 11:04:16 CURRENT_TEST: galera.galera_gcache_recover mysqltest: In included file "./include/galera_wait_ready.inc": included from /home/jenkins/workspace/10.2e-MTR-GALERA/BuildType/Debug/label/rhel-7/mysql-test/suite/galera/include/start_mysqld.inc at line 16: included from /home/jenkins/workspace/10.2e-MTR-GALERA/BuildType/Debug/label/rhel-7/mysql-test/suite/galera/t/galera_gcache_recover.test at line 44: At line 28: "Server did not transition to READY state" The result from queries just before the failure was: CREATE TABLE t1 (f1 INTEGER PRIMARY KEY) ENGINE=InnoDB; INSERT INTO t1 VALUES (1); SET SESSION wsrep_sync_wait = 0; connection node_2; SET SESSION wsrep_sync_wait = 0; Killing server ... connection node_1; INSERT INTO t1 VALUES (2); Killing server ... connection node_1; Performing --wsrep-recover ... Using --wsrep-start-position when starting mysqld ... INSERT INTO t1 VALUES (3); connection node_2; Performing --wsrep-recover ... Using --wsrep-start-position when starting mysqld ... - saving '/var/tmp/mtr/3/log/galera.galera_gcache_recover-innodb/' to '/var/tmp/mtr/log/galera.galera_gcache_recover-innodb/' ***Warnings generated in error logs during shutdown after running tests: galera.galera_gcache_recover 2020-04-13 11:03:25 139694994077440 [Warning] WSREP: No re-merged primary component found. 2020-04-13 11:03:25 139694994077440 [Warning] WSREP: No bootstrapped primary component found. 2020-04-13 11:03:25 139694994077440 [ERROR] WSREP: gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():873: Failed to establish quorum. Server logs .

Stepan Patryshev (Inactive) added a comment - 2020-05-08 15:46

It failed also on Jenkins, 10.3.22-6 ES 164232ab3faf1f43eb38d1a4c9cdb4393a5563ab, Build Debug, debian-9.

Stepan Patryshev (Inactive) added a comment - 2020-05-08 15:46 It failed also on Jenkins, 10.3.22-6 ES 164232ab3faf1f43eb38d1a4c9cdb4393a5563ab, Build Debug, debian-9.

Alexey added a comment - 2021-03-20 08:53

What happens here (following the logs from 2020-04-13 since this is very difficult to reproduce) is an issue with Galera layered architecture.

As expected from the test, node2 restarts and successfully recovers gcache:
2020-04-13 11:03:15 139695533410432 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 1-2

It then connects to node1, but connection is broken (due to external forces) before the upper layer manages to initialize:

2020-04-13 11:03:23 139695136687872 [Note] WSREP: (48d0798f, 'tcp://0.0.0.0:16025') turning message relay requesting on, nonlive peers: tcp://127.0.0.1:16022
2020-04-13 11:03:25 139695136687872 [Note] WSREP: (48d0798f, 'tcp://0.0.0.0:16025') reconnecting to 486ea196 (tcp://127.0.0.1:16022), attempt 0
...
2020-04-13 11:03:25 139694994077440 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2020-04-13 11:03:25 139694994077440 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2020-04-13 11:03:25 139695136687872 [Note] WSREP: forgetting 486ea196 (tcp://127.0.0.1:16022)
2020-04-13 11:03:25 139694994077440 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1

At this point there are no stateful nodes in node2 component. In normal situation gcomm layer would have returned non-PRIMARY component and would have reconnected. However for this test pc.ignore_sb=true was set up, so gcomm layer thinks everything is fine, while the upper layer didn't have a chance to exchange states with node1.

So it all boils down to an unfortunate timing of events.

The test can be made more reliable by not using pc.ignore_sb=true on node2, since it does not need it.

Submitted a trivial fix for 10.2 in https://github.com/MariaDB/server/pull/1785
Should be easily merged downstream

Alexey added a comment - 2021-03-20 08:53 What happens here (following the logs from 2020-04-13 since this is very difficult to reproduce) is an issue with Galera layered architecture. As expected from the test, node2 restarts and successfully recovers gcache: 2020-04-13 11:03:15 139695533410432 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 1-2 It then connects to node1, but connection is broken (due to external forces) before the upper layer manages to initialize: 2020-04-13 11:03:23 139695136687872 [Note] WSREP: (48d0798f, 'tcp://0.0.0.0:16025') turning message relay requesting on, nonlive peers: tcp://127.0.0.1:16022 2020-04-13 11:03:25 139695136687872 [Note] WSREP: (48d0798f, 'tcp://0.0.0.0:16025') reconnecting to 486ea196 (tcp://127.0.0.1:16022), attempt 0 ... 2020-04-13 11:03:25 139694994077440 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2 2020-04-13 11:03:25 139694994077440 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID. 2020-04-13 11:03:25 139695136687872 [Note] WSREP: forgetting 486ea196 (tcp://127.0.0.1:16022) 2020-04-13 11:03:25 139694994077440 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1 At this point there are no stateful nodes in node2 component. In normal situation gcomm layer would have returned non-PRIMARY component and would have reconnected. However for this test pc.ignore_sb=true was set up, so gcomm layer thinks everything is fine, while the upper layer didn't have a chance to exchange states with node1. So it all boils down to an unfortunate timing of events. The test can be made more reliable by not using pc.ignore_sb=true on node2, since it does not need it. Submitted a trivial fix for 10.2 in https://github.com/MariaDB/server/pull/1785 Should be easily merged downstream

MariaDB Server

galera.galera_gcache_recover MTR failed: Failed to establish quorum

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration