[MDEV-22240] galera.galera_gcache_recover MTR failed: Failed to establish quorum Created: 2020-04-14 Updated: 2021-09-29 Resolved: 2021-06-24 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, Tests |
| Affects Version/s: | 10.2.31, 10.3.22, 10.5.2 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Stepan Patryshev (Inactive) | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
debian-10 |
||
| Attachments: |
|
| Description |
|
galera.galera_gcache_recover failed on CI: "gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum".
|
| Comments |
| Comment by Stepan Patryshev (Inactive) [ 2020-04-14 ] | |||||||||||||||||||||||||||||||||||
|
It failed also on CI, 10.2 ES, rhel-7.
| |||||||||||||||||||||||||||||||||||
| Comment by Stepan Patryshev (Inactive) [ 2020-05-08 ] | |||||||||||||||||||||||||||||||||||
|
It failed also on Jenkins, 10.3.22-6 ES 164232ab3faf1f43eb38d1a4c9cdb4393a5563ab, Build Debug, debian-9. | |||||||||||||||||||||||||||||||||||
| Comment by Alexey [ 2021-03-20 ] | |||||||||||||||||||||||||||||||||||
|
What happens here (following the logs from 2020-04-13 since this is very difficult to reproduce) is an issue with Galera layered architecture. As expected from the test, node2 restarts and successfully recovers gcache: It then connects to node1, but connection is broken (due to external forces) before the upper layer manages to initialize: 2020-04-13 11:03:23 139695136687872 [Note] WSREP: (48d0798f, 'tcp://0.0.0.0:16025') turning message relay requesting on, nonlive peers: tcp://127.0.0.1:16022 At this point there are no stateful nodes in node2 component. In normal situation gcomm layer would have returned non-PRIMARY component and would have reconnected. However for this test pc.ignore_sb=true was set up, so gcomm layer thinks everything is fine, while the upper layer didn't have a chance to exchange states with node1. So it all boils down to an unfortunate timing of events. The test can be made more reliable by not using pc.ignore_sb=true on node2, since it does not need it. Submitted a trivial fix for 10.2 in https://github.com/MariaDB/server/pull/1785 |