Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22240

galera.galera_gcache_recover MTR failed: Failed to establish quorum

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.2.31, 10.3.22, 10.5.2
    • N/A
    • Galera, Tests
    • None
    • debian-10

    Description

      galera.galera_gcache_recover failed on CI: "gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum".

      stdio.log:

      10.5.2-0 ES 022c86e4b886e4d87cfcbfc5c5c0cdcb7e4eb3fa RelWithDebInfo debian-10

      galera.galera_gcache_recover 'innodb'    w3 [ fail ]
              Test ended at 2020-04-10 15:16:26
       
      CURRENT_TEST: galera.galera_gcache_recover
       
       
      Failed to start mysqld.2
      mysqltest failed but provided no output
       
       
       - saving '/var/tmp/mtr/3/log/galera.galera_gcache_recover-innodb/' to '/var/tmp/mtr/log/galera.galera_gcache_recover-innodb/'
      ***Warnings generated in error logs during shutdown after running tests: galera.galera_gcache_recover
       
      2020-04-10 15:13:34 0 [Warning] WSREP: No re-merged primary component found.
      2020-04-10 15:13:34 0 [Warning] WSREP: No bootstrapped primary component found.
      2020-04-10 15:13:34 0 [ERROR] WSREP: gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum.
      

      Server logs.

      Attachments

        Activity

          It failed also on CI, 10.2 ES, rhel-7.

          stdio.log:

          10.2.31-6 ES 2969d0702d56405d1aec8c16a272ac85fef7bd61 Debug

           
          galera.galera_gcache_recover 'innodb'    w3 [ fail ]
                  Test ended at 2020-04-13 11:04:16
           
          CURRENT_TEST: galera.galera_gcache_recover
          mysqltest: In included file "./include/galera_wait_ready.inc": 
          included from /home/jenkins/workspace/10.2e-MTR-GALERA/BuildType/Debug/label/rhel-7/mysql-test/suite/galera/include/start_mysqld.inc at line 16:
          included from /home/jenkins/workspace/10.2e-MTR-GALERA/BuildType/Debug/label/rhel-7/mysql-test/suite/galera/t/galera_gcache_recover.test at line 44:
          At line 28: "Server did not transition to READY state"
           
          The result from queries just before the failure was:
          CREATE TABLE t1 (f1 INTEGER PRIMARY KEY) ENGINE=InnoDB;
          INSERT INTO t1 VALUES (1);
          SET SESSION wsrep_sync_wait = 0;
          connection node_2;
          SET SESSION wsrep_sync_wait = 0;
          Killing server ...
          connection node_1;
          INSERT INTO t1 VALUES (2);
          Killing server ...
          connection node_1;
          Performing --wsrep-recover ...
          Using --wsrep-start-position when starting mysqld ...
          INSERT INTO t1 VALUES (3);
          connection node_2;
          Performing --wsrep-recover ...
          Using --wsrep-start-position when starting mysqld ...
           
           - saving '/var/tmp/mtr/3/log/galera.galera_gcache_recover-innodb/' to '/var/tmp/mtr/log/galera.galera_gcache_recover-innodb/'
          ***Warnings generated in error logs during shutdown after running tests: galera.galera_gcache_recover
           
          2020-04-13 11:03:25 139694994077440 [Warning] WSREP: No re-merged primary component found.
          2020-04-13 11:03:25 139694994077440 [Warning] WSREP: No bootstrapped primary component found.
          2020-04-13 11:03:25 139694994077440 [ERROR] WSREP: gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():873: Failed to establish quorum.
          

          Server logs.

          stepan.patryshev Stepan Patryshev (Inactive) added a comment - - edited It failed also on CI, 10.2 ES, rhel-7. stdio.log : 10.2.31-6 ES 2969d0702d56405d1aec8c16a272ac85fef7bd61 Debug   galera.galera_gcache_recover 'innodb' w3 [ fail ] Test ended at 2020-04-13 11:04:16   CURRENT_TEST: galera.galera_gcache_recover mysqltest: In included file "./include/galera_wait_ready.inc": included from /home/jenkins/workspace/10.2e-MTR-GALERA/BuildType/Debug/label/rhel-7/mysql-test/suite/galera/include/start_mysqld.inc at line 16: included from /home/jenkins/workspace/10.2e-MTR-GALERA/BuildType/Debug/label/rhel-7/mysql-test/suite/galera/t/galera_gcache_recover.test at line 44: At line 28: "Server did not transition to READY state"   The result from queries just before the failure was: CREATE TABLE t1 (f1 INTEGER PRIMARY KEY) ENGINE=InnoDB; INSERT INTO t1 VALUES (1); SET SESSION wsrep_sync_wait = 0; connection node_2; SET SESSION wsrep_sync_wait = 0; Killing server ... connection node_1; INSERT INTO t1 VALUES (2); Killing server ... connection node_1; Performing --wsrep-recover ... Using --wsrep-start-position when starting mysqld ... INSERT INTO t1 VALUES (3); connection node_2; Performing --wsrep-recover ... Using --wsrep-start-position when starting mysqld ...   - saving '/var/tmp/mtr/3/log/galera.galera_gcache_recover-innodb/' to '/var/tmp/mtr/log/galera.galera_gcache_recover-innodb/' ***Warnings generated in error logs during shutdown after running tests: galera.galera_gcache_recover   2020-04-13 11:03:25 139694994077440 [Warning] WSREP: No re-merged primary component found. 2020-04-13 11:03:25 139694994077440 [Warning] WSREP: No bootstrapped primary component found. 2020-04-13 11:03:25 139694994077440 [ERROR] WSREP: gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():873: Failed to establish quorum. Server logs .

          It failed also on Jenkins, 10.3.22-6 ES 164232ab3faf1f43eb38d1a4c9cdb4393a5563ab, Build Debug, debian-9.

          stepan.patryshev Stepan Patryshev (Inactive) added a comment - It failed also on Jenkins, 10.3.22-6 ES 164232ab3faf1f43eb38d1a4c9cdb4393a5563ab, Build Debug, debian-9.
          Yurchenko Alexey added a comment -

          What happens here (following the logs from 2020-04-13 since this is very difficult to reproduce) is an issue with Galera layered architecture.

          As expected from the test, node2 restarts and successfully recovers gcache:
          2020-04-13 11:03:15 139695533410432 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 1-2

          It then connects to node1, but connection is broken (due to external forces) before the upper layer manages to initialize:

          2020-04-13 11:03:23 139695136687872 [Note] WSREP: (48d0798f, 'tcp://0.0.0.0:16025') turning message relay requesting on, nonlive peers: tcp://127.0.0.1:16022
          2020-04-13 11:03:25 139695136687872 [Note] WSREP: (48d0798f, 'tcp://0.0.0.0:16025') reconnecting to 486ea196 (tcp://127.0.0.1:16022), attempt 0
          ...
          2020-04-13 11:03:25 139694994077440 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
          2020-04-13 11:03:25 139694994077440 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
          2020-04-13 11:03:25 139695136687872 [Note] WSREP: forgetting 486ea196 (tcp://127.0.0.1:16022)
          2020-04-13 11:03:25 139694994077440 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1

          At this point there are no stateful nodes in node2 component. In normal situation gcomm layer would have returned non-PRIMARY component and would have reconnected. However for this test pc.ignore_sb=true was set up, so gcomm layer thinks everything is fine, while the upper layer didn't have a chance to exchange states with node1.

          So it all boils down to an unfortunate timing of events.

          The test can be made more reliable by not using pc.ignore_sb=true on node2, since it does not need it.

          Submitted a trivial fix for 10.2 in https://github.com/MariaDB/server/pull/1785
          Should be easily merged downstream

          Yurchenko Alexey added a comment - What happens here (following the logs from 2020-04-13 since this is very difficult to reproduce) is an issue with Galera layered architecture. As expected from the test, node2 restarts and successfully recovers gcache: 2020-04-13 11:03:15 139695533410432 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 1-2 It then connects to node1, but connection is broken (due to external forces) before the upper layer manages to initialize: 2020-04-13 11:03:23 139695136687872 [Note] WSREP: (48d0798f, 'tcp://0.0.0.0:16025') turning message relay requesting on, nonlive peers: tcp://127.0.0.1:16022 2020-04-13 11:03:25 139695136687872 [Note] WSREP: (48d0798f, 'tcp://0.0.0.0:16025') reconnecting to 486ea196 (tcp://127.0.0.1:16022), attempt 0 ... 2020-04-13 11:03:25 139694994077440 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2 2020-04-13 11:03:25 139694994077440 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID. 2020-04-13 11:03:25 139695136687872 [Note] WSREP: forgetting 486ea196 (tcp://127.0.0.1:16022) 2020-04-13 11:03:25 139694994077440 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1 At this point there are no stateful nodes in node2 component. In normal situation gcomm layer would have returned non-PRIMARY component and would have reconnected. However for this test pc.ignore_sb=true was set up, so gcomm layer thinks everything is fine, while the upper layer didn't have a chance to exchange states with node1. So it all boils down to an unfortunate timing of events. The test can be made more reliable by not using pc.ignore_sb=true on node2, since it does not need it. Submitted a trivial fix for 10.2 in https://github.com/MariaDB/server/pull/1785 Should be easily merged downstream

          People

            jplindst Jan Lindström (Inactive)
            stepan.patryshev Stepan Patryshev (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.