Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26316

quorum is lost on full or bigger resync using wsrep on galera multi-master

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Not a Bug
    • 10.5.11, 10.5
    • N/A
    • Galera, wsrep
    • None

    Description

      We hit disaster scenario unexpectedly on our multi-master galera setup running 10.5 version.

      During recovery we were able to bring first node up without issues up. It always started as primary with quorum and allowed another node to join cluster and was safe to bootstrap

      root@mysql-0:/var/lib/mysql# cat grastate.dat
      # GALERA saved state
      version: 2.1
      uuid:    676e5a48-fbd8-11ea-a02d-8f1e7fb61efb
      seqno:   -1
      safe_to_bootstrap: 1
       
      MariaDB [(none)]> SHOW GLOBAL STATUS LIKE 'wsrep_cluster_status';
      +----------------------+---------+
      | Variable_name        | Value   |
      +----------------------+---------+
      | wsrep_cluster_status | Primary |
      +----------------------+---------+
      1 row in set (0.001 sec)
       
      MariaDB [(none)]> SHOW GLOBAL STATUS LIKE 'wsrep_local_state_comment';
      +---------------------------+--------+
      | Variable_name             | Value  |
      +---------------------------+--------+
      | wsrep_local_state_comment | Synced |
      +---------------------------+--------+
      1 row in set (0.002 sec)
      

      To prevent any data mismatch and have the restore faster we recreated disk to have node-1 blank and join cluster as new one. This worked fine, the process started to full sync but after a while (~19-25GB) of data sync the whole node crashed and in logs we could see that QUORUM was lost and both nodes didn't know who is the master now. The sync stopped and nodes never joined back to form any kind of cluster.
      Sometimes this sync ended sooner with same quorum break and sometimes later. After retrying lot of times we sucessfully synced the node with pure hope it will once sync fully.

      Error log in attachment where this was thrown

      2021-08-06T07:15:00.0000000Z	"2021-08-06  7:15:45 0 [Warning] WSREP: Quorum: No node with complete state:
      "
      2021-08-06T07:15:00.0000000Z	"
      "
      2021-08-06T07:15:00.0000000Z	"	Version      : 6
      "
      2021-08-06T07:15:00.0000000Z	"	Flags        : 0x1
      "
      2021-08-06T07:15:00.0000000Z	"	Protocols    : 2 / 10 / 4
      "
      2021-08-06T07:15:00.0000000Z	"	State        : JOINER
      "
      2021-08-06T07:15:00.0000000Z	"	Desync count : 0
      "
      2021-08-06T07:15:00.0000000Z	"	Prim state   : JOINER
      "
      2021-08-06T07:15:00.0000000Z	"	Prim UUID    : 610fc4ce-f685-11eb-9e4f-725b850de788
      "
      2021-08-06T07:15:00.0000000Z	"	Prim  seqno  : 2
      "
      2021-08-06T07:15:00.0000000Z	"	First seqno  : -1
      "
      2021-08-06T07:15:00.0000000Z	"	Last  seqno  : 245421524
      "
      2021-08-06T07:15:00.0000000Z	"	Commit cut   : 245421512
      "
      2021-08-06T07:15:00.0000000Z	"	Last vote    : -1.0
      "
      2021-08-06T07:15:00.0000000Z	"	Vote policy  : 0
      "
      2021-08-06T07:15:00.0000000Z	"	Prim JOINED  : 1
      "
      2021-08-06T07:15:00.0000000Z	"	State UUID   : 1dda2d24-f686-11eb-8c64-5ab16981001e
      "
      2021-08-06T07:15:00.0000000Z	"	Group UUID   : 676e5a48-fbd8-11ea-a02d-8f1e7fb61efb
      "
      2021-08-06T07:15:00.0000000Z	"	Name         : 'mysql-1'
      "
      2021-08-06T07:15:00.0000000Z	"	Incoming addr: 'AUTO'
      "
      2021-08-06T07:15:00.0000000Z	"
      "
      2021-08-06T07:15:00.0000000Z	"2021-08-06  7:15:45 0 [Warning] WSREP: No re-merged primary component found.
      "
      2021-08-06T07:15:00.0000000Z	"2021-08-06  7:15:45 0 [Warning] WSREP: No bootstrapped primary component found.
      "
      2021-08-06T07:15:00.0000000Z	"2021-08-06  7:15:45 0 [ERROR] WSREP: /home/buildbot/buildbot/build/gcs/src/gcs_state_msg.cpp:gcs_state_msg_get_quorum():947: Failed to establish quorum.
      "
      2021-08-06T07:15:00.0000000Z	"2021-08-06  7:15:45 0 [Note] WSREP: Quorum results:
      "
      2021-08-06T07:15:00.0000000Z	"	version    = 6,
      "
      2021-08-06T07:15:00.0000000Z	"	component  = NON-PRIMARY,
      "
      2021-08-06T07:15:00.0000000Z	"	conf_id    = -1,
      "
      2021-08-06T07:15:00.0000000Z	"	members    = 1/1 (joined/total),
      "
      2021-08-06T07:15:00.0000000Z	"	act_id     = -1,
      "
      2021-08-06T07:15:00.0000000Z	"	last_appl. = 245421512,
      "
      2021-08-06T07:15:00.0000000Z	"	protocols  = -1/-1/-1 (gcs/repl/appl),
      "
      2021-08-06T07:15:00.0000000Z	"	vote policy= 1,
      "
      2021-08-06T07:15:00.0000000Z	"	group UUID = 00000000-0000-0000-0000-000000000000
      "
      2021-08-06T07:15:00.0000000Z	"2021-08-06  7:15:45 0 [Note] WSREP: Flow-control interval: [16, 16]
      
      

      Attachments

        Activity

          People

            jplindst Jan Lindström (Inactive)
            jaroslav Jaroslav
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.