Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9870

SST fails because request is null

    XMLWordPrintable

Details

    Description

      A user has seen a joiner fail an IST, and then fail the followup SST due to a "null SST request".

      For example, here we see a joiner node request an IST, and then it fails with no explanation:

      160404 13:52:52 [Warning] WSREP: Gap in state sequence. Need state transfer.
      160404 13:52:52 [Note] WSREP: You have configured 'xtrabackup-v2' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
      160404 13:52:52 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      160404 13:52:52 [Note] WSREP: REPL Protocols: 7 (3, 2)
      160404 13:52:53 [Note] WSREP: Service thread queue flushed.
      160404 13:52:53 [Note] WSREP: Assign initial position for certification: 22297049, protocol version: 3
      160404 13:52:53 [Note] WSREP: Service thread queue flushed.
      160404 13:52:53 [Note] WSREP: Prepared IST receiver, listening at: tcp://ip1:4568
      160404 13:52:53 [Note] WSREP: Member 0.0 (node01) requested state transfer from '*any*'. Selected 1.0 (node02)(SYNCED) as donor.
      160404 13:52:53 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 22297056)
      160404 13:52:53 [Note] WSREP: Requesting state transfer: success, donor: 1
      160404 13:52:53 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 434955524 bytes
      160404 13:52:53 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 3/10 locked buffers
      160404 13:52:53 [Note] WSREP: Receiving IST: 288 writesets, seqnos 22296761-22297049
      160404 13:52:55 [Warning] WSREP: 1.0 (node02): State transfer to 0.0 (node01) failed: -125 (Operation canceled)
      160404 13:52:55 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():733: Will never receive state. Need to abort.
      

      And here is the request from the donor's perspective:

      160404 13:52:53 [Note] WSREP: Member 0.0 (node01) requested state transfer from '*any*'. Selected 1.0 (node02)(SYNCED) as donor.
      160404 13:52:53 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 22297056)
      160404 13:52:55 [Note] WSREP: IST request: cca366dc-9938-11e5-9d4b-e7c47e7205aa:22296761-22297049|tcp://ip1:4568
      160404 13:52:55 [Note] WSREP: IST first seqno 22296762 not found from cache, falling back to SST
      160404 13:52:55 [Warning] WSREP: SST request is null, SST canceled.
      160404 13:52:55 [Warning] WSREP: 1.0 (node02): State transfer to 0.0 (node01) failed: -125 (Operation canceled)
      160404 13:52:55 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 22297056)
      160404 13:52:55 [Note] WSREP: Member 1.0 (node02) synced with group.
      160404 13:52:55 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 22297056)
      

      We can see in the above output that it failed because the first seqno was not in the donor's gcache, and then the followup SST failed because it was a "null request".

      Why would the followup SST request be null?

      The joiner successfully SSTed the next time that it restarted. For comparison, here is the joiner's log from that attempt:

      160404 15:55:00 [Warning] WSREP: Gap in state sequence. Need state transfer.
      160404 15:55:00 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address 'ip1' --datadir '/db/mysql_data/node01/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '26110' --binlog '/db/mysql_binlog/node01/master-bin' '
      WSREP_SST: [INFO] Logging all stderr of SST/Innobackupex to syslog (20160404 15:55:01.299)
      160404 15:55:01 [Note] WSREP: Prepared SST request: xtrabackup-v2|ip1:4444/xtrabackup_sst//1
      160404 15:55:01 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      160404 15:55:01 [Note] WSREP: REPL Protocols: 7 (3, 2)
      160404 15:55:01 [Note] WSREP: Service thread queue flushed.
      160404 15:55:01 [Note] WSREP: Assign initial position for certification: 22298758, protocol version: 3
      160404 15:55:01 [Note] WSREP: Service thread queue flushed.
      160404 15:55:01 [Note] WSREP: Prepared IST receiver, listening at: tcp://ip1:4568
      160404 15:55:01 [Note] WSREP: Member 1.0 (node01) requested state transfer from '*any*'. Selected 0.0 (node02)(SYNCED) as donor.
      160404 15:55:01 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 22298758)
      160404 15:55:01 [Note] WSREP: Requesting state transfer: success, donor: 0
      

      And here is the donor's log from that attempt:

      160404 15:55:01 [Note] WSREP: Member 1.0 (node01) requested state transfer from '*any*'. Selected 0.0 (node02)(SYNCED) as donor.
      160404 15:55:01 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 22298758)
      160404 15:55:01 [Note] WSREP: IST request: cca366dc-9938-11e5-9d4b-e7c47e7205aa:22296761-22298758|tcp://ip1:4568
      160404 15:55:01 [Note] WSREP: IST first seqno 22296762 not found from cache, falling back to SST
      160404 15:55:01 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      160404 15:55:01 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'donor' --address 'ip1:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/db/mysql_data/node02/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix ''  --binlog '/db/mysql_binlog/node02/master-bin' --gtid 'cca366dc-9938-11e5-9d4b-e7c47e7205aa:22298758''
      160404 15:55:01 [Note] WSREP: sst_donor_thread signaled with 0
      WSREP_SST: [INFO] Logging all stderr of SST/Innobackupex to syslog (20160404 15:55:01.929)
      160404 15:55:03 [Note] WSREP: (91737ffe, 'tcp://0.0.0.0:4567') turning message relay requesting off
      160404 16:08:12 [Note] WSREP: Provider paused at cca366dc-9938-11e5-9d4b-e7c47e7205aa:22298931 (1554650)
      160404 16:08:13 [Note] WSREP: resuming provider at 1554650
      160404 16:08:13 [Note] WSREP: Provider resumed.
      160404 16:08:31 [Note] WSREP: 0.0 (node02): State transfer to 1.0 (node01) complete.
      160404 16:08:31 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 22298934)
      160404 16:08:31 [Note] WSREP: Member 0.0 (node02) synced with group.
      160404 16:08:31 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 22298934)
      

      Attachments

        Activity

          People

            nirbhay_c Nirbhay Choubey (Inactive)
            GeoffMontee Geoff Montee (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.