Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Not a Bug
-
10.0.24-galera
Description
A user has seen a joiner fail an IST, and then fail the followup SST due to a "null SST request".
For example, here we see a joiner node request an IST, and then it fails with no explanation:
160404 13:52:52 [Warning] WSREP: Gap in state sequence. Need state transfer.
|
160404 13:52:52 [Note] WSREP: You have configured 'xtrabackup-v2' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
|
160404 13:52:52 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
|
160404 13:52:52 [Note] WSREP: REPL Protocols: 7 (3, 2)
|
160404 13:52:53 [Note] WSREP: Service thread queue flushed.
|
160404 13:52:53 [Note] WSREP: Assign initial position for certification: 22297049, protocol version: 3
|
160404 13:52:53 [Note] WSREP: Service thread queue flushed.
|
160404 13:52:53 [Note] WSREP: Prepared IST receiver, listening at: tcp://ip1:4568
|
160404 13:52:53 [Note] WSREP: Member 0.0 (node01) requested state transfer from '*any*'. Selected 1.0 (node02)(SYNCED) as donor.
|
160404 13:52:53 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 22297056)
|
160404 13:52:53 [Note] WSREP: Requesting state transfer: success, donor: 1
|
160404 13:52:53 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 434955524 bytes
|
160404 13:52:53 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 3/10 locked buffers
|
160404 13:52:53 [Note] WSREP: Receiving IST: 288 writesets, seqnos 22296761-22297049
|
160404 13:52:55 [Warning] WSREP: 1.0 (node02): State transfer to 0.0 (node01) failed: -125 (Operation canceled)
|
160404 13:52:55 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():733: Will never receive state. Need to abort.
|
And here is the request from the donor's perspective:
160404 13:52:53 [Note] WSREP: Member 0.0 (node01) requested state transfer from '*any*'. Selected 1.0 (node02)(SYNCED) as donor.
|
160404 13:52:53 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 22297056)
|
160404 13:52:55 [Note] WSREP: IST request: cca366dc-9938-11e5-9d4b-e7c47e7205aa:22296761-22297049|tcp://ip1:4568
|
160404 13:52:55 [Note] WSREP: IST first seqno 22296762 not found from cache, falling back to SST
|
160404 13:52:55 [Warning] WSREP: SST request is null, SST canceled.
|
160404 13:52:55 [Warning] WSREP: 1.0 (node02): State transfer to 0.0 (node01) failed: -125 (Operation canceled)
|
160404 13:52:55 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 22297056)
|
160404 13:52:55 [Note] WSREP: Member 1.0 (node02) synced with group.
|
160404 13:52:55 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 22297056)
|
We can see in the above output that it failed because the first seqno was not in the donor's gcache, and then the followup SST failed because it was a "null request".
Why would the followup SST request be null?
The joiner successfully SSTed the next time that it restarted. For comparison, here is the joiner's log from that attempt:
160404 15:55:00 [Warning] WSREP: Gap in state sequence. Need state transfer.
|
160404 15:55:00 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address 'ip1' --datadir '/db/mysql_data/node01/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '26110' --binlog '/db/mysql_binlog/node01/master-bin' '
|
WSREP_SST: [INFO] Logging all stderr of SST/Innobackupex to syslog (20160404 15:55:01.299)
|
160404 15:55:01 [Note] WSREP: Prepared SST request: xtrabackup-v2|ip1:4444/xtrabackup_sst//1
|
160404 15:55:01 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
|
160404 15:55:01 [Note] WSREP: REPL Protocols: 7 (3, 2)
|
160404 15:55:01 [Note] WSREP: Service thread queue flushed.
|
160404 15:55:01 [Note] WSREP: Assign initial position for certification: 22298758, protocol version: 3
|
160404 15:55:01 [Note] WSREP: Service thread queue flushed.
|
160404 15:55:01 [Note] WSREP: Prepared IST receiver, listening at: tcp://ip1:4568
|
160404 15:55:01 [Note] WSREP: Member 1.0 (node01) requested state transfer from '*any*'. Selected 0.0 (node02)(SYNCED) as donor.
|
160404 15:55:01 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 22298758)
|
160404 15:55:01 [Note] WSREP: Requesting state transfer: success, donor: 0
|
And here is the donor's log from that attempt:
160404 15:55:01 [Note] WSREP: Member 1.0 (node01) requested state transfer from '*any*'. Selected 0.0 (node02)(SYNCED) as donor.
|
160404 15:55:01 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 22298758)
|
160404 15:55:01 [Note] WSREP: IST request: cca366dc-9938-11e5-9d4b-e7c47e7205aa:22296761-22298758|tcp://ip1:4568
|
160404 15:55:01 [Note] WSREP: IST first seqno 22296762 not found from cache, falling back to SST
|
160404 15:55:01 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
|
160404 15:55:01 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'donor' --address 'ip1:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/db/mysql_data/node02/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --binlog '/db/mysql_binlog/node02/master-bin' --gtid 'cca366dc-9938-11e5-9d4b-e7c47e7205aa:22298758''
|
160404 15:55:01 [Note] WSREP: sst_donor_thread signaled with 0
|
WSREP_SST: [INFO] Logging all stderr of SST/Innobackupex to syslog (20160404 15:55:01.929)
|
160404 15:55:03 [Note] WSREP: (91737ffe, 'tcp://0.0.0.0:4567') turning message relay requesting off
|
160404 16:08:12 [Note] WSREP: Provider paused at cca366dc-9938-11e5-9d4b-e7c47e7205aa:22298931 (1554650)
|
160404 16:08:13 [Note] WSREP: resuming provider at 1554650
|
160404 16:08:13 [Note] WSREP: Provider resumed.
|
160404 16:08:31 [Note] WSREP: 0.0 (node02): State transfer to 1.0 (node01) complete.
|
160404 16:08:31 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 22298934)
|
160404 16:08:31 [Note] WSREP: Member 0.0 (node02) synced with group.
|
160404 16:08:31 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 22298934)
|