Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21434

WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1178: Will never receive state

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.4.11
    • Fix Version/s: 10.4
    • Component/s: Galera
    • Labels:
      None

      Description

      Okay. Now I'm filing this bug and attempt to settle this issue once and for all. We have a two-node cluster. I reboot node number 1. MariaDB does the GCache-Recovery etc. Node comes up. Great. I now reboot node number 2. This basically never works. Once in a blue moon perhaps. Most of the time this crashes.

      Every once in a a while, for example, a new Linux kernel is released and I reboot the nodes. Time and again, it just doesn't work properly. The first node yes. The second: no.

      Here is the log of a recent attempt:

      Jan 07 18:54:46 yak1 mysqld[1041]: WSREP: Recovered position a6a91495-1b7f-11ea-a0be-7f78e01c49de:312608770
      Jan 07 18:54:47 yak1 mysqld[1412]: 2020-01-07 18:54:47 0 [Note] /usr/sbin/mysqld (mysqld 10.4.11-MariaDB-1:10.4.11+maria~eoan) starting as process 1412 ...
      Jan 07 18:54:47 yak1 mysqld[1412]: 2020-01-07 18:54:47 0 [Note] WSREP: Loading provider /usr/lib/galera/libgalera_smm.so initial position: a6a91495-1b7f-11ea-a0be-7f78e01c49de:312608770
      Jan 07 18:54:47 yak1 mysqld[1412]: wsrep loader: [INFO] wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
      Jan 07 18:54:47 yak1 mysqld[1412]: wsrep loader: [INFO] wsrep_load(): Galera 26.4.3(r4535) by Codership Oy <info@codership.com> loaded successfully.
      Jan 07 18:54:47 yak1 mysqld[1412]: 2020-01-07 18:54:47 0 [Note] WSREP: CRC-32C: using hardware acceleration.
      Jan 07 18:54:47 yak1 mysqld[1412]: 2020-01-07 18:54:47 0 [Note] WSREP: Found saved state: a6a91495-1b7f-11ea-a0be-7f78e01c49de:312608770, safe_to_bootstrap: 0
      Jan 07 18:54:47 yak1 mysqld[1412]: 2020-01-07 18:54:47 0 [Note] WSREP: GCache DEBUG: opened preamble:
      Jan 07 18:54:47 yak1 mysqld[1412]: Version: 2
      Jan 07 18:54:47 yak1 mysqld[1412]: UUID: a6a91495-1b7f-11ea-a0be-7f78e01c49de
      Jan 07 18:54:47 yak1 mysqld[1412]: Seqno: 312018801 - 312608770
      Jan 07 18:54:47 yak1 mysqld[1412]: Offset: 5086016768
      Jan 07 18:54:47 yak1 mysqld[1412]: Synced: 1
      Jan 07 18:54:47 yak1 mysqld[1412]: 2020-01-07 18:54:47 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: a6a91495-1b7f-11ea-a0be-7f78e01c49de, offset: 5086016768
      Jan 07 18:54:47 yak1 mysqld[1412]: 2020-01-07 18:54:47 0 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (          0/10737418264 bytes) complete.
      Jan 07 18:54:57 yak1 mysqld[1412]: 2020-01-07 18:54:57 0 [Note] WSREP: GCache::RingBuffer initial scan... 40.6% ( 4364017832/10737418264 bytes) complete.
      Jan 07 18:55:07 yak1 mysqld[1412]: 2020-01-07 18:55:07 0 [Note] WSREP: GCache::RingBuffer initial scan... 81.1% ( 8710618424/10737418264 bytes) complete.
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (10737418264/10737418264 bytes) complete.
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 312018801-312608770
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...  0.0% (          0/10737370056 bytes) complete.
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (10737370056/10737370056 bytes) complete.
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): found 1/589971 locked buffers
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): used space: 10737370056/10737418240
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 159.69.77.51; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 10G; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = f
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: Service thread queue flushed.
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: ####### Assign initial position for certification: a6a91495-1b7f-11ea-a0be-7f78e01c49de:312608770, protocol version: -1
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: Start replication
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: Connecting with bootstrap option: 0
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: Setting GCS initial position to a6a91495-1b7f-11ea-a0be-7f78e01c49de:312608770
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: protonet asio version 0
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: Using CRC-32C for message checksums.
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: backend: asio
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: gcomm thread scheduling priority set to other:0
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: restore pc from disk failed
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: GMCast version 0
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: EVS version 1
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: gcomm: connecting to group 'yak_cluster', peer '192.168.0.1:,192.168.0.2:'
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address tcp://192.168.0.1:4567
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') connection established to 99c55b98 tcp://192.168.0.2:4567
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
      Jan 07 18:55:12 yak1 mysqld[1412]: 2020-01-07 18:55:12 0 [Note] WSREP: EVS version upgrade 0 -> 1
      Jan 07 18:55:12 yak1 mysqld[1412]: 2020-01-07 18:55:12 0 [Note] WSREP: declaring 99c55b98 at tcp://192.168.0.2:4567 stable
      Jan 07 18:55:12 yak1 mysqld[1412]: 2020-01-07 18:55:12 0 [Note] WSREP: PC protocol upgrade 0 -> 1
      Jan 07 18:55:12 yak1 mysqld[1412]: 2020-01-07 18:55:12 0 [Note] WSREP: Node 99c55b98 state prim
      Jan 07 18:55:14 yak1 mysqld[1412]: 2020-01-07 18:55:14 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') turning message relay requesting off
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: view(view_id(PRIM,99c55b98,41) memb {
      Jan 07 18:55:15 yak1 mysqld[1412]:         99c55b98,0
      Jan 07 18:55:15 yak1 mysqld[1412]:         d9c30ef3,0
      Jan 07 18:55:15 yak1 mysqld[1412]: } joined {
      Jan 07 18:55:15 yak1 mysqld[1412]: } left {
      Jan 07 18:55:15 yak1 mysqld[1412]: } partitioned {
      Jan 07 18:55:15 yak1 mysqld[1412]: })
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: save pc into disk
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.2:4567
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: gcomm: connected
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: Opened channel 'yak_cluster'
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 1 [Note] WSREP: Starting rollbacker thread 1
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 2 [Note] WSREP: Starting applier thread 2
      Jan 07 18:55:16 yak1 mysqld[1412]: 2020-01-07 18:55:16 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') connection established to 99c55b98 tcp://192.168.0.2:4567
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Warning] WSREP: Action message in non-primary configuration from member 0
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: STATE EXCHANGE: sent state msg: dbdd8160-3176-11ea-98d6-7602d2526dca
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: STATE EXCHANGE: got state msg: dbdd8160-3176-11ea-98d6-7602d2526dca from 0 (yak2)
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: STATE EXCHANGE: got state msg: dbdd8160-3176-11ea-98d6-7602d2526dca from 1 (yak1)
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: Quorum results:
      Jan 07 18:55:17 yak1 mysqld[1412]:         version    = 6,
      Jan 07 18:55:17 yak1 mysqld[1412]:         component  = PRIMARY,
      Jan 07 18:55:17 yak1 mysqld[1412]:         conf_id    = 11,
      Jan 07 18:55:17 yak1 mysqld[1412]:         members    = 1/2 (joined/total),
      Jan 07 18:55:17 yak1 mysqld[1412]:         act_id     = 312631833,
      Jan 07 18:55:17 yak1 mysqld[1412]:         last_appl. = 312608659,
      Jan 07 18:55:17 yak1 mysqld[1412]:         protocols  = 2/10/4 (gcs/repl/appl),
      Jan 07 18:55:17 yak1 mysqld[1412]:         vote policy= 0,
      Jan 07 18:55:17 yak1 mysqld[1412]:         group UUID = a6a91495-1b7f-11ea-a0be-7f78e01c49de
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: Flow-control interval: [23, 23]
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 312631834)
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: ####### processing CC 312631834, local, ordered
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: REPL Protocols: 10 (5, 3)
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: ####### My UUID: d9c30ef3-3176-11ea-95a0-422096815619
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: Server yak1 connected to cluster at position a6a91495-1b7f-11ea-a0be-7f78e01c49de:312631834 with ID d9c30ef3-3176-11ea-95a0-422096815619
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: Server status change disconnected -> connected
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: State transfer required:
      Jan 07 18:55:17 yak1 mysqld[1412]:         Group state: a6a91495-1b7f-11ea-a0be-7f78e01c49de:312631834
      Jan 07 18:55:17 yak1 mysqld[1412]:         Local state: a6a91495-1b7f-11ea-a0be-7f78e01c49de:312608770
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: Server status change connected -> joiner
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '159.69.77.51' --datadir '/var/lib/mysql/' --parent '1412' --mysqld-args --wsrep_start_position=a6a91495-1b7f-11ea-a0be-7f78e01c49de:312608770'
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 134217728 bytes
      Jan 07 18:55:17 yak1 mysqld[1412]: WSREP_SST: [INFO] Logging all stderr of SST/Innobackupex to syslog (20200107 18:55:17.590)
      Jan 07 18:55:19 yak1 mysqld[1412]: 2020-01-07 18:55:19 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') turning message relay requesting off
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 2 [Note] WSREP: Prepared SST request: mariabackup|159.69.77.51:4444/xtrabackup_sst//1
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 2 [Note] WSREP: Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [Note] WSREP: Service thread queue flushed.
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 2 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: 5
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 2 [Note] WSREP: ####### IST uuid:a6a91495-1b7f-11ea-a0be-7f78e01c49de f: 312608771, l: 312631834, STRv: 3
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 2 [Note] WSREP: IST receiver addr using tcp://159.69.77.51:4568
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 2 [Note] WSREP: Prepared IST receiver for 312608771-312631834, listening at: tcp://159.69.77.51:4568
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [Note] WSREP: Member 1.0 (yak1) requested state transfer from '*any*'. Selected 0.0 (yak2)(SYNCED) as donor.
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 312632054)
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 2 [Note] WSREP: Requesting state transfer: success, donor: 0
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [Warning] WSREP: 0.0 (yak2): State transfer to 1.0 (yak1) failed: -42 (No message of desired type)
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1178: Will never receive state. Need to abort.
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [Note] WSREP: gcomm: terminating thread
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [Note] WSREP: gcomm: joining thread
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [Note] WSREP: gcomm: closing backend
      Jan 07 18:55:22 yak1 mysqld[1412]: 2020-01-07 18:55:22 0 [Note] WSREP: view(view_id(NON_PRIM,99c55b98,41) memb {
      Jan 07 18:55:22 yak1 mysqld[1412]:         d9c30ef3,0
      Jan 07 18:55:22 yak1 mysqld[1412]: } joined {
      Jan 07 18:55:22 yak1 mysqld[1412]: } left {
      Jan 07 18:55:22 yak1 mysqld[1412]: } partitioned {
      Jan 07 18:55:22 yak1 mysqld[1412]:         99c55b98,0
      Jan 07 18:55:22 yak1 mysqld[1412]: })
      Jan 07 18:55:22 yak1 mysqld[1412]: 2020-01-07 18:55:22 0 [Note] WSREP: PC protocol downgrade 1 -> 0
      Jan 07 18:55:22 yak1 mysqld[1412]: 2020-01-07 18:55:22 0 [Note] WSREP: view((empty))
      Jan 07 18:55:22 yak1 mysqld[1412]: 2020-01-07 18:55:22 0 [Note] WSREP: gcomm: closed
      Jan 07 18:55:22 yak1 mysqld[1412]: 2020-01-07 18:55:22 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.
      

      If I try to start the node again, I end up with cluster state: INCONSISTENT. The only way to get out of the situation is to wipe /var/lib/mysql on that node and restart from scratch. In this case the SST is done and it finally syncs.

      I'd really like to solve this because it's quite annoying as you can probably imagine. Let me know if you need more info. It's a live server though, so I cannot mess with it at will. For starters, I'd really like to know what the obnoxious "Will never receive state" means:

      WSREP: 0.0 (yak2): State transfer to 1.0 (yak1) failed: -42 (No message of desired type)
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1178: Will never receive state. 
      

      And what is "No message of desired type"? What can I do about it?

        Attachments

          Activity

            People

            Assignee:
            jplindst Jan Lindström
            Reporter:
            rgpublic Ranjan Ghosh
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated: