Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21434

WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1178: Will never receive state

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Cannot Reproduce
    • 10.4.11, 10.4.12
    • N/A
    • Galera

    Description

      Okay. Now I'm filing this bug and attempt to settle this issue once and for all. We have a two-node cluster. I reboot node number 1. MariaDB does the GCache-Recovery etc. Node comes up. Great. I now reboot node number 2. This basically never works. Once in a blue moon perhaps. Most of the time this crashes.

      Every once in a a while, for example, a new Linux kernel is released and I reboot the nodes. Time and again, it just doesn't work properly. The first node yes. The second: no.

      Here is the log of a recent attempt:

      Jan 07 18:54:46 yak1 mysqld[1041]: WSREP: Recovered position a6a91495-1b7f-11ea-a0be-7f78e01c49de:312608770
      Jan 07 18:54:47 yak1 mysqld[1412]: 2020-01-07 18:54:47 0 [Note] /usr/sbin/mysqld (mysqld 10.4.11-MariaDB-1:10.4.11+maria~eoan) starting as process 1412 ...
      Jan 07 18:54:47 yak1 mysqld[1412]: 2020-01-07 18:54:47 0 [Note] WSREP: Loading provider /usr/lib/galera/libgalera_smm.so initial position: a6a91495-1b7f-11ea-a0be-7f78e01c49de:312608770
      Jan 07 18:54:47 yak1 mysqld[1412]: wsrep loader: [INFO] wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
      Jan 07 18:54:47 yak1 mysqld[1412]: wsrep loader: [INFO] wsrep_load(): Galera 26.4.3(r4535) by Codership Oy <info@codership.com> loaded successfully.
      Jan 07 18:54:47 yak1 mysqld[1412]: 2020-01-07 18:54:47 0 [Note] WSREP: CRC-32C: using hardware acceleration.
      Jan 07 18:54:47 yak1 mysqld[1412]: 2020-01-07 18:54:47 0 [Note] WSREP: Found saved state: a6a91495-1b7f-11ea-a0be-7f78e01c49de:312608770, safe_to_bootstrap: 0
      Jan 07 18:54:47 yak1 mysqld[1412]: 2020-01-07 18:54:47 0 [Note] WSREP: GCache DEBUG: opened preamble:
      Jan 07 18:54:47 yak1 mysqld[1412]: Version: 2
      Jan 07 18:54:47 yak1 mysqld[1412]: UUID: a6a91495-1b7f-11ea-a0be-7f78e01c49de
      Jan 07 18:54:47 yak1 mysqld[1412]: Seqno: 312018801 - 312608770
      Jan 07 18:54:47 yak1 mysqld[1412]: Offset: 5086016768
      Jan 07 18:54:47 yak1 mysqld[1412]: Synced: 1
      Jan 07 18:54:47 yak1 mysqld[1412]: 2020-01-07 18:54:47 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: a6a91495-1b7f-11ea-a0be-7f78e01c49de, offset: 5086016768
      Jan 07 18:54:47 yak1 mysqld[1412]: 2020-01-07 18:54:47 0 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (          0/10737418264 bytes) complete.
      Jan 07 18:54:57 yak1 mysqld[1412]: 2020-01-07 18:54:57 0 [Note] WSREP: GCache::RingBuffer initial scan... 40.6% ( 4364017832/10737418264 bytes) complete.
      Jan 07 18:55:07 yak1 mysqld[1412]: 2020-01-07 18:55:07 0 [Note] WSREP: GCache::RingBuffer initial scan... 81.1% ( 8710618424/10737418264 bytes) complete.
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (10737418264/10737418264 bytes) complete.
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 312018801-312608770
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...  0.0% (          0/10737370056 bytes) complete.
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (10737370056/10737370056 bytes) complete.
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): found 1/589971 locked buffers
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): used space: 10737370056/10737418240
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 159.69.77.51; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 10G; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = f
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: Service thread queue flushed.
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: ####### Assign initial position for certification: a6a91495-1b7f-11ea-a0be-7f78e01c49de:312608770, protocol version: -1
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: Start replication
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: Connecting with bootstrap option: 0
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: Setting GCS initial position to a6a91495-1b7f-11ea-a0be-7f78e01c49de:312608770
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: protonet asio version 0
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: Using CRC-32C for message checksums.
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: backend: asio
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: gcomm thread scheduling priority set to other:0
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: restore pc from disk failed
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: GMCast version 0
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: EVS version 1
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: gcomm: connecting to group 'yak_cluster', peer '192.168.0.1:,192.168.0.2:'
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address tcp://192.168.0.1:4567
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') connection established to 99c55b98 tcp://192.168.0.2:4567
      Jan 07 18:55:11 yak1 mysqld[1412]: 2020-01-07 18:55:11 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
      Jan 07 18:55:12 yak1 mysqld[1412]: 2020-01-07 18:55:12 0 [Note] WSREP: EVS version upgrade 0 -> 1
      Jan 07 18:55:12 yak1 mysqld[1412]: 2020-01-07 18:55:12 0 [Note] WSREP: declaring 99c55b98 at tcp://192.168.0.2:4567 stable
      Jan 07 18:55:12 yak1 mysqld[1412]: 2020-01-07 18:55:12 0 [Note] WSREP: PC protocol upgrade 0 -> 1
      Jan 07 18:55:12 yak1 mysqld[1412]: 2020-01-07 18:55:12 0 [Note] WSREP: Node 99c55b98 state prim
      Jan 07 18:55:14 yak1 mysqld[1412]: 2020-01-07 18:55:14 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') turning message relay requesting off
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: view(view_id(PRIM,99c55b98,41) memb {
      Jan 07 18:55:15 yak1 mysqld[1412]:         99c55b98,0
      Jan 07 18:55:15 yak1 mysqld[1412]:         d9c30ef3,0
      Jan 07 18:55:15 yak1 mysqld[1412]: } joined {
      Jan 07 18:55:15 yak1 mysqld[1412]: } left {
      Jan 07 18:55:15 yak1 mysqld[1412]: } partitioned {
      Jan 07 18:55:15 yak1 mysqld[1412]: })
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: save pc into disk
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.2:4567
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: gcomm: connected
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: Opened channel 'yak_cluster'
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 1 [Note] WSREP: Starting rollbacker thread 1
      Jan 07 18:55:15 yak1 mysqld[1412]: 2020-01-07 18:55:15 2 [Note] WSREP: Starting applier thread 2
      Jan 07 18:55:16 yak1 mysqld[1412]: 2020-01-07 18:55:16 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') connection established to 99c55b98 tcp://192.168.0.2:4567
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Warning] WSREP: Action message in non-primary configuration from member 0
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: STATE EXCHANGE: sent state msg: dbdd8160-3176-11ea-98d6-7602d2526dca
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: STATE EXCHANGE: got state msg: dbdd8160-3176-11ea-98d6-7602d2526dca from 0 (yak2)
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: STATE EXCHANGE: got state msg: dbdd8160-3176-11ea-98d6-7602d2526dca from 1 (yak1)
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: Quorum results:
      Jan 07 18:55:17 yak1 mysqld[1412]:         version    = 6,
      Jan 07 18:55:17 yak1 mysqld[1412]:         component  = PRIMARY,
      Jan 07 18:55:17 yak1 mysqld[1412]:         conf_id    = 11,
      Jan 07 18:55:17 yak1 mysqld[1412]:         members    = 1/2 (joined/total),
      Jan 07 18:55:17 yak1 mysqld[1412]:         act_id     = 312631833,
      Jan 07 18:55:17 yak1 mysqld[1412]:         last_appl. = 312608659,
      Jan 07 18:55:17 yak1 mysqld[1412]:         protocols  = 2/10/4 (gcs/repl/appl),
      Jan 07 18:55:17 yak1 mysqld[1412]:         vote policy= 0,
      Jan 07 18:55:17 yak1 mysqld[1412]:         group UUID = a6a91495-1b7f-11ea-a0be-7f78e01c49de
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: Flow-control interval: [23, 23]
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 312631834)
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: ####### processing CC 312631834, local, ordered
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: REPL Protocols: 10 (5, 3)
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: ####### My UUID: d9c30ef3-3176-11ea-95a0-422096815619
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: Server yak1 connected to cluster at position a6a91495-1b7f-11ea-a0be-7f78e01c49de:312631834 with ID d9c30ef3-3176-11ea-95a0-422096815619
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: Server status change disconnected -> connected
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: State transfer required:
      Jan 07 18:55:17 yak1 mysqld[1412]:         Group state: a6a91495-1b7f-11ea-a0be-7f78e01c49de:312631834
      Jan 07 18:55:17 yak1 mysqld[1412]:         Local state: a6a91495-1b7f-11ea-a0be-7f78e01c49de:312608770
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: Server status change connected -> joiner
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address '159.69.77.51' --datadir '/var/lib/mysql/' --parent '1412' --mysqld-args --wsrep_start_position=a6a91495-1b7f-11ea-a0be-7f78e01c49de:312608770'
      Jan 07 18:55:17 yak1 mysqld[1412]: 2020-01-07 18:55:17 0 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 134217728 bytes
      Jan 07 18:55:17 yak1 mysqld[1412]: WSREP_SST: [INFO] Logging all stderr of SST/Innobackupex to syslog (20200107 18:55:17.590)
      Jan 07 18:55:19 yak1 mysqld[1412]: 2020-01-07 18:55:19 0 [Note] WSREP: (d9c30ef3, 'tcp://0.0.0.0:4567') turning message relay requesting off
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 2 [Note] WSREP: Prepared SST request: mariabackup|159.69.77.51:4444/xtrabackup_sst//1
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 2 [Note] WSREP: Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [Note] WSREP: Service thread queue flushed.
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 2 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: 5
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 2 [Note] WSREP: ####### IST uuid:a6a91495-1b7f-11ea-a0be-7f78e01c49de f: 312608771, l: 312631834, STRv: 3
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 2 [Note] WSREP: IST receiver addr using tcp://159.69.77.51:4568
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 2 [Note] WSREP: Prepared IST receiver for 312608771-312631834, listening at: tcp://159.69.77.51:4568
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [Note] WSREP: Member 1.0 (yak1) requested state transfer from '*any*'. Selected 0.0 (yak2)(SYNCED) as donor.
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 312632054)
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 2 [Note] WSREP: Requesting state transfer: success, donor: 0
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [Warning] WSREP: 0.0 (yak2): State transfer to 1.0 (yak1) failed: -42 (No message of desired type)
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1178: Will never receive state. Need to abort.
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [Note] WSREP: gcomm: terminating thread
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [Note] WSREP: gcomm: joining thread
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [Note] WSREP: gcomm: closing backend
      Jan 07 18:55:22 yak1 mysqld[1412]: 2020-01-07 18:55:22 0 [Note] WSREP: view(view_id(NON_PRIM,99c55b98,41) memb {
      Jan 07 18:55:22 yak1 mysqld[1412]:         d9c30ef3,0
      Jan 07 18:55:22 yak1 mysqld[1412]: } joined {
      Jan 07 18:55:22 yak1 mysqld[1412]: } left {
      Jan 07 18:55:22 yak1 mysqld[1412]: } partitioned {
      Jan 07 18:55:22 yak1 mysqld[1412]:         99c55b98,0
      Jan 07 18:55:22 yak1 mysqld[1412]: })
      Jan 07 18:55:22 yak1 mysqld[1412]: 2020-01-07 18:55:22 0 [Note] WSREP: PC protocol downgrade 1 -> 0
      Jan 07 18:55:22 yak1 mysqld[1412]: 2020-01-07 18:55:22 0 [Note] WSREP: view((empty))
      Jan 07 18:55:22 yak1 mysqld[1412]: 2020-01-07 18:55:22 0 [Note] WSREP: gcomm: closed
      Jan 07 18:55:22 yak1 mysqld[1412]: 2020-01-07 18:55:22 0 [Note] WSREP: /usr/sbin/mysqld: Terminated.
      

      If I try to start the node again, I end up with cluster state: INCONSISTENT. The only way to get out of the situation is to wipe /var/lib/mysql on that node and restart from scratch. In this case the SST is done and it finally syncs.

      I'd really like to solve this because it's quite annoying as you can probably imagine. Let me know if you need more info. It's a live server though, so I cannot mess with it at will. For starters, I'd really like to know what the obnoxious "Will never receive state" means:

      WSREP: 0.0 (yak2): State transfer to 1.0 (yak1) failed: -42 (No message of desired type)
      Jan 07 18:55:21 yak1 mysqld[1412]: 2020-01-07 18:55:21 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1178: Will never receive state. 
      

      And what is "No message of desired type"? What can I do about it?

      Attachments

        Activity

          rgpublic Ranjan Ghosh added a comment -

          A new MariaDB version comes out. I try to update. Again: I just doesn't work. This time even wiping /var/lib/mysql and retrying multiple times didn't help. I needed to switch back to SST method "RSYNC" which worked immediately. I have the feeling that the mariabackup-SST-method is still somewhat flaky and unreliable. Interestingly, I cannot find any documentation on what wsrep status "INCONSISTENT" means. How can this happen? What can be done to avoid ending up in this state when starting up the second node? Bad thing is, when the node you are just trying to start ends up in "INCONSIStTENT", the first node which was happily running before, crashes as well. After 2 days I looked retrospectively at the logs. I tried a lot and restarted multiple times so the logs are huge. I tried to find sth. in there that might be striking. I found sth. (see below). Also very mysterious... "unallowed state transition: joined -> joined" (?). wsrep::server_state::state(wsrep::unique_lock<wsrep::mutex>&, wsrep::server_state::state): Assertion `0' failed. (?)

           

          Jan 29 18:16:15 yak2 -innobackupex-backup[26292]: [00] 2020-01-29 18:16:15 Redo log (from LSN 12315939239368 to 12316031160521) was copied.
          Jan 29 18:16:15 yak2 -innobackupex-backup[26292]: [00] 2020-01-29 18:16:15 completed OK!
          Jan 29 18:16:15 yak2 mysqld[1285]: 2020-01-29 18:16:15 0 [Note] WSREP: SST sent: a6a91495-1b7f-11ea-a0be-7f78e01c49de:570788225
          Jan 29 18:16:15 yak2 mysqld[1285]: 2020-01-29 18:16:15 0 [Warning] WSREP: server: yak2 unallowed state transition: joined -> joined
          Jan 29 18:16:15 yak2 mysqld[1285]: mysqld: /home/buildbot/buildbot/build/mariadb-10.4.12/wsrep-lib/src/server_state.cpp:1356: void wsrep::server_state::state(wsrep::unique_lock<wsrep::mutex>&, wsrep::server_state::state): Assertion `0' failed.
          Jan 29 18:16:15 yak2 mysqld[1285]: 200129 18:16:15 [ERROR] mysqld got signal 6 ;
          Jan 29 18:16:15 yak2 mysqld[1285]: This could be because you hit a bug. It is also possible that this binary
          Jan 29 18:16:15 yak2 mysqld[1285]: or one of the libraries it was linked against is corrupt, improperly built,
          Jan 29 18:16:15 yak2 mysqld[1285]: or misconfigured. This error can also be caused by malfunctioning hardware.
          Jan 29 18:16:15 yak2 mysqld[1285]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
          Jan 29 18:16:15 yak2 mysqld[1285]: We will try our best to scrape up some info that will hopefully help
          Jan 29 18:16:15 yak2 mysqld[1285]: diagnose the problem, but since we have already crashed,
          Jan 29 18:16:15 yak2 mysqld[1285]: something is definitely wrong and this may fail.
          Jan 29 18:16:15 yak2 mysqld[1285]: Server version: 10.4.12-MariaDB-1:10.4.12+maria~eoan
          Jan 29 18:16:15 yak2 mysqld[1285]: key_buffer_size=134217728
          Jan 29 18:16:15 yak2 mysqld[1285]: read_buffer_size=131072
          Jan 29 18:16:15 yak2 mysqld[1285]: max_used_connections=108
          Jan 29 18:16:15 yak2 mysqld[1285]: max_threads=65537
          Jan 29 18:16:15 yak2 mysqld[1285]: thread_count=10
          Jan 29 18:16:15 yak2 mysqld[1285]: It is possible that mysqld could use up to
          Jan 29 18:16:15 yak2 mysqld[1285]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 142751964 K bytes of memory
          Jan 29 18:16:15 yak2 mysqld[1285]: Hope that's ok; if not, decrease some variables in the equation.
          Jan 29 18:16:15 yak2 mysqld[1285]: Thread pointer: 0x7f0128000d28
          Jan 29 18:16:15 yak2 mysqld[1285]: Attempting backtrace. You can use the following information to find out
          Jan 29 18:16:15 yak2 mysqld[1285]: where mysqld died. If you see no messages after this, something went
          Jan 29 18:16:15 yak2 mysqld[1285]: terribly wrong...
          Jan 29 18:16:15 yak2 mysqld[1285]: stack_bottom = 0x7f0309029728 thread_stack 0x49000
          Jan 29 18:16:15 yak2 -wsrep-sst-donor[26831]: Total time on donor: 0 seconds
          Jan 29 18:16:15 yak2 -wsrep-sst-donor[26833]: Cleaning up temporary directories
          Jan 29 18:16:15 yak2 mysqld[1285]: *** buffer overflow detected ***: /usr/sbin/mysqld terminated
          Jan 29 18:16:16 yak2 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT
          {code\

          rgpublic Ranjan Ghosh added a comment - A new MariaDB version comes out. I try to update. Again: I just doesn't work. This time even wiping /var/lib/mysql and retrying multiple times didn't help. I needed to switch back to SST method "RSYNC" which worked immediately. I have the feeling that the mariabackup-SST-method is still somewhat flaky and unreliable. Interestingly, I cannot find any documentation on what wsrep status "INCONSISTENT" means. How can this happen? What can be done to avoid ending up in this state when starting up the second node? Bad thing is, when the node you are just trying to start ends up in "INCONSIStTENT", the first node which was happily running before, crashes as well. After 2 days I looked retrospectively at the logs. I tried a lot and restarted multiple times so the logs are huge. I tried to find sth. in there that might be striking. I found sth. (see below). Also very mysterious... "unallowed state transition: joined -> joined" (? ). wsrep::server_state::state(wsrep::unique_lock<wsrep::mutex>&, wsrep::server_state::state): Assertion `0' failed. ( ?)   Jan 29 18:16:15 yak2 -innobackupex-backup [26292] : [00] 2020-01-29 18:16:15 Redo log (from LSN 12315939239368 to 12316031160521) was copied. Jan 29 18:16:15 yak2 -innobackupex-backup [26292] : [00] 2020-01-29 18:16:15 completed OK! Jan 29 18:16:15 yak2 mysqld [1285] : 2020-01-29 18:16:15 0 [Note] WSREP: SST sent: a6a91495-1b7f-11ea-a0be-7f78e01c49de:570788225 Jan 29 18:16:15 yak2 mysqld [1285] : 2020-01-29 18:16:15 0 [Warning] WSREP: server: yak2 unallowed state transition: joined -> joined Jan 29 18:16:15 yak2 mysqld [1285] : mysqld: /home/buildbot/buildbot/build/mariadb-10.4.12/wsrep-lib/src/server_state.cpp:1356: void wsrep::server_state::state(wsrep::unique_lock<wsrep::mutex>&, wsrep::server_state::state): Assertion `0' failed. Jan 29 18:16:15 yak2 mysqld [1285] : 200129 18:16:15 [ERROR] mysqld got signal 6 ; Jan 29 18:16:15 yak2 mysqld [1285] : This could be because you hit a bug. It is also possible that this binary Jan 29 18:16:15 yak2 mysqld [1285] : or one of the libraries it was linked against is corrupt, improperly built, Jan 29 18:16:15 yak2 mysqld [1285] : or misconfigured. This error can also be caused by malfunctioning hardware. Jan 29 18:16:15 yak2 mysqld [1285] : To report this bug, see https://mariadb.com/kb/en/reporting-bugs Jan 29 18:16:15 yak2 mysqld [1285] : We will try our best to scrape up some info that will hopefully help Jan 29 18:16:15 yak2 mysqld [1285] : diagnose the problem, but since we have already crashed, Jan 29 18:16:15 yak2 mysqld [1285] : something is definitely wrong and this may fail. Jan 29 18:16:15 yak2 mysqld [1285] : Server version: 10.4.12-MariaDB-1:10.4.12+maria~eoan Jan 29 18:16:15 yak2 mysqld [1285] : key_buffer_size=134217728 Jan 29 18:16:15 yak2 mysqld [1285] : read_buffer_size=131072 Jan 29 18:16:15 yak2 mysqld [1285] : max_used_connections=108 Jan 29 18:16:15 yak2 mysqld [1285] : max_threads=65537 Jan 29 18:16:15 yak2 mysqld [1285] : thread_count=10 Jan 29 18:16:15 yak2 mysqld [1285] : It is possible that mysqld could use up to Jan 29 18:16:15 yak2 mysqld [1285] : key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 142751964 K bytes of memory Jan 29 18:16:15 yak2 mysqld [1285] : Hope that's ok; if not, decrease some variables in the equation. Jan 29 18:16:15 yak2 mysqld [1285] : Thread pointer: 0x7f0128000d28 Jan 29 18:16:15 yak2 mysqld [1285] : Attempting backtrace. You can use the following information to find out Jan 29 18:16:15 yak2 mysqld [1285] : where mysqld died. If you see no messages after this, something went Jan 29 18:16:15 yak2 mysqld [1285] : terribly wrong... Jan 29 18:16:15 yak2 mysqld [1285] : stack_bottom = 0x7f0309029728 thread_stack 0x49000 Jan 29 18:16:15 yak2 -wsrep-sst-donor [26831] : Total time on donor: 0 seconds Jan 29 18:16:15 yak2 -wsrep-sst-donor [26833] : Cleaning up temporary directories Jan 29 18:16:15 yak2 mysqld [1285] : *** buffer overflow detected ***: /usr/sbin/mysqld terminated Jan 29 18:16:16 yak2 systemd [1] : mariadb.service: Main process exited, code=killed, status=6/ABRT {code\

          Can someone share cluster configuration from all nodes and full unedited error log from both donor and joiner, finally all possible mariabackup and sst log files.

          jplindst Jan Lindström (Inactive) added a comment - - edited Can someone share cluster configuration from all nodes and full unedited error log from both donor and joiner, finally all possible mariabackup and sst log files.
          rgpublic Ranjan Ghosh added a comment -

          Everything should be in syslog. I could provide the full syslog of both nodes from that hour (18:00 till 19:00). How can I provide this to you in a secure way as it contains sensitive data?

          rgpublic Ranjan Ghosh added a comment - Everything should be in syslog. I could provide the full syslog of both nodes from that hour (18:00 till 19:00). How can I provide this to you in a secure way as it contains sensitive data?
          rgpublic Ranjan Ghosh added a comment -

          Ah, sorry, I see I can only recover the full log from the second node. It's just too long ago. But perhaps it still contains valuable information... Let me know how I can send it securely.

          rgpublic Ranjan Ghosh added a comment - Ah, sorry, I see I can only recover the full log from the second node. It's just too long ago. But perhaps it still contains valuable information... Let me know how I can send it securely.
          rgpublic Ranjan Ghosh added a comment -

          I cannot prove it, because I don't have the logs anymore, but I think since MDEV-21002 has been fixed, the first bug I mentioned here ("Will never receive state") has gone away. I haven't seen it anymore. But MDEV-22796 still means that mariabackup SST method is extremely unstable for me. But I think this one here could perhaps be closed as DUPLICATE.

          rgpublic Ranjan Ghosh added a comment - I cannot prove it, because I don't have the logs anymore, but I think since MDEV-21002 has been fixed, the first bug I mentioned here ("Will never receive state") has gone away. I haven't seen it anymore. But MDEV-22796 still means that mariabackup SST method is extremely unstable for me. But I think this one here could perhaps be closed as DUPLICATE.
          Yurchenko Alexey added a comment -

          Yes, it would be nice to have error log from the donor to be CERTAIN of what has happened, but judging by error 42 donor could not find required writeset from the cache. It is likely due to a GCache bug that was fixed in commit 9d17c341d98c6c00d51a6372076038eb210483c8, which is part of Galera release 25.3.30/26.4.5

          I'm quite certain that this is the case here.

          Yurchenko Alexey added a comment - Yes, it would be nice to have error log from the donor to be CERTAIN of what has happened, but judging by error 42 donor could not find required writeset from the cache. It is likely due to a GCache bug that was fixed in commit 9d17c341d98c6c00d51a6372076038eb210483c8, which is part of Galera release 25.3.30/26.4.5 I'm quite certain that this is the case here.

          People

            Yurchenko Alexey
            rgpublic Ranjan Ghosh
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.