Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15399

Galera catches exception and terminates, but MariaDB keeps going

    XMLWordPrintable

Details

    Description

      A user encountered the following sequence of events.

      The server was starting up with --wsrep-new-cluster:

      2018-02-22 23:47:41 139925613557888 [Note] WSREP: Read nil XID from storage engines, skipping position init
      2018-02-22 23:47:41 139925613557888 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
      2018-02-22 23:47:41 139925613557888 [Note] WSREP: wsrep_load(): Galera 25.3.22(r3764) by Codership Oy <info@codership.com> loaded successfully.
      2018-02-22 23:47:41 139925613557888 [Note] WSREP: CRC-32C: using hardware acceleration.
      2018-02-22 23:47:41 139925613557888 [Note] WSREP: Found saved state: 4ee72403-e1e2-11e7-998a-73a130e7e5cb:-1, safe_to_bootstrap: 1
      2018-02-22 23:47:41 139925613557888 [Note] WSREP: Recovering GCache ring buffer: version: 1, UUID: 4ee72403-e1e2-11e7-998a-73a130e7e5cb, offset: -1
      2018-02-22 23:47:41 139925613557888 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (         0/2147483688 bytes) complete.
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (2147483688/2147483688 bytes) complete.
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 520442-551341
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache::RingBuffer unused buffers scan...  0.0% (        0/547216918 bytes) complete.
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (547216918/547216918 bytes) complete.
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): found 4/30904 locked buffers
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): used space: 547216918/2147483648
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 172.33.40.143; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT1M; evs.install_timeout = PT1M; evs.join_retrans_period = PT0.5S; evs.keepalive_period = PT3S; evs.max_install_timeouts = 3; evs.send_window = 1024; evs.stats_report_period = PT1M; evs.suspect_timeout = PT30S; evs.user_send_window = 512; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 2G; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.ve
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache history reset: 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551341 -> 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551339
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: Assign initial position for certification: 551339, protocol version: -1
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: wsrep_sst_grab()
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: Start replication
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: 'wsrep-new-cluster' option used, bootstrapping the cluster
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: Setting initial position to 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551339
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: protonet asio version 0
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: Using CRC-32C for message checksums.
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: backend: asio
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: gcomm thread scheduling priority set to other:0 
      2018-02-22 23:47:51 139925613557888 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: restore pc from disk failed
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: GMCast version 0
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: (cb295870, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: (cb295870, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: EVS version 0
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: gcomm: bootstrapping new group 'galera_cluster_01'
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: start_prim is enabled, turn off pc_recovery
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: Node cb295870 state prim
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: view(view_id(PRIM,cb295870,1) memb {
      	cb295870,0
      } joined {
      } left {
      } partitioned {
      })
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: save pc into disk
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr without UUID: tcp://172.31.29.97:4567
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr proto entry 0x5568e6fc6cd0
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr without UUID: tcp://172.32.10.9:4567
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr proto entry 0x5568e6fcf4d0
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr without UUID: tcp://172.33.40.143:4567
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr proto entry 0x5568e6fd7ce0
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: gcomm: connected
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: Opened channel 'galera_cluster_01'
      2018-02-22 23:47:51 139922778347264 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: Waiting for SST to complete.
      2018-02-22 23:47:51 139922778347264 [Note] WSREP: STATE_EXCHANGE: sent state UUID: cb299881-182a-11e8-9d9c-f6ef07f79521
      2018-02-22 23:47:51 139922778347264 [Note] WSREP: STATE EXCHANGE: sent state msg: cb299881-182a-11e8-9d9c-f6ef07f79521
      2018-02-22 23:47:51 139922778347264 [Note] WSREP: STATE EXCHANGE: got state msg: cb299881-182a-11e8-9d9c-f6ef07f79521 from 0 (dbc03)
      2018-02-22 23:47:51 139922778347264 [Note] WSREP: Quorum results:
      	version    = 4,
      	component  = PRIMARY,
      	conf_id    = 0,
      	members    = 1/1 (joined/total),
      	act_id     = 551339,
      	last_appl. = -1,
      	protocols  = 0/7/3 (gcs/repl/appl),
      	group UUID = 4ee72403-e1e2-11e7-998a-73a130e7e5cb
      2018-02-22 23:47:51 139922778347264 [Note] WSREP: Flow-control interval: [16, 16]
      2018-02-22 23:47:51 139922778347264 [Note] WSREP: Trying to continue unpaused monitor
      2018-02-22 23:47:51 139922778347264 [Note] WSREP: Restored state OPEN -> JOINED (551339)
      2018-02-22 23:47:51 139922778347264 [Note] WSREP: Member 0.0 (dbc03) synced with group.
      2018-02-22 23:47:51 139922778347264 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 551339)
      2018-02-22 23:47:51 139925002704640 [Note] WSREP: New cluster view: global state: 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551339, view# 1: Primary, number of nodes: 1, my index: 0, protocol version 3
      2018-02-22 23:47:51 139925613557888 [Note] WSREP: SST complete, seqno: 551339
      2018-02-22 23:47:51 139925613557888 [Note] InnoDB: For Galera, using innodb_lock_schedule_algorithm=fcfs
      2018-02-22 23:47:51 139925613557888 [Warning] InnoDB: Using innodb_file_format is deprecated and the parameter may be removed in future releases. See http://dev.mysql.com/doc/refman/5.7/en/innodb-file-format.html
      2018-02-22 23:47:51 139925613557888 [Warning] InnoDB: Using innodb_file_format_max is deprecated and the parameter may be removed in future releases. See http://dev.mysql.com/doc/refman/5.7/en/innodb-file-format.html
      2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
      2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Uses event mutexes
      2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Compressed tables use zlib 1.2.7
      2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Using Linux native AIO
      2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Number of pools: 1
      2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Using SSE2 crc32 instructions
      2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Initializing buffer pool, total size = 20G, instances = 20, chunk size = 128M
      2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Completed initialization of buffer pool
      2018-02-22 23:47:51 139922786739968 [Note] WSREP: (cb295870, 'tcp://0.0.0.0:4567') connection established to e2e8f67a tcp://172.31.24.27:4567
      2018-02-22 23:47:52 139922786739968 [Note] WSREP: (cb295870, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
      2018-02-22 23:47:52 139899263854336 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
      2018-02-22 23:47:52 139925613557888 [Note] InnoDB: Highest supported file format is Barracuda.
      2018-02-22 23:47:52 139922786739968 [Note] WSREP: declaring e2e8f67a at tcp://172.31.24.27:4567 stable
      

      And then the Galera provider received an exception and it died:

      2018-02-22 23:47:52 139922786739968 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
      pc::Proto{uuid=cb295870,start_prim=1,npvo=0,ignore_sb=0,ignore_quorum=0,state=1,last_sent_seq=3,checksum=0,instances=
      cb295870,prim=1,un=0,last_seq=3,last_prim=view_id(PRIM,cb295870,1),to_seq=2,weight=1,segment=0
      ,state_msgs=
      cb295870,pcmsg{ type=STATE, seq=0, flags= 0, node_map {	cb295870,prim=1,un=0,last_seq=3,last_prim=view_id(PRIM,cb295870,1),to_seq=2,weight=1,segment=0
      }}
      ,current_view=view(view_id(REG,cb295870,34) memb {
      cb295870,0
      e2e8f67a,0
      } joined {
      e2e8f67a,0
      } left {
      } partitioned {
      }),pc_view=view(view_id(PRIM,cb295870,1) memb {
      cb295870,0
      } joined {
      } left {
      } partitioned {
      }),mtu=32636}
      2018-02-22 23:47:52 139922786739968 [Note] WSREP: {v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=e2e8f67a,srcvid=view_id(REG,cb295870,34),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=340547,nl=(
      )
      } 64
      2018-02-22 23:47:52 139922786739968 [ERROR] WSREP: exception caused by message: {v=0,t=3,ut=255,o=1,s=0,sr=-1,as=0,f=4,src=e2e8f67a,srcvid=view_id(REG,cb295870,34),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=340549,nl=(
      )
      }
      state after handling message: evs::proto(evs::proto(cb295870, OPERATIONAL, view_id(REG,cb295870,34)), OPERATIONAL) {
      current_view=view(view_id(REG,cb295870,34) memb {
      cb295870,0
      e2e8f67a,0
      } joined {
      } left {
      } partitioned {
      }),
      input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} node: {idx=1,range=[1,0],safe_seq=0} },
      fifo_seq=17,
      last_sent=0,
      known:
      cb295870 at 
      {o=1,s=0,i=1,fs=-1,}
      e2e8f67a at tcp://172.31.24.27:4567
      {o=1,s=0,i=1,fs=340549,}
      }2018-02-22 23:47:52 139922786739968 [ERROR] WSREP: exception from gcomm, backend must be restarted: cb295870 aborting due to conflicting prims: older overrides (FATAL)
      at gcomm/src/pc_proto.cpp:handle_state():982
      2018-02-22 23:47:52 139922786739968 [Note] WSREP: gcomm: terminating thread
      2018-02-22 23:47:52 139922786739968 [Note] WSREP: gcomm: joining thread
      2018-02-22 23:47:52 139922786739968 [Note] WSREP: gcomm: closing backend
      2018-02-22 23:47:52 139922786739968 [Note] WSREP: Forced PC close
      2018-02-22 23:47:52 139922786739968 [Warning] WSREP: discarding 1 messages from message index
      2018-02-22 23:47:52 139922786739968 [Note] WSREP: gcomm: closed
      2018-02-22 23:47:52 139922778347264 [ERROR] WSREP: gcs/src/gcs_core.cpp:core_handle_comp_msg():718: Malformed component message. Ignoring
      

      But MariaDB and InnoDB kept going:

      2018-02-22 23:47:52 139925613557888 [Note] InnoDB: 128 out of 128 rollback segments are active.
      2018-02-22 23:47:52 139925613557888 [Note] InnoDB: Creating shared tablespace for temporary tables
      2018-02-22 23:47:52 139925613557888 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
      2018-02-22 23:47:52 139925613557888 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
      2018-02-22 23:47:52 139925613557888 [Note] InnoDB: 5.7.20 started; log sequence number 6217279687
      2018-02-22 23:47:52 139899550291712 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
      180222 23:47:52 server_audit: MariaDB Audit Plugin version 1.4.3 STARTED.
      2018-02-22 23:47:52 139925613557888 [Note] Server socket created on IP: '::'.
      2018-02-22 23:47:53 139922720352000 [Note] Event Scheduler: scheduler thread started with id 10
      2018-02-22 23:47:53 139925613557888 [Note] Reading of all Master_info entries succeded
      2018-02-22 23:47:53 139925613557888 [Note] Added new Master_info '' to hash table
      2018-02-22 23:47:53 139925613557888 [Note] /usr/sbin/mysqld: ready for connections.
      Version: '10.2.12-MariaDB-log' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server
      2018-02-22 23:47:53 139925002704640 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2018-02-22 23:47:53 139925002704640 [Note] WSREP: REPL Protocols: 7 (3, 2)
      2018-02-22 23:47:53 139925002704640 [Note] WSREP: Assign initial position for certification: 551339, protocol version: 3
      2018-02-22 23:47:53 139925036578560 [Note] WSREP: Service thread queue flushed.
      2018-02-22 23:47:53 139925002704640 [Note] WSREP: GCache history reset: 4ee72403-e1e2-11e7-998a-73a130e7e5cb:0 -> 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551339
      2018-02-22 23:47:53 139925002704640 [Note] WSREP: Synchronized with group, ready for connections
      2018-02-22 23:47:53 139925002704640 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      

      But all transactions failed due to an ambiguous "Software caused connection abort" error:

      2018-02-22 23:48:04 139899236554496 [Warning] WSREP: Send action {(nil), 1082, TORDERED} returned -103 (Software caused connection abort)
      2018-02-22 23:48:04 139899236554496 [Note] WSREP: cluster conflict due to certification failure for threads:
      2018-02-22 23:48:04 139899236554496 [Note] WSREP: Victim thread: 
      THD: 30, mode: local, state: executing, conflict: cert failure, seqno: -1
      SQL: COMMIT
      2018-02-22 23:48:13 139899236251392 [Warning] WSREP: Send action {(nil), 1133, TORDERED} returned -103 (Software caused connection abort)
      2018-02-22 23:48:13 139899236251392 [Note] WSREP: cluster conflict due to certification failure for threads:
      2018-02-22 23:48:13 139899236251392 [Note] WSREP: Victim thread: 
      THD: 31, mode: local, state: executing, conflict: cert failure, seqno: -1
      SQL: COMMIT
      2018-02-22 23:48:13 139899236251392 [Warning] WSREP: Send action {(nil), 624, TORDERED} returned -103 (Software caused connection abort)
      2018-02-22 23:48:13 139899236251392 [Note] WSREP: cluster conflict due to certification failure for threads:
      2018-02-22 23:48:13 139899236251392 [Note] WSREP: Victim thread: 
      THD: 31, mode: local, state: executing, conflict: cert failure, seqno: -1
      SQL: COMMIT
      

      The user tried to shutdown mysqld, but it hung during the shutdown process, so mysqld had to be killed.

      It looks like the problem was caused by Galera receiving unexpected messages over the network. Shouldn't the whole server have been terminated in this case, instead of just the Galera threads?

      Attachments

        Issue Links

          Activity

            People

              seppo Seppo Jaakola
              GeoffMontee Geoff Montee (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.