[MDEV-15399] Galera catches exception and terminates, but MariaDB keeps going Created: 2018-02-23  Updated: 2020-08-25  Resolved: 2018-11-07

Status: Closed
Project: MariaDB Server
Component/s: Galera, Server, wsrep
Affects Version/s: 10.2.12
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Geoff Montee (Inactive) Assignee: Seppo Jaakola
Resolution: Fixed Votes: 0
Labels: galera, hang, wsrep

Issue Links:
Relates
relates to MDEV-14971 Mariadb stops working when second gal... Closed

 Description   

A user encountered the following sequence of events.

The server was starting up with --wsrep-new-cluster:

2018-02-22 23:47:41 139925613557888 [Note] WSREP: Read nil XID from storage engines, skipping position init
2018-02-22 23:47:41 139925613557888 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
2018-02-22 23:47:41 139925613557888 [Note] WSREP: wsrep_load(): Galera 25.3.22(r3764) by Codership Oy <info@codership.com> loaded successfully.
2018-02-22 23:47:41 139925613557888 [Note] WSREP: CRC-32C: using hardware acceleration.
2018-02-22 23:47:41 139925613557888 [Note] WSREP: Found saved state: 4ee72403-e1e2-11e7-998a-73a130e7e5cb:-1, safe_to_bootstrap: 1
2018-02-22 23:47:41 139925613557888 [Note] WSREP: Recovering GCache ring buffer: version: 1, UUID: 4ee72403-e1e2-11e7-998a-73a130e7e5cb, offset: -1
2018-02-22 23:47:41 139925613557888 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (         0/2147483688 bytes) complete.
2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (2147483688/2147483688 bytes) complete.
2018-02-22 23:47:51 139925613557888 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 520442-551341
2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache::RingBuffer unused buffers scan...  0.0% (        0/547216918 bytes) complete.
2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (547216918/547216918 bytes) complete.
2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): found 4/30904 locked buffers
2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): used space: 547216918/2147483648
2018-02-22 23:47:51 139925613557888 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 172.33.40.143; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT1M; evs.install_timeout = PT1M; evs.join_retrans_period = PT0.5S; evs.keepalive_period = PT3S; evs.max_install_timeouts = 3; evs.send_window = 1024; evs.stats_report_period = PT1M; evs.suspect_timeout = PT30S; evs.user_send_window = 512; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 2G; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.ve
2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache history reset: 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551341 -> 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551339
2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset
2018-02-22 23:47:51 139925613557888 [Note] WSREP: Assign initial position for certification: 551339, protocol version: -1
2018-02-22 23:47:51 139925613557888 [Note] WSREP: wsrep_sst_grab()
2018-02-22 23:47:51 139925613557888 [Note] WSREP: Start replication
2018-02-22 23:47:51 139925613557888 [Note] WSREP: 'wsrep-new-cluster' option used, bootstrapping the cluster
2018-02-22 23:47:51 139925613557888 [Note] WSREP: Setting initial position to 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551339
2018-02-22 23:47:51 139925613557888 [Note] WSREP: protonet asio version 0
2018-02-22 23:47:51 139925613557888 [Note] WSREP: Using CRC-32C for message checksums.
2018-02-22 23:47:51 139925613557888 [Note] WSREP: backend: asio
2018-02-22 23:47:51 139925613557888 [Note] WSREP: gcomm thread scheduling priority set to other:0 
2018-02-22 23:47:51 139925613557888 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
2018-02-22 23:47:51 139925613557888 [Note] WSREP: restore pc from disk failed
2018-02-22 23:47:51 139925613557888 [Note] WSREP: GMCast version 0
2018-02-22 23:47:51 139925613557888 [Note] WSREP: (cb295870, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2018-02-22 23:47:51 139925613557888 [Note] WSREP: (cb295870, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2018-02-22 23:47:51 139925613557888 [Note] WSREP: EVS version 0
2018-02-22 23:47:51 139925613557888 [Note] WSREP: gcomm: bootstrapping new group 'galera_cluster_01'
2018-02-22 23:47:51 139925613557888 [Note] WSREP: start_prim is enabled, turn off pc_recovery
2018-02-22 23:47:51 139925613557888 [Note] WSREP: Node cb295870 state prim
2018-02-22 23:47:51 139925613557888 [Note] WSREP: view(view_id(PRIM,cb295870,1) memb {
	cb295870,0
} joined {
} left {
} partitioned {
})
2018-02-22 23:47:51 139925613557888 [Note] WSREP: save pc into disk
2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr without UUID: tcp://172.31.29.97:4567
2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr proto entry 0x5568e6fc6cd0
2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr without UUID: tcp://172.32.10.9:4567
2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr proto entry 0x5568e6fcf4d0
2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr without UUID: tcp://172.33.40.143:4567
2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr proto entry 0x5568e6fd7ce0
2018-02-22 23:47:51 139925613557888 [Note] WSREP: gcomm: connected
2018-02-22 23:47:51 139925613557888 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2018-02-22 23:47:51 139925613557888 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2018-02-22 23:47:51 139925613557888 [Note] WSREP: Opened channel 'galera_cluster_01'
2018-02-22 23:47:51 139922778347264 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
2018-02-22 23:47:51 139925613557888 [Note] WSREP: Waiting for SST to complete.
2018-02-22 23:47:51 139922778347264 [Note] WSREP: STATE_EXCHANGE: sent state UUID: cb299881-182a-11e8-9d9c-f6ef07f79521
2018-02-22 23:47:51 139922778347264 [Note] WSREP: STATE EXCHANGE: sent state msg: cb299881-182a-11e8-9d9c-f6ef07f79521
2018-02-22 23:47:51 139922778347264 [Note] WSREP: STATE EXCHANGE: got state msg: cb299881-182a-11e8-9d9c-f6ef07f79521 from 0 (dbc03)
2018-02-22 23:47:51 139922778347264 [Note] WSREP: Quorum results:
	version    = 4,
	component  = PRIMARY,
	conf_id    = 0,
	members    = 1/1 (joined/total),
	act_id     = 551339,
	last_appl. = -1,
	protocols  = 0/7/3 (gcs/repl/appl),
	group UUID = 4ee72403-e1e2-11e7-998a-73a130e7e5cb
2018-02-22 23:47:51 139922778347264 [Note] WSREP: Flow-control interval: [16, 16]
2018-02-22 23:47:51 139922778347264 [Note] WSREP: Trying to continue unpaused monitor
2018-02-22 23:47:51 139922778347264 [Note] WSREP: Restored state OPEN -> JOINED (551339)
2018-02-22 23:47:51 139922778347264 [Note] WSREP: Member 0.0 (dbc03) synced with group.
2018-02-22 23:47:51 139922778347264 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 551339)
2018-02-22 23:47:51 139925002704640 [Note] WSREP: New cluster view: global state: 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551339, view# 1: Primary, number of nodes: 1, my index: 0, protocol version 3
2018-02-22 23:47:51 139925613557888 [Note] WSREP: SST complete, seqno: 551339
2018-02-22 23:47:51 139925613557888 [Note] InnoDB: For Galera, using innodb_lock_schedule_algorithm=fcfs
2018-02-22 23:47:51 139925613557888 [Warning] InnoDB: Using innodb_file_format is deprecated and the parameter may be removed in future releases. See http://dev.mysql.com/doc/refman/5.7/en/innodb-file-format.html
2018-02-22 23:47:51 139925613557888 [Warning] InnoDB: Using innodb_file_format_max is deprecated and the parameter may be removed in future releases. See http://dev.mysql.com/doc/refman/5.7/en/innodb-file-format.html
2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Uses event mutexes
2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Compressed tables use zlib 1.2.7
2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Using Linux native AIO
2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Number of pools: 1
2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Using SSE2 crc32 instructions
2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Initializing buffer pool, total size = 20G, instances = 20, chunk size = 128M
2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Completed initialization of buffer pool
2018-02-22 23:47:51 139922786739968 [Note] WSREP: (cb295870, 'tcp://0.0.0.0:4567') connection established to e2e8f67a tcp://172.31.24.27:4567
2018-02-22 23:47:52 139922786739968 [Note] WSREP: (cb295870, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: 
2018-02-22 23:47:52 139899263854336 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2018-02-22 23:47:52 139925613557888 [Note] InnoDB: Highest supported file format is Barracuda.
2018-02-22 23:47:52 139922786739968 [Note] WSREP: declaring e2e8f67a at tcp://172.31.24.27:4567 stable

And then the Galera provider received an exception and it died:

2018-02-22 23:47:52 139922786739968 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
pc::Proto{uuid=cb295870,start_prim=1,npvo=0,ignore_sb=0,ignore_quorum=0,state=1,last_sent_seq=3,checksum=0,instances=
cb295870,prim=1,un=0,last_seq=3,last_prim=view_id(PRIM,cb295870,1),to_seq=2,weight=1,segment=0
,state_msgs=
cb295870,pcmsg{ type=STATE, seq=0, flags= 0, node_map {	cb295870,prim=1,un=0,last_seq=3,last_prim=view_id(PRIM,cb295870,1),to_seq=2,weight=1,segment=0
}}
,current_view=view(view_id(REG,cb295870,34) memb {
cb295870,0
e2e8f67a,0
} joined {
e2e8f67a,0
} left {
} partitioned {
}),pc_view=view(view_id(PRIM,cb295870,1) memb {
cb295870,0
} joined {
} left {
} partitioned {
}),mtu=32636}
2018-02-22 23:47:52 139922786739968 [Note] WSREP: {v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=e2e8f67a,srcvid=view_id(REG,cb295870,34),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=340547,nl=(
)
} 64
2018-02-22 23:47:52 139922786739968 [ERROR] WSREP: exception caused by message: {v=0,t=3,ut=255,o=1,s=0,sr=-1,as=0,f=4,src=e2e8f67a,srcvid=view_id(REG,cb295870,34),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=340549,nl=(
)
}
state after handling message: evs::proto(evs::proto(cb295870, OPERATIONAL, view_id(REG,cb295870,34)), OPERATIONAL) {
current_view=view(view_id(REG,cb295870,34) memb {
cb295870,0
e2e8f67a,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} node: {idx=1,range=[1,0],safe_seq=0} },
fifo_seq=17,
last_sent=0,
known:
cb295870 at 
{o=1,s=0,i=1,fs=-1,}
e2e8f67a at tcp://172.31.24.27:4567
{o=1,s=0,i=1,fs=340549,}
}2018-02-22 23:47:52 139922786739968 [ERROR] WSREP: exception from gcomm, backend must be restarted: cb295870 aborting due to conflicting prims: older overrides (FATAL)
at gcomm/src/pc_proto.cpp:handle_state():982
2018-02-22 23:47:52 139922786739968 [Note] WSREP: gcomm: terminating thread
2018-02-22 23:47:52 139922786739968 [Note] WSREP: gcomm: joining thread
2018-02-22 23:47:52 139922786739968 [Note] WSREP: gcomm: closing backend
2018-02-22 23:47:52 139922786739968 [Note] WSREP: Forced PC close
2018-02-22 23:47:52 139922786739968 [Warning] WSREP: discarding 1 messages from message index
2018-02-22 23:47:52 139922786739968 [Note] WSREP: gcomm: closed
2018-02-22 23:47:52 139922778347264 [ERROR] WSREP: gcs/src/gcs_core.cpp:core_handle_comp_msg():718: Malformed component message. Ignoring

But MariaDB and InnoDB kept going:

2018-02-22 23:47:52 139925613557888 [Note] InnoDB: 128 out of 128 rollback segments are active.
2018-02-22 23:47:52 139925613557888 [Note] InnoDB: Creating shared tablespace for temporary tables
2018-02-22 23:47:52 139925613557888 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2018-02-22 23:47:52 139925613557888 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2018-02-22 23:47:52 139925613557888 [Note] InnoDB: 5.7.20 started; log sequence number 6217279687
2018-02-22 23:47:52 139899550291712 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool
180222 23:47:52 server_audit: MariaDB Audit Plugin version 1.4.3 STARTED.
2018-02-22 23:47:52 139925613557888 [Note] Server socket created on IP: '::'.
2018-02-22 23:47:53 139922720352000 [Note] Event Scheduler: scheduler thread started with id 10
2018-02-22 23:47:53 139925613557888 [Note] Reading of all Master_info entries succeded
2018-02-22 23:47:53 139925613557888 [Note] Added new Master_info '' to hash table
2018-02-22 23:47:53 139925613557888 [Note] /usr/sbin/mysqld: ready for connections.
Version: '10.2.12-MariaDB-log' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server
2018-02-22 23:47:53 139925002704640 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-02-22 23:47:53 139925002704640 [Note] WSREP: REPL Protocols: 7 (3, 2)
2018-02-22 23:47:53 139925002704640 [Note] WSREP: Assign initial position for certification: 551339, protocol version: 3
2018-02-22 23:47:53 139925036578560 [Note] WSREP: Service thread queue flushed.
2018-02-22 23:47:53 139925002704640 [Note] WSREP: GCache history reset: 4ee72403-e1e2-11e7-998a-73a130e7e5cb:0 -> 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551339
2018-02-22 23:47:53 139925002704640 [Note] WSREP: Synchronized with group, ready for connections
2018-02-22 23:47:53 139925002704640 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

But all transactions failed due to an ambiguous "Software caused connection abort" error:

2018-02-22 23:48:04 139899236554496 [Warning] WSREP: Send action {(nil), 1082, TORDERED} returned -103 (Software caused connection abort)
2018-02-22 23:48:04 139899236554496 [Note] WSREP: cluster conflict due to certification failure for threads:
2018-02-22 23:48:04 139899236554496 [Note] WSREP: Victim thread: 
THD: 30, mode: local, state: executing, conflict: cert failure, seqno: -1
SQL: COMMIT
2018-02-22 23:48:13 139899236251392 [Warning] WSREP: Send action {(nil), 1133, TORDERED} returned -103 (Software caused connection abort)
2018-02-22 23:48:13 139899236251392 [Note] WSREP: cluster conflict due to certification failure for threads:
2018-02-22 23:48:13 139899236251392 [Note] WSREP: Victim thread: 
THD: 31, mode: local, state: executing, conflict: cert failure, seqno: -1
SQL: COMMIT
2018-02-22 23:48:13 139899236251392 [Warning] WSREP: Send action {(nil), 624, TORDERED} returned -103 (Software caused connection abort)
2018-02-22 23:48:13 139899236251392 [Note] WSREP: cluster conflict due to certification failure for threads:
2018-02-22 23:48:13 139899236251392 [Note] WSREP: Victim thread: 
THD: 31, mode: local, state: executing, conflict: cert failure, seqno: -1
SQL: COMMIT

The user tried to shutdown mysqld, but it hung during the shutdown process, so mysqld had to be killed.

It looks like the problem was caused by Galera receiving unexpected messages over the network. Shouldn't the whole server have been terminated in this case, instead of just the Galera threads?



 Comments   
Comment by Geoff Montee (Inactive) [ 2018-02-26 ]

This is related to the following Galera issue:

https://github.com/codership/galera/issues/500

Comment by Julien Fritsch [ 2018-11-07 ]

I consider this one as closed, since https://github.com/codership/galera/issues/500 is closed as well as the customer issue.

Generated at Thu Feb 08 08:21:00 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.