[MDEV-15399] Galera catches exception and terminates, but MariaDB keeps going - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.2.12
Fix Version/s: N/A
Component/s: Galera, Server, wsrep
Labels:
- galera
- hang
- wsrep

Description

A user encountered the following sequence of events.

The server was starting up with --wsrep-new-cluster:

2018-02-22 23:47:41 139925613557888 [Note] WSREP: Read nil XID from storage engines, skipping position init

2018-02-22 23:47:41 139925613557888 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'

2018-02-22 23:47:41 139925613557888 [Note] WSREP: wsrep_load(): Galera 25.3.22(r3764) by Codership Oy <info@codership.com> loaded successfully.

2018-02-22 23:47:41 139925613557888 [Note] WSREP: CRC-32C: using hardware acceleration.

2018-02-22 23:47:41 139925613557888 [Note] WSREP: Found saved state: 4ee72403-e1e2-11e7-998a-73a130e7e5cb:-1, safe_to_bootstrap: 1

2018-02-22 23:47:41 139925613557888 [Note] WSREP: Recovering GCache ring buffer: version: 1, UUID: 4ee72403-e1e2-11e7-998a-73a130e7e5cb, offset: -1

2018-02-22 23:47:41 139925613557888 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (         0/2147483688 bytes) complete.

2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (2147483688/2147483688 bytes) complete.

2018-02-22 23:47:51 139925613557888 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 520442-551341

2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache::RingBuffer unused buffers scan...  0.0% (        0/547216918 bytes) complete.

2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (547216918/547216918 bytes) complete.

2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): found 4/30904 locked buffers

2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): used space: 547216918/2147483648

2018-02-22 23:47:51 139925613557888 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 172.33.40.143; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT1M; evs.install_timeout = PT1M; evs.join_retrans_period = PT0.5S; evs.keepalive_period = PT3S; evs.max_install_timeouts = 3; evs.send_window = 1024; evs.stats_report_period = PT1M; evs.suspect_timeout = PT30S; evs.user_send_window = 512; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 2G; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.ve

2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache history reset: 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551341 -> 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551339

2018-02-22 23:47:51 139925613557888 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset

2018-02-22 23:47:51 139925613557888 [Note] WSREP: Assign initial position for certification: 551339, protocol version: -1

2018-02-22 23:47:51 139925613557888 [Note] WSREP: wsrep_sst_grab()

2018-02-22 23:47:51 139925613557888 [Note] WSREP: Start replication

2018-02-22 23:47:51 139925613557888 [Note] WSREP: 'wsrep-new-cluster' option used, bootstrapping the cluster

2018-02-22 23:47:51 139925613557888 [Note] WSREP: Setting initial position to 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551339

2018-02-22 23:47:51 139925613557888 [Note] WSREP: protonet asio version 0

2018-02-22 23:47:51 139925613557888 [Note] WSREP: Using CRC-32C for message checksums.

2018-02-22 23:47:51 139925613557888 [Note] WSREP: backend: asio

2018-02-22 23:47:51 139925613557888 [Note] WSREP: gcomm thread scheduling priority set to other:0

2018-02-22 23:47:51 139925613557888 [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)

2018-02-22 23:47:51 139925613557888 [Note] WSREP: restore pc from disk failed

2018-02-22 23:47:51 139925613557888 [Note] WSREP: GMCast version 0

2018-02-22 23:47:51 139925613557888 [Note] WSREP: (cb295870, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567

2018-02-22 23:47:51 139925613557888 [Note] WSREP: (cb295870, 'tcp://0.0.0.0:4567') multicast: , ttl: 1

2018-02-22 23:47:51 139925613557888 [Note] WSREP: EVS version 0

2018-02-22 23:47:51 139925613557888 [Note] WSREP: gcomm: bootstrapping new group 'galera_cluster_01'

2018-02-22 23:47:51 139925613557888 [Note] WSREP: start_prim is enabled, turn off pc_recovery

2018-02-22 23:47:51 139925613557888 [Note] WSREP: Node cb295870 state prim

2018-02-22 23:47:51 139925613557888 [Note] WSREP: view(view_id(PRIM,cb295870,1) memb {

	cb295870,0

} joined {

} left {

} partitioned {

})

2018-02-22 23:47:51 139925613557888 [Note] WSREP: save pc into disk

2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr without UUID: tcp://172.31.29.97:4567

2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr proto entry 0x5568e6fc6cd0

2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr without UUID: tcp://172.32.10.9:4567

2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr proto entry 0x5568e6fcf4d0

2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr without UUID: tcp://172.33.40.143:4567

2018-02-22 23:47:51 139925613557888 [Note] WSREP: discarding pending addr proto entry 0x5568e6fd7ce0

2018-02-22 23:47:51 139925613557888 [Note] WSREP: gcomm: connected

2018-02-22 23:47:51 139925613557888 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636

2018-02-22 23:47:51 139925613557888 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)

2018-02-22 23:47:51 139925613557888 [Note] WSREP: Opened channel 'galera_cluster_01'

2018-02-22 23:47:51 139922778347264 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1

2018-02-22 23:47:51 139925613557888 [Note] WSREP: Waiting for SST to complete.

2018-02-22 23:47:51 139922778347264 [Note] WSREP: STATE_EXCHANGE: sent state UUID: cb299881-182a-11e8-9d9c-f6ef07f79521

2018-02-22 23:47:51 139922778347264 [Note] WSREP: STATE EXCHANGE: sent state msg: cb299881-182a-11e8-9d9c-f6ef07f79521

2018-02-22 23:47:51 139922778347264 [Note] WSREP: STATE EXCHANGE: got state msg: cb299881-182a-11e8-9d9c-f6ef07f79521 from 0 (dbc03)

2018-02-22 23:47:51 139922778347264 [Note] WSREP: Quorum results:

	version    = 4,

	component  = PRIMARY,

	conf_id    = 0,

	members    = 1/1 (joined/total),

	act_id     = 551339,

	last_appl. = -1,

	protocols  = 0/7/3 (gcs/repl/appl),

	group UUID = 4ee72403-e1e2-11e7-998a-73a130e7e5cb

2018-02-22 23:47:51 139922778347264 [Note] WSREP: Flow-control interval: [16, 16]

2018-02-22 23:47:51 139922778347264 [Note] WSREP: Trying to continue unpaused monitor

2018-02-22 23:47:51 139922778347264 [Note] WSREP: Restored state OPEN -> JOINED (551339)

2018-02-22 23:47:51 139922778347264 [Note] WSREP: Member 0.0 (dbc03) synced with group.

2018-02-22 23:47:51 139922778347264 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 551339)

2018-02-22 23:47:51 139925002704640 [Note] WSREP: New cluster view: global state: 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551339, view# 1: Primary, number of nodes: 1, my index: 0, protocol version 3

2018-02-22 23:47:51 139925613557888 [Note] WSREP: SST complete, seqno: 551339

2018-02-22 23:47:51 139925613557888 [Note] InnoDB: For Galera, using innodb_lock_schedule_algorithm=fcfs

2018-02-22 23:47:51 139925613557888 [Warning] InnoDB: Using innodb_file_format is deprecated and the parameter may be removed in future releases. See http://dev.mysql.com/doc/refman/5.7/en/innodb-file-format.html

2018-02-22 23:47:51 139925613557888 [Warning] InnoDB: Using innodb_file_format_max is deprecated and the parameter may be removed in future releases. See http://dev.mysql.com/doc/refman/5.7/en/innodb-file-format.html

2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins

2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Uses event mutexes

2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Compressed tables use zlib 1.2.7

2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Using Linux native AIO

2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Number of pools: 1

2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Using SSE2 crc32 instructions

2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Initializing buffer pool, total size = 20G, instances = 20, chunk size = 128M

2018-02-22 23:47:51 139925613557888 [Note] InnoDB: Completed initialization of buffer pool

2018-02-22 23:47:51 139922786739968 [Note] WSREP: (cb295870, 'tcp://0.0.0.0:4567') connection established to e2e8f67a tcp://172.31.24.27:4567

2018-02-22 23:47:52 139922786739968 [Note] WSREP: (cb295870, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:

2018-02-22 23:47:52 139899263854336 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().

2018-02-22 23:47:52 139925613557888 [Note] InnoDB: Highest supported file format is Barracuda.

2018-02-22 23:47:52 139922786739968 [Note] WSREP: declaring e2e8f67a at tcp://172.31.24.27:4567 stable

And then the Galera provider received an exception and it died:

2018-02-22 23:47:52 139922786739968 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:

pc::Proto{uuid=cb295870,start_prim=1,npvo=0,ignore_sb=0,ignore_quorum=0,state=1,last_sent_seq=3,checksum=0,instances=

cb295870,prim=1,un=0,last_seq=3,last_prim=view_id(PRIM,cb295870,1),to_seq=2,weight=1,segment=0

,state_msgs=

cb295870,pcmsg{ type=STATE, seq=0, flags= 0, node_map {	cb295870,prim=1,un=0,last_seq=3,last_prim=view_id(PRIM,cb295870,1),to_seq=2,weight=1,segment=0

}}

,current_view=view(view_id(REG,cb295870,34) memb {

cb295870,0

e2e8f67a,0

} joined {

e2e8f67a,0

} left {

} partitioned {

}),pc_view=view(view_id(PRIM,cb295870,1) memb {

cb295870,0

} joined {

} left {

} partitioned {

}),mtu=32636}

2018-02-22 23:47:52 139922786739968 [Note] WSREP: {v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=e2e8f67a,srcvid=view_id(REG,cb295870,34),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=340547,nl=(

} 64

2018-02-22 23:47:52 139922786739968 [ERROR] WSREP: exception caused by message: {v=0,t=3,ut=255,o=1,s=0,sr=-1,as=0,f=4,src=e2e8f67a,srcvid=view_id(REG,cb295870,34),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=340549,nl=(

state after handling message: evs::proto(evs::proto(cb295870, OPERATIONAL, view_id(REG,cb295870,34)), OPERATIONAL) {

current_view=view(view_id(REG,cb295870,34) memb {

cb295870,0

e2e8f67a,0

} joined {

} left {

} partitioned {

}),

input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} node: {idx=1,range=[1,0],safe_seq=0} },

fifo_seq=17,

last_sent=0,

known:

cb295870 at

{o=1,s=0,i=1,fs=-1,}

e2e8f67a at tcp://172.31.24.27:4567

{o=1,s=0,i=1,fs=340549,}

}2018-02-22 23:47:52 139922786739968 [ERROR] WSREP: exception from gcomm, backend must be restarted: cb295870 aborting due to conflicting prims: older overrides (FATAL)

at gcomm/src/pc_proto.cpp:handle_state():982

2018-02-22 23:47:52 139922786739968 [Note] WSREP: gcomm: terminating thread

2018-02-22 23:47:52 139922786739968 [Note] WSREP: gcomm: joining thread

2018-02-22 23:47:52 139922786739968 [Note] WSREP: gcomm: closing backend

2018-02-22 23:47:52 139922786739968 [Note] WSREP: Forced PC close

2018-02-22 23:47:52 139922786739968 [Warning] WSREP: discarding 1 messages from message index

2018-02-22 23:47:52 139922786739968 [Note] WSREP: gcomm: closed

2018-02-22 23:47:52 139922778347264 [ERROR] WSREP: gcs/src/gcs_core.cpp:core_handle_comp_msg():718: Malformed component message. Ignoring

But MariaDB and InnoDB kept going:

2018-02-22 23:47:52 139925613557888 [Note] InnoDB: 128 out of 128 rollback segments are active.

2018-02-22 23:47:52 139925613557888 [Note] InnoDB: Creating shared tablespace for temporary tables

2018-02-22 23:47:52 139925613557888 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...

2018-02-22 23:47:52 139925613557888 [Note] InnoDB: File './ibtmp1' size is now 12 MB.

2018-02-22 23:47:52 139925613557888 [Note] InnoDB: 5.7.20 started; log sequence number 6217279687

2018-02-22 23:47:52 139899550291712 [Note] InnoDB: Loading buffer pool(s) from /var/lib/mysql/ib_buffer_pool

180222 23:47:52 server_audit: MariaDB Audit Plugin version 1.4.3 STARTED.

2018-02-22 23:47:52 139925613557888 [Note] Server socket created on IP: '::'.

2018-02-22 23:47:53 139922720352000 [Note] Event Scheduler: scheduler thread started with id 10

2018-02-22 23:47:53 139925613557888 [Note] Reading of all Master_info entries succeded

2018-02-22 23:47:53 139925613557888 [Note] Added new Master_info '' to hash table

2018-02-22 23:47:53 139925613557888 [Note] /usr/sbin/mysqld: ready for connections.

Version: '10.2.12-MariaDB-log' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server

2018-02-22 23:47:53 139925002704640 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

2018-02-22 23:47:53 139925002704640 [Note] WSREP: REPL Protocols: 7 (3, 2)

2018-02-22 23:47:53 139925002704640 [Note] WSREP: Assign initial position for certification: 551339, protocol version: 3

2018-02-22 23:47:53 139925036578560 [Note] WSREP: Service thread queue flushed.

2018-02-22 23:47:53 139925002704640 [Note] WSREP: GCache history reset: 4ee72403-e1e2-11e7-998a-73a130e7e5cb:0 -> 4ee72403-e1e2-11e7-998a-73a130e7e5cb:551339

2018-02-22 23:47:53 139925002704640 [Note] WSREP: Synchronized with group, ready for connections

2018-02-22 23:47:53 139925002704640 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

But all transactions failed due to an ambiguous "Software caused connection abort" error:

2018-02-22 23:48:04 139899236554496 [Warning] WSREP: Send action {(nil), 1082, TORDERED} returned -103 (Software caused connection abort)

2018-02-22 23:48:04 139899236554496 [Note] WSREP: cluster conflict due to certification failure for threads:

2018-02-22 23:48:04 139899236554496 [Note] WSREP: Victim thread:

THD: 30, mode: local, state: executing, conflict: cert failure, seqno: -1

SQL: COMMIT

2018-02-22 23:48:13 139899236251392 [Warning] WSREP: Send action {(nil), 1133, TORDERED} returned -103 (Software caused connection abort)

2018-02-22 23:48:13 139899236251392 [Note] WSREP: cluster conflict due to certification failure for threads:

2018-02-22 23:48:13 139899236251392 [Note] WSREP: Victim thread:

THD: 31, mode: local, state: executing, conflict: cert failure, seqno: -1

SQL: COMMIT

2018-02-22 23:48:13 139899236251392 [Warning] WSREP: Send action {(nil), 624, TORDERED} returned -103 (Software caused connection abort)

2018-02-22 23:48:13 139899236251392 [Note] WSREP: cluster conflict due to certification failure for threads:

2018-02-22 23:48:13 139899236251392 [Note] WSREP: Victim thread:

THD: 31, mode: local, state: executing, conflict: cert failure, seqno: -1

SQL: COMMIT

The user tried to shutdown mysqld, but it hung during the shutdown process, so mysqld had to be killed.

It looks like the problem was caused by Galera receiving unexpected messages over the network. Shouldn't the whole server have been terminated in this case, instead of just the Galera threads?

Attachments

Issue Links

relates to

MDEV-14971 Mariadb stops working when second galera nodes joins

Closed

links to

Galera issue #500 - Provider does not get properly closed in case of gcomm exception

Activity

People

Assignee:: Seppo Jaakola

Reporter:: Geoff Montee (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2018-02-23 20:16

Updated:: 2024-07-08 00:22

Resolved:: 2018-11-07 14:53

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.