[MDEV-15863] Galera cluster node refuses to shut down Created: 2018-04-13  Updated: 2018-09-02  Resolved: 2018-09-02

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.2.14
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Rick Pizzi Assignee: Jan Lindström (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Centos6



 Description   

In my lab while running sysbench with 8 threads I try and shut down node receiving
traffic: service mysqld stop refuses to complete, mysqld server is stuck in shutdown mode.

Details below. Worth nothing this node was also async slave, but receiving very little traffic from it (just pt-heartbeat updates).

 
2018-04-13 14:48:41 140573072803584 [Note] /usr/sbin/mysqld (initiated by: unknown): Normal shutdown
2018-04-13 14:48:41 140573072803584 [Note] WSREP: Stop replication
2018-04-13 14:48:41 140573072803584 [Note] WSREP: Closing send monitor...
2018-04-13 14:48:41 140573072803584 [Note] WSREP: Closed send monitor.
2018-04-13 14:48:41 140573072803584 [Note] WSREP: gcomm: terminating thread
2018-04-13 14:48:41 140573072803584 [Note] WSREP: gcomm: joining thread
2018-04-13 14:48:41 140573072803584 [Note] WSREP: gcomm: closing backend
2018-04-13 14:48:41 140573072803584 [Note] WSREP: view(view_id(NON_PRIM,95d49a94,41) memb {
	cd67d220,0
} joined {
} left {
} partitioned {
	95d49a94,0
	da537e8a,0
})
2018-04-13 14:48:41 140573072803584 [Note] WSREP: view((empty))
2018-04-13 14:48:41 140573072803584 [Note] WSREP: gcomm: closed
2018-04-13 14:48:41 140573092583168 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2018-04-13 14:48:41 140573092583168 [Note] WSREP: Flow-control interval: [16, 16]
2018-04-13 14:48:41 140573092583168 [Note] WSREP: Trying to continue unpaused monitor
2018-04-13 14:48:41 140573092583168 [Note] WSREP: Received NON-PRIMARY.
2018-04-13 14:48:41 140573092583168 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 270133)
2018-04-13 14:48:41 140573092583168 [Note] WSREP: Received self-leave message.
2018-04-13 14:48:41 140573092583168 [Note] WSREP: Flow-control interval: [0, 0]
2018-04-13 14:48:41 140573092583168 [Note] WSREP: Trying to continue unpaused monitor
2018-04-13 14:48:41 140573092583168 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2018-04-13 14:48:41 140573092583168 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 270133)
2018-04-13 14:48:41 140573092583168 [Note] WSREP: RECV thread exiting 0: Success
2018-04-13 14:48:41 140573072803584 [Note] WSREP: recv_thread() joined.
2018-04-13 14:48:41 140573072803584 [Note] WSREP: Closing replication queue.
2018-04-13 14:48:41 140573072803584 [Note] WSREP: Closing slave action queue.
2018-04-13 14:48:41 140573081790208 [Note] WSREP: New cluster view: global state: 75749ce0-3c03-11e8-ba88-67dc780e95e7:270133, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
2018-04-13 14:48:41 140573081790208 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-04-13 14:48:41 140573081790208 [Note] WSREP: New cluster view: global state: 75749ce0-3c03-11e8-ba88-67dc780e95e7:270133, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 3
2018-04-13 14:48:41 140573081790208 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-04-13 14:48:41 140573081790208 [Note] WSREP: applier thread exiting (code:0)
2018-04-13 14:48:41 140573074417408 [Warning] Aborted connection 18 to db: 'sbtest' user: 'root' host: '192.168.2.90' (Got an error reading communication packets)
2018-04-13 14:48:42 140573412579072 [Note] Slave I/O thread exiting, read up to log 'rbac.000003', position 20269957
2018-04-13 14:48:43 140573072803584 [Note] WSREP: killing local connection: 15
2018-04-13 14:48:43 140573072803584 [Note] WSREP: killing local connection: 16
2018-04-13 14:48:43 140573072803584 [Note] WSREP: killing local connection: 17
2018-04-13 14:48:43 140573072803584 [Note] WSREP: killing local connection: 22
2018-04-13 14:48:43 140573072803584 [Note] WSREP: killing local connection: 21
2018-04-13 14:48:43 140573072803584 [Note] WSREP: killing local connection: 20
2018-04-13 14:48:43 140573072803584 [Note] WSREP: killing local connection: 19

[mysqld]
 
binlog_format=row
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
log_slave_updates=1
server_id=1001
 
innodb_log_file_size=128M
innodb_log_files_in_group=2
 
[galera]
# Mandatory settings
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address=gcomm://192.168.2.91,192.168.2.92,192.168.2.93
#

    10 libaio::??(libaio.so.1),LinuxAIOHandler::collect(os0file.cc:1915),LinuxAIOHandler::poll(os0file.cc:2060),os_aio_linux_handler(os0file.cc:2114),os_aio_handler(os0file.cc:2114),fil_aio_wait(fil0fil.cc:5348),io_handler_thread(srv0start.cc:342),start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),srv_resume_thread(srv0srv.cc:924),srv_worker_thread(srv0srv.cc:924),start_thread(libpthread.so.0),clone(libc.so.6)
      1 sigwait(libpthread.so.0),signal_hand(mysqld.cc:3509),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),srv_resume_thread(srv0srv.cc:924),srv_purge_coordinator_suspend(srv0srv.cc:924),srv_purge_coordinator_thread(srv0srv.cc:924),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),buf_resize_thread(buf0buf.cc:3083),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),buf_dump_thread(buf0dump.cc:790),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,wait,galera::ServiceThd::thd_func,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1152),MYSQL_BIN_LOG::wait_for_update_relay_log(log.cc:8275),next_event(slave.cc:7172),exec_relay_log_event(slave.cc:7172),handle_slave_sql(slave.cc:7172),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1149),inline_mysql_cond_wait(mysql_thread.h:1149),wsrep_rollback_process(mysql_thread.h:1149),start_wsrep_THD(wsrep_mysqld.cc:1981),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1149),inline_mysql_cond_wait(mysql_thread.h:1149),wsrep_close_client_connections(mysql_thread.h:1149),wsrep_stop_replication(wsrep_mysqld.cc:909),kill_server(mysqld.cc:2008),kill_server_thread(mysqld.cc:2043),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1149),inline_mysql_cond_wait(mysql_thread.h:1149),thd_destructor_proxy(mysql_thread.h:1149),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1149),inline_mysql_cond_wait(mysql_thread.h:1149),handle_slave_background(mysql_thread.h:1149),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),srv_monitor_thread(srv0srv.cc:1752),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),srv_error_monitor_thread(srv0srv.cc:1916),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),pc_sleep_if_needed(buf0flu.cc:2693),buf_flush_page_cleaner_coordinator(buf0flu.cc:2693),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),lock_wait_timeout_thread(lock0wait.cc:547),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),ib_wqueue_timedwait(ut0wqueue.cc:163),fts_optimize_thread(fts0opt.cc:3035),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),dict_stats_thread(dict0stats_bg.cc:451),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1186),timer_handler(mysql_thread.h:1186),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1186),my_service_thread_sleep(mysql_thread.h:1186),ma_checkpoint_background(ma_checkpoint.c:709),start_thread(libpthread.so.0),clone(libc.so.6)
      1 poll(libc.so.6),handle_connections_sockets(mysqld.cc:6589),mysqld_main(mysqld.cc:6055),__libc_start_main(libc.so.6),_start
      1 nanosleep(libpthread.so.0),os_thread_sleep(os0thread.cc:230),srv_master_sleep(srv0srv.cc:2405),srv_master_thread(srv0srv.cc:2405),start_thread(libpthread.so.0),clone(libc.so.6)
      1 nanosleep(libpthread.so.0),os_thread_sleep(os0thread.cc:230),btr_defragment_thread(btr0defragment.cc:772),start_thread(libpthread.so.0),clone(libc.so.6)



 Comments   
Comment by Rick Pizzi [ 2018-04-13 ]

Just happened again...

     10 libaio::??(libaio.so.1),LinuxAIOHandler::collect(os0file.cc:1915),LinuxAIOHandler::poll(os0file.cc:2060),os_aio_linux_handler(os0file.cc:2114),os_aio_handler(os0file.cc:2114),fil_aio_wait(fil0fil.cc:5348),io_handler_thread(srv0start.cc:342),start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),srv_resume_thread(srv0srv.cc:924),srv_worker_thread(srv0srv.cc:924),start_thread(libpthread.so.0),clone(libc.so.6)
      1 sigwait(libpthread.so.0),signal_hand(mysqld.cc:3509),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),srv_resume_thread(srv0srv.cc:924),srv_purge_coordinator_suspend(srv0srv.cc:924),srv_purge_coordinator_thread(srv0srv.cc:924),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),buf_resize_thread(buf0buf.cc:3083),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,wait(os0event.cc:166),wait_low(os0event.cc:166),os_event_wait_low(os0event.cc:166),buf_dump_thread(buf0dump.cc:790),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,wait,galera::ServiceThd::thd_func,start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1152),MYSQL_BIN_LOG::wait_for_update_relay_log(log.cc:8275),next_event(slave.cc:7172),exec_relay_log_event(slave.cc:7172),handle_slave_sql(slave.cc:7172),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1149),inline_mysql_cond_wait(mysql_thread.h:1149),wsrep_rollback_process(mysql_thread.h:1149),start_wsrep_THD(wsrep_mysqld.cc:1981),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1149),inline_mysql_cond_wait(mysql_thread.h:1149),wsrep_close_client_connections(mysql_thread.h:1149),wsrep_stop_replication(wsrep_mysqld.cc:909),kill_server(mysqld.cc:2008),kill_server_thread(mysqld.cc:2043),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1149),inline_mysql_cond_wait(mysql_thread.h:1149),thd_destructor_proxy(mysql_thread.h:1149),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1149),inline_mysql_cond_wait(mysql_thread.h:1149),handle_slave_background(mysql_thread.h:1149),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_wait,inline_mysql_cond_wait(mysql_thread.h:1149),inline_mysql_cond_wait(mysql_thread.h:1149),binlog_background_thread(mysql_thread.h:1149),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),srv_monitor_thread(srv0srv.cc:1752),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),srv_error_monitor_thread(srv0srv.cc:1916),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),pc_sleep_if_needed(buf0flu.cc:2693),buf_flush_page_cleaner_coordinator(buf0flu.cc:2693),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),lock_wait_timeout_thread(lock0wait.cc:547),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),ib_wqueue_timedwait(ut0wqueue.cc:163),fts_optimize_thread(fts0opt.cc:3035),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,os_event::timed_wait(os0event.cc:286),os_event::wait_time_low(os0event.cc:413),dict_stats_thread(dict0stats_bg.cc:451),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1186),timer_handler(mysql_thread.h:1186),start_thread(libpthread.so.0),clone(libc.so.6)
      1 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1186),my_service_thread_sleep(mysql_thread.h:1186),ma_checkpoint_background(ma_checkpoint.c:709),start_thread(libpthread.so.0),clone(libc.so.6)
      1 poll(libc.so.6),handle_connections_sockets(mysqld.cc:6589),mysqld_main(mysqld.cc:6055),__libc_start_main(libc.so.6),_start
      1 nanosleep(libpthread.so.0),os_thread_sleep(os0thread.cc:230),srv_master_sleep(srv0srv.cc:2405),srv_master_thread(srv0srv.cc:2405),start_thread(libpthread.so.0),clone(libc.so.6)
      1 nanosleep(libpthread.so.0),os_thread_sleep(os0thread.cc:230),btr_defragment_thread(btr0defragment.cc:772),start_thread(libpthread.so.0),clone(libc.so.6)

Comment by Rick Pizzi [ 2018-04-13 ]

To reproduce, run sysbench for at least a couple minutes then try to shutdown node where sysbench is writing

sysbench --db-driver=mysql  --threads=16 --time=3000  --mysql-host=192.168.2.91  --mysql-user=root  /usr/share/sysbench/oltp_insert.lua run

Comment by Jan Lindström (Inactive) [ 2018-09-02 ]

After few tries with 10.2.18 and sysbench, I could not repeat any problem on shutdown the master node while sysbench was running.

Generated at Thu Feb 08 08:24:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.