Details
-
Bug
-
Status: Closed (View Workflow)
-
Minor
-
Resolution: Fixed
-
10.1.11
-
None
Description
Shutting down MariaDB by SIGTERM when it has many (20) replication channels (with parallel replication enabled) seems never-ending. But if I stop all slaves before shutdown, it takes a few seconds. Moreover, mysqld reports errors on I/O threads.
I'd expect that server stops all slaves itself.
# sudo systemctl status mariadb -n 64
|
● mariadb.service - MariaDB server
|
Loaded: loaded (/nix/store/4hm49xrn8md0swvc3sm9n2qhafwllgc1-unit-mariadb.service/mariadb.service)
|
Active: deactivating (stop-sigterm) since Fri 2016-02-19 21:06:29 SGT; 1h 50min ago
|
Process: 3189 ExecStartPre=/nix/store/m8hhisn7gag7iycddw8hflila3gi71h3-unit-script/bin/mariadb-pre-start (code=exited, status=0/SUCCESS)
|
Main PID: 3640 (mysqld)
|
CGroup: /system.slice/mariadb.service
|
└─3640 /nix/store/z7x0ndrckrsmqr3163hp9kgs9hwags61-mariadb-10.1.11/bin/mysqld --defaults-file=/nix/store/8lgvar6h0696ngpsladkvmqwbd5k6x5v-mysqld.cnf
|
|
Feb 19 20:04:23 grumpycat mysqld[3640]: 2016-02-19 20:04:23 139653215652608 [Note] Master 'oms-my': Slave SQL thread initialized, starting replication in l...n: 3761334
|
Feb 19 21:06:29 grumpycat systemd[1]: Stopping MariaDB server...
|
Feb 19 21:06:29 grumpycat mysqld[3640]: 2016-02-19 21:06:29 139652892388096 [Note] /nix/store/z7x0ndrckrsmqr3163hp9kgs9hwags61-mariadb-10.1.11/bin/mysqld: ...l shutdown
|
Feb 19 21:06:29 grumpycat mysqld[3640]: 2016-02-19 21:06:29 139652892388096 [Note] Event Scheduler: Purging the queue. 0 events
|
Feb 19 21:06:29 grumpycat mysqld[3640]: 2016-02-19 21:06:29 139653540129536 [Note] Master 'bob-th': Error reading relay log event: slave SQL thread was killed
|
Feb 19 21:06:29 grumpycat mysqld[3640]: 2016-02-19 21:06:29 139652246162176 [ERROR] Master 'bob-th': Error reading packet from server: Lost connection to M...rrno=2013)
|
Feb 19 21:06:29 grumpycat mysqld[3640]: 2016-02-19 21:06:29 139652246162176 [Note] Master 'bob-th': Slave I/O thread exiting, read up to log 'mysql-bin-cha...on 3256007
|
Feb 19 21:06:29 grumpycat mysqld[3640]: 2016-02-19 21:06:29 139653539826432 [Note] Master 'zops': Error reading relay log event: slave SQL thread was killed
|
Feb 19 21:06:29 grumpycat mysqld[3640]: 2016-02-19 21:06:29 139652894509824 [ERROR] Master 'zops': Error reading packet from server: Lost connection to MyS...rrno=2013)
|
Feb 19 21:06:29 grumpycat mysqld[3640]: 2016-02-19 21:06:29 139652894509824 [Note] Master 'zops': Slave I/O thread exiting, read up to log 'mysql-bin-chang...sition 422
|
Feb 19 21:06:29 grumpycat mysqld[3640]: 2016-02-19 21:06:29 139653999532800 [Note] Master 'bob-my': Error reading relay log event: slave SQL thread was killed
|
Feb 19 21:06:29 grumpycat mysqld[3640]: 2016-02-19 21:06:29 139652570942208 [ERROR] Master 'bob-my': Error reading packet from server: Lost connection to M...rrno=2013)
|
Feb 19 21:06:29 grumpycat mysqld[3640]: 2016-02-19 21:06:29 139652570942208 [Note] Master 'bob-my': Slave I/O thread exiting, read up to log 'mysql-bin-cha...on 3947621
|
Feb 19 21:06:29 grumpycat mysqld[3640]: 2016-02-19 21:06:29 139653538310912 [Note] Master 'bob-hk': Error reading relay log event: slave SQL thread was killed
|
Feb 19 21:06:29 grumpycat mysqld[3640]: 2016-02-19 21:06:29 139652570336000 [ERROR] Master 'bob-hk': Error reading packet from server: Lost connection to M...rrno=2013)
|
Feb 19 21:06:29 grumpycat mysqld[3640]: 2016-02-19 21:06:29 139652570336000 [Note] Master 'bob-hk': Slave I/O thread exiting, read up to log 'mysql-bin-cha...on 1303108
|
Feb 19 21:06:30 grumpycat mysqld[3640]: 2016-02-19 21:06:30 139653217471232 [Note] Master 'bob-ph': Error reading relay log event: slave SQL thread was killed
|
Feb 19 21:06:30 grumpycat mysqld[3640]: 2016-02-19 21:06:30 139652892691200 [ERROR] Master 'bob-ph': Error reading packet from server: Lost connection to M...rrno=2013)
|
Feb 19 21:06:30 grumpycat mysqld[3640]: 2016-02-19 21:06:30 139652892691200 [Note] Master 'bob-ph': Slave I/O thread exiting, read up to log 'mysql-bin-cha...on 8157090
|
...
|
Server has 30G of RAM, and 4 CPUs. iotop shows ~100K/s read and write for mysqld process
[mysqld]
|
basedir = /nix/store/z7x0ndrckrsmqr3163hp9kgs9hwags61-mariadb-10.1.11
|
init_file = /nix/store/71malbvn6kkn29yiwrsalnpirf2b6v62-init
|
pid_file = /run/mysqld/mysqld.pid
|
plugin_load = unix_socket=auth_socket.so
|
datadir = /mariadb/db
|
ignore_db_dirs = lost+found
|
innodb_buffer_pool_instances = 64
|
innodb_buffer_pool_size = 19327352832
|
innodb_file_format = barracuda
|
innodb_file_per_table = ON
|
innodb_flush_log_at_trx_commit = 2
|
innodb_flush_method = O_DIRECT
|
innodb_lock_wait_timeout = 1800
|
innodb_log_file_size = 314572800
|
join_buffer_size = 1048576
|
log_slave_updates = OFF
|
max_allowed_packet = 268435456
|
max_connections = 1000
|
net_read_timeout = 1000
|
net_write_timeout = 1000
|
port = 3306
|
relay_log = /mariadb/relay/cat-bin
|
server_id = 112233
|
skip_log_bin
|
slave_compressed_protocol = ON
|
slave_domain_parallel_threads = 10
|
slave_net_timeout = 600
|
slave_parallel_max_queued = 8388608
|
slave_parallel_threads = 40
|
sort_buffer_size = 4194304
|
ssl_cert = /nix/store/rmlq71y8fnyx1k7sp3qx55fqwg45pl40-cats-cert.pem
|
ssl_key = /run/keys/cats-key.pem
|
table_open_cache = 30000
|
; replicate_*
|
!include /nix/store/czaiav63ddqrz18sizhfy69jvv04nfcn-mysqld-repl.cnf
|
Do you have a suggestion for reproducing this? Setting up 20-way
multi-source replication is not trivial, and it's not clear what "seems
never-ending" means, exactly?
The error messages look like just the normal messages when a slave is
stopped, or did I miss something?
I can imagine that shutdown with such a setup could take a while. What
happens is that each of the 20 master-slave connections can have 8MB (your
value of slave_parallel_max_queued) of events queued up. At stop, a
conservative safe point to stop is selected in each of these queues, so
there might be a bunch of events still need to be executed before stop can
complete. And with slave_parallel_threads=40 and
slave_domain_parallel_threads=10, most of the 20 master-slave might have to
wait for the others to release their threads.
Though this is assuming everything is busy and catching up to their masters
... and "seems never-ending" sounds more that what would be expected.
I checked the code, and I did not immediately spot that parallel replication
stop would act differently for STOP ALL SLAVES and SIGTERM. The stop of
parallel replication threads happens the same way after the SQL thread is
stopped. But it is definitely possible that there could be a bug or race in
slave stop, it is just not clear what it could be from the information given
here...
Maybe you could do a detailed check of the error log to see if all SQL and
IO threads stopped? I mean, if each of your 20 master-slave connections say
something like "slave SQL thread was killed" and "Slave I/O thread exiting".
It would help understand if the problem is related to stopping the IO/SQL
threads, or if it is related to parallel replication completing its queue of
events...