[MDEV-18450] slow master shutdown to ensure slaves have received all its event Created: 2019-02-01 Updated: 2023-05-16 Resolved: 2019-03-12 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Fix Version/s: | 10.4.4 |
| Type: | Task | Priority: | Major |
| Reporter: | Andrei Elkin | Assignee: | Andrei Elkin |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Description |
|
When master server shuts down its slaves stop receiving events from it at random It must be feasible how to defer critical shutdown operation until after all/some of the currently connected slaves have received the last event from the master binlog. When client connections are closed, so no more data are generated for replication, the fact of sending (even better - receiving) of the last event should release the final phase of shutdown. |
| Comments |
| Comment by Geoff Montee (Inactive) [ 2019-02-15 ] | |||||||||||||
|
I have been working with someone who experienced this exact problem with semi-synchronous replication. They noticed that when they trigger a normal shutdown on the master, some data could still be committed by existing client threads after the semi-synchronous replication master threads have been shutdown. | |||||||||||||
| Comment by Andrei Elkin [ 2019-02-15 ] | |||||||||||||
|
Sergei, hello. I've submitted a patch 80961a70f22 to bb-10.4-andrei. I hope to hear from you when your busy times permit Thanks! Andrei | |||||||||||||
| Comment by Geoff Montee (Inactive) [ 2019-02-15 ] | |||||||||||||
|
Hi Elkin, I noticed that the patch prevents the server from killing the binlog dump thread early on the master, which sounds great. For semi-synchronous replication, does anything need to be changed to prevent the server from killing the ack receiver thread early on the master as well? Currently, it looks like this might be killed after the binlog dump thread, and before the storage engines shut down:
In the source code, I see it's currently stopped here: https://github.com/MariaDB/server/blob/62c0ac2da66f8e26d5bbf79f3a7dac56cad34f5e/sql/mysqld.cc#L1973 | |||||||||||||
| Comment by Andrei Elkin [ 2019-02-18 ] | |||||||||||||
|
GeoffMontee, thank you for this inspring question about the semisync and its ack collector specifically. 0. Master is committing a trx Notice that according to my observation the Ack collector role is completely ignored. 4. Ack collector thread is killed to not change
Actually, the conclusion did not consider the very fixes which ensure the dump thread is alive Cheers, | |||||||||||||
| Comment by Andrei Elkin [ 2019-03-07 ] | |||||||||||||
|
There is an open question in the end of the MDEV description and implementation.
The patches features an optional shutdown behavior to hold on The solution therefore disallows killing the dump thread until is `mysqladmin shutdown' is extended with a `--wait_for_slaves' option A question is raised by svoj whether or not make sense to introduce a dynamic Andrei | |||||||||||||
| Comment by Sergei Golubchik [ 2019-03-07 ] | |||||||||||||
|
Elkin, I'd say it's fine without a variable. Until somebody requests it and explains why he cannot use mysql -e 'shutdown wait_for_slaves. (hint: not knowing root password is not a good reason, as 10.4 uses unix_socket by default) But please rename "wait_for_slaves" to "wait_for_all_slaves" and may be drop underscores — sql standard generally avoids them | |||||||||||||
| Comment by Andrei Elkin [ 2019-03-07 ] | |||||||||||||
|
GeoffMontee, thanks for assessing and the future-proof idea! Personally I have thought as well the global var can wait, until requested. Hopefully, that's fine with you as well. | |||||||||||||
| Comment by Geoff Montee (Inactive) [ 2019-03-07 ] | |||||||||||||
|
Hi Elkin, I am working with a user who ran into this problem with a master that has semi-synchronous slaves. If there is no global variable to start with, if this semi-synchronous master is shutdown externally (i.e. via systemd), would it still be shutdown safely by default? Or would that master also have to be manually shutdown with something like "shutdown wait_for_all_slaves"? These servers are in a containerized environment. These containers can be shutdown automatically when they need to be moved to a different physical host for performance reasons. This is a completely automated process, so there is no DBA involved to execute a command like "shutdown wait_for_all_slaves". For this user, it would be much preferred if they could ensure that the semi-synchronous master could be shutdown safely by default without manually executing some SQL command. | |||||||||||||
| Comment by Andrei Elkin [ 2019-03-07 ] | |||||||||||||
|
GeoffMontee to >there is no DBA involved to execute a command like "shutdown wait_for_all_slaves" I read the server option has just been requested and to deliver that to the architectures. | |||||||||||||
| Comment by Geoff Montee (Inactive) [ 2019-03-07 ] | |||||||||||||
|
Hi Elkin, The server process isn't killed in the case that I'm talking about. It's shutdown gracefully by the system, similar to how it would be if someone rebooted their database server like this:
The server process would be gracefully shutdown with systemd. | |||||||||||||
| Comment by Andrei Elkin [ 2019-03-07 ] | |||||||||||||
|
GeoffMontee does this specific container shutdown have any hook where mysqladmin could run? | |||||||||||||
| Comment by Geoff Montee (Inactive) [ 2019-03-07 ] | |||||||||||||
|
Hi Elkin, He said that they can use kubernetes PreStop hook: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/ So he is OK with the non-safe shutdown being the default behavior. | |||||||||||||
| Comment by Andrei Elkin [ 2019-03-12 ] | |||||||||||||
|
Pushed 3568427d11f to 10.4.4. | |||||||||||||
| Comment by Geoff Montee (Inactive) [ 2019-08-01 ] | |||||||||||||
|
I just documented this behavior: https://mariadb.com/kb/en/library/replication-threads/#binary-log-dump-threads-and-the-shutdown-process |