[MDEV-22136] wsrep_restart_slave = 1 does not always work Created: 2020-04-03 Updated: 2021-04-19 Resolved: 2021-03-17 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, Replication |
| Affects Version/s: | 10.2.12, 10.2.32, 10.3 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Valerii Kravchuk | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Description |
|
There is an async replication setup between two Galera clusters of 3 nodes each. In some cases SQL thread on slave cluster stops with error 1047. Error log content is the following:
So, normally SQL thread restarts to (maybe) hit a conflict on some other row (see different id values in messages above. But in some cases it stops and replication does not continue. Like this:
So, async replication had to be monitored and restarted "manually". After that:
Question is: why with wsrep_restart_slave = 1 there are cases when slave restart does not happen automatically? It looks like a bug. |
| Comments |
| Comment by Jan Lindström (Inactive) [ 2020-11-19 ] |
|
In case when slave did not restart automatically did node go Non-Primary state? MariaDB 10.2.36 contains some important fixes but not sure if they would help here. To further analyze, we would need full error logs and some instructions how to reproduce. |
| Comment by Seppo Jaakola [ 2021-01-12 ] |
|
async replication slave restarting feature (as configured by wsrep_restart_slave parameter) was developed for automating the slave thread restart in situation, where the node operating as replication slave in the cluster would drop out from cluster and later joins back. In the old versions, the async slave thread would stop as soon as node drops from cluster, and when node joins back, the async slave thread remains stopped, although the slave node is now healthy and capable of applying async replication stream. However, I can see that later development has extended the effect of wsrep_restart_slave to cover also cases where async replication event handling fails for error during applying. The checked errors are conflicts with Galera replication only, and if applying fails for "natural", not Galera related, problem then slave thread restarting does not kick in. Reading the above error logs, it appears that the apply time error checking does not detect all possible Galera replication conflicts, and slave thread remains stopped because of this. All in all, the behavior of wsrep_restart_slave parameter has deviated from the original requirement specification. It would be possible to continue developing this deviated behavior further, but there are some risks and "unknows" involved. Note that the original idea of wsrep_restart_slave is that async replication works successfully, and restarting happens only when node is joining back to cluster. In the deviated behavior, async replication has conflicts with Galera replication, and we now need to decide how to resolve these conflicts. There raises questions like:
There may be users who use wsrep_restart_slave with the original design, and this automatic replication conflict resolving already violates their use case and extending it further would just violate even more. To handle the backward compatibility, it would be best to have additional configuration for enabling conflict resolution. e.g. wsrep_restart_slave could be bit field, with following flags to trigger restart on:
There are also variables for handling replication slave operation, like: slave_skip_errors and slave_transaction_retry_errors. These might be possible to extend to cover conflicts with Galera replication as well. |
| Comment by Jan Lindström (Inactive) [ 2021-03-17 ] |
|
I would say this is not-a-bug as documented and designed way how wsrep_restart_slave parameter should work is still there. There is some effort done to extent this with best-effort try but it does not work in all error cases. In those cases it is then not intended to work. |