[MDEV-31517] Wrong variable name in the configuration leads Galera to think SST/IST failed, at next restart will request a full SST Created: 2023-06-21 Updated: 2023-12-04 |
|
| Status: | Stalled |
| Project: | MariaDB Server |
| Component/s: | Galera, Galera SST |
| Affects Version/s: | 10.6.14 |
| Fix Version/s: | 10.6 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Claudio Nanni | Assignee: | Julius Goryavsky |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Description |
|
When there's a wrong variable name MariaDB Server stops and exits. There is thou a misleading error message logged by Galera which can lead to think that SST failed because of that (even when no SST actually happened.
And this is without a wrong variable name:
You can see the difference after: [Note] InnoDB: Loading buffer pool(s) My impression is that Galera inherits somehow an error happening in that phase wrongly attributing it to SST phase which is formally still open. The real problem is that at next restart the node will request a full SST even if that's not really required. To reproduce:
|
| Comments |
| Comment by Jan Lindström [ 2023-09-01 ] | |
|
claudio.nanni The state was not saved because server killed the node ungracefully. In such cases before node startup you need to run wsrep_recover to determine the actual database state and pass that option to the node on a command line:
Otherwise how does it know its state and how would it know that it does not need an SST. While the server is running its position is subject to constant change. So in grastate.dat it is saved as e33471d9-4887-11ee-8d0b-9a90c947af83:-1 Unless you shut it down gracefully, it does not know what it is. | |
| Comment by Claudio Nanni [ 2023-09-01 ] | |
|
janlindstrom Good to know, nevertheless imho MariaDB should take care of that. | |
| Comment by Jan Lindström [ 2023-09-01 ] | |
|
claudio.nanni That would require server to do clean shutdown in case of incorrect variable. For some reason it does not do so and I do not know why and not sure if it would help as all server elements are not yet started. | |
| Comment by Claudio Nanni [ 2023-09-04 ] | |
|
janlindstrom I think this should be addressed either at server or galera level because requiring SST for a trivial error is very bad and actually not needed. There's no problem with the node, node was stopped, full data is there, it was restarted and now a simple typo will trigger maybe a hours if not day long SST (with terabytes of data). IMHO it's unacceptable from a user standpoint: | |
| Comment by Justin Bennett [ 2023-09-04 ] | |
|
So an SST is forced on next restart because I accidentally typed "inndb_buffer_pool_size" into a config file rather than "innodb_buffer_pool_size"? How can you not see that this is a problem that requires fixing? | |
| Comment by Jan Lindström [ 2023-09-04 ] | |
|
Justin Bennett I agree on your case it is a problem. From developer point of view there is no way to know can you ignore unknown variables or does it lead bigger problems later for user. Provided workaround using wsrep_recover and providing wsrep_start_position is safe way and then it should not force full SST. Changing the handling i.e. use position we maybe "know" in this case could lead cluster inconsistency and inconsistency voting leading full SST, I rather not go there as at the moment not sure how to determine is unknown variable harmless or not and is position we think we know correct or not. | |
| Comment by Justin Bennett [ 2023-09-04 ] | |
|
Sorry Jan I can't seem to tag you in a comment. Does the provided workaround work if the database restarts multiple times if Restart is set to 'on-abort' in the mariadb.service file? 'on-abort' is the default, which will cause the database to go into an abort/restart loop. Can the configuration validation tool requested in MDEV-31527 be expedited? |