[MDEV-31517] Wrong variable name in the configuration leads Galera to think SST/IST failed, at next restart will request a full SST Created: 2023-06-21  Updated: 2023-12-04

Status: Stalled
Project: MariaDB Server
Component/s: Galera, Galera SST
Affects Version/s: 10.6.14
Fix Version/s: 10.6

Type: Bug Priority: Critical
Reporter: Claudio Nanni Assignee: Julius Goryavsky
Resolution: Unresolved Votes: 0
Labels: None


 Description   

When there's a wrong variable name MariaDB Server stops and exits.

There is thou a misleading error message logged by Galera which can lead to think that SST failed because of that (even when no SST actually happened.

 
2023-06-21 15:04:07 3 [Note] WSREP: Server status change joiner -> initializing
2023-06-21 15:04:07 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2023-06-21 15:04:07 0 [Note] InnoDB: Compressed tables use zlib 1.2.13
2023-06-21 15:04:07 0 [Note] InnoDB: Number of pools: 1
2023-06-21 15:04:07 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2023-06-21 15:04:07 0 [Note] InnoDB: Using Linux native AIO
2023-06-21 15:04:07 0 [Note] InnoDB: Initializing buffer pool, total size = 134217728, chunk size = 134217728
2023-06-21 15:04:07 0 [Note] InnoDB: Completed initialization of buffer pool
2023-06-21 15:04:07 0 [Note] InnoDB: 128 rollback segments are active.
2023-06-21 15:04:07 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2023-06-21 15:04:07 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2023-06-21 15:04:07 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2023-06-21 15:04:07 0 [Note] InnoDB: 10.6.14 started; log sequence number 3807484; transaction id 185
2023-06-21 15:04:07 0 [Note] Plugin 'FEEDBACK' is disabled.
2023-06-21 15:04:07 0 [Note] InnoDB: Loading buffer pool(s) from /run/media/claudio/FedoraData/myharem/instances/mariadb-enterprise-10.6.14-9-centos-9-x86_64.30614/data/ib_buffer_pool
2023-06-21 15:04:07 0 [ERROR] /run/media/claudio/FedoraData/myharem/instances/mariadb-enterprise-10.6.14-9-centos-9-x86_64.30614/bin/mariadbd: unknown variable 'table_definitio_cache=4K'
2023-06-21 15:04:07 0 [ERROR] Aborting
2023-06-21 15:04:07 3 [ERROR] WSREP: sst_received failed: State wait was interrupted
2023-06-21 15:04:07 2 [ERROR] WSREP: Application received wrong state:

And this is without a wrong variable name:

2023-06-21 15:01:13 3 [Note] WSREP: Server status change joiner -> initializing
2023-06-21 15:01:13 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2023-06-21 15:01:13 0 [Note] InnoDB: Compressed tables use zlib 1.2.13
2023-06-21 15:01:13 0 [Note] InnoDB: Number of pools: 1
2023-06-21 15:01:13 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2023-06-21 15:01:13 0 [Note] InnoDB: Using Linux native AIO
2023-06-21 15:01:13 0 [Note] InnoDB: Initializing buffer pool, total size = 134217728, chunk size = 134217728
2023-06-21 15:01:13 0 [Note] InnoDB: Completed initialization of buffer pool
2023-06-21 15:01:13 0 [Note] InnoDB: 128 rollback segments are active.
2023-06-21 15:01:13 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2023-06-21 15:01:13 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2023-06-21 15:01:13 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2023-06-21 15:01:13 0 [Note] InnoDB: 10.6.14 started; log sequence number 3803750; transaction id 172
2023-06-21 15:01:13 0 [Note] Plugin 'FEEDBACK' is disabled.
2023-06-21 15:01:13 0 [Note] InnoDB: Loading buffer pool(s) from /run/media/claudio/FedoraData/myharem/instances/mariadb-enterprise-10.6.14-9-centos-9-x86_64.30614/data/ib_buffer_pool
2023-06-21 15:01:13 0 [Note] InnoDB: Buffer pool(s) load completed at 230621 15:01:13
2023-06-21 15:01:13 0 [Note] Server socket created on IP: '0.0.0.0'.
2023-06-21 15:01:13 0 [Note] Server socket created on IP: '::'.
2023-06-21 15:01:13 0 [Warning] WSREP: Ignoring server id for non bootstrap node.
2023-06-21 15:01:13 0 [Note] WSREP: wsrep_init_schema_and_SR 0x0
2023-06-21 15:01:13 0 [Note] WSREP: Server initialized
2023-06-21 15:01:13 0 [Note] WSREP: Server status change initializing -> initialized
2023-06-21 15:01:13 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2023-06-21 15:01:13 3 [Note] WSREP: Recovered position from storage: fc65b47a-0ebd-11ee-a691-eb1618faaa0e:72
2023-06-21 15:01:13 3 [Note] WSREP: Recovered view from SST:

You can see the difference after: [Note] InnoDB: Loading buffer pool(s)

My impression is that Galera inherits somehow an error happening in that phase wrongly attributing it to SST phase which is formally still open.

The real problem is that at next restart the node will request a full SST even if that's not really required.
This happens also with no activity whatsovere in the cluster.

To reproduce:

  • Start Galera cluster
  • Stop a Node
  • Add non_existing_variable=ON in the config
  • Start the node
  • See how it fails and notice the SST failed messages
  • Remove the wrong variable
  • Start the node
  • Observe a full SST happening


 Comments   
Comment by Jan Lindström [ 2023-09-01 ]

claudio.nanni The state was not saved because server killed the node ungracefully. In such cases before node startup you need to run wsrep_recover to determine the actual database state and pass that option to the node on a command line:

mysqld .... --wsrep_start_position=XXXX:1234

Otherwise how does it know its state and how would it know that it does not need an SST. While the server is running its position is subject to constant change. So in grastate.dat it is saved as e33471d9-4887-11ee-8d0b-9a90c947af83:-1 Unless you shut it down gracefully, it does not know what it is.

Comment by Claudio Nanni [ 2023-09-01 ]

janlindstrom Good to know, nevertheless imho MariaDB should take care of that.
Either stop it gracefully or understand what happened and run wsrep_recover at next startup.
In any common scenario a user that puts a wrong parameter, realizes it, fixes it, and restarts the service.
I'm even against just documenting it, it doesn't sound really smart imho.
Do you think this can be fixed in a way that MariaDB takes care of that automatically?

Comment by Jan Lindström [ 2023-09-01 ]

claudio.nanni That would require server to do clean shutdown in case of incorrect variable. For some reason it does not do so and I do not know why and not sure if it would help as all server elements are not yet started.

Comment by Claudio Nanni [ 2023-09-04 ]

janlindstrom I think this should be addressed either at server or galera level because requiring SST for a trivial error is very bad and actually not needed. There's no problem with the node, node was stopped, full data is there, it was restarted and now a simple typo will trigger maybe a hours if not day long SST (with terabytes of data). IMHO it's unacceptable from a user standpoint:
config typo --> node is trashed and needs complete resync

Comment by Justin Bennett [ 2023-09-04 ]

So an SST is forced on next restart because I accidentally typed "inndb_buffer_pool_size" into a config file rather than "innodb_buffer_pool_size"? How can you not see that this is a problem that requires fixing?

Comment by Jan Lindström [ 2023-09-04 ]

Justin Bennett I agree on your case it is a problem. From developer point of view there is no way to know can you ignore unknown variables or does it lead bigger problems later for user. Provided workaround using wsrep_recover and providing wsrep_start_position is safe way and then it should not force full SST. Changing the handling i.e. use position we maybe "know" in this case could lead cluster inconsistency and inconsistency voting leading full SST, I rather not go there as at the moment not sure how to determine is unknown variable harmless or not and is position we think we know correct or not.

Comment by Justin Bennett [ 2023-09-04 ]

Sorry Jan I can't seem to tag you in a comment.

Does the provided workaround work if the database restarts multiple times if Restart is set to 'on-abort' in the mariadb.service file? 'on-abort' is the default, which will cause the database to go into an abort/restart loop.

Can the configuration validation tool requested in MDEV-31527 be expedited?

Generated at Thu Feb 08 10:24:28 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.