[MDEV-24956] ALTER TABLE not replicated with Galera in MariaDB 10.5.9 Created: 2021-02-23  Updated: 2021-09-30  Resolved: 2021-04-05

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.9
Fix Version/s: 10.4.19, 10.5.10

Type: Bug Priority: Major
Reporter: Petr Šťastný Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 2
Labels: galera
Environment:

Debian Buster (LXC containers in Ubuntu 20.10), MariaDB from official repo (10.5.9+maria~buster)


Attachments: File galera_2_cluster_alter.cnf     File galera_2_cluster_alter.test    
Issue Links:
Duplicate
is duplicated by MDEV-25653 Replication between two cluster fails... Closed
Relates
relates to MDEV-25653 Replication between two cluster fails... Closed

 Description   

After upgrading 3-node cluster from 10.5.8 to 10.5.9, I noticed that ALTER TABLE statements are not replicated. Change is visible only on initiator. No log message anywhere.

Later when I try to write a row in that table, other node fails:

2021-02-23 15:17:33 2 [ERROR] Slave SQL: Column 57 of table 'xx.xx' cannot be converted from type 'tinyint' to type 'enum('unknown','ok','warning','slow','response_tim', Internal MariaDB error code: 1677
 
...
 
2021-02-23 15:17:33 2 [ERROR] WSREP: Inconsistency detected: Inconsistent by consensus on 654e4dc9-1f74-11eb-9dab-7e7fcd43b6b6:20437917
         at /home/buildbot/buildbot/build/galera/src/replicator_smm.cpp:process_apply_error():1347

I tried many different ALTER TABLE statements, without success. For example, CREATE TABLE works. All DML commands are also ok.

After downgrading back to 10.5.8, everything works.



 Comments   
Comment by Oli Sennhauser [ 2021-03-18 ]

CANNOT reproduce:

CREATE TABLE `test` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`data` varchar(128) DEFAULT NULL,
`ts` timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
`blabla` int(10) unsigned DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=7274387 DEFAULT CHARSET=latin1
alter table test add column pro tinyint unsigned;
alter table test modify pro enum('unknown','ok','warning','slow','response_tim');
Query OK, 2086152 rows affected (1 min 11.856 sec)
Records: 2086152 Duplicates: 0 Warnings: 0

Would be nice to have more details?

Comment by Petr Šťastný [ 2021-03-29 ]

So I decided to upgrade another set of my servers that form a Galera cluster. Unfortunately I ran into the same problem. I spent few hours trying to change configuration to see what can be causing this. It seems that I am the only one who faces this problem?

Let's dig into details.

I use MariaDB 10.5, always 3 node in Galera cluster, running Debian 10. The problem is that ALTER TABLE commands are not propagated to other nodes.

For example, I have the following table:

CREATE TABLE `test` (
  `ID` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(140) COLLATE utf8mb4_czech_ci DEFAULT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_czech_ci

Let's print the columns:

MariaDB [aaa_test]> SHOW COLUMNS FROM test;
+-------+------------------+------+-----+---------+----------------+
| Field | Type             | Null | Key | Default | Extra          |
+-------+------------------+------+-----+---------+----------------+
| ID    | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| name  | varchar(250)     | NO   |     | NULL    |                |
+-------+------------------+------+-----+---------+----------------+

I issue the following command on one of the nodes:

MariaDB [aaa_test]> ALTER TABLE test CHANGE name name varchar(140) NULL DEFAULT NULL;
Query OK, 0 rows affected (0.032 sec)              
Records: 0  Duplicates: 0  Warnings: 0

On that node I see that column definition has changed. But it didn't on other 2 nodes. There is no error message in log file (I event turned wsrep_debug on).

It doesn't matter on which node I execute the command or in which table I do it, it always turns out the same way.

Tested with the following version from official MariaDB repository: 10.5.9+maria~buster. Whenever I downgrade back to 10.5.8+maria~buster, everything works fine.

A also tried to have different versions on the nodes. The problem arises when the node where ALTER command is initiated has 10.5.9 version. When the initiator is 10.5.8 and others are 10.5.9, there is no problem.

This is my my.cnf (symlink to mariadb.cnf):

[client]
port = 3306
socket = /run/mysqld/mysqld.sock
default_character_set = utf8mb4
 
[mysqld_safe]
socket = /run/mysqld/mysqld.sock
nice = 0
 
[mysqld]
tls_version=TLSv1.2,TLSv1.3
ssl_ca=/etc/mysql/ssl/ca.crt
ssl_cert=/etc/mysql/ssl/db1.crt
ssl_key=/etc/mysql/ssl/db1.key
 
user = mysql
log_error = /var/log/mysql/error.log
pid-file = /run/mysqld/mysqld.pid
socket = /run/mysqld/mysqld.sock
bind-address = 0.0.0.0
port = 3306
basedir = /usr
#datadir = /var/lib/mysql
datadir = /data/db/data
innodb_log_group_home_dir = /data/db/innodblog
tmpdir = /tmp
lc-messages-dir = /usr/share/mysql
skip-external-locking
skip-name-resolve
 
collation_server = utf8mb4_czech_ci
character_set_server = utf8mb4
 
key_buffer = 16M
table_open_cache = 4096
table_definition_cache = 4096
max_connections = 500
tmp_table_size = 16M
max_heap_table_size = 16M
 
query_cache_type = 0
#query_cache_limit = 2M
#query_cache_size = 16M
 
default_storage_engine=innodb
innodb_buffer_pool_size = 512M
innodb_file_per_table
innodb_log_buffer_size = 16M
innodb_autoinc_lock_mode=2
innodb_flush_log_at_trx_commit=0  # pouze Galera
innodb_log_file_size = 128M
 
# ZFS optimizations
innodb_log_write_ahead_size=16384
innodb_doublewrite=0
innodb_checksum_algorithm=none
innodb_flush_neighbors=0
innodb_use_native_aio=0
innodb_use_atomic_writes=0
 
binlog_format=ROW
binlog_cache_size = 64K
max_binlog_size = 100M
expire_logs_days=10
sync_binlog=1
server_id=1
gtid_domain_id=1
 
wsrep_on=ON
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_provider_options="gcache.size=100M;socket.ssl_cert=/etc/mysql/ssl/db1.crt;socket.ssl_key=/etc/mysql/ssl/db1.key;socket.ssl_ca=/etc/mysql/ssl/ca.crt"
wsrep_cluster_name="dbcluster"
#wsrep_cluster_address="gcomm://"
wsrep_cluster_address="gcomm://10.0.3.17,10.0.3.18"
wsrep_node_name=db1
wsrep_node_address="10.0.3.16"
wsrep_sst_method=mariabackup
wsrep_sst_auth=mysql:
wsrep_slave_threads=2
#wsrep_debug=1
 
# master
#log-bin=/data/db/binlog/mysqld-bin
 
# multi-master
#auto_increment_increment=2
#auto_increment_offset=1
 
# slave
#relay-log=/data/db/binlog/mysqld-relay-bin
#read-only
 
[sst]
encrypt=3
tcert=/etc/mysql/ssl/db1.crt
tkey=/etc/mysql/ssl/db1.key
 
[mysqldump]
quick
quote-names
max_allowed_packet = 16M
 
[isamchk]
key_buffer = 16M

Note that I am using ZFS file system. There are few InnoDB optimizations for that. I tried to disable them (thougth they are causing it), but it did not help.

If this still can't help you to find the bug (something that has changed between 10.5.8 and 10.5.9), I can try to deploy new cluster with the same problem and give you a SSH key.

Comment by Elton M. Labajo [ 2021-03-30 ]

Hi,

We are experiencing the same issue with 10.5.9-6-MariaDB-enterprise-log MariaDB Enterprise Server. The setup is two galera clusters replicating each other via 3rd node (node c on each cluster). *ALTER TABLE * on cluster A works and gets replicated to Cluster B node c (the replication slave connected to Cluster A), however it doesn't propagate to Cluster B node a and node b.

So basically, any *ALTER TABLE * statements are not propagated to the Cluster B node a and node b.

Comment by Jan Lindström (Inactive) [ 2021-03-30 ]

Hi,

I can't repeat the problem:

CREATE TABLE `test` (
`ID` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(140) COLLATE utf8mb4_czech_ci DEFAULT NULL,
PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_czech_ci;
SHOW COLUMNS FROM test;
Field	Type	Null	Key	Default	Extra
ID	int(10) unsigned	NO	PRI	NULL	auto_increment
name	varchar(140)	YES		NULL	
ALTER TABLE test CHANGE name name varchar(140) NULL DEFAULT NULL;
SHOW WARNINGS;
Level	Code	Message
SHOW COLUMNS FROM test;
Field	Type	Null	Key	Default	Extra
ID	int(10) unsigned	NO	PRI	NULL	auto_increment
name	varchar(140)	YES		NULL	
connection node_2;
SHOW COLUMNS FROM test;
Field	Type	Null	Key	Default	Extra
ID	int(10) unsigned	NO	PRI	NULL	auto_increment
name	varchar(140)	YES		NULL	

One strange thing on your test case is the fact that 'CREATE TABLE' has 'VARCHAR(140)' but in 'SHOW COLUMNS' it is 'VARCHAR(250)' but this does not really change anything.

Comment by Petr Šťastný [ 2021-03-30 ]

So do you want me to set up a cluster with my exact configuration with that problem and give you a SSH access?

Comment by Jan Lindström (Inactive) [ 2021-03-30 ]

I would like repeatable test case, that would be lot more easier.

Comment by Petr Šťastný [ 2021-03-30 ]

Unfortunately I don't think this case is so easy. It seems that the problem depends on maybe on server configuration or maybe other enviroment settings. In my case, on my 2 clusters, any ALTER TABLE command does not work. There is no one special test case that fails and other that is ok. It just don't work at all. I can try to do any change on any column in any table in any database (some testing or real-life) - in all cases, table definition change is not propagated (at the same time, DML is ok). The case I presented here was just a very simple example, not exact situation when it happens and not in other situations.

Do you think there is a change between 10.5.8 and 10.5.9 that could inlfluence this so that I can focus on it and try to change configuration settings?

Comment by Jan Lindström (Inactive) [ 2021-03-30 ]

Unfortunately, I do not know any change that could cause the issue. To further diagnose this please provide full error log from all nodes preferable with --wsrep-debug=1 but only in case when you hit the issue. If you hit the issue, did you use SHOW WARNINGS ?

Comment by Petr Šťastný [ 2021-03-30 ]

I believe that I have reproducible test case for you. After long investigation I realized that I may have made few mistakes during previous investigations. But the bug is still here a I am able to easy reproduce it using the following steps. The bug is reproducible using persistent connections and mysql_change_user() API call, I used PHP for that.

Create table, for example:

CREATE TABLE `test` (
  `ID` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_czech_ci NOT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

Now on Node A, run the following PHP code:

<?php
mysqli_report(MYSQLI_REPORT_ERROR | MYSQLI_REPORT_STRICT);
 
function alter_table($len) {
    $mysqli = new mysqli("p:localhost", "root", "", "test");
    $mysqli->query("ALTER TABLE `test` CHANGE `name` `name` VARCHAR($len) CHARACTER SET utf8mb4 COLLATE utf8mb4_cze
ch_ci NOT NULL;");
    $mysqli->close();
}
 
alter_table(110);
alter_table(120);

On node A, check table structure:

MariaDB [test]> SHOW COLUMNS FROM test;
+-------+------------------+------+-----+---------+----------------+
| Field | Type             | Null | Key | Default | Extra          |
+-------+------------------+------+-----+---------+----------------+
| ID    | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| name  | varchar(120)     | NO   |     | NULL    |                |
+-------+------------------+------+-----+---------+----------------+
2 rows in set (0.001 sec)

Do the same thing on Node B:

MariaDB [test]> show columns from test;
+-------+------------------+------+-----+---------+----------------+
| Field | Type             | Null | Key | Default | Extra          |
+-------+------------------+------+-----+---------+----------------+
| ID    | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| name  | varchar(110)     | NO   |     | NULL    |                |
+-------+------------------+------+-----+---------+----------------+
2 rows in set (0.002 sec)

As you can see, the second ALTER TABLE command has not propagated to other nodes!

And now use the same PHP script, but remove persistent connection ("p:" prefix in host) and everything works!

Let me explain what and why I did. After a lot of research I figured out that I can 100% reproduce the error using persistent connection. I don't know why. When PHP disconnects and connects persistent connection, it uses mysql_change_user() API call, which should clean up session.

I am using PHP because there is no option to call mysql_change_user() API in mysql client on command line. So I used PHP for this simple test case. No warnings are returned.

It looks like user session is in some strange state after mysql_change_user() which results that DDL is not propagated in Galera cluster. But surprisingly DML works good.

And I repeat again - when I downgrade back to 10.5.8, the problem disappears. So it does not look like there is something wrong in PHP in this test case, but in MariaDB server.

Comment by Elton M. Labajo [ 2021-03-31 ]

Hi, to illustrate on reproducing the bug I created a video. please click here I hope this will show much more clearer presentation.

Comment by Mario Karuza (Inactive) [ 2021-03-31 ]

Added galera_3nodes test based on Elton's video, tested on 10.4.19(dbg)

Generated at Thu Feb 08 09:33:59 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.