[MDEV-4211] Galera: with binlog-checksum=1 any ALTER TABLE statement results in Error_code: 1064 and not replicated on other nodes Created: 2013-02-27  Updated: 2013-03-08  Resolved: 2013-03-04

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 5.5.28a-galera
Fix Version/s: 5.5.29-galera

Type: Bug Priority: Major
Reporter: Aleksey Sanin (Inactive) Assignee: Seppo Jaakola
Resolution: Fixed Votes: 0
Labels: galera
Environment:

Cent OS 5.x Ubuntu 12.04



 Description   

We have a setup of 3 servers in a galera cluster: db01, db02, db03.

If we run an ALTER TABLE statement on one of the nodes then the other two nodes get an error in the log and the statement is not replicated. For example, the following queries run on db01:

ALTER TABLE `test` ADD INDEX `started_time` (`started_time`);
ALTER TABLE `test2` ADD COLUMN `ip` varchar(39) DEFAULT NULL;

Resulted in the following errors on db02 (db03 errors look the same):

130227 7:30:09 [ERROR] Slave SQL: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '' at line 1' on query. Default database: 'test'. Query: 'ALTER TABLE `test` ADD INDEX `started_time` (`started_time`)', Error_code: 1064
130227 7:30:09 [Warning] WSREP: RBR event 1 Query apply warning: 1, 9952638
130227 7:30:09 [Warning] WSREP: Ignoring error for TO isolated action: source: 6885794c-7ea3-11e2-0800-7f8704162adb version: 2 local: 0 state: APPLYING flags: 65 conn_id: 2863428 trx_id: -1 seqnos (l: 60187, g: 9952638, s: 9952637, d: 9952637, ts: 1361950209418672000)
130227 7:36:10 [ERROR] Slave SQL: Error 'You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near '' at line 2' on query. Default database: 'test'. Query: 'ALTER TABLE `test2` ADD COLUMN `ip` varchar(39) DEFAULT NULL', Error_code: 1064
130227 7:36:10 [Warning] WSREP: RBR event 1 Query apply warning: 1, 9955444
130227 7:36:10 [Warning] WSREP: Ignoring error for TO isolated action: source: 6885794c-7ea3-11e2-0800-7f8704162adb version: 2 local: 0 state: APPLYING flags: 65 conn_id: 2862856 trx_id: -1 seqnos (l: 63003, g: 9955444, s: 9955443, d: 9955443, ts: 1361950570003527000)

No errors have been produced on the db01.

I've submitted my.cnf for another bug (https://mariadb.atlassian.net/browse/MDEV-4136), the only change is the following addition:

wsrep_log_conflicts=1

Aleksey



 Comments   
Comment by Aleksey Sanin (Inactive) [ 2013-02-27 ]

Forgot to add that after checking db02/03 nodes, the alter table commands have not been executed there

Comment by Elena Stepanova [ 2013-02-28 ]

Hi Alexey,

I can't see a cnf in MDEV-4136, can you?
We had one in MDEV-4179, with all wsrep* options disabled, did you mean this one?

Comment by Aleksey Sanin (Inactive) [ 2013-02-28 ]

Right, this one. Basically, this is the my.cnf with all wsrep_* options un-disabled

Comment by Aleksey Sanin (Inactive) [ 2013-02-28 ]

Also, we've tried to reproduce this problem in our test environment w/o success using same configs (the only difference is memory size/threads number because test machines are smaller than production).

We also found these issues that look related:

http://www.perconaforum.com/index.php?t=msg&goto=9378&
https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1112363
https://groups.google.com/forum/?fromgroups=#!topic/percona-discussion/bOEclT7K7ow
https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1049105

Comment by Elena Stepanova [ 2013-02-28 ]

Okay, thanks, I see – many questions, no answers... Lets see if we can figure it out.

Does it happen on any table at all?
Does it only happen on ALTER TABLE? what about other DDL – CREATE TABLE, DROP TABLE, ALTER DATABASE,...?
What happens if you execute a similar statement on db02 or db03, does it work there? Does it get replicated to db01?
When you get such an error in your node log, there is also a GRA*log file created in the datadir of this node. The files are tiny and basically just contain the offending event. Could you please attach one of those, for the certainty – the latest one, and quote the corresponding error message from the log?

Thanks.

Comment by Aleksey Sanin (Inactive) [ 2013-02-28 ]

1) Yes, we tried a few tables and all have the same issue. Of course, all tables were InnoDB

2) We didn't try other DDLs unfortunately

3) Yes, the same issue regardless of the origin of the DDL. Exactly the same log entries.

4) Unfortunately, we had to rollback to the plain Master-Slave setup last night thus I don't have the error log anymore. And as I said, we can't repro it on our test environment

Sorry, not a lot of data unfortunately. I was planning to build the latest 5.5.29, test it in dev environment and may be try another production rollout next week if everything goes well. I'll definitely keep an eye on the GRA*log files this time.

Comment by Aleksey Sanin (Inactive) [ 2013-03-01 ]

We found it. The issue is 100% reproducible with

binlog_checksum=1
master_verify_checksum=1
slave_sql_verify_checksum=1

And things work as expected if

#binlog_checksum=1
#master_verify_checksum=1
#slave_sql_verify_checksum=1

Comment by Aleksey Sanin (Inactive) [ 2013-03-01 ]

clarification: binlog_checksum=1 is enough

Comment by Elena Stepanova [ 2013-03-01 ]

Thank you, Aleksey.

The problem is still reproducible on current maria-5.5-galera (revno 3386).

Comment by Seppo Jaakola [ 2013-03-04 ]

Problem happens because the chosen checksum algorithm is not communicated to receiving nodes, currently there is no method for it.

Note that, Galera replication has checksums already, these binlog checksums should not be used

Comment by Elena Stepanova [ 2013-03-04 ]

>> these binlog checksums should not be used

Then the server should disable/ignore them automatically (with a proper warning in the error log). We cannot expect every user to know all subtle limitations and fight with the issues caused by them.

Comment by Seppo Jaakola [ 2013-03-04 ]

A temporary fix has been pushed, which sends a format description event before query event, the FD carries current binlog checksum setting. This fix enables checksums only for DDL statements, and binlog_checksum option should still not be used.

Comment by Seppo Jaakola [ 2013-03-04 ]

Fix pushed in: http://bazaar.launchpad.net/~maria-captains/maria/maria-5.5-galera/revision/3389

Comment by Aleksey Sanin (Inactive) [ 2013-03-05 ]

I disagree that checksums for binlog should not be used. In a mixed setup with Galera cluster streaming data with normal replication (e.g. for delayed replication) these checksums in binlog are useful.

Comment by Seppo Jaakola [ 2013-03-08 ]

Aleksey, that's a valid use case, indeed. And binlog checksumming should not hurt with Galera replication in general. It is just that this issue was acknowledged so close to the release deadline, that there was no time for a proper fix with full binlog checksum support. The plan is to continue with this development for future releases.

Comment by Aleksey Sanin (Inactive) [ 2013-03-08 ]

Hi Seppo. Thanks for your reply. It is perfectly fine to have temporarily hack for the release I was just concerned that the issue was marked as "fixed" in the same time so it was not clear if there are plans to implement the "right" fix.

Generated at Thu Feb 08 06:54:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.