[MDEV-30456] Galera / MariaDB issue with ALTER TABLE statement still propagating to other cluster members in TOI while it failed on origin server Created: 2023-01-24 Updated: 2024-01-30 |
|
| Status: | In Review |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.3, 10.5 |
| Fix Version/s: | 10.5 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Tom | Assignee: | Julius Goryavsky |
| Resolution: | Unresolved | Votes: | 3 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Description |
|
Hi, I'm new here so I'm not really sure what are the minimal requirements for a bug report so I'll just work from my own experience for now: We had an incident this morning whereby an ALTER TABLE statement was executed on a server in a cluster of 3 servers. This became an outage for us because the statement resulted in a COPY algorithm being used on a large table (in member 2 and 3 of the cluster) which (at least in my opinion) should have been prevented by the statements that were done on member 1 of the cluster:
Note that the enum order changed by accident which caused the COPY algorithm to be needed which made it fail on member 1 which in turn didn't stop it from propagating the query to the other cluster members somehow (this seems like unwanted behavior to me). I've checked this by using SHOW PROCESSLIST; on all machines and found that it was only running on both members 2 and 3 while it clearly showed me the following error on member 1 to which I was connected while running the queries: ALGORITHM=INSTANT is not supported. Reason: Cannot change column type. Try ALGORITHM=COPY Some version information:
I was wondering if this might be a known issue somehow? Or if this is really a bug? And if so is it a bug in Galera or in MariaDB? |
| Comments |
| Comment by Tom [ 2023-01-24 ] | |||||||||||||||||||||||||
|
I also reported this on github for galera (not sure on which side this issue is caused). | |||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2023-04-02 ] | |||||||||||||||||||||||||
|
The session state is not preserved in receiving node, when applying the replicated ALTER statement. Therefore the default alter algorithm was used, instead of the "INSTANT". | |||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2023-04-17 ] | |||||||||||||||||||||||||
|
The alter algorithm is not enforced in applying, should be possible to fix by simply adding in the replicated query log event the algorithm specification : ALGORITHM [=] {DEFAULT|INPLACE|COPY|NOCOPY|INSTANT | |||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2023-04-19 ] | |||||||||||||||||||||||||
|
a pull request containing proposed fix for this is now in Codership side testing & review | |||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2023-05-10 ] | |||||||||||||||||||||||||
|
PR submitted in github, containing the fix and mtr test | |||||||||||||||||||||||||
| Comment by Julius Goryavsky [ 2023-05-11 ] | |||||||||||||||||||||||||
|
seppo submitted PR causes a regression in another test:
| |||||||||||||||||||||||||
| Comment by Julius Goryavsky [ 2023-12-26 ] | |||||||||||||||||||||||||
|
The proposed fix (https://github.com/MariaDB/server/pull/2629) works with its own tests, but breaks the first or second check block in galera_toi_ddl_nonconflicting.test - in the first check inside fragment, two lines mentioning the table "t1" are returned instead of three. Problems in this test fragment:
As a result we get:
| |||||||||||||||||||||||||
| Comment by Julius Goryavsky [ 2023-12-26 ] | |||||||||||||||||||||||||
|
seppo I returned the PR and this task to you for improvement - to eliminate regression in the test, see the detailed comment above. My additions based on review: https://github.com/MariaDB/server/commit/0612f7c92a7cc67664e24b02c0a35207557d1484 | |||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2024-01-09 ] | |||||||||||||||||||||||||
|
this issue is due to multi-statement used in the test, which causes trouble for TOI replication query rewriting. The test has two statements sent together: ALTER TABLE t1... ; INSERT INTO t1 16 Query ALTER TABLE t1 ADD COLUMN f3 INTEGER; INSERT INTO t1 (f1, f2) VALUES (DEFAULT, 123) I have a fix for this, but same issue may surface with other TOI replication also, or anywhere where THD::query() output is trusted. So, some more review is needed to sort this out. | |||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2024-01-10 ] | |||||||||||||||||||||||||
|
Added to the original PR a commit, which fixes the regression. | |||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2024-01-10 ] | |||||||||||||||||||||||||
|
PR has one more commit to fix the regression |