[MDEV-33178] Random slave replication error, others continues without error Created: 2024-01-04 Updated: 2024-01-22 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.11.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Cuchac | Assignee: | Roel Van de Paar |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | replication, slave | ||
| Environment: |
official docker images running in kubernetes cluster |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
Hello, we have 1 master, 9 slaves. All slaves connects from the same master. All slaves started from same snapshot, completely same process. Probably after upgrade from 10.6 to 10.11 every few days random slave errors with
We use MIXED replication mode, no parallel replication, no wsrep. I do believe it is slave bug, because all other slaves continues without problem. I'm unable to reproduce this on purpose. Usually the error happens on the same database, same table, but probably this is just the result of another bug in replication that happened earlier. I understand that this bug report does not contains all the required information, but I need your guidance to provide more relevant data. I can provide binlog or relay log for example. |
| Comments |
| Comment by Cuchac [ 2024-01-05 ] | |||||||||||||||||
|
May be this problem has some connection with "INSERT IGNORE" statements that are used to fill the problematic table. | |||||||||||||||||
| Comment by Roel Van de Paar [ 2024-01-05 ] | |||||||||||||||||
|
cuchac Hi! I need your guidance to provide more relevant data - sure, questions below; You mention in the original description that "all other slaves continues without problem". Does "all other slaves" mean slaves which have not been upgraded to 10.11 yet? Can you please also clarify what "...other bug in replication that happened earlier" was/is? Which tool are you using to visualize the killed "slave worker"? What tool or script is killing the connections, and why? Is the Slave_worker the user used for replication? The original description mentions id_section_2, which is not visible in the screenshot, it would be helpful to see more information on tables definitions. Does the issue happen again after instantiating a new (temporary/test) slave? How was the orginal snapshot made/what format is it in? | |||||||||||||||||
| Comment by Cuchac [ 2024-01-05 ] | |||||||||||||||||
|
Hello,
All slaves use same version. All slaves were created in the same time from same dump. We are using kubernetes cluster and it is one stateful set with maxscale. "all other slaves" means that there are completely same slaves from same dump running and one stops with this error, all other continues without any problem. Next day, another slave stops, all other continues ...
It was just my idea that probably the error "Duplicate entry '8-7' for key 'id_section_2'" could be only some consequence of some bug that happened earlier. There was no transactions skipping or similar thinks. Slave replications were stable for several years now and they started to fail randomly in last months.
It is PhpMyAdmin, Status / Processes page. Nobody is killing connections. There is no "Slave worker" user, they are the system workers from "slave_parallel_threads". It is not normally turned on (error happens also when slave_parallel_threads=0) but I just tried slave_parallel_threads=8 and noticed this weird behavior that probably could be related to the poblem.
It happens after new backup/restore. I have to do it after every failure because otherwise I would run out of slaves after several days I do backup/restore using mariabackup. Same script for slave restore was working for several years without any problem. Currently I think the main cause could be "INSERT IGNORE" statement that we use to update the problematic table. I will try to rewrite update queries to use "ON DUPLICATE KEY" and see if it helps. Is it possible that "INSERT IGNORE" is more risky for master/slave replication than "ON DUPLICATE KEY"? | |||||||||||||||||
| Comment by Roel Van de Paar [ 2024-01-05 ] | |||||||||||||||||
|
> I will try to rewrite update queries to use "ON DUPLICATE KEY" and see if it helps. That sounds like a good start idea. I also tried the table definitions provided, but received an error:
Could you check the same on your end? > Is it possible that "INSERT IGNORE" is more risky for master/slave replication than "ON DUPLICATE KEY"? Yes, INSERT IGNORE can be riskier than ON DUPLICATE KEY UPDATE in a replication setup for a few reasons: | |||||||||||||||||
| Comment by Cuchac [ 2024-01-06 ] | |||||||||||||||||
|
Hello, > I also tried the table definitions provided, but received an error: That's because of foreign keys to other tables. You can ignore the foreign keys. Is it normal then, that using "INSERT IGNORE" can break replication? Or is it still a bug that probably "INSERT IGNORE" sometimes breaks master/slave replication? I will let you know when I manage to change the queries. It will take some days to change and test. | |||||||||||||||||
| Comment by Kristian Nielsen [ 2024-01-06 ] | |||||||||||||||||
|
cuchac It is normal with optimistic parallel replication to see worker threads in "killed" state. This happens when there is a conflict between two transactions replicated in parallel, one is killed and retried later. INSERT IGNORE should not break replication in mixed mode, that should be a bug. What would be really useful is to have the binlog (or relay log) around the point where it fails, together with the exact messages from the error log at the time of failure. Ideally have the events from the binlog sufficiently far back that the transaction that caused the duplicate row, as well as the failing transaction, can be seen. Hopefully this can give some clue as to why the problem occurs. | |||||||||||||||||
| Comment by Cuchac [ 2024-01-07 ] | |||||||||||||||||
|
OK, I will try to get the binlog when the problem occurs again.
We have very large binlog files and I did not find any quick way to "copy" some filtered events to separate raw binlog file. | |||||||||||||||||
| Comment by Cuchac [ 2024-01-10 ] | |||||||||||||||||
|
Problem appeared again.
If I read it correctly, just before the problematic insert instruction (2-100-8777986841) there was DELETE statement (GTID 2-100-8777986839), that should delete the duplicit row. But the row was not deleted and the row is still present in table - verified it using SELECT on problematic node. I compared slave relay log and master binlog and they are same.
that is correct, but on slave binlog, I can see
some unrelated row. I also checked other slaves binlog and they have correctly deleted same row as master. What do you think about this? | |||||||||||||||||
| Comment by Kristian Nielsen [ 2024-01-10 ] | |||||||||||||||||
|
Thanks a lot cuchac for the detailed additional information and nice analysis. I agree with your analysis. The slave has the correct delete event in its relaylog to delete row (8,7,1). But somehow instead it deletes a completely different row (440,7,17), according to what is in the binlog. This seems to be the real problem, the duplicate key error two transactions later is just a symptom. The mystery is how the DELETE ends up deleting the completely wrong row?!? I looked at the table definition. The only special thing I see is that the table does not have a primary key; it has a unique key that has a null-able second part. But that doesn't really seem able to explain why the wrong row would be deleted, there's no NULLs involved it seems. It seems a very fundamental issue if the matching of replication row event to table data is wrong. Not sure at this point how this could turn up only very occasionally on a single slave, unfortunately. The only thing I can think of is to try to analyze a couple other instances of this, to learn if there is any pattern to this. Is it always the same table? Same number of rows deleted/inserted? Similar values? Any specific slave server that tends to get this error, or is it random across all slaves? Maybe you already have a bit on data on this from previous failures? I noticed that both the original report and the last analyzed instance get a duplicate on the key "8-7". If it's always that key, then that's an interesting data point, or just a random coincidence? Again, thanks for your efforts in tracking this down. - Kristian. | |||||||||||||||||
| Comment by Cuchac [ 2024-01-11 ] | |||||||||||||||||
|
Hello, Error happens on random slave node. I did not record what nodes suffered the issue in past, but I think all of them. Do you think there is some way to compare binary and relay logs? Because this bug should be discovered whenever relay differs from binary log. Or some other valid actions can cause differences in binary vs relay log? | |||||||||||||||||
| Comment by Kristian Nielsen [ 2024-01-11 ] | |||||||||||||||||
|
Yes, agree, scary bug when a row event seemingly is deleting the wrong row. There's some special code for handling row-event replication for table with Let's see what we know: 1. From the mysqldump of the relaylog we know that the delete_row event 2. The table has no primary key, so the code will select one of the two 3. Rows_log_event::find_row() is called to find the row to delete. Since 4a. If the unique index was selected, then the code will check if there are 4b. If the non-unique index was selected, each row that matches the index 5. The found row is deleted from the table and logged to the slave's binlog I notice that in the case (4a), the first row returned from the lookup in In the case (5b), it is plausible that the row (440,7,17) could be returned, Other remarks: It's curious that it's often the same row that triggers it, though it could I noticed that you wrote "All slaves started from same snapshot". Is this a Unfortunately I can't think of an easy way to compare relaylog to binlog. | |||||||||||||||||
| Comment by Cuchac [ 2024-01-12 ] | |||||||||||||||||
|
Hello, Error happened again on different node, with same results:
All the slaves are bootstrapped in the same way but from different mariabackup snapshots. When one slave breaks, I do mariabackup of random slave again and use that new snapshot to avoid long resync time. Few days ago, we experienced "General error: 1712 Index .... is corrupted". It has never happened before and it has happened 3rd time in last 2 months .... what if the real cause is corrupted index that causes wrong row deletion and that causes duplicate row error? Very interesting chain of events, but from my current point of view, it seems quite possible. | |||||||||||||||||
| Comment by Cuchac [ 2024-01-12 ] | |||||||||||||||||
|
What if for example https://jira.mariadb.org/browse/MDEV-30531 caused corruption to indexes on these tables and all of these problems are caused by this index corruption? And when I use mariabackup to bootstrap slave, I just copy the corrupted index with it and it will cause the same error again .... | |||||||||||||||||
| Comment by Kristian Nielsen [ 2024-01-12 ] | |||||||||||||||||
|
Agree. That you started to see index corruption problems, and more or less simultaneously started to see replication problems that could be caused by wrong index lookups, makes it more plausible that the root cause is the index corruption. marko, do you know of any bug that would cause InnoDB index corruption like this? Summary: There is an innodb table with a two-column unique index (nullable second column, no primary key). The full table structure in attachment maria_slave_bug_slave_status.txt. A replication row event applies incorrectly, finds and deletes the wrong row. The error is consistent with the index lookup in the unique index returning the wrong row. The problem has occured multiple times affecting a specific row, but does not appear to happen consistently. The server also occasionally reports corrupt index errors. I remember seeing discussions about index corruption bugs in innodb, but not sure which of them would be consistent with this observed replication behaviour, if any? | |||||||||||||||||
| Comment by Cuchac [ 2024-01-15 ] | |||||||||||||||||
|
Hello, | |||||||||||||||||
| Comment by Kristian Nielsen [ 2024-01-15 ] | |||||||||||||||||
|
Ok, hope that works out for you so you can get rid of those failures. Here's a comment from Marko (InnoDB developer) about a few recent bugs that could cause wrong index results: "Yes, corrupted indexes may result in wrong results. Apart from | |||||||||||||||||
| Comment by Cuchac [ 2024-01-15 ] | |||||||||||||||||
|
Hello, One slave experienced the same error again. I'm pretty sure that I run OPTIMIZE TABLE on this slave. This time the row with duplicate error was different, BUT actually deleted row was the same!
That is quite interesting, isn't it? Is there any way to check if the index is corrupted? Or some way to find if for example row (440, 7, 17) is first entry in the index? It is weird that for two different indexes the lookup will return the same row. That seems like the returned row can be a clue to the problem. | |||||||||||||||||
| Comment by Cuchac [ 2024-01-16 ] | |||||||||||||||||
|
I can send you raw file of the problematic table. Will that help? There are no sensitive information so it should be no problem. You can then look at the index and see whether the index is corrupted or how can row (440, 7, 17) get selected. | |||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-17 ] | |||||||||||||||||
|
cuchac, OPTIMIZE TABLE will not rebuild an InnoDB table if the parameters innodb_defragment ( There is a hard-to-reproduce bug MDEV-33189 or MDEV-33208 for which I have a hypothesis. While in those cases it involves undo log pages, I think that the bug should affect any pages that are located in the buffer pool. If my hypothesis is correct, this corruption would be prevented by setting innodb_adaptive_flushing=OFF in the server configuration. Can you try that? | |||||||||||||||||
| Comment by Cuchac [ 2024-01-17 ] | |||||||||||||||||
|
Hi Marko, both innodb_defragment and innodb_optimize_fulltext_only are OFF. So rebuild should happen and theory with corrupted index seems less probable now. Maybe the crash I experienced in MDEV-33260 can have the same source as this bug - memory corruption. I will try to reproduce MDEV-33260 and get some decent stack trace. Here we are quite blind it seems. The only idea I have is to issue same SELECT instead of DELETE (SELECT * FROM sections_relation WHERE id_section = '12'), force using the index that does not check correct values and issue it every second and wait if there is any moment that returns invalid row that get deleted (440, 7, 17) instead of correct output. What do you think? Find the exact timestamp when database starts to return invalid row on `WHERE id_section = '12'`. | |||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-17 ] | |||||||||||||||||
|
While it is possible that some code outside InnoDB could cause corruption of InnoDB data structures, I would really like you to try innodb_adaptive_flushing=OFF as well. As far as InnoDB is concerned, I think that it is a rather low-hanging fruit with unpredictable and potentially devastating consequences. | |||||||||||||||||
| Comment by Cuchac [ 2024-01-17 ] | |||||||||||||||||
|
OK, I turned off adaptive flushing. Will see. | |||||||||||||||||
| Comment by Cuchac [ 2024-01-22 ] | |||||||||||||||||
|
Hello, 5 days without the problem. So far so good. innodb_adaptive_flushing=OFF seems to fixed the issue. |