[MDEV-33178] Random slave replication error, others continues without error Created: 2024-01-04  Updated: 2024-01-22

Status: Open
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.11.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cuchac Assignee: Roel Van de Paar
Resolution: Unresolved Votes: 0
Labels: replication, slave
Environment:

official docker images running in kubernetes cluster


Attachments: PNG File Screenshot_20240105_112524.png     Text File maria_slave_bug_master_binlog.txt     Text File maria_slave_bug_slave_binlog.txt     Text File maria_slave_bug_slave_relaylog.txt     Text File maria_slave_bug_slave_status.txt    
Issue Links:
Relates
relates to MDEV-30781 On duplicate key update makes duplica... Closed
relates to MDEV-33275 buf_flush_LRU(): mysql_mutex_assert_o... Closed

 Description   

Hello,

we have 1 master, 9 slaves. All slaves connects from the same master. All slaves started from same snapshot, completely same process. Probably after upgrade from 10.6 to 10.11 every few days random slave errors with

Could not execute Write_rows_v1 event on table ??????.??????; Duplicate entry '8-7' for key 'id_section_2', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysqld-bin.001693, end_log_pos 899067800

We use MIXED replication mode, no parallel replication, no wsrep.

I do believe it is slave bug, because all other slaves continues without problem. I'm unable to reproduce this on purpose.

Usually the error happens on the same database, same table, but probably this is just the result of another bug in replication that happened earlier.

I understand that this bug report does not contains all the required information, but I need your guidance to provide more relevant data. I can provide binlog or relay log for example.



 Comments   
Comment by Cuchac [ 2024-01-05 ]

May be this problem has some connection with "INSERT IGNORE" statements that are used to fill the problematic table.


shows some weird "killed" SQL worker and stalls in replication with parallel replication turned on.

Comment by Roel Van de Paar [ 2024-01-05 ]

cuchac Hi! I need your guidance to provide more relevant data - sure, questions below;

You mention in the original description that "all other slaves continues without problem". Does "all other slaves" mean slaves which have not been upgraded to 10.11 yet?

Can you please also clarify what "...other bug in replication that happened earlier" was/is?

Which tool are you using to visualize the killed "slave worker"? What tool or script is killing the connections, and why? Is the Slave_worker the user used for replication?

The original description mentions id_section_2, which is not visible in the screenshot, it would be helpful to see more information on tables definitions.

Does the issue happen again after instantiating a new (temporary/test) slave? How was the orginal snapshot made/what format is it in?

Comment by Cuchac [ 2024-01-05 ]

Hello,
thanks for you reply. Here are answers:

You mention in the original description that "all other slaves continues without problem". Does "all other slaves" mean slaves which have not been upgraded to 10.11 yet?

All slaves use same version. All slaves were created in the same time from same dump. We are using kubernetes cluster and it is one stateful set with maxscale. "all other slaves" means that there are completely same slaves from same dump running and one stops with this error, all other continues without any problem. Next day, another slave stops, all other continues ...

Can you please also clarify what "...other bug in replication that happened earlier" was/is?

It was just my idea that probably the error "Duplicate entry '8-7' for key 'id_section_2'" could be only some consequence of some bug that happened earlier. There was no transactions skipping or similar thinks. Slave replications were stable for several years now and they started to fail randomly in last months.

Which tool are you using to visualize the killed "slave worker"? What tool or script is killing the connections, and why? Is the Slave_worker the user used for replication?

It is PhpMyAdmin, Status / Processes page. Nobody is killing connections. There is no "Slave worker" user, they are the system workers from "slave_parallel_threads". It is not normally turned on (error happens also when slave_parallel_threads=0) but I just tried slave_parallel_threads=8 and noticed this weird behavior that probably could be related to the poblem.

The original description mentions id_section_2, which is not visible in the screenshot, it would be helpful to see more information on tables definitions.

CREATE TABLE `products_in_sections` (
  `id_product` int(11) NOT NULL DEFAULT 0,
  `id_section` int(11) NOT NULL DEFAULT 0,
  `figure` enum('Y','N') NOT NULL DEFAULT 'Y',
  `generated` tinyint(4) NOT NULL DEFAULT 0
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci ROW_FORMAT=DYNAMIC;
 
ALTER TABLE `products_in_sections`
  ADD PRIMARY KEY (`id_product`,`id_section`),
  ADD KEY `id_section` (`id_section`);
 
ALTER TABLE `products_in_sections`
  ADD CONSTRAINT `products_in_sections_ibfk_1` FOREIGN KEY (`id_product`) REFERENCES `products` (`id`) ON DELETE CASCADE ON UPDATE CASCADE,
  ADD CONSTRAINT `products_in_sections_ibfk_2` FOREIGN KEY (`id_section`) REFERENCES `sections` (`id`) ON DELETE CASCADE ON UPDATE CASCADE;

Does the issue happen again after instantiating a new (temporary/test) slave? How was the orginal snapshot made/what format is it in?

It happens after new backup/restore. I have to do it after every failure because otherwise I would run out of slaves after several days

I do backup/restore using mariabackup. Same script for slave restore was working for several years without any problem.


Currently I think the main cause could be "INSERT IGNORE" statement that we use to update the problematic table. I will try to rewrite update queries to use "ON DUPLICATE KEY" and see if it helps. Is it possible that "INSERT IGNORE" is more risky for master/slave replication than "ON DUPLICATE KEY"?

Comment by Roel Van de Paar [ 2024-01-05 ]

> I will try to rewrite update queries to use "ON DUPLICATE KEY" and see if it helps.

That sounds like a good start idea. I also tried the table definitions provided, but received an error:

10.11.7 09049fe496eea1c19cd3ce80a788fa4b75d9609e (Optimized)

10.11.7-opt>ALTER TABLE `products_in_sections`
    ->   ADD CONSTRAINT `products_in_sections_ibfk_1` FOREIGN KEY (`id_product`) REFERENCES `products` (`id`) ON DELETE CASCADE ON UPDATE CASCADE,
    ->   ADD CONSTRAINT `products_in_sections_ibfk_2` FOREIGN KEY (`id_section`) REFERENCES `sections` (`id`) ON DELETE CASCADE ON UPDATE CASCADE;
ERROR 1005 (HY000): Can't create table `test`.`products_in_sections` (errno: 150 "Foreign key constraint is incorrectly formed")

Could you check the same on your end?

> Is it possible that "INSERT IGNORE" is more risky for master/slave replication than "ON DUPLICATE KEY"?

Yes, INSERT IGNORE can be riskier than ON DUPLICATE KEY UPDATE in a replication setup for a few reasons:
1) Silent failures: INSERT IGNORE silently (warning only) skips rows that cause duplicate key conflicts or other errors, potentially leading to data inconsistencies between the master and slaves.
2) Error Handling: "ON DUPLICATE KEY UPDATE" is more explicit in handling conflicts. It updates the conflicting rows with new values, ruling out any inconsistency between the master and slaves.
3) Idempotency: "ON DUPLICATE KEY UPDATE" is idempotent; applying the same statement multiple times will result in the same state. This is beneficial for replication and recovery processes. I suspect "INSERT IGNORE" might not be idempotent if the "ignored error" is not due to a duplicate key but perhaps some other issue like constraint errors.

Comment by Cuchac [ 2024-01-06 ]

Hello,

> I also tried the table definitions provided, but received an error:

That's because of foreign keys to other tables. You can ignore the foreign keys.

Is it normal then, that using "INSERT IGNORE" can break replication? Or is it still a bug that probably "INSERT IGNORE" sometimes breaks master/slave replication?

I will let you know when I manage to change the queries. It will take some days to change and test.

Comment by Kristian Nielsen [ 2024-01-06 ]

cuchac It is normal with optimistic parallel replication to see worker threads in "killed" state. This happens when there is a conflict between two transactions replicated in parallel, one is killed and retried later.

INSERT IGNORE should not break replication in mixed mode, that should be a bug.
You mention that you get the error also without using parallel replication, but only on one of many slaves. That suggests that there is some non-determinism happening somewhere in the query execution. Not sure how INSERT IGNORE could cause this though, errors in the INSERT would seem to be deterministic, and the handling (skip inserting the row) is deterministic as well. And if the problem was incorrect row locking on the master, we would expect to see the duplicate key error on all slaves.

What would be really useful is to have the binlog (or relay log) around the point where it fails, together with the exact messages from the error log at the time of failure. Ideally have the events from the binlog sufficiently far back that the transaction that caused the duplicate row, as well as the failing transaction, can be seen. Hopefully this can give some clue as to why the problem occurs.

Comment by Cuchac [ 2024-01-07 ]

OK, I will try to get the binlog when the problem occurs again.
Is it OK to send mysqlbinlog output filtered to just one database and table, where the problem happened? For example using following command?

mysqlbinlog -v --base64-output=decode-rows -d database -t table

We have very large binlog files and I did not find any quick way to "copy" some filtered events to separate raw binlog file.

Comment by Cuchac [ 2024-01-10 ]

Problem appeared again.
I've attached following files:

  • maria_slave_bug_slave_status.txt - "SHOW SLAVE STATUS" and table structure
  • maria_slave_bug_master_binlog.txt - master binary log from table where error happened
  • maria_slave_bug_slave_binlog.txt - binlog from problematic slave
  • maria_slave_bug_slave_relaylog.txt - relaylog from problematic slave

If I read it correctly, just before the problematic insert instruction (2-100-8777986841) there was DELETE statement (GTID 2-100-8777986839), that should delete the duplicit row. But the row was not deleted and the row is still present in table - verified it using SELECT on problematic node.

I compared slave relay log and master binlog and they are same.
I compared master binlog and slave binlog and they differ! That is very weird. To me it seems that slave deleted wrong row
Transaction 2-100-8777986839 on master deleted:

### DELETE FROM `wpjshop_prod_oveckarna_db`.`sections_relation`
### WHERE
###   @1=8
###   @2=7
###   @3=1

that is correct, but on slave binlog, I can see

### DELETE FROM `wpjshop_prod_oveckarna_db`.`sections_relation`
### WHERE
###   @1=440
###   @2=7
###   @3=17

some unrelated row.

I also checked other slaves binlog and they have correctly deleted same row as master.

What do you think about this?

Comment by Kristian Nielsen [ 2024-01-10 ]

Thanks a lot cuchac for the detailed additional information and nice analysis.

I agree with your analysis. The slave has the correct delete event in its relaylog to delete row (8,7,1). But somehow instead it deletes a completely different row (440,7,17), according to what is in the binlog. This seems to be the real problem, the duplicate key error two transactions later is just a symptom.

The mystery is how the DELETE ends up deleting the completely wrong row?!?

I looked at the table definition. The only special thing I see is that the table does not have a primary key; it has a unique key that has a null-able second part. But that doesn't really seem able to explain why the wrong row would be deleted, there's no NULLs involved it seems.

It seems a very fundamental issue if the matching of replication row event to table data is wrong. Not sure at this point how this could turn up only very occasionally on a single slave, unfortunately.

The only thing I can think of is to try to analyze a couple other instances of this, to learn if there is any pattern to this. Is it always the same table? Same number of rows deleted/inserted? Similar values? Any specific slave server that tends to get this error, or is it random across all slaves? Maybe you already have a bit on data on this from previous failures?

I noticed that both the original report and the last analyzed instance get a duplicate on the key "8-7". If it's always that key, then that's an interesting data point, or just a random coincidence?

Again, thanks for your efforts in tracking this down.

- Kristian.

Comment by Cuchac [ 2024-01-11 ]

Hello,
I was hoping you will find some error in my analysis, because otherwise this seems like a "freaking scary" bug )

Error happens on random slave node. I did not record what nodes suffered the issue in past, but I think all of them.
It is mostly the same table, same database, but I think I have seen same table in different database. These SQL queries are result of e-shop synchronization that deletes all relations of synchronized section and add new one (mostly the same). Same synchronizations happen on more e-shops - databases.
If I remember correctly, it is regularly key "8-7", but I'm not sure that other pair did not happen and what was the problematic pair in another database.

Do you think there is some way to compare binary and relay logs? Because this bug should be discovered whenever relay differs from binary log. Or some other valid actions can cause differences in binary vs relay log?

Comment by Kristian Nielsen [ 2024-01-11 ]

Yes, agree, scary bug when a row event seemingly is deleting the wrong row.
There's probably something else going on too, given that this happens so
infrequently and seemingly randomly; the tricky part is figuring out what
that "else" can be...

There's some special code for handling row-event replication for table with
triggers. What is your value of --slave-run-triggers-for-rbr ? I assume it
is "NO" and thus should not be relevant, but checking just in case...

Let's see what we know:

1. From the mysqldump of the relaylog we know that the delete_row event
references the row (8,7,1).

2. The table has no primary key, so the code will select one of the two
available indexes based on their estimated selectivity
(key->rec_per_key[last_part] in Rows_log_event::find_key()). Not 100%
sure but I would guess the unique index on (id_section,id_topsection) would
be chosen.

3. Rows_log_event::find_row() is called to find the row to delete. Since
the table has no primary key, the code will instead use the index selected
in (2) and do an index lookup using values from the row event.

4a. If the unique index was selected, then the code will check if there are
any NULL values in the key. In the row (8,7,1) there are not, so the row
will be selected without any further checks.

4b. If the non-unique index was selected, each row that matches the index
lookup is compared with the target row until found.

5. The found row is deleted from the table and logged to the slave's binlog

I notice that in the case (4a), the first row returned from the lookup in
the unique index will be used without any further checks. Thus, if InnoDB
would somehow return the row (440,7,17) from the lookup of (8,7,1), that
could result in the observed behaviour. That still does not explain why the
index lookup would misfunction in the first place.

In the case (5b), it is plausible that the row (440,7,17) could be returned,
as it matches on the indexed column. However, the code in find_key() skips
rows that do not match, which would cause this row now to be selected;
also it seems somewhat unlikely that the non-unique index will be selected
over the unique one which contains an extra column.

Other remarks:

It's curious that it's often the same row that triggers it, though it could
be coincidence.

I noticed that you wrote "All slaves started from same snapshot". Is this a
raw copy of InnoDB tablespaces, like from xtrabackup/mariabackup or from
filecopy of the data directory? I'm wondering if there could be some wierd
table/index corruption in InnoDB for the affected row/table that
occasionally causes the wrong row to be returned? Re-generating the table
might be able to test this, though it's a long shot perhaps.

Unfortunately I can't think of an easy way to compare relaylog to binlog.
They need not be exactly identical, though of course the should be
"equivalent", meaning they will replicate the same way (which is obviously
not the case here when you experience the bug).

Comment by Cuchac [ 2024-01-12 ]

Hello,
thank you for your detailed explanation.

Error happened again on different node, with same results:

#240111 14:51:11 server id 100  end_log_pos 1044601736 CRC32 0x5b469095         Annotate_rows:
#Q> DELETE FROM sections_relation WHERE  id_section = '8'
#240111 14:51:11 server id 100  end_log_pos 1044601819 CRC32 0xc8e0204a         Table_map: `wpjshop_prod_oveckarna_db`.`sections_relation` mapped to number 2652 (has triggers)
# at 1044601819
#240111 14:51:11 server id 100  end_log_pos 1044601865 CRC32 0x9ce89a21         Delete_rows: table id 2652 flags: STMT_END_F
 
BINLOG '
XwCgZRNkAAAAUwAAANtbQz4AAFwKAAAAAAFAGXdwanNob3BfcHJvZF9vdmVja2FybmFfZGIAEXNl
Y3Rpb25zX3JlbGF0aW9uAAMDAwMAAkog4Mg=
XwCgZRlkAAAALgAAAAlcQz4AAFwKAAAAAAEAA//4uAEAAAcAAAARAAAAIZronA==
'/*!*/;
### DELETE FROM `wpjshop_prod_oveckarna_db`.`sections_relation`
### WHERE
###   @1=440
###   @2=7
###   @3=17
# Number of rows: 1

All the slaves are bootstrapped in the same way but from different mariabackup snapshots. When one slave breaks, I do mariabackup of random slave again and use that new snapshot to avoid long resync time.

Few days ago, we experienced "General error: 1712 Index .... is corrupted". It has never happened before and it has happened 3rd time in last 2 months .... what if the real cause is corrupted index that causes wrong row deletion and that causes duplicate row error? Very interesting chain of events, but from my current point of view, it seems quite possible.

Comment by Cuchac [ 2024-01-12 ]

What if for example https://jira.mariadb.org/browse/MDEV-30531 caused corruption to indexes on these tables and all of these problems are caused by this index corruption? And when I use mariabackup to bootstrap slave, I just copy the corrupted index with it and it will cause the same error again ....

Comment by Kristian Nielsen [ 2024-01-12 ]

Agree. That you started to see index corruption problems, and more or less simultaneously started to see replication problems that could be caused by wrong index lookups, makes it more plausible that the root cause is the index corruption.

marko, do you know of any bug that would cause InnoDB index corruption like this?

Summary: There is an innodb table with a two-column unique index (nullable second column, no primary key). The full table structure in attachment maria_slave_bug_slave_status.txt. A replication row event applies incorrectly, finds and deletes the wrong row. The error is consistent with the index lookup in the unique index returning the wrong row. The problem has occured multiple times affecting a specific row, but does not appear to happen consistently. The server also occasionally reports corrupt index errors.

I remember seeing discussions about index corruption bugs in innodb, but not sure which of them would be consistent with this observed replication behaviour, if any?

Comment by Cuchac [ 2024-01-15 ]

Hello,
I've run OPTIMIZE TABLE on all tables/all nodes to fix possible problems from previous versions. So now if my theory is correct, I should not see this error again. Fingers crossed.

Comment by Kristian Nielsen [ 2024-01-15 ]

Ok, hope that works out for you so you can get rid of those failures.

Here's a comment from Marko (InnoDB developer) about a few recent bugs that could cause wrong index results:

"Yes, corrupted indexes may result in wrong results. Apart from MDEV-30531 (which requires FOREIGN KEY), a good candidate for corrupting secondary indexes is the change buffer. The latest fixed bug in that area is MDEV-32132 which was originally fixed as part of MDEV-30009."

Comment by Cuchac [ 2024-01-15 ]

Hello,
bad news

One slave experienced the same error again. I'm pretty sure that I run OPTIMIZE TABLE on this slave.

This time the row with duplicate error was different, BUT actually deleted row was the same!

#240115 22:20:21 server id 102  end_log_pos 379665068 CRC32 0x73030264  Annotate_rows:
#Q> DELETE FROM sections_relation WHERE  id_section = '12'
#240115 22:20:21 server id 102  end_log_pos 379665151 CRC32 0xdf002b7f  Table_map: `wpjshop_prod_oveckarna_db`.`sections_relation` mapped to number 166437 (has triggers)
# at 379665151
#240115 22:20:21 server id 102  end_log_pos 379665197 CRC32 0xf101c7db  Delete_rows: table id 166437 flags: STMT_END_F
 
BINLOG '
pa+lZRNmAAAAUwAAAP86oRYAACWKAgAAAAFAGXdwanNob3BfcHJvZF9vdmVja2FybmFfZGIAEXNl
Y3Rpb25zX3JlbGF0aW9uAAMDAwMAAn8rAN8=
pa+lZRlmAAAALgAAAC07oRYAACWKAgAAAAEAA//4uAEAAAcAAAARAAAA28cB8Q==
'/*!*/;
### DELETE FROM `wpjshop_prod_oveckarna_db`.`sections_relation`
### WHERE
###   @1=440
###   @2=7
###   @3=17
# Number of rows: 1

That is quite interesting, isn't it? Is there any way to check if the index is corrupted? Or some way to find if for example row (440, 7, 17) is first entry in the index? It is weird that for two different indexes the lookup will return the same row. That seems like the returned row can be a clue to the problem.

Comment by Cuchac [ 2024-01-16 ]

I can send you raw file of the problematic table. Will that help? There are no sensitive information so it should be no problem. You can then look at the index and see whether the index is corrupted or how can row (440, 7, 17) get selected.

Comment by Marko Mäkelä [ 2024-01-17 ]

cuchac, OPTIMIZE TABLE will not rebuild an InnoDB table if the parameters innodb_defragment (MDEV-30545) or innodb_optimize_fulltext_only are set. ALTER TABLE…FORCE would. But I do not think that it this is the issue.

There is a hard-to-reproduce bug MDEV-33189 or MDEV-33208 for which I have a hypothesis. While in those cases it involves undo log pages, I think that the bug should affect any pages that are located in the buffer pool. If my hypothesis is correct, this corruption would be prevented by setting innodb_adaptive_flushing=OFF in the server configuration. Can you try that?

Comment by Cuchac [ 2024-01-17 ]

Hi Marko,

both innodb_defragment and innodb_optimize_fulltext_only are OFF. So rebuild should happen and theory with corrupted index seems less probable now.

Maybe the crash I experienced in MDEV-33260 can have the same source as this bug - memory corruption. I will try to reproduce MDEV-33260 and get some decent stack trace. Here we are quite blind it seems.

The only idea I have is to issue same SELECT instead of DELETE (SELECT * FROM sections_relation WHERE id_section = '12'), force using the index that does not check correct values and issue it every second and wait if there is any moment that returns invalid row that get deleted (440, 7, 17) instead of correct output. What do you think? Find the exact timestamp when database starts to return invalid row on `WHERE id_section = '12'`.

Comment by Marko Mäkelä [ 2024-01-17 ]

While it is possible that some code outside InnoDB could cause corruption of InnoDB data structures, I would really like you to try innodb_adaptive_flushing=OFF as well. As far as InnoDB is concerned, I think that it is a rather low-hanging fruit with unpredictable and potentially devastating consequences.

Comment by Cuchac [ 2024-01-17 ]

OK, I turned off adaptive flushing. Will see.

Comment by Cuchac [ 2024-01-22 ]

Hello, 5 days without the problem. So far so good. innodb_adaptive_flushing=OFF seems to fixed the issue.

Generated at Thu Feb 08 10:36:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.