|
Note: I had a theory in the original description, I have removed it because it could not cause this.
|
|
Ticket created on behalf of javiercasares, please ping him for more info.
|
|
This reminds me of MDEV-31072, but here I would assume that a normal B-tree index with no hidden indexed virtual columns is being used.
What does the MDEV-24402 CHECK TABLE wp_067d483e_woocommerce_sessions EXTENDED; report on the table?
If the InnoDB change buffer had ever been enabled and any CREATE INDEX was executed on the table in the past, one possible explanation might be a finding that was fixed in MDEV-30009.
|
|
MariaDB [grao]> CHECK TABLE wp_067d483e_woocommerce_sessions EXTENDED;
------------------------------------------------------------+
| Table |
Op |
Msg_type |
Msg_text |
------------------------------------------------------------+
| grao.wp_067d483e_woocommerce_sessions |
check |
status |
OK |
------------------------------------------------------------+
1 row in set (0.484 sec)
|
|
About the indexes and everything… the table was created by WooCommerce, and no change has been made to the table since its creation (at least, not by a person) 
BTW: I'm using:
mariadb Ver 15.1 Distrib 10.11.2-MariaDB, for debian-linux-gnu (x86_64)
|
|
Also, for more context...
Last week this happened. So, a few days ago I did a TRUNTATE TABLE, so the tables were empty and restart the sync (with mariadb-backup). The sync was going fine until this morning when I found the error.
I just execute this:
INSERT INTO wp_067d483e_woocommerce_sessions (session_key, session_value, session_expiry) VALUES ('mariadb-test', 'nothing to do here', 1704067200);
Query OK, 1 row affected (0.011 sec)
MariaDB [grao]> INSERT INTO wp_067d483e_woocommerce_sessions (session_key, session_value, session_expiry) VALUES ('mariadb-test', 'nothing to do here', 1704067201);
ERROR 1062 (23000): Entrada duplicada 'mariadb-test' para la clave 'session_key'
|
|
javiercasares, thank you. Can you clarify what are the source and target of mariadb-backup? Are both running MariaDB Server 10.11.2?
Duplicate key errors should be detected by the storage engine (in this case, InnoDB). I think that we can rule out the InnoDB change buffer. The next suspect would be the collation functions that InnoDB is calling. I think that utf8mb4_general_ci is rather basic, and the examples of duplicate key values have been simple ASCII. So, I would be very surprised if this were a collation bug.
The next suspect would be that replication somehow gets out of sync. What does your replication setup look like? Can you attach the configuration files?
|
|
Context:
This is a WordPress Multisite with WooCommerce.
Two weeks ago, we decided to go from a master to a master-slave. I did the master-slave, and some time later, I detect the slave was not synced. I checked and it was that error. I left that until last Saturday.
I synced again and some minutes later (it was synced and working fine for an hour) I got the same. So, I went to the master and saw the duplicate key.
I did a TRUNCATE TABLE in all the woocommerce_sessions tables (I'm reporting the main one because it is where I found the sync problem).
So, now, with all the session tables empty, I did again the backup-restore and everything was working fine… from Saturday to a few hours ago. Usually is when there is more traffic on the site, so that's there.
The sessions are simple plain text because is what WooCommerce adds... (i'm going to paste the config in a next comment)
|
|
master.txt
slave.txt
If you want any other file… I'm using Ubuntu 22, on VPS only for MariaDB, so there shouldn't be any other software there.
|
|
Did you run CHECK TABLE…EXTENDED on both nodes? Maybe the unique index is somehow corrupted on the master, so that it would let the duplicate key to be inserted in the first place. If that is the case, I would suggest ensuring that SET unique_checks=0 was never invoked. That setting would be part of files created by mariadb-dump and mysqldump. There might be some bug due to MDEV-24621 that you are hitting. Several bug fixes in that area will be part of the upcoming 10.11.3 release.
|
|
Yes, we checked that before with Andrew.
About the EXTENDED, everything was fine in both.
About the unique_check, no, I don't run that... I usually execute this:
in MASTER:
RESET MASTER;
FLUSH TABLES WITH READ LOCK;
SHOW MASTER STATUS;
then, I do the mariabackup
When the backup finishes
UNLOCK TABLES;
and then, I restore everything in the slave (execute the "prepare" and "move-back") with the server stopped, and then:
STOP SLAVE;
RESET SLAVE;
CHANGE MASTER TO MASTER_HOST='10.20.10.5', MASTER_USER='whatever', MASTER_PASSWORD='whatever', MASTER_LOG_FILE='db1-bin.000001', MASTER_LOG_POS=12345, MASTER_CONNECT_RETRY=10;
START SLAVE;
Usually, that works fine for some hours (and yes, the syncing is working for that time)… until there is that duplication in the master…
The main thing for me here is "how it's possible to do a SELECT and see both ID repeated in the master" because the problem is that, in the master there are 2 times the same ID in the UNIQUE column.
|
|
Edited the description and highlighted that the duplicate is somehow being allowed on the master, which is the problem.
|
|
marko if it helps, I just looked into the Woocommerce source and this is the insert query used:
"INSERT INTO {$wpdb->prefix}woocommerce_sessions (`session_key`, `session_value`, `session_expiry`) VALUES (%s, %s, %d)
|
ON DUPLICATE KEY UPDATE `session_value` = VALUES(`session_value`), `session_expiry` = VALUES(`session_expiry`)"
|
Not sure if InnoDB does some magic for INSERT...ON DUPLICATE KEY UPDATE queries. But might be a target point?
|
|
TheLinuxJedi, thank you. This starts to look like a replication bug. See MDEV-17073, MDEV-17603, MDEV-17614.
|
|
marko I'm not sure about that, unless somehow replication causes the master to execute the insert twice?
|
|
The simplest explanation that I can come up for this is incorrect replication of INSERT…ON DUPLICATE KEY UPDATE statements. But I do not know much about the binlog or replication; Elkin does. I see that both master.txt and slave.txt specify binlog_format=row. The earlier tickets that I mentioned were about disallowing statement-based replication, or making binlog_format=mixed choose the row-based format.
|
|
At first, there was binlog_format=mixed, but when I had the errors I went to row to have a query by query sync to check if there was something wrong, but the problem was also there with the standard configuration.
|
|
TheLinuxJedi, no slave does not have such policy to cause the master obtain duplicates. marko, if the snap screenshot that catches the duplicate unique index is indeed from master replication can not be obviously involved.
|
|
Is there proof that we have one or more duplicate entries in the unique column on master? CHECK TABLE ... EXTENDED or similar?
Or does the screenshot just show the state on the slave?
|
|
When I checked this, there was no duplicated value in the Primary database, but there was the duplicated one in the Secondary database.
Tables where exactly, Everything looked OK; binary, hexadecimal tests where the same… so “there were really two identical lines”.
This thing should be “easily” replicable with a WordPress + WooCommerce (plugin) and some fake-data-generator.
|
|
mleich, as I got the attached screenshot is from Master.
I am not sure we need that, but
javiercasares might confirm with a piece of the master binlog containing
the errored event
the event's master log db1-bin.000001, end_log_pos 109407704
|
and one that was binlogged before and applied successfully.
|
|
Attached MTR case produces results in ~20 seconds:
|
10.11-3430767e006383aac035c2aefd67db6afaba92dd mtr MDEV-31120.test
|
connection conn5;
|
INSERT INTO WP (session_key, session_value, session_expiry) VALUES (433, 'jack', NOW()) ON DUPLICATE KEY UPDATE session_value = 'jackie', session_expiry = NOW();
|
connection conn2;
|
connection conn3;
|
connection conn4;
|
connection conn5;
|
At 437
|
connection server_2;
|
SHOW REPLICA STATUS;
|
Slave_IO_State Waiting for master to send event
|
Master_Host 127.0.0.1
|
Master_User root
|
Master_Port 16020
|
Connect_Retry 1
|
Master_Log_File master-bin.000001
|
Read_Master_Log_Pos 206765
|
Relay_Log_File slave-relay-bin.000002
|
Relay_Log_Pos 104057
|
Relay_Master_Log_File master-bin.000001
|
Slave_IO_Running Yes
|
Slave_SQL_Running No
|
Replicate_Rewrite_DB
|
Replicate_Do_DB
|
Replicate_Ignore_DB
|
Replicate_Do_Table
|
Replicate_Ignore_Table
|
Replicate_Wild_Do_Table
|
Replicate_Wild_Ignore_Table
|
Last_Errno 1942
|
Last_Error Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '0-535' for key 'PRIMARY', Error_code: 1062; the event's master log master-bin.000001, end_log_pos 202579
|
Skip_Counter 0
|
Exec_Master_Log_Pos 202203
|
Relay_Log_Space 106705
|
Until_Condition None
|
Until_Log_File
|
Until_Log_Pos 0
|
Master_SSL_Allowed No
|
Master_SSL_CA_File
|
Master_SSL_CA_Path
|
Master_SSL_Cert
|
Master_SSL_Cipher
|
Master_SSL_Key
|
Seconds_Behind_Master NULL
|
Master_SSL_Verify_Server_Cert No
|
Last_IO_Errno 0
|
Last_IO_Error
|
Last_SQL_Errno 1942
|
Last_SQL_Error Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '0-535' for key 'PRIMARY', Error_code: 1062; the event's master log master-bin.000001, end_log_pos 202579
|
Replicate_Ignore_Server_Ids
|
Master_Server_Id 1
|
Master_SSL_Crl
|
Master_SSL_Crlpath
|
Using_Gtid Slave_Pos
|
Gtid_IO_Pos 0-1-546
|
Replicate_Do_Domain_Ids
|
Replicate_Ignore_Domain_Ids
|
Parallel_Mode optimistic
|
SQL_Delay 0
|
SQL_Remaining_Delay NULL
|
Slave_SQL_Running_State
|
Slave_DDL_Groups 1
|
Slave_Non_Transactional_Groups 0
|
Slave_Transactional_Groups 538
|
CHECK TABLE WP EXTENDED;
|
Table Op Msg_type Msg_text
|
test.WP check status OK
|
connection conn1;
|
CHECK TABLE WP EXTENDED;
|
Table Op Msg_type Msg_text
|
test.WP check status OK
|
|
|
10.9, 10.10, 11.0 took considerably longer to fail ~5min.
No <11.0 test failed on master like:
|
11.0-51a85e7e73a36d024cc231df0e280fe009945fbf the only one to fail on the master (during the CHECK), in addition to the replica
|
CHECK TABLE WP EXTENDED;
|
Table Op Msg_type Msg_text
|
test.WP check status OK
|
connection conn1;
|
CHECK TABLE WP EXTENDED;
|
Table Op Msg_type Msg_text
|
test.WP check Warning InnoDB: Clustered index record with stale history in table `test`.`WP`: COMPACT RECORD(info_bits=0, 6 fields): {[8] I(0x000000000006AE49),[6] H (0x0000000D480B),[7] (0xFF000004E2208A),[32]348155 (0x3334383135352020202020202020202020202020202020202020202020202020),[5]bruce(0x6272756365),[8] fT (0x0000126654877FA4)}
|
test.WP check status OK
|
main.MDEV-31120 'innodb,mix' [ fail ] Found warnings/errors in server log file!
|
Test ended at 2023-07-14 13:03:44
|
|
11.1-26b96094ecccc5683a22674e290aea109cfbd65a
|
Last_SQL_Error Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '0-1212750' for key 'PRIMARY', Error_code: 1062; the event's master log master-bin.000001, end_log_pos 466616726
|
Replicate_Ignore_Server_Ids
|
Master_Server_Id 1
|
Master_SSL_Crl
|
Master_SSL_Crlpath
|
Using_Gtid Slave_Pos
|
Gtid_IO_Pos 0-1-1212756
|
Replicate_Do_Domain_Ids
|
Replicate_Ignore_Domain_Ids
|
Parallel_Mode optimistic
|
SQL_Delay 0
|
SQL_Remaining_Delay NULL
|
Slave_SQL_Running_State
|
Slave_DDL_Groups 1
|
Slave_Non_Transactional_Groups 0
|
Slave_Transactional_Groups 1212817
|
CHECK TABLE WP EXTENDED;
|
Table Op Msg_type Msg_text
|
test.WP check Warning InnoDB: Clustered index record with stale history in table `test`.`WP`: COMPACT RECORD(info_bits=0, 6 fields): {[8] V(0x000000000010E956),[6] ! i(0x00000021BE69),[7] :. (0xCC0000163A2EE6),[32]884536 (0x3838343533362020202020202020202020202020202020202020202020202020),[6]barney(0x6261726E6579),[8] fT G(0x0000126654878147)}
|
test.WP check status OK
|
connection conn1;
|
CHECK TABLE WP EXTENDED;
|
Table Op Msg_type Msg_text
|
test.WP check status OK
|
main.MDEV-31120 'innodb,mix' [ fail ] Found warnings/errors in server log file!
|
Test ended at 2023-07-14 13:08:13
|
line
|
2023-07-14 13:08:08 11 [Warning] Slave: Record has changed since last read in table 'WP' Error_code: 1020
|
2023-07-14 13:08:08 11 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
|
2023-07-14 13:08:08 11 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
|
2023-07-14 13:08:08 9 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
|
2023-07-14 13:08:08 10 [Warning] Slave: Commit failed due to failure of an earlier commit on which this one depends Error_code: 1964
|
2023-07-14 13:08:09 5 [Warning] InnoDB: Clustered index record with stale history in table `test`.`WP`: COMPACT RECORD(info_bits=0, 6 fields): {[8] V(0x000000000010E956),[6] ! i(0x00000021BE69),[7] :. (0xCC0000163A2EE6),[32]884536 (0x3838343533362020202020202020202020202020202020202020202020202020),[6]barney(0x6261726E6579),[8] fT G(0x0000126654878147)}
|
^ Found warnings in /home/dan/repos/build-mariadb-server-11.1/mysql-test/var/log/mysqld.2.err
|
|
|
It looks like something went wrong with --slave-parallel-threads.
However as slave_parallell_mode is not set, slave-parallel-threads 'should' be safe.
There is still a chance that something got into the wrong order on binary log.
I tried to repeat the issue with the MDEV-31120.test on MariaDB 10.6. It did fail with
2023-07-16 15:17:01 9 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1062: Duplicate entry '0-9770' for key 'PRIMARY', Error_code: 1062; the event's master log master-bin.000001, end_log_pos 3717726, Gtid 0-1-9770, Internal MariaDB error code: 1942
I checked the gtid_slave_pos and it did not have any duplicates or error. However the number 9770 was in the table, so it is correct that there is a duplicate error.
One possible reason for the error is if XA gets a failure and does a rollback and tries again.
That would work with InnoDB but not with Aria and could be a reason for the failure with Aria.
I noticed that mtr is changing the gtid_slave_pos to Aria.
After commenting the following row in mtr
mtr_tofile($bootstrap_sql_file, "ALTER TABLE gtid_slave_pos ENGINE=Aria transactional=0;\n");
The test case does not fail anymore.
It looks like the MTR test case does not generate any duplicate numbers in any unique index and thus fails to replicate the original issue.
|
|
Retested 10.11, 11.0, 11.1 with InnoDB gtid_slave_pos running CHECK TABLES EXTENDED in the loop and didn't generate a clustered index warning (or any other errors) in 50k rows.
Leaving gtid_slave_pos duplicate as a problem for MDEV-27849.
|
|
The culprit is innodb_change_buffering, which is set to all both in this report and in MDEV-31713. inserts is sufficient for duplicate keys to show up.
Most likely the problem was introduced, or at least became visible, in 10.6.12. On some reason it is very poorly reproducible on 10.6.12, so I couldn't bisect, but I did see it happen on 10.6.12, while never on 10.6.11 (or 10.5). And there were indeed a bunch of seemingly related changes in 10.6.12, so it fits.
On 10.6.14 and the current tree it is reproducible somewhat better, although maybe it's coincidental.
Not reproducible on 11.0, not surprisingly since innodb_change_buffering was removed there.
The basic test flow is such:
- do a lot of inserts in concurrent threads with potentially duplicate keys;
- restart the server;
- do some more inserts (sometimes even a few non-concurrent ones are sufficient);
- observe duplicate keys.
The test below is for reproducing only, don't put it into the regression suite!.
I'm not even sure it is suitable for debugging, but hopefully InnoDB experts will be able to create a better one with this as a starting point.
I've made the test as tunable as possible, because different builds seem to demand different tweaks – some want more concurrency but are okay with less queries, some vice versa, and so on.
For some builds the parameters set in the tests would be excessive.
The test case doesn't have to be run with --repeat=N, instead max_trials value can be increased.
The failure happens both on optimized and debug builds, although of course the test runs much faster on optimized ones.
I doubt it is rr-able, but I only tried a few times (without success).
--source include/have_innodb.inc
|
|
let $restart_parameters= --innodb-change-buffering=insert --innodb-buffer-pool-size=1G --innodb-log-file-size=128M;
|
let $threads= 16;
|
let $queries= 640000; # Total amount, across all threads
|
let $max_trials= 20;
|
|
let $my_slap_command= $MYSQL_SLAP -uroot --create-schema=test --port=$MASTER_MYPORT --protocol=tcp --concurrency=$threads --query="INSERT INTO test.woocommerce_sessions (session_key, session_value, session_expiry) VALUES (ROUND(RAND()*5000), REPEAT(CHR(ROUND(@@timestamp)%26 + 97),2141), ROUND(@@timestamp)) ON DUPLICATE KEY UPDATE session_value = VALUES(session_value), session_expiry = VALUES(session_expiry)";
|
|
CREATE TABLE woocommerce_sessions (
|
session_id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
|
session_key char(32) NOT NULL,
|
session_value longtext NOT NULL,
|
session_expiry bigint(20) unsigned NOT NULL,
|
PRIMARY KEY (session_id),
|
UNIQUE KEY session_key (session_key)
|
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;
|
|
# This restart is to pick up innodb-change-buffering,
|
# it seems to work (better? or only?) than dynamic setting
|
echo #;
|
source include/restart_mysqld.inc;
|
|
let $trial= 0;
|
|
while ($trial < $max_trials)
|
{
|
inc $trial;
|
echo ##################;
|
echo # Trial $trial/$max_trials: running ~$queries queries via $threads parallel threads;
|
|
--disable_result_log
|
exec $my_slap_command --number-of-queries=$queries;
|
|
echo #;
|
source include/restart_mysqld.inc;
|
|
echo #;
|
echo # Sending some more queries;
|
|
--disable_result_log
|
exec $my_slap_command --number-of-queries=100;
|
|
if (`SELECT COUNT(*) FROM (SELECT session_key, COUNT(*) cnt FROM test.woocommerce_sessions GROUP BY session_key HAVING cnt > 1) sq`)
|
{
|
enable_query_log;
|
enable_result_log;
|
echo #######################;
|
echo # Found duplicate keys:;
|
echo #######################;
|
SELECT session_key, COUNT(*) cnt FROM woocommerce_sessions GROUP BY session_key HAVING cnt > 1;
|
CHECK TABLE woocommerce_sessions EXTENDED;
|
die # Found duplicate keys, see above;
|
}
|
echo #;
|
echo # No duplicate keys yet;
|
}
|
|
# Cleanup
|
connection default;
|
DROP TABLE woocommerce_sessions;
|
Example of the output:
|
10.6 a03ce7b9 non-debug
|
# Trial 4/20: running ~640000 queries via 16 parallel threads
|
#
|
# restart: --innodb-change-buffering=insert --innodb-buffer-pool-size=1G --innodb-log-file-size=128M
|
#
|
# Sending some more queries
|
#######################
|
# Found duplicate keys:
|
#######################
|
SELECT session_key, COUNT(*) cnt FROM woocommerce_sessions GROUP BY session_key HAVING cnt > 1;
|
session_key cnt
|
1374 2
|
1405 2
|
<...cut...>
|
603 2
|
67 2
|
CHECK TABLE woocommerce_sessions EXTENDED;
|
Table Op Msg_type Msg_text
|
test.woocommerce_sessions check status OK
|
bug.t 'innodb' [ fail ]
|
Test ended at 2023-07-19 04:02:48
|
|
CURRENT_TEST: bug.t
|
mysqltest: At line 57: # Found duplicate keys, see above
|
|
|
I can reproduce this rather easily on my system, in 57 seconds when using a RelWithDebInfo build, and some 4 to 5 times longer when using an optimized Debug build, each build failing in the first round. The following patch seems to plug this:
diff --git a/storage/innobase/include/ibuf0ibuf.inl b/storage/innobase/include/ibuf0ibuf.inl
|
index 1e21f74ff2b..801353bb5fa 100644
|
--- a/storage/innobase/include/ibuf0ibuf.inl
|
+++ b/storage/innobase/include/ibuf0ibuf.inl
|
@@ -103,7 +103,7 @@ ibuf_should_try(
|
&& !(index->type & (DICT_CLUSTERED | DICT_IBUF))
|
&& ibuf.max_size != 0
|
&& index->table->quiesce == QUIESCE_NONE
|
- && (ignore_sec_unique || !dict_index_is_unique(index)));
|
+ && (/*ignore_sec_unique ||*/ !dict_index_is_unique(index)));
|
}
|
|
/******************************************************************//**
|
It is yet unclear to me why we would have ignore_sec_unique=true here, because there is no SET unique_checks=0 in the test case. I will investigate this deeper, once I have a better confidence that this is it. So far, 5 out of 20 trials have completed without any error.
|
|
With the above patch, the test was interrupted due to a timeout in 15 minutes:
|
10.6 6e484c3bd99a6709e3554f32a99167042ea88496
|
# Trial 18/20: running ~640000 queries via 16 parallel threads
|
worker[1] Trying to dump core for [mysqltest - pid: 107364, winpid: 107364]
|
worker[1] Trying to dump core for [mysqld.1 - pid: 137456, winpid: 137456]
|
innodb.mdev-31120 'innodb' [ fail ] timeout after 900 seconds
|
|
|
There is only one caller of ibuf_should_try():
if (btr_op != BTR_NO_OP &&
|
ibuf_should_try(index(), btr_op != BTR_INSERT_OP))
|
Here, we have btr_op = BTR_INSERT_IGNORE_UNIQUE_OP, so the insert buffering would be allowed in a unique secondary index. It turns out that an incorrect parameter was passed earlier in the call stack:
diff --git a/storage/innobase/row/row0ins.cc b/storage/innobase/row/row0ins.cc
|
index 3188c8c27ce..d9ba8bca64e 100644
|
--- a/storage/innobase/row/row0ins.cc
|
+++ b/storage/innobase/row/row0ins.cc
|
@@ -3005,8 +3005,8 @@ row_ins_sec_index_entry_low(
|
search_mode = btr_latch_mode(
|
search_mode
|
| (thr_get_trx(thr)->check_unique_secondary
|
- ? BTR_INSERT | BTR_IGNORE_SEC_UNIQUE
|
- : BTR_INSERT));
|
+ ? BTR_INSERT
|
+ : BTR_INSERT | BTR_IGNORE_SEC_UNIQUE));
|
}
|
|
err = cursor.search_leaf(entry, PAGE_CUR_LE, search_mode,
|
This is something that I broke in MDEV-29603. There used to be a negation before the thr_get_trx(thr)->check_unique_secondary check.
|