Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.2.12
-
None
-
CentOS 7.4
Description
We have a 5 master Galera cluster with all of the writes going to a single master. During our unit test run last night, we observed a total cluster failure. Further investigation revealed that all of the masters other than the one that took the initial write did this:
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table pollstream_data_2.counter; Cannot add or update a child row: a foreign key constraint fails (`pollstream_data_2`.`counter`, CONSTRAINT `fk_counter_account` FOREIGN KEY (`account_id`) REFERENCES `account` (`account_id`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 481, Internal MariaDB error code: 1452 |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [Warning] WSREP: RBR event 3 Write_rows_v1 apply warning: 151, 31740050 |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [Warning] WSREP: Failed to apply app buffer: seqno: 31740050, status: 1 |
Feb 6 00:13:57 bm1 mysqld[27265]: at galera/src/trx_handle.cpp:apply():351 |
Feb 6 00:13:57 bm1 mysqld[27265]: Retrying 2th time |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table pollstream_data_2.counter; Cannot add or update a child row: a foreign key constraint fails (`pollstream_data_2`.`counter`, CONSTRAINT `fk_counter_account` FOREIGN KEY (`account_id`) REFERENCES `account` (`account_id`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 481, Internal MariaDB error code: 1452 |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [Warning] WSREP: RBR event 3 Write_rows_v1 apply warning: 151, 31740050 |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [Warning] WSREP: Failed to apply app buffer: seqno: 31740050, status: 1 |
Feb 6 00:13:57 bm1 mysqld[27265]: at galera/src/trx_handle.cpp:apply():351 |
Feb 6 00:13:57 bm1 mysqld[27265]: Retrying 3th time |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table pollstream_data_2.counter; Cannot add or update a child row: a foreign key constraint fails (`pollstream_data_2`.`counter`, CONSTRAINT `fk_counter_account` FOREIGN KEY (`account_id`) REFERENCES `account` (`account_id`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 481, Internal MariaDB error code: 1452 |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [Warning] WSREP: RBR event 3 Write_rows_v1 apply warning: 151, 31740050 |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [Warning] WSREP: Failed to apply app buffer: seqno: 31740050, status: 1 |
Feb 6 00:13:57 bm1 mysqld[27265]: at galera/src/trx_handle.cpp:apply():351 |
Feb 6 00:13:57 bm1 mysqld[27265]: Retrying 4th time |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table pollstream_data_2.counter; Cannot add or update a child row: a foreign key constraint fails (`pollstream_data_2`.`counter`, CONSTRAINT `fk_counter_account` FOREIGN KEY (`account_id`) REFERENCES `account` (`account_id`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 481, Internal MariaDB error code: 1452 |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [Warning] WSREP: RBR event 3 Write_rows_v1 apply warning: 151, 31740050 |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [ERROR] WSREP: Failed to apply trx: source: 350e0cb2-054e-11e8-8f40-9b418f81e428 version: 3 local: 0 state: APPLYING flags: 1 conn_id: 4366705 trx_id: 141931452 seqnos (l: 1919314, g: 31740050, s: 31740043, d: 31739920, ts: 2800471558683974) |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [ERROR] WSREP: Failed to apply trx 31740050 4 times |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [ERROR] WSREP: Node consistency compromised, aborting... |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [Note] WSREP: Closing send monitor... |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [Note] WSREP: Closed send monitor. |
Feb 6 00:13:57 bm1 mysqld[27265]: 2018-02-06 0:13:57 140359886825216 [Note] WSREP: gcomm: terminating thread |
Said other masters would then go on to attempt to rejoin the cluster but fail (because there were no functional members). The master which took the original write didn't have this error, but due to all of the other slaves dying simultaneously, shifted itself to non-primary (and so refused to serve any queries or act as a donor for any SST/IST requests).
We also have several non-Galera standard replication slaves which experienced a very similar issue at the same time, in that their replication stopped with the following error:
Last_SQL_Error: Could not execute Write_rows_v1 event on table pollstream_data_2.counter; Cannot add or update a child row: a foreign key constraint fails (`pollstream_data_2`.`counter`, CONSTRAINT `fk_counter_account` FOREIGN KEY (`account_id`) REFERENCES `account` (`account_id`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log binary-log.000476, end_log_pos 3994011 |
Looking back at the binlog on the master that took the original write, I can see this:
#180206 0:13:56 server id 31 end_log_pos 3983510 CRC32 0x8465144b GTID 0-31-57504329 trans |
/*!100001 SET @@session.gtid_seq_no=57504329*//*!*/; |
BEGIN
|
/*!*/; |
# at 3983510 |
# at 3983584 |
#180206 0:13:56 server id 31 end_log_pos 3983584 CRC32 0x60f580f7 Annotate_rows: |
#Q> DELETE FROM `account` WHERE `account_id` = '720951' |
#180206 0:13:56 server id 31 end_log_pos 3983707 CRC32 0x209aea84 Table_map: `pollstream_data_2`.`account` mapped to number 1147 |
# at 3983707 |
#180206 0:13:56 server id 31 end_log_pos 3983794 CRC32 0x05524cd3 Table_map: `pollstream_data_2`.`access_tokens` mapped to number 6197 |
# at 3983794 |
#180206 0:13:56 server id 31 end_log_pos 3983877 CRC32 0x1d1dbdc2 Table_map: `pollstream_data_2`.`account_additional_info` mapped to number 6198 |
# at 3983877 |
#180206 0:13:56 server id 31 end_log_pos 3983949 CRC32 0x9e5dfc0c Table_map: `pollstream_data_2`.`account_forms` mapped to number 6199 |
# at 3983949 |
#180206 0:13:56 server id 31 end_log_pos 3984040 CRC32 0xf5f50e3a Table_map: `pollstream_data_2`.`campaign` mapped to number 5051 |
# at 3984040 |
#180206 0:13:56 server id 31 end_log_pos 3984144 CRC32 0x419e5916 Table_map: `pollstream_data_2`.`configurable_scheduled_task` mapped to number 1354 |
# at 3984144 |
#180206 0:13:56 server id 31 end_log_pos 3984222 CRC32 0xbd0213a1 Table_map: `pollstream_data_2`.`counter` mapped to number 6200 |
# at 3984222 |
#180206 0:13:56 server id 31 end_log_pos 3984314 CRC32 0xcf27c9d7 Table_map: `pollstream_data_2`.`folder_system_object` mapped to number 6201 |
# at 3984314 |
#180206 0:13:56 server id 31 end_log_pos 3984385 CRC32 0x643217c3 Table_map: `pollstream_data_2`.`poll_themes` mapped to number 1355 |
# at 3984385 |
#180206 0:13:56 server id 31 end_log_pos 3984466 CRC32 0xc17415c9 Table_map: `pollstream_data_2`.`widget` mapped to number 6202 |
# at 3984466 |
#180206 0:13:56 server id 31 end_log_pos 3984552 CRC32 0x124cd79b Table_map: `pollstream_data_2`.`user_background_reports` mapped to number 5865 |
# at 3984552 |
#180206 0:13:56 server id 31 end_log_pos 3984653 CRC32 0x8e0cfe08 Table_map: `pollstream_data_2`.`campaign_update_log` mapped to number 5049 |
# at 3984653 |
#180206 0:13:56 server id 31 end_log_pos 3984744 CRC32 0x4953174a Table_map: `pollstream_data_2`.`publish_schedule` mapped to number 6149 |
# at 3984744 |
#180206 0:13:56 server id 31 end_log_pos 3984837 CRC32 0xdbd40d27 Table_map: `pollstream_data_2`.`configurable_scheduled_task_log` mapped to number 1366 |
# at 3984837 |
#180206 0:13:56 server id 31 end_log_pos 3990138 CRC32 0x6c906d83 Delete_rows: table id 1147 flags: STMT_END_F |
There's a pile of foreign key action here, but most important to note is that this delete is cascaded to the counter table.
Now, when this happens shortly afterwards:
#180206 0:13:57 server id 31 end_log_pos 3993518 CRC32 0x7f2c60e0 GTID 0-31-57504337 trans waited |
/*!100001 SET @@session.gtid_seq_no=57504337*//*!*/; |
BEGIN
|
/*!*/; |
# at 3993518 |
# at 3993847 |
#180206 0:13:56 server id 31 end_log_pos 3993847 CRC32 0xce690654 Annotate_rows: |
#Q> INSERT INTO `counter` (`counter_id`, `account_id`, `counter_type`, `campaign_id`,`media_buy_id`, `language_id`, `tracker_value`,`counter_period`, `counter_period_hour`, `counter_impressions`) VALUES (720951, 720951, 'frontend_template_impression', '-1', '0', '-1', '_default_language', '2018-02-06', 00, 2) |
#180206 0:13:56 server id 31 end_log_pos 3993925 CRC32 0x51566b41 Table_map: `pollstream_data_2`.`counter` mapped to number 6200 |
# at 3993925 |
#180206 0:13:56 server id 31 end_log_pos 3994011 CRC32 0x8ced7637 Write_rows: table id 6200 flags: STMT_END_F |
It should be rejected as the foreign key constraint should fail (this account_id was just deleted above). This is what happens on the other masters (resulting in their untimely demise). However, on the master which took the initial write, I can see that the query was not rejected, and this data (from the above query) is in the counter table and violating the foreign key constraint, which should be impossible. Also, this is the only invalid data in that table as far as the foreign key goes (we always run with foreign keys enabled).
I have not tried to reproduce this yet, but I suspect it could be reproduced by running two processes in parallel, with one process continually inserting and deleting values from the account table, and the other process trying to insert the aforementioned "hot" values into the counter table (we do quite a lot of this when running our unit test suite).
Attachments
Issue Links
- relates to
-
MDEV-21466 BF-BF X lock conflict, replication fails, cluster unusable
- Closed