Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15227

Inserted data violates foreign key constraint and breaks (Galera and regular async) replication

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.2.12
    • N/A
    • Server
    • None
    • CentOS 7.4

    Description

      We have a 5 master Galera cluster with all of the writes going to a single master. During our unit test run last night, we observed a total cluster failure. Further investigation revealed that all of the masters other than the one that took the initial write did this:

      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table pollstream_data_2.counter; Cannot add or update a child row: a foreign key constraint fails (`pollstream_data_2`.`counter`, CONSTRAINT `fk_counter_account` FOREIGN KEY (`account_id`) REFERENCES `account` (`account_id`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 481, Internal MariaDB error code: 1452
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [Warning] WSREP: RBR event 3 Write_rows_v1 apply warning: 151, 31740050
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [Warning] WSREP: Failed to apply app buffer: seqno: 31740050, status: 1
      Feb  6 00:13:57 bm1 mysqld[27265]: at galera/src/trx_handle.cpp:apply():351
      Feb  6 00:13:57 bm1 mysqld[27265]: Retrying 2th time
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table pollstream_data_2.counter; Cannot add or update a child row: a foreign key constraint fails (`pollstream_data_2`.`counter`, CONSTRAINT `fk_counter_account` FOREIGN KEY (`account_id`) REFERENCES `account` (`account_id`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 481, Internal MariaDB error code: 1452
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [Warning] WSREP: RBR event 3 Write_rows_v1 apply warning: 151, 31740050
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [Warning] WSREP: Failed to apply app buffer: seqno: 31740050, status: 1
      Feb  6 00:13:57 bm1 mysqld[27265]: at galera/src/trx_handle.cpp:apply():351
      Feb  6 00:13:57 bm1 mysqld[27265]: Retrying 3th time
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table pollstream_data_2.counter; Cannot add or update a child row: a foreign key constraint fails (`pollstream_data_2`.`counter`, CONSTRAINT `fk_counter_account` FOREIGN KEY (`account_id`) REFERENCES `account` (`account_id`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 481, Internal MariaDB error code: 1452
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [Warning] WSREP: RBR event 3 Write_rows_v1 apply warning: 151, 31740050
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [Warning] WSREP: Failed to apply app buffer: seqno: 31740050, status: 1
      Feb  6 00:13:57 bm1 mysqld[27265]: at galera/src/trx_handle.cpp:apply():351
      Feb  6 00:13:57 bm1 mysqld[27265]: Retrying 4th time
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [ERROR] Slave SQL: Could not execute Write_rows_v1 event on table pollstream_data_2.counter; Cannot add or update a child row: a foreign key constraint fails (`pollstream_data_2`.`counter`, CONSTRAINT `fk_counter_account` FOREIGN KEY (`account_id`) REFERENCES `account` (`account_id`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log FIRST, end_log_pos 481, Internal MariaDB error code: 1452
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [Warning] WSREP: RBR event 3 Write_rows_v1 apply warning: 151, 31740050
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [ERROR] WSREP: Failed to apply trx: source: 350e0cb2-054e-11e8-8f40-9b418f81e428 version: 3 local: 0 state: APPLYING flags: 1 conn_id: 4366705 trx_id: 141931452 seqnos (l: 1919314, g: 31740050, s: 31740043, d: 31739920, ts: 2800471558683974)
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [ERROR] WSREP: Failed to apply trx 31740050 4 times
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [ERROR] WSREP: Node consistency compromised, aborting...
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [Note] WSREP: Closing send monitor...
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [Note] WSREP: Closed send monitor.
      Feb  6 00:13:57 bm1 mysqld[27265]: 2018-02-06  0:13:57 140359886825216 [Note] WSREP: gcomm: terminating thread
      

      Said other masters would then go on to attempt to rejoin the cluster but fail (because there were no functional members). The master which took the original write didn't have this error, but due to all of the other slaves dying simultaneously, shifted itself to non-primary (and so refused to serve any queries or act as a donor for any SST/IST requests).

      We also have several non-Galera standard replication slaves which experienced a very similar issue at the same time, in that their replication stopped with the following error:

      Last_SQL_Error: Could not execute Write_rows_v1 event on table pollstream_data_2.counter; Cannot add or update a child row: a foreign key constraint fails (`pollstream_data_2`.`counter`, CONSTRAINT `fk_counter_account` FOREIGN KEY (`account_id`) REFERENCES `account` (`account_id`) ON DELETE CASCADE), Error_code: 1452; handler error HA_ERR_NO_REFERENCED_ROW; the event's master log binary-log.000476, end_log_pos 3994011
      

      Looking back at the binlog on the master that took the original write, I can see this:

      #180206  0:13:56 server id 31  end_log_pos 3983510 CRC32 0x8465144b     GTID 0-31-57504329 trans
      /*!100001 SET @@session.gtid_seq_no=57504329*//*!*/;
      BEGIN
      /*!*/;
      # at 3983510
      # at 3983584
      #180206  0:13:56 server id 31  end_log_pos 3983584 CRC32 0x60f580f7     Annotate_rows:
      #Q> DELETE FROM `account` WHERE `account_id` = '720951'
      #180206  0:13:56 server id 31  end_log_pos 3983707 CRC32 0x209aea84     Table_map: `pollstream_data_2`.`account` mapped to number 1147
      # at 3983707
      #180206  0:13:56 server id 31  end_log_pos 3983794 CRC32 0x05524cd3     Table_map: `pollstream_data_2`.`access_tokens` mapped to number 6197
      # at 3983794
      #180206  0:13:56 server id 31  end_log_pos 3983877 CRC32 0x1d1dbdc2     Table_map: `pollstream_data_2`.`account_additional_info` mapped to number 6198
      # at 3983877
      #180206  0:13:56 server id 31  end_log_pos 3983949 CRC32 0x9e5dfc0c     Table_map: `pollstream_data_2`.`account_forms` mapped to number 6199
      # at 3983949
      #180206  0:13:56 server id 31  end_log_pos 3984040 CRC32 0xf5f50e3a     Table_map: `pollstream_data_2`.`campaign` mapped to number 5051
      # at 3984040
      #180206  0:13:56 server id 31  end_log_pos 3984144 CRC32 0x419e5916     Table_map: `pollstream_data_2`.`configurable_scheduled_task` mapped to number 1354
      # at 3984144
      #180206  0:13:56 server id 31  end_log_pos 3984222 CRC32 0xbd0213a1     Table_map: `pollstream_data_2`.`counter` mapped to number 6200
      # at 3984222
      #180206  0:13:56 server id 31  end_log_pos 3984314 CRC32 0xcf27c9d7     Table_map: `pollstream_data_2`.`folder_system_object` mapped to number 6201
      # at 3984314
      #180206  0:13:56 server id 31  end_log_pos 3984385 CRC32 0x643217c3     Table_map: `pollstream_data_2`.`poll_themes` mapped to number 1355
      # at 3984385
      #180206  0:13:56 server id 31  end_log_pos 3984466 CRC32 0xc17415c9     Table_map: `pollstream_data_2`.`widget` mapped to number 6202
      # at 3984466
      #180206  0:13:56 server id 31  end_log_pos 3984552 CRC32 0x124cd79b     Table_map: `pollstream_data_2`.`user_background_reports` mapped to number 5865
      # at 3984552
      #180206  0:13:56 server id 31  end_log_pos 3984653 CRC32 0x8e0cfe08     Table_map: `pollstream_data_2`.`campaign_update_log` mapped to number 5049
      # at 3984653
      #180206  0:13:56 server id 31  end_log_pos 3984744 CRC32 0x4953174a     Table_map: `pollstream_data_2`.`publish_schedule` mapped to number 6149
      # at 3984744
      #180206  0:13:56 server id 31  end_log_pos 3984837 CRC32 0xdbd40d27     Table_map: `pollstream_data_2`.`configurable_scheduled_task_log` mapped to number 1366
      # at 3984837
      #180206  0:13:56 server id 31  end_log_pos 3990138 CRC32 0x6c906d83     Delete_rows: table id 1147 flags: STMT_END_F
      

      There's a pile of foreign key action here, but most important to note is that this delete is cascaded to the counter table.

      Now, when this happens shortly afterwards:

      #180206  0:13:57 server id 31  end_log_pos 3993518 CRC32 0x7f2c60e0     GTID 0-31-57504337 trans waited
      /*!100001 SET @@session.gtid_seq_no=57504337*//*!*/;
      BEGIN
      /*!*/;
      # at 3993518
      # at 3993847
      #180206  0:13:56 server id 31  end_log_pos 3993847 CRC32 0xce690654     Annotate_rows:
      #Q> INSERT INTO `counter` (`counter_id`, `account_id`, `counter_type`, `campaign_id`,`media_buy_id`, `language_id`, `tracker_value`,`counter_period`, `counter_period_hour`, `counter_impressions`) VALUES (720951, 720951, 'frontend_template_impression', '-1', '0', '-1', '_default_language', '2018-02-06', 00, 2)
      #180206  0:13:56 server id 31  end_log_pos 3993925 CRC32 0x51566b41     Table_map: `pollstream_data_2`.`counter` mapped to number 6200
      # at 3993925
      #180206  0:13:56 server id 31  end_log_pos 3994011 CRC32 0x8ced7637     Write_rows: table id 6200 flags: STMT_END_F
      

      It should be rejected as the foreign key constraint should fail (this account_id was just deleted above). This is what happens on the other masters (resulting in their untimely demise). However, on the master which took the initial write, I can see that the query was not rejected, and this data (from the above query) is in the counter table and violating the foreign key constraint, which should be impossible. Also, this is the only invalid data in that table as far as the foreign key goes (we always run with foreign keys enabled).

      I have not tried to reproduce this yet, but I suspect it could be reproduced by running two processes in parallel, with one process continually inserting and deleting values from the account table, and the other process trying to insert the aforementioned "hot" values into the counter table (we do quite a lot of this when running our unit test suite).

      Attachments

        1. my.cnf
          3 kB
        2. kill-galera.sh
          0.5 kB
        3. kill-galera.dump
          2 kB

        Issue Links

          Activity

            People

              seppo Seppo Jaakola
              neiltembo Neil Skrypuch
              Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.