Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27085

Total Galera Cluster failure on Delete_rows_v1

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.2.40, 10.2.41
    • None
    • Galera
    • Ubuntu 18.04

    Description

      We've had a multi-primary Galera Cluster with 5 MariaDB 10.1 for the last 5 years, without major issues.

      In the past month, we started a rolling upgrade process to 10.2.40, intending to go up to 10.4 or 10.5.

      We've upgraded one machine per one or two days, to give them time to fill RAM, in order to be the less disruptible for our clients.

      Last thursday, November 11th, the last one was upgraded, and caught 10.2.41 instead of the 10.2.40 of the other machines.

      The next day, friday 12th, at one of our peak hours, all the cluster came crashing down.

      Slave SQL: Could not execute Write_rows_v1 event on table moloni_geral.documentos; Duplicate entry '27:409702:52248:0' for key 'doc_unico', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 1518, Internal MariaDB error code: 1062
      

      The logNovember12th file is the output on all the servers, which strikes me as odd - at least one shouldn't have this output but something else?

      We know how this happened: an API customer tried to insert two documents at the same time, and the doc_unico unique key didn't end up unique.

      However, this worked for years without problems. Our load for the day was nowhere near the worst we had.

      We managed to reproduce the issue (inadvertly, while trying to implement a request throttler) on the morning of November 17th, with the exact same results.

      We've also had another error, for 4 times now - one right on the afternoon of the 12th, after recovering some of the cluster machines, and three more today.

      Slave SQL: Could not execute Delete_rows_v1 event on table moloni_geral.AC_remain_cookies; Can't find record in 'AC_remain_cookies', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 501, Internal MariaDB error code: 1032
      

      The logNovember18th file contains the message that, again, appears on all the servers. Between the events today, we've tried to set wsrep_slave_threads=1 to no avail, it still occurred.

      The query that triggers this error is the following:

      DELETE FROM AC_remain_cookies
      WHERE hash = [some user hash]
        OR expires < [a year ago]
      

      And the structure is the following:

      CREATE TABLE `AC_remain_cookies` (
        `hash` varchar(128) NOT NULL,
        `ip` varchar(250) NOT NULL,
        `user_id` int(11) NOT NULL,
        `created` datetime NOT NULL,
        `expires` datetime NOT NULL
      ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
       
      ALTER TABLE `AC_remain_cookies`
        ADD PRIMARY KEY (`hash`),
        ADD KEY `expires` (`expires`),
        ADD KEY `user_id` (`user_id`);
      

      This table is very, very small, averaging at about 15k lines.

      Our Galera configuration is as follows:

      [mysqld]
      binlog_format=ROW
      binlog_checksum=NONE
      default-storage-engine=innodb
      innodb_autoinc_lock_mode=2
      bind-address=0.0.0.0
      slave_type_conversions="ALL_NON_LOSSY,ALL_LOSSY"
       
      # Galera Provider Configuration
      wsrep_on=ON
      wsrep_provider=/usr/lib/galera/libgalera_smm.so
      wsrep_provider_options="gcache.size=4096M;gcache.recover=yes"
       
      # Galera Cluster Configuration
      wsrep_cluster_name="moloni_cluster"
      wsrep_cluster_address=gcomm://<usually five servers here>
       
      # Galera Synchronization Configuration
      wsrep_sst_method=mariabackup
      wsrep_sst_auth="<username:password>"
      wsrep_sst_donor="fiona"
       
      # Galera Node Configuration
      wsrep_node_address="10.32.0.27"
      wsrep_node_name="grace"
      wsrep_slave_threads=1
      wsrep_slave_UK_checks=ON
       
      # Retry autocommit queries
      wsrep_retry_autocommit=4
      

      Analysing the timing of the crashes and the order with which we've upgraded or recovered the machines, we're tempted to point at the machine with 10.2.41: the first crashes were the morning after the dawn the machine was upgraded from 10.1, then our test on the 17th crashed when we just had just rejoined this machine to the cluster, and this morning errors were after this machine was recovered again.

      Between the 12th and the dawn of the 17th, for instance, this cluster was working with three machines, all with 10.2.40, with no major issues. And, of course, since we started the rolling upgrade, 3 weeks ago, and even further beyond.

      This feels like a bug - because nothing like it happened for years on 10.1 - the queries that triggered the problems exist since forever in our codebase.

      Attachments

        1. logNovember12th
          6 kB
          Marco Amado
        2. logNovember18th
          7 kB
          Marco Amado

        Activity

          People

            Unassigned Unassigned
            mjamado Marco Amado
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.