Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6451

"Error 'Table 't1' already exists' on query" with slave_ddl_exec_mode=IDEMPOTENT

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.0.12
    • 10.0.13
    • None
    • None

    Description

      Hi!

      I set up a simple replication topology with 3 hosts, all running MariaDB 10.0.12.

      node1 <--> node2
        \         /
         \       /
          \     /
           node3

      node1 and node2 are in active master-master with each other.

      node3 is set up in one-way multi-source from node1 and node2.

      Replication is set up using old-style binlog positions (GTIDs are not used for replication).

      I execute a CREATE TABLE statement on node1, which is replicated to node2 as expected and logged to its binary log due to log-slave-updates.

      node3 receives this CREATE TABLE event from both streams. node3 is running with gtid_ignore_duplicates=OFF, so it tries executing the CREATE TABLE event both times. I expect this.

      However, I am surprised to see this:

                     Last_SQL_Error: Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'create table t1 (id int unsigned not null auto_increment primary key, server_id int) comment='created on node1''

      If I do START ALL SLAVES on node3, the CREATE TABLE statement executes without any errors. I then see two copies of this event in the node3's binary log (node3 has log-slave-updates enabled).

      Apparently, if I execute subsequent CREATE TABLE statements on node1 or node2, they are executed twice on the slaves without any intervention.

      So, it seems that the first CREATE TABLE fails but subsequent create tables execute using slave_ddl_exec_mode=IDEMPOTENT behavior?

      Attachments

        Activity

          I could reproduce it manually reliably, but when it's put in MTR, it turns out to be weirdly sporadic. I need to look more at it.

          elenst Elena Stepanova added a comment - I could reproduce it manually reliably, but when it's put in MTR, it turns out to be weirdly sporadic. I need to look more at it.
          elenst Elena Stepanova added a comment - - edited

          So, when the scenario is run manually, the problem is sporadic too.
          Apparently, there is a race condition between two slave SQL threads executing CREATE nearly simultaneously, caused by the specific (maybe erroneous) locking algorithm when it comes to converting CREATE to CREATE OR REPLACE inside the slave with IDEMPOTENT DDL.

          I wasn't able to produce a deterministic test case because I cannot trigger sync points inside slave SQL threads, and running regular CREATE OR REPLACE statements doesn't seem to cause the same problem since it's processed differently.

          Below is a crude MTR test case to reproduce the problem (not to be included into the suite!).

          --let $rpl_topology=1->2->3
          --source include/rpl_init.inc
          --source include/have_binlog_format_statement.inc
           
          --enable_connect_log
           
          connection server_3;
          --source include/stop_slave.inc
          reset slave all;
          eval change master 'm1' to master_host='127.0.0.1', master_port=$SERVER_MYPORT_1, master_user='root';
          eval change master 'm2' to master_host='127.0.0.1', master_port=$SERVER_MYPORT_2, master_user='root';
          start all slaves;
           
          connection server_1;
           
          let $count = 100;
          while ($count)
          {
           
          	connection server_1;
          	eval create table t$count (i int);
          	save_master_pos;
           
          	connection server_2;
          	sync_with_master;
           
          	connection server_3;
          	sync_with_master 0, 'm1';
           
          	connection server_2;
          	save_master_pos;
           
          	connection server_3;
          	sync_with_master 0, 'm2';
           
          	dec $count;
          }
           
          connection server_1;
          drop database test;
          create database test;
           
          --source include/rpl_end.inc

          elenst Elena Stepanova added a comment - - edited So, when the scenario is run manually, the problem is sporadic too. Apparently, there is a race condition between two slave SQL threads executing CREATE nearly simultaneously, caused by the specific (maybe erroneous) locking algorithm when it comes to converting CREATE to CREATE OR REPLACE inside the slave with IDEMPOTENT DDL. I wasn't able to produce a deterministic test case because I cannot trigger sync points inside slave SQL threads, and running regular CREATE OR REPLACE statements doesn't seem to cause the same problem since it's processed differently. Below is a crude MTR test case to reproduce the problem (not to be included into the suite!). --let $rpl_topology=1->2->3 --source include/rpl_init.inc --source include/have_binlog_format_statement.inc   --enable_connect_log   connection server_3; --source include/stop_slave.inc reset slave all ; eval change master 'm1' to master_host= '127.0.0.1' , master_port=$SERVER_MYPORT_1, master_user= 'root' ; eval change master 'm2' to master_host= '127.0.0.1' , master_port=$SERVER_MYPORT_2, master_user= 'root' ; start all slaves;   connection server_1;   let $ count = 100; while ($ count ) {   connection server_1; eval create table t$ count (i int ); save_master_pos;   connection server_2; sync_with_master;   connection server_3; sync_with_master 0, 'm1' ;   connection server_2; save_master_pos;   connection server_3; sync_with_master 0, 'm2' ;   dec $ count ; }   connection server_1; drop database test; create database test;   --source include/rpl_end.inc

          Hi!

          I was not able to repeat the problem.
          How should the above program fail ?

          I was trying to test this in 10.0, but noticed that the required test.cnf file was missing.

          I created the following file mysql-test/suite/rpl/t/skr.cnf

          !include suite/rpl/my.cnf

          [mysqld.1]
          log-slave-updates
          loose-innodb

          [mysqld.2]
          log-slave-updates
          loose-innodb
          replicate-ignore-table=db.t2

          [mysqld.3]
          log-slave-updates
          loose-innodb

          [ENV]
          SERVER_MYPORT_3= @mysqld.3.port
          SERVER_MYSOCK_3= @mysqld.3.socket

          When I run the test case as:

          mysql-test-run rpl.skr

          This crashed in rpl_end with:
          t line 149: sync_slave_with_master failed: 'select master_pos_wait('slave-bin.000001', 13044, 300, '')' returned NULL indicating slave SQL thread failure

          However, I don't have any problems with create table failing (as far as I can see)

          monty Michael Widenius added a comment - Hi! I was not able to repeat the problem. How should the above program fail ? I was trying to test this in 10.0, but noticed that the required test.cnf file was missing. I created the following file mysql-test/suite/rpl/t/skr.cnf !include suite/rpl/my.cnf [mysqld.1] log-slave-updates loose-innodb [mysqld.2] log-slave-updates loose-innodb replicate-ignore-table=db.t2 [mysqld.3] log-slave-updates loose-innodb [ENV] SERVER_MYPORT_3= @mysqld.3.port SERVER_MYSOCK_3= @mysqld.3.socket When I run the test case as: mysql-test-run rpl.skr This crashed in rpl_end with: t line 149: sync_slave_with_master failed: 'select master_pos_wait('slave-bin.000001', 13044, 300, '')' returned NULL indicating slave SQL thread failure However, I don't have any problems with create table failing (as far as I can see)

          Here is my cnf file:

          !include suite/rpl/my.cnf
           
          [mysqld.1]
          log-slave-updates
          loose-innodb
           
          [mysqld.2]
          log-slave-updates
          loose-innodb
           
          [mysqld.3]
          log-slave-updates
          loose-innodb
           
          [ENV]
          SERVER_MYPORT_3=                @mysqld.3.port
          SERVER_MYSOCK_3=                @mysqld.3.socket

          For me, the test also fails like this:

          analyze: sync_with_master
          mysqltest: At line 37: sync_with_master failed: 'select master_pos_wait('master-bin.000001', 689, 300, 'm1')' returned NULL indicating slave SQL thread failure

          And the error logs shows this:

          140730 17:59:22 [ERROR] Master 'm1': Slave SQL: Error 'Table 't98' already exists' on query. Default database: 'test'. Query: 'create table t98 (i int)', Gtid 0-1-3, Internal MariaDB error code: 1050
          140730 17:59:22 [Warning] Master 'm1': Slave: Table 't98' already exists Error_code: 1050
          140730 17:59:22 [ERROR] Master 'm1': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 564

          elenst Elena Stepanova added a comment - Here is my cnf file: !include suite/rpl/my.cnf   [mysqld.1] log-slave-updates loose-innodb   [mysqld.2] log-slave-updates loose-innodb   [mysqld.3] log-slave-updates loose-innodb   [ENV] SERVER_MYPORT_3= @mysqld.3.port SERVER_MYSOCK_3= @mysqld.3.socket For me, the test also fails like this: analyze: sync_with_master mysqltest: At line 37: sync_with_master failed: 'select master_pos_wait('master-bin.000001', 689, 300, 'm1')' returned NULL indicating slave SQL thread failure And the error logs shows this: 140730 17:59:22 [ERROR] Master 'm1': Slave SQL: Error 'Table 't98' already exists' on query. Default database: 'test'. Query: 'create table t98 (i int)', Gtid 0-1-3, Internal MariaDB error code: 1050 140730 17:59:22 [Warning] Master 'm1': Slave: Table 't98' already exists Error_code: 1050 140730 17:59:22 [ERROR] Master 'm1': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 564

          The race condition was in lock_table_names() which failed if we tried to do a create table withing the replication thread if the table was in use.

          Fixed by propagating the REPLACE flag to lock_table_names().

          monty Michael Widenius added a comment - The race condition was in lock_table_names() which failed if we tried to do a create table withing the replication thread if the table was in use. Fixed by propagating the REPLACE flag to lock_table_names().

          Pushed into 10.0 tree

          monty Michael Widenius added a comment - Pushed into 10.0 tree

          Hi

          I am encountering the same issue but with a VIEW.

          Master is Server version: 10.0.13-MariaDB-log Homebrew
          on OS X 10.10.3

          Slave is Server version: 10.0.19-MariaDB-1~precise-log mariadb.org binary distribution
          on Ubuntu 14.04.1 LTS

          Jun 13 16:54:49 whktest mysqld: 150613 16:54:49 [Warning] Slave: Table
          'listview1' already exists Error_code: 1050
          Jun 13 16:54:49 whktest mysqld: 150613 16:54:49 [ERROR] Error running
          query, slave SQL thread aborted. Fix the problem, and restart the slave
          SQL thread with "SLAVE START". We stopped at log 'master-bin.000002'
          position 55904540

          The statement causing this is

          CREATE ALGORITHM = UNDEFINED
          DEFINER = `xxxxxxxx`@`localhost`
          SQL SECURITY DEFINER
          VIEW `listview1`
          AS
          SELECT id ,
          from_fld ,
          to_fld ,
          subj_fld ,
          date_fld ,
          mailbox ,
          body_ fld ,
          numAttach ,
          attachNames,
          attachText ,
          headings
          FROM email
          WHERE id IN(197854,
          197853,
          197852)

          Stepping over the statement does not work.

          greetings, el

          ondohotola Dr Eberhard W Lisse added a comment - Hi I am encountering the same issue but with a VIEW. Master is Server version: 10.0.13-MariaDB-log Homebrew on OS X 10.10.3 Slave is Server version: 10.0.19-MariaDB-1~precise-log mariadb.org binary distribution on Ubuntu 14.04.1 LTS Jun 13 16:54:49 whktest mysqld: 150613 16:54:49 [Warning] Slave: Table 'listview1' already exists Error_code: 1050 Jun 13 16:54:49 whktest mysqld: 150613 16:54:49 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000002' position 55904540 The statement causing this is CREATE ALGORITHM = UNDEFINED DEFINER = `xxxxxxxx`@`localhost` SQL SECURITY DEFINER VIEW `listview1` AS SELECT id , from_fld , to_fld , subj_fld , date_fld , mailbox , body_ fld , numAttach , attachNames, attachText , headings FROM email WHERE id IN(197854, 197853, 197852) Stepping over the statement does not work. greetings, el

          ondohotola,

          I am encountering the same issue but with a VIEW

          By "the same issue" do you mean that you also have two masters executing the statement, and observe the error despite slave_ddl_exec_mode=IDEMPOTENT?
          Or do you just mean that you are getting replication failure with a similar error text?

          This bug report was about a very specific situation. The conflict was expected to happen in the given setup, but it was supposed to be suppressed by the slave configuration.

          If you mean the same – that you indeed create a duplicate view, but don't expect it to cause a problem due to IDEMPOTENT mode, then I think the expectation is not valid, according to the variable description it does not affect views.

          If you just mean that you are getting the error when you think there should be no conflict, we will need much more information to find out why it might be happening, and it's not related to this bug report. Please create another JIRA issue, attach your binary logs (or their fragments around this view creation, the wider the better), cnf files and error logs.

          elenst Elena Stepanova added a comment - ondohotola , I am encountering the same issue but with a VIEW By "the same issue" do you mean that you also have two masters executing the statement, and observe the error despite slave_ddl_exec_mode=IDEMPOTENT? Or do you just mean that you are getting replication failure with a similar error text? This bug report was about a very specific situation. The conflict was expected to happen in the given setup, but it was supposed to be suppressed by the slave configuration. If you mean the same – that you indeed create a duplicate view, but don't expect it to cause a problem due to IDEMPOTENT mode, then I think the expectation is not valid, according to the variable description it does not affect views. If you just mean that you are getting the error when you think there should be no conflict, we will need much more information to find out why it might be happening, and it's not related to this bug report. Please create another JIRA issue, attach your binary logs (or their fragments around this view creation, the wider the better), cnf files and error logs.

          People

            monty Michael Widenius
            kolbe Kolbe Kegel (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.