Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32356

Setting gtid_slave_pos is not atomic

    XMLWordPrintable

Details

    • Bug
    • Status: Stalled (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.4
    • 10.4
    • Galera, Replication
    • None

    Description

      Consider first normal master-slave topology with gtid_strict_mode=0 where user stops slave and sets:

      SET GLOBAL gtid_slave_pos= '1-2-3,2-4-6';
      

      Yes, this could be totally incorrect i.e. there could not even be any node with domain_id with 1 or 2. This command is executed like this:

      rpl_slave_state::record_gtid (this=0x55fb9f8f6c90, thd=0x7f2e80031480, gtid=0x7f2e89efa710, sub_id=2, in_transaction=false, 
          in_statement=true, out_hton=0x7f2e89efa6f8) at /home/jan/work/mariadb/10.4/sql/rpl_gtid.cc:690
      #1  0x000055fb9b9ff12e in rpl_slave_state::load (this=0x55fb9f8f6c90, thd=0x7f2e80031480, state_from_master=0x7f2e8003e053 "", len=11, 
          reset=true, in_statement=true) at /home/jan/work/mariadb/10.4/sql/rpl_gtid.cc:1409
      #2  0x000055fb9b81d972 in rpl_gtid_pos_update (thd=0x7f2e80031480, str=0x7f2e8003e048 "1-2-3,2-4-6", len=11)
          at /home/jan/work/mariadb/10.4/sql/sql_repl.cc:4728
      #3  0x000055fb9b99469a in Sys_var_gtid_slave_pos::global_update (this=0x55fb9d1fde20 <Sys_gtid_slave_pos>, thd=0x7f2e80031480, 
          var=0x7f2e8003dff8) at /home/jan/work/mariadb/10.4/sql/sys_vars.cc:1858
      #4  0x000055fb9b6a8c5e in sys_var::update (this=0x55fb9d1fde20 <Sys_gtid_slave_pos>, thd=0x7f2e80031480, var=0x7f2e8003dff8)
          at /home/jan/work/mariadb/10.4/sql/set_var.cc:208
      #5  0x000055fb9b6aab8e in set_var::update (this=0x7f2e8003dff8, thd=0x7f2e80031480) at /home/jan/work/mariadb/10.4/sql/set_var.cc:837
      #6  0x000055fb9b6aa7f0 in sql_set_variables (thd=0x7f2e80031480, var_list=0x7f2e80036360, free=true)
          at /home/jan/work/mariadb/10.4/sql/set_var.cc:740
      #7  0x000055fb9b7db3f1 in mysql_execute_command (thd=0x7f2e80031480) at /home/jan/work/mariadb/10.4/sql/sql_parse.cc:5047
      #8  0x000055fb9b7e5303 in mysql_parse (thd=0x7f2e80031480, rawbuf=0x7f2e8003de68 "SET GLOBAL gtid_slave_pos= '1-2-3,2-4-6'", length=40, 
          parser_state=0x7f2e89efb300, is_com_multi=false, is_next_command=false) at /home/jan/work/mariadb/10.4/sql/sql_parse.cc:8012
      #9  0x000055fb9b7e499d in wsrep_mysql_parse (thd=0x7f2e80031480, rawbuf=0x7f2e8003de68 "SET GLOBAL gtid_slave_pos= '1-2-3,2-4-6'", length=40, 
          parser_state=0x7f2e89efb300, is_com_multi=false, is_next_command=false) at /home/jan/work/mariadb/10.4/sql/sql_parse.cc:7814
      #10 0x000055fb9b7d0979 in dispatch_command (command=COM_QUERY, thd=0x7f2e80031480, 
          packet=0x7f2e8004fa01 "SET GLOBAL gtid_slave_pos= '1-2-3,2-4-6'", packet_length=40, is_com_multi=false, is_next_command=false)
          at /home/jan/work/mariadb/10.4/sql/sql_parse.cc:1843
      #11 0x000055fb9b7cf2ce in do_command (thd=0x7f2e80031480) at /home/jan/work/mariadb/10.4/sql/sql_parse.cc:1378
      #12 0x000055fb9b975bbe in do_handle_one_connection (connect=0x55fb9fded720) at /home/jan/work/mariadb/10.4/sql/sql_connect.cc:1420
      #13 0x000055fb9b97591a in handle_one_connection (arg=0x55fb9fded720) at /home/jan/work/mariadb/10.4/sql/sql_connect.cc:1324
      #14 0x000055fb9bf1154b in pfs_spawn_thread (arg=0x55fb9f972430) at /home/jan/work/mariadb/10.4/storage/perfschema/pfs.cc:1869
      #15 0x00007f2e97c97ada in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
      #16 0x00007f2e97d282e4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
      

      Anyway, this is not atomic because rpl_slave_state::load contains a loop i.e. it takes one gtid and calls rpl_slave_state::record_gtid where we have

          if (err || (err= ha_commit_trans(thd, FALSE)))
            ha_rollback_trans(thd, FALSE);
      

      The fact that storing these gtids is not atomic might have problems in following cases:

      • In Galera we replicate gtid_slave_pos table to other nodes and assume it is InnoDB. This replication is required at least on case where slave node is configured to use skip_slave_start=0 and node goes down and then starts again
      • What happens if we have stored first gtid position and committed transaction and then node crashes?
      • For Galera we need to have galera transaction and so we start it on rpl_slave_state::record_gtid but we lost that transaction because of ha_commit or ha_rollback. We might be able to fix this by cleaning Galera transaction context and start a new transaction but it is not optimal because gtid position update is not atomic.

      Attachments

        Issue Links

          Activity

            People

              knielsen Kristian Nielsen
              janlindstrom Jan Lindström
              Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.