Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9952

Strange out-of-order sequence errors on slave, unreleased table locks and possibly SQL thread position corruption

    XMLWordPrintable

Details

    Description

      The same user who experienced MDEV-9670 has started experiencing similar issues, but in these new cases, server_id is not mysteriously set to 0 in the binlog event.

      The latest error showed a bunch of out-of-order sequence number errors in a short period of time. Each time, the user attempted to skip the problem. After each of those, another out-of-order sequence number error occurred shortly after.

      For example:

      160417 10:33:07 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937125 which would create an out-of-order sequence number with existing GTID 35-7735-937125, and gtid strict mode is enabled. Gtid 35-7735-937125, Internal MariaDB error code: 1950
      160417 10:33:07 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937125 which would create an out-of-order sequence number with existing GTID 35-7735-937125, and gtid strict mode is enabled. Error_code: 1950
      160417 10:33:07 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 95365805; GTID position '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'
      160417 10:33:07 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.
      TRANSACTION 1521353596, not started
      mysql tables in use 1, locked 4
      MySQL thread id 329862, OS thread handle 0x7efc5a4da700, query id 137799390 Waiting for slave mutex on exit
      ...
      160417 10:44:43 [Note] Master 'dc2master': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='/data/logs/relay/relay-bin-dc2master.000001', relay_log_pos='4', master_log_name='mariadb-bin.007022', master_log_pos='95365805', GTID '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017';  and new position at relay_log_file='/data/logs/relay/relay-bin-dc2master.000002', relay_log_pos='48656', master_log_name='mariadb-bin.007022', master_log_pos='95413434', GTID '21-7521-7596336,0-7725-126,35-7735-937125,11-7511-6574751,17-7725-102939928,15-7715-29236017';
      160417 10:44:43 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937137 which would create an out-of-order sequence number with existing GTID 35-7735-937137, and gtid strict mode is enabled. Gtid 35-7735-937137, Internal MariaDB error code: 1950
      160417 10:44:43 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937137 which would create an out-of-order sequence number with existing GTID 35-7735-937137, and gtid strict mode is enabled. Error_code: 1950
      160417 10:44:43 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 95860806; GTID position '21-7521-7596336,0-7725-126,35-7735-937136,11-7511-6574751,17-7725-102939930,15-7715-29236019'
      160417 10:44:43 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.
      TRANSACTION 1521379111, not started
      mysql tables in use 1, locked 4
      MySQL thread id 348591, OS thread handle 0x7efc5a4da700, query id 138021377 Waiting for slave mutex on exit
      ...
      160417 10:44:57 [Note] Master 'dc2master': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='/data/logs/relay/relay-bin-dc2master.000001', relay_log_pos='4', master_log_name='mariadb-bin.007022', master_log_pos='95860806', GTID '21-7521-7596336,0-7725-126,35-7735-937136,11-7511-6574751,17-7725-102939930,15-7715-29236019';  and new position at relay_log_file='/data/logs/relay/relay-bin-dc2master.000002', relay_log_pos='23166', master_log_name='mariadb-bin.007022', master_log_pos='95882945', GTID '21-7521-7596336,0-7725-126,35-7735-937137,11-7511-6574751,17-7725-102939930,15-7715-29236019';
      160417 10:44:57 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937151 which would create an out-of-order sequence number with existing GTID 35-7735-937151, and gtid strict mode is enabled. Gtid 35-7735-937151, Internal MariaDB error code: 1950
      160417 10:44:57 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937151 which would create an out-of-order sequence number with existing GTID 35-7735-937151, and gtid strict mode is enabled. Error_code: 1950
      160417 10:44:57 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 96808225; GTID position '21-7521-7596336,0-7725-126,35-7735-937150,11-7511-6574751,17-7725-102939940,15-7715-29236023'
      160417 10:44:57 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.
      TRANSACTION 1521379598, not started
      mysql tables in use 1, locked 4
      MySQL thread id 348606, OS thread handle 0x7efc5a4da700, query id 138026480 Waiting for slave mutex on exit
      ...
      160417 10:45:21 [Note] Master 'dc2master': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='/data/logs/relay/relay-bin-dc2master.000001', relay_log_pos='4', master_log_name='mariadb-bin.007022', master_log_pos='96808225', GTID '21-7521-7596336,0-7725-126,35-7735-937150,11-7511-6574751,17-7725-102939940,15-7715-29236023';  and new position at relay_log_file='/data/logs/relay/relay-bin-dc2master.000002', relay_log_pos='277601', master_log_name='mariadb-bin.007022', master_log_pos='97085707', GTID '21-7521-7596336,0-7725-126,35-7735-937151,11-7511-6574751,17-7725-102939940,15-7715-29236023';
      160417 10:45:21 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937161 which would create an out-of-order sequence number with existing GTID 35-7735-937161, and gtid strict mode is enabled. Gtid 35-7735-937161, Internal MariaDB error code: 1950
      160417 10:45:21 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937161 which would create an out-of-order sequence number with existing GTID 35-7735-937161, and gtid strict mode is enabled. Error_code: 1950
      160417 10:45:21 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 97244987; GTID position '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'
      160417 10:45:21 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 6.
      TRANSACTION 1521380481, not started
      mysql tables in use 1, locked 6
      MySQL thread id 348640, OS thread handle 0x7efc5a4da700, query id 138034243 Waiting for slave mutex on exit
      

      I've noticed some strange things about the latest failure at 10:45:21:

      1.) When the slave stops due to out-of-order sequence number with existing GTID 35-7735-937161, the slave is stopped at master_log_name=''mariadb-bin.007022' master_log_pos='97244987', GTID '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'

      160417 10:45:21 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 97244987; GTID position '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'
      

      The strange thing is that GTID '35-7735-937161' is not present in the binary log until much later: master_log_name='mariadb-bin.007022' master_log_pos='97246276'

      # at 97246276
      #160417 10:33:08 server id 7735  end_log_pos 97246318 CRC32 0x0e07bf7f  GTID 35-7735-937161
      /*!100001 SET @@session.gtid_domain_id=35*//*!*/;
      /*!100001 SET @@session.server_id=7735*//*!*/;
      /*!100001 SET @@session.gtid_seq_no=937161*//*!*/;
      

      If the slave supposedly failed executing an event at master_log_name='mariadb-bin.007022' master_log_pos='97246276', why would it stop all the way back at master_log_name='mariadb-bin.007022' master_log_pos='97244987'?

      2.) When the slave stops due to out-of-order sequence number with existing GTID 35-7735-937161, the GTID position when it stops is '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'. Notice that the sequence number for gtid_domain_id 35 is 937160, which is less than the out-of-order sequence number 937161.

      If the highest sequence number for the domain was only 937160 at the time, why would a sequence number of 937161 cause an out-of-order sequence number?

      These items also make me wonder if these strange observations also hold for the first failure that happened at 10:33, so I backtracked in the binlogs to look into it.

      1.) When the slave stops due to out-of-order sequence number with existing GTID '35-7735-937125', it stops at position master_log_name='mariadb-bin.007022', master_log_pos='95365805'

      160417 10:33:07 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 95365805; GTID position '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'
      

      This position does actually seem to be correct, since the GTID '35-7735-937125' is at that position:

      # at 95365805
      #160417 10:33:07 server id 7735  end_log_pos 95365847 CRC32 0x021ffb43  GTID 35-7735-937125
      /*!100001 SET @@session.gtid_seq_no=937125*//*!*/;
      BEGIN
      /*!*/;
      

      2.) When the slave stops due to out-of-order sequence number with existing GTID '35-7735-937125', the GTID position when it stops is '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'. Notice here that the sequence number for gtid_domain_id 35 is only 937124, so this out-of-order sequence error also seems to be unusual. Additionally, I can see that the transaction with the sequence number 937124 is directly preceding the failing one in the binlog:

      # at 95327974
      #160417 10:33:07 server id 7735  end_log_pos 95328016 CRC32 0x60f374c6  GTID 35-7735-937124
      /*!100001 SET @@session.gtid_seq_no=937124*//*!*/;
      BEGIN
      /*!*/;
      

      Let's see if observation #1 above holds for the other two failures at 10:44:43 and 10:44:57.

      The failure at 10:44:43 involved GTID '35-7735-937137', and the slave thread stopped at master_log_name='mariadb-bin.007022', master_log_pos='95860806'.

      This position also seems to be correct, since the GTID '35-7735-937137' is at the position:

      # at 95860806
      #160417 10:33:07 server id 7735  end_log_pos 95860848 CRC32 0x5a30278a  GTID 35-7735-937137
      /*!100001 SET @@session.gtid_seq_no=937137*//*!*/;
      BEGIN
      /*!*/;
      

      The failure at 10:44:57 involved GTID '35-7735-937151', and the slave thread stopped at master_log_name='mariadb-bin.007022', master_log_pos='96808225'.

      This GTID doesn't appear until position 96809133, a bit later than position 96808225.

      # at 96809133
      #160417 10:33:08 server id 7735  end_log_pos 96809175 CRC32 0xe44d2cfa  GTID 35-7735-937151
      /*!100001 SET @@session.gtid_domain_id=35*//*!*/;
      /*!100001 SET @@session.server_id=7735*//*!*/;
      /*!100001 SET @@session.gtid_seq_no=937151*//*!*/;
      BEGIN
      /*!*/;
      

      These observations make me wonder if the SQL thread's position was corrupted after the 10:44:57 failure. I know of at least one open bug (MDEV-9138) that could cause problems with the SQL thread position, but it does not seem similar to this problem.

      The user does not have parallel replication enabled.

      Some other observations:

      • Sometimes when this problem occurs, the mysqld process has to be killed because the slave thread, which is killed when the out-of-order sequence error occurs and no longer exists, still seems to have table locks and/or a slave mutex.
      • Other times when this problem occurs, the event can simply be skipped and the slave thread will happily restart.

      Attachments

        1. 5-17-crash-errorlog
          4 kB
          Matt Neth
        2. 5-17-crash-gdb-bt
          127 kB
          Matt Neth
        3. gdb-bt-during-slave-status-lock.txt
          90 kB
          Matt Neth

        Issue Links

          Activity

            People

              Elkin Andrei Elkin
              GeoffMontee Geoff Montee (Inactive)
              Votes:
              4 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.