Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9952

Strange out-of-order sequence errors on slave, unreleased table locks and possibly SQL thread position corruption

    XMLWordPrintable

Details

    Description

      The same user who experienced MDEV-9670 has started experiencing similar issues, but in these new cases, server_id is not mysteriously set to 0 in the binlog event.

      The latest error showed a bunch of out-of-order sequence number errors in a short period of time. Each time, the user attempted to skip the problem. After each of those, another out-of-order sequence number error occurred shortly after.

      For example:

      160417 10:33:07 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937125 which would create an out-of-order sequence number with existing GTID 35-7735-937125, and gtid strict mode is enabled. Gtid 35-7735-937125, Internal MariaDB error code: 1950
      160417 10:33:07 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937125 which would create an out-of-order sequence number with existing GTID 35-7735-937125, and gtid strict mode is enabled. Error_code: 1950
      160417 10:33:07 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 95365805; GTID position '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'
      160417 10:33:07 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.
      TRANSACTION 1521353596, not started
      mysql tables in use 1, locked 4
      MySQL thread id 329862, OS thread handle 0x7efc5a4da700, query id 137799390 Waiting for slave mutex on exit
      ...
      160417 10:44:43 [Note] Master 'dc2master': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='/data/logs/relay/relay-bin-dc2master.000001', relay_log_pos='4', master_log_name='mariadb-bin.007022', master_log_pos='95365805', GTID '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017';  and new position at relay_log_file='/data/logs/relay/relay-bin-dc2master.000002', relay_log_pos='48656', master_log_name='mariadb-bin.007022', master_log_pos='95413434', GTID '21-7521-7596336,0-7725-126,35-7735-937125,11-7511-6574751,17-7725-102939928,15-7715-29236017';
      160417 10:44:43 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937137 which would create an out-of-order sequence number with existing GTID 35-7735-937137, and gtid strict mode is enabled. Gtid 35-7735-937137, Internal MariaDB error code: 1950
      160417 10:44:43 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937137 which would create an out-of-order sequence number with existing GTID 35-7735-937137, and gtid strict mode is enabled. Error_code: 1950
      160417 10:44:43 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 95860806; GTID position '21-7521-7596336,0-7725-126,35-7735-937136,11-7511-6574751,17-7725-102939930,15-7715-29236019'
      160417 10:44:43 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.
      TRANSACTION 1521379111, not started
      mysql tables in use 1, locked 4
      MySQL thread id 348591, OS thread handle 0x7efc5a4da700, query id 138021377 Waiting for slave mutex on exit
      ...
      160417 10:44:57 [Note] Master 'dc2master': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='/data/logs/relay/relay-bin-dc2master.000001', relay_log_pos='4', master_log_name='mariadb-bin.007022', master_log_pos='95860806', GTID '21-7521-7596336,0-7725-126,35-7735-937136,11-7511-6574751,17-7725-102939930,15-7715-29236019';  and new position at relay_log_file='/data/logs/relay/relay-bin-dc2master.000002', relay_log_pos='23166', master_log_name='mariadb-bin.007022', master_log_pos='95882945', GTID '21-7521-7596336,0-7725-126,35-7735-937137,11-7511-6574751,17-7725-102939930,15-7715-29236019';
      160417 10:44:57 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937151 which would create an out-of-order sequence number with existing GTID 35-7735-937151, and gtid strict mode is enabled. Gtid 35-7735-937151, Internal MariaDB error code: 1950
      160417 10:44:57 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937151 which would create an out-of-order sequence number with existing GTID 35-7735-937151, and gtid strict mode is enabled. Error_code: 1950
      160417 10:44:57 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 96808225; GTID position '21-7521-7596336,0-7725-126,35-7735-937150,11-7511-6574751,17-7725-102939940,15-7715-29236023'
      160417 10:44:57 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 4.
      TRANSACTION 1521379598, not started
      mysql tables in use 1, locked 4
      MySQL thread id 348606, OS thread handle 0x7efc5a4da700, query id 138026480 Waiting for slave mutex on exit
      ...
      160417 10:45:21 [Note] Master 'dc2master': 'SQL_SLAVE_SKIP_COUNTER=1' executed at relay_log_file='/data/logs/relay/relay-bin-dc2master.000001', relay_log_pos='4', master_log_name='mariadb-bin.007022', master_log_pos='96808225', GTID '21-7521-7596336,0-7725-126,35-7735-937150,11-7511-6574751,17-7725-102939940,15-7715-29236023';  and new position at relay_log_file='/data/logs/relay/relay-bin-dc2master.000002', relay_log_pos='277601', master_log_name='mariadb-bin.007022', master_log_pos='97085707', GTID '21-7521-7596336,0-7725-126,35-7735-937151,11-7511-6574751,17-7725-102939940,15-7715-29236023';
      160417 10:45:21 [ERROR] Master 'dc2master': Slave SQL: An attempt was made to binlog GTID 35-7735-937161 which would create an out-of-order sequence number with existing GTID 35-7735-937161, and gtid strict mode is enabled. Gtid 35-7735-937161, Internal MariaDB error code: 1950
      160417 10:45:21 [Warning] Master 'dc2master': Slave: An attempt was made to binlog GTID 35-7735-937161 which would create an out-of-order sequence number with existing GTID 35-7735-937161, and gtid strict mode is enabled. Error_code: 1950
      160417 10:45:21 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 97244987; GTID position '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'
      160417 10:45:21 [ERROR] Master 'dc2master': InnoDB: MySQL is freeing a thd though trx->n_mysql_tables_in_use is 1 and trx->mysql_n_tables_locked is 6.
      TRANSACTION 1521380481, not started
      mysql tables in use 1, locked 6
      MySQL thread id 348640, OS thread handle 0x7efc5a4da700, query id 138034243 Waiting for slave mutex on exit
      

      I've noticed some strange things about the latest failure at 10:45:21:

      1.) When the slave stops due to out-of-order sequence number with existing GTID 35-7735-937161, the slave is stopped at master_log_name=''mariadb-bin.007022' master_log_pos='97244987', GTID '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'

      160417 10:45:21 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 97244987; GTID position '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'
      

      The strange thing is that GTID '35-7735-937161' is not present in the binary log until much later: master_log_name='mariadb-bin.007022' master_log_pos='97246276'

      # at 97246276
      #160417 10:33:08 server id 7735  end_log_pos 97246318 CRC32 0x0e07bf7f  GTID 35-7735-937161
      /*!100001 SET @@session.gtid_domain_id=35*//*!*/;
      /*!100001 SET @@session.server_id=7735*//*!*/;
      /*!100001 SET @@session.gtid_seq_no=937161*//*!*/;
      

      If the slave supposedly failed executing an event at master_log_name='mariadb-bin.007022' master_log_pos='97246276', why would it stop all the way back at master_log_name='mariadb-bin.007022' master_log_pos='97244987'?

      2.) When the slave stops due to out-of-order sequence number with existing GTID 35-7735-937161, the GTID position when it stops is '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'. Notice that the sequence number for gtid_domain_id 35 is 937160, which is less than the out-of-order sequence number 937161.

      If the highest sequence number for the domain was only 937160 at the time, why would a sequence number of 937161 cause an out-of-order sequence number?

      These items also make me wonder if these strange observations also hold for the first failure that happened at 10:33, so I backtracked in the binlogs to look into it.

      1.) When the slave stops due to out-of-order sequence number with existing GTID '35-7735-937125', it stops at position master_log_name='mariadb-bin.007022', master_log_pos='95365805'

      160417 10:33:07 [ERROR] Master 'dc2master': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mariadb-bin.007022' position 95365805; GTID position '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'
      

      This position does actually seem to be correct, since the GTID '35-7735-937125' is at that position:

      # at 95365805
      #160417 10:33:07 server id 7735  end_log_pos 95365847 CRC32 0x021ffb43  GTID 35-7735-937125
      /*!100001 SET @@session.gtid_seq_no=937125*//*!*/;
      BEGIN
      /*!*/;
      

      2.) When the slave stops due to out-of-order sequence number with existing GTID '35-7735-937125', the GTID position when it stops is '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'. Notice here that the sequence number for gtid_domain_id 35 is only 937124, so this out-of-order sequence error also seems to be unusual. Additionally, I can see that the transaction with the sequence number 937124 is directly preceding the failing one in the binlog:

      # at 95327974
      #160417 10:33:07 server id 7735  end_log_pos 95328016 CRC32 0x60f374c6  GTID 35-7735-937124
      /*!100001 SET @@session.gtid_seq_no=937124*//*!*/;
      BEGIN
      /*!*/;
      

      Let's see if observation #1 above holds for the other two failures at 10:44:43 and 10:44:57.

      The failure at 10:44:43 involved GTID '35-7735-937137', and the slave thread stopped at master_log_name='mariadb-bin.007022', master_log_pos='95860806'.

      This position also seems to be correct, since the GTID '35-7735-937137' is at the position:

      # at 95860806
      #160417 10:33:07 server id 7735  end_log_pos 95860848 CRC32 0x5a30278a  GTID 35-7735-937137
      /*!100001 SET @@session.gtid_seq_no=937137*//*!*/;
      BEGIN
      /*!*/;
      

      The failure at 10:44:57 involved GTID '35-7735-937151', and the slave thread stopped at master_log_name='mariadb-bin.007022', master_log_pos='96808225'.

      This GTID doesn't appear until position 96809133, a bit later than position 96808225.

      # at 96809133
      #160417 10:33:08 server id 7735  end_log_pos 96809175 CRC32 0xe44d2cfa  GTID 35-7735-937151
      /*!100001 SET @@session.gtid_domain_id=35*//*!*/;
      /*!100001 SET @@session.server_id=7735*//*!*/;
      /*!100001 SET @@session.gtid_seq_no=937151*//*!*/;
      BEGIN
      /*!*/;
      

      These observations make me wonder if the SQL thread's position was corrupted after the 10:44:57 failure. I know of at least one open bug (MDEV-9138) that could cause problems with the SQL thread position, but it does not seem similar to this problem.

      The user does not have parallel replication enabled.

      Some other observations:

      • Sometimes when this problem occurs, the mysqld process has to be killed because the slave thread, which is killed when the out-of-order sequence error occurs and no longer exists, still seems to have table locks and/or a slave mutex.
      • Other times when this problem occurs, the event can simply be skipped and the slave thread will happily restart.

      Attachments

        Issue Links

          Activity

            People

              Elkin Andrei Elkin
              GeoffMontee Geoff Montee (Inactive)
              Votes:
              4 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.