[MDEV-15426] Event scheduler disabled on a slave possibly breaking replication Created: 2018-02-27  Updated: 2021-11-02  Resolved: 2019-01-22

Status: Closed
Project: MariaDB Server
Component/s: Events, Replication, Server
Affects Version/s: 10.1.31
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Manuel Arostegui Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: event, replication
Environment:

debian stretch


Issue Links:
Relates
relates to MDEV-17718 Setting event_scheduler to DISABLED c... Closed

 Description   

The environment is:

master:
mariadb 10.1.31
debian stretch
event_scheduler=ON

slave:
mariadb 10.1.31
debian stretch
event_scheduler=OFF

Replication: ROW based

The master was originally a standalone server and we wanted to add a slave.
To do so we stopped mariadb and copied all the content over to the future slave.

We do run events on the master and we didn't want those to run on the slave, as otherwise row based replication will break.

When setting up the slave we started it for the first time with event_scheduler=0
We started replication and it was working finely until an event was created on the master.
It broke replication with:

               Last_SQL_Error: Error 'Cannot proceed because system tables used by Event Scheduler were found damaged at server start' on query. Default database: 'tendril'. Query: 'CREATE DEFINER=`root`@`10.64.32.13` event db2093_codfw_wmnet_3306_schema_prep
  on schedule every 1 day starts date(now()) + interval floor(rand() * 23) hour
  do begin
 
    select @server_id := id, @enabled := enabled from servers where host = 'db2093.codfw.wmnet' and port = 3306;
 
    delete from schemata_check where server_id = @server_id;
    insert into schemata_check select @server_id, t.schema_name from db2093_codfw_wmnet_3306_schemata t;
 
  end'

At first we thought the mysql.event  table could be corrupted because of the data transfer between the two servers. But it wasn't or doesn't look like it (as per check table event command)
These are all the actions we have tried to try to get this fixed.

  • Repair event table + restarting mariadb
  • Analyze event table + restarting mariadb
  • Dropping + creating the event table + restarting mariadb
  • mysqldump from the master + load on the slave + restarting mariadb
  • altering event table to make it InnoDB + restarting mariadb
  • mysql_upgrade --force -s

The table doesn't look like it is corrupted:

mysql:root@localhost [mysql]> check table event;
+-------------+-------+----------+----------+
| Table       | Op    | Msg_type | Msg_text |
+-------------+-------+----------+----------+
| mysql.event | check | status   | OK       |
+-------------+-------+----------+----------+
1 row in set (0.05 sec)
 
mysql:root@localhost [mysql]> select count(*) from event;
+----------+
| count(*) |
+----------+
|     1812 |
+----------+
1 row in set (0.00 sec)
 
mysql:root@localhost [mysql]> select * from event limit 1\G
*************************** 1. row ***************************
                  db: tendril
                name: es1011_eqiad_wmnet_3306_usage
                body: begin
 
    if (get_lock('es1011_eqiad_wmnet_3306_usage', 1) = 0) then
        signal sqlstate value '45000' set message_text = 'get_lock';
    end if;

Yet mariadb thinks so:

 
mysql:root@localhost [tendril]> show events;
ERROR 1577 (HY000): Cannot proceed because system tables used by Event Scheduler were found damaged at server start

And upon server restart

Feb 27 13:26:03 db2093 mysqld[6921]: 2018-02-27 13:26:03 139909807241984 [ERROR] Slave SQL: Error 'Cannot proceed because system tables used by Event Scheduler were found damaged at server start' on query. Default database: 'tendril'. Query: 'CREATE DEFINER=`root`@`10.64.
Feb 27 13:26:03 db2093 mysqld[6921]:   on schedule every 1 day starts date(now()) + interval floor(rand() * 23) hour
Feb 27 13:26:03 db2093 mysqld[6921]:   do begin
Feb 27 13:26:03 db2093 mysqld[6921]:     select @server_id := id, @enabled := enabled from servers where host = 'db2093.codfw.wmnet' and port = 3306;
Feb 27 13:26:03 db2093 mysqld[6921]:     delete from schemata_check where server_id = @server_id;
Feb 27 13:26:03 db2093 mysqld[6921]:     insert into schemata_check select @server_id, t.schema_name from db2093_codfw_wmnet_3306_schemata t;
Feb 27 13:26:03 db2093 mysqld[6921]:   end', Gtid 171966586-171966586-319085319, Internal MariaDB error code: 1577
Feb 27 13:26:03 db2093 mysqld[6921]: 2018-02-27 13:26:03 139909807241984 [Warning] Slave: Cannot proceed because system tables used by Event Scheduler were found damaged at server start Error_code: 1577

This looks pretty similar to: https://bugs.mysql.com/bug.php?id=70975 which affects 5.7 and is still open.

We could probably try to see if this doesn't happen if the slave has event_scheduler=ON (even though replication will be broken because of data drifts)



 Comments   
Comment by Manuel Arostegui [ 2018-02-28 ]

The event table

mysql:root@localhost [mysql]> show create table event\G
*************************** 1. row ***************************
       Table: event
Create Table: CREATE TABLE `event` (
  `db` char(64) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '',
  `name` char(64) NOT NULL DEFAULT '',
  `body` longblob NOT NULL,
  `definer` char(141) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '',
  `execute_at` datetime DEFAULT NULL,
  `interval_value` int(11) DEFAULT NULL,
  `interval_field` enum('YEAR','QUARTER','MONTH','DAY','HOUR','MINUTE','WEEK','SECOND','MICROSECOND','YEAR_MONTH','DAY_HOUR','DAY_MINUTE','DAY_SECOND','HOUR_MINUTE','HOUR_SECOND','MINUTE_SECOND','DAY_MICROSECOND','HOUR_MICROSECOND','MINUTE_MICROSECOND','SECOND_MICROSECOND') DEFAULT NULL,
  `created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `modified` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `last_executed` datetime DEFAULT NULL,
  `starts` datetime DEFAULT NULL,
  `ends` datetime DEFAULT NULL,
  `status` enum('ENABLED','DISABLED','SLAVESIDE_DISABLED') NOT NULL DEFAULT 'ENABLED',
  `on_completion` enum('DROP','PRESERVE') NOT NULL DEFAULT 'DROP',
  `sql_mode` set('REAL_AS_FLOAT','PIPES_AS_CONCAT','ANSI_QUOTES','IGNORE_SPACE','IGNORE_BAD_TABLE_OPTIONS','ONLY_FULL_GROUP_BY','NO_UNSIGNED_SUBTRACTION','NO_DIR_IN_CREATE','POSTGRESQL','ORACLE','MSSQL','DB2','MAXDB','NO_KEY_OPTIONS','NO_TABLE_OPTIONS','NO_FIELD_OPTIONS','MYSQL323','MYSQL40','ANSI','NO_AUTO_VALUE_ON_ZERO','NO_BACKSLASH_ESCAPES','STRICT_TRANS_TABLES','STRICT_ALL_TABLES','NO_ZERO_IN_DATE','NO_ZERO_DATE','INVALID_DATES','ERROR_FOR_DIVISION_BY_ZERO','TRADITIONAL','NO_AUTO_CREATE_USER','HIGH_NOT_PRECEDENCE','NO_ENGINE_SUBSTITUTION','PAD_CHAR_TO_FULL_LENGTH') NOT NULL DEFAULT '',
  `comment` char(64) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '',
  `originator` int(10) unsigned NOT NULL,
  `time_zone` char(64) CHARACTER SET latin1 NOT NULL DEFAULT 'SYSTEM',
  `character_set_client` char(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `collation_connection` char(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `db_collation` char(32) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  `body_utf8` longblob,
  PRIMARY KEY (`db`,`name`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COMMENT='Events'

And this is the full error log during the start:

Feb 27 13:25:58 db2093 systemd[1]: Starting MariaDB database server...
Feb 27 13:25:58 db2093 mysqld[6921]: 2018-02-27 13:25:58 139909886306560 [Note] /opt/wmf-mariadb101/bin/mysqld (mysqld 10.1.31-MariaDB) starting as process 6921 ...
Feb 27 13:25:58 db2093 mysqld[6921]: 2018-02-27 13:25:58 139909886306560 [Warning] No argument was provided to --log-bin and neither --log-basename or --log-bin-index where used;  This may cause repliction to break when this server acts as a master and has its hostname ch
Feb 27 13:25:58 db2093 mysqld[6921]: 2018-02-27 13:25:58 139909886306560 [ERROR] Plugin 'SEQUENCE' already installed
Feb 27 13:25:58 db2093 mysqld[6921]: 2018-02-27 13:25:58 139909886306560 [Note] CONNECT: Version 1.06.0005 October 14, 2017
Feb 27 13:25:59 db2093 mysqld[6921]: 2018-02-27 13:25:59 139909886306560 [Note] InnoDB: Using mutexes to ref count buffer pool pages
Feb 27 13:25:59 db2093 mysqld[6921]: 2018-02-27 13:25:59 139909886306560 [Note] InnoDB: The InnoDB memory heap is disabled
Feb 27 13:25:59 db2093 mysqld[6921]: 2018-02-27 13:25:59 139909886306560 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
Feb 27 13:25:59 db2093 mysqld[6921]: 2018-02-27 13:25:59 139909886306560 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
Feb 27 13:25:59 db2093 mysqld[6921]: 2018-02-27 13:25:59 139909886306560 [Note] InnoDB: Compressed tables use zlib 1.2.3
Feb 27 13:25:59 db2093 mysqld[6921]: 2018-02-27 13:25:59 139909886306560 [Note] InnoDB: Using Linux native AIO
Feb 27 13:25:59 db2093 mysqld[6921]: 2018-02-27 13:25:59 139909886306560 [Note] InnoDB: Using SSE crc32 instructions
Feb 27 13:25:59 db2093 mysqld[6921]: 2018-02-27 13:25:59 139909886306560 [Note] InnoDB: Initializing buffer pool, size = 10.0G
Feb 27 13:25:59 db2093 mysqld[6921]: 2018-02-27 13:25:59 139909886306560 [Note] InnoDB: Completed initialization of buffer pool
Feb 27 13:26:00 db2093 mysqld[6921]: 2018-02-27 13:26:00 139909886306560 [Note] InnoDB: Highest supported file format is Barracuda.
Feb 27 13:26:01 db2093 mysqld[6921]: 2018-02-27 13:26:01 139909886306560 [Note] InnoDB: 128 rollback segment(s) are active.
Feb 27 13:26:01 db2093 mysqld[6921]: 2018-02-27 13:26:01 139909886306560 [Note] InnoDB: Waiting for purge to start
Feb 27 13:26:01 db2093 mysqld[6921]: 2018-02-27 13:26:01 139909886306560 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.36-83.0 started; log sequence number 318187207646484
Feb 27 13:26:01 db2093 mysqld[6921]: 2018-02-27 13:26:01 139894462576384 [Note] InnoDB: Dumping buffer pool(s) not yet started
Feb 27 13:26:01 db2093 mysqld[6921]: 2018-02-27 13:26:01 7f3bb7bf7700 InnoDB: Loading buffer pool(s) from .//ib_buffer_pool
Feb 27 13:26:01 db2093 mysqld[6921]: 2018-02-27 13:26:01 139909886306560 [Note] Plugin 'FEEDBACK' is disabled.
Feb 27 13:26:01 db2093 mysqld[6921]: 2018-02-27 13:26:01 139909886306560 [Note] Server socket created on IP: '::'.
Feb 27 13:26:01 db2093 mysqld[6921]: 2018-02-27 13:26:01 139909886306560 [Warning] 'proxies_priv' entry '@% root@db1011' ignored in --skip-name-resolve mode.
Feb 27 13:26:01 db2093 mysqld[6921]: 2018-02-27 13:26:01 139909886306560 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--log-basename=#' or '--relay
Feb 27 13:26:02 db2093 mysqld[6921]: 2018-02-27 13:26:02 139909807241984 [Note] Slave SQL thread initialized, starting replication in log 'db1115-bin.000296' at position 108070704, relay log './db2093-relay-bin.000001' position: 4; GTID position '171966586-171966586-31908
Feb 27 13:26:02 db2093 mysqld[6921]: 2018-02-27 13:26:02 139909886306560 [Note] /opt/wmf-mariadb101/bin/mysqld: ready for connections.
Feb 27 13:26:02 db2093 mysqld[6921]: Version: '10.1.31-MariaDB'  socket: '/run/mysqld/mysqld.sock'  port: 3306  MariaDB Server
Feb 27 13:26:02 db2093 systemd[1]: Started MariaDB database server.
Feb 27 13:26:02 db2093 mysqld[6921]: 2018-02-27 13:26:02 139909885184768 [Note] Slave I/O thread: connected to master 'repl@db1115.eqiad.wmnet:3306',replication starts at GTID position '171966586-171966586-319085318'
Feb 27 13:26:03 db2093 mysqld[6921]: 2018-02-27 13:26:03 139909807241984 [ERROR] Slave SQL: Error 'Cannot proceed because system tables used by Event Scheduler were found damaged at server start' on query. Default database: 'tendril'. Query: 'CREATE DEFINER=`root`@`10.64.
Feb 27 13:26:03 db2093 mysqld[6921]:   on schedule every 1 day starts date(now()) + interval floor(rand() * 23) hour
Feb 27 13:26:03 db2093 mysqld[6921]:   do begin
Feb 27 13:26:03 db2093 mysqld[6921]:     select @server_id := id, @enabled := enabled from servers where host = 'db2093.codfw.wmnet' and port = 3306;
Feb 27 13:26:03 db2093 mysqld[6921]:     delete from schemata_check where server_id = @server_id;
Feb 27 13:26:03 db2093 mysqld[6921]:     insert into schemata_check select @server_id, t.schema_name from db2093_codfw_wmnet_3306_schemata t;
Feb 27 13:26:03 db2093 mysqld[6921]:   end', Gtid 171966586-171966586-319085319, Internal MariaDB error code: 1577
Feb 27 13:26:03 db2093 mysqld[6921]: 2018-02-27 13:26:03 139909807241984 [Warning] Slave: Cannot proceed because system tables used by Event Scheduler were found damaged at server start Error_code: 1577
Feb 27 13:26:03 db2093 mysqld[6921]: 2018-02-27 13:26:03 139909807241984 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'db1115-bin.000296' position 108070704; GTID position '1
Feb 27 13:26:03 db2093 mysqld[6921]: 2018-02-27 13:26:03 139909807241984 [Note] Slave SQL thread exiting, replication stopped in log 'db1115-bin.000296' at position 108070704; GTID position '171966586-171966586-319085318'
Feb 27 13:26:42 db2093 mysqld[6921]: 2018-02-27 13:26:42 7f3bb7bf7700 InnoDB: Buffer pool(s) load completed at 180227 13:26:42
 

Comment by Manuel Arostegui [ 2018-02-28 ]

So I have done another test.
As kinda expected, if we start the slave with the event_scheduler enabled, there is no complain about broken system tables.
Unfortunately, as I stated, this breaks replication as the scheduler deletes stuff and when the deletion comes from the master, when using ROW based replication, the row is no longer there and it breaks.

If I stop the slave, disable the scheduler, and start it again, the problem is gone. Unfortunately, this is too late as replication is broken already.

So with this bug, there is no way to have a slave, cloned directly, from the master (with events), and start it from 0 with the event_scheduler disabled.

Comment by Elena Stepanova [ 2018-03-05 ]

When you start the slave, what exactly does SELECT @@event_scheduler say, even before any hurtful binlog events come from the master? Is it OFF, or is it DISABLED? There is a difference between these two values.

Comment by Manuel Arostegui [ 2018-03-06 ]

It was showing DISABLED as far as I remember.

Comment by Elena Stepanova [ 2018-03-06 ]

Okay, that makes a difference, for better or worse, DISABLED is supposed to do that. The question is how you got DISABLED – the error log doesn't show anything that would explain that. Could you please double- and triple-check that you don't have it anywhere in config files?

Comment by Manuel Arostegui [ 2018-03-08 ]

I checked all over and it is not on config files or systemctl variables.
But the error log does complain about it:

Feb 27 13:26:03 db2093 mysqld[6921]: 2018-02-27 13:26:03 139909807241984 [Warning] Slave: Cannot proceed because system tables used by Event Scheduler were found damaged at server start Error_code: 1577

Even if it doesn't clearly says it has been switched off, the strange part is:

 
mysql:root@localhost [mysql]> check table event;
+-------------+-------+----------+----------+
| Table       | Op    | Msg_type | Msg_text |
+-------------+-------+----------+----------+
| mysql.event | check | status   | OK       |
+-------------+-------+----------+----------+
1 row in set (0.05 sec)
 
mysql:root@localhost [mysql]> select count(*) from event;
+----------+
| count(*) |
+----------+
|     1812 |
+----------+
1 row in set (0.00 sec)
 
mysql:root@localhost [mysql]> select * from event limit 1\G
*************************** 1. row ***************************
                  db: tendril
                name: es1011_eqiad_wmnet_3306_usage
                body: begin
 
    if (get_lock('es1011_eqiad_wmnet_3306_usage', 1) = 0) then
        signal sqlstate value '45000' set message_text = 'get_lock';
    end if;
 
 
mysql:root@localhost [tendril]> show events;
ERROR 1577 (HY000): Cannot proceed because system tables used by Event Scheduler were found damaged at server start

Comment by Elena Stepanova [ 2018-03-08 ]

But the error log does complain about it:

I know, but that's the effect, not the cause. This (admittedly stupid) error message is issued whenever event_scheduler is set to DISABLED and you try to do something event-related. You can get it by simply starting server with event_scheduler=DISABLED and running CREATE EVENT:

MariaDB [test]> select @@event_scheduler;
+-------------------+
| @@event_scheduler |
+-------------------+
| DISABLED          |
+-------------------+
1 row in set (0.00 sec)
 
MariaDB [test]> create event ev on schedule every 1 minute do select 1 into @a;
ERROR 1577 (HY000): Cannot proceed because system tables used by Event Scheduler were found damaged at server start

The question is why it got disabled to begin with. The easy way to achieve it is having it in the config files, that's why I asked about it.
It can also be disabled upon server startup, but I'd expect it to produce an error message saying that it gets disabled, and why.

Now it occurred to me – it might be producing the message at the very beginning of startup, before it starts writing to the error log. Could you please look around, I'm not sure where it writes those first few lines in your case – journal, messages, syslog, stdout, etc., depending on how you start it?

Comment by Manuel Arostegui [ 2018-03-14 ]

There is nothing around the logs that indicates why the event scheduler has been disabled or stopped.
I have looked all over the logs and possible logs sources.
All the stuff that happens during startup is posted at https://jira.mariadb.org/browse/MDEV-15426?focusedCommentId=107788&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-107788

I haven't found anything more than that.

Comment by Elena Stepanova [ 2018-03-17 ]

Could you please paste the output of

select * from information_schema.system_variables where variable_name='event_scheduler' \G

Comment by Manuel Arostegui [ 2018-03-23 ]

That output will not be relevant anymore, as we fixed the system already as stated at: https://jira.mariadb.org/browse/MDEV-15426?focusedCommentId=107796&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-107796

Comment by Elena Stepanova [ 2019-01-22 ]

Unfortunately, without more details on how to reproduce it, there isn't anything we can work with. I've checked the basic scenario again – it does work: slave created by copying the initial data from the master, master running with event scheduler = ON, slave running with event_scheduler = OFF – nothing bad happens on replicated event creation. We are missing something.
With what we can see (no errors in the log and event_scheduler set to DISABLED, the most likely reason is that it's done somewhere in server configuration upon startup, maybe there is an extra configuration file which wasn't noticed;
Hypothetically, there might something non-standard in configuration which causes it (possibly reveals an actual bug), or some error messages about disabling event scheduler didn't make it to the error log on whatever reason.

If somebody else experiences this problem, please comment, preferably including right away all configuration files, exact server version(s), and the output of select * from information_schema.system_variables where variable_name='event_scheduler'.

Generated at Thu Feb 08 08:21:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.