[MDEV-13577] slave_parallel_mode=optimistic should not report the mode's specific temporary errors Created: 2017-08-18  Updated: 2020-08-25  Resolved: 2018-06-12

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.1, 10.1.22, 10.1.25, 10.2
Fix Version/s: 10.1.34, 10.2.16, 10.3.8

Type: Bug Priority: Major
Reporter: Sergey Chernomorets Assignee: Andrei Elkin
Resolution: Fixed Votes: 1
Labels: None
Environment:

centos7 x86_64


Issue Links:
Duplicate
duplicates MDEV-16468 Can't find record in optimistic repli... Closed
Relates
relates to MDEV-20628 If temporary error occurs with optimi... Closed
Sprint: 10.2.14, 10.3.6-1

 Description   

I have server mariadb-10.1.22 with slave_parallel_mode=optimistic and see (bogus?) errors in log:

2017-08-16  9:34:05 139902864206592 [ERROR] mysqld: Can't find record in 'hr_user_request_log'
2017-08-16  9:36:03 139902865722112 [ERROR] mysqld: Can't find record in 'sms_log'
2017-08-16  9:43:11 139902864812800 [ERROR] mysqld: Can't find record in 'sms_log'
2017-08-17  3:14:05 139902864812800 [ERROR] mysqld: Can't find record in 'sms_log'
2017-08-17  4:01:17 139902864206592 [ERROR] mysqld: Can't find record in 'sms_log'
2017-08-17 10:02:02 139902865419008 [ERROR] mysqld: Can't find record in 'sms_log'
2017-08-17 10:14:06 139902864206592 [ERROR] mysqld: Can't find record in 'reg_user_tracked_stat'
2017-08-17 11:06:03 139902865419008 [ERROR] mysqld: Can't find record in 'cv_journal'

Replication is not crash and still works.

CREATE TABLE `sms_log` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `sms_type` enum('','admin_login','admin_vpn','resume_buy','after_registration_reg_user','after_registration_hr','hr_welcome_after_registration','hr_start_after_registration','password_recovery','phone_confirmation','friends_resume','friends_resume_app','invite_reg_user','response_to_a_demo_vacancy_no_read','verification_code','my_resume','link_to_app','hr_basic_500_after_registration','small_town_after_registration','small_region_after_registration','hr_promotional') DEFAULT NULL,
  `receiver_type` enum('hr','reg_user','guest','admin') NOT NULL DEFAULT 'guest',
  `sms_text` mediumtext NOT NULL,
  `date_create` datetime NOT NULL,
  `date_sent` datetime NOT NULL,
  `sms_link_clicked` tinyint(3) NOT NULL DEFAULT '0',
  `sms_source` enum('desktop','admin','api','mobile_app_reg_user','mobile_app_hr','mobile_site_reg_user','mobile_site_hr','cron') DEFAULT NULL,
  `id_town` smallint(5) unsigned NOT NULL DEFAULT '0',
  `mobile_operator` enum('mts','beeline','megafon','tele2','other') NOT NULL DEFAULT 'other',
  `phone` varchar(20) NOT NULL DEFAULT '',
  `id_receiver` int(10) unsigned NOT NULL DEFAULT '0',
  `sms_gate` enum('superjobvip','superjobnew','devinotele') NOT NULL DEFAULT 'superjobnew',
  `sender_type` enum('hr','reg_user','guest','admin') DEFAULT NULL,
  `id_sender` int(10) DEFAULT NULL,
  `id_client` int(10) DEFAULT NULL,
  `last_error` varchar(255) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `date_create` (`date_create`),
  KEY `date_sent` (`date_sent`),
  KEY `mobile_operator` (`mobile_operator`),
  KEY `id_receiver` (`id_receiver`),
  KEY `id_sender` (`sender_type`,`id_client`,`id_sender`)
) ENGINE=InnoDB AUTO_INCREMENT=13685411 DEFAULT CHARSET=utf8

binlog:

# at 460866962
#170817  3:14:05 server id 86  end_log_pos 460867000    GTID 0-86-586027225 trans
/*!100001 SET @@session.gtid_seq_no=586027225*//*!*/;
BEGIN
/*!*/;
# at 460867000
# at 460867086
#170817  3:14:05 server id 86  end_log_pos 460867206    Write_rows: table id 25 flags: STMT_END_F
### INSERT INTO `site_stat`.`sms_log`
### SET
###   @1=13676743
###   @2=10
###   @3=3
###   @4='---cleared---'
###   @5=2017-08-17 03:14:05
###   @6=0000-00-00 00:00:00
###   @7=0
###   @8=1
###   @9=0
###   @10=4
###   @11='79595959595'
###   @12=0
###   @13=3
###   @14=3
###   @15=NULL
###   @16=NULL
###   @17=''
# at 460867206
#170817  3:14:05 server id 86  end_log_pos 460867233    Xid = 678520262
COMMIT/*!*/;
# at 460867233
#170817  3:14:05 server id 86  end_log_pos 460867273    GTID 0-86-586027226 cid=678520266 trans
/*!100001 SET @@session.gtid_seq_no=586027226*//*!*/;
BEGIN
/*!*/;
# at 460867273
# at 460867359
#170817  3:14:05 server id 86  end_log_pos 460867571    Update_rows: table id 25 flags: STMT_END_F
### UPDATE `site_stat`.`sms_log`
### WHERE
###   @1=13676743
###   @2=10
###   @3=3
###   @4='---cleared---'
###   @5=2017-08-17 03:14:05
###   @6=0000-00-00 00:00:00
###   @7=0
###   @8=1
###   @9=0
###   @10=4
###   @11='79595959595'
###   @12=0
###   @13=3
###   @14=3
###   @15=NULL
###   @16=NULL
###   @17=''
### SET
###   @1=13676743
###   @2=10
###   @3=3
###   @4='---cleared---'
###   @5=2017-08-17 03:14:05
###   @6=2017-08-17 03:14:05
###   @7=0
###   @8=1
###   @9=0
###   @10=4
###   @11='79595959595'
###   @12=0
###   @13=3
###   @14=3
###   @15=NULL
###   @16=NULL
###   @17=''
# at 460867571
#170817  3:14:05 server id 86  end_log_pos 460867598    Xid = 678520266
COMMIT/*!*/;

MariaDB [(none)]> show variables like 'slave%';
+-------------------------------+------------+
| Variable_name                 | Value      |
+-------------------------------+------------+
| slave_compressed_protocol     | OFF        |
| slave_ddl_exec_mode           | IDEMPOTENT |
| slave_domain_parallel_threads | 0          |
| slave_exec_mode               | STRICT     |
| slave_load_tmpdir             | /base/tmp  |
| slave_max_allowed_packet      | 1073741824 |
| slave_net_timeout             | 3600       |
| slave_parallel_max_queued     | 131072     |
| slave_parallel_mode           | optimistic |
| slave_parallel_threads        | 4          |
| slave_run_triggers_for_rbr    | NO         |
| slave_skip_errors             | OFF        |
| slave_sql_verify_checksum     | ON         |
| slave_transaction_retries     | 10         |
| slave_type_conversions        |            |
+-------------------------------+------------+



 Comments   
Comment by Elena Stepanova [ 2017-08-31 ]

The error as such is easily reproducible with a simple concurrent test which inserts / updates rows with the same PK (see the test below). The result seems to be all right at the end, no inconsistencies; so, I think that the appearance of these errors in optimistic mode is most likely expected – by definition any DML is allowed to run in parallel, so it can happen that UPDATE is attempted before INSERT, causes an error and is retried later.

I will assign it to Elkin to confirm (or not) that this behavior is indeed expected. Also, if it is, maybe it makes sense to convert the error into a warning or even a note. It is printed on warning level 2, which has become a default in 10.2.

mdev13577.yy

query_init:
  CREATE TABLE IF NOT EXISTS `t1` (
    `id` BIGINT UNSIGNED PRIMARY KEY,
    `c` char(120) NOT NULL default ''
  ) ENGINE=InnoDB
;
 
query:
  update | delete | insert | insert;
 
insert:
  INSERT IGNORE INTO `t1` (`id`) VALUES ( UNIX_TIMESTAMP() );
 
update:
  UPDATE `t1` SET `c` = _string WHERE `id` = UNIX_TIMESTAMP();
 
delete:
    DELETE FROM `t1` WHERE `id` = UNIX_TIMESTAMP();

To reproduce,

git clone https://github.com/elenst/rqg --branch experimental rqg-mdev13577
cd rqg-mdev13577
# Place the grammar above somewhere, e.g. ./mdev13577.yy
perl ./runall-new.pl \
--threads=8 --duration=600 --queries=100M \
--grammar=mdev13577.yy --skip-gendata \
--rpl-mode=row --mysqld=--slave-parallel-mode=optimistic --mysqld=--slave-parallel-threads=4 \
--mysqld=--log-warnings=2 \
--vardir=/tmp/mdev13577 \
--basedir=<your basedir>

(change paths as needed).
The slave error log will be /tmp/mdev13577_slave/mysqld.err or alike, depending on the vardir path that you choose.

Comment by Kristian Nielsen [ 2017-11-27 ]

Elena is right, these errors will occur regularly internally for optimistic parallel replication, and will be handled by retrying the failed transactions.

But I think the error should never appear in the error log. An error that is handled by optimistic parallel replication as a temporary error that merely causes a transaction retry should not be put into the logs. There is no actual error, and the user cannot do anything about it anyway - and there is a retry count status available to monitor this behaviour.

Maybe historycally, slave retries (in single-threaded replication) were logged like this, and so optimistic parallel replication logs retries the same way - but it seems wrong, and should be safe to silence.

Comment by Elena Stepanova [ 2017-11-27 ]

I am also voting for not logging these pseudo-errors, we have enough noise in the error log as it is, and if there is nothing users are expected to do, there is no point alerting them. (I also have my own agenda, since the error messages create false positives in my tests, which I have to remember about and work around).

Comment by Andrei Elkin [ 2018-06-06 ]

Sergei, salute.

I start with offering you to review the patch whose main accent is
on my_message_sql(). knielsen would be the 2nd option though I understood he
apparently okayed an idea to downgrade unnecessary error level in
case of the slave applier running optimistically.

I hope you'll find time to check out.
elenst needs this patch which hurts RockDB testing.

Cheers,

Andrei

Comment by Andrei Elkin [ 2018-06-12 ]

Fixes are pushed as 7bbe324fc17 commit.

Generated at Thu Feb 08 08:06:41 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.