[MDEV-28681] bin log interrupted and hung for minutes after transaction commit - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Won't Fix
Affects Version/s: 10.4.21
Fix Version/s: N/A
Component/s: Data Manipulation - Update, Platform RedHat, Storage Engine - InnoDB
Labels:
None

Description

*Hi,

In our production lab we found one issue that the bin log update got interrupted and hung there for minutes after C++ API mysql_query with "commit" param returned success.

Timeline:*

1. Transaction committed by calling of mysql_query with "commit" param which returned success (product log which confirmed successful return of mysql_query):

+++ 2022/05/26 11:32:46.065 DB MEDIUM remotedb:11310 E:105110 S:2657 (RemotedbNotify.cpp 1026 v-0001 169.254.64.1 oam-a)
INFO: Completed notify connection attempt.

2. bin log started to be updated, but only updated the first 3 records:

at 2296562
#220526 11:32:46 server id 1 end_log_pos 2296604 CRC32 0x30537424 GTID 0-1-29252 trans
/Unable to render embedded object: File (100001 SET @@session.gtid_seq_no=29252*//*) not found./;
START TRANSACTION
/!/;
at 2296604
#220526 11:32:46 server id 1 end_log_pos 2297037 CRC32 0xed2efe0a Query thread_id=11783 exec_time=0 error_code=0
SET TIMESTAMP=1653564766/!/;
update db01host.notify_remdb_operation set process_id = 11312, objlen = 48, flag = 0, socket = 171, retry_count = 0, host_ip = x'00a9fe4101000000000000000000000000000000', schema_class = 'cfg' where operation = 1 and notification_type = 2 and remote_ip = x'00a9fe4001000000000000000000000000000000' and table_name = 'resource_parms' and process_name = 'SSPcntl_agt'
/!/;
at 2297037
#220526 11:32:46 server id 1 end_log_pos 2297470 CRC32 0xa716708c Query thread_id=11783 exec_time=0 error_code=0
SET TIMESTAMP=1653564766/!/;
update db01host.notify_remdb_operation set process_id = 11312, objlen = 48, flag = 0, socket = 171, retry_count = 0, host_ip = x'00a9fe4101000000000000000000000000000000', schema_class = 'cfg' where operation = 2 and notification_type = 2 and remote_ip = x'00a9fe4001000000000000000000000000000000' and table_name = 'resource_parms' and process_name = 'SSPcntl_agt'
/!/;
at 2297470
#220526 11:32:46 server id 1 end_log_pos 2297903 CRC32 0x3144ec9e Query thread_id=11783 exec_time=0 error_code=0
SET TIMESTAMP=1653564766/!/;
update db01host.notify_remdb_operation set process_id = 11312, objlen = 48, flag = 0, socket = 171, retry_count = 0, host_ip = x'00a9fe4101000000000000000000000000000000', schema_class = 'cfg' where operation = 3 and notification_type = 2 and remote_ip = x'00a9fe4001000000000000000000000000000000' and table_name = 'resource_parms' and process_name = 'SSPcntl_agt'
/!/;
at 2297903
#220526 11:32:46 server id 1 end_log_pos 2297934 CRC32 0x3214e514 Xid = 249264
COMMIT/!/;

*3. About >6m later, it started to write bin log for the remaining records for that transaction (Note in-between there were bin logs writes for other transactions): *

at 2443522
#220526 11:39:18 server id 1 end_log_pos 2443961 CRC32 0xd83cc209 Query thread_id=72 exec_time=0 error_code=0
SET TIMESTAMP=1653565158/!/;
update db01host.notify_remdb_operation set process_id = 11312, objlen = 824, flag = 0, socket = 171, retry_count = 0, host_ip = x'00a9fe4001000000000000000000000000000000', schema_class = 'cfg' where operation = 1 and notification_type = 2 and remote_ip = x'00a9fe4001000000000000000000000000000000' and table_name = 'gateway_family_base' and process_name = 'SSPcntl_agt'
/!/;
......
at 2498796
#220526 11:39:20 server id 1 end_log_pos 2499229 CRC32 0x2b4c10a9 Query thread_id=72 exec_time=0 error_code=0
SET TIMESTAMP=1653565160/!/;
update db01host.notify_remdb_operation set process_id = 11312, objlen = 48, flag = 0, socket = 171, retry_count = 0, host_ip = x'00a9fe4001000000000000000000000000000000', schema_class = 'cfg' where operation = 3 and notification_type = 2 and remote_ip = x'00a9fe4001000000000000000000000000000000' and table_name = 'resource_parms' and process_name = 'SSPcntl_agt'
/!/;

Then finally the DB contents were overwritten incorrectly by the delayed bin log updates causing product failures.

*Mariadb Server version: 10.4.21-MariaDB-log Source distribution

Attached config file my.cnf and bin log as well as decoded bin log.

Your advice is highly appreciated.*

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

bin.log
5.06 MB
2022-05-27 10:51
my.cnf
3 kB
2022-05-27 10:51
mysql-lcpdb-bin.000003
4.38 MB
2022-05-27 10:52

Activity

Ascending order - Click to sort in descending order

View 5 older comments

Andy Liu added a comment - 2022-07-26 01:29

Thanks for the advice.
We'll try to reproduce with general log enabled for further clue. Meanwhile we'll be also looking at this from product perspective.

Andy Liu added a comment - 2022-07-26 01:29 Thanks for the advice. We'll try to reproduce with general log enabled for further clue. Meanwhile we'll be also looking at this from product perspective.

Elena Stepanova added a comment - 2022-08-24 23:00

liberiul, any new discoveries?

Elena Stepanova added a comment - 2022-08-24 23:00 liberiul , any new discoveries?

Andy Liu added a comment - 2022-08-25 01:49

It's found that the replication got delayed significantly under extreme high CPU usage. The statement executed on master host was not applied to slave host until minutes later. The delayed replication events finally overrode the statements queried to slave and caused the issue.

Would you advise:
1) Are there any tweaking options to improve mariadb performance or reduce CPU usage? It is noticed that the commit may take long time to finish (>1m).
2) How to assure the replication timing based on timestamp?
e.g.
Event A: master update tuple T at 1:00;
Event B: slave updates tuple T at 1:02;
The event A should be replicated and applied to slave no later than event B. Otherwise the DB value on slave would be mistakenly overridden by the obsolete one.

Andy Liu added a comment - 2022-08-25 01:49 It's found that the replication got delayed significantly under extreme high CPU usage. The statement executed on master host was not applied to slave host until minutes later. The delayed replication events finally overrode the statements queried to slave and caused the issue. Would you advise: 1) Are there any tweaking options to improve mariadb performance or reduce CPU usage? It is noticed that the commit may take long time to finish (>1m). 2) How to assure the replication timing based on timestamp? e.g. Event A: master update tuple T at 1:00; Event B: slave updates tuple T at 1:02; The event A should be replicated and applied to slave no later than event B. Otherwise the DB value on slave would be mistakenly overridden by the obsolete one.

Andrei Elkin added a comment - 2022-08-25 13:21

liberiul, to the 2nd of the two, it looks like you'd be interested in master_wait_gtid. That is the B's updater would invoke the function prior to submit the update. That obviously requires A and B updaters coordination on the matter of the sequence number of replication events. (B must be aware of the last A's update's gtid).

To the 1st question, it looks like the request is about the slave threads scheduling prioritization (which we don't provide unfortunately) unless your whole slave server was busy.

Andrei Elkin added a comment - 2022-08-25 13:21 liberiul , to the 2nd of the two, it looks like you'd be interested in master_wait_gtid . That is the B's updater would invoke the function prior to submit the update. That obviously requires A and B updaters coordination on the matter of the sequence number of replication events. (B must be aware of the last A's update's gtid). To the 1st question, it looks like the request is about the slave threads scheduling prioritization (which we don't provide unfortunately) unless your whole slave server was busy.

Andy Liu added a comment - 2022-08-29 03:34

Thanks for the explanation. This ticket can be closed.

Andy Liu added a comment - 2022-08-29 03:34 Thanks for the explanation. This ticket can be closed.

MariaDB Server

bin log interrupted and hung for minutes after transaction commit

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration