Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28681

bin log interrupted and hung for minutes after transaction commit

Details

    Description

      *Hi,

      In our production lab we found one issue that the bin log update got interrupted and hung there for minutes after C++ API mysql_query with "commit" param returned success.

      Timeline:*

      1. Transaction committed by calling of mysql_query with "commit" param which returned success (product log which confirmed successful return of mysql_query):

      +++ 2022/05/26 11:32:46.065 DB MEDIUM remotedb:11310 E:105110 S:2657 (RemotedbNotify.cpp 1026 v-0001 169.254.64.1 oam-a)
      INFO: Completed notify connection attempt.

      2. bin log started to be updated, but only updated the first 3 records:

      1. at 2296562
        #220526 11:32:46 server id 1 end_log_pos 2296604 CRC32 0x30537424 GTID 0-1-29252 trans
        /Unable to render embedded object: File (100001 SET @@session.gtid_seq_no=29252*//*) not found./;
        START TRANSACTION
        /!/;
      2. at 2296604
        #220526 11:32:46 server id 1 end_log_pos 2297037 CRC32 0xed2efe0a Query thread_id=11783 exec_time=0 error_code=0
        SET TIMESTAMP=1653564766/!/;
        update db01host.notify_remdb_operation set process_id = 11312, objlen = 48, flag = 0, socket = 171, retry_count = 0, host_ip = x'00a9fe4101000000000000000000000000000000', schema_class = 'cfg' where operation = 1 and notification_type = 2 and remote_ip = x'00a9fe4001000000000000000000000000000000' and table_name = 'resource_parms' and process_name = 'SSPcntl_agt'
        /!/;
      3. at 2297037
        #220526 11:32:46 server id 1 end_log_pos 2297470 CRC32 0xa716708c Query thread_id=11783 exec_time=0 error_code=0
        SET TIMESTAMP=1653564766/!/;
        update db01host.notify_remdb_operation set process_id = 11312, objlen = 48, flag = 0, socket = 171, retry_count = 0, host_ip = x'00a9fe4101000000000000000000000000000000', schema_class = 'cfg' where operation = 2 and notification_type = 2 and remote_ip = x'00a9fe4001000000000000000000000000000000' and table_name = 'resource_parms' and process_name = 'SSPcntl_agt'
        /!/;
      4. at 2297470
        #220526 11:32:46 server id 1 end_log_pos 2297903 CRC32 0x3144ec9e Query thread_id=11783 exec_time=0 error_code=0
        SET TIMESTAMP=1653564766/!/;
        update db01host.notify_remdb_operation set process_id = 11312, objlen = 48, flag = 0, socket = 171, retry_count = 0, host_ip = x'00a9fe4101000000000000000000000000000000', schema_class = 'cfg' where operation = 3 and notification_type = 2 and remote_ip = x'00a9fe4001000000000000000000000000000000' and table_name = 'resource_parms' and process_name = 'SSPcntl_agt'
        /!/;
      5. at 2297903
        #220526 11:32:46 server id 1 end_log_pos 2297934 CRC32 0x3214e514 Xid = 249264
        COMMIT/!/;

      *3. About >6m later, it started to write bin log for the remaining records for that transaction (Note in-between there were bin logs writes for other transactions): *

      1. at 2443522
        #220526 11:39:18 server id 1 end_log_pos 2443961 CRC32 0xd83cc209 Query thread_id=72 exec_time=0 error_code=0
        SET TIMESTAMP=1653565158/!/;
        update db01host.notify_remdb_operation set process_id = 11312, objlen = 824, flag = 0, socket = 171, retry_count = 0, host_ip = x'00a9fe4001000000000000000000000000000000', schema_class = 'cfg' where operation = 1 and notification_type = 2 and remote_ip = x'00a9fe4001000000000000000000000000000000' and table_name = 'gateway_family_base' and process_name = 'SSPcntl_agt'
        /!/;
        ......
      2. at 2498796
        #220526 11:39:20 server id 1 end_log_pos 2499229 CRC32 0x2b4c10a9 Query thread_id=72 exec_time=0 error_code=0
        SET TIMESTAMP=1653565160/!/;
        update db01host.notify_remdb_operation set process_id = 11312, objlen = 48, flag = 0, socket = 171, retry_count = 0, host_ip = x'00a9fe4001000000000000000000000000000000', schema_class = 'cfg' where operation = 3 and notification_type = 2 and remote_ip = x'00a9fe4001000000000000000000000000000000' and table_name = 'resource_parms' and process_name = 'SSPcntl_agt'
        /!/;

      Then finally the DB contents were overwritten incorrectly by the delayed bin log updates causing product failures.

      *Mariadb Server version: 10.4.21-MariaDB-log Source distribution

      Attached config file my.cnf and bin log as well as decoded bin log.

      Your advice is highly appreciated.*

      Attachments

        1. bin.log
          5.06 MB
        2. my.cnf
          3 kB
        3. mysql-lcpdb-bin.000003
          4.38 MB

        Activity

          liberiul Andy Liu added a comment -

          Thanks for the advice.
          We'll try to reproduce with general log enabled for further clue. Meanwhile we'll be also looking at this from product perspective.

          liberiul Andy Liu added a comment - Thanks for the advice. We'll try to reproduce with general log enabled for further clue. Meanwhile we'll be also looking at this from product perspective.

          liberiul, any new discoveries?

          elenst Elena Stepanova added a comment - liberiul , any new discoveries?
          liberiul Andy Liu added a comment -

          It's found that the replication got delayed significantly under extreme high CPU usage. The statement executed on master host was not applied to slave host until minutes later. The delayed replication events finally overrode the statements queried to slave and caused the issue.

          Would you advise:
          1) Are there any tweaking options to improve mariadb performance or reduce CPU usage? It is noticed that the commit may take long time to finish (>1m).
          2) How to assure the replication timing based on timestamp?
          e.g.
          Event A: master update tuple T at 1:00;
          Event B: slave updates tuple T at 1:02;
          The event A should be replicated and applied to slave no later than event B. Otherwise the DB value on slave would be mistakenly overridden by the obsolete one.

          liberiul Andy Liu added a comment - It's found that the replication got delayed significantly under extreme high CPU usage. The statement executed on master host was not applied to slave host until minutes later. The delayed replication events finally overrode the statements queried to slave and caused the issue. Would you advise: 1) Are there any tweaking options to improve mariadb performance or reduce CPU usage? It is noticed that the commit may take long time to finish (>1m). 2) How to assure the replication timing based on timestamp? e.g. Event A: master update tuple T at 1:00; Event B: slave updates tuple T at 1:02; The event A should be replicated and applied to slave no later than event B. Otherwise the DB value on slave would be mistakenly overridden by the obsolete one.
          Elkin Andrei Elkin added a comment -

          liberiul, to the 2nd of the two, it looks like you'd be interested in master_wait_gtid. That is the B's updater would invoke the function prior to submit the update. That obviously requires A and B updaters coordination on the matter of the sequence number of replication events. (B must be aware of the last A's update's gtid).

          To the 1st question, it looks like the request is about the slave threads scheduling prioritization (which we don't provide unfortunately) unless your whole slave server was busy.

          Elkin Andrei Elkin added a comment - liberiul , to the 2nd of the two, it looks like you'd be interested in master_wait_gtid . That is the B's updater would invoke the function prior to submit the update. That obviously requires A and B updaters coordination on the matter of the sequence number of replication events. (B must be aware of the last A's update's gtid). To the 1st question, it looks like the request is about the slave threads scheduling prioritization (which we don't provide unfortunately) unless your whole slave server was busy.
          liberiul Andy Liu added a comment -

          Thanks for the explanation. This ticket can be closed.

          liberiul Andy Liu added a comment - Thanks for the explanation. This ticket can be closed.

          People

            Unassigned Unassigned
            liberiul Andy Liu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.