Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33509

Failed to apply write set with flags = (rollback | pa_unsafe)

Details

    Description

      Message from the customer:

      Description: Hello ,
      One of our heavy loaded 3 Nodes Galera Clusters fall in Inconsistency issue of 2 nodes from 3.
       
      MariaDB CS 10.6.16 , Galera Provider 26.4.16.
       
      The nodes are with following names and roles:
       
      db3 - Application Master Node for DMLs
      db4 - Wsrep replicator for 99.99% of the queries, despite Aggregation DML which are "offloaded" on it
      db5 - Wsrep replicator and Async Replication Master
       
      All application traffic goes to node db3.
       
      We use node db4 to offload data aggregation functions, which are doing DMLs on dedicated tables, no other functionality is doing changes on those Aggr tables.
       
      Today we start our standard procedure to perform Live Alter on one of Aggr tables. Live alter script was executed on db3 instead of db4. 10 minutes after Live Alter was started the db4 node become Inconsistent.
       
      Live Alter script continue to work on db3.
       
      Background Aggr functions start using db5 for doing aggregation DML. 10 minutes after that db5 become's also Inconsistent.
       
      Live Alter is used to change DDL on big tables without cluster hang. It is working with triggers and INSERT SELECT

      Judging by the customer's logs, we are dealing with the following failure:

      From db4:
       
      2024-02-16 12:05:30 6 [ERROR] WSREP: Failed to apply write set: gtid: a8f0f00f-842d-11eb-b5a7-7a1763e4c1e6:45662493750 server_id: 269fd913-9633-11ee-9629-87d0be11dc45 client_id: 18446744073709551615 trx_id: 48609383118 flags: 20 (rollback | pa_unsafe)
      ...
       
      From db5:
       
      2024-02-16 12:05:30 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
      2024-02-16 12:05:30 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: eab3a4dc-ccb2-11ee-a3ba-c612c5ccc31d
      2024-02-16 12:05:30 0 [Note] WSREP: STATE EXCHANGE: sent state msg: eab3a4dc-ccb2-11ee-a3ba-c612c5ccc31d
      2024-02-16 12:05:30 0 [Note] WSREP: STATE EXCHANGE: got state msg: eab3a4dc-ccb2-11ee-a3ba-c612c5ccc31d from 0 (fx112_db5)
      2024-02-16 12:05:30 0 [Note] WSREP: STATE EXCHANGE: got state msg: eab3a4dc-ccb2-11ee-a3ba-c612c5ccc31d from 1 (fx112_db3)
      2024-02-16 12:05:30 0 [Note] WSREP: Quorum results:
      version = 6,
      component = PRIMARY,
      conf_id = 281,
      members = 2/2 (joined/total),
      act_id = 45662493751,
      last_appl. = 45662493731,
      protocols = 2/10/4 (gcs/repl/appl),
      vote policy= 0,
      group UUID = a8f0f00f-842d-11eb-b5a7-7a1763e4c1e6
      2024-02-16 12:05:30 0 [Note] WSREP: Flow-control interval: [6000, 6000]
      2024-02-16 12:05:30 21 [Note] WSREP: ####### processing CC 45662493752, local, ordered
      2024-02-16 12:05:30 21 [Note] WSREP: ####### My UUID: 184a029b-9622-11ee-8c61-6f91e19335c1
      2024-02-16 12:05:30 21 [Note] WSREP: Skipping cert index reset
      2024-02-16 12:05:30 21 [Note] WSREP: REPL Protocols: 10 (5)
      2024-02-16 12:05:30 21 [Note] WSREP: ####### Adjusting cert position: 45662493751 -> 45662493752
      2024-02-16 12:05:30 0 [Note] WSREP: Service thread queue flushed.
      2024-02-16 12:05:30 21 [Note] WSREP: ================================================
      View:
      id: a8f0f00f-842d-11eb-b5a7-7a1763e4c1e6:45662493752
      status: primary
      protocol_version: 4
      capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
      final: no
      own_index: 0
      members(2):
      0: 184a029b-9622-11ee-8c61-6f91e19335c1, fx112_db5
      1: b8c19c60-962e-11ee-a025-af00b051f68f, fx112_db3
      =================================================
      2024-02-16 12:05:30 21 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2024-02-16 12:05:30 21 [Note] WSREP: Lowest cert index boundary for CC from group: 45662493732
      2024-02-16 12:05:30 21 [Note] WSREP: Min available from gcache for CC from group: 45614618803
      2024-02-16 12:05:36 0 [Note] WSREP: cleaning up 269fd913-9629 (tcp://xxx.xxx.xxx.xxx:yyyy)
      2024-02-16 12:15:46 35 [ERROR] WSREP: Failed to apply write set: gtid: a8f0f00f-842d-11eb-b5a7-7a1763e4c1e6:45662930080 server_id: 184a029b-9622-11ee-8c61-6f91e19335c1 client_id: 18446744073709551615 trx_id: 81448619022 flags: 20 (rollback | pa_unsafe)

      Attachments

        Activity

          sysprg Julius Goryavsky created issue -
          sysprg Julius Goryavsky made changes -
          Field Original Value New Value
          sysprg Julius Goryavsky made changes -
          Description Message from the client:{code}Description: Hello ,
          One of our heavy loaded 3 Nodes Galera Clusters fall in Inconsistency issue of 2 nodes from 3.

          MariaDB CS 10.6.16 , Galera Provider 26.4.16.

          The nodes are with following names and roles:

          db3 - Application Master Node for DMLs
          db4 - Wsrep replicator for 99.99% of the queries, despite Aggregation DML which are "offloaded" on it
          db5 - Wsrep replicator and Async Replication Master

          All application traffic goes to node db3.

          We use node db4 to offload data aggregation functions, which are doing DMLs on dedicated tables, no other functionality is doing changes on those Aggr tables.

          Today we start our standard procedure to perform Live Alter on one of Aggr tables. Live alter script was executed on db3 instead of db4. 10 minutes after Live Alter was started the db4 node become Inconsistent.

          Live Alter script continue to work on db3.

          Background Aggr functions start using db5 for doing aggregation DML. 10 minutes after that db5 become's also Inconsistent.

          Live Alter is used to change DDL on big tables without cluster hang. It is working with triggers and INSERT SELECT{code}Judging by the client's logs, we are dealing with the following failure:{code}From db4:

          2024-02-16 12:05:30 6 [ERROR] WSREP: Failed to apply write set: gtid: a8f0f00f-842d-11eb-b5a7-7a1763e4c1e6:45662493750 server_id: 269fd913-9633-11ee-9629-87d0be11dc45 client_id: 18446744073709551615 trx_id: 48609383118 flags: 20 (rollback | pa_unsafe)
          ...

          From db5:

          2024-02-16 12:05:30 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
          2024-02-16 12:05:30 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: eab3a4dc-ccb2-11ee-a3ba-c612c5ccc31d
          2024-02-16 12:05:30 0 [Note] WSREP: STATE EXCHANGE: sent state msg: eab3a4dc-ccb2-11ee-a3ba-c612c5ccc31d
          2024-02-16 12:05:30 0 [Note] WSREP: STATE EXCHANGE: got state msg: eab3a4dc-ccb2-11ee-a3ba-c612c5ccc31d from 0 (fx112_db5)
          2024-02-16 12:05:30 0 [Note] WSREP: STATE EXCHANGE: got state msg: eab3a4dc-ccb2-11ee-a3ba-c612c5ccc31d from 1 (fx112_db3)
          2024-02-16 12:05:30 0 [Note] WSREP: Quorum results:
          version = 6,
          component = PRIMARY,
          conf_id = 281,
          members = 2/2 (joined/total),
          act_id = 45662493751,
          last_appl. = 45662493731,
          protocols = 2/10/4 (gcs/repl/appl),
          vote policy= 0,
          group UUID = a8f0f00f-842d-11eb-b5a7-7a1763e4c1e6
          2024-02-16 12:05:30 0 [Note] WSREP: Flow-control interval: [6000, 6000]
          2024-02-16 12:05:30 21 [Note] WSREP: ####### processing CC 45662493752, local, ordered
          2024-02-16 12:05:30 21 [Note] WSREP: ####### My UUID: 184a029b-9622-11ee-8c61-6f91e19335c1
          2024-02-16 12:05:30 21 [Note] WSREP: Skipping cert index reset
          2024-02-16 12:05:30 21 [Note] WSREP: REPL Protocols: 10 (5)
          2024-02-16 12:05:30 21 [Note] WSREP: ####### Adjusting cert position: 45662493751 -> 45662493752
          2024-02-16 12:05:30 0 [Note] WSREP: Service thread queue flushed.
          2024-02-16 12:05:30 21 [Note] WSREP: ================================================
          View:
          id: a8f0f00f-842d-11eb-b5a7-7a1763e4c1e6:45662493752
          status: primary
          protocol_version: 4
          capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
          final: no
          own_index: 0
          members(2):
          0: 184a029b-9622-11ee-8c61-6f91e19335c1, fx112_db5
          1: b8c19c60-962e-11ee-a025-af00b051f68f, fx112_db3
          =================================================
          2024-02-16 12:05:30 21 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
          2024-02-16 12:05:30 21 [Note] WSREP: Lowest cert index boundary for CC from group: 45662493732
          2024-02-16 12:05:30 21 [Note] WSREP: Min available from gcache for CC from group: 45614618803
          2024-02-16 12:05:36 0 [Note] WSREP: cleaning up 269fd913-9629 (tcp://xxx.xxx.xxx.xxx:yyyy)
          2024-02-16 12:15:46 35 [ERROR] WSREP: Failed to apply write set: gtid: a8f0f00f-842d-11eb-b5a7-7a1763e4c1e6:45662930080 server_id: 184a029b-9622-11ee-8c61-6f91e19335c1 client_id: 18446744073709551615 trx_id: 81448619022 flags: 20 (rollback | pa_unsafe){code}
          Message from the customer:{code}Description: Hello ,
          One of our heavy loaded 3 Nodes Galera Clusters fall in Inconsistency issue of 2 nodes from 3.

          MariaDB CS 10.6.16 , Galera Provider 26.4.16.

          The nodes are with following names and roles:

          db3 - Application Master Node for DMLs
          db4 - Wsrep replicator for 99.99% of the queries, despite Aggregation DML which are "offloaded" on it
          db5 - Wsrep replicator and Async Replication Master

          All application traffic goes to node db3.

          We use node db4 to offload data aggregation functions, which are doing DMLs on dedicated tables, no other functionality is doing changes on those Aggr tables.

          Today we start our standard procedure to perform Live Alter on one of Aggr tables. Live alter script was executed on db3 instead of db4. 10 minutes after Live Alter was started the db4 node become Inconsistent.

          Live Alter script continue to work on db3.

          Background Aggr functions start using db5 for doing aggregation DML. 10 minutes after that db5 become's also Inconsistent.

          Live Alter is used to change DDL on big tables without cluster hang. It is working with triggers and INSERT SELECT{code}Judging by the customer's logs, we are dealing with the following failure:{code}From db4:

          2024-02-16 12:05:30 6 [ERROR] WSREP: Failed to apply write set: gtid: a8f0f00f-842d-11eb-b5a7-7a1763e4c1e6:45662493750 server_id: 269fd913-9633-11ee-9629-87d0be11dc45 client_id: 18446744073709551615 trx_id: 48609383118 flags: 20 (rollback | pa_unsafe)
          ...

          From db5:

          2024-02-16 12:05:30 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
          2024-02-16 12:05:30 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: eab3a4dc-ccb2-11ee-a3ba-c612c5ccc31d
          2024-02-16 12:05:30 0 [Note] WSREP: STATE EXCHANGE: sent state msg: eab3a4dc-ccb2-11ee-a3ba-c612c5ccc31d
          2024-02-16 12:05:30 0 [Note] WSREP: STATE EXCHANGE: got state msg: eab3a4dc-ccb2-11ee-a3ba-c612c5ccc31d from 0 (fx112_db5)
          2024-02-16 12:05:30 0 [Note] WSREP: STATE EXCHANGE: got state msg: eab3a4dc-ccb2-11ee-a3ba-c612c5ccc31d from 1 (fx112_db3)
          2024-02-16 12:05:30 0 [Note] WSREP: Quorum results:
          version = 6,
          component = PRIMARY,
          conf_id = 281,
          members = 2/2 (joined/total),
          act_id = 45662493751,
          last_appl. = 45662493731,
          protocols = 2/10/4 (gcs/repl/appl),
          vote policy= 0,
          group UUID = a8f0f00f-842d-11eb-b5a7-7a1763e4c1e6
          2024-02-16 12:05:30 0 [Note] WSREP: Flow-control interval: [6000, 6000]
          2024-02-16 12:05:30 21 [Note] WSREP: ####### processing CC 45662493752, local, ordered
          2024-02-16 12:05:30 21 [Note] WSREP: ####### My UUID: 184a029b-9622-11ee-8c61-6f91e19335c1
          2024-02-16 12:05:30 21 [Note] WSREP: Skipping cert index reset
          2024-02-16 12:05:30 21 [Note] WSREP: REPL Protocols: 10 (5)
          2024-02-16 12:05:30 21 [Note] WSREP: ####### Adjusting cert position: 45662493751 -> 45662493752
          2024-02-16 12:05:30 0 [Note] WSREP: Service thread queue flushed.
          2024-02-16 12:05:30 21 [Note] WSREP: ================================================
          View:
          id: a8f0f00f-842d-11eb-b5a7-7a1763e4c1e6:45662493752
          status: primary
          protocol_version: 4
          capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
          final: no
          own_index: 0
          members(2):
          0: 184a029b-9622-11ee-8c61-6f91e19335c1, fx112_db5
          1: b8c19c60-962e-11ee-a025-af00b051f68f, fx112_db3
          =================================================
          2024-02-16 12:05:30 21 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
          2024-02-16 12:05:30 21 [Note] WSREP: Lowest cert index boundary for CC from group: 45662493732
          2024-02-16 12:05:30 21 [Note] WSREP: Min available from gcache for CC from group: 45614618803
          2024-02-16 12:05:36 0 [Note] WSREP: cleaning up 269fd913-9629 (tcp://xxx.xxx.xxx.xxx:yyyy)
          2024-02-16 12:15:46 35 [ERROR] WSREP: Failed to apply write set: gtid: a8f0f00f-842d-11eb-b5a7-7a1763e4c1e6:45662930080 server_id: 184a029b-9622-11ee-8c61-6f91e19335c1 client_id: 18446744073709551615 trx_id: 81448619022 flags: 20 (rollback | pa_unsafe){code}
          janlindstrom Jan Lindström made changes -
          Assignee Seppo Jaakola [ seppo ] Jan Lindström [ JIRAUSER53125 ]
          Status Open [ 1 ] Needs Feedback [ 10501 ]
          valerii Valerii Kravchuk made changes -
          Attachment provide_to_mariadb_obfuscated.tar [ 73181 ]
          valerii Valerii Kravchuk made changes -
          Attachment wsrep_variables.txt [ 73183 ]
          valerii Valerii Kravchuk made changes -
          Status Needs Feedback [ 10501 ] Open [ 1 ]
          janlindstrom Jan Lindström made changes -
          Status Open [ 1 ] Confirmed [ 10101 ]
          janlindstrom Jan Lindström made changes -
          Status Confirmed [ 10101 ] In Progress [ 3 ]
          janlindstrom Jan Lindström made changes -
          Status In Progress [ 3 ] Needs Feedback [ 10501 ]
          valerii Valerii Kravchuk made changes -
          Status Needs Feedback [ 10501 ] Open [ 1 ]
          janlindstrom Jan Lindström made changes -
          Status Open [ 1 ] Needs Feedback [ 10501 ]
          valerii Valerii Kravchuk made changes -
          Status Needs Feedback [ 10501 ] Open [ 1 ]
          janlindstrom Jan Lindström made changes -
          Status Open [ 1 ] Confirmed [ 10101 ]
          janlindstrom Jan Lindström made changes -
          Assignee Jan Lindström [ JIRAUSER53125 ] Daniele Sciascia [ sciascid ]
          sciascid Daniele Sciascia made changes -
          Status Confirmed [ 10101 ] In Review [ 10002 ]
          sysprg Julius Goryavsky made changes -
          Assignee Daniele Sciascia [ sciascid ] Julius Goryavsky [ sysprg ]
          sysprg Julius Goryavsky made changes -
          Status In Review [ 10002 ] Stalled [ 10000 ]
          sysprg Julius Goryavsky made changes -
          Fix Version/s 10.6.18 [ 29627 ]
          Fix Version/s 10.6 [ 24028 ]
          Resolution Fixed [ 1 ]
          Status Stalled [ 10000 ] Closed [ 6 ]
          JIraAutomate JiraAutomate made changes -
          Fix Version/s 10.11.8 [ 29630 ]
          Fix Version/s 11.0.6 [ 29628 ]
          Fix Version/s 11.1.5 [ 29629 ]
          Fix Version/s 11.2.4 [ 29631 ]
          mariadb-jira-automation Jira Automation (IT) made changes -
          Zendesk Related Tickets 201774
          Zendesk active tickets 201774

          People

            sysprg Julius Goryavsky
            sysprg Julius Goryavsky
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.