Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26298

Galera cluster slave abort with Assertion `mode_ == m_local || transaction_.is_streaming()' failed despite primary added for all tables

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.4.20, 10.4.21
    • 10.4.23
    • Galera
    • Centos8

    Description

      Hi, Despite have add Primary Key to all tables, we still encounter the `Assertion `mode_ == m_local || transaction_.is_streaming()' failed.` Galera failure.
      We operate 3 Mariadb Server 10.4.20 nodes (galera-4-26.4.6-1.el8.x86_64). Our aplications reach cluster using maxscale 2.2.9 configured with a read write splite router.

      It seems to be related to a specfic request, ans some times both slaves fails and produce a core:

      mysqld: 
      /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.4.20/wsrep-lib/include/wsrep/client_state.hpp:668: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.
      210802 11:55:00 [ERROR] mysqld got signal 6 ;
      

      Please found in attachment:

      • the Master general query log rd-devops-sql2.log
      • A Slave core
      • A Slave mysql error log

      Attachments

        Issue Links

          Activity

            seppo Seppo Jaakola added a comment -

            mreibert your first stacktrace attached (reibert_mysqld-abort-stacktrace.txt) has one replication applier thread (Thread 73) in the middle of processing of Foreign key constraint's cascade operation (row_ins_foreign_check_on_constraint). This suggest that this crash relates to issue MDEV-26803, where the reason for the crashes was that cascading executions did not reliably record all rows manipulated through cascade operation in the replication write set. And this has resulted in unsafe parallel applying in replica nodes.
            The second stacktrace has also ongoing foreign key checking, but in different stages.

            seppo Seppo Jaakola added a comment - mreibert your first stacktrace attached (reibert_mysqld-abort-stacktrace.txt) has one replication applier thread (Thread 73) in the middle of processing of Foreign key constraint's cascade operation (row_ins_foreign_check_on_constraint). This suggest that this crash relates to issue MDEV-26803 , where the reason for the crashes was that cascading executions did not reliably record all rows manipulated through cascade operation in the replication write set. And this has resulted in unsafe parallel applying in replica nodes. The second stacktrace has also ongoing foreign key checking, but in different stages.
            mreibert Mark Reibert added a comment -

            seppo — Thank you for the update and reference to MDEV-26803. I see in that issue setting cert.optimistic_pa = no is a suggested workaround, as has been discussed in comments here. With the fix for MDEV-26803 in 10.4.23, would it be considered "safe" to return to cert.optimistic_pa = yes? To ask the question a little differently, is there a downside to staying with cert.optimistic_pa = no? After 1–2 months of very frequent crashes with 10.4.19–22, I have been rock solid for the past week after setting cert.optimistic_pa = no. So i am (I think understandably) hesitant to turn that back on … ever.

            mreibert Mark Reibert added a comment - seppo — Thank you for the update and reference to MDEV-26803 . I see in that issue setting cert.optimistic_pa = no is a suggested workaround, as has been discussed in comments here. With the fix for MDEV-26803 in 10.4.23, would it be considered "safe" to return to cert.optimistic_pa = yes ? To ask the question a little differently, is there a downside to staying with cert.optimistic_pa = no ? After 1–2 months of very frequent crashes with 10.4.19–22, I have been rock solid for the past week after setting cert.optimistic_pa = no . So i am (I think understandably) hesitant to turn that back on … ever.
            seppo Seppo Jaakola added a comment -

            mreibert "cert.optimistic_pa = no" configuration is a viable workaround for this and also as permanent configuration, if replication performance remains in acceptable level.
            The "cert.optimistic_pa = no" configuration has global effect on all transactions, and will limit the concurrency of replication applying in replica nodes. Whereas, the fix for MDEV-26803, will monitor the transaction's foreign key constraint processing in the first node, and only when cascading delete operation really takes place, then that specific transaction only is marked as not safe for parallel applying.

            Your stack trace contains at least one ongoing FK cascade delete operation, and this makes it plausible candidate for MDEV-26803 scenario.

            seppo Seppo Jaakola added a comment - mreibert "cert.optimistic_pa = no" configuration is a viable workaround for this and also as permanent configuration, if replication performance remains in acceptable level. The "cert.optimistic_pa = no" configuration has global effect on all transactions, and will limit the concurrency of replication applying in replica nodes. Whereas, the fix for MDEV-26803 , will monitor the transaction's foreign key constraint processing in the first node, and only when cascading delete operation really takes place, then that specific transaction only is marked as not safe for parallel applying. Your stack trace contains at least one ongoing FK cascade delete operation, and this makes it plausible candidate for MDEV-26803 scenario.
            mreibert Mark Reibert added a comment -

            seppo — thank you for the feedback.

            At present my replication performance is holding fine with cert.optimistic_pa = no and I haven't experienced a single SIGABRT/SIGSEGV failure since making that change. So I will leave that disabled in my configuration. But if I do at some point start seeing replication fall behind I will consider re-enabling cert.optimistic_pa (after upgrading to a version that contains MDEV-26803, of course).

            mreibert Mark Reibert added a comment - seppo — thank you for the feedback. At present my replication performance is holding fine with cert.optimistic_pa = no and I haven't experienced a single SIGABRT/SIGSEGV failure since making that change. So I will leave that disabled in my configuration. But if I do at some point start seeing replication fall behind I will consider re-enabling cert.optimistic_pa (after upgrading to a version that contains MDEV-26803 , of course).

            @Mark Reibert, we are experiencing the same issue from time to time and would like to set cert.optimistic_pa = no, however I'm not sure how to proceed safely applying this as a permanent configuration ?

            Do I need to apply the setting on all nodes or can one node be set for testing?

            Is adding such a line in /etc/mysql/mariadb.conf.d/50-server.cnf ok:

            wsrep_provider_options = "cert.optimistic_pa = no"
            

            Or should I copy/paste all existing values from an already running node, which looks like below, with a lot of options set:

            wsrep_provider_options  base_dir = /var/lib/mysql/; base_host = 192.168.0.6; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.causal_keepalive_period = PT1S; evs.debug_log_mask = 0x1; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.info_log_mask = 0; evs.install_timeout = PT7.5S; evs.join_retrans_period = PT1S; evs.keepalive_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.use_aggregate = true; evs.user_send_window = 2; evs.version = 1; evs.view_forget_timeout = P1D; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.keep_plaintext_size = 128M; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://0.0.0.0:4567; gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT3S; gmcast.segment = 0; gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = 192.168.0.6; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.linger = PT20S; pc.npvo = false; pc.recovery = true; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = PT30S; pc.weight = 1; protonet.version = 0; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.max_ws_size = 2147483647; repl.proto_max = 10; socket.checksum = 2; socket.recv_buf_size = auto; socket.send_buf_size = auto;
            

            Thanks for any tips.

            ccounotte COUNOTTE CEDRIC added a comment - @Mark Reibert, we are experiencing the same issue from time to time and would like to set cert.optimistic_pa = no, however I'm not sure how to proceed safely applying this as a permanent configuration ? Do I need to apply the setting on all nodes or can one node be set for testing? Is adding such a line in /etc/mysql/mariadb.conf.d/50-server.cnf ok: wsrep_provider_options = "cert.optimistic_pa = no" Or should I copy/paste all existing values from an already running node, which looks like below, with a lot of options set: wsrep_provider_options base_dir = /var/lib/mysql/; base_host = 192.168 . 0.6 ; base_port = 4567 ; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0 ; evs.causal_keepalive_period = PT1S; evs.debug_log_mask = 0x1 ; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.info_log_mask = 0 ; evs.install_timeout = PT7.5S; evs.join_retrans_period = PT1S; evs.keepalive_period = PT1S; evs.max_install_timeouts = 3 ; evs.send_window = 4 ; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.use_aggregate = true ; evs.user_send_window = 2 ; evs.version = 1 ; evs.view_forget_timeout = P1D; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0 ; gcache.keep_plaintext_size = 128M; gcache.mem_size = 0 ; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0 ; gcs.fc_factor = 1.0 ; gcs.fc_limit = 16 ; gcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500 ; gcs.max_throttle = 0.25 ; gcs.recv_q_hard_limit = 9223372036854775807 ; gcs.recv_q_soft_limit = 0.25 ; gcs.sync_donor = no; gmcast.listen_addr = tcp: //0.0.0.0:4567; gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT3S; gmcast.segment = 0; gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = 192.168.0.6; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.linger = PT20S; pc.npvo = false; pc.recovery = true; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = PT30S; pc.weight = 1; protonet.version = 0; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.max_ws_size = 2147483647; repl.proto_max = 10; socket.checksum = 2; socket.recv_buf_size = auto; socket.send_buf_size = auto; Thanks for any tips.

            People

              seppo Seppo Jaakola
              jfillatre jordane fillatre
              Votes:
              3 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.