Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26803

Galera crash - Assertion. Possible parallel writeset problem

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.4.21, 10.4.22, 10.5, 10.6
    • 10.4.23, 10.5.14, 10.6.6, 10.7.2
    • Galera
    • None
    • Ubuntu Bionic, using community packages from MariaDB repo.
      Also reproduced with build_43208 of 10.4.22

    Description

      We are experiencing a crash of all galera nodes receiving write sets. The operation is a "last resort" clean up stored procedure, that deletes many rows from the same set of related tables. It generally takes 4-5 minutes to run based on our data size, but is crashing within 10-20 seconds if it is going to go wrong.

      We have been using this stored procedure, reasonably regularly, without problem on 10.1 for several years. As suggested by Enterprise support, I have also tried this on the latest 10.4 build, which they provided me with a URL to. This also exhibits the problem.

      Unfortunately, I have been unable to replicate either simplified reproduction steps, or from a different system of ours. However, I have been able to take a "mariabackup" i.e. physical backup, and reproduce the fault on 2 other clusters. The original, and first replication were on VMware machines. The third system, is an AWS EC2 setup. All 3 have the same MariaDB configuration. I suspect the problem is exposed due to the particular on disk data.

      Attached is the log of one of the nodes receiving the writeset.

      First round of testing, I found that autocommit needs to be ON.

      Due to suspecting the data, and knowing that our QA team were trying to delete rows - I started my test again and used "OPTIMIZE TABLE" on the tables that are touched. This caused

      [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table mediator.SEQUENCE; Deadlock found when trying to get lock; try restarting transaction, Error_code: 1213; handler error HA_ERR_LOCK_DEADLOCK; the event's master log FIRST, end_log_pos 276, Internal MariaDB error code: 1213
      

      to appear in the log, at an unusual point in the crash logging.

      Because of finding that info, I have now set wsrep_slave_thread = 1, and this completes successfully. Previously the value was 12. I have also tested = 4, which also crashed.

      Therefore with this additional knowledge, I am presuming that something in Galera is presuming it can apply certain writesets in parallel when it cannot.

      Attachments

        1. gdb.txt
          205 kB
        2. mysql-receivingNode.log
          6 kB
        3. second-of-crash.combined.log
          110 kB
        4. storedProcedure.sql
          1.0 kB
        5. table-structure.sql
          2 kB
        6. unable-to-read-page.fatal.log
          4.96 MB

        Issue Links

          Activity

            brendon Brendon Abbott created issue -

            I realise there is probably very little that can be done without reproducible steps, but thought best to report the issue regardless, especially as I had been following MDEV-26298 which was the same error message.

            I'll get details of the table relations, and the stored procedure and comment further tomorrow.

            brendon Brendon Abbott added a comment - I realise there is probably very little that can be done without reproducible steps, but thought best to report the issue regardless, especially as I had been following MDEV-26298 which was the same error message. I'll get details of the table relations, and the stored procedure and comment further tomorrow.
            valerii Valerii Kravchuk made changes -
            Field Original Value New Value
            Affects Version/s 10.4.22 [ 26031 ]
            brendon Brendon Abbott added a comment - - edited

            The deadlock error is potentially not part of the problem / something else. I have just tried to reproduce again, whilst enabling `innodb_print_all_deadlocks`. However, the original line above did not show.

            brendon Brendon Abbott added a comment - - edited The deadlock error is potentially not part of the problem / something else. I have just tried to reproduce again, whilst enabling `innodb_print_all_deadlocks`. However, the original line above did not show.
            brendon Brendon Abbott made changes -
            Attachment table-structure.sql [ 59718 ]

            I have added the table structure, in the form of an SQL statement. The tables have had columns removed, to concentrate on the inter-dependencies.

            • The structure is essentially a collection of lists - LISTs.
            • A LIST ordering is determined by many TBL_ROW. This has a Next/Prev ID, which are not FKs, and I have not included.
            • A TBL_ROW has TBL_DATASET, which in turn is made of a collection of TBL_ENTRY. Only a single DATASET is referenced by a single ROW.
            • TBL_REGION represents a selection of a number of rows, by holding top and bottom IDs.
              • Multiple, and nested "selections" can occur, such that a TBL_ROW ID can appear in more than one REGION.
              • A REGION can consist of one row. In this case TOP == BOTTOM.
            • An INSTANCE is something observing the list, and acting on it.
            • A STATUS holds data regarding the many-to-many link between an INSTANCE and ENTRY.

            I have also attached the stored procedure which is being used to clear down a LIST. I realise that some of the statements are not optimal, and redundant due to the DELETE CASCADEs - but have left "as is", in the hope it highlights what the issue may be.

            brendon Brendon Abbott added a comment - I have added the table structure, in the form of an SQL statement. The tables have had columns removed, to concentrate on the inter-dependencies. The structure is essentially a collection of lists - LISTs. A LIST ordering is determined by many TBL_ROW. This has a Next/Prev ID, which are not FKs, and I have not included. A TBL_ROW has TBL_DATASET, which in turn is made of a collection of TBL_ENTRY. Only a single DATASET is referenced by a single ROW. TBL_REGION represents a selection of a number of rows, by holding top and bottom IDs. Multiple, and nested "selections" can occur, such that a TBL_ROW ID can appear in more than one REGION. A REGION can consist of one row. In this case TOP == BOTTOM. An INSTANCE is something observing the list, and acting on it. A STATUS holds data regarding the many-to-many link between an INSTANCE and ENTRY. I have also attached the stored procedure which is being used to clear down a LIST. I realise that some of the statements are not optimal, and redundant due to the DELETE CASCADEs - but have left "as is", in the hope it highlights what the issue may be.
            brendon Brendon Abbott made changes -
            Attachment storedProcedure.sql [ 59719 ]
            julien.fritsch Julien Fritsch made changes -
            Assignee Ramesh Sivaraman [ JIRAUSER48189 ]
            ramesh Ramesh Sivaraman added a comment - - edited

            brendon Could not reproduce the crash using given table structure. Please let me know if there is anything missing in the test case.
            Data load

            CREATE TABLE LIST (
              ID INTEGER PRIMARY KEY AUTO_INCREMENT
            );
            CREATE TABLE TBL_DATASET (
              ID INTEGER PRIMARY KEY AUTO_INCREMENT
            );
            CREATE TABLE TBL_INSTANCE (
              ID INTEGER PRIMARY KEY AUTO_INCREMENT
            );
            CREATE TABLE TBL_ENTRY(
              ID INTEGER PRIMARY KEY AUTO_INCREMENT,
              TBL_DATASET_ID INTEGER NOT NULL,
              CONSTRAINT TBL_ENTRY_TBL_DATASET_FK FOREIGN KEY (TBL_DATASET_ID) REFERENCES TBL_DATASET (ID) ON DELETE CASCADE
            );
            CREATE TABLE TBL_ROW(
              ID INTEGER PRIMARY KEY AUTO_INCREMENT,
              LIST_ID INTEGER NOT NULL,
              TBL_DATASET_ID INTEGER NOT NULL,
              CONSTRAINT SEQ_TBL_DATASET_FK FOREIGN KEY (TBL_DATASET_ID) REFERENCES TBL_ROW (ID),
              CONSTRAINT LIST_FK FOREIGN KEY (LIST_ID) REFERENCES LIST (ID)
            );
            CREATE TABLE TBL_REGION(
              ID INTEGER PRIMARY KEY AUTO_INCREMENT,
              TOP INTEGER NOT NULL,
              BOTTOM INTEGER NOT NULL,
              TBL_DATASET_ID INTEGER DEFAULT NULL,
              CONSTRAINT TOP_FK FOREIGN KEY (TOP) REFERENCES TBL_ROW (ID),
              CONSTRAINT BOTTOM_FK FOREIGN KEY (BOTTOM) REFERENCES TBL_ROW (ID),
              CONSTRAINT TBL_REGION_TBL_DATASET_ID_FK FOREIGN KEY (TBL_DATASET_ID) REFERENCES TBL_DATASET (ID) ON DELETE CASCADE  
            );
            CREATE TABLE STATUS (
              TBL_ROW_ID INTEGER DEFAULT NULL,
              TBL_DATASET_ID INTEGER DEFAULT NULL,
              TBL_ENTRY_ID INTEGER NOT NULL,
              TBL_INSTANCE_ID INTEGER NOT NULL,
              PRIMARY KEY (TBL_ENTRY_ID,TBL_INSTANCE_ID),
              KEY STATUS_IDX2 (TBL_ROW_ID),
              KEY STATUS_IDX3 (TBL_DATASET_ID),
              KEY STATUS_TBL_INSTANCE (TBL_INSTANCE_ID),
              CONSTRAINT TBL_INSTANCE_FK FOREIGN KEY (TBL_INSTANCE_ID) REFERENCES TBL_INSTANCE (ID),
              CONSTRAINT STATUS_TBL_ENTRY_FK FOREIGN KEY (TBL_ENTRY_ID) REFERENCES TBL_ENTRY (ID) ON DELETE CASCADE,
              CONSTRAINT STATUS_TBL_DATASET_FK1 FOREIGN KEY (TBL_DATASET_ID) REFERENCES TBL_DATASET (ID) ON DELETE CASCADE,
              CONSTRAINT STATUS_TBL_ROW_FK1 FOREIGN KEY (TBL_ROW_ID) REFERENCES TBL_ROW (ID) ON DELETE CASCADE
            );
             
            DELIMITER //
             
            DROP PROCEDURE IF EXISTS CLEAR_LIST;
            CREATE PROCEDURE CLEAR_LIST(
             in in_list_id int
            )
            BEGIN
             
                DECLARE done INT DEFAULT FALSE;
                DECLARE deleted INT DEFAULT 0;
                DECLARE rowId INT;
                DECLARE dataId INT;
                
                DECLARE cur CURSOR FOR SELECT ID, TBL_DATASET_ID FROM TBL_ROW WHERE LIST_ID = in_list_id;
                DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
                     
                OPEN cur;
                      
               read_loop: LOOP
                    FETCH cur INTO rowId, dataId;
                    IF done THEN
                        LEAVE read_loop;
                    END IF;
                    SET deleted = deleted + 1;
                    DELETE from TBL_REGION where TOP = rowId;
                    DELETE from TBL_REGION where BOTTOM = rowId;
                    DELETE from TBL_ROW where ID = rowId;
                    DELETE from TBL_ENTRY where TBL_DATASET_ID = dataId;
                    DELETE from TBL_DATASET where ID = dataId;
                    COMMIT;
                END LOOP;
              
                CLOSE cur;
                  
               SELECT concat(deleted, ' rows deleted');
                  
            END; //
              
            DELIMITER ;
             
            DELIMITER //
            FOR i IN 1..500
            DO
              INSERT INTO LIST VALUES (i);
            END FOR;
            //
             
            FOR i IN 1..1000
            DO
              INSERT INTO TBL_DATASET VALUES (i);
            END FOR;
            //
             
            FOR i IN 1..500
            DO
              INSERT INTO TBL_INSTANCE VALUES (i);
            END FOR;
            //
             
            SET @val = 1;
            FOR i IN 1..500
            DO
              FOR j IN 1..10
              DO  
                INSERT INTO TBL_ENTRY VALUES (@val,i);
              SET @val = @val + 1;
              END FOR;
            END FOR;
            //
             
            SET @val = 1;
            FOR i IN 1..500
            DO
              FOR j IN 1..10
              DO  
                INSERT INTO TBL_ROW VALUES (@val,i,i);
              SET @val = @val + 1;
              END FOR;
            END FOR;
            //
             
            SET @val = 1;
            FOR i IN 1..500
            DO
              FOR j IN 1..50
              DO  
                INSERT INTO TBL_REGION VALUES (@val,i,i,j);
              SET @val = @val + 1;
              END FOR;
            END FOR;
            //
             
            SET @val = 1;
            FOR i IN 1..500
            DO
              FOR j IN 1..10
              DO  
                INSERT INTO STATUS VALUES (i,i,@val,i);
              SET @val = @val + 1;
              END FOR;
            END FOR;
            //
             
            DELIMITER ;
            

            SP to execute CLEAR_LIST and OPTIMIZE TABLE command.

            DROP PROCEDURE IF EXISTS run_sp;
            DELIMITER //
            CREATE PROCEDURE `run_sp`()
            BEGIN	
                SET @num = FLOOR(RAND()*(500-5+1)+5);
                call CLEAR_LIST(@num);
                OPTIMIZE TABLE LIST;
                OPTIMIZE TABLE TBL_DATASET;
                OPTIMIZE TABLE TBL_ENTRY;
                OPTIMIZE TABLE TBL_INSTANCE;
                OPTIMIZE TABLE TBL_REGION ;
                OPTIMIZE TABLE TBL_ROW;
            END//
            DELIMITER ;
            

            EVENTS to run run_sp at regular intervals

            create event ev_1 on schedule every 1 minute do call run_sp();
            create event ev_2 on schedule every 1 minute do call run_sp();
            create event ev_3 on schedule every 1 minute do call run_sp();
            create event ev_4 on schedule every 1 minute do call run_sp();
            create event ev_5 on schedule every 1 minute do call run_sp();
            

            ramesh Ramesh Sivaraman added a comment - - edited brendon Could not reproduce the crash using given table structure. Please let me know if there is anything missing in the test case. Data load CREATE TABLE LIST ( ID INTEGER PRIMARY KEY AUTO_INCREMENT ); CREATE TABLE TBL_DATASET ( ID INTEGER PRIMARY KEY AUTO_INCREMENT ); CREATE TABLE TBL_INSTANCE ( ID INTEGER PRIMARY KEY AUTO_INCREMENT ); CREATE TABLE TBL_ENTRY( ID INTEGER PRIMARY KEY AUTO_INCREMENT, TBL_DATASET_ID INTEGER NOT NULL , CONSTRAINT TBL_ENTRY_TBL_DATASET_FK FOREIGN KEY (TBL_DATASET_ID) REFERENCES TBL_DATASET (ID) ON DELETE CASCADE ); CREATE TABLE TBL_ROW( ID INTEGER PRIMARY KEY AUTO_INCREMENT, LIST_ID INTEGER NOT NULL , TBL_DATASET_ID INTEGER NOT NULL , CONSTRAINT SEQ_TBL_DATASET_FK FOREIGN KEY (TBL_DATASET_ID) REFERENCES TBL_ROW (ID), CONSTRAINT LIST_FK FOREIGN KEY (LIST_ID) REFERENCES LIST (ID) ); CREATE TABLE TBL_REGION( ID INTEGER PRIMARY KEY AUTO_INCREMENT, TOP INTEGER NOT NULL , BOTTOM INTEGER NOT NULL , TBL_DATASET_ID INTEGER DEFAULT NULL , CONSTRAINT TOP_FK FOREIGN KEY ( TOP ) REFERENCES TBL_ROW (ID), CONSTRAINT BOTTOM_FK FOREIGN KEY (BOTTOM) REFERENCES TBL_ROW (ID), CONSTRAINT TBL_REGION_TBL_DATASET_ID_FK FOREIGN KEY (TBL_DATASET_ID) REFERENCES TBL_DATASET (ID) ON DELETE CASCADE ); CREATE TABLE STATUS ( TBL_ROW_ID INTEGER DEFAULT NULL , TBL_DATASET_ID INTEGER DEFAULT NULL , TBL_ENTRY_ID INTEGER NOT NULL , TBL_INSTANCE_ID INTEGER NOT NULL , PRIMARY KEY (TBL_ENTRY_ID,TBL_INSTANCE_ID), KEY STATUS_IDX2 (TBL_ROW_ID), KEY STATUS_IDX3 (TBL_DATASET_ID), KEY STATUS_TBL_INSTANCE (TBL_INSTANCE_ID), CONSTRAINT TBL_INSTANCE_FK FOREIGN KEY (TBL_INSTANCE_ID) REFERENCES TBL_INSTANCE (ID), CONSTRAINT STATUS_TBL_ENTRY_FK FOREIGN KEY (TBL_ENTRY_ID) REFERENCES TBL_ENTRY (ID) ON DELETE CASCADE , CONSTRAINT STATUS_TBL_DATASET_FK1 FOREIGN KEY (TBL_DATASET_ID) REFERENCES TBL_DATASET (ID) ON DELETE CASCADE , CONSTRAINT STATUS_TBL_ROW_FK1 FOREIGN KEY (TBL_ROW_ID) REFERENCES TBL_ROW (ID) ON DELETE CASCADE );   DELIMITER //   DROP PROCEDURE IF EXISTS CLEAR_LIST; CREATE PROCEDURE CLEAR_LIST( in in_list_id int ) BEGIN DECLARE done INT DEFAULT FALSE ; DECLARE deleted INT DEFAULT 0; DECLARE rowId INT ; DECLARE dataId INT ; DECLARE cur CURSOR FOR SELECT ID, TBL_DATASET_ID FROM TBL_ROW WHERE LIST_ID = in_list_id; DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE ; OPEN cur; read_loop: LOOP FETCH cur INTO rowId, dataId; IF done THEN LEAVE read_loop; END IF ; SET deleted = deleted + 1; DELETE from TBL_REGION where TOP = rowId; DELETE from TBL_REGION where BOTTOM = rowId; DELETE from TBL_ROW where ID = rowId; DELETE from TBL_ENTRY where TBL_DATASET_ID = dataId; DELETE from TBL_DATASET where ID = dataId; COMMIT ; END LOOP; CLOSE cur; SELECT concat(deleted, ' rows deleted' ); END ; // DELIMITER ;   DELIMITER // FOR i IN 1..500 DO INSERT INTO LIST VALUES (i); END FOR ; //   FOR i IN 1..1000 DO INSERT INTO TBL_DATASET VALUES (i); END FOR ; //   FOR i IN 1..500 DO INSERT INTO TBL_INSTANCE VALUES (i); END FOR ; //   SET @val = 1; FOR i IN 1..500 DO FOR j IN 1..10 DO INSERT INTO TBL_ENTRY VALUES (@val,i); SET @val = @val + 1; END FOR ; END FOR ; //   SET @val = 1; FOR i IN 1..500 DO FOR j IN 1..10 DO INSERT INTO TBL_ROW VALUES (@val,i,i); SET @val = @val + 1; END FOR ; END FOR ; //   SET @val = 1; FOR i IN 1..500 DO FOR j IN 1..50 DO INSERT INTO TBL_REGION VALUES (@val,i,i,j); SET @val = @val + 1; END FOR ; END FOR ; //   SET @val = 1; FOR i IN 1..500 DO FOR j IN 1..10 DO INSERT INTO STATUS VALUES (i,i,@val,i); SET @val = @val + 1; END FOR ; END FOR ; //   DELIMITER ; SP to execute CLEAR_LIST and OPTIMIZE TABLE command. DROP PROCEDURE IF EXISTS run_sp; DELIMITER // CREATE PROCEDURE `run_sp`() BEGIN SET @num = FLOOR(RAND()*(500-5+1)+5); call CLEAR_LIST(@num); OPTIMIZE TABLE LIST; OPTIMIZE TABLE TBL_DATASET; OPTIMIZE TABLE TBL_ENTRY; OPTIMIZE TABLE TBL_INSTANCE; OPTIMIZE TABLE TBL_REGION ; OPTIMIZE TABLE TBL_ROW; END // DELIMITER ; EVENTS to run run_sp at regular intervals create event ev_1 on schedule every 1 minute do call run_sp(); create event ev_2 on schedule every 1 minute do call run_sp(); create event ev_3 on schedule every 1 minute do call run_sp(); create event ev_4 on schedule every 1 minute do call run_sp(); create event ev_5 on schedule every 1 minute do call run_sp();
            Roel Roel Van de Paar made changes -
            Labels need_feedback
            Roel Roel Van de Paar added a comment - - edited

            brendon What was the exact revision of the latest 10.4 build you tried this on please? Thank you.

            Also, you mentioned seeing the assertion from MDEV-26298? I wonder if data corruption could have been introduced, now resulting in the observed issue(s)?

            Roel Roel Van de Paar added a comment - - edited brendon What was the exact revision of the latest 10.4 build you tried this on please? Thank you. Also, you mentioned seeing the assertion from MDEV-26298 ? I wonder if data corruption could have been introduced, now resulting in the observed issue(s)?
            brendon Brendon Abbott added a comment - - edited

            Apologies for the slow response, I have been on annual leave.

            The support ticket asked me to provide this output:

            MariaDB [(none)]> SHOW VARIABLES LIKE '%revision%';
            | version_source_revision | 53c8d559a596993cf3c7b020b19e2c345952fb71 |
            

            I have also wondered about data corruption, as I have never been able to replicate the problem. I have always had to go back to the "physical" copy that was originally causing problems. I purely mentioned MDEV-26298, because that is what our Enterprise Support engineer originally mentioned.
            Prior to going on leave, I have also had our of our engineers report this exact same failure on a completely separate system. They were using the same conditions to cause the problem - i.e. deleting a large quantity of rows, with auto commit turned on.

            I will have a read through the test case in more detail. Because I obfuscated the table names, I need to map them back. I did originally not mention the other columns from the table, presuming that something key related was the issue.

            brendon Brendon Abbott added a comment - - edited Apologies for the slow response, I have been on annual leave. The support ticket asked me to provide this output: MariaDB [(none)]> SHOW VARIABLES LIKE '%revision%'; | version_source_revision | 53c8d559a596993cf3c7b020b19e2c345952fb71 | I have also wondered about data corruption, as I have never been able to replicate the problem. I have always had to go back to the "physical" copy that was originally causing problems. I purely mentioned MDEV-26298 , because that is what our Enterprise Support engineer originally mentioned. Prior to going on leave, I have also had our of our engineers report this exact same failure on a completely separate system. They were using the same conditions to cause the problem - i.e. deleting a large quantity of rows, with auto commit turned on. I will have a read through the test case in more detail. Because I obfuscated the table names, I need to map them back. I did originally not mention the other columns from the table, presuming that something key related was the issue.
            valerii Valerii Kravchuk made changes -
            Labels need_feedback
            brendon Brendon Abbott added a comment - - edited

            I am going to need to come back to you with some better test data - somehow. Unfortunately I have been asked to spend my time elsewhere, on critical deadlines we have.

            I think I am going to need to try and get you the exact data we are using, in an obfuscated format.
            I have just taken a mysqldump, and piped into a new database schema. Running the problematic process on the new schema has caused the crash. So I think it is safe to say, that it is not a corrupt on disk structure that causes the problem.

            In your example above, the data structure is very uniform compared to ours. Doing a simple COUNT(*), without really representing the relations, we have

            • 26 LIST
            • 344,067 DATASET
            • 138,379 ROW
              Additionally, in the REGION case - it is common to have cases where the DATASET is NULL
            brendon Brendon Abbott added a comment - - edited I am going to need to come back to you with some better test data - somehow. Unfortunately I have been asked to spend my time elsewhere, on critical deadlines we have. I think I am going to need to try and get you the exact data we are using, in an obfuscated format. I have just taken a mysqldump, and piped into a new database schema. Running the problematic process on the new schema has caused the crash. So I think it is safe to say, that it is not a corrupt on disk structure that causes the problem. In your example above, the data structure is very uniform compared to ours. Doing a simple COUNT(*), without really representing the relations, we have 26 LIST 344,067 DATASET 138,379 ROW Additionally, in the REGION case - it is common to have cases where the DATASET is NULL

            brendon Thank you. Yes real obfuscated data that actually produces the problem would be great. We have tools to quickly reduce millions of SQL lines in a testcase (which produces the problem) to a minimal testcase which still produces the problem. Please dump one INSERT per line rather than multiple values per INSERT/line. I know this greatly increases the overall dump/data size (though likely not if compressed afterwards), but it allows us to do this reduction much better. Thank you.

            Roel Roel Van de Paar added a comment - brendon Thank you. Yes real obfuscated data that actually produces the problem would be great. We have tools to quickly reduce millions of SQL lines in a testcase (which produces the problem) to a minimal testcase which still produces the problem. Please dump one INSERT per line rather than multiple values per INSERT/line. I know this greatly increases the overall dump/data size (though likely not if compressed afterwards), but it allows us to do this reduction much better. Thank you.
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.4 [ 22408 ]
            Roel Roel Van de Paar made changes -
            Labels need_feedback

            Just to acknowledge I have seen this, and I will try to get back to this as soon as possible.
            You will be pleased to know I have ruled out it being a "physical" DB problem, and been able to mysqldump the subset of tables, and reimport to a new schema. I haven't got my notes in front of me, but - I then excluded one of the tables and the problem didn't occur.

            I now need to obfuscate the tables and data before posting when I can get back onto this.

            brendon Brendon Abbott added a comment - Just to acknowledge I have seen this, and I will try to get back to this as soon as possible. You will be pleased to know I have ruled out it being a "physical" DB problem, and been able to mysqldump the subset of tables, and reimport to a new schema. I haven't got my notes in front of me, but - I then excluded one of the tables and the problem didn't occur. I now need to obfuscate the tables and data before posting when I can get back onto this.

            For anyone non-MariaDB following, I have had to follow this ticket up with MariaDB commercial support so I can provide test data in a timely, secure fashion.
            I have managed to reduce the test dataset size considerably, both in terms of rows and BLOB columns.

            brendon Brendon Abbott added a comment - For anyone non-MariaDB following, I have had to follow this ticket up with MariaDB commercial support so I can provide test data in a timely, secure fashion. I have managed to reduce the test dataset size considerably, both in terms of rows and BLOB columns.
            valerii Valerii Kravchuk made changes -
            Labels need_feedback
            Roel Roel Van de Paar added a comment - - edited

            ramesh Can you give this a spin please (inc later versions), and reduce with reducer.sh (MODE=2) if reproducible? Thank you
            brendon Thank you very much, we're on it trying to reproduce (and reduce), if possible.

            Roel Roel Van de Paar added a comment - - edited ramesh Can you give this a spin please (inc later versions), and reduce with reducer.sh (MODE=2) if reproducible? Thank you brendon Thank you very much, we're on it trying to reproduce (and reduce), if possible.
            ramesh Ramesh Sivaraman added a comment - - edited

            brendon Thank you for the reproducible testcase.
            jplindst Reproduced the crash using given test case.

            10.4.22 d352bc5b6772f735a7e4bc540f578e0efc280e71 (Optimized)

            #0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
            #1  0x0000563408a2ed8f in my_write_core (sig=sig@entry=6) at /test/10.4_opt/mysys/stacktrace.c:386
            #2  0x0000563408414fa8 in handle_fatal_signal (sig=6) at /test/10.4_opt/sql/signal_handler.cc:356
            #3  <signal handler called>
            #4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
            #5  0x000015343b36a859 in __GI_abort () at abort.c:79
            #6  0x000015343b36a729 in __assert_fail_base (fmt=0x15343b500588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x563408b29d50 "mode_ == m_local || transaction_.is_streaming()", 
                file=0x563408afb1c0 "/test/10.4_opt/wsrep-lib/include/wsrep/client_state.hpp", line=668, function=<optimized out>) at assert.c:92
            #7  0x000015343b37bf36 in __GI___assert_fail (assertion=assertion@entry=0x563408b29d50 "mode_ == m_local || transaction_.is_streaming()", 
                file=file@entry=0x563408afb1c0 "/test/10.4_opt/wsrep-lib/include/wsrep/client_state.hpp", line=line@entry=668, function=function@entry=0x563408b29d80 "int wsrep::client_state::bf_abort(wsrep::seqno)")
                at assert.c:101
            #8  0x000056340838fcec in wsrep::client_state::bf_abort (bf_seqno=..., this=0x1533a4006be0) at /test/10.4_opt/sql/wsrep_thd.cc:365
            #9  wsrep_bf_abort (bf_thd=bf_thd@entry=0x153390000c48, victim_thd=victim_thd@entry=0x1533a4000c48) at /test/10.4_opt/sql/wsrep_thd.cc:382
            #10 0x0000563408397b03 in wsrep_thd_bf_abort (bf_thd=bf_thd@entry=0x153390000c48, victim_thd=victim_thd@entry=0x1533a4000c48, signal=<optimized out>) at /test/10.4_opt/sql/service_wsrep.cc:223
            #11 0x00005634085e39cf in bg_wsrep_kill_trx (void_arg=<optimized out>) at /test/10.4_opt/storage/innobase/handler/ha_innodb.cc:18708
            #12 0x00005634081e00ae in handle_manager (arg=<optimized out>) at /test/10.4_opt/sql/sql_manager.cc:112
            #13 0x000015343b878609 in start_thread (arg=<optimized out>) at pthread_create.c:477
            #14 0x000015343b467293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            

            Core dump is available in (galapq) /test/mtest/GAL_MD111121-mariadb-10.4.21-linux-x86_64-opt/node2/.

            Debug build crashes with different assertion (similar to MDEV-22105).

            10.4.22 d352bc5b6772f735a7e4bc540f578e0efc280e71 (Debug)

            (gdb) bt
            #0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
            #1  0x0000562ed0d1ac23 in my_write_core (sig=sig@entry=6) at /test/10.4_dbg/mysys/stacktrace.c:386
            #2  0x0000562ed0401396 in handle_fatal_signal (sig=6) at /test/10.4_dbg/sql/signal_handler.cc:356
            #3  <signal handler called>
            #4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
            #5  0x00001521681b2859 in __GI_abort () at abort.c:79
            #6  0x00001521681b2729 in __assert_fail_base (fmt=0x152168348588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
                assertion=0x562ed0e544b8 "(&(&thd->LOCK_thd_data)->m_mutex)->count > 0 && pthread_equal(pthread_self(), (&(&thd->LOCK_thd_data)->m_mutex)->thread)", 
                file=0x562ed0e54230 "/test/10.4_dbg/sql/service_wsrep.cc", line=271, function=<optimized out>) at assert.c:92
            #7  0x00001521681c3f36 in __GI___assert_fail (assertion=assertion@entry=0x562ed0e544b8 "(&(&thd->LOCK_thd_data)->m_mutex)->count > 0 && pthread_equal(pthread_self(), (&(&thd->LOCK_thd_data)->m_mutex)->thread)", 
                file=file@entry=0x562ed0e54230 "/test/10.4_dbg/sql/service_wsrep.cc", line=line@entry=271, function=function@entry=0x562ed0e54538 "my_bool wsrep_thd_is_aborting(const THD*)") at assert.c:101
            #8  0x0000562ed0358fb1 in wsrep_thd_is_aborting (thd=0x152140000d90) at /usr/include/pthread.h:1167
            #9  0x0000562ed0706563 in wsrep_assert_no_bf_bf_wait (lock=lock@entry=0x1521642354a0, trx=trx@entry=0x152164234118) at /test/10.4_dbg/storage/innobase/lock/lock0lock.cc:691
            #10 0x0000562ed0707e99 in lock_rec_has_to_wait (lock_is_on_supremum=false, lock2=0x1521642354a0, type_mode=1027, trx=0x152164234118, for_locking=true) at /test/10.4_dbg/storage/innobase/lock/lock0lock.cc:842
            #11 lock_rec_other_has_conflicting (mode=mode@entry=1027, block=block@entry=0x152119d26928, heap_no=heap_no@entry=84, trx=trx@entry=0x152164234118) at /test/10.4_dbg/storage/innobase/lock/lock0lock.cc:1168
            #12 0x0000562ed0714d5c in lock_rec_lock (impl=impl@entry=false, mode=mode@entry=1027, block=block@entry=0x152119d26928, heap_no=heap_no@entry=84, index=index@entry=0x1520f8518a40, thr=thr@entry=0x1520dc03ba20)
                at /test/10.4_dbg/storage/innobase/lock/lock0lock.cc:1949
            #13 0x0000562ed0715f2e in lock_clust_rec_read_check_and_lock (flags=flags@entry=0, block=block@entry=0x152119d26928, rec=rec@entry=0x15211a93d9e4 "\211\301ë®»\232", <incomplete sequence \312>, 
                index=index@entry=0x1520f8518a40, offsets=0x1521180c7490, mode=mode@entry=LOCK_X, gap_mode=1024, thr=0x1520dc03ba20) at /test/10.4_dbg/storage/innobase/lock/lock0lock.cc:5889
            #14 0x0000562ed07162ff in lock_clust_rec_read_check_and_lock_alt (flags=flags@entry=0, block=block@entry=0x152119d26928, rec=rec@entry=0x15211a93d9e4 "\211\301ë®»\232", <incomplete sequence \312>, 
                index=index@entry=0x1520f8518a40, mode=mode@entry=LOCK_X, gap_mode=gap_mode@entry=1024, thr=0x1520dc03ba20) at /test/10.4_dbg/storage/innobase/lock/lock0lock.cc:5937
            #15 0x0000562ed07ebf2f in row_ins_foreign_check_on_constraint (mtr=0x1521180c7df0, entry=0x1520dc0f8c40, pcur=0x1521180c7900, foreign=0x1520f851cd70, thr=0x1520dc03ba20)
                at /test/10.4_dbg/storage/innobase/row/row0ins.cc:1174
            #16 row_ins_check_foreign_constraint (check_ref=check_ref@entry=0, foreign=foreign@entry=0x1520f851cd70, table=table@entry=0x1520f8515fa0, entry=entry@entry=0x1520dc0f8c40, thr=thr@entry=0x1520dc03ba20)
                at /test/10.4_dbg/storage/innobase/row/row0ins.cc:1746
            #17 0x0000562ed08731f4 in row_upd_check_references_constraints (node=node@entry=0x1520dc03b2c8, pcur=pcur@entry=0x1520dc0443d0, table=0x1520f8515fa0, index=index@entry=0x1520f8517140, 
                offsets=offsets@entry=0x1521180c8400, thr=thr@entry=0x1520dc03ba20, mtr=0x1521180c88c0) at /test/10.4_dbg/storage/innobase/row/row0upd.cc:298
            #18 0x0000562ed0873e3c in row_upd_del_mark_clust_rec (mtr=0x1521180c88c0, foreign=false, referenced=true, thr=0x1520dc03ba20, offsets=0x1521180c8400, index=0x1520f8517140, node=0x1520dc03b2c8)
                at /test/10.4_dbg/storage/innobase/row/row0upd.cc:3020
            #19 row_upd_clust_step (node=node@entry=0x1520dc03b2c8, thr=thr@entry=0x1520dc03ba20) at /test/10.4_dbg/storage/innobase/row/row0upd.cc:3173
            #20 0x0000562ed08776d9 in row_upd (thr=0x1520dc03ba20, node=0x1520dc03b2c8) at /test/10.4_dbg/storage/innobase/row/row0upd.cc:3295
            #21 row_upd_step (thr=thr@entry=0x1520dc03ba20) at /test/10.4_dbg/storage/innobase/row/row0upd.cc:3439
            #22 0x0000562ed0815f9f in row_update_for_mysql (prebuilt=0x1520dc03a6d0) at /test/10.4_dbg/storage/innobase/row/row0mysql.cc:1809
            #23 0x0000562ed066dd2c in ha_innobase::delete_row (this=0x1520dc036158, record=0x1520dc036948 "\232b\377<") at /test/10.4_dbg/storage/innobase/handler/ha_innodb.cc:9014
            #24 0x0000562ed041496b in handler::ha_delete_row (this=0x1520dc036158, buf=0x1520dc036948 "\232b\377<") at /test/10.4_dbg/sql/handler.cc:6892
            #25 0x0000562ed0591ef5 in Delete_rows_log_event::do_exec_row (this=0x1520dc023ad0, rgi=<optimized out>) at /test/10.4_dbg/sql/log_event.cc:14495
            #26 0x0000562ed057ef76 in Rows_log_event::do_apply_event (this=0x1520dc023ad0, rgi=0x1520dc0210b0) at /test/10.4_dbg/sql/log_event.cc:11656
            #27 0x0000562ed034eacc in Log_event::apply_event (rgi=0x1520dc0210b0, this=0x1520dc023ad0) at /test/10.4_dbg/sql/log_event.h:1492
            #28 wsrep_apply_events (thd=thd@entry=0x1520dc000d90, rli=<optimized out>, events_buf=<optimized out>, buf_len=0) at /test/10.4_dbg/sql/wsrep_applier.cc:200
            #29 0x0000562ed0333424 in Wsrep_applier_service::apply_write_set (this=0x1521180cad50, ws_meta=..., data=...) at /test/10.4_dbg/wsrep-lib/include/wsrep/buffer.hpp:48
            #30 0x0000562ed0db034b in apply_write_set (server_state=..., high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at /test/10.4_dbg/wsrep-lib/src/server_state.cpp:327
            #31 0x0000562ed0db1297 in wsrep::server_state::on_apply (this=<optimized out>, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at /test/10.4_dbg/wsrep-lib/src/server_state.cpp:1145
            #32 0x0000562ed0dc1c3b in wsrep::high_priority_service::apply (data=..., ws_meta=..., ws_handle=..., this=0x1521180cad50) at /test/10.4_dbg/wsrep-lib/include/wsrep/high_priority_service.hpp:47
            #33 (anonymous namespace)::apply_cb (ctx=ctx@entry=0x1521180cad50, wsh=wsh@entry=0x1521180c9dc0, flags=flags@entry=65, buf=buf@entry=0x1521180c9dd0, meta=meta@entry=0x1521180ca080, 
                exit_loop=exit_loop@entry=0x1521180ca03f) at /test/10.4_dbg/wsrep-lib/src/wsrep_provider_v26.cpp:504
            #34 0x0000152167a29765 in galera::TrxHandleSlave::apply (this=this@entry=0x1520dc07a210, recv_ctx=recv_ctx@entry=0x1521180cad50, 
                apply_cb=0x562ed0dc1ab8 <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, meta=..., 
            --Type <RET> for more, q to quit, c to continue without paging--
                exit_loop=exit_loop@entry=@0x1521180ca03f: false) at /test/10.4_galera_dbg/galera/src/trx_handle.cpp:391
            #35 0x0000152167a39442 in galera::ReplicatorSMM::apply_trx (this=0x562ed209a350, recv_ctx=0x1521180cad50, ts=...) at /test/10.4_galera_dbg/galera/src/replicator_smm.cpp:504
            #36 0x0000152167a41378 in galera::ReplicatorSMM::process_trx (this=0x562ed209a350, recv_ctx=0x1521180cad50, ts_ptr=...) at /test/10.4_galera_dbg/galera/src/replicator_smm.cpp:2123
            #37 0x0000152167a71a43 in galera::GcsActionSource::process_writeset (this=0x562ed2093390, recv_ctx=0x1521180cad50, act=..., exit_loop=@0x1521180ca92f: false)
                at /test/10.4_galera_dbg/galera/src/gcs_action_source.cpp:62
            #38 0x0000152167a7241b in galera::GcsActionSource::dispatch (this=<optimized out>, recv_ctx=<optimized out>, act=..., exit_loop=<optimized out>) at /test/10.4_galera_dbg/galera/src/gcs_action_source.cpp:110
            #39 0x0000152167a72622 in galera::GcsActionSource::process (this=0x562ed2093390, recv_ctx=0x1521180cad50, exit_loop=@0x1521180ca92f: false) at /test/10.4_galera_dbg/galera/src/gcs_action_source.cpp:183
            #40 0x0000152167a3e3a0 in galera::ReplicatorSMM::async_recv (this=0x562ed209a350, recv_ctx=0x1521180cad50) at /test/10.4_galera_dbg/galera/src/replicator_smm.cpp:390
            #41 0x0000152167a158e1 in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at /test/10.4_galera_dbg/galera/src/wsrep_provider.cpp:236
            #42 0x0000562ed0dc209c in wsrep::wsrep_provider_v26::run_applier (this=<optimized out>, applier_ctx=<optimized out>) at /test/10.4_dbg/wsrep-lib/src/wsrep_provider_v26.cpp:796
            #43 0x0000562ed03511a3 in wsrep_replication_process (thd=0x1520dc000d90, arg=<optimized out>) at /test/10.4_dbg/sql/wsrep_server_state.h:51
            #44 0x0000562ed033eab9 in start_wsrep_THD (arg=0x562ed2d3bfe0) at /test/10.4_dbg/sql/wsrep_mysqld.h:428
            #45 0x00001521686c0609 in start_thread (arg=<optimized out>) at pthread_create.c:477
            #46 0x00001521682af293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            

            ramesh Ramesh Sivaraman added a comment - - edited brendon Thank you for the reproducible testcase. jplindst Reproduced the crash using given test case. 10.4.22 d352bc5b6772f735a7e4bc540f578e0efc280e71 (Optimized) #0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 #1 0x0000563408a2ed8f in my_write_core (sig=sig@entry=6) at /test/10.4_opt/mysys/stacktrace.c:386 #2 0x0000563408414fa8 in handle_fatal_signal (sig=6) at /test/10.4_opt/sql/signal_handler.cc:356 #3 <signal handler called> #4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #5 0x000015343b36a859 in __GI_abort () at abort.c:79 #6 0x000015343b36a729 in __assert_fail_base (fmt=0x15343b500588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x563408b29d50 "mode_ == m_local || transaction_.is_streaming()", file=0x563408afb1c0 "/test/10.4_opt/wsrep-lib/include/wsrep/client_state.hpp", line=668, function=<optimized out>) at assert.c:92 #7 0x000015343b37bf36 in __GI___assert_fail (assertion=assertion@entry=0x563408b29d50 "mode_ == m_local || transaction_.is_streaming()", file=file@entry=0x563408afb1c0 "/test/10.4_opt/wsrep-lib/include/wsrep/client_state.hpp", line=line@entry=668, function=function@entry=0x563408b29d80 "int wsrep::client_state::bf_abort(wsrep::seqno)") at assert.c:101 #8 0x000056340838fcec in wsrep::client_state::bf_abort (bf_seqno=..., this=0x1533a4006be0) at /test/10.4_opt/sql/wsrep_thd.cc:365 #9 wsrep_bf_abort (bf_thd=bf_thd@entry=0x153390000c48, victim_thd=victim_thd@entry=0x1533a4000c48) at /test/10.4_opt/sql/wsrep_thd.cc:382 #10 0x0000563408397b03 in wsrep_thd_bf_abort (bf_thd=bf_thd@entry=0x153390000c48, victim_thd=victim_thd@entry=0x1533a4000c48, signal=<optimized out>) at /test/10.4_opt/sql/service_wsrep.cc:223 #11 0x00005634085e39cf in bg_wsrep_kill_trx (void_arg=<optimized out>) at /test/10.4_opt/storage/innobase/handler/ha_innodb.cc:18708 #12 0x00005634081e00ae in handle_manager (arg=<optimized out>) at /test/10.4_opt/sql/sql_manager.cc:112 #13 0x000015343b878609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #14 0x000015343b467293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Core dump is available in (galapq) /test/mtest/GAL_MD111121-mariadb-10.4.21-linux-x86_64-opt/node2/ . Debug build crashes with different assertion (similar to MDEV-22105 ). 10.4.22 d352bc5b6772f735a7e4bc540f578e0efc280e71 (Debug) (gdb) bt #0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 #1 0x0000562ed0d1ac23 in my_write_core (sig=sig@entry=6) at /test/10.4_dbg/mysys/stacktrace.c:386 #2 0x0000562ed0401396 in handle_fatal_signal (sig=6) at /test/10.4_dbg/sql/signal_handler.cc:356 #3 <signal handler called> #4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #5 0x00001521681b2859 in __GI_abort () at abort.c:79 #6 0x00001521681b2729 in __assert_fail_base (fmt=0x152168348588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x562ed0e544b8 "(&(&thd->LOCK_thd_data)->m_mutex)->count > 0 && pthread_equal(pthread_self(), (&(&thd->LOCK_thd_data)->m_mutex)->thread)", file=0x562ed0e54230 "/test/10.4_dbg/sql/service_wsrep.cc", line=271, function=<optimized out>) at assert.c:92 #7 0x00001521681c3f36 in __GI___assert_fail (assertion=assertion@entry=0x562ed0e544b8 "(&(&thd->LOCK_thd_data)->m_mutex)->count > 0 && pthread_equal(pthread_self(), (&(&thd->LOCK_thd_data)->m_mutex)->thread)", file=file@entry=0x562ed0e54230 "/test/10.4_dbg/sql/service_wsrep.cc", line=line@entry=271, function=function@entry=0x562ed0e54538 "my_bool wsrep_thd_is_aborting(const THD*)") at assert.c:101 #8 0x0000562ed0358fb1 in wsrep_thd_is_aborting (thd=0x152140000d90) at /usr/include/pthread.h:1167 #9 0x0000562ed0706563 in wsrep_assert_no_bf_bf_wait (lock=lock@entry=0x1521642354a0, trx=trx@entry=0x152164234118) at /test/10.4_dbg/storage/innobase/lock/lock0lock.cc:691 #10 0x0000562ed0707e99 in lock_rec_has_to_wait (lock_is_on_supremum=false, lock2=0x1521642354a0, type_mode=1027, trx=0x152164234118, for_locking=true) at /test/10.4_dbg/storage/innobase/lock/lock0lock.cc:842 #11 lock_rec_other_has_conflicting (mode=mode@entry=1027, block=block@entry=0x152119d26928, heap_no=heap_no@entry=84, trx=trx@entry=0x152164234118) at /test/10.4_dbg/storage/innobase/lock/lock0lock.cc:1168 #12 0x0000562ed0714d5c in lock_rec_lock (impl=impl@entry=false, mode=mode@entry=1027, block=block@entry=0x152119d26928, heap_no=heap_no@entry=84, index=index@entry=0x1520f8518a40, thr=thr@entry=0x1520dc03ba20) at /test/10.4_dbg/storage/innobase/lock/lock0lock.cc:1949 #13 0x0000562ed0715f2e in lock_clust_rec_read_check_and_lock (flags=flags@entry=0, block=block@entry=0x152119d26928, rec=rec@entry=0x15211a93d9e4 "\211\301ë®»\232", <incomplete sequence \312>, index=index@entry=0x1520f8518a40, offsets=0x1521180c7490, mode=mode@entry=LOCK_X, gap_mode=1024, thr=0x1520dc03ba20) at /test/10.4_dbg/storage/innobase/lock/lock0lock.cc:5889 #14 0x0000562ed07162ff in lock_clust_rec_read_check_and_lock_alt (flags=flags@entry=0, block=block@entry=0x152119d26928, rec=rec@entry=0x15211a93d9e4 "\211\301ë®»\232", <incomplete sequence \312>, index=index@entry=0x1520f8518a40, mode=mode@entry=LOCK_X, gap_mode=gap_mode@entry=1024, thr=0x1520dc03ba20) at /test/10.4_dbg/storage/innobase/lock/lock0lock.cc:5937 #15 0x0000562ed07ebf2f in row_ins_foreign_check_on_constraint (mtr=0x1521180c7df0, entry=0x1520dc0f8c40, pcur=0x1521180c7900, foreign=0x1520f851cd70, thr=0x1520dc03ba20) at /test/10.4_dbg/storage/innobase/row/row0ins.cc:1174 #16 row_ins_check_foreign_constraint (check_ref=check_ref@entry=0, foreign=foreign@entry=0x1520f851cd70, table=table@entry=0x1520f8515fa0, entry=entry@entry=0x1520dc0f8c40, thr=thr@entry=0x1520dc03ba20) at /test/10.4_dbg/storage/innobase/row/row0ins.cc:1746 #17 0x0000562ed08731f4 in row_upd_check_references_constraints (node=node@entry=0x1520dc03b2c8, pcur=pcur@entry=0x1520dc0443d0, table=0x1520f8515fa0, index=index@entry=0x1520f8517140, offsets=offsets@entry=0x1521180c8400, thr=thr@entry=0x1520dc03ba20, mtr=0x1521180c88c0) at /test/10.4_dbg/storage/innobase/row/row0upd.cc:298 #18 0x0000562ed0873e3c in row_upd_del_mark_clust_rec (mtr=0x1521180c88c0, foreign=false, referenced=true, thr=0x1520dc03ba20, offsets=0x1521180c8400, index=0x1520f8517140, node=0x1520dc03b2c8) at /test/10.4_dbg/storage/innobase/row/row0upd.cc:3020 #19 row_upd_clust_step (node=node@entry=0x1520dc03b2c8, thr=thr@entry=0x1520dc03ba20) at /test/10.4_dbg/storage/innobase/row/row0upd.cc:3173 #20 0x0000562ed08776d9 in row_upd (thr=0x1520dc03ba20, node=0x1520dc03b2c8) at /test/10.4_dbg/storage/innobase/row/row0upd.cc:3295 #21 row_upd_step (thr=thr@entry=0x1520dc03ba20) at /test/10.4_dbg/storage/innobase/row/row0upd.cc:3439 #22 0x0000562ed0815f9f in row_update_for_mysql (prebuilt=0x1520dc03a6d0) at /test/10.4_dbg/storage/innobase/row/row0mysql.cc:1809 #23 0x0000562ed066dd2c in ha_innobase::delete_row (this=0x1520dc036158, record=0x1520dc036948 "\232b\377<") at /test/10.4_dbg/storage/innobase/handler/ha_innodb.cc:9014 #24 0x0000562ed041496b in handler::ha_delete_row (this=0x1520dc036158, buf=0x1520dc036948 "\232b\377<") at /test/10.4_dbg/sql/handler.cc:6892 #25 0x0000562ed0591ef5 in Delete_rows_log_event::do_exec_row (this=0x1520dc023ad0, rgi=<optimized out>) at /test/10.4_dbg/sql/log_event.cc:14495 #26 0x0000562ed057ef76 in Rows_log_event::do_apply_event (this=0x1520dc023ad0, rgi=0x1520dc0210b0) at /test/10.4_dbg/sql/log_event.cc:11656 #27 0x0000562ed034eacc in Log_event::apply_event (rgi=0x1520dc0210b0, this=0x1520dc023ad0) at /test/10.4_dbg/sql/log_event.h:1492 #28 wsrep_apply_events (thd=thd@entry=0x1520dc000d90, rli=<optimized out>, events_buf=<optimized out>, buf_len=0) at /test/10.4_dbg/sql/wsrep_applier.cc:200 #29 0x0000562ed0333424 in Wsrep_applier_service::apply_write_set (this=0x1521180cad50, ws_meta=..., data=...) at /test/10.4_dbg/wsrep-lib/include/wsrep/buffer.hpp:48 #30 0x0000562ed0db034b in apply_write_set (server_state=..., high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at /test/10.4_dbg/wsrep-lib/src/server_state.cpp:327 #31 0x0000562ed0db1297 in wsrep::server_state::on_apply (this=<optimized out>, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at /test/10.4_dbg/wsrep-lib/src/server_state.cpp:1145 #32 0x0000562ed0dc1c3b in wsrep::high_priority_service::apply (data=..., ws_meta=..., ws_handle=..., this=0x1521180cad50) at /test/10.4_dbg/wsrep-lib/include/wsrep/high_priority_service.hpp:47 #33 (anonymous namespace)::apply_cb (ctx=ctx@entry=0x1521180cad50, wsh=wsh@entry=0x1521180c9dc0, flags=flags@entry=65, buf=buf@entry=0x1521180c9dd0, meta=meta@entry=0x1521180ca080, exit_loop=exit_loop@entry=0x1521180ca03f) at /test/10.4_dbg/wsrep-lib/src/wsrep_provider_v26.cpp:504 #34 0x0000152167a29765 in galera::TrxHandleSlave::apply (this=this@entry=0x1520dc07a210, recv_ctx=recv_ctx@entry=0x1521180cad50, apply_cb=0x562ed0dc1ab8 <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, meta=..., --Type <RET> for more, q to quit, c to continue without paging-- exit_loop=exit_loop@entry=@0x1521180ca03f: false) at /test/10.4_galera_dbg/galera/src/trx_handle.cpp:391 #35 0x0000152167a39442 in galera::ReplicatorSMM::apply_trx (this=0x562ed209a350, recv_ctx=0x1521180cad50, ts=...) at /test/10.4_galera_dbg/galera/src/replicator_smm.cpp:504 #36 0x0000152167a41378 in galera::ReplicatorSMM::process_trx (this=0x562ed209a350, recv_ctx=0x1521180cad50, ts_ptr=...) at /test/10.4_galera_dbg/galera/src/replicator_smm.cpp:2123 #37 0x0000152167a71a43 in galera::GcsActionSource::process_writeset (this=0x562ed2093390, recv_ctx=0x1521180cad50, act=..., exit_loop=@0x1521180ca92f: false) at /test/10.4_galera_dbg/galera/src/gcs_action_source.cpp:62 #38 0x0000152167a7241b in galera::GcsActionSource::dispatch (this=<optimized out>, recv_ctx=<optimized out>, act=..., exit_loop=<optimized out>) at /test/10.4_galera_dbg/galera/src/gcs_action_source.cpp:110 #39 0x0000152167a72622 in galera::GcsActionSource::process (this=0x562ed2093390, recv_ctx=0x1521180cad50, exit_loop=@0x1521180ca92f: false) at /test/10.4_galera_dbg/galera/src/gcs_action_source.cpp:183 #40 0x0000152167a3e3a0 in galera::ReplicatorSMM::async_recv (this=0x562ed209a350, recv_ctx=0x1521180cad50) at /test/10.4_galera_dbg/galera/src/replicator_smm.cpp:390 #41 0x0000152167a158e1 in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at /test/10.4_galera_dbg/galera/src/wsrep_provider.cpp:236 #42 0x0000562ed0dc209c in wsrep::wsrep_provider_v26::run_applier (this=<optimized out>, applier_ctx=<optimized out>) at /test/10.4_dbg/wsrep-lib/src/wsrep_provider_v26.cpp:796 #43 0x0000562ed03511a3 in wsrep_replication_process (thd=0x1520dc000d90, arg=<optimized out>) at /test/10.4_dbg/sql/wsrep_server_state.h:51 #44 0x0000562ed033eab9 in start_wsrep_THD (arg=0x562ed2d3bfe0) at /test/10.4_dbg/sql/wsrep_mysqld.h:428 #45 0x00001521686c0609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #46 0x00001521682af293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            ramesh Ramesh Sivaraman made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]

            Note that there may be two issues at play here; the 1213 error and the assert. We're looking at reducing towards either/both.

            Roel Roel Van de Paar added a comment - Note that there may be two issues at play here; the 1213 error and the assert. We're looking at reducing towards either/both.
            ramesh Ramesh Sivaraman made changes -
            Affects Version/s 10.5 [ 23123 ]
            Affects Version/s 10.6 [ 24028 ]
            ramesh Ramesh Sivaraman made changes -
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            ramesh Ramesh Sivaraman made changes -
            Assignee Ramesh Sivaraman [ JIRAUSER48189 ] Jan Lindström [ jplindst ]
            wdoekes Walter Doekes added a comment -

            Chiming in here. We're seeing the same assertion failure on 10.4.22.

            I'm happy to learn that this isn't caused by local data corruption.

            mysqld: /home/buildbot/buildbot/build/mariadb-10.4.22/wsrep-lib/include/wsrep/client_state.hpp:668: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.
            

            #0  __GI_abort () at abort.c:107
            #1  0x00007fa9817a748a in __assert_fail_base (fmt=0x7fa98192e750 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x55c8ebb499c8 "mode_ == m_local || transaction_.is_streaming()", file=file@entry=0x55c8ebb1a130 "/home/buildbot/buildbot/build/mariadb-10.4.22/wsrep-lib/include/wsrep/client_state.hpp", line=line@entry=668, function=function@entry=0x55c8ebb49ae0 <wsrep::client_state::bf_abort(wsrep::seqno)::__PRETTY_FUNCTION__> "int wsrep::client_state::bf_abort(wsrep::seqno)") at assert.c:92
            #2  0x00007fa9817a7502 in __GI___assert_fail (assertion=assertion@entry=0x55c8ebb499c8 "mode_ == m_local || transaction_.is_streaming()", file=file@entry=0x55c8ebb1a130 "/home/buildbot/buildbot/build/mariadb-10.4.22/wsrep-lib/include/wsrep/client_state.hpp", line=line@entry=668, function=function@entry=0x55c8ebb49ae0 <wsrep::client_state::bf_abort(wsrep::seqno)::__PRETTY_FUNCTION__> "int wsrep::client_state::bf_abort(wsrep::seqno)") at assert.c:101
            #3  0x000055c8eb45192b in wsrep::client_state::bf_abort (bf_seqno=..., this=0x7fa7f8006ba0) at ./wsrep-lib/include/wsrep/client_state.hpp:668
            #4  wsrep_bf_abort (bf_thd=bf_thd@entry=0x7fa504000c08, victim_thd=victim_thd@entry=0x7fa7f8000c08) at ./sql/wsrep_thd.cc:399
            #5  0x000055c8eb459ac9 in wsrep_thd_bf_abort (bf_thd=bf_thd@entry=0x7fa504000c08, victim_thd=victim_thd@entry=0x7fa7f8000c08, signal=signal@entry=1 '\001') at ./sql/service_wsrep.cc:241
            #6  0x000055c8eb687f1f in wsrep_kill_victim (signal=1 '\001', victim_trx=0x7fa7ed4a5108, thd=0x7fa7f8000c08, bf_thd=0x7fa504000c08) at ./storage/innobase/handler/ha_innodb.cc:18628
            #7  wsrep_innobase_kill_one_trx (bf_thd=0x7fa504000c08, victim_trx=0x7fa7ed4a5108, signal=signal@entry=1 '\001') at ./storage/innobase/handler/ha_innodb.cc:18723
            #8  0x000055c8eb6ca75d in wsrep_kill_victim (lock=<optimized out>, trx=<optimized out>) at ./storage/innobase/lock/lock0lock.cc:1134
            #9  0x000055c8eb6cf566 in lock_rec_other_has_conflicting (mode=mode@entry=1026, block=block@entry=0x7fa6bc171240, heap_no=heap_no@entry=5, trx=trx@entry=0x7fa7ed4b16a8) at ./storage/innobase/lock/lock0lock.cc:1175
            #10 0x000055c8eb6d4a11 in lock_rec_lock (impl=impl@entry=false, mode=mode@entry=1026, block=0x7fa6bc171240, heap_no=5, index=index@entry=0x7fa7f802fe48, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/lock/lock0lock.cc:1953
            #11 0x000055c8eb6d5700 in lock_clust_rec_read_check_and_lock (flags=flags@entry=0, block=block@entry=0x7fa6bc171240, rec=rec@entry=0x7fa6c08e417d <error: Cannot access memory at address 0x7fa6c08e417d>, index=index@entry=0x7fa7f802fe48, offsets=offsets@entry=0x7fa5c5125d00, mode=mode@entry=LOCK_S, gap_mode=1024, thr=0x7fa50407adb0) at ./storage/innobase/lock/lock0lock.cc:5889
            #12 0x000055c8eb743f7a in row_ins_set_shared_rec_lock (thr=0x7fa50407adb0, offsets=0x7fa5c5125d00, index=0x7fa7f802fe48, rec=<optimized out>, block=0x7fa6bc171240, type=1024) at ./storage/innobase/row/row0ins.cc:1400
            #13 row_ins_check_foreign_constraint (check_ref=check_ref@entry=0, foreign=foreign@entry=0x7fa7f8036638, table=table@entry=0x7fa50408f5d8, entry=entry@entry=0x7fa5041f1968, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0ins.cc:1676
            #14 0x000055c8eb77ac58 in row_upd_check_references_constraints (node=node@entry=0x7fa504042858, pcur=pcur@entry=0x7fa5040e8640, table=0x7fa50408f5d8, index=index@entry=0x7fa5040903d8, offsets=offsets@entry=0x7fa5c5126e80, thr=thr@entry=0x7fa50407adb0, mtr=0x7fa5c5127340) at ./storage/innobase/row/row0upd.cc:298
            #15 0x000055c8eb77fbf6 in row_upd_check_references_constraints (mtr=0x7fa5c5127340, thr=0x7fa50407adb0, offsets=0x7fa5c5126e80, index=0x7fa5040903d8, table=<optimized out>, pcur=0x7fa5040e8640, node=0x7fa504042858) at ./storage/innobase/row/row0upd.cc:234
            #16 row_upd_del_mark_clust_rec (mtr=0x7fa5c5127340, foreign=false, referenced=true, thr=0x7fa50407adb0, offsets=0x7fa5c5126e80, index=0x7fa5040903d8, node=0x7fa504042858) at ./storage/innobase/row/row0upd.cc:3020
            #17 row_upd_clust_step (node=node@entry=0x7fa504042858, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0upd.cc:3173
            #18 0x000055c8eb781edc in row_upd (thr=0x7fa50407adb0, node=0x7fa504042858) at ./storage/innobase/row/row0upd.cc:3295
            #19 row_upd_step (thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0upd.cc:3439
            #20 0x000055c8eb757059 in row_update_cascade_for_mysql (thr=thr@entry=0x7fa50407adb0, node=node@entry=0x7fa504042858, table=<optimized out>) at ./storage/innobase/row/row0mysql.cc:2235
            #21 0x000055c8eb7446b7 in row_ins_foreign_check_on_constraint (mtr=0x7fa5c5128700, entry=<optimized out>, pcur=0x7fa5c5127a90, foreign=<optimized out>, thr=0x7fa50407adb0) at ./storage/innobase/row/row0ins.cc:1338
            #22 row_ins_check_foreign_constraint (check_ref=check_ref@entry=0, foreign=foreign@entry=0x7fa504143c88, table=table@entry=0x7fa5040c3a98, entry=entry@entry=0x7fa5040c9ab0, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0ins.cc:1746
            #23 0x000055c8eb77ac58 in row_upd_check_references_constraints (node=node@entry=0x7fa4c82414d8, pcur=pcur@entry=0x7fa504042a30, table=0x7fa5040c3a98, index=index@entry=0x7fa5040c4978, offsets=offsets@entry=0x7fa5c5128d00, thr=thr@entry=0x7fa50407adb0, mtr=0x7fa5c51291c0) at ./storage/innobase/row/row0upd.cc:298
            #24 0x000055c8eb77fbf6 in row_upd_check_references_constraints (mtr=0x7fa5c51291c0, thr=0x7fa50407adb0, offsets=0x7fa5c5128d00, index=0x7fa5040c4978, table=<optimized out>, pcur=0x7fa504042a30, node=0x7fa4c82414d8) at ./storage/innobase/row/row0upd.cc:234
            #25 row_upd_del_mark_clust_rec (mtr=0x7fa5c51291c0, foreign=false, referenced=true, thr=0x7fa50407adb0, offsets=0x7fa5c5128d00, index=0x7fa5040c4978, node=0x7fa4c82414d8) at ./storage/innobase/row/row0upd.cc:3020
            #26 row_upd_clust_step (node=node@entry=0x7fa4c82414d8, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0upd.cc:3173
            #27 0x000055c8eb781edc in row_upd (thr=0x7fa50407adb0, node=0x7fa4c82414d8) at ./storage/innobase/row/row0upd.cc:3295
            #28 row_upd_step (thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0upd.cc:3439
            #29 0x000055c8eb757059 in row_update_cascade_for_mysql (thr=thr@entry=0x7fa50407adb0, node=node@entry=0x7fa4c82414d8, table=<optimized out>) at ./storage/innobase/row/row0mysql.cc:2235
            #30 0x000055c8eb7446b7 in row_ins_foreign_check_on_constraint (mtr=0x7fa5c512a580, entry=<optimized out>, pcur=0x7fa5c5129910, foreign=<optimized out>, thr=0x7fa50407adb0) at ./storage/innobase/row/row0ins.cc:1338
            #31 row_ins_check_foreign_constraint (check_ref=check_ref@entry=0, foreign=foreign@entry=0x7fa5040c3858, table=table@entry=0x7fa50408f5d8, entry=entry@entry=0x7fa5041f0a78, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0ins.cc:1746
            #32 0x000055c8eb77ac58 in row_upd_check_references_constraints (node=node@entry=0x7fa50407a8a8, pcur=pcur@entry=0x7fa52811e730, table=0x7fa50408f5d8, index=index@entry=0x7fa5040903d8, offsets=offsets@entry=0x7fa5c512ab80, thr=thr@entry=0x7fa50407adb0, mtr=0x7fa5c512b040) at ./storage/innobase/row/row0upd.cc:298
            #33 0x000055c8eb77fbf6 in row_upd_check_references_constraints (mtr=0x7fa5c512b040, thr=0x7fa50407adb0, offsets=0x7fa5c512ab80, index=0x7fa5040903d8, table=<optimized out>, pcur=0x7fa52811e730, node=0x7fa50407a8a8) at ./storage/innobase/row/row0upd.cc:234
            #34 row_upd_del_mark_clust_rec (mtr=0x7fa5c512b040, foreign=false, referenced=true, thr=0x7fa50407adb0, offsets=0x7fa5c512ab80, index=0x7fa5040903d8, node=0x7fa50407a8a8) at ./storage/innobase/row/row0upd.cc:3020
            #35 row_upd_clust_step (node=node@entry=0x7fa50407a8a8, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0upd.cc:3173
            #36 0x000055c8eb781edc in row_upd (thr=0x7fa50407adb0, node=0x7fa50407a8a8) at ./storage/innobase/row/row0upd.cc:3295
            #37 row_upd_step (thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0upd.cc:3439
            #38 0x000055c8eb758e24 in row_update_for_mysql (prebuilt=0x7fa504079f08) at ./storage/innobase/row/row0mysql.cc:1809
            #39 0x000055c8eb68a005 in ha_innobase::delete_row (this=0x7fa504075650, record=0x7fa504078540 "\327\005") at ./storage/innobase/handler/ha_innodb.cc:9013
            #40 0x000055c8eb4e895f in handler::ha_delete_row (this=0x7fa504075650, buf=0x7fa504078540 "\327\005") at ./sql/handler.cc:6892
            #41 0x000055c8eb5e05fd in Delete_rows_log_event::do_exec_row (this=0x7fa504046318, rgi=<optimized out>) at ./sql/log_event.cc:14495
            #42 0x000055c8eb5d39a7 in Rows_log_event::do_apply_event (this=0x7fa504046318, rgi=0x7fa50401c4b0) at ./sql/log_event.cc:11656
            #43 0x000055c8eb44f848 in Log_event::apply_event (rgi=0x7fa50401c4b0, this=0x7fa504046318) at ./sql/log_event.h:1492
            #44 wsrep_apply_events (thd=thd@entry=0x7fa504000c08, rli=<optimized out>, events_buf=<optimized out>, buf_len=0) at ./sql/wsrep_applier.cc:200
            #45 0x000055c8eb435e5d in Wsrep_applier_service::apply_write_set (this=0x7fa5c512ccf0, ws_meta=..., data=...) at ./sql/wsrep_high_priority_service.cc:550
            #46 0x000055c8ebae577a in apply_write_set (server_state=..., high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at ./wsrep-lib/src/server_state.cpp:328
            #47 0x000055c8ebae6505 in wsrep::server_state::on_apply (this=<optimized out>, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at ./wsrep-lib/src/server_state.cpp:1146
            #48 0x000055c8ebafafb4 in wsrep::high_priority_service::apply (data=..., ws_meta=..., ws_handle=..., this=0x7fa5c512ccf0) at ./wsrep-lib/include/wsrep/high_priority_service.hpp:47
            #49 (anonymous namespace)::apply_cb (ctx=0x7fa5c512ccf0, wsh=<optimized out>, flags=<optimized out>, buf=<optimized out>, meta=<optimized out>, exit_loop=0x7fa5c512c00f) at ./wsrep-lib/src/wsrep_provider_v26.cpp:504
            #50 0x00007fa954a15275 in galera::TrxHandleSlave::apply (this=this@entry=0x7fa4f401fb00, recv_ctx=recv_ctx@entry=0x7fa5c512ccf0, apply_cb=0x55c8ebafae80 <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, meta=..., exit_loop=exit_loop@entry=@0x7fa5c512c00f: false) at ./galera/src/trx_handle.cpp:391
            #51 0x00007fa954a26c10 in galera::ReplicatorSMM::apply_trx (this=this@entry=0x55c8ed421ba0, recv_ctx=recv_ctx@entry=0x7fa5c512ccf0, ts=...) at ./galera/src/replicator_smm.cpp:504
            #52 0x00007fa954a2d1b5 in galera::ReplicatorSMM::process_trx (this=0x55c8ed421ba0, recv_ctx=0x7fa5c512ccf0, ts_ptr=...) at ./galera/src/replicator_smm.cpp:2123
            #53 0x00007fa954a5f825 in galera::GcsActionSource::process_writeset (this=0x55c8ed42f0c0, recv_ctx=0x7fa5c512ccf0, act=..., exit_loop=@0x7fa5c512c8ff: false) at ./galera/src/gcs_action_source.cpp:62
            #54 0x00007fa954a60432 in galera::GcsActionSource::dispatch (this=this@entry=0x55c8ed42f0c0, recv_ctx=recv_ctx@entry=0x7fa5c512ccf0, act=..., exit_loop=@0x7fa5c512c8ff: false) at ./galera/src/gcs_action_source.cpp:110
            #55 0x00007fa954a606f2 in galera::GcsActionSource::process (this=0x55c8ed42f0c0, recv_ctx=0x7fa5c512ccf0, exit_loop=@0x7fa5c512c8ff: false) at ./galera/src/gcs_action_source.cpp:183
            #56 0x00007fa954a2dbd0 in galera::ReplicatorSMM::async_recv (this=0x55c8ed421ba0, recv_ctx=0x7fa5c512ccf0) at ./galera/src/replicator_smm.cpp:390
            #57 0x00007fa9549ff9cb in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at ./galera/src/wsrep_provider.cpp:236
            #58 0x000055c8ebafbbfe in wsrep::wsrep_provider_v26::run_applier (this=<optimized out>, applier_ctx=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:796
            #59 0x000055c8eb451a2f in wsrep_replication_process (thd=0x7fa504000c08, arg=<optimized out>) at ./sql/wsrep_thd.cc:57
            #60 0x000055c8eb440883 in start_wsrep_THD (arg=arg@entry=0x55c8f89b4170) at ./sql/wsrep_mysqld.cc:2932
            #61 0x000055c8eba180aa in pfs_spawn_thread (arg=0x55c8f89d3558) at ./storage/perfschema/pfs.cc:1869
            #62 0x00007fa98249a6db in start_thread (arg=0x7fa5c512d700) at pthread_create.c:463
            #63 0x00007fa98189871f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            

            Interesting to see how it visits row_upd_step three times (hence the relatively long backtrace).

            According to the abort-handler, the query executed was:

            DELETE FROM `elements` WHERE (NOT (`dateDeleted` IS NULL)) AND (`dateDeleted` < '2021-10-26 18:57:37')
            

            According to the full backtrace, it was doing DELETE_ROWS_EVENT_V1.

            Let me know if it helps if I fetch additional info from the dump.

            And, good to know you're already working on this.

            Cheers,
            Walter

            wdoekes Walter Doekes added a comment - Chiming in here. We're seeing the same assertion failure on 10.4.22. I'm happy to learn that this isn't caused by local data corruption. mysqld: /home/buildbot/buildbot/build/mariadb-10.4.22/wsrep-lib/include/wsrep/client_state.hpp:668: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed. #0 __GI_abort () at abort.c:107 #1 0x00007fa9817a748a in __assert_fail_base (fmt=0x7fa98192e750 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x55c8ebb499c8 "mode_ == m_local || transaction_.is_streaming()", file=file@entry=0x55c8ebb1a130 "/home/buildbot/buildbot/build/mariadb-10.4.22/wsrep-lib/include/wsrep/client_state.hpp", line=line@entry=668, function=function@entry=0x55c8ebb49ae0 <wsrep::client_state::bf_abort(wsrep::seqno)::__PRETTY_FUNCTION__> "int wsrep::client_state::bf_abort(wsrep::seqno)") at assert.c:92 #2 0x00007fa9817a7502 in __GI___assert_fail (assertion=assertion@entry=0x55c8ebb499c8 "mode_ == m_local || transaction_.is_streaming()", file=file@entry=0x55c8ebb1a130 "/home/buildbot/buildbot/build/mariadb-10.4.22/wsrep-lib/include/wsrep/client_state.hpp", line=line@entry=668, function=function@entry=0x55c8ebb49ae0 <wsrep::client_state::bf_abort(wsrep::seqno)::__PRETTY_FUNCTION__> "int wsrep::client_state::bf_abort(wsrep::seqno)") at assert.c:101 #3 0x000055c8eb45192b in wsrep::client_state::bf_abort (bf_seqno=..., this=0x7fa7f8006ba0) at ./wsrep-lib/include/wsrep/client_state.hpp:668 #4 wsrep_bf_abort (bf_thd=bf_thd@entry=0x7fa504000c08, victim_thd=victim_thd@entry=0x7fa7f8000c08) at ./sql/wsrep_thd.cc:399 #5 0x000055c8eb459ac9 in wsrep_thd_bf_abort (bf_thd=bf_thd@entry=0x7fa504000c08, victim_thd=victim_thd@entry=0x7fa7f8000c08, signal=signal@entry=1 '\001') at ./sql/service_wsrep.cc:241 #6 0x000055c8eb687f1f in wsrep_kill_victim (signal=1 '\001', victim_trx=0x7fa7ed4a5108, thd=0x7fa7f8000c08, bf_thd=0x7fa504000c08) at ./storage/innobase/handler/ha_innodb.cc:18628 #7 wsrep_innobase_kill_one_trx (bf_thd=0x7fa504000c08, victim_trx=0x7fa7ed4a5108, signal=signal@entry=1 '\001') at ./storage/innobase/handler/ha_innodb.cc:18723 #8 0x000055c8eb6ca75d in wsrep_kill_victim (lock=<optimized out>, trx=<optimized out>) at ./storage/innobase/lock/lock0lock.cc:1134 #9 0x000055c8eb6cf566 in lock_rec_other_has_conflicting (mode=mode@entry=1026, block=block@entry=0x7fa6bc171240, heap_no=heap_no@entry=5, trx=trx@entry=0x7fa7ed4b16a8) at ./storage/innobase/lock/lock0lock.cc:1175 #10 0x000055c8eb6d4a11 in lock_rec_lock (impl=impl@entry=false, mode=mode@entry=1026, block=0x7fa6bc171240, heap_no=5, index=index@entry=0x7fa7f802fe48, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/lock/lock0lock.cc:1953 #11 0x000055c8eb6d5700 in lock_clust_rec_read_check_and_lock (flags=flags@entry=0, block=block@entry=0x7fa6bc171240, rec=rec@entry=0x7fa6c08e417d <error: Cannot access memory at address 0x7fa6c08e417d>, index=index@entry=0x7fa7f802fe48, offsets=offsets@entry=0x7fa5c5125d00, mode=mode@entry=LOCK_S, gap_mode=1024, thr=0x7fa50407adb0) at ./storage/innobase/lock/lock0lock.cc:5889 #12 0x000055c8eb743f7a in row_ins_set_shared_rec_lock (thr=0x7fa50407adb0, offsets=0x7fa5c5125d00, index=0x7fa7f802fe48, rec=<optimized out>, block=0x7fa6bc171240, type=1024) at ./storage/innobase/row/row0ins.cc:1400 #13 row_ins_check_foreign_constraint (check_ref=check_ref@entry=0, foreign=foreign@entry=0x7fa7f8036638, table=table@entry=0x7fa50408f5d8, entry=entry@entry=0x7fa5041f1968, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0ins.cc:1676 #14 0x000055c8eb77ac58 in row_upd_check_references_constraints (node=node@entry=0x7fa504042858, pcur=pcur@entry=0x7fa5040e8640, table=0x7fa50408f5d8, index=index@entry=0x7fa5040903d8, offsets=offsets@entry=0x7fa5c5126e80, thr=thr@entry=0x7fa50407adb0, mtr=0x7fa5c5127340) at ./storage/innobase/row/row0upd.cc:298 #15 0x000055c8eb77fbf6 in row_upd_check_references_constraints (mtr=0x7fa5c5127340, thr=0x7fa50407adb0, offsets=0x7fa5c5126e80, index=0x7fa5040903d8, table=<optimized out>, pcur=0x7fa5040e8640, node=0x7fa504042858) at ./storage/innobase/row/row0upd.cc:234 #16 row_upd_del_mark_clust_rec (mtr=0x7fa5c5127340, foreign=false, referenced=true, thr=0x7fa50407adb0, offsets=0x7fa5c5126e80, index=0x7fa5040903d8, node=0x7fa504042858) at ./storage/innobase/row/row0upd.cc:3020 #17 row_upd_clust_step (node=node@entry=0x7fa504042858, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0upd.cc:3173 #18 0x000055c8eb781edc in row_upd (thr=0x7fa50407adb0, node=0x7fa504042858) at ./storage/innobase/row/row0upd.cc:3295 #19 row_upd_step (thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0upd.cc:3439 #20 0x000055c8eb757059 in row_update_cascade_for_mysql (thr=thr@entry=0x7fa50407adb0, node=node@entry=0x7fa504042858, table=<optimized out>) at ./storage/innobase/row/row0mysql.cc:2235 #21 0x000055c8eb7446b7 in row_ins_foreign_check_on_constraint (mtr=0x7fa5c5128700, entry=<optimized out>, pcur=0x7fa5c5127a90, foreign=<optimized out>, thr=0x7fa50407adb0) at ./storage/innobase/row/row0ins.cc:1338 #22 row_ins_check_foreign_constraint (check_ref=check_ref@entry=0, foreign=foreign@entry=0x7fa504143c88, table=table@entry=0x7fa5040c3a98, entry=entry@entry=0x7fa5040c9ab0, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0ins.cc:1746 #23 0x000055c8eb77ac58 in row_upd_check_references_constraints (node=node@entry=0x7fa4c82414d8, pcur=pcur@entry=0x7fa504042a30, table=0x7fa5040c3a98, index=index@entry=0x7fa5040c4978, offsets=offsets@entry=0x7fa5c5128d00, thr=thr@entry=0x7fa50407adb0, mtr=0x7fa5c51291c0) at ./storage/innobase/row/row0upd.cc:298 #24 0x000055c8eb77fbf6 in row_upd_check_references_constraints (mtr=0x7fa5c51291c0, thr=0x7fa50407adb0, offsets=0x7fa5c5128d00, index=0x7fa5040c4978, table=<optimized out>, pcur=0x7fa504042a30, node=0x7fa4c82414d8) at ./storage/innobase/row/row0upd.cc:234 #25 row_upd_del_mark_clust_rec (mtr=0x7fa5c51291c0, foreign=false, referenced=true, thr=0x7fa50407adb0, offsets=0x7fa5c5128d00, index=0x7fa5040c4978, node=0x7fa4c82414d8) at ./storage/innobase/row/row0upd.cc:3020 #26 row_upd_clust_step (node=node@entry=0x7fa4c82414d8, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0upd.cc:3173 #27 0x000055c8eb781edc in row_upd (thr=0x7fa50407adb0, node=0x7fa4c82414d8) at ./storage/innobase/row/row0upd.cc:3295 #28 row_upd_step (thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0upd.cc:3439 #29 0x000055c8eb757059 in row_update_cascade_for_mysql (thr=thr@entry=0x7fa50407adb0, node=node@entry=0x7fa4c82414d8, table=<optimized out>) at ./storage/innobase/row/row0mysql.cc:2235 #30 0x000055c8eb7446b7 in row_ins_foreign_check_on_constraint (mtr=0x7fa5c512a580, entry=<optimized out>, pcur=0x7fa5c5129910, foreign=<optimized out>, thr=0x7fa50407adb0) at ./storage/innobase/row/row0ins.cc:1338 #31 row_ins_check_foreign_constraint (check_ref=check_ref@entry=0, foreign=foreign@entry=0x7fa5040c3858, table=table@entry=0x7fa50408f5d8, entry=entry@entry=0x7fa5041f0a78, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0ins.cc:1746 #32 0x000055c8eb77ac58 in row_upd_check_references_constraints (node=node@entry=0x7fa50407a8a8, pcur=pcur@entry=0x7fa52811e730, table=0x7fa50408f5d8, index=index@entry=0x7fa5040903d8, offsets=offsets@entry=0x7fa5c512ab80, thr=thr@entry=0x7fa50407adb0, mtr=0x7fa5c512b040) at ./storage/innobase/row/row0upd.cc:298 #33 0x000055c8eb77fbf6 in row_upd_check_references_constraints (mtr=0x7fa5c512b040, thr=0x7fa50407adb0, offsets=0x7fa5c512ab80, index=0x7fa5040903d8, table=<optimized out>, pcur=0x7fa52811e730, node=0x7fa50407a8a8) at ./storage/innobase/row/row0upd.cc:234 #34 row_upd_del_mark_clust_rec (mtr=0x7fa5c512b040, foreign=false, referenced=true, thr=0x7fa50407adb0, offsets=0x7fa5c512ab80, index=0x7fa5040903d8, node=0x7fa50407a8a8) at ./storage/innobase/row/row0upd.cc:3020 #35 row_upd_clust_step (node=node@entry=0x7fa50407a8a8, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0upd.cc:3173 #36 0x000055c8eb781edc in row_upd (thr=0x7fa50407adb0, node=0x7fa50407a8a8) at ./storage/innobase/row/row0upd.cc:3295 #37 row_upd_step (thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0upd.cc:3439 #38 0x000055c8eb758e24 in row_update_for_mysql (prebuilt=0x7fa504079f08) at ./storage/innobase/row/row0mysql.cc:1809 #39 0x000055c8eb68a005 in ha_innobase::delete_row (this=0x7fa504075650, record=0x7fa504078540 "\327\005") at ./storage/innobase/handler/ha_innodb.cc:9013 #40 0x000055c8eb4e895f in handler::ha_delete_row (this=0x7fa504075650, buf=0x7fa504078540 "\327\005") at ./sql/handler.cc:6892 #41 0x000055c8eb5e05fd in Delete_rows_log_event::do_exec_row (this=0x7fa504046318, rgi=<optimized out>) at ./sql/log_event.cc:14495 #42 0x000055c8eb5d39a7 in Rows_log_event::do_apply_event (this=0x7fa504046318, rgi=0x7fa50401c4b0) at ./sql/log_event.cc:11656 #43 0x000055c8eb44f848 in Log_event::apply_event (rgi=0x7fa50401c4b0, this=0x7fa504046318) at ./sql/log_event.h:1492 #44 wsrep_apply_events (thd=thd@entry=0x7fa504000c08, rli=<optimized out>, events_buf=<optimized out>, buf_len=0) at ./sql/wsrep_applier.cc:200 #45 0x000055c8eb435e5d in Wsrep_applier_service::apply_write_set (this=0x7fa5c512ccf0, ws_meta=..., data=...) at ./sql/wsrep_high_priority_service.cc:550 #46 0x000055c8ebae577a in apply_write_set (server_state=..., high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at ./wsrep-lib/src/server_state.cpp:328 #47 0x000055c8ebae6505 in wsrep::server_state::on_apply (this=<optimized out>, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at ./wsrep-lib/src/server_state.cpp:1146 #48 0x000055c8ebafafb4 in wsrep::high_priority_service::apply (data=..., ws_meta=..., ws_handle=..., this=0x7fa5c512ccf0) at ./wsrep-lib/include/wsrep/high_priority_service.hpp:47 #49 (anonymous namespace)::apply_cb (ctx=0x7fa5c512ccf0, wsh=<optimized out>, flags=<optimized out>, buf=<optimized out>, meta=<optimized out>, exit_loop=0x7fa5c512c00f) at ./wsrep-lib/src/wsrep_provider_v26.cpp:504 #50 0x00007fa954a15275 in galera::TrxHandleSlave::apply (this=this@entry=0x7fa4f401fb00, recv_ctx=recv_ctx@entry=0x7fa5c512ccf0, apply_cb=0x55c8ebafae80 <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, meta=..., exit_loop=exit_loop@entry=@0x7fa5c512c00f: false) at ./galera/src/trx_handle.cpp:391 #51 0x00007fa954a26c10 in galera::ReplicatorSMM::apply_trx (this=this@entry=0x55c8ed421ba0, recv_ctx=recv_ctx@entry=0x7fa5c512ccf0, ts=...) at ./galera/src/replicator_smm.cpp:504 #52 0x00007fa954a2d1b5 in galera::ReplicatorSMM::process_trx (this=0x55c8ed421ba0, recv_ctx=0x7fa5c512ccf0, ts_ptr=...) at ./galera/src/replicator_smm.cpp:2123 #53 0x00007fa954a5f825 in galera::GcsActionSource::process_writeset (this=0x55c8ed42f0c0, recv_ctx=0x7fa5c512ccf0, act=..., exit_loop=@0x7fa5c512c8ff: false) at ./galera/src/gcs_action_source.cpp:62 #54 0x00007fa954a60432 in galera::GcsActionSource::dispatch (this=this@entry=0x55c8ed42f0c0, recv_ctx=recv_ctx@entry=0x7fa5c512ccf0, act=..., exit_loop=@0x7fa5c512c8ff: false) at ./galera/src/gcs_action_source.cpp:110 #55 0x00007fa954a606f2 in galera::GcsActionSource::process (this=0x55c8ed42f0c0, recv_ctx=0x7fa5c512ccf0, exit_loop=@0x7fa5c512c8ff: false) at ./galera/src/gcs_action_source.cpp:183 #56 0x00007fa954a2dbd0 in galera::ReplicatorSMM::async_recv (this=0x55c8ed421ba0, recv_ctx=0x7fa5c512ccf0) at ./galera/src/replicator_smm.cpp:390 #57 0x00007fa9549ff9cb in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at ./galera/src/wsrep_provider.cpp:236 #58 0x000055c8ebafbbfe in wsrep::wsrep_provider_v26::run_applier (this=<optimized out>, applier_ctx=<optimized out>) at ./wsrep-lib/src/wsrep_provider_v26.cpp:796 #59 0x000055c8eb451a2f in wsrep_replication_process (thd=0x7fa504000c08, arg=<optimized out>) at ./sql/wsrep_thd.cc:57 #60 0x000055c8eb440883 in start_wsrep_THD (arg=arg@entry=0x55c8f89b4170) at ./sql/wsrep_mysqld.cc:2932 #61 0x000055c8eba180aa in pfs_spawn_thread (arg=0x55c8f89d3558) at ./storage/perfschema/pfs.cc:1869 #62 0x00007fa98249a6db in start_thread (arg=0x7fa5c512d700) at pthread_create.c:463 #63 0x00007fa98189871f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 Interesting to see how it visits row_upd_step three times (hence the relatively long backtrace). According to the abort-handler, the query executed was: DELETE FROM `elements` WHERE (NOT (`dateDeleted` IS NULL)) AND (`dateDeleted` < '2021-10-26 18:57:37') According to the full backtrace, it was doing DELETE_ROWS_EVENT_V1 . Let me know if it helps if I fetch additional info from the dump. And, good to know you're already working on this. Cheers, Walter

            wdoekes Can we have table structure of elements table. Can you reproduce this problem and how? Can we have error log preferable with --wsrep-debug=1 setting?

            jplindst Jan Lindström (Inactive) added a comment - wdoekes Can we have table structure of elements table. Can you reproduce this problem and how? Can we have error log preferable with --wsrep-debug=1 setting?
            jplindst Jan Lindström (Inactive) made changes -
            Labels need_feedback
            jplindst Jan Lindström (Inactive) made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            wdoekes Walter Doekes added a comment -

            Can we have table structure of elements table?

            (gdb) frame 13
            #13 row_ins_check_foreign_constraint (check_ref=check_ref@entry=0, foreign=foreign@entry=0x7fa7f8036638, table=table@entry=0x7fa50408f5d8, entry=entry@entry=0x7fa5041f1968, thr=thr@entry=0x7fa50407adb0)
                at ./storage/innobase/row/row0ins.cc:1676
             
            (gdb) info args
            check_ref = 0
            foreign = 0x7fa7f8036638
            table = 0x7fa50408f5d8
            entry = 0x7fa5041f1968
            thr = 0x7fa50407adb0
             
            (gdb) print table
            $1 = (dict_table_t *) 0x7fa50408f5d8
             
            (gdb) print *table
            $2 = {id = 27254, id_hash = 0x0, name = {m_name = 0x7fa504175ea0 "XXX_2021/elements", static part_suffix = "#P#"}, ...
             
            (gdb) print *foreign
            $3 = {heap = 0x7fa7f80364f0, id = 0x7fa7f80366a8 "XXX_2021/fk_zwtrgocenvwvzjfhlecdejrqyipdqnxugjoh", n_fields = 1, type = 1, foreign_table_name = 0x7fa7f80366e0 "XXX_2021/entries", 
              foreign_table_name_lookup = 0x7fa7f80366e0 "XXX_2021/entries", foreign_table = 0x7fa7f8029278, foreign_col_names = 0x7fa7f8036710, referenced_table_name = 0x7fa7f80366f8 "XXX_2021/elements", 
              referenced_table_name_lookup = 0x7fa7f80366f8 "XXX_2021/elements", referenced_table = 0x7fa50408f5d8, referenced_col_names = 0x7fa7f8036718, foreign_index = 0x7fa7f802fe48, 
              referenced_index = 0x7fa5040903d8, v_cols = 0x0}
            

            structure:

            MariaDB [(none)]> show create table XXX_2021.elements;
             
            CREATE TABLE `elements` (
              `id` int(11) NOT NULL AUTO_INCREMENT,
              `canonicalId` int(11) DEFAULT NULL,
              `draftId` int(11) DEFAULT NULL,
              `revisionId` int(11) DEFAULT NULL,
              `fieldLayoutId` int(11) DEFAULT NULL,
              `type` varchar(255) NOT NULL,
              `enabled` tinyint(1) NOT NULL DEFAULT 1,
              `archived` tinyint(1) NOT NULL DEFAULT 0,
              `dateCreated` datetime NOT NULL,
              `dateUpdated` datetime NOT NULL,
              `dateLastMerged` datetime DEFAULT NULL,
              `dateDeleted` datetime DEFAULT NULL,
              `uid` char(36) NOT NULL DEFAULT '0',
              PRIMARY KEY (`id`),
              KEY `idx_lyxzzoporzhezehtmrgbzibwuugxuzzqkuus` (`dateDeleted`),
              KEY `idx_kepcjeufntzfbvurnklnyrshlkauvlsbbhyp` (`fieldLayoutId`),
              KEY `idx_jsflfbvgimumupwbfhjkaewefrtzqzemfcxf` (`type`),
              KEY `idx_wswfnejwarlvqlyzkegmzyoskpovqrhvnuhh` (`enabled`),
              KEY `idx_luwcwrtqmkvxbnbwovadsalqdeybuvymrnqp` (`archived`,`dateCreated`),
              KEY `fk_pdqpxunbdgpmmxafqjlntwvjnntdfmpyfrpw` (`draftId`),
              KEY `fk_surxehxvjuzuskcukwuhvtvpnrmyafwtzdzc` (`revisionId`),
              KEY `fk_dygcjtouufcsitultbhinztkxfcnzpfukuhg` (`canonicalId`),
              KEY `idx_ccmkvbunjtifciakrjazqzzetgbtxiqzhjty` (`archived`,`dateDeleted`,`draftId`,`revisionId`,`canonicalId`),
              CONSTRAINT `fk_dygcjtouufcsitultbhinztkxfcnzpfukuhg` FOREIGN KEY (`canonicalId`) REFERENCES `elements` (`id`) ON DELETE SET NULL,
              CONSTRAINT `fk_onuebrbzmawzrbfgcteaqzqnfkiinqrcnfyv` FOREIGN KEY (`fieldLayoutId`) REFERENCES `fieldlayouts` (`id`) ON DELETE SET NULL,
              CONSTRAINT `fk_pdqpxunbdgpmmxafqjlntwvjnntdfmpyfrpw` FOREIGN KEY (`draftId`) REFERENCES `drafts` (`id`) ON DELETE CASCADE,
              CONSTRAINT `fk_surxehxvjuzuskcukwuhvtvpnrmyafwtzdzc` FOREIGN KEY (`revisionId`) REFERENCES `revisions` (`id`) ON DELETE CASCADE
            ) ENGINE=InnoDB AUTO_INCREMENT=2696 DEFAULT CHARSET=utf8
            

            –

            Can you reproduce this problem and how?

            Not that I'm aware of. This is a mariadb/galera setup in K8S. I'm not intimate with which queries are executed and how often. And this is the first time I've seen this particular crash.

            However, now that I'm looking for error logs, I do see that this particular crash happened on 2/3 nodes. I had only seen the one crash, but apparently I have a similar core dump on another node (100% identical backtrace).

            Apparently, the crash recovery went as follows:

            • node-0 crashed
            • node-2 crashed
            • node-1 was lonely
            • node-2 recovered and agreed with node-1
            • node-0 did sst and recovered later

            –

            Can we have error log preferable with --wsrep-debug=1 setting?

            I have the error logs from stdout/err, but they contain nothing of interest until the crash. (Only the occasional Aborted connection NNN to db in there.)

            I'll try to get them to enable --wsrep-debug=1 before the next occurrence.

            wdoekes Walter Doekes added a comment - Can we have table structure of elements table? (gdb) frame 13 #13 row_ins_check_foreign_constraint (check_ref=check_ref@entry=0, foreign=foreign@entry=0x7fa7f8036638, table=table@entry=0x7fa50408f5d8, entry=entry@entry=0x7fa5041f1968, thr=thr@entry=0x7fa50407adb0) at ./storage/innobase/row/row0ins.cc:1676   (gdb) info args check_ref = 0 foreign = 0x7fa7f8036638 table = 0x7fa50408f5d8 entry = 0x7fa5041f1968 thr = 0x7fa50407adb0   (gdb) print table $1 = (dict_table_t *) 0x7fa50408f5d8   (gdb) print *table $2 = {id = 27254, id_hash = 0x0, name = {m_name = 0x7fa504175ea0 "XXX_2021/elements", static part_suffix = "#P#"}, ...   (gdb) print *foreign $3 = {heap = 0x7fa7f80364f0, id = 0x7fa7f80366a8 "XXX_2021/fk_zwtrgocenvwvzjfhlecdejrqyipdqnxugjoh", n_fields = 1, type = 1, foreign_table_name = 0x7fa7f80366e0 "XXX_2021/entries", foreign_table_name_lookup = 0x7fa7f80366e0 "XXX_2021/entries", foreign_table = 0x7fa7f8029278, foreign_col_names = 0x7fa7f8036710, referenced_table_name = 0x7fa7f80366f8 "XXX_2021/elements", referenced_table_name_lookup = 0x7fa7f80366f8 "XXX_2021/elements", referenced_table = 0x7fa50408f5d8, referenced_col_names = 0x7fa7f8036718, foreign_index = 0x7fa7f802fe48, referenced_index = 0x7fa5040903d8, v_cols = 0x0} structure: MariaDB [(none)]> show create table XXX_2021.elements;   CREATE TABLE `elements` ( `id` int(11) NOT NULL AUTO_INCREMENT, `canonicalId` int(11) DEFAULT NULL, `draftId` int(11) DEFAULT NULL, `revisionId` int(11) DEFAULT NULL, `fieldLayoutId` int(11) DEFAULT NULL, `type` varchar(255) NOT NULL, `enabled` tinyint(1) NOT NULL DEFAULT 1, `archived` tinyint(1) NOT NULL DEFAULT 0, `dateCreated` datetime NOT NULL, `dateUpdated` datetime NOT NULL, `dateLastMerged` datetime DEFAULT NULL, `dateDeleted` datetime DEFAULT NULL, `uid` char(36) NOT NULL DEFAULT '0', PRIMARY KEY (`id`), KEY `idx_lyxzzoporzhezehtmrgbzibwuugxuzzqkuus` (`dateDeleted`), KEY `idx_kepcjeufntzfbvurnklnyrshlkauvlsbbhyp` (`fieldLayoutId`), KEY `idx_jsflfbvgimumupwbfhjkaewefrtzqzemfcxf` (`type`), KEY `idx_wswfnejwarlvqlyzkegmzyoskpovqrhvnuhh` (`enabled`), KEY `idx_luwcwrtqmkvxbnbwovadsalqdeybuvymrnqp` (`archived`,`dateCreated`), KEY `fk_pdqpxunbdgpmmxafqjlntwvjnntdfmpyfrpw` (`draftId`), KEY `fk_surxehxvjuzuskcukwuhvtvpnrmyafwtzdzc` (`revisionId`), KEY `fk_dygcjtouufcsitultbhinztkxfcnzpfukuhg` (`canonicalId`), KEY `idx_ccmkvbunjtifciakrjazqzzetgbtxiqzhjty` (`archived`,`dateDeleted`,`draftId`,`revisionId`,`canonicalId`), CONSTRAINT `fk_dygcjtouufcsitultbhinztkxfcnzpfukuhg` FOREIGN KEY (`canonicalId`) REFERENCES `elements` (`id`) ON DELETE SET NULL, CONSTRAINT `fk_onuebrbzmawzrbfgcteaqzqnfkiinqrcnfyv` FOREIGN KEY (`fieldLayoutId`) REFERENCES `fieldlayouts` (`id`) ON DELETE SET NULL, CONSTRAINT `fk_pdqpxunbdgpmmxafqjlntwvjnntdfmpyfrpw` FOREIGN KEY (`draftId`) REFERENCES `drafts` (`id`) ON DELETE CASCADE, CONSTRAINT `fk_surxehxvjuzuskcukwuhvtvpnrmyafwtzdzc` FOREIGN KEY (`revisionId`) REFERENCES `revisions` (`id`) ON DELETE CASCADE ) ENGINE=InnoDB AUTO_INCREMENT=2696 DEFAULT CHARSET=utf8 – Can you reproduce this problem and how? Not that I'm aware of. This is a mariadb/galera setup in K8S. I'm not intimate with which queries are executed and how often. And this is the first time I've seen this particular crash. However, now that I'm looking for error logs, I do see that this particular crash happened on 2/3 nodes. I had only seen the one crash, but apparently I have a similar core dump on another node (100% identical backtrace). Apparently, the crash recovery went as follows: node-0 crashed node-2 crashed node-1 was lonely node-2 recovered and agreed with node-1 node-0 did sst and recovered later – Can we have error log preferable with --wsrep-debug=1 setting? I have the error logs from stdout/err, but they contain nothing of interest until the crash. (Only the occasional Aborted connection NNN to db in there.) I'll try to get them to enable --wsrep-debug=1 before the next occurrence.
            wdoekes Walter Doekes made changes -
            Attachment unable-to-read-page.fatal.log [ 60989 ]
            wdoekes Walter Doekes added a comment -

            Ok. So we enabled --wsrep-debug=1.

            I'm not sure if this was a coincidence or related, but immediately during startup one node crashed with a seemingly unrelated backtrace. After various warnings about corrupt data:

            2021-11-30 11:28:08 7 [ERROR] InnoDB: Space id and page no stored in the page, read in are [page id: space=9142, page number=327453354], should be [page id: space=9142, page number=0]
            2021-11-30 11:28:08 7 [ERROR] InnoDB: Database page corruption on disk or a failed file read of tablespace XXXmuseum/FTS_00000000000023c3_0000000000005ea1_INDEX_2 page [page id: space=9142, page number=0]. You may have to recover from a backup.
            ...
            

            It ended in this fatal error:

                                    ib::fatal() << "Unable to read page " << page_id
                                            << " into the buffer pool after "
                                            << BUF_PAGE_READ_MAX_RETRIES
                                            << ". The most probable cause"
                                            " of this error may be that the"
                                            " table has been corrupted."
                                            " See https://mariadb.com/kb/en/library/innodb-recovery-modes/";
            

            #0  __GI_abort () at abort.c:107
            #1  0x000055e0545a4ea1 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>) at ./storage/innobase/ut/ut0ut.cc:604
            #2  0x000055e0545e1aeb in buf_page_get_low (page_id=..., zip_size=0, rw_latch=rw_latch@entry=2, guess=<optimized out>, mode=10, file=0x55e054a6a570 "/home/buildbot/buildbot/build/mariadb-10.4.22/storage/innobase/fil/fil0fil.cc", line=3830, mtr=0x7f83252a6f80, err=0x0) at ./storage/innobase/buf/buf0buf.cc:4402
            #3  0x000055e0545e1ea6 in buf_page_get_gen (page_id=..., zip_size=zip_size@entry=0, rw_latch=rw_latch@entry=2, guess=guess@entry=0x0, mode=mode@entry=10, file=file@entry=0x55e054a6a570 "/home/buildbot/buildbot/build/mariadb-10.4.22/storage/innobase/fil/fil0fil.cc", line=<optimized out>, mtr=<optimized out>, err=<optimized out>) at ./storage/innobase/buf/buf0buf.cc:4874
            #4  0x000055e054651fd3 in fsp_flags_try_adjust (space=space@entry=0x55e061f96740, flags=flags@entry=33) at ./storage/innobase/fil/fil0fil.cc:3828
            #5  0x000055e05465346f in fil_space_for_table_exists_in_mem (id=9142, name=0x7f82b0035680 "XXXmuseum/FTS_", '0' <repeats 12 times>, "23c3_", '0' <repeats 12 times>, "5ea1_INDEX_2", table_flags=<optimized out>) at ./storage/innobase/fil/fil0fil.cc:3897
            #6  0x000055e054626b9e in dict_load_tablespace (ignore_err=DICT_ERR_IGNORE_FK_NOKEY, table=0x7f82b0035778) at ./storage/innobase/dict/dict0load.cc:2797
            #7  dict_load_table_one (name=..., ignore_err=ignore_err@entry=DICT_ERR_IGNORE_FK_NOKEY, fk_tables=std::deque with 0 elements) at ./storage/innobase/dict/dict0load.cc:2947
            #8  0x000055e054629101 in dict_load_table (name=0x7f82b000be40 "XXXmuseum/FTS_", '0' <repeats 12 times>, "23c3_", '0' <repeats 12 times>, "5ea1_INDEX_2", ignore_err=ignore_err@entry=DICT_ERR_IGNORE_FK_NOKEY) at ./storage/innobase/dict/dict0load.cc:2756
            #9  0x000055e05462a2f4 in dict_load_table_on_id (table_id=table_id@entry=9157, ignore_err=DICT_ERR_IGNORE_FK_NOKEY) at ./storage/innobase/dict/dict0load.cc:3194
            #10 0x000055e0546190b6 in dict_table_open_on_id_low (cached_only=false, ignore_err=DICT_ERR_IGNORE_FK_NOKEY, table_id=9157) at ./storage/innobase/dict/dict0dict.cc:237
            #11 dict_table_open_on_id (table_id=9157, dict_locked=dict_locked@entry=0, table_op=table_op@entry=DICT_TABLE_OP_NORMAL) at ./storage/innobase/dict/dict0dict.cc:747
            #12 0x000055e054536774 in row_purge_parse_undo_rec (thr=0x55e061e51c10, updated_extern=0x7f83252a9b37, undo_rec=0x7f82b8011da8 "", node=0x55e061e51d38) at ./storage/innobase/row/row0purge.cc:1074
            #13 row_purge (thr=0x55e061e51c10, undo_rec=0x7f82b8011da8 "", node=0x55e061e51d38) at ./storage/innobase/row/row0purge.cc:1261
            #14 row_purge_step (thr=thr@entry=0x55e061e51c10) at ./storage/innobase/row/row0purge.cc:1342
            #15 0x000055e0544ed4d2 in que_thr_step (thr=0x55e061e51c10) at ./storage/innobase/que/que0que.cc:1038
            #16 que_run_threads_low (thr=0x55e061e51c10) at ./storage/innobase/que/que0que.cc:1100
            #17 que_run_threads (thr=thr@entry=0x55e061e51c10) at ./storage/innobase/que/que0que.cc:1140
            #18 0x000055e05455b8e7 in srv_task_execute () at ./storage/innobase/srv/srv0srv.cc:2465
            #19 srv_worker_thread (arg=<optimized out>) at ./storage/innobase/srv/srv0srv.cc:2520
            #20 0x00007f86e08606db in start_thread (arg=0x7f83252aa700) at pthread_create.c:463
            #21 0x00007f86dfc5e71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            

            I have since checked this table – the ..000023c3 (9155) table_id in information_schema.innodb_sys_tables pointed to XXXmuseum.searchindex:

            • analyze table was happy;
            • check table was happy;
            • I mysqldumped and reimported it anyway.

            The full log is attached as unable-to-read-page.fatal.log. (Warnings about innodb_checksum_algorithm mention it being none, due to the data being stored on ZFS which also does checksumming.)

            For now, I'll assume this is not related to this particular bug.

            wdoekes Walter Doekes added a comment - Ok. So we enabled --wsrep-debug=1 . I'm not sure if this was a coincidence or related, but immediately during startup one node crashed with a seemingly unrelated backtrace. After various warnings about corrupt data: 2021-11-30 11:28:08 7 [ERROR] InnoDB: Space id and page no stored in the page, read in are [page id: space=9142, page number=327453354], should be [page id: space=9142, page number=0] 2021-11-30 11:28:08 7 [ERROR] InnoDB: Database page corruption on disk or a failed file read of tablespace XXXmuseum/FTS_00000000000023c3_0000000000005ea1_INDEX_2 page [page id: space=9142, page number=0]. You may have to recover from a backup. ... It ended in this fatal error: ib::fatal() << "Unable to read page " << page_id << " into the buffer pool after " << BUF_PAGE_READ_MAX_RETRIES << ". The most probable cause" " of this error may be that the" " table has been corrupted." " See https://mariadb.com/kb/en/library/innodb-recovery-modes/"; #0 __GI_abort () at abort.c:107 #1 0x000055e0545a4ea1 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>) at ./storage/innobase/ut/ut0ut.cc:604 #2 0x000055e0545e1aeb in buf_page_get_low (page_id=..., zip_size=0, rw_latch=rw_latch@entry=2, guess=<optimized out>, mode=10, file=0x55e054a6a570 "/home/buildbot/buildbot/build/mariadb-10.4.22/storage/innobase/fil/fil0fil.cc", line=3830, mtr=0x7f83252a6f80, err=0x0) at ./storage/innobase/buf/buf0buf.cc:4402 #3 0x000055e0545e1ea6 in buf_page_get_gen (page_id=..., zip_size=zip_size@entry=0, rw_latch=rw_latch@entry=2, guess=guess@entry=0x0, mode=mode@entry=10, file=file@entry=0x55e054a6a570 "/home/buildbot/buildbot/build/mariadb-10.4.22/storage/innobase/fil/fil0fil.cc", line=<optimized out>, mtr=<optimized out>, err=<optimized out>) at ./storage/innobase/buf/buf0buf.cc:4874 #4 0x000055e054651fd3 in fsp_flags_try_adjust (space=space@entry=0x55e061f96740, flags=flags@entry=33) at ./storage/innobase/fil/fil0fil.cc:3828 #5 0x000055e05465346f in fil_space_for_table_exists_in_mem (id=9142, name=0x7f82b0035680 "XXXmuseum/FTS_", '0' <repeats 12 times>, "23c3_", '0' <repeats 12 times>, "5ea1_INDEX_2", table_flags=<optimized out>) at ./storage/innobase/fil/fil0fil.cc:3897 #6 0x000055e054626b9e in dict_load_tablespace (ignore_err=DICT_ERR_IGNORE_FK_NOKEY, table=0x7f82b0035778) at ./storage/innobase/dict/dict0load.cc:2797 #7 dict_load_table_one (name=..., ignore_err=ignore_err@entry=DICT_ERR_IGNORE_FK_NOKEY, fk_tables=std::deque with 0 elements) at ./storage/innobase/dict/dict0load.cc:2947 #8 0x000055e054629101 in dict_load_table (name=0x7f82b000be40 "XXXmuseum/FTS_", '0' <repeats 12 times>, "23c3_", '0' <repeats 12 times>, "5ea1_INDEX_2", ignore_err=ignore_err@entry=DICT_ERR_IGNORE_FK_NOKEY) at ./storage/innobase/dict/dict0load.cc:2756 #9 0x000055e05462a2f4 in dict_load_table_on_id (table_id=table_id@entry=9157, ignore_err=DICT_ERR_IGNORE_FK_NOKEY) at ./storage/innobase/dict/dict0load.cc:3194 #10 0x000055e0546190b6 in dict_table_open_on_id_low (cached_only=false, ignore_err=DICT_ERR_IGNORE_FK_NOKEY, table_id=9157) at ./storage/innobase/dict/dict0dict.cc:237 #11 dict_table_open_on_id (table_id=9157, dict_locked=dict_locked@entry=0, table_op=table_op@entry=DICT_TABLE_OP_NORMAL) at ./storage/innobase/dict/dict0dict.cc:747 #12 0x000055e054536774 in row_purge_parse_undo_rec (thr=0x55e061e51c10, updated_extern=0x7f83252a9b37, undo_rec=0x7f82b8011da8 "", node=0x55e061e51d38) at ./storage/innobase/row/row0purge.cc:1074 #13 row_purge (thr=0x55e061e51c10, undo_rec=0x7f82b8011da8 "", node=0x55e061e51d38) at ./storage/innobase/row/row0purge.cc:1261 #14 row_purge_step (thr=thr@entry=0x55e061e51c10) at ./storage/innobase/row/row0purge.cc:1342 #15 0x000055e0544ed4d2 in que_thr_step (thr=0x55e061e51c10) at ./storage/innobase/que/que0que.cc:1038 #16 que_run_threads_low (thr=0x55e061e51c10) at ./storage/innobase/que/que0que.cc:1100 #17 que_run_threads (thr=thr@entry=0x55e061e51c10) at ./storage/innobase/que/que0que.cc:1140 #18 0x000055e05455b8e7 in srv_task_execute () at ./storage/innobase/srv/srv0srv.cc:2465 #19 srv_worker_thread (arg=<optimized out>) at ./storage/innobase/srv/srv0srv.cc:2520 #20 0x00007f86e08606db in start_thread (arg=0x7f83252aa700) at pthread_create.c:463 #21 0x00007f86dfc5e71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 I have since checked this table – the ..000023c3 (9155) table_id in information_schema.innodb_sys_tables pointed to XXXmuseum.searchindex: analyze table was happy; check table was happy; I mysqldumped and reimported it anyway. The full log is attached as unable-to-read-page.fatal.log . (Warnings about innodb_checksum_algorithm mention it being none, due to the data being stored on ZFS which also does checksumming.) For now, I'll assume this is not related to this particular bug.
            marko Marko Mäkelä added a comment - - edited

            wdoekes, your corruption seems to affect an internal table that implements a partition of the inverted index for a fulltext index. Those tables are only read by special queries, not by ANALYZE or CHECK. The table (SELECT name FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE id=0x23c3) could have been repaired by dropping and re-creating the fulltext index that has the ID 0x5ea1.

            I have stated in MDEV-24845 my belief that Galera rsync SST is unsafe to use unless you configure skip_innodb_use_native_aio (which would destroy performance). It is also possible that there are bugs in mariadb-backup. I believe that once the race condition with rsync snapshot transfer has been fixed, it should be safer than mariadb-backup. Come to think about it, if snapshot transfers are being executed from different sources (not always from the same "original master server"), corruption could accumulate.

            marko Marko Mäkelä added a comment - - edited wdoekes , your corruption seems to affect an internal table that implements a partition of the inverted index for a fulltext index. Those tables are only read by special queries, not by ANALYZE or CHECK . The table ( SELECT name FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE id=0x23c3 ) could have been repaired by dropping and re-creating the fulltext index that has the ID 0x5ea1 . I have stated in MDEV-24845 my belief that Galera rsync SST is unsafe to use unless you configure skip_innodb_use_native_aio (which would destroy performance). It is also possible that there are bugs in mariadb-backup . I believe that once the race condition with rsync snapshot transfer has been fixed, it should be safer than mariadb-backup . Come to think about it, if snapshot transfers are being executed from different sources (not always from the same "original master server"), corruption could accumulate.
            wdoekes Walter Doekes added a comment -

            Hi Marko, thanks for your feedback.

            Those tables are only read by special queries, not by ANALYZE or CHECK.

            Okay. I was unsure whether to expect new corruption anyway, as the system auto-healed after pod restart with a new IST.

            The table (SELECT name FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE id=0x23c3) could have been repaired by dropping and re-creating the fulltext index that has the ID 0x5ea1.

            I found the table (0x23c3) before I found the index (0x5ea1), so I thought it safest to DROP and re-CREATE. This was probably not much slower than removing the only index and re-adding it. (Only 300K records.)

            I guess the table/index was already ok after the IST though (because of fresh IST data?), otherwise I would've expected the same fatal() error.

            I have stated [...] my belief that Galera rsync SST is unsafe to use when you use. [...]

            Not sure where that sentence ends. But we're using wsrep_sst_method = mariabackup, so if we're dealing with corruption from SSTs, it would be from mariadb-backup.

            Also, you say "different sources". In our case, that is continuously the case: in the multi node setup in K8S, we're disconnecting/reconnecting nodes at will, so there is/was never a true "original master". I'll go check out MDEV-24845 to see if there is anything that appears relevant to us.

            wdoekes Walter Doekes added a comment - Hi Marko, thanks for your feedback. Those tables are only read by special queries, not by ANALYZE or CHECK. Okay. I was unsure whether to expect new corruption anyway, as the system auto-healed after pod restart with a new IST. The table (SELECT name FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES WHERE id=0x23c3) could have been repaired by dropping and re-creating the fulltext index that has the ID 0x5ea1. I found the table (0x23c3) before I found the index (0x5ea1), so I thought it safest to DROP and re-CREATE. This was probably not much slower than removing the only index and re-adding it. (Only 300K records.) I guess the table/index was already ok after the IST though (because of fresh IST data?), otherwise I would've expected the same fatal() error. I have stated [...] my belief that Galera rsync SST is unsafe to use when you use. [...] Not sure where that sentence ends. But we're using wsrep_sst_method = mariabackup , so if we're dealing with corruption from SSTs, it would be from mariadb-backup. Also, you say "different sources". In our case, that is continuously the case: in the multi node setup in K8S, we're disconnecting/reconnecting nodes at will, so there is/was never a true "original master". I'll go check out MDEV-24845 to see if there is anything that appears relevant to us.

            wdoekes, sorry, please disregard the "when you use" or see my edited comment.

            It is unnecessary to rebuild or reload an entire table if the corruption is known to affect only some indexes. The index ID would only matter if there were multiple fulltext indexes defined on the table, and you wanted to avoid dropping and re-creating some of them. I do not think that InnoDB has any self-healing capabilities. Corruption can remain dormant for a long time until the data is being accessed. I do not think that applying an incremental snapshot transfer (IST) would have healed the fulltext index, unless it included a DROP INDEX statement or similar.

            marko Marko Mäkelä added a comment - wdoekes , sorry, please disregard the "when you use" or see my edited comment. It is unnecessary to rebuild or reload an entire table if the corruption is known to affect only some indexes. The index ID would only matter if there were multiple fulltext indexes defined on the table, and you wanted to avoid dropping and re-creating some of them. I do not think that InnoDB has any self-healing capabilities. Corruption can remain dormant for a long time until the data is being accessed. I do not think that applying an incremental snapshot transfer (IST) would have healed the fulltext index, unless it included a DROP INDEX statement or similar.

            #4  0x00005653df4e76a5 in wsrep::client_state::bf_abort (bf_seqno=..., this=0x29ad24006be0)
                at /home/buildbot/buildbot/build/wsrep-lib/include/wsrep/client_state.hpp:668
            668	/home/buildbot/buildbot/build/wsrep-lib/include/wsrep/client_state.hpp: No such file or directory.
            (rr) p mode_
            $1 = wsrep::client_state::m_high_priority
            (rr) p transaction_.streaming_context_
            $2 = {fragments_certified_ = 0, fragments_ = std::vector of length 0, capacity 0, rollback_replicated_for_ = {id_ = 18446744073709551615}, 
              fragment_unit_ = wsrep::streaming_context::bytes, fragment_size_ = 0, unit_counter_ = 0, log_position_ = 0}
            

            jplindst Jan Lindström (Inactive) added a comment - #4 0x00005653df4e76a5 in wsrep::client_state::bf_abort (bf_seqno=..., this=0x29ad24006be0) at /home/buildbot/buildbot/build/wsrep-lib/include/wsrep/client_state.hpp:668 668 /home/buildbot/buildbot/build/wsrep-lib/include/wsrep/client_state.hpp: No such file or directory. (rr) p mode_ $1 = wsrep::client_state::m_high_priority (rr) p transaction_.streaming_context_ $2 = {fragments_certified_ = 0, fragments_ = std::vector of length 0, capacity 0, rollback_replicated_for_ = {id_ = 18446744073709551615}, fragment_unit_ = wsrep::streaming_context::bytes, fragment_size_ = 0, unit_counter_ = 0, log_position_ = 0}

            2021-12-01 14:32:27 0 [ERROR] WSREP: async IST sender failed to serve tcp://172.20.2.35:4568: ist send failed: ', asio error 'Failed to writ
            e: Connection reset by peer: 104 (Connection reset by peer)
            	 at /home/buildbot/buildbot/build/galerautils/src/gu_asio_stream_react.cpp:throw_sync_op_error():141': 104 (Connection reset by peer
            )
            	 at /home/buildbot/buildbot/build/galera/src/ist.cpp:send():704
            2021-12-01 14:32:27 0 [Note] WSREP: async IST sender served
            2021-12-01 14:32:32 0 [Note] WSREP:  cleaning up 99490f01-ae4c (tcp://172.20.2.35:4567)
            2021-12-01 14:33:10 20 [Note] WSREP: wsrep_cs().bf_abort: BF: thread 20 client_state exec client_mode high priority trans_state executing qu
            ery DELETE from SEQUENCE where id =  NAME_CONST('seqId',7123015)���a 
            2021-12-01 14:33:10 20 [Note] WSREP: wsrep_cs().bf_abort: victim: thread 18 client_state exec client_mode high priority trans_state executin
            g query DELETE from EVENT where parcel_id =  NAME_CONST('pclId',7216042)���a 
            2021-12-01 14:33:10 20 [Note] WSREP: wsrep_cs().bf_abort victim:BF 1 local 0 applying 1 toi 0 local_toi 0 SR 0
            2021-12-01 14:33:10 20 [Note] WSREP: wsrep_cs().bf_abort bf:BF 1 local 0 applying 1 toi 0 local_toi 0 SR 0
            mariadbd: /test/mtest/10.4_opt/wsrep-lib/include/wsrep/client_state.hpp:668: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mod
            e_ == m_local || transaction_.is_streaming()' failed.
            211201 14:33:10 [ERROR] mysqld got signal 6 ;
            

            jplindst Jan Lindström (Inactive) added a comment - 2021-12-01 14:32:27 0 [ERROR] WSREP: async IST sender failed to serve tcp://172.20.2.35:4568: ist send failed: ', asio error 'Failed to writ e: Connection reset by peer: 104 (Connection reset by peer) at /home/buildbot/buildbot/build/galerautils/src/gu_asio_stream_react.cpp:throw_sync_op_error():141': 104 (Connection reset by peer ) at /home/buildbot/buildbot/build/galera/src/ist.cpp:send():704 2021-12-01 14:32:27 0 [Note] WSREP: async IST sender served 2021-12-01 14:32:32 0 [Note] WSREP: cleaning up 99490f01-ae4c (tcp://172.20.2.35:4567) 2021-12-01 14:33:10 20 [Note] WSREP: wsrep_cs().bf_abort: BF: thread 20 client_state exec client_mode high priority trans_state executing qu ery DELETE from SEQUENCE where id = NAME_CONST('seqId',7123015)���a 2021-12-01 14:33:10 20 [Note] WSREP: wsrep_cs().bf_abort: victim: thread 18 client_state exec client_mode high priority trans_state executin g query DELETE from EVENT where parcel_id = NAME_CONST('pclId',7216042)���a 2021-12-01 14:33:10 20 [Note] WSREP: wsrep_cs().bf_abort victim:BF 1 local 0 applying 1 toi 0 local_toi 0 SR 0 2021-12-01 14:33:10 20 [Note] WSREP: wsrep_cs().bf_abort bf:BF 1 local 0 applying 1 toi 0 local_toi 0 SR 0 mariadbd: /test/mtest/10.4_opt/wsrep-lib/include/wsrep/client_state.hpp:668: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mod e_ == m_local || transaction_.is_streaming()' failed. 211201 14:33:10 [ERROR] mysqld got signal 6 ;
            jplindst Jan Lindström (Inactive) made changes -
            Attachment gdb.txt [ 61018 ]
            jplindst Jan Lindström (Inactive) made changes -
            Assignee Jan Lindström [ jplindst ] Seppo Jaakola [ seppo ]
            wdoekes Walter Doekes made changes -
            Attachment second-of-crash.combined.log [ 61024 ]
            wdoekes Walter Doekes added a comment -

            Ok. So we enabled --wsrep-debug=1.

            So, we had another crash – this time on node-1 – that looks like MDEV-27115 which crashed somewhere due to SELECT RELEASE_LOCK('6727d278c0ccd054a69ff97d2725bce3404a599c'). Backtrace:

            Program terminated with signal SIGSEGV, Segmentation fault.
            #0  MDL_ticket::destroy (ticket=0x7fc2ec0e8fa0) at ./sql/mdl.cc:1043
            #1  0x000055f985670c7c in Item_func_release_lock::val_int (this=0x7fc2ec010800) at ./sql/item_func.cc:4256
            ...
            

            But when scraping together all the logs I found this query:

            command: 161 query: DELETE FROM `searchindex` WHERE ((`elementId`=2152330) AND (`siteId`=2))
            

            and this log

            2021-12-02T09:23:38.546912+0100 node-1: 2021-12-02  8:23:38 27 [Note] WSREP: wsrep_thd_set_wsrep_aborter called
            

            So, it looks very related to this bug after all.

            See the second-of-crash.combined.log where the action starts around 2021-12-02T09:23:38.546834:

            2021-12-02T09:23:38.546834+0100 node-1: 2021-12-02  8:23:38 27 [Note] WSREP: cluster conflict due to high priority abort for threads:
            2021-12-02T09:23:38.546849+0100 node-1: 2021-12-02  8:23:38 27 [Note] WSREP: Winning thread: 
            2021-12-02T09:23:38.546854+0100 node-1:    THD: 27, mode: high priority, state: exec, conflict: executing, seqno: 34787904
            2021-12-02T09:23:38.546860+0100 node-1: [144B blob data]
            2021-12-02T09:23:38.546865+0100 node-1: 2021-12-02  8:23:38 27 [Note] WSREP: Victim thread: 
            2021-12-02T09:23:38.546869+0100 node-1:    THD: 440952, mode: local, state: idle, conflict: executing, seqno: -1
            2021-12-02T09:23:38.546875+0100 node-1:    SQL: NULL
            2021-12-02T09:23:38.546882+0100 node-1: 2021-12-02  8:23:38 27 [Note] WSREP: context: /home/buildbot/buildbot/build/mariadb-10.4.22/storage/innobase/handler/ha_innodb.cc:18695
            2021-12-02T09:23:38.546892+0100 node-1: [331B blob data]
            2021-12-02T09:23:38.546907+0100 node-1: 2021-12-02  8:23:38 27 [Note] WSREP: wsrep_innobase_kill_one_trx: Victim normal trx_id: 520524882 thread: 440952 seqno: -1 client_state: idle client_mode: local trx_state executing query: NULL
            2021-12-02T09:23:38.546912+0100 node-1: 2021-12-02  8:23:38 27 [Note] WSREP: wsrep_thd_set_wsrep_aborter called
            2021-12-02T09:23:38.546916+0100 node-1: 2021-12-02  8:23:38 27 [Note] WSREP: wsrep_bf_abort BF aborter before
            2021-12-02T09:23:38.546921+0100 node-1:     thd: 27 thd_ptr: 0x7fc338000c08 client_mode: high priority client_state: exec trx_state: executing
            2021-12-02T09:23:38.546928+0100 node-1:     next_trx_id: 40649945 trx_id: 41550118 seqno: 34787904
            2021-12-02T09:23:38.546933+0100 node-1:     is_streaming: 0 fragments: 0
            2021-12-02T09:23:38.546937+0100 node-1:     sql_errno: 0 message: 
            2021-12-02T09:23:38.546943+0100 node-1:     command: 161 query: DELETE FROM `searchindex` WHERE ((`elementId`=2152330) AND (`siteId`=2))
            2021-12-02T09:23:38.546947+0100 node-1: 2021-12-02  8:23:38 27 [Note] WSREP: wsrep_bf_abort victim before
            2021-12-02T09:23:38.546950+0100 node-1:     thd: 440952 thd_ptr: 0x7fc2ec000c08 client_mode: local client_state: idle trx_state: executing
            2021-12-02T09:23:38.546953+0100 node-0: 2021-12-02  8:23:38 11 [Note] WSREP: wsrep_commit_empty(11)
            2021-12-02T09:23:38.546956+0100 node-1:     next_trx_id: 40649944 trx_id: 40644041 seqno: -1
            2021-12-02T09:23:38.546962+0100 node-1:     is_streaming: 0 fragments: 0
            2021-12-02T09:23:38.546963+0100 node-0: 2021-12-02  8:23:38 11 [Note] WSREP: assigned new next trx id: 40934578
            2021-12-02T09:23:38.546966+0100 node-1:     sql_errno: 0 message: 
            2021-12-02T09:23:38.546970+0100 node-1:     command: 0 query: NULL
            2021-12-02T09:23:38.546974+0100 node-1: 2021-12-02  8:23:38 27 [Note] WSREP: enqueuing trx abort for (440952)
            2021-12-02T09:23:38.546978+0100 node-1: 2021-12-02  8:23:38 440952 [Note] WSREP: Rollbacker aborting local thd (440952 140494756587264)
            

            2021-12-02T09:23:38.578175+0100 node-1: 2021-12-02  8:23:38 34 [Note] WSREP: assigned new next trx id: 40649949
            2021-12-02T09:23:38.578190+0100 node-1: 2021-12-02  8:23:38 440952 [Note] WSREP: do_command enter found BF aborted
            2021-12-02T09:23:38.578197+0100 node-1:     thd: 440952 thd_ptr: 0x7fc2ec000c08 client_mode: local client_state: exec trx_state: aborted
            2021-12-02T09:23:38.578204+0100 node-1:     next_trx_id: 40649944 trx_id: -1 seqno: -1
            2021-12-02T09:23:38.578211+0100 node-1:     is_streaming: 0 fragments: 0
            2021-12-02T09:23:38.578218+0100 node-1:     sql_errno: 0 message: 
            2021-12-02T09:23:38.578226+0100 node-1:     command: 52 query: NULL
            2021-12-02T09:23:38.578234+0100 node-1: 2021-12-02  8:23:38 440952 [Note] WSREP: Deadlock error for: (null)
            

            I have earlier logs if needed. I have a core dump. But maybe this is sufficient info already? Let me know.

            wdoekes Walter Doekes added a comment - Ok. So we enabled --wsrep-debug=1. So, we had another crash – this time on node-1 – that looks like MDEV-27115 which crashed somewhere due to SELECT RELEASE_LOCK('6727d278c0ccd054a69ff97d2725bce3404a599c') . Backtrace: Program terminated with signal SIGSEGV, Segmentation fault. #0 MDL_ticket::destroy (ticket=0x7fc2ec0e8fa0) at ./sql/mdl.cc:1043 #1 0x000055f985670c7c in Item_func_release_lock::val_int (this=0x7fc2ec010800) at ./sql/item_func.cc:4256 ... But when scraping together all the logs I found this query: command: 161 query: DELETE FROM `searchindex` WHERE ((`elementId`=2152330) AND (`siteId`=2)) and this log 2021-12-02T09:23:38.546912+0100 node-1: 2021-12-02 8:23:38 27 [Note] WSREP: wsrep_thd_set_wsrep_aborter called So, it looks very related to this bug after all. See the second-of-crash.combined.log where the action starts around 2021-12-02T09:23:38.546834 : 2021-12-02T09:23:38.546834+0100 node-1: 2021-12-02 8:23:38 27 [Note] WSREP: cluster conflict due to high priority abort for threads: 2021-12-02T09:23:38.546849+0100 node-1: 2021-12-02 8:23:38 27 [Note] WSREP: Winning thread: 2021-12-02T09:23:38.546854+0100 node-1: THD: 27, mode: high priority, state: exec, conflict: executing, seqno: 34787904 2021-12-02T09:23:38.546860+0100 node-1: [144B blob data] 2021-12-02T09:23:38.546865+0100 node-1: 2021-12-02 8:23:38 27 [Note] WSREP: Victim thread: 2021-12-02T09:23:38.546869+0100 node-1: THD: 440952, mode: local, state: idle, conflict: executing, seqno: -1 2021-12-02T09:23:38.546875+0100 node-1: SQL: NULL 2021-12-02T09:23:38.546882+0100 node-1: 2021-12-02 8:23:38 27 [Note] WSREP: context: /home/buildbot/buildbot/build/mariadb-10.4.22/storage/innobase/handler/ha_innodb.cc:18695 2021-12-02T09:23:38.546892+0100 node-1: [331B blob data] 2021-12-02T09:23:38.546907+0100 node-1: 2021-12-02 8:23:38 27 [Note] WSREP: wsrep_innobase_kill_one_trx: Victim normal trx_id: 520524882 thread: 440952 seqno: -1 client_state: idle client_mode: local trx_state executing query: NULL 2021-12-02T09:23:38.546912+0100 node-1: 2021-12-02 8:23:38 27 [Note] WSREP: wsrep_thd_set_wsrep_aborter called 2021-12-02T09:23:38.546916+0100 node-1: 2021-12-02 8:23:38 27 [Note] WSREP: wsrep_bf_abort BF aborter before 2021-12-02T09:23:38.546921+0100 node-1: thd: 27 thd_ptr: 0x7fc338000c08 client_mode: high priority client_state: exec trx_state: executing 2021-12-02T09:23:38.546928+0100 node-1: next_trx_id: 40649945 trx_id: 41550118 seqno: 34787904 2021-12-02T09:23:38.546933+0100 node-1: is_streaming: 0 fragments: 0 2021-12-02T09:23:38.546937+0100 node-1: sql_errno: 0 message: 2021-12-02T09:23:38.546943+0100 node-1: command: 161 query: DELETE FROM `searchindex` WHERE ((`elementId`=2152330) AND (`siteId`=2)) 2021-12-02T09:23:38.546947+0100 node-1: 2021-12-02 8:23:38 27 [Note] WSREP: wsrep_bf_abort victim before 2021-12-02T09:23:38.546950+0100 node-1: thd: 440952 thd_ptr: 0x7fc2ec000c08 client_mode: local client_state: idle trx_state: executing 2021-12-02T09:23:38.546953+0100 node-0: 2021-12-02 8:23:38 11 [Note] WSREP: wsrep_commit_empty(11) 2021-12-02T09:23:38.546956+0100 node-1: next_trx_id: 40649944 trx_id: 40644041 seqno: -1 2021-12-02T09:23:38.546962+0100 node-1: is_streaming: 0 fragments: 0 2021-12-02T09:23:38.546963+0100 node-0: 2021-12-02 8:23:38 11 [Note] WSREP: assigned new next trx id: 40934578 2021-12-02T09:23:38.546966+0100 node-1: sql_errno: 0 message: 2021-12-02T09:23:38.546970+0100 node-1: command: 0 query: NULL 2021-12-02T09:23:38.546974+0100 node-1: 2021-12-02 8:23:38 27 [Note] WSREP: enqueuing trx abort for (440952) 2021-12-02T09:23:38.546978+0100 node-1: 2021-12-02 8:23:38 440952 [Note] WSREP: Rollbacker aborting local thd (440952 140494756587264) 2021-12-02T09:23:38.578175+0100 node-1: 2021-12-02 8:23:38 34 [Note] WSREP: assigned new next trx id: 40649949 2021-12-02T09:23:38.578190+0100 node-1: 2021-12-02 8:23:38 440952 [Note] WSREP: do_command enter found BF aborted 2021-12-02T09:23:38.578197+0100 node-1: thd: 440952 thd_ptr: 0x7fc2ec000c08 client_mode: local client_state: exec trx_state: aborted 2021-12-02T09:23:38.578204+0100 node-1: next_trx_id: 40649944 trx_id: -1 seqno: -1 2021-12-02T09:23:38.578211+0100 node-1: is_streaming: 0 fragments: 0 2021-12-02T09:23:38.578218+0100 node-1: sql_errno: 0 message: 2021-12-02T09:23:38.578226+0100 node-1: command: 52 query: NULL 2021-12-02T09:23:38.578234+0100 node-1: 2021-12-02 8:23:38 440952 [Note] WSREP: Deadlock error for: (null) I have earlier logs if needed. I have a core dump. But maybe this is sufficient info already? Let me know.
            wdoekes Walter Doekes made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 126127 ] MariaDB v4 [ 144419 ]
            julien.fritsch Julien Fritsch made changes -
            Status Confirmed [ 10101 ] Open [ 1 ]
            julien.fritsch Julien Fritsch made changes -
            Status Open [ 1 ] Needs Feedback [ 10501 ]
            julien.fritsch Julien Fritsch made changes -
            Labels need_feedback
            jplindst Jan Lindström (Inactive) made changes -
            Status Needs Feedback [ 10501 ] Open [ 1 ]
            jplindst Jan Lindström (Inactive) made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]
            seppo Seppo Jaakola made changes -
            Status Confirmed [ 10101 ] In Progress [ 3 ]
            seppo Seppo Jaakola added a comment -

            I was able to reproduce the issue, and the offending use case appears to be a structure where a foreign key child table has two foreign key constraints with ON DELETE CASCADE option, pointing to two separate parent tables. Then two client sessions, each deleting one of the parent table, a row, having same depending row in the foreign key child table. Both delete operations will cascade to delete the same row in the foreign key child table, but only one of the sessions will actually do this cascading delete (the later session does not find the row anymore). As a result, the generated replicated write sets do not have key information showing the dependency with these write sets. These write would not be safe to apply in parallel, but current MariaDB version does not restrict this.

            seppo Seppo Jaakola added a comment - I was able to reproduce the issue, and the offending use case appears to be a structure where a foreign key child table has two foreign key constraints with ON DELETE CASCADE option, pointing to two separate parent tables. Then two client sessions, each deleting one of the parent table, a row, having same depending row in the foreign key child table. Both delete operations will cascade to delete the same row in the foreign key child table, but only one of the sessions will actually do this cascading delete (the later session does not find the row anymore). As a result, the generated replicated write sets do not have key information showing the dependency with these write sets. These write would not be safe to apply in parallel, but current MariaDB version does not restrict this.
            seppo Seppo Jaakola added a comment -

            Using provider configuration: cert.optimistic_pa=false, is effective work around for the above scenario, until a better fix is worked out

            seppo Seppo Jaakola added a comment - Using provider configuration: cert.optimistic_pa=false, is effective work around for the above scenario, until a better fix is worked out
            wdoekes Walter Doekes added a comment -

            Nice work!

            I'm looking at the schema dump here. And I do not see FKs on the searchindex table that I mentioned (not on and not to). But I do see plenty of other tables in this schema that have FKs with ON DELETE CASCADE to multiple other tables. I'll assume the problem is then triggered before we even got to the DELETE FROM `searchindex`.

            wdoekes Walter Doekes added a comment - Nice work! I'm looking at the schema dump here. And I do not see FKs on the searchindex table that I mentioned (not on and not to ). But I do see plenty of other tables in this schema that have FKs with ON DELETE CASCADE to multiple other tables. I'll assume the problem is then triggered before we even got to the DELETE FROM `searchindex` .
            seppo Seppo Jaakola added a comment -

            I have a fix for this now under testing, looks effective with the provided sample and stored procedure exercising some 50K such vulnerable cascading FK delete operations. Will work on to develop a simple deterministic regression test for this.

            seppo Seppo Jaakola added a comment - I have a fix for this now under testing, looks effective with the provided sample and stored procedure exercising some 50K such vulnerable cascading FK delete operations. Will work on to develop a simple deterministic regression test for this.
            seppo Seppo Jaakola made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            seppo Seppo Jaakola added a comment -

            PR submitted in github

            seppo Seppo Jaakola added a comment - PR submitted in github
            seppo Seppo Jaakola made changes -
            Assignee Seppo Jaakola [ seppo ] Jan Lindström [ jplindst ]
            Status Stalled [ 10000 ] In Review [ 10002 ]

            seppo and jplindst, please try to make the changes inside an existing Galera hook (wsrep_append_foreign_key()), possibly simplifying its interface a little.

            marko Marko Mäkelä added a comment - seppo and jplindst , please try to make the changes inside an existing Galera hook ( wsrep_append_foreign_key() ), possibly simplifying its interface a little.
            jplindst Jan Lindström (Inactive) added a comment - https://github.com/MariaDB/server/commit/82f798157884ad6a8764e6a854f17464e3e9d517

            Thanks everyone for the persistence in finding the issue.

            brendon Brendon Abbott added a comment - Thanks everyone for the persistence in finding the issue.
            jplindst Jan Lindström (Inactive) made changes -
            issue.field.resolutiondate 2021-12-20 13:13:01.0 2021-12-20 13:13:01.747
            jplindst Jan Lindström (Inactive) made changes -
            Fix Version/s 10.4.23 [ 26807 ]
            Fix Version/s 10.5.14 [ 26809 ]
            Fix Version/s 10.6.6 [ 26811 ]
            Fix Version/s 10.7.2 [ 26813 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Resolution Fixed [ 1 ]
            Status In Review [ 10002 ] Closed [ 6 ]
            elenst Elena Stepanova made changes -
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 145774

            People

              jplindst Jan Lindström (Inactive)
              brendon Brendon Abbott
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.