Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31038

Parallel Replication Breaks if XA PREPARE Fails Updating Slave GTID State

Details

    Description

      Similar to MDEV-29642, but instead of a slave crashing, if a parallel slave executing an XA PREPARE fails the follow up autocommit transaction due to a temporary error which can be retried (e.g. a deadlock when trying to lock mysql.gtid_slave_pos), then the retry will run into the same issue as MDEV-29642, i.e. an out-of-order GTID attempt (if gtid strict mode is enabled) or XID already exists (otherwise).

      This is because the XA PREPARE had already committed successfully in binlog and storage engines, and only the slave GTID state update failed. Then on retry, the transaction has already been seen, so we error.

      Filed as a separate ticket (rather than extending MDEV-29642 scope) because this is specific to the parallel slave, and may need a separate effort.

      Attachments

        Issue Links

          Activity

            Hi Andrei,

            This is ready for review as PR-2610.

            bnestere Brandon Nesterenko added a comment - Hi Andrei, This is ready for review as PR-2610 .
            Elkin Andrei Elkin added a comment -

            A review round suggested couple of improvements.

            Elkin Andrei Elkin added a comment - A review round suggested couple of improvements.

            Pushed into 10.5 as 31f09e3

            bnestere Brandon Nesterenko added a comment - Pushed into 10.5 as 31f09e3
            vlad.lesin Vladislav Lesin added a comment - - edited

            Guys, I think the fix does not actually fixes the root case. From the user perspective it just replaces one error no another.

            The initial problem can be solved with two sides:

            1. Replication side. "XA prepare" binlogging and record inserting to gtid_slave_pos is not atomic. If "XA prepare" is binloged, and for some reason the insert to gtid_slave_pos failed, the binlog and gtid_slave_pos are not consistent. It would be great if binlog and gtid_slave_pos updates were in the same transaction, which could be either committed or rolled back.

            2. InnoDB side.

            If the bug is caused by transaction retry, and there was error during in gtid_slave_pos insert, the error can be either deadlock or timeout(see the code around slave_transaction_retry_errors array and is_parallel_retry_error() call in Xid_apply_log_event::do_record_gtid() for details). Both errors can be caused by InnoDB locks.

            But when some record is inserted into gtid_slave_pos, persistent cursor in opened with PAGE_CUR_LE search mode, see row_ins_clust_index_entry_low() for details. I.e. it tries to find the record less or equal to the inserted gtid. During the search, btr_pcur_open_low() fills cursor->up_match and cursor->low_match fields. Here is the comments about the fields:

            ulint           up_match;       /*!< If the search mode was PAGE_CUR_LE,
                                                        the number of matched fields to the                                                                                      
                                                        the first user record to the right of                                                                                    
                                                        the cursor record after                                                                                                  
                                                        btr_cur_search_to_nth_level;                                                                                                                                                                                        
                                                        ... */
            ulint           low_match;     /*!< if search mode was PAGE_CUR_LE,                                                                                     
                                                         the number of matched fields to the                                                                                      
                                                         first user record AT THE CURSOR or                                                                                       
                                                         to the left of it after                                                                                                  
                                                         btr_cur_search_to_nth_level;                                                                                             
                                                         ...*/                        
            

            gtid_slave_pos has the following pk:

            *************************** 1. row ***************************
                   Table: gtid_slave_pos
            Create Table: CREATE TABLE `gtid_slave_pos` (
              `domain_id` int(10) unsigned NOT NULL,
              `sub_id` bigint(20) unsigned NOT NULL,
              `server_id` int(10) unsigned NOT NULL,
              `seq_no` bigint(20) unsigned NOT NULL,
              PRIMARY KEY (`domain_id`,`sub_id`)
            ) ENGINE=InnoDB DEFAULT CHARSET=latin1 PAGE_CHECKSUM=1 TRANSACTIONAL=0 COMMENT='Replication slave GTID position'
            

            i.e. if we have the following records in gtid_slave_pos (0, 1, ...), (0, 4, ...), and want to insert (0, 3, ...), both cursor->low_match and cursor->up_match will be equal to 1, as both (0,1, ...), (0,4, ...) has the same 1 field as (0,3, ...). If gtid_slave_pos had contained also (0,3, ...) record, cursor->low_match would be equal to 1, and cursor->up_match would be equal to 2, what would make (cursor->up_match >= n_uniq || cursor->low_match >= n_uniq) condition to be true in row_ins_clust_index_entry_low(), what would cause non-gap shared lock acquiring for (0,3,...) to compare all fields(see row_ins_duplicate_error_in_clust()) and to check 'deleted' flag. Non-gap locks can be converted to gap lock only on record deleting, or page splitting/merging or page discarding. See lock_rec_inherit_to_gap() for details. So if (0,3,...) is delete-marked and purged(or page is split/merged or discarded) after some thread checked it for duplicates, committed mini-transaction, but have not released locks yet, insert (0,2,...) could be blocked.

            In other words, non-gap S-lock on gtid_slave_pos record can happen only if it contain duplicate(even delete-marked), i.e. the record with the same (`domain_id`,`sub_id`) pair as in the inserting record. As I understood correctly(Elkin, correct me if I am wrong), it's impossible, as (`domain_id`,`sub_id`) is supposed to be unique.

            In the case of duplicate(even delete-marked) absence only insert-intention locks are acquired(they don't conflict each other). Insert intention locks are not converted(see lock_rec_inherit_to_gap(), lock_rec_inherit_to_gap_if_gap_lock()). Insert-intention locks conflict only with gap locks. But, to make it happen, there must be duplicates in gtid_slave_pos, what is supposed to be impossible.

            I would add printing transaction locks info (i.e. what locks the transaction holds, in what lock it's deadlocked or timed out, the locks of conflicting transaction) to understand if the insert failure was caused by locks, and by what exactly locks. Maybe it also makes sense to print the statement of conflicting transaction, as, theoretically, the locks can be caused not only by slave workers, but also by some third-party queries.

            vlad.lesin Vladislav Lesin added a comment - - edited Guys, I think the fix does not actually fixes the root case. From the user perspective it just replaces one error no another. The initial problem can be solved with two sides: 1. Replication side. "XA prepare" binlogging and record inserting to gtid_slave_pos is not atomic. If "XA prepare" is binloged, and for some reason the insert to gtid_slave_pos failed, the binlog and gtid_slave_pos are not consistent. It would be great if binlog and gtid_slave_pos updates were in the same transaction, which could be either committed or rolled back. 2. InnoDB side. If the bug is caused by transaction retry, and there was error during in gtid_slave_pos insert, the error can be either deadlock or timeout(see the code around slave_transaction_retry_errors array and is_parallel_retry_error() call in Xid_apply_log_event::do_record_gtid() for details). Both errors can be caused by InnoDB locks. But when some record is inserted into gtid_slave_pos, persistent cursor in opened with PAGE_CUR_LE search mode, see row_ins_clust_index_entry_low() for details. I.e. it tries to find the record less or equal to the inserted gtid. During the search, btr_pcur_open_low() fills cursor->up_match and cursor->low_match fields. Here is the comments about the fields: ulint up_match; /*!< If the search mode was PAGE_CUR_LE, the number of matched fields to the the first user record to the right of the cursor record after btr_cur_search_to_nth_level; ... */ ulint low_match; /*!< if search mode was PAGE_CUR_LE, the number of matched fields to the first user record AT THE CURSOR or to the left of it after btr_cur_search_to_nth_level; ...*/ gtid_slave_pos has the following pk: *************************** 1. row *************************** Table: gtid_slave_pos Create Table: CREATE TABLE `gtid_slave_pos` ( `domain_id` int(10) unsigned NOT NULL, `sub_id` bigint(20) unsigned NOT NULL, `server_id` int(10) unsigned NOT NULL, `seq_no` bigint(20) unsigned NOT NULL, PRIMARY KEY (`domain_id`,`sub_id`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 PAGE_CHECKSUM=1 TRANSACTIONAL=0 COMMENT='Replication slave GTID position' i.e. if we have the following records in gtid_slave_pos (0, 1, ...), (0, 4, ...), and want to insert (0, 3, ...), both cursor->low_match and cursor->up_match will be equal to 1, as both (0,1, ...), (0,4, ...) has the same 1 field as (0,3, ...). If gtid_slave_pos had contained also (0,3, ...) record, cursor->low_match would be equal to 1, and cursor->up_match would be equal to 2, what would make (cursor->up_match >= n_uniq || cursor->low_match >= n_uniq) condition to be true in row_ins_clust_index_entry_low(), what would cause non-gap shared lock acquiring for (0,3,...) to compare all fields(see row_ins_duplicate_error_in_clust()) and to check 'deleted' flag. Non-gap locks can be converted to gap lock only on record deleting, or page splitting/merging or page discarding. See lock_rec_inherit_to_gap() for details. So if (0,3,...) is delete-marked and purged(or page is split/merged or discarded) after some thread checked it for duplicates, committed mini-transaction, but have not released locks yet, insert (0,2,...) could be blocked. In other words, non-gap S-lock on gtid_slave_pos record can happen only if it contain duplicate(even delete-marked), i.e. the record with the same (`domain_id`,`sub_id`) pair as in the inserting record. As I understood correctly( Elkin , correct me if I am wrong), it's impossible, as (`domain_id`,`sub_id`) is supposed to be unique. In the case of duplicate(even delete-marked) absence only insert-intention locks are acquired(they don't conflict each other). Insert intention locks are not converted(see lock_rec_inherit_to_gap(), lock_rec_inherit_to_gap_if_gap_lock()). Insert-intention locks conflict only with gap locks. But, to make it happen, there must be duplicates in gtid_slave_pos, what is supposed to be impossible. I would add printing transaction locks info (i.e. what locks the transaction holds, in what lock it's deadlocked or timed out, the locks of conflicting transaction) to understand if the insert failure was caused by locks, and by what exactly locks. Maybe it also makes sense to print the statement of conflicting transaction, as, theoretically, the locks can be caused not only by slave workers, but also by some third-party queries.
            vlad.lesin Vladislav Lesin added a comment - - edited

            Here is my try to cause "out-of-order" error: rpl_parallel_optimistic.test.

            I launched it with 30 instanced, it worked about 8 hours and did not catch the bug.

            vlad.lesin Vladislav Lesin added a comment - - edited Here is my try to cause "out-of-order" error: rpl_parallel_optimistic.test . I launched it with 30 instanced, it worked about 8 hours and did not catch the bug.

            The rpl_parallel_optimistic.test produced an anomaly that I posted some analysis of into MDEV-28430. It could explain mystery locking conflicts.

            marko Marko Mäkelä added a comment - The rpl_parallel_optimistic.test produced an anomaly that I posted some analysis of into MDEV-28430 . It could explain mystery locking conflicts.

            Closing issue as there was a patch out into a release. Any further work on the corresponding crash-safety concern should be tracked to MDEV-21777.

            bnestere Brandon Nesterenko added a comment - Closing issue as there was a patch out into a release. Any further work on the corresponding crash-safety concern should be tracked to MDEV-21777 .
            Elkin Andrei Elkin added a comment -

            Correcting the relation after discussion with vlad.lesin.

            Elkin Andrei Elkin added a comment - Correcting the relation after discussion with vlad.lesin .

            People

              vlad.lesin Vladislav Lesin
              bnestere Brandon Nesterenko
              Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.