Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-19344

innodb.innodb-change-buffer-recovery-innodb fails with incorrect results or crash

Details

    Description

      In debug mode running a test with --repeat results in an every second run failed like this:

      innodb.innodb-change-buffer-recovery 'innodb' [ fail ]
              Test ended at 2019-04-26 12:23:01
       
      CURRENT_TEST: innodb.innodb-change-buffer-recovery
      --- /home/kevg/work/mariadb/mysql-test/suite/innodb/r/innodb-change-buffer-recovery.result	2019-04-26 12:14:05.804197310 +0300
      +++ /home/kevg/work/mariadb/mysql-test/suite/innodb/r/innodb-change-buffer-recovery.reject	2019-04-26 12:23:01.364749083 +0300
      @@ -38,7 +38,7 @@
       ERROR HY000: Lost connection to MySQL server during query
       disconnect con1;
       connection default;
      -FOUND 1 /Wrote log record for ibuf update in place operation/ in my_restart.err
      +FOUND 2 /Wrote log record for ibuf update in place operation/ in my_restart.err
       CHECK TABLE t1;
       Table	Op	Msg_type	Msg_text
       test.t1	check	status	OK
       
      mysqltest: Result content mismatch
       
       - saving '/home/kevg/work/debug/mysql-test/var/log/innodb.innodb-change-buffer-recovery-innodb/' to '/home/kevg/work/debug/mysql-test/var/log/innodb.innodb-change-buffer-recovery-innodb/'
      

      In debug mode with -O2 it sometime crashes like this:

      #6  0x00007f4016d55012 in __GI___assert_fail (assertion=0xfadc93 "field != 16 || val", file=0xf9048b "/work/mariadb/storage/innobase/include/page0page.ic", line=176, function=0xfadc3f "void page_header_set_field(ib_page_t *, page_zip_des_t *, ulint, ulint)") at assert.c:101
      #7  0x0000000000a1bbcb in page_header_set_field (page=<optimized out>, page_zip=0x0, field=16, val=<optimized out>) at page0page.ic:176
      #8  page_mem_free (page=<optimized out>, page_zip=0x0, rec=<optimized out>, index=<optimized out>, offsets=<optimized out>) at page0page.ic:1047
      #9  page_cur_delete_rec (cursor=0x7f40106e8d01, index=<optimized out>, offsets=0x7ffe97abb990, mtr=0x7ffe97abc230) at page0cur.cc:2417
      #10 0x0000000000b79fab in btr_cur_optimistic_delete_func (cursor=0x7ffe97abc760, flags=<optimized out>, mtr=0x7ffe97abc230) at btr0cur.cc:5163
      #11 0x00000000009c4387 in ibuf_delete_rec (space=4, page_no=27, pcur=0x7ffe97abc760, search_tuple=0x26faf98, mtr=0x7ffe97abc230) at ibuf0ibuf.cc:4261
      #12 0x00000000009c11de in ibuf_merge_or_delete_for_page (block=0x7f40102526a8, page_id=..., page_size=0x7f40102526b8, update_ibuf_bitmap=1) at ibuf0ibuf.cc:4645
      #13 0x00000000009fcd6f in mlog_init_t::ibuf_merge (this=<optimized out>, mtr=...) at log0recv.cc:330
      #14 0x00000000009f44c5 in recv_apply_hashed_log_recs (last_batch=<optimized out>) at log0recv.cc:2516
      #15 0x0000000000ae989b in innobase_start_or_create_for_mysql () at srv0start.cc:2224
      #16 0x000000000097d77f in innobase_init (p=<optimized out>) at ha_innodb.cc:4377
      #17 0x00000000007af955 in ha_initialize_handlerton (plugin=0x2367ca8) at handler.cc:521
      #18 0x000000000060cc59 in plugin_initialize (tmp_root=0x7ffe97ac26b8, plugin=0x2367ca8, argc=0x15d1f68 <remaining_argc>, argv=0x2225428, options_only=<optimized out>) at sql_plugin.cc:1416
      #19 0x000000000060c77e in plugin_init (argc=0x15d1f68 <remaining_argc>, argv=0x2225428, flags=2) at sql_plugin.cc:1697
      #20 0x00000000005525b5 in init_server_components () at mysqld.cc:5311
      #21 mysqld_main (argc=<optimized out>, argv=0x2225428) at mysqld.cc:5907
      

      Adding -mysqld=-debug=d,ib_log_checkpoint_avoid make debug version fail with the same backtrace.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä added a comment - - edited

            The ‘extra’ call to ibuf_insert_low() is explained by the following:

            	err = ibuf_insert_low(BTR_MODIFY_PREV, op, no_counter,
            			      entry, entry_size,
            			      index, page_id, zip_size, thr);
            	if (err == DB_FAIL) {
            		err = ibuf_insert_low(BTR_MODIFY_TREE | BTR_LATCH_FOR_INSERT,
            				      op, no_counter, entry, entry_size,
            				      index, page_id, zip_size, thr);
            	}
            

            That is, no duplicate record was inserted into the change buffer for either the DELETE (@r, 36) or the DELETE_MARK (NULL, 521). Only 14 records were present when the DELETE (NULL, 520) was wrongly buffered.

            In ibuf_get_volume_buffered() we check also the preceding change buffer page. This worries me a little, because normally, a backward B-tree scan in InnoDB could lead to a deadlock of threads, and we normally save the cursor position and search for the preceding record to move to the preceding page. Only if the caller ibuf_insert_low() was invoked with mode = BTR_MODIFY_TREE | BTR_LATCH_FOR_INSERT would there likely be no concern, because we would be holding ibuf_mutex and ibuf_pessimistic_insert_mutex. However, that call has been there since the inclusion of InnoDB in MySQL. If I remember correctly the discussions between 2005 and 2007, going back one more page could lead to deadlocks. So, I guess that we are fine with regard to deadlocks here. (Also, I am not aware of any actual deadlocks in this code.)

            The following hash function ought to work. False positives are not a correctness problem. For example, ('fu','bar'), (NULL,'fubar'),('fub','ar') would be treated as equal, and we would increment min_n_rec by at most 1 for those 3 distinct keys.

            /** Determine if a change buffer record has been encountered already.
            @param rec   change buffer record in the MySQL 5.5 format
            @param hash  hash table of encountered records
            @param size  number of elements in hash
            @retval true if a distinct record
            @retval false if this may be duplicating an earlier record */
            static bool ibuf_get_volume_buffered_hash(const rec_t *rec, ulint *hash,
                                                      ulint size)
            {
              ut_ad(rec_get_n_fields_old(rec) > IBUF_REC_FIELD_USER);
              const ulint start= rec_get_field_start_offs(rec, IBUF_REC_FIELD_USER);
              const ulint len= rec_get_data_size_old(rec) - start;
              const uint32_t fold= ut_crc32(rec + start, len);
              hash+= (fold / (CHAR_BIT * sizeof *hash)) % size;
              ulint bitmask= static_cast<ulint>(1) << (fold % (CHAR_BIT * sizeof(*hash)));
             
              if (*hash & bitmask)
                return false;
             
              /* We have not seen this record yet. Remember it. */
              *hash|= bitmask;
              return true;
            }
            

            The comparison min_n_recs < 2 in ibuf_insert_low() actually does check for underflow, because the variable is a signed type (lint, not ulint).

            With the corrected hash function, the estimate of min_n_recs looks conservative to me, even if there were multiple change buffer pages of buffered changes and the estimate did not consider them all. In the worst case, we missed many preceding pages filled with IBUF_DELETE_OP records that had been buffered for the user page earlier. But, it would have been an error to buffer those operations if the records did not exist in the first place. So, it is OK to ignore any preceding pages of buffered operations and start with a conservative estimate that there exist at least min_n_recs=0 records in the page.

            Update: I tried 10.2 with the following patch:

            diff --git a/storage/innobase/include/page0page.ic b/storage/innobase/include/page0page.ic
            index 4efb92f7182..276f85df69a 100644
            --- a/storage/innobase/include/page0page.ic
            +++ b/storage/innobase/include/page0page.ic
            @@ -173,9 +173,7 @@ page_header_set_field(
             {
             	ut_ad(page);
             	ut_ad(field <= PAGE_N_RECS);
            -#if 0 /* FIXME: MDEV-19344 hits this */
             	ut_ad(field != PAGE_N_RECS || val);
            -#endif
             	ut_ad(field == PAGE_N_HEAP || val < srv_page_size);
             	ut_ad(field != PAGE_N_HEAP || (val & 0x7fff) < srv_page_size);
             
            

            but the test innodb.innodb-change-buffer-recovery did not fail in 70 runs even without my fix. I will try harder, possibly with an older version, so that I can test my fix.

            marko Marko Mäkelä added a comment - - edited The ‘extra’ call to ibuf_insert_low() is explained by the following: err = ibuf_insert_low(BTR_MODIFY_PREV, op, no_counter, entry, entry_size, index, page_id, zip_size, thr); if (err == DB_FAIL) { err = ibuf_insert_low(BTR_MODIFY_TREE | BTR_LATCH_FOR_INSERT, op, no_counter, entry, entry_size, index, page_id, zip_size, thr); } That is, no duplicate record was inserted into the change buffer for either the DELETE (@r, 36) or the DELETE_MARK (NULL, 521) . Only 14 records were present when the DELETE (NULL, 520) was wrongly buffered. In ibuf_get_volume_buffered() we check also the preceding change buffer page. This worries me a little, because normally, a backward B-tree scan in InnoDB could lead to a deadlock of threads, and we normally save the cursor position and search for the preceding record to move to the preceding page. Only if the caller ibuf_insert_low() was invoked with mode = BTR_MODIFY_TREE | BTR_LATCH_FOR_INSERT would there likely be no concern, because we would be holding ibuf_mutex and ibuf_pessimistic_insert_mutex . However, that call has been there since the inclusion of InnoDB in MySQL . If I remember correctly the discussions between 2005 and 2007, going back one more page could lead to deadlocks. So, I guess that we are fine with regard to deadlocks here. (Also, I am not aware of any actual deadlocks in this code.) The following hash function ought to work. False positives are not a correctness problem. For example, ('fu','bar'), (NULL,'fubar'),('fub','ar') would be treated as equal, and we would increment min_n_rec by at most 1 for those 3 distinct keys. /** Determine if a change buffer record has been encountered already. @param rec change buffer record in the MySQL 5.5 format @param hash hash table of encountered records @param size number of elements in hash @retval true if a distinct record @retval false if this may be duplicating an earlier record */ static bool ibuf_get_volume_buffered_hash( const rec_t *rec, ulint *hash, ulint size) { ut_ad(rec_get_n_fields_old(rec) > IBUF_REC_FIELD_USER); const ulint start= rec_get_field_start_offs(rec, IBUF_REC_FIELD_USER); const ulint len= rec_get_data_size_old(rec) - start; const uint32_t fold= ut_crc32(rec + start, len); hash+= (fold / (CHAR_BIT * sizeof *hash)) % size; ulint bitmask= static_cast <ulint>(1) << (fold % (CHAR_BIT * sizeof (*hash)));   if (*hash & bitmask) return false ;   /* We have not seen this record yet. Remember it. */ *hash|= bitmask; return true ; } The comparison min_n_recs < 2 in ibuf_insert_low() actually does check for underflow, because the variable is a signed type ( lint , not ulint ). With the corrected hash function, the estimate of min_n_recs looks conservative to me, even if there were multiple change buffer pages of buffered changes and the estimate did not consider them all. In the worst case, we missed many preceding pages filled with IBUF_DELETE_OP records that had been buffered for the user page earlier. But, it would have been an error to buffer those operations if the records did not exist in the first place. So, it is OK to ignore any preceding pages of buffered operations and start with a conservative estimate that there exist at least min_n_recs=0 records in the page. Update: I tried 10.2 with the following patch: diff --git a/storage/innobase/include/page0page.ic b/storage/innobase/include/page0page.ic index 4efb92f7182..276f85df69a 100644 --- a/storage/innobase/include/page0page.ic +++ b/storage/innobase/include/page0page.ic @@ -173,9 +173,7 @@ page_header_set_field( { ut_ad(page); ut_ad(field <= PAGE_N_RECS); -#if 0 /* FIXME: MDEV-19344 hits this */ ut_ad(field != PAGE_N_RECS || val); -#endif ut_ad(field == PAGE_N_HEAP || val < srv_page_size); ut_ad(field != PAGE_N_HEAP || (val & 0x7fff) < srv_page_size); but the test innodb.innodb-change-buffer-recovery did not fail in 70 runs even without my fix. I will try harder, possibly with an older version, so that I can test my fix.

            So far, I have been unable to repeat the assertion failure on 10.2. It looks like the result difference was a caused by the test not removing the file my_restart.err between runs, causing a failure on ./mtr --repeat. That test problem was fixed in 10.2.28, 10.3.19, 10.4.10, 10.5.0.

            marko Marko Mäkelä added a comment - So far, I have been unable to repeat the assertion failure on 10.2. It looks like the result difference was a caused by the test not removing the file my_restart.err between runs, causing a failure on ./mtr --repeat . That test problem was fixed in 10.2.28, 10.3.19, 10.4.10, 10.5.0 .

            The rr trace mentioned above belongs to some test run on a 10.5 develoment tree where the server aborts with (snip from server error log slightly edited)
            [ERROR] InnoDB: Unable to purge a record
            mysqld: /home/mleich/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3994: void ibuf_delete(const dtuple_t*, buf_block_t*, dict_index_t*, mtr_t*): Assertion `0' failed.
            [ERROR] mysqld got signal 6 ; 
            

            mleich Matthias Leich added a comment - The rr trace mentioned above belongs to some test run on a 10.5 develoment tree where the server aborts with (snip from server error log slightly edited) [ERROR] InnoDB: Unable to purge a record mysqld: /home/mleich/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3994: void ibuf_delete(const dtuple_t*, buf_block_t*, dict_index_t*, mtr_t*): Assertion `0' failed. [ERROR] mysqld got signal 6 ;

            mleich reported the assertion on crash recovery. It turns out that the assertion was missing || recv_recovery_is_on() at the end. The reason for this is the following condition, which has been removed in 10.5 (MDEV-12353):

            	if (page_get_n_recs(page) == 1 && !recv_recovery_is_on()) {
            		/* Empty the page, unless we are applying the redo log
            		during crash recovery. During normal operation, the
            		page_create_empty() gets logged as one of MLOG_PAGE_CREATE,
            		MLOG_COMP_PAGE_CREATE, MLOG_ZIP_PAGE_COMPRESS. */
            		ut_ad(page_is_leaf(page));
            		/* Usually, this should be the root page,
            		and the whole index tree should become empty.
            		However, this could also be a call in
            		btr_cur_pessimistic_update() to delete the only
            		record in the page and to insert another one. */
            		page_cur_move_to_next(cursor);
            		ut_ad(page_cur_is_after_last(cursor));
            		page_create_empty(page_cur_get_block(cursor),
            				  const_cast<dict_index_t*>(index), mtr);
            		return;
            	}
            

            Only during crash recovery, we can decrement PAGE_N_RECS from 1 to 0. In all other cases, we would invoke page_create_empty(). The assertion failed on the change buffer tree, because the change buffer was emptied during crash recovery.

            marko Marko Mäkelä added a comment - mleich reported the assertion on crash recovery. It turns out that the assertion was missing || recv_recovery_is_on() at the end. The reason for this is the following condition, which has been removed in 10.5 ( MDEV-12353 ): if (page_get_n_recs(page) == 1 && !recv_recovery_is_on()) { /* Empty the page, unless we are applying the redo log during crash recovery. During normal operation, the page_create_empty() gets logged as one of MLOG_PAGE_CREATE, MLOG_COMP_PAGE_CREATE, MLOG_ZIP_PAGE_COMPRESS. */ ut_ad(page_is_leaf(page)); /* Usually, this should be the root page, and the whole index tree should become empty. However, this could also be a call in btr_cur_pessimistic_update() to delete the only record in the page and to insert another one. */ page_cur_move_to_next(cursor); ut_ad(page_cur_is_after_last(cursor)); page_create_empty(page_cur_get_block(cursor), const_cast <dict_index_t*>(index), mtr); return ; } Only during crash recovery, we can decrement PAGE_N_RECS from 1 to 0. In all other cases, we would invoke page_create_empty() . The assertion failed on the change buffer tree, because the change buffer was emptied during crash recovery.

            Summary: The test was not tolerant of ./mtr --repeat=2 and the assertion that was added in a hope to catch the bug that I now filed and fixed as MDEV-22497 was bogus. I removed the assertion that was already bracketed inside #if 0.

            marko Marko Mäkelä added a comment - Summary: The test was not tolerant of ./mtr --repeat=2 and the assertion that was added in a hope to catch the bug that I now filed and fixed as MDEV-22497 was bogus. I removed the assertion that was already bracketed inside #if 0 .

            People

              marko Marko Mäkelä
              kevg Eugene Kosov (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.