Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27865

buf_page_read_complete assertions during during innodb_init

Details

    Description

      From: faf

      20 instances from last report 2022-02-15.

      Two stack traces, similar at frame 11 (buf_page_get_low) and lower.

      Example:

      faf report 314639

      Frame # 	Function 	Binary 	Source or offset 	Line
      1 	
      kill
      	/lib64/libc.so.6 	0x4298b
      Build id: 644dac2c66a6e0b32674f0ec71e7431bd0c06a63 	-
      2 	
      handle_fatal_signal
      	/usr/libexec/mariadbd 	0x98265d
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      3 	
      	/lib64/libc.so.6 	0x42750
      Build id: 644dac2c66a6e0b32674f0ec71e7431bd0c06a63 	-
      4 	
      __pthread_kill_implementation
      	/lib64/libc.so.6 	0x8f84c
      Build id: 644dac2c66a6e0b32674f0ec71e7431bd0c06a63 	-
      5 	
      raise
      	/lib64/libc.so.6 	0x426a6
      Build id: 644dac2c66a6e0b32674f0ec71e7431bd0c06a63 	-
      6 	
      abort
      	/lib64/libc.so.6 	0x2c7d3
      Build id: 644dac2c66a6e0b32674f0ec71e7431bd0c06a63 	-
      7 	
      ib_wqueue_add(ib_wqueue_t*, void*, mem_block_info_t*, bool) [clone .cold]
      	/usr/libexec/mariadbd 	0x598826
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      8 	
      buf_page_read_complete(buf_page_t*, fil_node_t const&) [clone .cold]
      	/usr/libexec/mariadbd 	0x59e056
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      9 	
      buf_read_page_low(dberr_t*, fil_space_t*, bool, unsigned long, page_id_t, unsigned long, bool) [clone .lto_priv.0]
      	/usr/libexec/mariadbd 	0xd5e967
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      10 	
      buf_read_page(page_id_t, unsigned long)
      	/usr/libexec/mariadbd 	0xd5f4c8
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      11 	
      buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned int, mtr_t*, dberr_t*, bool)
      	/usr/libexec/mariadbd 	0xd5f71f
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      12 	
      dict_boot()
      	/usr/libexec/mariadbd 	0x5a0910
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      13 	
      srv_start(bool)
      	/usr/libexec/mariadbd 	0x592326
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      14 	
      innodb_init(void*) [clone .lto_priv.0]
      	/usr/libexec/mariadbd 	0xc3b939
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      15 	
      ha_initialize_handlerton(st_plugin_int*)
      	/usr/libexec/mariadbd 	0x983902
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      16 	
      plugin_initialize(st_mem_root*, st_plugin_int*, int*, char**, bool)
      	/usr/libexec/mariadbd 	0x74a3fe
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      17 	
      plugin_init(int*, char**, int)
      	/usr/libexec/mariadbd 	0x74f334
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      18 	
      init_server_components()
      	/usr/libexec/mariadbd 	0x6050ab
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      19 	
      mysqld_main(int, char**)
      	/usr/libexec/mariadbd 	0x60a2fb
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      20 	
      __libc_start_call_main
      	/lib64/libc.so.6 	0x2d560
      Build id: 644dac2c66a6e0b32674f0ec71e7431bd0c06a63 	-
      21 	
      __libc_start_main@@GLIBC_2.34
      	/lib64/libc.so.6 	0x2d60c
      Build id: 644dac2c66a6e0b32674f0ec71e7431bd0c06a63 	-
      22 	
      _start
      	/usr/libexec/mariadbd 	0x5f5a35
      Build id: 7246c3e7348acc8b2e3e369599d9a84f006bae17 	-
      

      The assertion was added in 2d98b967e316 / MDEV-26865

      Report Build ID MariaDB version Stating point
      faf report 314639 7246c3e7348acc8b2e3e369599d9a84f006bae17 10.5.13 dict_boot
      faf report 352736 7246c3e7348acc8b2e3e369599d9a84f006bae17 10.5.13 trx_rseg_array_init

      BuildID to MariaDB version

      $ DEBUGINFOD_URLS=https://debuginfod.fedoraproject.org/ debuginfod-find executable 7246c3e7348acc8b2e3e369599d9a84f006bae17
       
      /home/dan/.cache/debuginfod_client/7246c3e7348acc8b2e3e369599d9a84f006bae17/executable
       
      $ chmod a+x /home/dan/.cache/debuginfod_client/7246c3e7348acc8b2e3e369599d9a84f006bae17/executable
       
      $ /home/dan/.cache/debuginfod_client/7246c3e7348acc8b2e3e369599d9a84f006bae17/executable --version
      /home/dan/.cache/debuginfod_client/7246c3e7348acc8b2e3e369599d9a84f006bae17/executable  Ver 10.5.13-MariaDB for Linux on x86_64 (MariaDB Server)
      

      I don't have access to original stack traces. Just copying from faf with a little information extraction.

      Attachments

        Issue Links

          Activity

            Thank you, this is an interesting concept.

            The stack traces look invalid. The only callers of ib_wqueue_add() in mariadb-10.5.13 are add_msg() in fts0opt.cc and trx_finalize_for_fts() on transaction commit. The buffer pool (such as buf_page_read_complete()) does not directly interact with the fts_optimize_wq or dict_table_t::fts->add_wq.

            Can you try to determine the correct stack traces?

            marko Marko Mäkelä added a comment - Thank you, this is an interesting concept. The stack traces look invalid. The only callers of ib_wqueue_add() in mariadb-10.5.13 are add_msg() in fts0opt.cc and trx_finalize_for_fts() on transaction commit. The buffer pool (such as buf_page_read_complete() ) does not directly interact with the fts_optimize_wq or dict_table_t::fts->add_wq . Can you try to determine the correct stack traces?
            danblack Daniel Black added a comment -

            Yep, I see. Taking the compiler can take identical looking bits of code and make them the same, sure, calling ib_wqueue_add wrong.

            Looking at offsets as authentic:

            buf_page_read_complete 0x59e056 from faf report 314639

             
               0x000000000059e03a <+391>:	call   0x56c36c <_ZN2ib6loggerC2Ev.lto_priv.0>
               0x000000000059e03f <+396>:	lea    0xb7804a(%rip),%rsi        # 0x1116090
               0x000000000059e046 <+403>:	mov    %r14,%rdi
               0x000000000059e049 <+406>:	call   0x56de26 <_ZN2ib6loggerlsILm9EEERS0_RAT__Kc>
               0x000000000059e04e <+411>:	mov    %r14,%rdi
               0x000000000059e051 <+414>:	call   0x5987d2 <_ZN2ib5fatalD2Ev>
               0x000000000059e056 <+419>:	mov    $0x43,%r14d
             
            (gdb) p ((char *)0x1116090)
            $12 = 0x1116090 "Aborting because of a corrupt database page."
            

            So the following call in buf_page_read_complete after this error is buf_corrupt_page_release. Would this be alias of ib_wqueue_add? Doesn't look similar enough, but I don't have a compiler mind, but at least it has a cold attribute and contains an abort. clone seems to be a copy of the function under specific preconditions (why a compiler would do that to a tiny cold function - no idea).

            So a close as random un-reproducable corruption?

            danblack Daniel Black added a comment - Yep, I see. Taking the compiler can take identical looking bits of code and make them the same, sure, calling ib_wqueue_add wrong. Looking at offsets as authentic: buf_page_read_complete 0x59e056 from faf report 314639   0x000000000059e03a <+391>: call 0x56c36c <_ZN2ib6loggerC2Ev.lto_priv.0> 0x000000000059e03f <+396>: lea 0xb7804a(%rip),%rsi # 0x1116090 0x000000000059e046 <+403>: mov %r14,%rdi 0x000000000059e049 <+406>: call 0x56de26 <_ZN2ib6loggerlsILm9EEERS0_RAT__Kc> 0x000000000059e04e <+411>: mov %r14,%rdi 0x000000000059e051 <+414>: call 0x5987d2 <_ZN2ib5fatalD2Ev> 0x000000000059e056 <+419>: mov $0x43,%r14d   (gdb) p ((char *)0x1116090) $12 = 0x1116090 "Aborting because of a corrupt database page." So the following call in buf_page_read_complete after this error is buf_corrupt_page_release . Would this be alias of ib_wqueue_add ? Doesn't look similar enough, but I don't have a compiler mind, but at least it has a cold attribute and contains an abort. clone seems to be a copy of the function under specific preconditions (why a compiler would do that to a tiny cold function - no idea). So a close as random un-reproducable corruption?

            Thank you. It could simply be that the database was corrupted for whatever reason, and we failed to start up. Another case of MDEV-13542.

            We could try to fix the crash (just make InnoDB fail to start up), possibly as part of MDEV-13542, or as a special case for this particular call trace.

            marko Marko Mäkelä added a comment - Thank you. It could simply be that the database was corrupted for whatever reason, and we failed to start up. Another case of MDEV-13542 . We could try to fix the crash (just make InnoDB fail to start up), possibly as part of MDEV-13542 , or as a special case for this particular call trace.

            I think that this crash should be prevented in the MDEV-13542 branch that is currently under testing. But, the development branch is based on 10.6. Porting and testing the changes for earlier major versions could involve a lot of risk and effort.

            marko Marko Mäkelä added a comment - I think that this crash should be prevented in the MDEV-13542 branch that is currently under testing. But, the development branch is based on 10.6. Porting and testing the changes for earlier major versions could involve a lot of risk and effort.

            People

              marko Marko Mäkelä
              danblack Daniel Black
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.