Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-37453

Parallel Replication Crash During Backup

    XMLWordPrintable

Details

    • Can result in hang or crash
    • Q4/2025 Server Maintenance

    Description

      A parallel replica (optimistic parallel mode) can crash while a backup is being taken (via mariadb-backup). With an initial look, it appears thd::backup_commit_lock::ticket is being nullified by another thread when it is trying to be released by wait_for_commit::wait_for_prior_commit2.

      The call stack of the crashing thread:

      #17 0x000055a6d18c433d in handle_fatal_signal (sig=11) at /usr/src/debug/MariaDB-/src_0/sql/signal_handler.cc:227
      #18 <signal handler called>
      No symbol table info available.
      #19 0x000055a6d1791f04 in inline_mysql_prlock_wrlock (src_line=1846, src_file=0x55a6d1eef8d0 "/home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/sql/mdl.cc", that=0xb2020db2ea080243) at /usr/src/debug/MariaDB-/src_0/include/mysql/psi/mysql_thread.h:946
      No locals.
      #20 MDL_lock::remove_ticket (this=0xb2020db2ea0800b3, pins=0x7ef72c020f88, list=&MDL_lock::m_granted, ticket=0x7efb94e2e0a8) at /usr/src/debug/MariaDB-/src_0/sql/mdl.cc:1846
      No locals.
      #21 0x000055a6d1792845 in MDL_context::release_lock (this=<optimized out>, duration=<optimized out>, ticket=0x7efb94e2e0a8) at /usr/src/debug/MariaDB-/src_0/sql/mdl.cc:2915
      #22 0x000055a6d179287d in MDL_context::release_lock (this=<optimized out>, ticket=<optimized out>) at /usr/src/debug/MariaDB-/src_0/sql/mdl.cc:2935
      No locals.
      #23 0x000055a6d160944f in wait_for_commit::wait_for_prior_commit2 (this=this@entry=0x7ef647fc0ce8, thd=thd@entry=0x7efb1c004418, allow_kill=allow_kill@entry=true) at /usr/src/debug/MariaDB-/src_0/sql/sql_class.cc:8336
      #24 0x000055a6d17ff426 in wait_for_commit::wait_for_prior_commit (allow_kill=true, thd=0x7efb1c004418, this=0x7ef647fc0ce8) at /usr/src/debug/MariaDB-/src_0/sql/sql_class.h:2408
      No locals.
      #25 THD::wait_for_prior_commit (allow_kill=true, this=0x7efb1c004418) at /usr/src/debug/MariaDB-/src_0/sql/sql_class.h:5346
      No locals.
      #26 retry_event_group (rgi=<optimized out>, rpt=<optimized out>, orig_qev=<optimized out>) at /usr/src/debug/MariaDB-/src_0/sql/rpl_parallel.cc:955
      #27 0x000055a6d180270b in handle_rpl_parallel_thread (arg=arg@entry=0x7efb081af568) at /usr/src/debug/MariaDB-/src_0/sql/rpl_parallel.cc:1561
      #28 0x000055a6d1b19329 in pfs_spawn_thread (arg=0x7efb081b0708) at /usr/src/debug/MariaDB-/src_0/storage/perfschema/pfs.cc:2201
      #29 0x00007f005aebb1ca in start_thread () from /lib64/libpthread.so.0
      No symbol table info available.
      #30 0x00007f005a1fb8d3 in clone () from /lib64/libc.so.6
      No symbol table info available.
      

      The release_lock call happens during the release of the backup lock (note via retry_event_group).

      int
      wait_for_commit::wait_for_prior_commit2(THD *thd, bool allow_kill)
      {
        ...
        /*
          Release MDL_BACKUP_COMMIT LOCK while waiting for other threads to commit
          This is needed to avoid deadlock between the other threads (which not
          yet have the MDL_BACKUP_COMMIT_LOCK) and any threads using
          BACKUP LOCK BLOCK_COMMIT.
        */
        if (thd->backup_commit_lock && thd->backup_commit_lock->ticket)
        {
          backup_lock_released= true;
          thd->mdl_context.release_lock(thd->backup_commit_lock->ticket);
          thd->backup_commit_lock->ticket= 0;
        }
      

      Prior to the crash, we can see the backup locks taken (via metadata lock info):

      +-----------+-------------------+---------------+---------------------+--------------+---------------------+
      | THREAD_ID | LOCK_MODE         | LOCK_DURATION | LOCK_TYPE           | TABLE_SCHEMA | TABLE_NAME          |
      +-----------+-------------------+---------------+---------------------+--------------+---------------------+
      |  14856913 | MDL_BACKUP_START  | NULL          | Backup lock         |              |                     |
      |   2277384 | MDL_BACKUP_COMMIT | NULL          | Backup lock         |              |                     |
      |   2277384 | MDL_SHARED_WRITE  | NULL          | Table metadata lock | <somedb>     | <some_t1>           |
      |   2277384 | MDL_SHARED_WRITE  | NULL          | Table metadata lock | mysql        | gtid_slave_pos      |
      |  14861424 | MDL_SHARED_READ   | NULL          | Table metadata lock | <somedb>     | <some_t2>           |
      |  14861480 | MDL_SHARED_READ   | NULL          | Table metadata lock | <somedb>     | <some_t2>           |
      |  14861424 | MDL_SHARED_READ   | NULL          | Table metadata lock | <somedb>     | <some_t3>           |
      |  14861480 | MDL_SHARED_READ   | NULL          | Table metadata lock | <somedb>     | <some_t3>           |
      |  14861424 | MDL_SHARED_READ   | NULL          | Table metadata lock | <somedb>     | <some_t4>           |
      |  14861480 | MDL_SHARED_READ   | NULL          | Table metadata lock | <somedb>     | <some_t4>           |
      |  14861424 | MDL_SHARED_READ   | NULL          | Table metadata lock | <somedb>     | <some_t5>           |
      |  14861480 | MDL_SHARED_READ   | NULL          | Table metadata lock | <somedb>     | <some_t5>           |
      |  14861424 | MDL_SHARED_READ   | NULL          | Table metadata lock | <somedb>     | <some_t6>           |
      |  14861480 | MDL_SHARED_READ   | NULL          | Table metadata lock | <somedb>     | <some_t6>           |
      |  14861424 | MDL_SHARED_READ   | NULL          | Table metadata lock | <somedb>     | <some_t7>           |
      |  14861480 | MDL_SHARED_READ   | NULL          | Table metadata lock | <somedb>     | <some_t7>           |
      +-----------+-------------------+---------------+---------------------+--------------+---------------------+
      

      Attachments

        Issue Links

          Activity

            People

              Elkin Andrei Elkin
              bnestere Brandon Nesterenko
              Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.