Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-12773

xtrabackup --prepare may hang with 'Waiting for 1 active transactions to finish'

Details

    Description

      Xtrabackup 2.4.7 may hang in --prepare phase with MariaDB 10.2 in this test
      https://github.com/percona/percona-xtrabackup/blob/2.4/storage/innobase/xtrabackup/test/t/undo_tablespaces.sh

      /usr/bin/innobackupex version 2.4.7 based on MySQL server 5.7.13 Linux (x86_64) (revision id: 6f7a799)
      xtrabackup: cd to /dev/shm/__var0/var1/backup/
      xtrabackup: This target seems to be not prepared yet.
      InnoDB: Number of pools: 1
      xtrabackup: xtrabackup_logfile detected: size=10092544, start_lsn=(1504807)
      xtrabackup: using the following InnoDB configuration for recovery:
      xtrabackup:   innodb_data_home_dir = .
      xtrabackup:   innodb_data_file_path = ibdata1:12M:autoextend
      xtrabackup:   innodb_log_group_home_dir = .
      xtrabackup:   innodb_log_files_in_group = 1
      xtrabackup:   innodb_log_file_size = 10092544
      xtrabackup: using the following InnoDB configuration for recovery:
      xtrabackup:   innodb_data_home_dir = .
      xtrabackup:   innodb_data_file_path = ibdata1:12M:autoextend
      xtrabackup:   innodb_log_group_home_dir = .
      xtrabackup:   innodb_log_files_in_group = 1
      xtrabackup:   innodb_log_file_size = 10092544
      xtrabackup: Starting InnoDB instance for recovery.
      xtrabackup: Using 104857600 bytes for buffer pool (set by --use-memory parameter)
      InnoDB: PUNCH HOLE support available
      InnoDB: Mutexes and rw_locks use GCC atomic builtins
      InnoDB: Uses event mutexes
      InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
      InnoDB: Compressed tables use zlib 1.2.8
      InnoDB: Number of pools: 1
      InnoDB: Using CPU crc32 instructions
      InnoDB: Initializing buffer pool, total size = 100M, instances = 1, chunk size = 100M
      InnoDB: Completed initialization of buffer pool
      InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
      InnoDB: Opened 4 undo tablespaces
      InnoDB: 4 undo tablespaces made active
      InnoDB: Highest supported file format is Barracuda.
      InnoDB: Log scan progressed past the checkpoint lsn 1504807
      InnoDB: Doing recovery: scanned up to log sequence number 6682112 (57%)
      InnoDB: Doing recovery: scanned up to log sequence number 10416535 (99%)
      InnoDB: Database was not shutdown normally!
      InnoDB: Starting crash recovery.
      InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 16049 row operations to undo
      InnoDB: Trx id counter is 1792
      InnoDB: Starting an apply batch of log records to the database...
      InnoDB: Progress in percent: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 
      InnoDB: Apply batch completed
      InnoDB: xtrabackup: Last MySQL binlog file position 3503344, file name ./mysql-bin.000002
      InnoDB: Creating shared tablespace for temporary commit
      InnoDB: Creating shared tablespace for temporary commit
      InnoDB: Rolling back trx with id 1466, 16049 rows to undo
       
      InnoDB: Progress in percents: 1InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
       2 3InnoDB: File './ibtmp1' size is now 12 MB.
      InnoDB: 96 redo rollback segment(s) found. 1 redo rollback segment(s) are active.
      InnoDB: 32 non-redo rollback segment(s) are active.
       4InnoDB: 5.7.13 started; log sequence number 10416535
      InnoDB: xtrabackup: Last MySQL binlog file position 3503344, file name ./mysql-bin.000002
       
      xtrabackup: starting shutdown with innodb_fast_shutdown = 1
      InnoDB: FTS optimize thread exiting.
      InnoDB: Starting shutdown...
      InnoDB: Waiting for 1 active transactions to finish
      InnoDB: Waiting for 1 active transactions to finish
      

      The problem is not stable. Sometimes I can repeat it 1/10, sometimes it is 9/10 .
      10.1.23 and 5.7.18 are not affected. 10.2.5 is affected.

      There is consideration that the problem is incompatibility introduced by MDEV-12289, after which InnoDB 5.7 is not expecting TRX_SYS page slots 1..32 to contain transactions.

      There is consideration that --prepare is not needed for simplest backup / restore scenario, and Server may be started directly on backup.

      Attachments

        1. data.zip
          7.14 MB
        2. MDEV-12773.zip
          8.26 MB

        Issue Links

          Activity

            MDEV-12773.zip is from slightly simplified test on current 10.2 :

            data - actual data directory (it probably may differ from backup as server was still running and doing background activities)
            backup_orig - copy of backup, taken immediately after --backup operation
            backup - actual backup with attempted --apply-log operation
            au.sh - actual script used for testing
            stdout.log - terminal's output

            anikitin Andrii Nikitin (Inactive) added a comment - MDEV-12773 .zip is from slightly simplified test on current 10.2 : data - actual data directory (it probably may differ from backup as server was still running and doing background activities) backup_orig - copy of backup, taken immediately after --backup operation backup - actual backup with attempted --apply-log operation au.sh - actual script used for testing stdout.log - terminal's output

            I am able to repeat the hang with the latest xtrabackup 2.4 branch as follows:

            mv MDEV-12773/backup_orig xtrabackup_backupfiles
            gdb xtrabackup
            

            …
            (gdb) run --prepare
            …
            InnoDB: 5.7.13 started; log sequence number 10602734
            [New Thread 0x7fff9ffff700 (LWP 22163)]
            [New Thread 0x7fff9f7fe700 (LWP 22164)]
            [New Thread 0x7fff9effd700 (LWP 22165)]
            InnoDB: xtrabackup: Last MySQL binlog file position 3503344, file name ./mysql-bin.000002
             
            xtrabackup: starting shutdown with innodb_fast_shutdown = 1
            InnoDB: FTS optimize thread exiting.
            [Thread 0x7fff9f7fe700 (LWP 22164) exited]
            [Thread 0x7fff9ffff700 (LWP 22163) exited]
            InnoDB: Starting shutdown...
            [Thread 0x7fffb9ffb700 (LWP 22160) exited]
            [Thread 0x7fffba7fc700 (LWP 22159) exited]
            [Thread 0x7fffb8ff9700 (LWP 22162) exited]
            [Thread 0x7fffb97fa700 (LWP 22161) exited]
             3[Thread 0x7fffd5ffb700 (LWP 22151) exited]
            [Thread 0x7fffbb7fe700 (LWP 22157) exited]
            [Thread 0x7fff9effd700 (LWP 22165) exited]
            [Thread 0x7fffbbfff700 (LWP 22156) exited]
            [Thread 0x7fffd4ff9700 (LWP 22155) exited]
            ^C
            Thread 1 "xtrabackup" received signal SIGINT, Interrupt.
            0x00007ffff7bcc7dd in nanosleep () at ../sysdeps/unix/syscall-template.S:84
            84	../sysdeps/unix/syscall-template.S: Tiedostoa tai hakemistoa ei ole.
            (gdb) bt
            #0  0x00007ffff7bcc7dd in nanosleep () at ../sysdeps/unix/syscall-template.S:84
            #1  0x00005555564e879d in os_thread_sleep (tm=100000)
                at /mariadb/percona-xtrabackup/storage/innobase/os/os0thread.cc:254
            #2  0x00005555564ba975 in logs_empty_and_mark_files_at_shutdown ()
                at /mariadb/percona-xtrabackup/storage/innobase/log/log0log.cc:2260
            #3  0x00005555565e0d35 in innobase_shutdown_for_mysql ()
                at /mariadb/percona-xtrabackup/storage/innobase/srv/srv0start.cc:2711
            #4  0x000055555625028d in innodb_end ()
                at /mariadb/percona-xtrabackup/storage/innobase/xtrabackup/src/xtrabackup.cc:2006
            #5  0x000055555625bf1d in xtrabackup_prepare_func ()
                at /mariadb/percona-xtrabackup/storage/innobase/xtrabackup/src/xtrabackup.cc:7758
            #6  0x000055555625d53d in main (argc=4, argv=0x7fffffffe678)
                at /mariadb/percona-xtrabackup/storage/innobase/xtrabackup/src/xtrabackup.cc:8396
            (gdb) p srv_shutdown_state 
            $9 = SRV_SHUTDOWN_CLEANUP
            (gdb) p trx_sys->rw_trx_list
            $2 = {count = 1, start = 0x555557ddd5c8, end = 0x555557ddd5c8, 
              node = &trx_t::trx_list, init = 51966}
            (gdb) p trx_sys->rw_trx_list.start.id
            $3 = 1466
            (gdb) p trx_sys->rw_trx_list.start.state
            $4 = TRX_STATE_ACTIVE
            (gdb) p trx_sys->rw_trx_list.start.undo_no
            $5 = 15656
            

            So, there is one transaction with 15656 rows to roll back, which the main thread is waiting to be rolled back.
            The background rollback thread is hung waiting for a redo log checkpoint to happen:

            Thread 14 (Thread 0x7fffd57fa700 (LWP 22154)):
            #0  buf_flush_wait_flushed (new_oldest=3587153)
                at /mariadb/percona-xtrabackup/storage/innobase/buf/buf0flu.cc:2055
            #1  0x00005555564b8d95 in log_preflush_pool_modified_pages (new_oldest=3587153)
                at /mariadb/percona-xtrabackup/storage/innobase/log/log0log.cc:1614
            #2  0x00005555564ba488 in log_checkpoint_margin ()
                at /mariadb/percona-xtrabackup/storage/innobase/log/log0log.cc:2131
            #3  0x00005555564ba832 in log_check_margins ()
                at /mariadb/percona-xtrabackup/storage/innobase/log/log0log.cc:2223
            #4  0x000055555651829b in log_free_check ()
                at /mariadb/percona-xtrabackup/storage/innobase/include/log0log.ic:521
            #5  0x000055555651a347 in que_run_threads_low (thr=0x7fffb4000f90)
                at /mariadb/percona-xtrabackup/storage/innobase/que/que0que.cc:1112
            #6  0x000055555651a508 in que_run_threads (thr=0x7fffb4000f90)
                at /mariadb/percona-xtrabackup/storage/innobase/que/que0que.cc:1158
            #7  0x000055555661a1a0 in trx_rollback_active (trx=0x555557ddd5c8)
                at /mariadb/percona-xtrabackup/storage/innobase/trx/trx0roll.cc:678
            #8  0x000055555661a667 in trx_rollback_resurrected (trx=0x555557ddd5c8, all=1)
                at /mariadb/percona-xtrabackup/storage/innobase/trx/trx0roll.cc:764
            #9  0x000055555661a8df in trx_rollback_or_clean_recovered (all=1)
                at /mariadb/percona-xtrabackup/storage/innobase/trx/trx0roll.cc:826
            #10 0x000055555661aa39 in trx_rollback_or_clean_all_recovered (arg=0x0)
                at /mariadb/percona-xtrabackup/storage/innobase/trx/trx0roll.cc:865
            #11 0x00007ffff7bc3494 in start_thread (arg=0x7fffd57fa700)
                at pthread_create.c:333
            #12 0x00007ffff60e693f in clone ()
                at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
            

            The reason why the checkpoint is hung is that the buf_flush_page_cleaner_coordinator thread already exited. I suspect that it is due to this condition:

            	if (srv_fast_shutdown == 2
            	    || srv_shutdown_state < SRV_SHUTDOWN_FLUSH_PHASE) {
            		/* In very fast shutdown or when innodb failed to start, we
            		simulate a crash of the buffer pool. We are not required to do
            		any flushing. */
            		goto thread_exit;
            	}
            

            The srv_shutdown_state == SRV_SHUTDOWN_CLEANUP fulfills the condition. I think that the page cleaner coordinator should stay around until the shutdown state has been moved beyond to SRV_SHUTDOWN_FLUSH_PHASE. That is what the code in MariaDB 10.2 is doing ever since the InnoDB code merge from MySQL 5.7.9:

            	if (srv_fast_shutdown == 2
            	    || srv_shutdown_state == SRV_SHUTDOWN_EXIT_THREADS) {
            		/* In very fast shutdown or when innodb failed to start, we
            		simulate a crash of the buffer pool. We are not required to do
            		any flushing. */
            		goto thread_exit;
            	}
            

            This is also what Oracle MySQL 5.7 is doing ever since the condition was amended in May 2014.
            It appears that the bogus condition never existed in MySQL or MariaDB, but only in Percona’s code that is based on MySQL 5.7.
            The bogus condition was introduced in September 2016.

            Related to this hang is the following bug that to my knowledge has not been fixed in MySQL or XtraBackup:
            MDEV-12091 Shutdown fails to wait for rollback of recovered transactions to finish

            It seems to me that the bogus condition that was introduced into xtrabackup may be trying to work around MDEV-12091. I would suggest that Percona should revert that patch and apply my MDEV-12091 fix instead.

            marko Marko Mäkelä added a comment - I am able to repeat the hang with the latest xtrabackup 2.4 branch as follows: mv MDEV-12773/backup_orig xtrabackup_backupfiles gdb xtrabackup … (gdb) run --prepare … InnoDB: 5.7.13 started; log sequence number 10602734 [New Thread 0x7fff9ffff700 (LWP 22163)] [New Thread 0x7fff9f7fe700 (LWP 22164)] [New Thread 0x7fff9effd700 (LWP 22165)] InnoDB: xtrabackup: Last MySQL binlog file position 3503344, file name ./mysql-bin.000002   xtrabackup: starting shutdown with innodb_fast_shutdown = 1 InnoDB: FTS optimize thread exiting. [Thread 0x7fff9f7fe700 (LWP 22164) exited] [Thread 0x7fff9ffff700 (LWP 22163) exited] InnoDB: Starting shutdown... [Thread 0x7fffb9ffb700 (LWP 22160) exited] [Thread 0x7fffba7fc700 (LWP 22159) exited] [Thread 0x7fffb8ff9700 (LWP 22162) exited] [Thread 0x7fffb97fa700 (LWP 22161) exited] 3[Thread 0x7fffd5ffb700 (LWP 22151) exited] [Thread 0x7fffbb7fe700 (LWP 22157) exited] [Thread 0x7fff9effd700 (LWP 22165) exited] [Thread 0x7fffbbfff700 (LWP 22156) exited] [Thread 0x7fffd4ff9700 (LWP 22155) exited] ^C Thread 1 "xtrabackup" received signal SIGINT, Interrupt. 0x00007ffff7bcc7dd in nanosleep () at ../sysdeps/unix/syscall-template.S:84 84 ../sysdeps/unix/syscall-template.S: Tiedostoa tai hakemistoa ei ole. (gdb) bt #0 0x00007ffff7bcc7dd in nanosleep () at ../sysdeps/unix/syscall-template.S:84 #1 0x00005555564e879d in os_thread_sleep (tm=100000) at /mariadb/percona-xtrabackup/storage/innobase/os/os0thread.cc:254 #2 0x00005555564ba975 in logs_empty_and_mark_files_at_shutdown () at /mariadb/percona-xtrabackup/storage/innobase/log/log0log.cc:2260 #3 0x00005555565e0d35 in innobase_shutdown_for_mysql () at /mariadb/percona-xtrabackup/storage/innobase/srv/srv0start.cc:2711 #4 0x000055555625028d in innodb_end () at /mariadb/percona-xtrabackup/storage/innobase/xtrabackup/src/xtrabackup.cc:2006 #5 0x000055555625bf1d in xtrabackup_prepare_func () at /mariadb/percona-xtrabackup/storage/innobase/xtrabackup/src/xtrabackup.cc:7758 #6 0x000055555625d53d in main (argc=4, argv=0x7fffffffe678) at /mariadb/percona-xtrabackup/storage/innobase/xtrabackup/src/xtrabackup.cc:8396 (gdb) p srv_shutdown_state $9 = SRV_SHUTDOWN_CLEANUP (gdb) p trx_sys->rw_trx_list $2 = {count = 1, start = 0x555557ddd5c8, end = 0x555557ddd5c8, node = &trx_t::trx_list, init = 51966} (gdb) p trx_sys->rw_trx_list.start.id $3 = 1466 (gdb) p trx_sys->rw_trx_list.start.state $4 = TRX_STATE_ACTIVE (gdb) p trx_sys->rw_trx_list.start.undo_no $5 = 15656 So, there is one transaction with 15656 rows to roll back, which the main thread is waiting to be rolled back. The background rollback thread is hung waiting for a redo log checkpoint to happen: Thread 14 (Thread 0x7fffd57fa700 (LWP 22154)): #0 buf_flush_wait_flushed (new_oldest=3587153) at /mariadb/percona-xtrabackup/storage/innobase/buf/buf0flu.cc:2055 #1 0x00005555564b8d95 in log_preflush_pool_modified_pages (new_oldest=3587153) at /mariadb/percona-xtrabackup/storage/innobase/log/log0log.cc:1614 #2 0x00005555564ba488 in log_checkpoint_margin () at /mariadb/percona-xtrabackup/storage/innobase/log/log0log.cc:2131 #3 0x00005555564ba832 in log_check_margins () at /mariadb/percona-xtrabackup/storage/innobase/log/log0log.cc:2223 #4 0x000055555651829b in log_free_check () at /mariadb/percona-xtrabackup/storage/innobase/include/log0log.ic:521 #5 0x000055555651a347 in que_run_threads_low (thr=0x7fffb4000f90) at /mariadb/percona-xtrabackup/storage/innobase/que/que0que.cc:1112 #6 0x000055555651a508 in que_run_threads (thr=0x7fffb4000f90) at /mariadb/percona-xtrabackup/storage/innobase/que/que0que.cc:1158 #7 0x000055555661a1a0 in trx_rollback_active (trx=0x555557ddd5c8) at /mariadb/percona-xtrabackup/storage/innobase/trx/trx0roll.cc:678 #8 0x000055555661a667 in trx_rollback_resurrected (trx=0x555557ddd5c8, all=1) at /mariadb/percona-xtrabackup/storage/innobase/trx/trx0roll.cc:764 #9 0x000055555661a8df in trx_rollback_or_clean_recovered (all=1) at /mariadb/percona-xtrabackup/storage/innobase/trx/trx0roll.cc:826 #10 0x000055555661aa39 in trx_rollback_or_clean_all_recovered (arg=0x0) at /mariadb/percona-xtrabackup/storage/innobase/trx/trx0roll.cc:865 #11 0x00007ffff7bc3494 in start_thread (arg=0x7fffd57fa700) at pthread_create.c:333 #12 0x00007ffff60e693f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97 The reason why the checkpoint is hung is that the buf_flush_page_cleaner_coordinator thread already exited. I suspect that it is due to this condition: if (srv_fast_shutdown == 2 || srv_shutdown_state < SRV_SHUTDOWN_FLUSH_PHASE) { /* In very fast shutdown or when innodb failed to start, we simulate a crash of the buffer pool. We are not required to do any flushing. */ goto thread_exit; } The srv_shutdown_state == SRV_SHUTDOWN_CLEANUP fulfills the condition. I think that the page cleaner coordinator should stay around until the shutdown state has been moved beyond to SRV_SHUTDOWN_FLUSH_PHASE. That is what the code in MariaDB 10.2 is doing ever since the InnoDB code merge from MySQL 5.7.9 : if (srv_fast_shutdown == 2 || srv_shutdown_state == SRV_SHUTDOWN_EXIT_THREADS) { /* In very fast shutdown or when innodb failed to start, we simulate a crash of the buffer pool. We are not required to do any flushing. */ goto thread_exit; } This is also what Oracle MySQL 5.7 is doing ever since the condition was amended in May 2014 . It appears that the bogus condition never existed in MySQL or MariaDB, but only in Percona’s code that is based on MySQL 5.7. The bogus condition was introduced in September 2016. Related to this hang is the following bug that to my knowledge has not been fixed in MySQL or XtraBackup: MDEV-12091 Shutdown fails to wait for rollback of recovered transactions to finish It seems to me that the bogus condition that was introduced into xtrabackup may be trying to work around MDEV-12091 . I would suggest that Percona should revert that patch and apply my MDEV-12091 fix instead.

            If I revert the one-liner Percona xtrabackup patch, the xtrabackup shutdown will not hang, but instead the recovered transaction will be rolled back:

            InnoDB: Rolling back trx with id 1466, 16049 rows to undo
            

            (Note that in the gdb output of my previous comment, some rows were already rolled back; only 15656 remained to be rolled back.)
            The progress report for the rollback is interleaved with other output.
            This problem also exists in MariaDB and will be fixed in MDEV-12323.

            I believe that even with this hang fixed, xtrabackup --prepare in 2.4 may work incorrectly with MariaDB 10.2.6 due to MDEV-12289.
            The ultimate fix would be to release Mariabackup for MariaDB 10.2.

            marko Marko Mäkelä added a comment - If I revert the one-liner Percona xtrabackup patch , the xtrabackup shutdown will not hang, but instead the recovered transaction will be rolled back: InnoDB: Rolling back trx with id 1466, 16049 rows to undo (Note that in the gdb output of my previous comment, some rows were already rolled back; only 15656 remained to be rolled back.) The progress report for the rollback is interleaved with other output. This problem also exists in MariaDB and will be fixed in MDEV-12323 . I believe that even with this hang fixed, xtrabackup --prepare in 2.4 may work incorrectly with MariaDB 10.2.6 due to MDEV-12289 . The ultimate fix would be to release Mariabackup for MariaDB 10.2.
            anikitin Andrii Nikitin (Inactive) added a comment - I've opened https://bugs.launchpad.net/percona-xtrabackup/+bug/1690333

            Fixed in 2.4.8

            anikitin Andrii Nikitin (Inactive) added a comment - Fixed in 2.4.8

            MDEV-12548 in MariaDB 10.2.7 implemented Mariabackup for the 10.2 series, removing the need to use xtrabackup with MariaDB.

            marko Marko Mäkelä added a comment - MDEV-12548 in MariaDB 10.2.7 implemented Mariabackup for the 10.2 series, removing the need to use xtrabackup with MariaDB.

            People

              anikitin Andrii Nikitin (Inactive)
              anikitin Andrii Nikitin (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.