Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Not a Bug
    • 10.3.22
    • 10.2.32
    • mariabackup
    • None
    • Ubuntu 18.04 LTS

    Description

      mariabackup --prepare --export --use-memory=1G --target-dir=/backupdb/2020-04-23-04-29-52
      

      The prepare gets stuck here for hours:
      .....
      .....
      2020-04-23 4:29:58 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
      2020-04-23 4:29:58 0 [Note] InnoDB: Uses event mutexes
      2020-04-23 4:29:58 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
      2020-04-23 4:29:58 0 [Note] InnoDB: Number of pools: 1
      2020-04-23 4:29:58 0 [Note] InnoDB: Using SSE2 crc32 instructions
      2020-04-23 4:29:58 0 [Note] InnoDB: Initializing buffer pool, total size = 1G, instances = 1, chunk size = 1G
      2020-04-23 4:29:59 0 [Note] InnoDB: Completed initialization of buffer pool
      2020-04-23 4:29:59 0 [Note] InnoDB: page_cleaner coordinator priority: -20

      Suspecting:
      https://jira.mariadb.org/browse/MDEV-19401

      I asked to remove the --export, same result.

      We tried to collect info with 'gdb' and 'perf' and they don't return anything, in only one case gdb returned this snapshot (one single sample) attached as backtrace-txt.log

      One of the three threads apparently is in xtrabackup.cc:4284:

      /* Wait for threads to exit */
      while (1) {
      os_thread_sleep(1000000);
      pthread_mutex_lock(&count_mutex);
      bool stop = count == 0;
      pthread_mutex_unlock(&count_mutex);
      if (stop)

      { break; }

      }

      So wondering if it's stuck in a sleep loop, not sure if it's compatible with the observations.

      Attachments

        Activity

          marko Marko Mäkelä added a comment - - edited

          claudio.nanni, are you sure that the backtrace-txt.log is not from mariabackup --backup? The threads there would seem more like that to me, not --prepare. I see no InnoDB threads executing.

          The mariabackup --prepare of a backup is close to invoking InnoDB crash recovery. I do not see any recovery here; only log-copying and data-file-copying threads that should not be run during --prepare.

          For debugging --prepare, I would suggest invoking a debug version, to get verbose output from the recovery code:

          mariabackup --prepare --dbug=d,ib_log …
          

          marko Marko Mäkelä added a comment - - edited claudio.nanni , are you sure that the backtrace-txt.log is not from mariabackup --backup ? The threads there would seem more like that to me, not --prepare . I see no InnoDB threads executing. The mariabackup --prepare of a backup is close to invoking InnoDB crash recovery. I do not see any recovery here; only log-copying and data-file-copying threads that should not be run during --prepare . For debugging --prepare , I would suggest invoking a debug version, to get verbose output from the recovery code: mariabackup --prepare --dbug=d,ib_log …
          claudio.nanni Claudio Nanni added a comment -

          marko yes it should be --prepare

          mariabackup --prepare --dbug=d,ib_log --use-memory=4G --target-dir=/mybackup/backup/

          mariabackup based on MariaDB server 10.3.22-MariaDB debian-linux-gnu (x86_64)
          [00] 2020-04-30 17:07:14 cd to /mybackup/backup/
          [00] 2020-04-30 17:07:14 This target seems to be not prepared yet.
          [00] 2020-04-30 17:07:14 mariabackup: using the following InnoDB configuration for recovery:
          [00] 2020-04-30 17:07:14 innodb_data_home_dir = .
          [00] 2020-04-30 17:07:14 innodb_data_file_path = ibdata1:12M:autoextend
          [00] 2020-04-30 17:07:14 innodb_log_group_home_dir = .
          [00] 2020-04-30 17:07:14 InnoDB: Using Linux native AIO
          [00] 2020-04-30 17:07:14 Starting InnoDB instance for recovery.
          [00] 2020-04-30 17:07:14 mariabackup: Using 4294967296 bytes for buffer pool (set by --use-memory parameter)
          2020-04-30 17:07:14 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
          2020-04-30 17:07:14 0 [Note] InnoDB: Uses event mutexes
          2020-04-30 17:07:14 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
          2020-04-30 17:07:14 0 [Note] InnoDB: Number of pools: 1
          2020-04-30 17:07:14 0 [Note] InnoDB: Using SSE2 crc32 instructions
          2020-04-30 17:07:15 0 [Note] InnoDB: Initializing buffer pool, total size = 4G, instances = 1, chunk size = 4G
          2020-04-30 17:07:15 0 [Note] InnoDB: Completed initialization of buffer pool
          2020-04-30 17:07:15 0 [Note] InnoDB: page_cleaner coordinator priority: -20
          
          

          And hangs.

          claudio.nanni Claudio Nanni added a comment - marko yes it should be --prepare mariabackup --prepare --dbug=d,ib_log --use-memory=4G --target-dir=/mybackup/backup/ mariabackup based on MariaDB server 10.3.22-MariaDB debian-linux-gnu (x86_64) [00] 2020-04-30 17:07:14 cd to /mybackup/backup/ [00] 2020-04-30 17:07:14 This target seems to be not prepared yet. [00] 2020-04-30 17:07:14 mariabackup: using the following InnoDB configuration for recovery: [00] 2020-04-30 17:07:14 innodb_data_home_dir = . [00] 2020-04-30 17:07:14 innodb_data_file_path = ibdata1:12M:autoextend [00] 2020-04-30 17:07:14 innodb_log_group_home_dir = . [00] 2020-04-30 17:07:14 InnoDB: Using Linux native AIO [00] 2020-04-30 17:07:14 Starting InnoDB instance for recovery. [00] 2020-04-30 17:07:14 mariabackup: Using 4294967296 bytes for buffer pool (set by --use-memory parameter) 2020-04-30 17:07:14 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins 2020-04-30 17:07:14 0 [Note] InnoDB: Uses event mutexes 2020-04-30 17:07:14 0 [Note] InnoDB: Compressed tables use zlib 1.2.11 2020-04-30 17:07:14 0 [Note] InnoDB: Number of pools: 1 2020-04-30 17:07:14 0 [Note] InnoDB: Using SSE2 crc32 instructions 2020-04-30 17:07:15 0 [Note] InnoDB: Initializing buffer pool, total size = 4G, instances = 1, chunk size = 4G 2020-04-30 17:07:15 0 [Note] InnoDB: Completed initialization of buffer pool 2020-04-30 17:07:15 0 [Note] InnoDB: page_cleaner coordinator priority: -20 And hangs.

          According to strace log sent by the customer, bariabackup hangs after the following calls:

          1588252863.329651 openat(AT_FDCWD, "./ibdata1", O_RDWR|O_CLOEXEC) = 3 <0.000286>
          1588252863.329998 fcntl(3, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = ?
          1588253063.953793 +++ killed by SIGINT +++
          

          It turned out, the target backup dir located on NFS, and advisory locks can hang on NFS volumes. There are several workarounds, but the hanging was not caused by mariabackup code, that is why the issue is closed.

          vlad.lesin Vladislav Lesin added a comment - According to strace log sent by the customer, bariabackup hangs after the following calls: 1588252863.329651 openat(AT_FDCWD, "./ibdata1" , O_RDWR|O_CLOEXEC) = 3 < 0.000286 > 1588252863.329998 fcntl( 3 , F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start= 0 , l_len= 0 }) = ? 1588253063.953793 +++ killed by SIGINT +++ It turned out, the target backup dir located on NFS, and advisory locks can hang on NFS volumes. There are several workarounds, but the hanging was not caused by mariabackup code, that is why the issue is closed.
          spikestabber Brendan P added a comment -

          Unfortunately I can consistently repeat this bug on a good galera SST dump, 6 times out of 10, on a very large data set approx 16TB on a 1TB system with --use-memory of 800G, all stored on a 24TB fast ssd raid0 array. Mariabackup either hangs with no cpu waiting on a thread or hangs with 2 or more cores like this:

          {{[pid 18785] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
          [pid 18785] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1) = 0
          [pid 18787] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
          [pid 18786] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
          [pid 18787] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
          [pid 18786] <... futex resumed> ) = 0
          [pid 18787] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
          [pid 18785] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
          [pid 18787] <... futex resumed> ) = 0
          [pid 18785] <... futex resumed> ) = 0
          [pid 18519] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
          [pid 18787] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
          [pid 18786] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
          [pid 18787] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
          [pid 18786] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
          [pid 18787] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
          [pid 18786] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
          [pid 18787] <... futex resumed> ) = 0
          [pid 18786] <... futex resumed> ) = 0
          [pid 18785] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
          [pid 18784] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
          [pid 18785] <... futex resumed> ) = 0
          [pid 18784] <... futex resumed> ) = 0
          [pid 18519] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
          [pid 18787] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
          [pid 18786] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
          [pid 18787] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
          [pid 18786] <... futex resumed> ) = 0
          [pid 18787] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
          [pid 18519] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
          [pid 18787] <... futex resumed> ) = 0
          [pid 18519] <... futex resumed> ) = 0
          [pid 18786] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
          [pid 18519] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
          }}

          No errors in the log, all is well until this point on crash recovery phase (being an SST):
          Dec 7 02:28:09 x -innobackupex-apply: 2021-12-07 2:28:09 139656444233472 [Note] InnoDB: To recover: 9930460 pages from log
          Dec 7 02:28:24 x -innobackupex-apply: 2021-12-07 2:28:24 139656444233472 [Note] InnoDB: To recover: 9792342 pages from log
          Dec 7 02:28:39 x -innobackupex-apply: 2021-12-07 2:28:39 139656444233472 [Note] InnoDB: To recover: 9622947 pages from log
          Dec 7 02:28:54 x -innobackupex-apply: 2021-12-07 2:28:54 139656444233472 [Note] InnoDB: To recover: 9510081 pages from log
          Dec 7 02:29:09 x -innobackupex-apply: 2021-12-07 2:29:09 139656444233472 [Note] InnoDB: To recover: 9503157 pages from log
          Dec 7 02:29:24 x -innobackupex-apply: 2021-12-07 2:29:24 139656427448064 [Note] InnoDB: To recover: 9402421 pages from log

          Each attempt can freeze at a random point, never the same amount of pages to recover, and randomly it just works, passes, member rejoins the cluster like this isn't ever an issue.

          example of a hung mariabackup:
          PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
          18519 mysql 15 -5 0.939t 0.853t 4456 R 507.1 86.8 2780:00 mariabackup

          I will rerun with --dbug=d,ib_log during log apply phase to verify what can be found, it used to be this rarely would happen, but now it is far more regular of an occurrence. I've seen other reports of this happening but no real followups. I figured at first it was a ram limitation related matter but it can happen earlier on in the first batch read, before it even consumes a large amount of memory.

          ├─mysqld─┬─sh───wsrep_sst_maria─┬─logger
          │ │ ├─mariabackup───12*[

          {mariabackup}

          ]
          │ │ └─wsrep_sst_maria───logger
          │ └─7*[

          {mysqld}

          ]

          Thanks in advance,

          Bren

          spikestabber Brendan P added a comment - Unfortunately I can consistently repeat this bug on a good galera SST dump, 6 times out of 10, on a very large data set approx 16TB on a 1TB system with --use-memory of 800G, all stored on a 24TB fast ssd raid0 array. Mariabackup either hangs with no cpu waiting on a thread or hangs with 2 or more cores like this: {{ [pid 18785] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 18785] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 18787] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 18786] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18787] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 18786] <... futex resumed> ) = 0 [pid 18787] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18785] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18787] <... futex resumed> ) = 0 [pid 18785] <... futex resumed> ) = 0 [pid 18519] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 18787] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 18786] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 18787] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 18786] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 18787] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18786] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18787] <... futex resumed> ) = 0 [pid 18786] <... futex resumed> ) = 0 [pid 18785] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18784] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18785] <... futex resumed> ) = 0 [pid 18784] <... futex resumed> ) = 0 [pid 18519] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 18787] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 18786] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18787] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 18786] <... futex resumed> ) = 0 [pid 18787] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18519] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18787] <... futex resumed> ) = 0 [pid 18519] <... futex resumed> ) = 0 [pid 18786] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...> [pid 18519] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> }} No errors in the log, all is well until this point on crash recovery phase (being an SST): Dec 7 02:28:09 x -innobackupex-apply: 2021-12-07 2:28:09 139656444233472 [Note] InnoDB: To recover: 9930460 pages from log Dec 7 02:28:24 x -innobackupex-apply: 2021-12-07 2:28:24 139656444233472 [Note] InnoDB: To recover: 9792342 pages from log Dec 7 02:28:39 x -innobackupex-apply: 2021-12-07 2:28:39 139656444233472 [Note] InnoDB: To recover: 9622947 pages from log Dec 7 02:28:54 x -innobackupex-apply: 2021-12-07 2:28:54 139656444233472 [Note] InnoDB: To recover: 9510081 pages from log Dec 7 02:29:09 x -innobackupex-apply: 2021-12-07 2:29:09 139656444233472 [Note] InnoDB: To recover: 9503157 pages from log Dec 7 02:29:24 x -innobackupex-apply: 2021-12-07 2:29:24 139656427448064 [Note] InnoDB: To recover: 9402421 pages from log Each attempt can freeze at a random point, never the same amount of pages to recover, and randomly it just works, passes, member rejoins the cluster like this isn't ever an issue. example of a hung mariabackup: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18519 mysql 15 -5 0.939t 0.853t 4456 R 507.1 86.8 2780:00 mariabackup I will rerun with --dbug=d,ib_log during log apply phase to verify what can be found, it used to be this rarely would happen, but now it is far more regular of an occurrence. I've seen other reports of this happening but no real followups. I figured at first it was a ram limitation related matter but it can happen earlier on in the first batch read, before it even consumes a large amount of memory. ├─mysqld─┬─sh───wsrep_sst_maria─┬─logger │ │ ├─mariabackup───12*[ {mariabackup} ] │ │ └─wsrep_sst_maria───logger │ └─7*[ {mysqld} ] Thanks in advance, Bren

          People

            vlad.lesin Vladislav Lesin
            claudio.nanni Claudio Nanni
            Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.