[MDEV-22423] mariabackup hangs on prepare Created: 2020-04-30  Updated: 2021-12-07  Resolved: 2020-05-12

Status: Closed
Project: MariaDB Server
Component/s: mariabackup
Affects Version/s: 10.3.22
Fix Version/s: 10.2.32

Type: Bug Priority: Major
Reporter: Claudio Nanni Assignee: Vladislav Lesin
Resolution: Not a Bug Votes: 2
Labels: None
Environment:

Ubuntu 18.04 LTS


Attachments: File backtrace-txt.log    

 Description   

mariabackup --prepare --export --use-memory=1G --target-dir=/backupdb/2020-04-23-04-29-52

The prepare gets stuck here for hours:
.....
.....
2020-04-23 4:29:58 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2020-04-23 4:29:58 0 [Note] InnoDB: Uses event mutexes
2020-04-23 4:29:58 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2020-04-23 4:29:58 0 [Note] InnoDB: Number of pools: 1
2020-04-23 4:29:58 0 [Note] InnoDB: Using SSE2 crc32 instructions
2020-04-23 4:29:58 0 [Note] InnoDB: Initializing buffer pool, total size = 1G, instances = 1, chunk size = 1G
2020-04-23 4:29:59 0 [Note] InnoDB: Completed initialization of buffer pool
2020-04-23 4:29:59 0 [Note] InnoDB: page_cleaner coordinator priority: -20

Suspecting:
https://jira.mariadb.org/browse/MDEV-19401

I asked to remove the --export, same result.

We tried to collect info with 'gdb' and 'perf' and they don't return anything, in only one case gdb returned this snapshot (one single sample) attached as backtrace-txt.log

One of the three threads apparently is in xtrabackup.cc:4284:

/* Wait for threads to exit */
while (1) {
os_thread_sleep(1000000);
pthread_mutex_lock(&count_mutex);
bool stop = count == 0;
pthread_mutex_unlock(&count_mutex);
if (stop)

{ break; }

}

So wondering if it's stuck in a sleep loop, not sure if it's compatible with the observations.



 Comments   
Comment by Marko Mäkelä [ 2020-04-30 ]

claudio.nanni, are you sure that the backtrace-txt.log is not from mariabackup --backup? The threads there would seem more like that to me, not --prepare. I see no InnoDB threads executing.

The mariabackup --prepare of a backup is close to invoking InnoDB crash recovery. I do not see any recovery here; only log-copying and data-file-copying threads that should not be run during --prepare.

For debugging --prepare, I would suggest invoking a debug version, to get verbose output from the recovery code:

mariabackup --prepare --dbug=d,ib_log …

Comment by Claudio Nanni [ 2020-04-30 ]

marko yes it should be --prepare

mariabackup --prepare --dbug=d,ib_log --use-memory=4G --target-dir=/mybackup/backup/

mariabackup based on MariaDB server 10.3.22-MariaDB debian-linux-gnu (x86_64)
[00] 2020-04-30 17:07:14 cd to /mybackup/backup/
[00] 2020-04-30 17:07:14 This target seems to be not prepared yet.
[00] 2020-04-30 17:07:14 mariabackup: using the following InnoDB configuration for recovery:
[00] 2020-04-30 17:07:14 innodb_data_home_dir = .
[00] 2020-04-30 17:07:14 innodb_data_file_path = ibdata1:12M:autoextend
[00] 2020-04-30 17:07:14 innodb_log_group_home_dir = .
[00] 2020-04-30 17:07:14 InnoDB: Using Linux native AIO
[00] 2020-04-30 17:07:14 Starting InnoDB instance for recovery.
[00] 2020-04-30 17:07:14 mariabackup: Using 4294967296 bytes for buffer pool (set by --use-memory parameter)
2020-04-30 17:07:14 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2020-04-30 17:07:14 0 [Note] InnoDB: Uses event mutexes
2020-04-30 17:07:14 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2020-04-30 17:07:14 0 [Note] InnoDB: Number of pools: 1
2020-04-30 17:07:14 0 [Note] InnoDB: Using SSE2 crc32 instructions
2020-04-30 17:07:15 0 [Note] InnoDB: Initializing buffer pool, total size = 4G, instances = 1, chunk size = 4G
2020-04-30 17:07:15 0 [Note] InnoDB: Completed initialization of buffer pool
2020-04-30 17:07:15 0 [Note] InnoDB: page_cleaner coordinator priority: -20

And hangs.

Comment by Vladislav Lesin [ 2020-05-12 ]

According to strace log sent by the customer, bariabackup hangs after the following calls:

1588252863.329651 openat(AT_FDCWD, "./ibdata1", O_RDWR|O_CLOEXEC) = 3 <0.000286>
1588252863.329998 fcntl(3, F_SETLK, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = ?
1588253063.953793 +++ killed by SIGINT +++

It turned out, the target backup dir located on NFS, and advisory locks can hang on NFS volumes. There are several workarounds, but the hanging was not caused by mariabackup code, that is why the issue is closed.

Comment by Brendan P [ 2021-12-07 ]

Unfortunately I can consistently repeat this bug on a good galera SST dump, 6 times out of 10, on a very large data set approx 16TB on a 1TB system with --use-memory of 800G, all stored on a 24TB fast ssd raid0 array. Mariabackup either hangs with no cpu waiting on a thread or hangs with 2 or more cores like this:

{{[pid 18785] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18785] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 18787] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 18786] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18787] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18786] <... futex resumed> ) = 0
[pid 18787] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18785] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18787] <... futex resumed> ) = 0
[pid 18785] <... futex resumed> ) = 0
[pid 18519] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18787] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 18786] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 18787] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18786] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18787] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18786] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18787] <... futex resumed> ) = 0
[pid 18786] <... futex resumed> ) = 0
[pid 18785] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18784] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18785] <... futex resumed> ) = 0
[pid 18784] <... futex resumed> ) = 0
[pid 18519] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18787] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 18786] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18787] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 18786] <... futex resumed> ) = 0
[pid 18787] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18519] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18787] <... futex resumed> ) = 0
[pid 18519] <... futex resumed> ) = 0
[pid 18786] futex(0x7ff13810fc50, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 18519] futex(0x7ff13810fc50, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
}}

No errors in the log, all is well until this point on crash recovery phase (being an SST):
Dec 7 02:28:09 x -innobackupex-apply: 2021-12-07 2:28:09 139656444233472 [Note] InnoDB: To recover: 9930460 pages from log
Dec 7 02:28:24 x -innobackupex-apply: 2021-12-07 2:28:24 139656444233472 [Note] InnoDB: To recover: 9792342 pages from log
Dec 7 02:28:39 x -innobackupex-apply: 2021-12-07 2:28:39 139656444233472 [Note] InnoDB: To recover: 9622947 pages from log
Dec 7 02:28:54 x -innobackupex-apply: 2021-12-07 2:28:54 139656444233472 [Note] InnoDB: To recover: 9510081 pages from log
Dec 7 02:29:09 x -innobackupex-apply: 2021-12-07 2:29:09 139656444233472 [Note] InnoDB: To recover: 9503157 pages from log
Dec 7 02:29:24 x -innobackupex-apply: 2021-12-07 2:29:24 139656427448064 [Note] InnoDB: To recover: 9402421 pages from log

Each attempt can freeze at a random point, never the same amount of pages to recover, and randomly it just works, passes, member rejoins the cluster like this isn't ever an issue.

example of a hung mariabackup:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18519 mysql 15 -5 0.939t 0.853t 4456 R 507.1 86.8 2780:00 mariabackup

I will rerun with --dbug=d,ib_log during log apply phase to verify what can be found, it used to be this rarely would happen, but now it is far more regular of an occurrence. I've seen other reports of this happening but no real followups. I figured at first it was a ram limitation related matter but it can happen earlier on in the first batch read, before it even consumes a large amount of memory.

├─mysqld─┬─sh───wsrep_sst_maria─┬─logger
│ │ ├─mariabackup───12*[

{mariabackup}

]
│ │ └─wsrep_sst_maria───logger
│ └─7*[

{mysqld}

]

Thanks in advance,

Bren

Generated at Thu Feb 08 09:14:37 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.