[MDEV-16115] encryption.create_or_replace fails in buildbot with [ERROR] InnoDB: Trying to access tablespace Created: 2018-05-07  Updated: 2020-10-06  Resolved: 2020-08-10

Status: Closed
Project: MariaDB Server
Component/s: Encryption, Storage Engine - InnoDB, Tests
Affects Version/s: 10.1, 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.1.48, 10.2.35, 10.3.26, 10.4.16, 10.5.7

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: rr-profile-analyzed

Issue Links:
Blocks
blocks MDEV-20399 SEGV in fil_system_t::keyrotate_next ... Closed
Relates
relates to MDEV-22258 Limit innodb_encryption_threads to 255 Closed

 Description   

http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2-big/builds/2027

10.1 f214d365121f1d9b5835134f4bead929

encryption.create_or_replace 'ctr,xtradb' w1 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2018-05-06 08:11:15
line
2018-05-06  8:09:40 2794449728 [ERROR] InnoDB: Trying to access tablespace [space=909: page=5] but the tablespace does not exist or is just being dropped.
^ Found warnings in /mnt/buildbot/build/mariadb-10.1.33/mysql-test/var/1/log/mysqld.1.err
ok



 Comments   
Comment by Marko Mäkelä [ 2019-02-04 ]

Other tests are affected too:
https://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/16158/steps/mtr_ps/logs/stdio

10.1 213ece2f2e8c23ee0e727cc02dcecfae6f510ca2

encryption.innodb_encryption_discard_import 'xtradb' w2 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2019-02-02 12:01:52
line
2019-02-02 12:01:30 2730486592 [ERROR] InnoDB: Trying to access tablespace [space=6: page=675] but the tablespace does not exist or is just being dropped.
2019-02-02 12:01:30 2713701184 [ERROR] InnoDB: Trying to access tablespace [space=6: page=745] but the tablespace does not exist or is just being dropped.

Comment by Daniel Black [ 2020-06-03 ]

Is this the same as:

10.5-d5e8b4d7f97c7a15be5d58b6f088c4c64787a385

 
encryption.create_or_replace 'ctr,innodb' w57 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2020-06-03 14:03:32
line
2020-06-03 14:02:39 6 [Warning] InnoDB: Trying to delete tablespace 'test/create_or_replace_t' but there are 1 pending operations on it.
^ Found warnings in /home/dan/build-mariadb-server-10.5-debug-gccO0/mysql-test/var/57/log/mysqld.1.err
ok

Comment by Marko Mäkelä [ 2020-08-07 ]

While trying to repeat this, I got a slightly different type of failure, which looks related:

while ./mtr --rr --rr-arg=-h --parallel=auto encryption.{innodb_encryption_discard_import,create_or_replace}{,,,,,}; do :; done

10.2 1e31d74833d56609f8711022394c1eb2eb25a19a

CURRENT_TEST: encryption.create_or_replace
mysqltest: At line 111: query 'SET GLOBAL innodb_encrypt_tables = OFF' failed: 2006: MySQL server has gone away
Version: '10.2.33-MariaDB-debug-log'  socket: '/dev/shm/10.2o/mysql-test/var/tmp/7/mysqld.1.sock'  port: 16120  Source distribution
2020-08-07 12:31:46 47723491903232 [Note] InnoDB: Creating #1 encryption thread id 128900711794432 total threads 1.
2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #1 encryption thread id 128900711794432 total threads 4.
2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #2 encryption thread id 41880383772416 total threads 4.
2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #3 encryption thread id 140100598744832 total threads 4.
2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #4 encryption thread id 98690424022784 total threads 4.
2020-08-07 12:54:17 47723491903232 [Warning] InnoDB: Trying to close/delete/truncate tablespace 'test/table0_int_autoinc' but there are 1 pending operations on it.
2020-08-07 13:08:26 47723491903232 [Warning] InnoDB: Trying to close/delete/truncate tablespace 'test/table0_int_autoinc' but there are 1 pending operations on it.
InnoDB: ###### Diagnostic info printed to the standard error stream
2020-08-07 13:08:26 7689782224640 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

I found something interesting while analyzing the rr replay trace:

10.2 1e31d74833d56609f8711022394c1eb2eb25a19a

#0  fil_space_acquire_low (id=<optimized out>, silent=false) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:2029
#1  0x000055d522b05fd2 in fil_space_acquire (id=168) at /mariadb/10.2o/storage/innobase/include/fil0fil.h:772
#2  buf_read_ahead_linear (page_id=<optimized out>, page_size=@0x2b677dc7f6a0: {m_physical = 16384, m_logical = 16384, m_is_compressed = 0}, inside_ibuf=0) at /mariadb/10.2o/storage/innobase/buf/buf0rea.cc:579
#3  0x000055d522ada954 in buf_page_get_low (page_id={m_space = 168, m_page_no = 0}, page_size=@0x2b677dc7f6a0: {m_physical = 16384, m_logical = 16384, m_is_compressed = 0}, rw_latch=<optimized out>, 
    guess=0x3be57341bd00, mode=<optimized out>, file=0x55d522f25852 "/mariadb/10.2o/storage/innobase/fil/fil0fil.cc", line=4497, mtr=0x2b677dc7f818, err=0x0)
    at /mariadb/10.2o/storage/innobase/buf/buf0buf.cc:4866
#4  0x000055d522adc481 in buf_page_get_gen (page_id=<optimized out>, page_size=<optimized out>, rw_latch=<optimized out>, guess=<optimized out>, mode=<optimized out>, file=<optimized out>, line=3222298992, 
    mtr=0x7f6bc0105de8, err=0x7f6bc0106100) at /mariadb/10.2o/storage/innobase/buf/buf0buf.cc:4907
#5  0x000055d522b68039 in fsp_flags_try_adjust (space_id=94373927550768, flags=33) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:4495
#6  0x000055d522b69d15 in fil_space_for_table_exists_in_mem (id=168, name=0x7f6bc01a32b0 "test/table0_int_autoinc", table_flags=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:4555
#7  0x000055d5229ba3e5 in row_drop_single_table_tablespace (space_id=168, tablename=0x2b677dc84700 "", filepath=0x7f6bc00d8400 "./test/table0_int_autoinc.ibd", table_flags=33)
    at /mariadb/10.2o/storage/innobase/row/row0mysql.cc:3253
#8  row_drop_table_for_mysql (name=0x2b677dc80620 "test/table0_int_autoinc", trx=0x7fd511fc0390, sqlcom=SQLCOM_DROP_TABLE, create_failed=200, nonatomic=<optimized out>)
    at /mariadb/10.2o/storage/innobase/row/row0mysql.cc:3740
#9  0x000055d5228b3538 in ha_innobase::delete_table (this=<optimized out>, name=0x2b677dc82330 "./test/table0_int_autoinc", sqlcom=SQLCOM_DROP_TABLE)
    at /mariadb/10.2o/storage/innobase/handler/ha_innodb.cc:13304

This handle was properly closed later (so this is not the root cause of the failure), but it should make absolutely no sense to invoke the MDEV-11623 fixup when we had already opened the tablespace, and especially not when we are about to drop the tablespace.
Before that, we were several times accessing the tablespace (acquiring and releasing the handle) in:

10.2 1e31d74833d56609f8711022394c1eb2eb25a19a

#0  0x000055d522b6d553 in fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5911
#1  0x000055d522b71866 in fil_crypt_find_space_to_rotate (key_state=<optimized out>, state=<optimized out>, recheck=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1447
#2  fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2145
#3  0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

After painstaking debugging, I seem to have found the root cause of this. It appears to be a race condition between two encryption threads.

10.2 1e31d74833d56609f8711022394c1eb2eb25a19a

Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops
 
Old value = 0
New value = 1
fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
5937		mutex_exit(&fil_system->mutex);
Current event: 7260715
(rr) bt
#0  fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
#1  0x000055d522b71866 in fil_crypt_find_space_to_rotate (key_state=<optimized out>, state=<optimized out>, recheck=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1447
#2  fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2145
#3  0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(rr) n
2174				if (thr.space) {
(rr) 
2175					fil_crypt_complete_rotate_space(&new_state, &thr);
(rr) when
Current event: 7260723
(rr) n
[Switching to Thread 26959.27675]
 
Thread 32 hit Hardware watchpoint 2: -location sp->n_pending_ops
 
Old value = 1
New value = 2
fil_space_next (prev_space=0x7f6bc00cab50)
    at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
5937		mutex_exit(&fil_system->mutex);
Current event: 7260750
(rr) bt
#0  fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
#1  0x000055d522b71866 in fil_crypt_find_space_to_rotate (key_state=<optimized out>, state=<optimized out>, recheck=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1447
#2  fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2145
#3  0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(rr) thread 29
[Switching to thread 29 (Thread 26959.27672)]
#0  0x0000000070000002 in ?? ()
(rr) bt
#0  0x0000000070000002 in ?? ()
#1  0x00000000682053a6 in _raw_syscall () at /build/rr-2KB12a/rr-5.3.0/src/preload/raw_syscall.S:120
#2  0x000000006820130f in traced_raw_syscall (call=call@entry=0x7f6bab9fffa0) at ./src/preload/syscallbuf.c:229
#3  0x000000006820436d in sys_futex (call=<optimized out>) at ./src/preload/syscallbuf.c:1355
#4  syscall_hook_internal (call=0x7f6bab9fffa0) at ./src/preload/syscallbuf.c:2861
#5  syscall_hook (call=0x7f6bab9fffa0) at ./src/preload/syscallbuf.c:2987
#6  0x00000000682011ea in _syscall_hook_trampoline () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:282
#7  0x000000006820121a in __morestack () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:417
#8  0x0000000068201235 in _syscall_hook_trampoline_48_3d_00_f0_ff_ff () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:428
#9  0x00004657595ae7b8 in futex_wait_cancelable (private=0, expected=0, futex_word=0x55d5248f5dc8) at ../sysdeps/nptl/futex-internal.h:183
#10 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55d5248f5d78, cond=0x55d5248f5da0) at pthread_cond_wait.c:508
#11 __pthread_cond_wait (cond=0x55d5248f5da0, mutex=0x55d5248f5d78) at pthread_cond_wait.c:638
#12 0x000055d522941e96 in os_event::wait (this=0x55d5248f5d60) at /mariadb/10.2o/storage/innobase/os/os0event.cc:158
#13 os_event::wait_low (this=0x55d5248f5d60, reset_sig_count=573) at /mariadb/10.2o/storage/innobase/os/os0event.cc:325
#14 0x000055d522942195 in os_event_wait_low (event=0x55d5248f5dc8, reset_sig_count=128) at /mariadb/10.2o/storage/innobase/os/os0event.cc:507
#15 0x000055d522af5454 in buf_flush_wait_batch_end (buf_pool=<optimized out>, type=<optimized out>) at /mariadb/10.2o/storage/innobase/buf/buf0flu.cc:1987
#16 0x000055d522b73c40 in fil_crypt_flush_space (state=0x753c09233840) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1956
#17 fil_crypt_complete_rotate_space (key_state=<optimized out>, state=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2071
#18 0x000055d522b72ed0 in fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2175
#19 0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#20 0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

After this condition, the n_pending_ops would still reach 0 once more. Note, the Thread numbering may switch here, because I restarted the trace in between:

10.2 1e31d74833d56609f8711022394c1eb2eb25a19a

Thread 1 hit Hardware watchpoint 2: -location sp->n_pending_ops
 
Old value = 1
New value = 0
0x000055d522b6d512 in fil_space_next (prev_space=0x7f6bc00da5a0) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5920
5920			space->n_pending_ops--;
Current event: 7277146

Soon after that, we hit a permanent problem:

10.2 1e31d74833d56609f8711022394c1eb2eb25a19a

Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops
 
Old value = 1
New value = 2
fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
5937		mutex_exit(&fil_system->mutex);
Current event: 7279390
(rr) thread 30
[Switching to thread 30 (Thread 26959.27674)]
#0  0x0000000070000002 in ?? ()
(rr) bt
#0  0x0000000070000002 in ?? ()
#1  0x00000000682053a6 in _raw_syscall () at /build/rr-2KB12a/rr-5.3.0/src/preload/raw_syscall.S:120
#2  0x000000006820130f in traced_raw_syscall (call=call@entry=0x7f6bab4fffa0) at ./src/preload/syscallbuf.c:229
#3  0x000000006820436d in sys_futex (call=<optimized out>) at ./src/preload/syscallbuf.c:1355
#4  syscall_hook_internal (call=0x7f6bab4fffa0) at ./src/preload/syscallbuf.c:2861
#5  syscall_hook (call=0x7f6bab4fffa0) at ./src/preload/syscallbuf.c:2987
#6  0x00000000682011ea in _syscall_hook_trampoline () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:282
#7  0x000000006820121a in __morestack () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:417
#8  0x0000000068201235 in _syscall_hook_trampoline_48_3d_00_f0_ff_ff () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:428
#9  0x00004657595ae7b8 in futex_wait_cancelable (private=0, expected=0, futex_word=0x55d5248f5dc8) at ../sysdeps/nptl/futex-internal.h:183
#10 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55d5248f5d78, cond=0x55d5248f5da0) at pthread_cond_wait.c:508
#11 __pthread_cond_wait (cond=0x55d5248f5da0, mutex=0x55d5248f5d78) at pthread_cond_wait.c:638
#12 0x000055d522941e96 in os_event::wait (this=0x55d5248f5d60) at /mariadb/10.2o/storage/innobase/os/os0event.cc:158
#13 os_event::wait_low (this=0x55d5248f5d60, reset_sig_count=578) at /mariadb/10.2o/storage/innobase/os/os0event.cc:325
#14 0x000055d522942195 in os_event_wait_low (event=0x55d5248f5dc8, reset_sig_count=128) at /mariadb/10.2o/storage/innobase/os/os0event.cc:507
#15 0x000055d522af5454 in buf_flush_wait_batch_end (buf_pool=<optimized out>, type=<optimized out>) at /mariadb/10.2o/storage/innobase/buf/buf0flu.cc:1987
#16 0x000055d522b73c40 in fil_crypt_flush_space (state=0x7f6bb66bb840) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1956
#17 fil_crypt_complete_rotate_space (key_state=<optimized out>, state=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2071
#18 0x000055d522b72ed0 in fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2175
#19 0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#20 0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(rr) c
Continuing.
 
Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops
 
Old value = 2
New value = 1
0x000055d522b6d512 in fil_space_next (prev_space=0x7f6bc00da5a0) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5920
5920			space->n_pending_ops--;
Current event: 7279390
(rr) 
Continuing.
 
Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops
 
Old value = 1
New value = 2
fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
5937		mutex_exit(&fil_system->mutex);
Current event: 7281359

From this point on, the count will never reach 0 until the first warning message is issued. It seems to alternate between 1 and 2 all the time.

Why do we have multiple concurrently running fil_crypt_thread at all? All they are doing is marking pages in the buffer pool dirty. The actual work is being done by the page flushing. Especially if axel can confirm that a single encryption thread will not perform significantly worse than multiple ones, I think that we should deprecate and ignore the parameter.

Comment by Marko Mäkelä [ 2020-08-07 ]

To prove that the counter never reached 0 again, I let the execution continue for several more minutes with an additional condition on the watchpoint:

10.2 1e31d74833d56609f8711022394c1eb2eb25a19a

(rr) when
Current event: 7279390
(rr) i b
Num     Type           Disp Enb Address            What
1       breakpoint     keep y   0x000055d522a773e0 in ib::warn::~warn() at /mariadb/10.2o/storage/innobase/ut/ut0ut.cc:623
2       hw watchpoint  keep y                      -location sp->n_pending_ops
	breakpoint already hit 37 times
        when
(rr) cond 2 !$5->n_pending_ops
(rr) enable
(rr) c
Continuing.
[Switching to Thread 26959.27283]
 
Thread 28 hit Breakpoint 1, ib::warn::~warn (this=0x2b677dc7f620) at /mariadb/10.2o/storage/innobase/ut/ut0ut.cc:623
623	{
(rr) enable display
(rr) display
1: $5->n_pending_ops = 1
(rr) when
Current event: 21286261

Comment by Marko Mäkelä [ 2020-08-10 ]

I had forgotten that a benchmark was recently run in MDEV-22258, showing that marking pages dirty in multiple threads can slightly improve throughput. Thus, we cannot fix this bug by limiting the number of fil_crypt_thread to 1.

The failure is very random. Today, I seem to be unable to repeat it on either 10.1 or 10.2 (without using rr record; maybe that made the difference last week). On buildbot.askmonty.org, the test encryption.innodb_encryption_discard_import fails with this message only on 10.1 and not in later versions. The latest failure of that type was a week ago: http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/23955

10.1 8bca92c8845212ea96be404e664f4cbb

encryption.innodb_encryption_discard_import 'xtradb' w4 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2020-08-03 10:19:41
line
2020-08-03 10:19:07 2738875200 [ERROR] InnoDB: Trying to access tablespace [space=4: page=541] but the tablespace does not exist or is just being dropped.
2020-08-03 10:19:12 2738875200 [ERROR] InnoDB: Trying to access tablespace [space=5: page=506] but the tablespace does not exist or is just being dropped.
^ Found warnings in /mnt/buildbot/build/mariadb-10.1.46/mysql-test/var/4/log/mysqld.1.err
ok
 
 - skipping '/mnt/buildbot/build/mariadb-10.1.46/mysql-test/var/4/log/encryption.innodb_encryption_discard_import-xtradb/'
 
Retrying test encryption.innodb_encryption_discard_import, attempt(2/3)...

I think that I will ignore that failure, because 10.1 will reach EOL soon, and because that failure seems to be extremely rare on buildbot. The previous occurrence in the main 10.1 branch was 1½ months earlier; in some bb-10.1- staging branches it failed 2 times during that time frame. There are no failures of this kind in later branches. (That test does fail in other ways every now and then.)

I believe that the root cause of the failure of encryption.create_or_replace is that the test is reducing the value of innodb_encryption_threads. When that happens during the execution of fil_crypt_find_space_to_rotate(), it can happen that the loop will be terminated without us ever reducing the reference count (invoking fil_space_release(state->space)). It appears that to improve the chances of that happening, we should have very many encrypted tables or partitions.

Comment by Marko Mäkelä [ 2020-08-10 ]

I have pushed the fix up to 10.5.

Generated at Thu Feb 08 08:26:29 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.