Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.1(EOL), 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5
Description
http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2-big/builds/2027
10.1 f214d365121f1d9b5835134f4bead929 |
encryption.create_or_replace 'ctr,xtradb' w1 [ fail ] Found warnings/errors in server log file!
|
Test ended at 2018-05-06 08:11:15
|
line
|
2018-05-06 8:09:40 2794449728 [ERROR] InnoDB: Trying to access tablespace [space=909: page=5] but the tablespace does not exist or is just being dropped.
|
^ Found warnings in /mnt/buildbot/build/mariadb-10.1.33/mysql-test/var/1/log/mysqld.1.err
|
ok
|
Attachments
Issue Links
- blocks
-
MDEV-20399 SEGV in fil_system_t::keyrotate_next , InnoDB, table encrypted
-
- Closed
-
- relates to
-
MDEV-22258 Limit innodb_encryption_threads to 255
-
- Closed
-
Activity
Is this the same as:
10.5-d5e8b4d7f97c7a15be5d58b6f088c4c64787a385 |
|
encryption.create_or_replace 'ctr,innodb' w57 [ fail ] Found warnings/errors in server log file!
|
Test ended at 2020-06-03 14:03:32
|
line
|
2020-06-03 14:02:39 6 [Warning] InnoDB: Trying to delete tablespace 'test/create_or_replace_t' but there are 1 pending operations on it.
|
^ Found warnings in /home/dan/build-mariadb-server-10.5-debug-gccO0/mysql-test/var/57/log/mysqld.1.err
|
ok
|
|
While trying to repeat this, I got a slightly different type of failure, which looks related:
while ./mtr --rr --rr-arg=-h --parallel=auto encryption.{innodb_encryption_discard_import,create_or_replace}{,,,,,}; do :; done
|
10.2 1e31d74833d56609f8711022394c1eb2eb25a19a |
CURRENT_TEST: encryption.create_or_replace
|
mysqltest: At line 111: query 'SET GLOBAL innodb_encrypt_tables = OFF' failed: 2006: MySQL server has gone away
|
…
|
Version: '10.2.33-MariaDB-debug-log' socket: '/dev/shm/10.2o/mysql-test/var/tmp/7/mysqld.1.sock' port: 16120 Source distribution
|
2020-08-07 12:31:46 47723491903232 [Note] InnoDB: Creating #1 encryption thread id 128900711794432 total threads 1.
|
2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #1 encryption thread id 128900711794432 total threads 4.
|
2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #2 encryption thread id 41880383772416 total threads 4.
|
2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #3 encryption thread id 140100598744832 total threads 4.
|
2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #4 encryption thread id 98690424022784 total threads 4.
|
2020-08-07 12:54:17 47723491903232 [Warning] InnoDB: Trying to close/delete/truncate tablespace 'test/table0_int_autoinc' but there are 1 pending operations on it.
|
…
|
2020-08-07 13:08:26 47723491903232 [Warning] InnoDB: Trying to close/delete/truncate tablespace 'test/table0_int_autoinc' but there are 1 pending operations on it.
|
InnoDB: ###### Diagnostic info printed to the standard error stream
|
2020-08-07 13:08:26 7689782224640 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
|
I found something interesting while analyzing the rr replay trace:
10.2 1e31d74833d56609f8711022394c1eb2eb25a19a |
#0 fil_space_acquire_low (id=<optimized out>, silent=false) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:2029
|
#1 0x000055d522b05fd2 in fil_space_acquire (id=168) at /mariadb/10.2o/storage/innobase/include/fil0fil.h:772
|
#2 buf_read_ahead_linear (page_id=<optimized out>, page_size=@0x2b677dc7f6a0: {m_physical = 16384, m_logical = 16384, m_is_compressed = 0}, inside_ibuf=0) at /mariadb/10.2o/storage/innobase/buf/buf0rea.cc:579
|
#3 0x000055d522ada954 in buf_page_get_low (page_id={m_space = 168, m_page_no = 0}, page_size=@0x2b677dc7f6a0: {m_physical = 16384, m_logical = 16384, m_is_compressed = 0}, rw_latch=<optimized out>,
|
guess=0x3be57341bd00, mode=<optimized out>, file=0x55d522f25852 "/mariadb/10.2o/storage/innobase/fil/fil0fil.cc", line=4497, mtr=0x2b677dc7f818, err=0x0)
|
at /mariadb/10.2o/storage/innobase/buf/buf0buf.cc:4866
|
#4 0x000055d522adc481 in buf_page_get_gen (page_id=<optimized out>, page_size=<optimized out>, rw_latch=<optimized out>, guess=<optimized out>, mode=<optimized out>, file=<optimized out>, line=3222298992,
|
mtr=0x7f6bc0105de8, err=0x7f6bc0106100) at /mariadb/10.2o/storage/innobase/buf/buf0buf.cc:4907
|
#5 0x000055d522b68039 in fsp_flags_try_adjust (space_id=94373927550768, flags=33) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:4495
|
#6 0x000055d522b69d15 in fil_space_for_table_exists_in_mem (id=168, name=0x7f6bc01a32b0 "test/table0_int_autoinc", table_flags=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:4555
|
#7 0x000055d5229ba3e5 in row_drop_single_table_tablespace (space_id=168, tablename=0x2b677dc84700 "", filepath=0x7f6bc00d8400 "./test/table0_int_autoinc.ibd", table_flags=33)
|
at /mariadb/10.2o/storage/innobase/row/row0mysql.cc:3253
|
#8 row_drop_table_for_mysql (name=0x2b677dc80620 "test/table0_int_autoinc", trx=0x7fd511fc0390, sqlcom=SQLCOM_DROP_TABLE, create_failed=200, nonatomic=<optimized out>)
|
at /mariadb/10.2o/storage/innobase/row/row0mysql.cc:3740
|
#9 0x000055d5228b3538 in ha_innobase::delete_table (this=<optimized out>, name=0x2b677dc82330 "./test/table0_int_autoinc", sqlcom=SQLCOM_DROP_TABLE)
|
at /mariadb/10.2o/storage/innobase/handler/ha_innodb.cc:13304
|
This handle was properly closed later (so this is not the root cause of the failure), but it should make absolutely no sense to invoke the MDEV-11623 fixup when we had already opened the tablespace, and especially not when we are about to drop the tablespace.
Before that, we were several times accessing the tablespace (acquiring and releasing the handle) in:
10.2 1e31d74833d56609f8711022394c1eb2eb25a19a |
#0 0x000055d522b6d553 in fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5911
|
#1 0x000055d522b71866 in fil_crypt_find_space_to_rotate (key_state=<optimized out>, state=<optimized out>, recheck=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1447
|
#2 fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2145
|
#3 0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
|
#4 0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
|
After painstaking debugging, I seem to have found the root cause of this. It appears to be a race condition between two encryption threads.
10.2 1e31d74833d56609f8711022394c1eb2eb25a19a |
Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops
|
|
Old value = 0
|
New value = 1
|
fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
|
5937 mutex_exit(&fil_system->mutex);
|
Current event: 7260715
|
(rr) bt
|
#0 fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
|
#1 0x000055d522b71866 in fil_crypt_find_space_to_rotate (key_state=<optimized out>, state=<optimized out>, recheck=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1447
|
#2 fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2145
|
#3 0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
|
#4 0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
|
(rr) n
|
…
|
2174 if (thr.space) {
|
(rr)
|
2175 fil_crypt_complete_rotate_space(&new_state, &thr);
|
(rr) when
|
Current event: 7260723
|
(rr) n
|
[Switching to Thread 26959.27675]
|
|
Thread 32 hit Hardware watchpoint 2: -location sp->n_pending_ops
|
|
Old value = 1
|
New value = 2
|
fil_space_next (prev_space=0x7f6bc00cab50)
|
at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
|
5937 mutex_exit(&fil_system->mutex);
|
Current event: 7260750
|
(rr) bt
|
#0 fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
|
#1 0x000055d522b71866 in fil_crypt_find_space_to_rotate (key_state=<optimized out>, state=<optimized out>, recheck=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1447
|
#2 fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2145
|
#3 0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
|
#4 0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
|
(rr) thread 29
|
[Switching to thread 29 (Thread 26959.27672)]
|
#0 0x0000000070000002 in ?? ()
|
(rr) bt
|
#0 0x0000000070000002 in ?? ()
|
#1 0x00000000682053a6 in _raw_syscall () at /build/rr-2KB12a/rr-5.3.0/src/preload/raw_syscall.S:120
|
#2 0x000000006820130f in traced_raw_syscall (call=call@entry=0x7f6bab9fffa0) at ./src/preload/syscallbuf.c:229
|
#3 0x000000006820436d in sys_futex (call=<optimized out>) at ./src/preload/syscallbuf.c:1355
|
#4 syscall_hook_internal (call=0x7f6bab9fffa0) at ./src/preload/syscallbuf.c:2861
|
#5 syscall_hook (call=0x7f6bab9fffa0) at ./src/preload/syscallbuf.c:2987
|
#6 0x00000000682011ea in _syscall_hook_trampoline () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:282
|
#7 0x000000006820121a in __morestack () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:417
|
#8 0x0000000068201235 in _syscall_hook_trampoline_48_3d_00_f0_ff_ff () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:428
|
#9 0x00004657595ae7b8 in futex_wait_cancelable (private=0, expected=0, futex_word=0x55d5248f5dc8) at ../sysdeps/nptl/futex-internal.h:183
|
#10 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55d5248f5d78, cond=0x55d5248f5da0) at pthread_cond_wait.c:508
|
#11 __pthread_cond_wait (cond=0x55d5248f5da0, mutex=0x55d5248f5d78) at pthread_cond_wait.c:638
|
#12 0x000055d522941e96 in os_event::wait (this=0x55d5248f5d60) at /mariadb/10.2o/storage/innobase/os/os0event.cc:158
|
#13 os_event::wait_low (this=0x55d5248f5d60, reset_sig_count=573) at /mariadb/10.2o/storage/innobase/os/os0event.cc:325
|
#14 0x000055d522942195 in os_event_wait_low (event=0x55d5248f5dc8, reset_sig_count=128) at /mariadb/10.2o/storage/innobase/os/os0event.cc:507
|
#15 0x000055d522af5454 in buf_flush_wait_batch_end (buf_pool=<optimized out>, type=<optimized out>) at /mariadb/10.2o/storage/innobase/buf/buf0flu.cc:1987
|
#16 0x000055d522b73c40 in fil_crypt_flush_space (state=0x753c09233840) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1956
|
#17 fil_crypt_complete_rotate_space (key_state=<optimized out>, state=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2071
|
#18 0x000055d522b72ed0 in fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2175
|
#19 0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
|
#20 0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
|
After this condition, the n_pending_ops would still reach 0 once more. Note, the Thread numbering may switch here, because I restarted the trace in between:
10.2 1e31d74833d56609f8711022394c1eb2eb25a19a |
Thread 1 hit Hardware watchpoint 2: -location sp->n_pending_ops
|
|
Old value = 1
|
New value = 0
|
0x000055d522b6d512 in fil_space_next (prev_space=0x7f6bc00da5a0) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5920
|
5920 space->n_pending_ops--;
|
Current event: 7277146
|
Soon after that, we hit a permanent problem:
10.2 1e31d74833d56609f8711022394c1eb2eb25a19a |
Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops
|
|
Old value = 1
|
New value = 2
|
fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
|
5937 mutex_exit(&fil_system->mutex);
|
Current event: 7279390
|
(rr) thread 30
|
[Switching to thread 30 (Thread 26959.27674)]
|
#0 0x0000000070000002 in ?? ()
|
(rr) bt
|
#0 0x0000000070000002 in ?? ()
|
#1 0x00000000682053a6 in _raw_syscall () at /build/rr-2KB12a/rr-5.3.0/src/preload/raw_syscall.S:120
|
#2 0x000000006820130f in traced_raw_syscall (call=call@entry=0x7f6bab4fffa0) at ./src/preload/syscallbuf.c:229
|
#3 0x000000006820436d in sys_futex (call=<optimized out>) at ./src/preload/syscallbuf.c:1355
|
#4 syscall_hook_internal (call=0x7f6bab4fffa0) at ./src/preload/syscallbuf.c:2861
|
#5 syscall_hook (call=0x7f6bab4fffa0) at ./src/preload/syscallbuf.c:2987
|
#6 0x00000000682011ea in _syscall_hook_trampoline () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:282
|
#7 0x000000006820121a in __morestack () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:417
|
#8 0x0000000068201235 in _syscall_hook_trampoline_48_3d_00_f0_ff_ff () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:428
|
#9 0x00004657595ae7b8 in futex_wait_cancelable (private=0, expected=0, futex_word=0x55d5248f5dc8) at ../sysdeps/nptl/futex-internal.h:183
|
#10 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55d5248f5d78, cond=0x55d5248f5da0) at pthread_cond_wait.c:508
|
#11 __pthread_cond_wait (cond=0x55d5248f5da0, mutex=0x55d5248f5d78) at pthread_cond_wait.c:638
|
#12 0x000055d522941e96 in os_event::wait (this=0x55d5248f5d60) at /mariadb/10.2o/storage/innobase/os/os0event.cc:158
|
#13 os_event::wait_low (this=0x55d5248f5d60, reset_sig_count=578) at /mariadb/10.2o/storage/innobase/os/os0event.cc:325
|
#14 0x000055d522942195 in os_event_wait_low (event=0x55d5248f5dc8, reset_sig_count=128) at /mariadb/10.2o/storage/innobase/os/os0event.cc:507
|
#15 0x000055d522af5454 in buf_flush_wait_batch_end (buf_pool=<optimized out>, type=<optimized out>) at /mariadb/10.2o/storage/innobase/buf/buf0flu.cc:1987
|
#16 0x000055d522b73c40 in fil_crypt_flush_space (state=0x7f6bb66bb840) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1956
|
#17 fil_crypt_complete_rotate_space (key_state=<optimized out>, state=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2071
|
#18 0x000055d522b72ed0 in fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2175
|
#19 0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
|
#20 0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
|
(rr) c
|
Continuing.
|
|
Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops
|
|
Old value = 2
|
New value = 1
|
0x000055d522b6d512 in fil_space_next (prev_space=0x7f6bc00da5a0) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5920
|
5920 space->n_pending_ops--;
|
Current event: 7279390
|
(rr)
|
Continuing.
|
|
Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops
|
|
Old value = 1
|
New value = 2
|
fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
|
5937 mutex_exit(&fil_system->mutex);
|
Current event: 7281359
|
From this point on, the count will never reach 0 until the first warning message is issued. It seems to alternate between 1 and 2 all the time.
Why do we have multiple concurrently running fil_crypt_thread at all? All they are doing is marking pages in the buffer pool dirty. The actual work is being done by the page flushing. Especially if axel can confirm that a single encryption thread will not perform significantly worse than multiple ones, I think that we should deprecate and ignore the parameter.
To prove that the counter never reached 0 again, I let the execution continue for several more minutes with an additional condition on the watchpoint:
10.2 1e31d74833d56609f8711022394c1eb2eb25a19a |
(rr) when
|
Current event: 7279390
|
(rr) i b
|
Num Type Disp Enb Address What
|
1 breakpoint keep y 0x000055d522a773e0 in ib::warn::~warn() at /mariadb/10.2o/storage/innobase/ut/ut0ut.cc:623
|
2 hw watchpoint keep y -location sp->n_pending_ops
|
breakpoint already hit 37 times
|
when
|
(rr) cond 2 !$5->n_pending_ops
|
(rr) enable
|
(rr) c
|
Continuing.
|
[Switching to Thread 26959.27283]
|
|
Thread 28 hit Breakpoint 1, ib::warn::~warn (this=0x2b677dc7f620) at /mariadb/10.2o/storage/innobase/ut/ut0ut.cc:623
|
623 {
|
(rr) enable display
|
(rr) display
|
1: $5->n_pending_ops = 1
|
(rr) when
|
Current event: 21286261
|
I had forgotten that a benchmark was recently run in MDEV-22258, showing that marking pages dirty in multiple threads can slightly improve throughput. Thus, we cannot fix this bug by limiting the number of fil_crypt_thread to 1.
The failure is very random. Today, I seem to be unable to repeat it on either 10.1 or 10.2 (without using rr record; maybe that made the difference last week). On buildbot.askmonty.org, the test encryption.innodb_encryption_discard_import fails with this message only on 10.1 and not in later versions. The latest failure of that type was a week ago: http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/23955
10.1 8bca92c8845212ea96be404e664f4cbb |
encryption.innodb_encryption_discard_import 'xtradb' w4 [ fail ] Found warnings/errors in server log file!
|
Test ended at 2020-08-03 10:19:41
|
line
|
2020-08-03 10:19:07 2738875200 [ERROR] InnoDB: Trying to access tablespace [space=4: page=541] but the tablespace does not exist or is just being dropped.
|
2020-08-03 10:19:12 2738875200 [ERROR] InnoDB: Trying to access tablespace [space=5: page=506] but the tablespace does not exist or is just being dropped.
|
^ Found warnings in /mnt/buildbot/build/mariadb-10.1.46/mysql-test/var/4/log/mysqld.1.err
|
ok
|
|
- skipping '/mnt/buildbot/build/mariadb-10.1.46/mysql-test/var/4/log/encryption.innodb_encryption_discard_import-xtradb/'
|
|
Retrying test encryption.innodb_encryption_discard_import, attempt(2/3)...
|
I think that I will ignore that failure, because 10.1 will reach EOL soon, and because that failure seems to be extremely rare on buildbot. The previous occurrence in the main 10.1 branch was 1½ months earlier; in some bb-10.1- staging branches it failed 2 times during that time frame. There are no failures of this kind in later branches. (That test does fail in other ways every now and then.)
I believe that the root cause of the failure of encryption.create_or_replace is that the test is reducing the value of innodb_encryption_threads. When that happens during the execution of fil_crypt_find_space_to_rotate(), it can happen that the loop will be terminated without us ever reducing the reference count (invoking fil_space_release(state->space)). It appears that to improve the chances of that happening, we should have very many encrypted tables or partitions.
Other tests are affected too:
https://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/16158/steps/mtr_ps/logs/stdio
10.1 213ece2f2e8c23ee0e727cc02dcecfae6f510ca2
encryption.innodb_encryption_discard_import 'xtradb' w2 [ fail ] Found warnings/errors in server log file!
Test ended at 2019-02-02 12:01:52
line
2019-02-02 12:01:30 2730486592 [ERROR] InnoDB: Trying to access tablespace [space=6: page=675] but the tablespace does not exist or is just being dropped.
2019-02-02 12:01:30 2713701184 [ERROR] InnoDB: Trying to access tablespace [space=6: page=745] but the tablespace does not exist or is just being dropped.