Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-16115

encryption.create_or_replace fails in buildbot with [ERROR] InnoDB: Trying to access tablespace

Details

    Description

      http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2-big/builds/2027

      10.1 f214d365121f1d9b5835134f4bead929

      encryption.create_or_replace 'ctr,xtradb' w1 [ fail ]  Found warnings/errors in server log file!
              Test ended at 2018-05-06 08:11:15
      line
      2018-05-06  8:09:40 2794449728 [ERROR] InnoDB: Trying to access tablespace [space=909: page=5] but the tablespace does not exist or is just being dropped.
      ^ Found warnings in /mnt/buildbot/build/mariadb-10.1.33/mysql-test/var/1/log/mysqld.1.err
      ok
      

      Attachments

        Issue Links

          Activity

            Other tests are affected too:
            https://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/16158/steps/mtr_ps/logs/stdio

            10.1 213ece2f2e8c23ee0e727cc02dcecfae6f510ca2

            encryption.innodb_encryption_discard_import 'xtradb' w2 [ fail ]  Found warnings/errors in server log file!
                    Test ended at 2019-02-02 12:01:52
            line
            2019-02-02 12:01:30 2730486592 [ERROR] InnoDB: Trying to access tablespace [space=6: page=675] but the tablespace does not exist or is just being dropped.
            2019-02-02 12:01:30 2713701184 [ERROR] InnoDB: Trying to access tablespace [space=6: page=745] but the tablespace does not exist or is just being dropped.
            

            marko Marko Mäkelä added a comment - Other tests are affected too: https://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/16158/steps/mtr_ps/logs/stdio 10.1 213ece2f2e8c23ee0e727cc02dcecfae6f510ca2 encryption.innodb_encryption_discard_import 'xtradb' w2 [ fail ] Found warnings/errors in server log file! Test ended at 2019-02-02 12:01:52 line 2019-02-02 12:01:30 2730486592 [ERROR] InnoDB: Trying to access tablespace [space=6: page=675] but the tablespace does not exist or is just being dropped. 2019-02-02 12:01:30 2713701184 [ERROR] InnoDB: Trying to access tablespace [space=6: page=745] but the tablespace does not exist or is just being dropped.
            danblack Daniel Black added a comment -

            Is this the same as:

            10.5-d5e8b4d7f97c7a15be5d58b6f088c4c64787a385

             
            encryption.create_or_replace 'ctr,innodb' w57 [ fail ]  Found warnings/errors in server log file!
                    Test ended at 2020-06-03 14:03:32
            line
            2020-06-03 14:02:39 6 [Warning] InnoDB: Trying to delete tablespace 'test/create_or_replace_t' but there are 1 pending operations on it.
            ^ Found warnings in /home/dan/build-mariadb-server-10.5-debug-gccO0/mysql-test/var/57/log/mysqld.1.err
            ok
            
            

            danblack Daniel Black added a comment - Is this the same as: 10.5-d5e8b4d7f97c7a15be5d58b6f088c4c64787a385   encryption.create_or_replace 'ctr,innodb' w57 [ fail ] Found warnings/errors in server log file! Test ended at 2020-06-03 14:03:32 line 2020-06-03 14:02:39 6 [Warning] InnoDB: Trying to delete tablespace 'test/create_or_replace_t' but there are 1 pending operations on it. ^ Found warnings in /home/dan/build-mariadb-server-10.5-debug-gccO0/mysql-test/var/57/log/mysqld.1.err ok

            While trying to repeat this, I got a slightly different type of failure, which looks related:

            while ./mtr --rr --rr-arg=-h --parallel=auto encryption.{innodb_encryption_discard_import,create_or_replace}{,,,,,}; do :; done
            

            10.2 1e31d74833d56609f8711022394c1eb2eb25a19a

            CURRENT_TEST: encryption.create_or_replace
            mysqltest: At line 111: query 'SET GLOBAL innodb_encrypt_tables = OFF' failed: 2006: MySQL server has gone away
            Version: '10.2.33-MariaDB-debug-log'  socket: '/dev/shm/10.2o/mysql-test/var/tmp/7/mysqld.1.sock'  port: 16120  Source distribution
            2020-08-07 12:31:46 47723491903232 [Note] InnoDB: Creating #1 encryption thread id 128900711794432 total threads 1.
            2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #1 encryption thread id 128900711794432 total threads 4.
            2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #2 encryption thread id 41880383772416 total threads 4.
            2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #3 encryption thread id 140100598744832 total threads 4.
            2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #4 encryption thread id 98690424022784 total threads 4.
            2020-08-07 12:54:17 47723491903232 [Warning] InnoDB: Trying to close/delete/truncate tablespace 'test/table0_int_autoinc' but there are 1 pending operations on it.
            2020-08-07 13:08:26 47723491903232 [Warning] InnoDB: Trying to close/delete/truncate tablespace 'test/table0_int_autoinc' but there are 1 pending operations on it.
            InnoDB: ###### Diagnostic info printed to the standard error stream
            2020-08-07 13:08:26 7689782224640 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
            

            I found something interesting while analyzing the rr replay trace:

            10.2 1e31d74833d56609f8711022394c1eb2eb25a19a

            #0  fil_space_acquire_low (id=<optimized out>, silent=false) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:2029
            #1  0x000055d522b05fd2 in fil_space_acquire (id=168) at /mariadb/10.2o/storage/innobase/include/fil0fil.h:772
            #2  buf_read_ahead_linear (page_id=<optimized out>, page_size=@0x2b677dc7f6a0: {m_physical = 16384, m_logical = 16384, m_is_compressed = 0}, inside_ibuf=0) at /mariadb/10.2o/storage/innobase/buf/buf0rea.cc:579
            #3  0x000055d522ada954 in buf_page_get_low (page_id={m_space = 168, m_page_no = 0}, page_size=@0x2b677dc7f6a0: {m_physical = 16384, m_logical = 16384, m_is_compressed = 0}, rw_latch=<optimized out>, 
                guess=0x3be57341bd00, mode=<optimized out>, file=0x55d522f25852 "/mariadb/10.2o/storage/innobase/fil/fil0fil.cc", line=4497, mtr=0x2b677dc7f818, err=0x0)
                at /mariadb/10.2o/storage/innobase/buf/buf0buf.cc:4866
            #4  0x000055d522adc481 in buf_page_get_gen (page_id=<optimized out>, page_size=<optimized out>, rw_latch=<optimized out>, guess=<optimized out>, mode=<optimized out>, file=<optimized out>, line=3222298992, 
                mtr=0x7f6bc0105de8, err=0x7f6bc0106100) at /mariadb/10.2o/storage/innobase/buf/buf0buf.cc:4907
            #5  0x000055d522b68039 in fsp_flags_try_adjust (space_id=94373927550768, flags=33) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:4495
            #6  0x000055d522b69d15 in fil_space_for_table_exists_in_mem (id=168, name=0x7f6bc01a32b0 "test/table0_int_autoinc", table_flags=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:4555
            #7  0x000055d5229ba3e5 in row_drop_single_table_tablespace (space_id=168, tablename=0x2b677dc84700 "", filepath=0x7f6bc00d8400 "./test/table0_int_autoinc.ibd", table_flags=33)
                at /mariadb/10.2o/storage/innobase/row/row0mysql.cc:3253
            #8  row_drop_table_for_mysql (name=0x2b677dc80620 "test/table0_int_autoinc", trx=0x7fd511fc0390, sqlcom=SQLCOM_DROP_TABLE, create_failed=200, nonatomic=<optimized out>)
                at /mariadb/10.2o/storage/innobase/row/row0mysql.cc:3740
            #9  0x000055d5228b3538 in ha_innobase::delete_table (this=<optimized out>, name=0x2b677dc82330 "./test/table0_int_autoinc", sqlcom=SQLCOM_DROP_TABLE)
                at /mariadb/10.2o/storage/innobase/handler/ha_innodb.cc:13304
            

            This handle was properly closed later (so this is not the root cause of the failure), but it should make absolutely no sense to invoke the MDEV-11623 fixup when we had already opened the tablespace, and especially not when we are about to drop the tablespace.
            Before that, we were several times accessing the tablespace (acquiring and releasing the handle) in:

            10.2 1e31d74833d56609f8711022394c1eb2eb25a19a

            #0  0x000055d522b6d553 in fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5911
            #1  0x000055d522b71866 in fil_crypt_find_space_to_rotate (key_state=<optimized out>, state=<optimized out>, recheck=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1447
            #2  fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2145
            #3  0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
            #4  0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            

            After painstaking debugging, I seem to have found the root cause of this. It appears to be a race condition between two encryption threads.

            10.2 1e31d74833d56609f8711022394c1eb2eb25a19a

            Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops
             
            Old value = 0
            New value = 1
            fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
            5937		mutex_exit(&fil_system->mutex);
            Current event: 7260715
            (rr) bt
            #0  fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
            #1  0x000055d522b71866 in fil_crypt_find_space_to_rotate (key_state=<optimized out>, state=<optimized out>, recheck=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1447
            #2  fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2145
            #3  0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
            #4  0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            (rr) n
            2174				if (thr.space) {
            (rr) 
            2175					fil_crypt_complete_rotate_space(&new_state, &thr);
            (rr) when
            Current event: 7260723
            (rr) n
            [Switching to Thread 26959.27675]
             
            Thread 32 hit Hardware watchpoint 2: -location sp->n_pending_ops
             
            Old value = 1
            New value = 2
            fil_space_next (prev_space=0x7f6bc00cab50)
                at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
            5937		mutex_exit(&fil_system->mutex);
            Current event: 7260750
            (rr) bt
            #0  fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
            #1  0x000055d522b71866 in fil_crypt_find_space_to_rotate (key_state=<optimized out>, state=<optimized out>, recheck=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1447
            #2  fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2145
            #3  0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
            #4  0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            (rr) thread 29
            [Switching to thread 29 (Thread 26959.27672)]
            #0  0x0000000070000002 in ?? ()
            (rr) bt
            #0  0x0000000070000002 in ?? ()
            #1  0x00000000682053a6 in _raw_syscall () at /build/rr-2KB12a/rr-5.3.0/src/preload/raw_syscall.S:120
            #2  0x000000006820130f in traced_raw_syscall (call=call@entry=0x7f6bab9fffa0) at ./src/preload/syscallbuf.c:229
            #3  0x000000006820436d in sys_futex (call=<optimized out>) at ./src/preload/syscallbuf.c:1355
            #4  syscall_hook_internal (call=0x7f6bab9fffa0) at ./src/preload/syscallbuf.c:2861
            #5  syscall_hook (call=0x7f6bab9fffa0) at ./src/preload/syscallbuf.c:2987
            #6  0x00000000682011ea in _syscall_hook_trampoline () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:282
            #7  0x000000006820121a in __morestack () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:417
            #8  0x0000000068201235 in _syscall_hook_trampoline_48_3d_00_f0_ff_ff () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:428
            #9  0x00004657595ae7b8 in futex_wait_cancelable (private=0, expected=0, futex_word=0x55d5248f5dc8) at ../sysdeps/nptl/futex-internal.h:183
            #10 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55d5248f5d78, cond=0x55d5248f5da0) at pthread_cond_wait.c:508
            #11 __pthread_cond_wait (cond=0x55d5248f5da0, mutex=0x55d5248f5d78) at pthread_cond_wait.c:638
            #12 0x000055d522941e96 in os_event::wait (this=0x55d5248f5d60) at /mariadb/10.2o/storage/innobase/os/os0event.cc:158
            #13 os_event::wait_low (this=0x55d5248f5d60, reset_sig_count=573) at /mariadb/10.2o/storage/innobase/os/os0event.cc:325
            #14 0x000055d522942195 in os_event_wait_low (event=0x55d5248f5dc8, reset_sig_count=128) at /mariadb/10.2o/storage/innobase/os/os0event.cc:507
            #15 0x000055d522af5454 in buf_flush_wait_batch_end (buf_pool=<optimized out>, type=<optimized out>) at /mariadb/10.2o/storage/innobase/buf/buf0flu.cc:1987
            #16 0x000055d522b73c40 in fil_crypt_flush_space (state=0x753c09233840) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1956
            #17 fil_crypt_complete_rotate_space (key_state=<optimized out>, state=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2071
            #18 0x000055d522b72ed0 in fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2175
            #19 0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
            #20 0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            

            After this condition, the n_pending_ops would still reach 0 once more. Note, the Thread numbering may switch here, because I restarted the trace in between:

            10.2 1e31d74833d56609f8711022394c1eb2eb25a19a

            Thread 1 hit Hardware watchpoint 2: -location sp->n_pending_ops
             
            Old value = 1
            New value = 0
            0x000055d522b6d512 in fil_space_next (prev_space=0x7f6bc00da5a0) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5920
            5920			space->n_pending_ops--;
            Current event: 7277146
            

            Soon after that, we hit a permanent problem:

            10.2 1e31d74833d56609f8711022394c1eb2eb25a19a

            Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops
             
            Old value = 1
            New value = 2
            fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
            5937		mutex_exit(&fil_system->mutex);
            Current event: 7279390
            (rr) thread 30
            [Switching to thread 30 (Thread 26959.27674)]
            #0  0x0000000070000002 in ?? ()
            (rr) bt
            #0  0x0000000070000002 in ?? ()
            #1  0x00000000682053a6 in _raw_syscall () at /build/rr-2KB12a/rr-5.3.0/src/preload/raw_syscall.S:120
            #2  0x000000006820130f in traced_raw_syscall (call=call@entry=0x7f6bab4fffa0) at ./src/preload/syscallbuf.c:229
            #3  0x000000006820436d in sys_futex (call=<optimized out>) at ./src/preload/syscallbuf.c:1355
            #4  syscall_hook_internal (call=0x7f6bab4fffa0) at ./src/preload/syscallbuf.c:2861
            #5  syscall_hook (call=0x7f6bab4fffa0) at ./src/preload/syscallbuf.c:2987
            #6  0x00000000682011ea in _syscall_hook_trampoline () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:282
            #7  0x000000006820121a in __morestack () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:417
            #8  0x0000000068201235 in _syscall_hook_trampoline_48_3d_00_f0_ff_ff () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:428
            #9  0x00004657595ae7b8 in futex_wait_cancelable (private=0, expected=0, futex_word=0x55d5248f5dc8) at ../sysdeps/nptl/futex-internal.h:183
            #10 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55d5248f5d78, cond=0x55d5248f5da0) at pthread_cond_wait.c:508
            #11 __pthread_cond_wait (cond=0x55d5248f5da0, mutex=0x55d5248f5d78) at pthread_cond_wait.c:638
            #12 0x000055d522941e96 in os_event::wait (this=0x55d5248f5d60) at /mariadb/10.2o/storage/innobase/os/os0event.cc:158
            #13 os_event::wait_low (this=0x55d5248f5d60, reset_sig_count=578) at /mariadb/10.2o/storage/innobase/os/os0event.cc:325
            #14 0x000055d522942195 in os_event_wait_low (event=0x55d5248f5dc8, reset_sig_count=128) at /mariadb/10.2o/storage/innobase/os/os0event.cc:507
            #15 0x000055d522af5454 in buf_flush_wait_batch_end (buf_pool=<optimized out>, type=<optimized out>) at /mariadb/10.2o/storage/innobase/buf/buf0flu.cc:1987
            #16 0x000055d522b73c40 in fil_crypt_flush_space (state=0x7f6bb66bb840) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1956
            #17 fil_crypt_complete_rotate_space (key_state=<optimized out>, state=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2071
            #18 0x000055d522b72ed0 in fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2175
            #19 0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
            #20 0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            (rr) c
            Continuing.
             
            Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops
             
            Old value = 2
            New value = 1
            0x000055d522b6d512 in fil_space_next (prev_space=0x7f6bc00da5a0) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5920
            5920			space->n_pending_ops--;
            Current event: 7279390
            (rr) 
            Continuing.
             
            Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops
             
            Old value = 1
            New value = 2
            fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937
            5937		mutex_exit(&fil_system->mutex);
            Current event: 7281359
            

            From this point on, the count will never reach 0 until the first warning message is issued. It seems to alternate between 1 and 2 all the time.

            Why do we have multiple concurrently running fil_crypt_thread at all? All they are doing is marking pages in the buffer pool dirty. The actual work is being done by the page flushing. Especially if axel can confirm that a single encryption thread will not perform significantly worse than multiple ones, I think that we should deprecate and ignore the parameter.

            marko Marko Mäkelä added a comment - While trying to repeat this, I got a slightly different type of failure, which looks related: while ./mtr --rr --rr-arg=-h --parallel=auto encryption.{innodb_encryption_discard_import,create_or_replace}{,,,,,}; do :; done 10.2 1e31d74833d56609f8711022394c1eb2eb25a19a CURRENT_TEST: encryption.create_or_replace mysqltest: At line 111: query 'SET GLOBAL innodb_encrypt_tables = OFF' failed: 2006: MySQL server has gone away … Version: '10.2.33-MariaDB-debug-log' socket: '/dev/shm/10.2o/mysql-test/var/tmp/7/mysqld.1.sock' port: 16120 Source distribution 2020-08-07 12:31:46 47723491903232 [Note] InnoDB: Creating #1 encryption thread id 128900711794432 total threads 1. 2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #1 encryption thread id 128900711794432 total threads 4. 2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #2 encryption thread id 41880383772416 total threads 4. 2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #3 encryption thread id 140100598744832 total threads 4. 2020-08-07 12:32:53 47723491903232 [Note] InnoDB: Creating #4 encryption thread id 98690424022784 total threads 4. 2020-08-07 12:54:17 47723491903232 [Warning] InnoDB: Trying to close/delete/truncate tablespace 'test/table0_int_autoinc' but there are 1 pending operations on it. … 2020-08-07 13:08:26 47723491903232 [Warning] InnoDB: Trying to close/delete/truncate tablespace 'test/table0_int_autoinc' but there are 1 pending operations on it. InnoDB: ###### Diagnostic info printed to the standard error stream 2020-08-07 13:08:26 7689782224640 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. I found something interesting while analyzing the rr replay trace: 10.2 1e31d74833d56609f8711022394c1eb2eb25a19a #0 fil_space_acquire_low (id=<optimized out>, silent=false) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:2029 #1 0x000055d522b05fd2 in fil_space_acquire (id=168) at /mariadb/10.2o/storage/innobase/include/fil0fil.h:772 #2 buf_read_ahead_linear (page_id=<optimized out>, page_size=@0x2b677dc7f6a0: {m_physical = 16384, m_logical = 16384, m_is_compressed = 0}, inside_ibuf=0) at /mariadb/10.2o/storage/innobase/buf/buf0rea.cc:579 #3 0x000055d522ada954 in buf_page_get_low (page_id={m_space = 168, m_page_no = 0}, page_size=@0x2b677dc7f6a0: {m_physical = 16384, m_logical = 16384, m_is_compressed = 0}, rw_latch=<optimized out>, guess=0x3be57341bd00, mode=<optimized out>, file=0x55d522f25852 "/mariadb/10.2o/storage/innobase/fil/fil0fil.cc", line=4497, mtr=0x2b677dc7f818, err=0x0) at /mariadb/10.2o/storage/innobase/buf/buf0buf.cc:4866 #4 0x000055d522adc481 in buf_page_get_gen (page_id=<optimized out>, page_size=<optimized out>, rw_latch=<optimized out>, guess=<optimized out>, mode=<optimized out>, file=<optimized out>, line=3222298992, mtr=0x7f6bc0105de8, err=0x7f6bc0106100) at /mariadb/10.2o/storage/innobase/buf/buf0buf.cc:4907 #5 0x000055d522b68039 in fsp_flags_try_adjust (space_id=94373927550768, flags=33) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:4495 #6 0x000055d522b69d15 in fil_space_for_table_exists_in_mem (id=168, name=0x7f6bc01a32b0 "test/table0_int_autoinc", table_flags=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:4555 #7 0x000055d5229ba3e5 in row_drop_single_table_tablespace (space_id=168, tablename=0x2b677dc84700 "", filepath=0x7f6bc00d8400 "./test/table0_int_autoinc.ibd", table_flags=33) at /mariadb/10.2o/storage/innobase/row/row0mysql.cc:3253 #8 row_drop_table_for_mysql (name=0x2b677dc80620 "test/table0_int_autoinc", trx=0x7fd511fc0390, sqlcom=SQLCOM_DROP_TABLE, create_failed=200, nonatomic=<optimized out>) at /mariadb/10.2o/storage/innobase/row/row0mysql.cc:3740 #9 0x000055d5228b3538 in ha_innobase::delete_table (this=<optimized out>, name=0x2b677dc82330 "./test/table0_int_autoinc", sqlcom=SQLCOM_DROP_TABLE) at /mariadb/10.2o/storage/innobase/handler/ha_innodb.cc:13304 This handle was properly closed later (so this is not the root cause of the failure), but it should make absolutely no sense to invoke the MDEV-11623 fixup when we had already opened the tablespace, and especially not when we are about to drop the tablespace. Before that, we were several times accessing the tablespace (acquiring and releasing the handle) in: 10.2 1e31d74833d56609f8711022394c1eb2eb25a19a #0 0x000055d522b6d553 in fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5911 #1 0x000055d522b71866 in fil_crypt_find_space_to_rotate (key_state=<optimized out>, state=<optimized out>, recheck=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1447 #2 fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2145 #3 0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477 #4 0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 After painstaking debugging, I seem to have found the root cause of this. It appears to be a race condition between two encryption threads. 10.2 1e31d74833d56609f8711022394c1eb2eb25a19a Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops   Old value = 0 New value = 1 fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937 5937 mutex_exit(&fil_system->mutex); Current event: 7260715 (rr) bt #0 fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937 #1 0x000055d522b71866 in fil_crypt_find_space_to_rotate (key_state=<optimized out>, state=<optimized out>, recheck=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1447 #2 fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2145 #3 0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477 #4 0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (rr) n … 2174 if (thr.space) { (rr) 2175 fil_crypt_complete_rotate_space(&new_state, &thr); (rr) when Current event: 7260723 (rr) n [Switching to Thread 26959.27675]   Thread 32 hit Hardware watchpoint 2: -location sp->n_pending_ops   Old value = 1 New value = 2 fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937 5937 mutex_exit(&fil_system->mutex); Current event: 7260750 (rr) bt #0 fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937 #1 0x000055d522b71866 in fil_crypt_find_space_to_rotate (key_state=<optimized out>, state=<optimized out>, recheck=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1447 #2 fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2145 #3 0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477 #4 0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (rr) thread 29 [Switching to thread 29 (Thread 26959.27672)] #0 0x0000000070000002 in ?? () (rr) bt #0 0x0000000070000002 in ?? () #1 0x00000000682053a6 in _raw_syscall () at /build/rr-2KB12a/rr-5.3.0/src/preload/raw_syscall.S:120 #2 0x000000006820130f in traced_raw_syscall (call=call@entry=0x7f6bab9fffa0) at ./src/preload/syscallbuf.c:229 #3 0x000000006820436d in sys_futex (call=<optimized out>) at ./src/preload/syscallbuf.c:1355 #4 syscall_hook_internal (call=0x7f6bab9fffa0) at ./src/preload/syscallbuf.c:2861 #5 syscall_hook (call=0x7f6bab9fffa0) at ./src/preload/syscallbuf.c:2987 #6 0x00000000682011ea in _syscall_hook_trampoline () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:282 #7 0x000000006820121a in __morestack () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:417 #8 0x0000000068201235 in _syscall_hook_trampoline_48_3d_00_f0_ff_ff () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:428 #9 0x00004657595ae7b8 in futex_wait_cancelable (private=0, expected=0, futex_word=0x55d5248f5dc8) at ../sysdeps/nptl/futex-internal.h:183 #10 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55d5248f5d78, cond=0x55d5248f5da0) at pthread_cond_wait.c:508 #11 __pthread_cond_wait (cond=0x55d5248f5da0, mutex=0x55d5248f5d78) at pthread_cond_wait.c:638 #12 0x000055d522941e96 in os_event::wait (this=0x55d5248f5d60) at /mariadb/10.2o/storage/innobase/os/os0event.cc:158 #13 os_event::wait_low (this=0x55d5248f5d60, reset_sig_count=573) at /mariadb/10.2o/storage/innobase/os/os0event.cc:325 #14 0x000055d522942195 in os_event_wait_low (event=0x55d5248f5dc8, reset_sig_count=128) at /mariadb/10.2o/storage/innobase/os/os0event.cc:507 #15 0x000055d522af5454 in buf_flush_wait_batch_end (buf_pool=<optimized out>, type=<optimized out>) at /mariadb/10.2o/storage/innobase/buf/buf0flu.cc:1987 #16 0x000055d522b73c40 in fil_crypt_flush_space (state=0x753c09233840) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1956 #17 fil_crypt_complete_rotate_space (key_state=<optimized out>, state=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2071 #18 0x000055d522b72ed0 in fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2175 #19 0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477 #20 0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 After this condition, the n_pending_ops would still reach 0 once more. Note, the Thread numbering may switch here, because I restarted the trace in between: 10.2 1e31d74833d56609f8711022394c1eb2eb25a19a Thread 1 hit Hardware watchpoint 2: -location sp->n_pending_ops   Old value = 1 New value = 0 0x000055d522b6d512 in fil_space_next (prev_space=0x7f6bc00da5a0) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5920 5920 space->n_pending_ops--; Current event: 7277146 Soon after that, we hit a permanent problem: 10.2 1e31d74833d56609f8711022394c1eb2eb25a19a Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops   Old value = 1 New value = 2 fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937 5937 mutex_exit(&fil_system->mutex); Current event: 7279390 (rr) thread 30 [Switching to thread 30 (Thread 26959.27674)] #0 0x0000000070000002 in ?? () (rr) bt #0 0x0000000070000002 in ?? () #1 0x00000000682053a6 in _raw_syscall () at /build/rr-2KB12a/rr-5.3.0/src/preload/raw_syscall.S:120 #2 0x000000006820130f in traced_raw_syscall (call=call@entry=0x7f6bab4fffa0) at ./src/preload/syscallbuf.c:229 #3 0x000000006820436d in sys_futex (call=<optimized out>) at ./src/preload/syscallbuf.c:1355 #4 syscall_hook_internal (call=0x7f6bab4fffa0) at ./src/preload/syscallbuf.c:2861 #5 syscall_hook (call=0x7f6bab4fffa0) at ./src/preload/syscallbuf.c:2987 #6 0x00000000682011ea in _syscall_hook_trampoline () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:282 #7 0x000000006820121a in __morestack () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:417 #8 0x0000000068201235 in _syscall_hook_trampoline_48_3d_00_f0_ff_ff () at /build/rr-2KB12a/rr-5.3.0/src/preload/syscall_hook.S:428 #9 0x00004657595ae7b8 in futex_wait_cancelable (private=0, expected=0, futex_word=0x55d5248f5dc8) at ../sysdeps/nptl/futex-internal.h:183 #10 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x55d5248f5d78, cond=0x55d5248f5da0) at pthread_cond_wait.c:508 #11 __pthread_cond_wait (cond=0x55d5248f5da0, mutex=0x55d5248f5d78) at pthread_cond_wait.c:638 #12 0x000055d522941e96 in os_event::wait (this=0x55d5248f5d60) at /mariadb/10.2o/storage/innobase/os/os0event.cc:158 #13 os_event::wait_low (this=0x55d5248f5d60, reset_sig_count=578) at /mariadb/10.2o/storage/innobase/os/os0event.cc:325 #14 0x000055d522942195 in os_event_wait_low (event=0x55d5248f5dc8, reset_sig_count=128) at /mariadb/10.2o/storage/innobase/os/os0event.cc:507 #15 0x000055d522af5454 in buf_flush_wait_batch_end (buf_pool=<optimized out>, type=<optimized out>) at /mariadb/10.2o/storage/innobase/buf/buf0flu.cc:1987 #16 0x000055d522b73c40 in fil_crypt_flush_space (state=0x7f6bb66bb840) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:1956 #17 fil_crypt_complete_rotate_space (key_state=<optimized out>, state=<optimized out>) at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2071 #18 0x000055d522b72ed0 in fil_crypt_thread () at /mariadb/10.2o/storage/innobase/fil/fil0crypt.cc:2175 #19 0x00004657595a7ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477 #20 0x000060461686eeaf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (rr) c Continuing.   Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops   Old value = 2 New value = 1 0x000055d522b6d512 in fil_space_next (prev_space=0x7f6bc00da5a0) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5920 5920 space->n_pending_ops--; Current event: 7279390 (rr) Continuing.   Thread 29 hit Hardware watchpoint 2: -location sp->n_pending_ops   Old value = 1 New value = 2 fil_space_next (prev_space=0x7f6bc00cab50) at /mariadb/10.2o/storage/innobase/fil/fil0fil.cc:5937 5937 mutex_exit(&fil_system->mutex); Current event: 7281359 From this point on, the count will never reach 0 until the first warning message is issued. It seems to alternate between 1 and 2 all the time. Why do we have multiple concurrently running fil_crypt_thread at all? All they are doing is marking pages in the buffer pool dirty. The actual work is being done by the page flushing. Especially if axel can confirm that a single encryption thread will not perform significantly worse than multiple ones, I think that we should deprecate and ignore the parameter.

            To prove that the counter never reached 0 again, I let the execution continue for several more minutes with an additional condition on the watchpoint:

            10.2 1e31d74833d56609f8711022394c1eb2eb25a19a

            (rr) when
            Current event: 7279390
            (rr) i b
            Num     Type           Disp Enb Address            What
            1       breakpoint     keep y   0x000055d522a773e0 in ib::warn::~warn() at /mariadb/10.2o/storage/innobase/ut/ut0ut.cc:623
            2       hw watchpoint  keep y                      -location sp->n_pending_ops
            	breakpoint already hit 37 times
                    when
            (rr) cond 2 !$5->n_pending_ops
            (rr) enable
            (rr) c
            Continuing.
            [Switching to Thread 26959.27283]
             
            Thread 28 hit Breakpoint 1, ib::warn::~warn (this=0x2b677dc7f620) at /mariadb/10.2o/storage/innobase/ut/ut0ut.cc:623
            623	{
            (rr) enable display
            (rr) display
            1: $5->n_pending_ops = 1
            (rr) when
            Current event: 21286261
            

            marko Marko Mäkelä added a comment - To prove that the counter never reached 0 again, I let the execution continue for several more minutes with an additional condition on the watchpoint: 10.2 1e31d74833d56609f8711022394c1eb2eb25a19a (rr) when Current event: 7279390 (rr) i b Num Type Disp Enb Address What 1 breakpoint keep y 0x000055d522a773e0 in ib::warn::~warn() at /mariadb/10.2o/storage/innobase/ut/ut0ut.cc:623 2 hw watchpoint keep y -location sp->n_pending_ops breakpoint already hit 37 times when (rr) cond 2 !$5->n_pending_ops (rr) enable (rr) c Continuing. [Switching to Thread 26959.27283]   Thread 28 hit Breakpoint 1, ib::warn::~warn (this=0x2b677dc7f620) at /mariadb/10.2o/storage/innobase/ut/ut0ut.cc:623 623 { (rr) enable display (rr) display 1: $5->n_pending_ops = 1 (rr) when Current event: 21286261

            I had forgotten that a benchmark was recently run in MDEV-22258, showing that marking pages dirty in multiple threads can slightly improve throughput. Thus, we cannot fix this bug by limiting the number of fil_crypt_thread to 1.

            The failure is very random. Today, I seem to be unable to repeat it on either 10.1 or 10.2 (without using rr record; maybe that made the difference last week). On buildbot.askmonty.org, the test encryption.innodb_encryption_discard_import fails with this message only on 10.1 and not in later versions. The latest failure of that type was a week ago: http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/23955

            10.1 8bca92c8845212ea96be404e664f4cbb

            encryption.innodb_encryption_discard_import 'xtradb' w4 [ fail ]  Found warnings/errors in server log file!
                    Test ended at 2020-08-03 10:19:41
            line
            2020-08-03 10:19:07 2738875200 [ERROR] InnoDB: Trying to access tablespace [space=4: page=541] but the tablespace does not exist or is just being dropped.
            2020-08-03 10:19:12 2738875200 [ERROR] InnoDB: Trying to access tablespace [space=5: page=506] but the tablespace does not exist or is just being dropped.
            ^ Found warnings in /mnt/buildbot/build/mariadb-10.1.46/mysql-test/var/4/log/mysqld.1.err
            ok
             
             - skipping '/mnt/buildbot/build/mariadb-10.1.46/mysql-test/var/4/log/encryption.innodb_encryption_discard_import-xtradb/'
             
            Retrying test encryption.innodb_encryption_discard_import, attempt(2/3)...
            

            I think that I will ignore that failure, because 10.1 will reach EOL soon, and because that failure seems to be extremely rare on buildbot. The previous occurrence in the main 10.1 branch was 1½ months earlier; in some bb-10.1- staging branches it failed 2 times during that time frame. There are no failures of this kind in later branches. (That test does fail in other ways every now and then.)

            I believe that the root cause of the failure of encryption.create_or_replace is that the test is reducing the value of innodb_encryption_threads. When that happens during the execution of fil_crypt_find_space_to_rotate(), it can happen that the loop will be terminated without us ever reducing the reference count (invoking fil_space_release(state->space)). It appears that to improve the chances of that happening, we should have very many encrypted tables or partitions.

            marko Marko Mäkelä added a comment - I had forgotten that a benchmark was recently run in MDEV-22258 , showing that marking pages dirty in multiple threads can slightly improve throughput. Thus, we cannot fix this bug by limiting the number of fil_crypt_thread to 1. The failure is very random. Today, I seem to be unable to repeat it on either 10.1 or 10.2 (without using rr record ; maybe that made the difference last week). On buildbot.askmonty.org, the test encryption.innodb_encryption_discard_import fails with this message only on 10.1 and not in later versions. The latest failure of that type was a week ago: http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/23955 10.1 8bca92c8845212ea96be404e664f4cbb encryption.innodb_encryption_discard_import 'xtradb' w4 [ fail ] Found warnings/errors in server log file! Test ended at 2020-08-03 10:19:41 line 2020-08-03 10:19:07 2738875200 [ERROR] InnoDB: Trying to access tablespace [space=4: page=541] but the tablespace does not exist or is just being dropped. 2020-08-03 10:19:12 2738875200 [ERROR] InnoDB: Trying to access tablespace [space=5: page=506] but the tablespace does not exist or is just being dropped. ^ Found warnings in /mnt/buildbot/build/mariadb-10.1.46/mysql-test/var/4/log/mysqld.1.err ok   - skipping '/mnt/buildbot/build/mariadb-10.1.46/mysql-test/var/4/log/encryption.innodb_encryption_discard_import-xtradb/'   Retrying test encryption.innodb_encryption_discard_import, attempt(2/3)... I think that I will ignore that failure, because 10.1 will reach EOL soon, and because that failure seems to be extremely rare on buildbot. The previous occurrence in the main 10.1 branch was 1½ months earlier; in some bb-10.1- staging branches it failed 2 times during that time frame. There are no failures of this kind in later branches. (That test does fail in other ways every now and then.) I believe that the root cause of the failure of encryption.create_or_replace is that the test is reducing the value of innodb_encryption_threads . When that happens during the execution of fil_crypt_find_space_to_rotate() , it can happen that the loop will be terminated without us ever reducing the reference count (invoking fil_space_release(state->space) ). It appears that to improve the chances of that happening, we should have very many encrypted tables or partitions.
            marko Marko Mäkelä added a comment - - edited

            I have pushed the fix up to 10.5.

            marko Marko Mäkelä added a comment - - edited I have pushed the fix up to 10.5.

            People

              marko Marko Mäkelä
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.