Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27466

Deadlock when purge thread races the flush thread for page lock under Innodb redo log space pressure

Details

    Description

      This is not a full bug report yet, though I do have a reproducible scenario on a test server. I am working on a smaller test case.

      To cause the problem, we do this:

      set innodb_log_file_size to 4M to quickly simulate log space shortage
      use a table with 72M rows
      alter it with the INSTANT algorithm adding a few columns
      set innodb_buffer_pool_size to 1G
      run an update modifying one of the instant-added columns (everything goes well, the query takes 3 minutes)
      then set innodb_buffer_pool_size to 512M and wait for the dust to settle down so that it actually shows as 512M
      then re-run the above query

      This time we go into the deadlock timeout, and from the gdb trace we can see this race:

      hread 23 (Thread 0x7f39f4ff5700 (LWP 38128)):
      #0  0x00007f3a20687a35 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
      #1  0x0000560a685f3b89 in wait (this=0x7f3a0c429bb0)
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/include/sync0types.h:463
      #2  os_event::wait_low(long) () at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/os/os0event.cc:325
      #3  0x0000560a685f3e65 in os_event_wait_low (event=<optimized out>, reset_sig_count=<optimized out>)
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/os/os0event.cc:502
      #4  0x0000560a68670c21 in sync_array_wait_event(sync_array_t*, sync_cell_t*&) ()
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/include/ib0mutex.h:496
      #5  0x0000560a6867350a in rw_lock_sx_lock_func(rw_lock_t*, unsigned long, char const*, unsigned int) ()
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/sync/sync0rw.cc:833
      #6  0x0000560a686ea9c9 in pfs_rw_lock_sx_lock_func (pass=2, 
          file_name=0x560a68cd2a38 "/home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc", line=1186, 
          lock=0x7f39d88fb248) at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/include/sync0rw.ic:674
      #7  buf_flush_page(buf_pool_t*, buf_page_t*, buf_flush_t, bool) ()
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:1186
      #8  0x0000560a686eb201 in buf_flush_try_neighbors (n_to_flush=<optimized out>, n_flushed=<optimized out>, 
          flush_type=BUF_FLUSH_LIST, page_id=...)
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:1431
      #9  buf_flush_page_and_try_neighbors(buf_page_t*, buf_flush_t, unsigned long, unsigned long*) [clone .isra.0] ()
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:1505
      #10 0x0000560a686ec109 in buf_do_flush_list_batch(buf_pool_t*, unsigned long, unsigned long) ()
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:1763
      #11 0x0000560a686edef9 in buf_flush_batch (n=0x7f39f4ff4c50, lsn_limit=1446564044560, min_n=18446744073709551614, 
          flush_type=BUF_FLUSH_LIST, buf_pool=0x560a6b20c620)
      /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:1834
      #12 buf_flush_do_batch (n=0x7f39f4ff4c50, lsn_limit=1446564044560, min_n=18446744073709551614, type=BUF_FLUSH_LIST, 
          buf_pool=0x560a6b20c620) at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:1992
      #13 buf_flush_do_batch (type=BUF_FLUSH_LIST, n=0x7f39f4ff4c50, lsn_limit=<optimized out>, min_n=<optimized out>, 
          buf_pool=0x560a6b20c620) at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:1975
      #14 pc_flush_slot () at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:2791
      #15 0x0000560a680b1d54 in buf_flush_page_cleaner_coordinator ()
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/buf/buf0flu.cc:3160
      #16 0x00007f3a20683ea5 in start_thread () from /lib64/libpthread.so.0
      #17 0x00007f3a1f11396d in clone () from /lib64/libc.so.6
      
      

      The flush thread wants a page lock that is being held by another thread.
      Following the writer_thread member of rw_lock we find that it is being held by the purge thread with this trace:

      Thread 16 (Thread 0x7f39ec7f8700 (LWP 38147)):
      #0  0x00007f3a2068ae9d in nanosleep () from /lib64/libpthread.so.0
      #1  0x0000560a685f401c in os_thread_sleep (tm=tm@entry=100000)
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/os/os0thread.cc:231
      #2  0x0000560a685d96ca in log_margin_checkpoint_age(unsigned long) ()
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/log/log0log.cc:283
      #3  0x0000560a685eb4f3 in prepare_write (this=0x7f39ec7f6c50)
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/mtr/mtr0mtr.cc:699
      #4  mtr_t::commit() () at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/mtr/mtr0mtr.cc:439
      #5  0x0000560a6869c445 in trx_undo_truncate_start(trx_rseg_t*, unsigned long, unsigned long, unsigned long) ()
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/trx/trx0undo.cc:1018
      #6  0x0000560a6867cae4 in trx_purge_truncate_rseg_history (limit=..., rseg=...)
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/trx/trx0purge.cc:465
      #7  trx_purge_truncate_history() () at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/trx/trx0purge.cc:564
      #8  0x0000560a6867f9e5 in trx_purge(unsigned long, bool) ()
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/trx/trx0purge.cc:1329
      #9  0x0000560a6866cbad in srv_do_purge (n_total_purged=<synthetic pointer>)
          at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/srv/srv0srv.cc:2613
      #10 srv_purge_coordinator_thread () at /home/bdempsey/samurai/var/10.4.20samurai/src/storage/innobase/srv/srv0srv.cc:2752
      #11 0x00007f3a20683ea5 in start_thread () from /lib64/libpthread.so.0
      #12 0x00007f3a1f11396d in clone () from /lib64/libc.so.6
      

      So apparently the purge thread is waiting for the log space to become available, but it will never happen because it is doing it while holding the lock on the page that the flush thread wants to flush to make that space available.

      We are working on isolating this into something manageable that we can post, but since we are aware of this bug, I figured I would open an issue and hopefully get some tips on the code in question. In particular, I would like to know how to get a call in trx_undo_truncate_start() to do something meaningful - I can get a call to happen, but it returns immediately because the limit argument is 0.

      Attachments

        1. tpcc1.pdf
          98 kB
        2. tpcc2.pdf
          97 kB

        Issue Links

          Activity

            spachev Sasha Pachev added a comment -

            marko - done, also added the safety explanation comment as you requested.

            spachev Sasha Pachev added a comment - marko - done, also added the safety explanation comment as you requested.

            Can 10.3 be affected as well ? i hit similar long semaphore wait 600s on a read only replica with conservative and multi source GTID replication
            The purge thread waiting for an other thread doing a long CREATE OR REPLACE

            create or replace table SANDBOX.T1
            (PRIMARY KEY (ID,DATE,ID_ORDER)
            ,KEY `FKIndex1` (ID_ORDER))
             
            SELECT ...
            

            Unfortunate the stack did not write to disk but from monitoring we get

            =====================================
            2022-02-16 19:55:19 0x7efdd17fc700 INNODB MONITOR OUTPUT
            =====================================
            Per second averages calculated from the last 20 seconds
            -----------------
            BACKGROUND THREAD
            -----------------
            srv_master_thread loops: 35904 srv_active, 0 srv_shutdown, 18927 srv_idle
            srv_master_thread log flush and writes: 54830
            ----------
            SEMAPHORES
            ----------
            OS WAIT ARRAY INFO: reservation count 15411906
             
             
            --Thread 139628552120064 has waited at row0purge.cc line 1067 for 536.00 seconds the semaphore:
            S-lock on RW-latch at 0x562746c6a9e0 created in file dict0dict.cc line 920
            a writer (thread id 139628598245120) has reserved it in mode  exclusive
            number of readers 0, waiters flag 1, lock_word: 0
            Last time write locked in file dict0stats.cc line 2488
             
            --Thread 139628633913088 has waited at srv0srv.cc line 2023 for 939.00 seconds the semaphore:
            X-lock on RW-latch at 0x562746c6a9e0 created in file dict0dict.cc line 920
            a writer (thread id 139628598245120) has reserved it in mode  exclusive
            number of readers 0, waiters flag 1, lock_word: 0
            Last time write locked in file dict0stats.cc line 2488
             
             
            --Thread 139628539541248 has waited at row0purge.cc line 1067 for 536.00 seconds the semaphore:
            S-lock on RW-latch at 0x562746c6a9e0 created in file dict0dict.cc line 920
            a writer (thread id 139628598245120) has reserved it in mode  exclusive
             
            number of readers 0, waiters flag 1, lock_word: 0
            Last time write locked in file dict0stats.cc line 2488
            OS WAIT ARRAY INFO: signal count 19724996
            RW-shared spins 18501644, rounds 75969116, OS waits 1318974
            RW-excl spins 37487656, rounds 126621842, OS waits 1242171
            RW-sx spins 334085, rounds 5011848, OS waits 47900
            Spin rounds per wait: 4.11 RW-shared, 3.38 RW-excl, 15.00 RW-sx
            ------------
            TRANSACTIONS
            ------------
            Trx id counter 21851754826
            Purge done for trx's n:o < 21851754824 undo n:o < 0 state: running
            History list length 2
            LIST OF TRANSACTIONS FOR EACH SESSION:
            ---TRANSACTION 421161563984184, COMMITTED IN MEMORY flushing log
            0 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 8
            ---TRANSACTION 21851754821, ACTIVE (PREPARED) 958 sec
            2 lock struct(s), heap size 1128, 2 row lock(s), undo log entries 3
            MySQL thread id 31, OS thread handle 139628805043968, query id 6098577 Init
            COMMIT
             
            --------
            FILE I/O
            --------
            I/O thread 0 state: waiting for i/o request (insert buffer thread)
            I/O thread 1 state: waiting for i/o request (log thread)
            I/O thread 6 state: waiting for i/o request (write thread)
            ....
            I/O thread 9 state: waiting for i/o request (write thread)
            Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
             ibuf aio reads:, log i/o's:, sync i/o's:
            Pending flushes (fsync) log: 0; buffer pool: 1
            829392 OS file reads, 7730749 OS file writes, 426108 OS fsyncs
            0 pending reads, 1 pending writes
            0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
            -------------------------------------
            INSERT BUFFER AND ADAPTIVE HASH INDEX
            -------------------------------------
            Ibuf: size 1, free list len 5, seg size 7, 53437 merges
            merged operations:
             insert 75980, delete mark 61216, delete 1244
            discarded operations:
             insert 0, delete mark 0, delete 0
            Hash table size 6375037, node heap has 13856 buffer(s)
            Hash table size 6375037, node heap has 7638 buffer(s)
            Hash table size 6375037, node heap has 5463 buffer(s)
            Hash table size 6375037, node heap has 13341 buffer(s)
            Hash table size 6375037, node heap has 36025 buffer(s)
            Hash table size 6375037, node heap has 52283 buffer(s)
            Hash table size 6375037, node heap has 55725 buffer(s)
            Hash table size 6375037, node heap has 69834 buffer(s)
            0.00 hash searches/s, 0.00 non-hash searches/s
            ---
            LOG
            ---
            Log sequence number 12370514365421
            Log flushed up to   12370514363873
            Pages flushed up to 12370509062381
            Last checkpoint at  12370509062381
            0 pending log flushes, 0 pending chkp writes
            2118121 log i/o's done, 0.00 log i/o's/second
             
            ----------------------------
            END OF INNODB MONITOR OUTPUT
            ============================
            InnoDB: ###### Diagnostic info printed to the standard error stream
            2022-02-16 19:55:32 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
            220216 19:55:32 [ERROR] mysqld got signal 6 ;
             
            Thread pointer: 0x0
            Attempting backtrace. You can use the following information to find out
            where mysqld died. If you see no messages after this, something went
            terribly wrong...
            stack_bottom = 0x0 thread_stack 0x30000
            /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5627464767ce]
            /usr/sbin/mysqld(handle_fatal_signal+0x54d)[0x562745f879bd]
            /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f0b5a7ae730]
            /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f0b5a6127bb]
            /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f0b5a5fd535]
            /usr/sbin/mysqld(+0x4eac5b)[0x562745cbac5b]
            /usr/sbin/mysqld(+0x4e1022)[0x562745cb1022]
            

            stephane@skysql.com VAROQUI Stephane added a comment - Can 10.3 be affected as well ? i hit similar long semaphore wait 600s on a read only replica with conservative and multi source GTID replication The purge thread waiting for an other thread doing a long CREATE OR REPLACE create or replace table SANDBOX.T1 (PRIMARY KEY (ID,DATE,ID_ORDER) ,KEY `FKIndex1` (ID_ORDER))   SELECT ... Unfortunate the stack did not write to disk but from monitoring we get ===================================== 2022-02-16 19:55:19 0x7efdd17fc700 INNODB MONITOR OUTPUT ===================================== Per second averages calculated from the last 20 seconds ----------------- BACKGROUND THREAD ----------------- srv_master_thread loops: 35904 srv_active, 0 srv_shutdown, 18927 srv_idle srv_master_thread log flush and writes: 54830 ---------- SEMAPHORES ---------- OS WAIT ARRAY INFO: reservation count 15411906     --Thread 139628552120064 has waited at row0purge.cc line 1067 for 536.00 seconds the semaphore: S-lock on RW-latch at 0x562746c6a9e0 created in file dict0dict.cc line 920 a writer (thread id 139628598245120) has reserved it in mode exclusive number of readers 0, waiters flag 1, lock_word: 0 Last time write locked in file dict0stats.cc line 2488   --Thread 139628633913088 has waited at srv0srv.cc line 2023 for 939.00 seconds the semaphore: X-lock on RW-latch at 0x562746c6a9e0 created in file dict0dict.cc line 920 a writer (thread id 139628598245120) has reserved it in mode exclusive number of readers 0, waiters flag 1, lock_word: 0 Last time write locked in file dict0stats.cc line 2488     --Thread 139628539541248 has waited at row0purge.cc line 1067 for 536.00 seconds the semaphore: S-lock on RW-latch at 0x562746c6a9e0 created in file dict0dict.cc line 920 a writer (thread id 139628598245120) has reserved it in mode exclusive   number of readers 0, waiters flag 1, lock_word: 0 Last time write locked in file dict0stats.cc line 2488 OS WAIT ARRAY INFO: signal count 19724996 RW-shared spins 18501644, rounds 75969116, OS waits 1318974 RW-excl spins 37487656, rounds 126621842, OS waits 1242171 RW-sx spins 334085, rounds 5011848, OS waits 47900 Spin rounds per wait: 4.11 RW-shared, 3.38 RW-excl, 15.00 RW-sx ------------ TRANSACTIONS ------------ Trx id counter 21851754826 Purge done for trx's n:o < 21851754824 undo n:o < 0 state: running History list length 2 LIST OF TRANSACTIONS FOR EACH SESSION: ---TRANSACTION 421161563984184, COMMITTED IN MEMORY flushing log 0 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 8 ---TRANSACTION 21851754821, ACTIVE (PREPARED) 958 sec 2 lock struct(s), heap size 1128, 2 row lock(s), undo log entries 3 MySQL thread id 31, OS thread handle 139628805043968, query id 6098577 Init COMMIT   -------- FILE I/O -------- I/O thread 0 state: waiting for i/o request (insert buffer thread) I/O thread 1 state: waiting for i/o request (log thread) I/O thread 6 state: waiting for i/o request (write thread) .... I/O thread 9 state: waiting for i/o request (write thread) Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] , ibuf aio reads:, log i/o's:, sync i/o's: Pending flushes (fsync) log: 0; buffer pool: 1 829392 OS file reads, 7730749 OS file writes, 426108 OS fsyncs 0 pending reads, 1 pending writes 0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s ------------------------------------- INSERT BUFFER AND ADAPTIVE HASH INDEX ------------------------------------- Ibuf: size 1, free list len 5, seg size 7, 53437 merges merged operations: insert 75980, delete mark 61216, delete 1244 discarded operations: insert 0, delete mark 0, delete 0 Hash table size 6375037, node heap has 13856 buffer(s) Hash table size 6375037, node heap has 7638 buffer(s) Hash table size 6375037, node heap has 5463 buffer(s) Hash table size 6375037, node heap has 13341 buffer(s) Hash table size 6375037, node heap has 36025 buffer(s) Hash table size 6375037, node heap has 52283 buffer(s) Hash table size 6375037, node heap has 55725 buffer(s) Hash table size 6375037, node heap has 69834 buffer(s) 0.00 hash searches/s, 0.00 non-hash searches/s --- LOG --- Log sequence number 12370514365421 Log flushed up to 12370514363873 Pages flushed up to 12370509062381 Last checkpoint at 12370509062381 0 pending log flushes, 0 pending chkp writes 2118121 log i/o's done, 0.00 log i/o's/second   ---------------------------- END OF INNODB MONITOR OUTPUT ============================ InnoDB: ###### Diagnostic info printed to the standard error stream 2022-02-16 19:55:32 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. 220216 19:55:32 [ERROR] mysqld got signal 6 ;   Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x0 thread_stack 0x30000 /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5627464767ce] /usr/sbin/mysqld(handle_fatal_signal+0x54d)[0x562745f879bd] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730)[0x7f0b5a7ae730] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x10b)[0x7f0b5a6127bb] /lib/x86_64-linux-gnu/libc.so.6(abort+0x121)[0x7f0b5a5fd535] /usr/sbin/mysqld(+0x4eac5b)[0x562745cbac5b] /usr/sbin/mysqld(+0x4e1022)[0x562745cb1022]
            spachev Sasha Pachev added a comment -

            stephane@skysql.com - I think your case is different. From the status we see that the difference between "Log sequence number " and "Last checkpoint at" is only a bit over 5 M. Even if the log is very small, the smallest log capacity (2 logs, 4 MB each) is 8 M, and even if it is that low, you still have 3 M of log space left. But it looks like something might have gone wrong due to a slow or stuck log flush, which would produce similar symptoms.

            spachev Sasha Pachev added a comment - stephane@skysql.com - I think your case is different. From the status we see that the difference between "Log sequence number " and "Last checkpoint at" is only a bit over 5 M. Even if the log is very small, the smallest log capacity (2 logs, 4 MB each) is 8 M, and even if it is that low, you still have 3 M of log space left. But it looks like something might have gone wrong due to a slow or stuck log flush, which would produce similar symptoms.
            axel Axel Schwenke added a comment - - edited

            Added performance test results for PR#2021. For sysbench-tpcc we have even an increased performance - when the data set fits in the buffer pool (tpcc1.pdf). Unfortunately the effect reverses, when the data set is bigger than the buffer pool (aka IO-bound, tpcc2.pdf).

            Not shown: for standard sysbench workloads no change in performance was observed.

            To avoid negative effects for customers who are still on 10.4, I'd prefer offering the patched MariaDB 10.4 as a custom build to interested parties.

            axel Axel Schwenke added a comment - - edited Added performance test results for PR#2021. For sysbench-tpcc we have even an increased performance - when the data set fits in the buffer pool ( tpcc1.pdf ). Unfortunately the effect reverses, when the data set is bigger than the buffer pool (aka IO-bound, tpcc2.pdf ). Not shown: for standard sysbench workloads no change in performance was observed. To avoid negative effects for customers who are still on 10.4, I'd prefer offering the patched MariaDB 10.4 as a custom build to interested parties.

            For the record, MDEV-32050 and MDEV-34515 in 10.6 streamlined this area further.

            marko Marko Mäkelä added a comment - For the record, MDEV-32050 and MDEV-34515 in 10.6 streamlined this area further.

            People

              marko Marko Mäkelä
              spachev Sasha Pachev
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.