Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24188

Hang in buf_page_create() after reusing a previously freed page

Details

    Description

      Work flow:
      1. Start the server
      2. One session creates a table, switches Autocommit off and starts to insert records into the table.
           There is no concurrent activity in the server.
      3. After less than 100 (?) records inserted (it looks as if there was no commit or rollback) InnoDB reports
           [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 300 seconds.
           and the server aborts.
           
      origin/bb-10.2-MDEV-24182 2b62e15e479aca04326b40e3090c97cdd2f0c1c3 2020-11-10
      Per Marko: It could be that this bug does not exist in 10.5.
       
       
      RQG
      ====
      git clone https://github.com/mleich1/rqg --branch experimental RQG
       
      perl rqg.pl \
      --grammar=MDEV-24182.yy \
      --gendata=MDEV-24182.zz \
      --mysqld=--innodb_use_native_aio=1 \
      --mysqld=--innodb_lock_schedule_algorithm=fcfs \
      --mysqld=--loose-idle_write_transaction_timeout=0 \
      --mysqld=--loose-idle_transaction_timeout=0 \
      --mysqld=--loose-idle_readonly_transaction_timeout=0 \
      --mysqld=--connect_timeout=60 \
      --mysqld=--interactive_timeout=28800 \
      --mysqld=--slave_net_timeout=60 \
      --mysqld=--net_read_timeout=30 \
      --mysqld=--net_write_timeout=60 \
      --mysqld=--loose-table_lock_wait_timeout=50 \
      --mysqld=--wait_timeout=28800 \
      --mysqld=--lock-wait-timeout=86400 \
      --mysqld=--innodb-lock-wait-timeout=50 \
      --no-mask \
      --queries=10000000 \
      --seed=random \
      --reporters=Backtrace \
      --reporters=ErrorLog \
      --reporters=Deadlock1 \
      --validators=None \
      --mysqld=--log_output=none \
      --mysqld=--log-bin \
      --mysqld=--log_bin_trust_function_creators=1 \
      --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
      --engine=InnoDB \
      --sqltrace=MarkErrors \
      --restart_timeout=120 \
      --mysqld=--plugin-load-add=file_key_management.so \
      --mysqld=--loose-file-key-management-filename=/RQG/conf/mariadb/encryption_keys.txt \
      --duration=300 \
      --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
      --mysqld=--innodb_stats_persistent=off \
      --mysqld=--loose-max-statement-time=30 \
      --threads=2 \
      --mysqld=--innodb_page_size=4K \
      --mysqld=--innodb-buffer-pool-size=8M \
      --duration=300 \
      --no_mask \
      --workdir=<local settings> \
      --vardir=<local settings> \
      --mtr-build-thread=<local settings> \
      --basedir1=<local settings> \
      --script_debug=_nix_ \
      --rr=Extended \
      --rr_options=--chaos
       
      Please note that many of the values set had zero impact on the failing RQG run.
      Because they have their impact in a phase of the workflow which was not reached at all.
      Examples:
      --grammar=MDEV-24182.yy \
      --queries=10000000 \
      --threads=2 \
      --seed=random \
      --reporters=...
      But the tool rqg.pl might insist in values.
      
      

      Attachments

        1. MDEV-24182.cc
          11 kB
          Matthias Leich
        2. MDEV-24182.yy
          2 kB
          Matthias Leich
        3. MDEV-24182.zz
          1 kB
          Matthias Leich
        4. screenshot-1.png
          78 kB
          Olaf Buitelaar

        Issue Links

          Activity

            olafbuitelaar Olaf Buitelaar added a comment - - edited

            I seem to suffer from the same issue;

            2020-11-16 13:23:35 0 [Note] InnoDB: A semaphore wait:
            --Thread 139696564532992 has waited at btr0cur.cc line 1480 for 122.00 seconds the semaphore:
            SX-lock on RW-latch at 0x55943e5fbbb0 created in file dict0dict.cc line 2160
            a writer (thread id 139627021195008) has reserved it in mode  SX
            number of readers 0, waiters flag 1, lock_word: 10000000
            Last time write locked in file dict0stats.cc line 1969
             
            If the lock is held for more than (however it happend only in 10.5.7, not yet observed in 10.5.8) 10mins, the server crashes;
            2020-11-13 12:51:37 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
            201113 12:51:37 [ERROR] mysqld got signal 6 ;
            This could be because you hit a bug. It is also possible that this binary
            or one of the libraries it was linked against is corrupt, improperly built,
            or misconfigured. This error can also be caused by malfunctioning hardware.
             
            To report this bug, see https://mariadb.com/kb/en/reporting-bugs
             
            We will try our best to scrape up some info that will hopefully help
            diagnose the problem, but since we have already crashed,
            something is definitely wrong and this may fail.
             
            Server version: 10.5.7-MariaDB-1:10.5.7+maria~focal-log
            key_buffer_size=12582912
            read_buffer_size=131072
            max_used_connections=41
            max_threads=1202
            thread_count=43
            It is possible that mysqld could use up to
            key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2658343 K  bytes of memory
            Hope that's ok; if not, decrease some variables in the equation.
             
            Thread pointer: 0x0
            Attempting backtrace. You can use the following information to find out
            where mysqld died. If you see no messages after this, something went
            terribly wrong...
            stack_bottom = 0x0 thread_stack 0x49000
            mysqld(my_print_stacktrace+0x32)[0x55cb65360742]
            mysqld(handle_fatal_signal+0x485)[0x55cb64db7e95]
            

            Any chance a release might be pushed forward to address this? Also is there an configuration option to disable the forced shutdown after 10min?

            olafbuitelaar Olaf Buitelaar added a comment - - edited I seem to suffer from the same issue; 2020-11-16 13:23:35 0 [Note] InnoDB: A semaphore wait: --Thread 139696564532992 has waited at btr0cur.cc line 1480 for 122.00 seconds the semaphore: SX-lock on RW-latch at 0x55943e5fbbb0 created in file dict0dict.cc line 2160 a writer (thread id 139627021195008) has reserved it in mode SX number of readers 0, waiters flag 1, lock_word: 10000000 Last time write locked in file dict0stats.cc line 1969   If the lock is held for more than (however it happend only in 10.5.7, not yet observed in 10.5.8) 10mins, the server crashes; 2020-11-13 12:51:37 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. 201113 12:51:37 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware.   To report this bug, see https://mariadb.com/kb/en/reporting-bugs   We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail.   Server version: 10.5.7-MariaDB-1:10.5.7+maria~focal-log key_buffer_size=12582912 read_buffer_size=131072 max_used_connections=41 max_threads=1202 thread_count=43 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2658343 K bytes of memory Hope that's ok; if not, decrease some variables in the equation.   Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x0 thread_stack 0x49000 mysqld(my_print_stacktrace+0x32)[0x55cb65360742] mysqld(handle_fatal_signal+0x485)[0x55cb64db7e95] Any chance a release might be pushed forward to address this? Also is there an configuration option to disable the forced shutdown after 10min?

            olafbuitelaar, thank you for the report. You can use

            SET GLOBAL innodb_fatal_semaphore_wait_threshold=300;
            

            to have the server abort in 5 minutes (300 seconds) instead of the default timeout. I do not think that it is useful to let the server continue in a livelocked state, because service to some connections will be denied, depending on which latches the hung buf_page_create() threads are holding. Eventually, all I/O threads would be blocked and nothing could be accessed. If the hung buf_page_create() is executed as part of a CREATE TABLE operation, then all other InnoDB threads will be blocked, waiting for the data dictionary latch.

            Today, we double-checked that the wait loop that was originally added in MDEV-23452 is indeed necessary. If buf_page_create() did not check that no io-fix is set, it could acquire the block->lock in exclusive mode after another thread that is executing buf_flush_page() set the write-fix and released buf_pool.mutex but did not yet acquire the block->lock in shared-exclusive mode. Note: The io_fix is protected by the buffer pool mutex, which buf_page_create() is holding.

            The hang was caused because in MDEV-23456 we did not add a condition before the wait loop: if the buf_page_create() thread already holds an exclusive latch on the block (because the page was freed earlier during the mini-transaction before being reallocated), we would in vain wait for the io-fix to be removed.

            The probability of this hang can be reduced by configuring some parameters related to page flushing, but I do not think that it can be prevented completely.

            marko Marko Mäkelä added a comment - olafbuitelaar , thank you for the report. You can use SET GLOBAL innodb_fatal_semaphore_wait_threshold=300; to have the server abort in 5 minutes (300 seconds) instead of the default timeout. I do not think that it is useful to let the server continue in a livelocked state, because service to some connections will be denied, depending on which latches the hung buf_page_create() threads are holding. Eventually, all I/O threads would be blocked and nothing could be accessed. If the hung buf_page_create() is executed as part of a CREATE TABLE operation, then all other InnoDB threads will be blocked, waiting for the data dictionary latch. Today, we double-checked that the wait loop that was originally added in MDEV-23452 is indeed necessary. If buf_page_create() did not check that no io-fix is set, it could acquire the block->lock in exclusive mode after another thread that is executing buf_flush_page() set the write-fix and released buf_pool.mutex but did not yet acquire the block->lock in shared-exclusive mode. Note: The io_fix is protected by the buffer pool mutex, which buf_page_create() is holding. The hang was caused because in MDEV-23456 we did not add a condition before the wait loop: if the buf_page_create() thread already holds an exclusive latch on the block (because the page was freed earlier during the mini-transaction before being reallocated), we would in vain wait for the io-fix to be removed. The probability of this hang can be reduced by configuring some parameters related to page flushing, but I do not think that it can be prevented completely.

            Thank you for your reply. If i can provide more information please let me know. I'll try to tweak the parameters related to page flushing. We use ```create table``` regularly to create temporary tables.
            I'm not sure if it's related but since i've upgraded from 10.4.14 to 10.5.7 (the 12th and 13th to 10.5.8), the row locking times seems to be very irregular;

            olafbuitelaar Olaf Buitelaar added a comment - Thank you for your reply. If i can provide more information please let me know. I'll try to tweak the parameters related to page flushing. We use ```create table``` regularly to create temporary tables. I'm not sure if it's related but since i've upgraded from 10.4.14 to 10.5.7 (the 12th and 13th to 10.5.8), the row locking times seems to be very irregular;

            olafbuitelaar, please be aware that due to MDEV-24096 it is not safe to use 10.5.7 if any indexed virtual columns exist (including any that were created automatically). I think that row locking occurs at a higher level, above the buffer pool and page latch layer. This bug causes a real hang (livelock) that will remain until the server is killed. If any transaction that is holding row locks is being blocked due to this livelock, then of course other transactions could end up waiting for those row locks longer (until a lock wait timeout terminates the waits). As far as I can tell, such affected row locks should never be released (until the entire server process is killed).

            MDEV-24227 could be a duplicate of this report. Given that this hang seems to be relatively easy to reproduce, I think that we should consider issuing an unscheduled release.

            marko Marko Mäkelä added a comment - olafbuitelaar , please be aware that due to MDEV-24096 it is not safe to use 10.5.7 if any indexed virtual columns exist (including any that were created automatically). I think that row locking occurs at a higher level, above the buffer pool and page latch layer. This bug causes a real hang (livelock) that will remain until the server is killed. If any transaction that is holding row locks is being blocked due to this livelock, then of course other transactions could end up waiting for those row locks longer (until a lock wait timeout terminates the waits). As far as I can tell, such affected row locks should never be released (until the entire server process is killed). MDEV-24227 could be a duplicate of this report. Given that this hang seems to be relatively easy to reproduce, I think that we should consider issuing an unscheduled release.

            MDEV-24227 turned out to be unrelated to this hang.

            The scenario of this hang is that a page had been freed, a page write (or thanks to MDEV-15528, eviction) had been initiated, and the same page (with the same tablespace identifier and page number) was allocated and reused for something else. Until MDEV-12227 is fixed, we are unnecessarily writing pages of temporary tables to the data file. I think that repeatedly creating, populating and dropping InnoDB temporary tables should be a good way of reproducing this hang.

            marko Marko Mäkelä added a comment - MDEV-24227 turned out to be unrelated to this hang. The scenario of this hang is that a page had been freed, a page write (or thanks to MDEV-15528 , eviction) had been initiated, and the same page (with the same tablespace identifier and page number) was allocated and reused for something else. Until MDEV-12227 is fixed, we are unnecessarily writing pages of temporary tables to the data file. I think that repeatedly creating, populating and dropping InnoDB temporary tables should be a good way of reproducing this hang.

            People

              marko Marko Mäkelä
              mleich Matthias Leich
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.