Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24378

Crashes on Semaphore wait > 600 seconds

Details

    Description

      Since upgrading to 10.4.17 we have been getting long semaphore waits which frequently result in a crash. Sometimes they resolve before a crash, but still cause lots of other lock wait timeouts or other related problems.

      The most frequent latches referenced are in btr0cur.cc and dict0dict.cc.

      We have 4 production and 4 staging servers, each with different workloads, and we have seen these errors on at least 3 of the production servers and 2 of the staging servers. The tendency is for it to happen during high write load but it has happened when load was fairly low.

      It has occurred at least 8 times in our production cluster since the upgrade to 10.4.17 which we installed on 2020/11/21. Prior to the upgrade to 10.4.17 we had been on 10.4.14. That version had been having unexplained crashes, following a similar pattern, but we were not getting the semaphore wait warnings in our logs. Those had started with the 10.4.14 upgrade but without any ability to reproduce, and no significant log messages, we had not filed a bug report on it yet. I suspect that it was the same issue, but don't know how to test that theory.

      InnoDB: ###### Diagnostic info printed to the standard error stream
      2020-12-09  9:55:45 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
      201209  9:55:45 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
       
      Server version: 10.4.17-MariaDB-log
      key_buffer_size=31457280
      read_buffer_size=131072
      max_used_connections=152
      max_threads=4002
      thread_count=161
      It is possible that mysqld could use up to 
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 8837909 K  bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
       
      Thread pointer: 0x0
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x0 thread_stack 0x49000
      /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x560adb3d04ae]
      /usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x560adae63e1f]
      /lib64/libpthread.so.0(+0xf600)[0x7feabd373600]
      /lib64/libc.so.6(gsignal+0x37)[0x7feabb89c3a7]
      /lib64/libc.so.6(abort+0x148)[0x7feabb89da98]
      /usr/sbin/mysqld(+0xc1bd60)[0x560adb159d60]
      /usr/sbin/mysqld(+0xbd4232)[0x560adb112232]
      /lib64/libpthread.so.0(+0x7e75)[0x7feabd36be75]
      /lib64/libc.so.6(clone+0x6d)[0x7feabb9648fd]
      The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
      information that should help you find out what is causing the crash.
      Writing a core file...
      

      There are warnings with INNODB monitor output numerous times preceding the crash. The number of open queries getting stuck gets quite long as it approaches the crash so I figure the one most likely to be helpful is the earliest one. I have censored the query strings due to sensitive data.

      InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
      InnoDB: Pending reads 0, writes 0
       
      =====================================
      2020-12-09 09:43:53 0x7fe4bcbfc700 INNODB MONITOR OUTPUT
      =====================================
      Per second averages calculated from the last 60 seconds
      -----------------
      BACKGROUND THREAD
      -----------------
      srv_master_thread loops: 102524 srv_active, 0 srv_shutdown, 221160 srv_idle
      srv_master_thread log flush and writes: 323650
      ----------
      SEMAPHORES
      ----------
      OS WAIT ARRAY INFO: reservation count 3217156
      --Thread 140620618213120 has waited at buf0flu.cc line 1176 for 241.00 seconds the semaphore:
      SX-lock on RW-latch at 0x7fe6e6989208 created in file buf0buf.cc line 1568
      a writer (thread id 140620299949824) has reserved it in mode  exclusive
      number of readers 0, waiters flag 1, lock_word: e0000000
      Last time write locked in file btr0cur.cc line 2417
      --Thread 140620387563264 has waited at fsp0fsp.cc line 1984 for 227.00 seconds the semaphore:
      X-lock on RW-latch at 0x7fe4a280aec0 created in file fil0fil.cc line 1266
      a writer (thread id 140620299949824) has reserved it in mode  exclusive
      number of readers 0, waiters flag 1, lock_word: dfffffff
      Last time write locked in file fsp0fsp.cc line 2402
      OS WAIT ARRAY INFO: signal count 22529095
      RW-shared spins 41276936, rounds 128593064, OS waits 1031233
      RW-excl spins 27792813, rounds 117073205, OS waits 1025437
      RW-sx spins 1932526, rounds 13959835, OS waits 165716
      Spin rounds per wait: 3.12 RW-shared, 4.21 RW-excl, 7.22 RW-sx
      ------------
      TRANSACTIONS
      ------------
      Trx id counter 17462604867
      Purge done for trx's n:o < 17462603855 undo n:o < 0 state: running
      History list length 487
      LIST OF TRANSACTIONS FOR EACH SESSION:
      ---TRANSACTION 17462604866, ACTIVE 67 sec setting auto-inc lock
      mysql tables in use 1, locked 1
      LOCK WAIT 1 lock struct(s), heap size 1128, 0 row lock(s)
      MySQL thread id 18649, OS thread handle 140620601120512, query id 777866038 172.31.66.100 dbworkerfactorysecure Executing
      INSERT INTO ******
      ------- TRX HAS BEEN WAITING 67 SEC FOR THIS LOCK TO BE GRANTED:
      TABLE LOCK table `leanload_landingzone`.`database_action_log` trx id 17462604866 lock mode AUTO-INC waiting
      ------------------
      ---TRANSACTION 17462604865, ACTIVE 144 sec inserting
      mysql tables in use 1, locked 1
      2 lock struct(s), heap size 1128, 0 row lock(s)
      MySQL thread id 18646, OS thread handle 140619220154112, query id 777865925 172.31.66.223 dbworkerfactorysecure Executing
      INSERT INTO ******
      ---TRANSACTION 422120853290824, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853286568, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 17462604837, ACTIVE 234 sec inserting
      mysql tables in use 19, locked 13
      2 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 203216
      MySQL thread id 18631, OS thread handle 140619223607040, query id 777476378 172.31.66.68 dbworkerfactorysecure Creating sort index
      INSERT *****
      Trx read view will not see trx with id >= 17462604819, sees < 17462603853
      ---TRANSACTION 17462603853, ACTIVE 325 sec inserting
      mysql tables in use 5, locked 2
      2 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 3713879
      MySQL thread id 18627, OS thread handle 140620299949824, query id 777462590 172.31.66.142 dbworkerfactorysecure Sending data
      INSERT INTO ****
      Trx read view will not see trx with id >= 17462603853, sees < 17462603853
      ---TRANSACTION 422120853273800, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853269544, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853265288, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853261032, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853167400, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853256776, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853252520, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853248264, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853244008, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853239752, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853235496, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853231240, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853226984, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853222728, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853218472, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853214216, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853209960, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853205704, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853201448, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853197192, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853192936, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853188680, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853184424, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853180168, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853175912, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853171656, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853163144, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853158888, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853154632, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853150376, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853146120, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853141864, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853137608, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853133352, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853129096, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853124840, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853120584, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853116328, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853112072, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853107816, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853103560, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853099304, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853095048, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 422120853090792, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 17462604858, ACTIVE 220 sec inserting
      mysql tables in use 1, locked 1
      3 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1
      MySQL thread id 10, OS thread handle 140646178735872, query id 777865730
      ---TRANSACTION 422120853082280, not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      --------
      FILE I/O
      --------
      I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
      I/O thread 1 state: waiting for completed aio requests (log thread)
      I/O thread 2 state: waiting for completed aio requests (read thread)
      I/O thread 3 state: waiting for completed aio requests (read thread)
      I/O thread 4 state: waiting for completed aio requests (read thread)
      I/O thread 5 state: waiting for completed aio requests (read thread)
      I/O thread 6 state: waiting for completed aio requests (write thread)
      I/O thread 7 state: waiting for completed aio requests (write thread)
      I/O thread 8 state: waiting for completed aio requests (write thread)
      I/O thread 9 state: waiting for completed aio requests (write thread)
      Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
       ibuf aio reads:, log i/o's:, sync i/o's:
      Pending flushes (fsync) log: 0; buffer pool: 0
      24188890 OS file reads, 29232470 OS file writes, 4530146 OS fsyncs
      0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
      -------------------------------------
      INSERT BUFFER AND ADAPTIVE HASH INDEX
      -------------------------------------
      Ibuf: size 1, free list len 209942, seg size 209944, 1509741 merges
      merged operations:
       insert 3862925, delete mark 11675526, delete 1201855
      discarded operations:
       insert 0, delete mark 0, delete 0
      Hash table size 5843809, node heap has 3791 buffer(s)
      Hash table size 5843809, node heap has 5577 buffer(s)
      Hash table size 5843809, node heap has 3297 buffer(s)
      Hash table size 5843809, node heap has 5687 buffer(s)
      Hash table size 5843809, node heap has 3953 buffer(s)
      Hash table size 5843809, node heap has 7772 buffer(s)
      Hash table size 5843809, node heap has 18181 buffer(s)
      Hash table size 5843809, node heap has 10925 buffer(s)
      7.32 hash searches/s, 2.65 non-hash searches/s
      ---
      LOG
      ---
      Log sequence number 102422876480104
      Log flushed up to   102422876480104
      Pages flushed up to 102422800703925
      Last checkpoint at  102422800703925
      0 pending log flushes, 0 pending chkp writes
      1642503 log i/o's done, 0.00 log i/o's/second
      ----------------------
      BUFFER POOL AND MEMORY
      ----------------------
      Total large memory allocated 23655874560
      Dictionary memory allocated 4009296
      Buffer pool size   1412752
      Free buffers       4632
      Database pages     1348936
      Old database pages 497782
      Modified db pages  5220
      Percent of dirty pages(LRU & free pages): 0.386
      Max dirty pages percent: 75.000
      Pending reads 0
      Pending writes: LRU 0, flush list 2, single page 0
      Pages made young 12285071, not young 736294742
      0.00 youngs/s, 0.00 non-youngs/s
      Pages read 24091045, created 6416261, written 26204033
      0.00 reads/s, 0.00 creates/s, 0.00 writes/s
      Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
      Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
      LRU len: 1348936, unzip_LRU len: 0
      I/O sum[0]:cur[0], unzip sum[0]:cur[0]
      ----------------------
      INDIVIDUAL BUFFER POOL INFO
      ----------------------
      ---BUFFER POOL 0
      Buffer pool size   176594
      Free buffers       575
      Database pages     168645
      Old database pages 62233
      Modified db pages  825
      Percent of dirty pages(LRU & free pages): 0.488
      Max dirty pages percent: 75.000
      Pending reads 0
      Pending writes: LRU 0, flush list 0, single page 0
      Pages made young 1530158, not young 89652355
      0.00 youngs/s, 0.00 non-youngs/s
      Pages read 3000704, created 820651, written 4459033
      0.00 reads/s, 0.00 creates/s, 0.00 writes/s
      Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
      Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
      LRU len: 168645, unzip_LRU len: 0
      I/O sum[0]:cur[0], unzip sum[0]:cur[0]
      ---BUFFER POOL 1
      Buffer pool size   176594
      Free buffers       546
      Database pages     168545
      Old database pages 62196
      Modified db pages  661
      Percent of dirty pages(LRU & free pages): 0.391
      Max dirty pages percent: 75.000
      Pending reads 0
      Pending writes: LRU 0, flush list 0, single page 0
      Pages made young 1548223, not young 93971088
      0.00 youngs/s, 0.00 non-youngs/s
      Pages read 3064144, created 797585, written 2957721
      0.00 reads/s, 0.00 creates/s, 0.00 writes/s
      Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
      Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
      LRU len: 168545, unzip_LRU len: 0
      I/O sum[0]:cur[0], unzip sum[0]:cur[0]
      ---BUFFER POOL 2
      Buffer pool size   176594
      Free buffers       621
      Database pages     168578
      Old database pages 62208
      Modified db pages  608
      Percent of dirty pages(LRU & free pages): 0.359
      Max dirty pages percent: 75.000
      Pending reads 0
      Pending writes: LRU 0, flush list 0, single page 0
      Pages made young 1553285, not young 91180805
      0.00 youngs/s, 0.00 non-youngs/s
      Pages read 3023249, created 801831, written 2881988
      0.00 reads/s, 0.00 creates/s, 0.00 writes/s
      Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
      Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
      LRU len: 168578, unzip_LRU len: 0
      I/O sum[0]:cur[0], unzip sum[0]:cur[0]
      ---BUFFER POOL 3
      Buffer pool size   176594
      Free buffers       623
      Database pages     168501
      Old database pages 62180
      Modified db pages  621
      Percent of dirty pages(LRU & free pages): 0.367
      Max dirty pages percent: 75.000
      Pending reads 0
      Pending writes: LRU 0, flush list 0, single page 0
      Pages made young 1544401, not young 92782155
      0.00 youngs/s, 0.00 non-youngs/s
      Pages read 3055426, created 798407, written 3548916
      0.00 reads/s, 0.00 creates/s, 0.00 writes/s
      Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
      Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
      LRU len: 168501, unzip_LRU len: 0
      I/O sum[0]:cur[0], unzip sum[0]:cur[0]
      ---BUFFER POOL 4
      Buffer pool size   176594
      Free buffers       570
      Database pages     168653
      Old database pages 62236
      Modified db pages  605
      Percent of dirty pages(LRU & free pages): 0.358
      Max dirty pages percent: 75.000
      Pending reads 0
      Pending writes: LRU 0, flush list 0, single page 0
      Pages made young 1530555, not young 93768585
      0.00 youngs/s, 0.00 non-youngs/s
      Pages read 3023945, created 798087, written 3336562
      0.00 reads/s, 0.00 creates/s, 0.00 writes/s
      Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
      Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
      LRU len: 168653, unzip_LRU len: 0
      I/O sum[0]:cur[0], unzip sum[0]:cur[0]
      ---BUFFER POOL 5
      Buffer pool size   176594
      Free buffers       530
      Database pages     168808
      Old database pages 62293
      Modified db pages  641
      Percent of dirty pages(LRU & free pages): 0.379
      Max dirty pages percent: 75.000
      Pending reads 0
      Pending writes: LRU 0, flush list 2, single page 0
      Pages made young 1547158, not young 91440491
      0.00 youngs/s, 0.00 non-youngs/s
      Pages read 2993753, created 800748, written 2959769
      0.00 reads/s, 0.00 creates/s, 0.00 writes/s
      Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
      Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
      LRU len: 168808, unzip_LRU len: 0
      I/O sum[0]:cur[0], unzip sum[0]:cur[0]
      ---BUFFER POOL 6
      Buffer pool size   176594
      Free buffers       537
      Database pages     168617
      Old database pages 62223
      Modified db pages  636
      Percent of dirty pages(LRU & free pages): 0.376
      Max dirty pages percent: 75.000
      Pending reads 0
      Pending writes: LRU 0, flush list 0, single page 0
      Pages made young 1514961, not young 89712669
      0.00 youngs/s, 0.00 non-youngs/s
      Pages read 2978655, created 800363, written 2979454
      0.00 reads/s, 0.00 creates/s, 0.00 writes/s
      Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
      Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
      LRU len: 168617, unzip_LRU len: 0
      I/O sum[0]:cur[0], unzip sum[0]:cur[0]
      ---BUFFER POOL 7
      Buffer pool size   176594
      Free buffers       630
      Database pages     168589
      Old database pages 62213
      Modified db pages  623
      Percent of dirty pages(LRU & free pages): 0.368
      Max dirty pages percent: 75.000
      Pending reads 0
      Pending writes: LRU 0, flush list 0, single page 0
      Pages made young 1516330, not young 93786594
      0.00 youngs/s, 0.00 non-youngs/s
      Pages read 2951169, created 798589, written 3080590
      0.00 reads/s, 0.00 creates/s, 0.00 writes/s
      Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
      Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
      LRU len: 168589, unzip_LRU len: 0
      I/O sum[0]:cur[0], unzip sum[0]:cur[0]
      --------------
      ROW OPERATIONS
      --------------
      0 queries inside InnoDB, 0 queries in queue
      2 read views open inside InnoDB
      Process ID=2563, Main thread ID=140620421134080, state: sleeping
      Number of rows inserted 291442737, updated 43530027, deleted 16560971, read 18102275991
      0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 21.62 reads/s
      Number of system rows inserted 839287, updated 0, deleted 839231, read 839237
      0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
      ----------------------------
      END OF INNODB MONITOR OUTPUT
      ============================
      

      Potentially related issues?
      https://jira.mariadb.org/browse/MDEV-24142
      https://jira.mariadb.org/browse/MDEV-24225

      Attachments

        1. gdb-thread-apply-all-bt.txt.gz
          161 kB
        2. mariadb_10417.37345.3.gdb.bt
          38 kB
        3. mariadb_10417.37345.3.gdb.bt-full
          127 kB
        4. maria-sem.txt
          10 kB
        5. MDEV-24378.perf.tgz
          417 kB
        6. normal-load.perf.script.svg
          294 kB
        7. oc_1_innodb.txt
          11 kB
        8. oc_1_threads.txt
          49 kB
        9. read-spike.perf.script.svg
          99 kB
        10. SEIS.dict0stats.out
          6 kB

        Issue Links

          Activity

            jacob.williams, can you please try to produce a stack trace of all threads during the hang? This report could be a duplicate of MDEV-24188, but I cannot confirm that based on the available information yet.

            Side note: in MariaDB 10.6, I plan to remove the InnoDB semaphore output as part of MDEV-21452.

            marko Marko Mäkelä added a comment - jacob.williams , can you please try to produce a stack trace of all threads during the hang? This report could be a duplicate of MDEV-24188 , but I cannot confirm that based on the available information yet. Side note: in MariaDB 10.6, I plan to remove the InnoDB semaphore output as part of MDEV-21452 .

            I can confirm, I'm seeing long semaphore waits on multiple machines running both 10.3.27 and 10.4.17,
            to the point where there are semaphore-related crashes almost daily.
            This didn't happen before on 10.3.24, which was stable in our environment (Debian buster on bare metal).

            I have attached a monitor output: oc_1_innodb.txt and gdb backtrace: oc_1_threads.txt of a server
            running 10.4.17-MariaDB-1:10.4.17+maria~buster-log. The server is a slave replicating an ETL job,
            long-running queries examining tens of millions of rows are normal for this machine.
            In this particular case the query didn't even finish before the server was taken down by the lock monitor.

            There was a single query running, waiting for a RW-lock. Looking at the backtrace of the thread holding that RW-lock
            I noticed it was doing a lot of I/O in dict_stats_update_persistent, so I straced that thread and saw that it made
            a lot of pread64() calls, confirming that the thread was making progress and wasn't stalled by hung I/O or other OS-related issue.

            This led me to believe that the background stats update is starving the query threads by holding a write lock to
            a highly-contended mutex (a system dictionary lock?).
            Disabling innodb_stats_persistent has fixed the issue for me, and the server has managed to finish the workload and not crash.

            I suspect that this high contention on that internal mutex might be related to MDEV-23991
            but I'm not that knowledgeable on InnoDB internals.

            I'm seeing similar behaviour on a server running 10.3.27 - long semaphore waits in btr0cur.cc resulting in crashes,
            that server was recently upgraded from 10.3.24 - which was very stable for us and didn't have this issue.

            octavian.cerna Octavian Cerna added a comment - I can confirm, I'm seeing long semaphore waits on multiple machines running both 10.3.27 and 10.4.17, to the point where there are semaphore-related crashes almost daily. This didn't happen before on 10.3.24, which was stable in our environment (Debian buster on bare metal). I have attached a monitor output: oc_1_innodb.txt and gdb backtrace: oc_1_threads.txt of a server running 10.4.17-MariaDB-1:10.4.17+maria~buster-log. The server is a slave replicating an ETL job, long-running queries examining tens of millions of rows are normal for this machine. In this particular case the query didn't even finish before the server was taken down by the lock monitor. There was a single query running, waiting for a RW-lock. Looking at the backtrace of the thread holding that RW-lock I noticed it was doing a lot of I/O in dict_stats_update_persistent, so I straced that thread and saw that it made a lot of pread64() calls, confirming that the thread was making progress and wasn't stalled by hung I/O or other OS-related issue. This led me to believe that the background stats update is starving the query threads by holding a write lock to a highly-contended mutex (a system dictionary lock?). Disabling innodb_stats_persistent has fixed the issue for me, and the server has managed to finish the workload and not crash. I suspect that this high contention on that internal mutex might be related to MDEV-23991 but I'm not that knowledgeable on InnoDB internals. I'm seeing similar behaviour on a server running 10.3.27 - long semaphore waits in btr0cur.cc resulting in crashes, that server was recently upgraded from 10.3.24 - which was very stable for us and didn't have this issue.
            harri.korpi.c harri korpi added a comment - - edited

            Hello,
            I think we are having this same problem in production environment where the server crashes daily on CentOs 7. This started occurring after upgrading to 10.4.17 from 10.4.14 - our app reads/insert/deletes quite large number of rows twice a day by copying the data from one schema to another (the crash seems to happen mostly every 2nd run).

            Maria log shows the crash as:

            ..
            Pages read ahead 6.40/s, evicted without access 1.65/s, Random read ahead 0.00/s
            LRU len: 61456, unzip_LRU len: 0
            I/O sum[175215]:cur[3038], unzip sum[0]:cur[0]
            --------------
            ROW OPERATIONS
            --------------
            0 queries inside InnoDB, 0 queries in queue
            0 read views open inside InnoDB
            Process ID=25897, Main thread ID=140580188915456, state: sleeping
            Number of rows inserted 17557314, updated 65985, deleted 8711548, read 159080556
            0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
            Number of system rows inserted 0, updated 0, deleted 0, read 0
            0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
            ----------------------------
            END OF INNODB MONITOR OUTPUT
            ============================
            InnoDB: ###### Diagnostic info printed to the standard error stream
            2020-12-22  6:16:56 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
            201222  6:16:56 [ERROR] mysqld got signal 6 ;
            This could be because you hit a bug. It is also possible that this binary
            or one of the libraries it was linked against is corrupt, improperly built,
            or misconfigured. This error can also be caused by malfunctioning hardware.
             
            To report this bug, see https://mariadb.com/kb/en/reporting-bugs
             
            We will try our best to scrape up some info that will hopefully help
            diagnose the problem, but since we have already crashed,
            something is definitely wrong and this may fail.
             
            Server version: 10.4.17-MariaDB
            key_buffer_size=268435456
            read_buffer_size=1048576
            max_used_connections=5
            max_threads=153
            thread_count=7
            It is possible that mysqld could use up to
            key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 579242 K  bytes of memory
            Hope that's ok; if not, decrease some variables in the equation.
             
            Thread pointer: 0x0
            Attempting backtrace. You can use the following information to find out
            where mysqld died. If you see no messages after this, something went
            terribly wrong...
            stack_bottom = 0x0 thread_stack 0x49000
            /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x556b8b498ffe]
            /usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x556b8af2a9ef]
            /lib64/libpthread.so.0(+0xf630)[0x7fddb2a68630]
            /lib64/libc.so.6(gsignal+0x37)[0x7fddb0d39387]
            /lib64/libc.so.6(abort+0x148)[0x7fddb0d3aa78]
            /usr/sbin/mysqld(+0xbc1d50)[0x556b8b221d50]
            /usr/sbin/mysqld(+0xb7a362)[0x556b8b1da362]
            /lib64/libpthread.so.0(+0x7ea5)[0x7fddb2a60ea5]
            /lib64/libc.so.6(clone+0x6d)[0x7fddb0e018dd]
            The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
            information that should help you find out what is causing the crash.
            Writing a core file...
            Working directory at /var/lib/mysql
            Resource Limits:
            Fatal signal 11 while backtracing
            ..
            

            Would you able to confirm if this is the same or different issue?

            Thank you.

            harri.korpi.c harri korpi added a comment - - edited Hello, I think we are having this same problem in production environment where the server crashes daily on CentOs 7. This started occurring after upgrading to 10.4.17 from 10.4.14 - our app reads/insert/deletes quite large number of rows twice a day by copying the data from one schema to another (the crash seems to happen mostly every 2nd run). Maria log shows the crash as: .. Pages read ahead 6.40 /s, evicted without access 1.65 /s, Random read ahead 0.00 /s LRU len: 61456 , unzip_LRU len: 0 I/O sum[ 175215 ]:cur[ 3038 ], unzip sum[ 0 ]:cur[ 0 ] -------------- ROW OPERATIONS -------------- 0 queries inside InnoDB, 0 queries in queue 0 read views open inside InnoDB Process ID= 25897 , Main thread ID= 140580188915456 , state: sleeping Number of rows inserted 17557314 , updated 65985 , deleted 8711548 , read 159080556 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Number of system rows inserted 0 , updated 0 , deleted 0 , read 0 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s ---------------------------- END OF INNODB MONITOR OUTPUT ============================ InnoDB: ###### Diagnostic info printed to the standard error stream 2020 - 12 - 22 6 : 16 : 56 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. 201222 6 : 16 : 56 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware.   To report this bug, see https: //mariadb.com/kb/en/reporting-bugs   We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail.   Server version: 10.4 . 17 -MariaDB key_buffer_size= 268435456 read_buffer_size= 1048576 max_used_connections= 5 max_threads= 153 thread_count= 7 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 579242 K bytes of memory Hope that's ok; if not, decrease some variables in the equation.   Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this , something went terribly wrong... stack_bottom = 0x0 thread_stack 0x49000 /usr/sbin/mysqld(my_print_stacktrace+ 0x2e )[ 0x556b8b498ffe ] /usr/sbin/mysqld(handle_fatal_signal+ 0x30f )[ 0x556b8af2a9ef ] /lib64/libpthread.so. 0 (+ 0xf630 )[ 0x7fddb2a68630 ] /lib64/libc.so. 6 (gsignal+ 0x37 )[ 0x7fddb0d39387 ] /lib64/libc.so. 6 (abort+ 0x148 )[ 0x7fddb0d3aa78 ] /usr/sbin/mysqld(+ 0xbc1d50 )[ 0x556b8b221d50 ] /usr/sbin/mysqld(+ 0xb7a362 )[ 0x556b8b1da362 ] /lib64/libpthread.so. 0 (+ 0x7ea5 )[ 0x7fddb2a60ea5 ] /lib64/libc.so. 6 (clone+ 0x6d )[ 0x7fddb0e018dd ] The manual page at https: //mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains information that should help you find out what is causing the crash. Writing a core file... Working directory at /var/lib/mysql Resource Limits: Fatal signal 11 while backtracing .. Would you able to confirm if this is the same or different issue? Thank you.
            mikeininfo Mike Ingle added a comment -

            We had this problem on 10.4.17 ubuntu 18.04 x86_64. On one server I could not even load a large 250GB import file from mysqldump, as it would throw the >600 error and restart at about the 25% mark every time. I downgraded that machine to 10.4.15 and was able to load it.

            Our production server had been upgraded to 10.4.17 to avoid another issue with a complex stored procedure causing a restart. It was getting long semaphore timeouts causing rollbacks every day and eventually had a >600 restart during a large batch insert. We got a lot of log entries like this:
            OS WAIT ARRAY INFO: reservation count 8089822
            --Thread 140441005106944 has waited at btr0cur.cc line 1492 for 543.00 seconds the semaphore:
            SX-lock on RW-latch at 0x7fbae40162e8 created in file dict0dict.cc line 1954
            a writer (thread id 140442200524544) has reserved it in mode SX
            number of readers 0, waiters flag 1, lock_word: 10000000
            Last time write locked in file dict0stats.cc line 1968

            The lock was always from dict0stats.cc, which is table/index statistics. This file has been modified a lot between 10.4.15 and 10.4.17. I turned off table/index statistics:

            set global innodb_stats_persistent = off;
            set global innodb_stats_auto_recalc = off;
            set global innodb_stats_traditional = off;

            The errors, which had been appearing nightly, did not happen last night. I think what is happening is that when some percentage of the table has been changed, Mariadb decides to update the statistics. Because the insert or update is still going on, the statistics gets into a deadlock with the insert/update and crashes the server. It looks like the dict0stats.cc file has a lot of exception bailouts in it. Perhaps one of them is not releasing a lock? Until Mariadb gets this fixed, turning off innodb stats may be a workaround to keep your server alive.

            mikeininfo Mike Ingle added a comment - We had this problem on 10.4.17 ubuntu 18.04 x86_64. On one server I could not even load a large 250GB import file from mysqldump, as it would throw the >600 error and restart at about the 25% mark every time. I downgraded that machine to 10.4.15 and was able to load it. Our production server had been upgraded to 10.4.17 to avoid another issue with a complex stored procedure causing a restart. It was getting long semaphore timeouts causing rollbacks every day and eventually had a >600 restart during a large batch insert. We got a lot of log entries like this: OS WAIT ARRAY INFO: reservation count 8089822 --Thread 140441005106944 has waited at btr0cur.cc line 1492 for 543.00 seconds the semaphore: SX-lock on RW-latch at 0x7fbae40162e8 created in file dict0dict.cc line 1954 a writer (thread id 140442200524544) has reserved it in mode SX number of readers 0, waiters flag 1, lock_word: 10000000 Last time write locked in file dict0stats.cc line 1968 The lock was always from dict0stats.cc, which is table/index statistics. This file has been modified a lot between 10.4.15 and 10.4.17. I turned off table/index statistics: set global innodb_stats_persistent = off; set global innodb_stats_auto_recalc = off; set global innodb_stats_traditional = off; The errors, which had been appearing nightly, did not happen last night. I think what is happening is that when some percentage of the table has been changed, Mariadb decides to update the statistics. Because the insert or update is still going on, the statistics gets into a deadlock with the insert/update and crashes the server. It looks like the dict0stats.cc file has a lot of exception bailouts in it. Perhaps one of them is not releasing a lock? Until Mariadb gets this fixed, turning off innodb stats may be a workaround to keep your server alive.
            shalogo Leonard Sinquer added a comment - - edited

            We are having exact the same issues on our multi-master set-up, restarts happening daily on random masters/slaves (all running 10.4.17). We previously uses 10.4.2 and this was never happening. I will try the suggestion by @mikeininfo and will let you know the results. Thanks!

            InnoDB: ###### Diagnostic info printed to the standard error stream 
            2020-12-09 16:01:22 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
            

            mysql-error.log snippet:

            ===================================== 
            2020-12-09 16:01:10 0x7f924a7fc700 INNODB MONITOR OUTPUT 
            ===================================== 
            Per second averages calculated from the last 20 seconds 
            ----------------- 
            BACKGROUND THREAD 
            ----------------- 
            srv_master_thread loops: 162828 srv_active, 0 srv_shutdown, 5635 srv_idle 
            srv_master_thread log flush and writes: 168462 
            ---------- 
            SEMAPHORES 
            ---------- 
            OS WAIT ARRAY INFO: reservation count 891960 
            --Thread 140378791868160 has waited at btr0cur.cc line 1492 for 941.00 seconds the semaphore: 
            SX-lock on RW-latch at 0x7f923801f0a8 created in file dict0dict.cc line 1954 
            a writer (thread id 140266283448064) has reserved it in mode  SX 
            number of readers 0, waiters flag 1, lock_word: 10000000 
            Last time write locked in file dict0stats.cc line 1968 
            OS WAIT ARRAY INFO: signal count 44950247 
            RW-shared spins 216698389, rounds 768920256, OS waits 382090 
            RW-excl spins 108692241, rounds 463496787, OS waits 57256 
            RW-sx spins 50881, rounds 607691, OS waits 5555 
            Spin rounds per wait: 3.55 RW-shared, 4.26 RW-excl, 11.94 RW-sx 
            ------------ 
            TRANSACTIONS 
            ------------ 
            Trx id counter 66812968602 
            Purge done for trx's n:o < 66812968601 undo n:o < 0 state: running but idle 
            History list length 1 
            LIST OF TRANSACTIONS FOR EACH SESSION: 
            ---TRANSACTION 421853774909944, not started 
            0 lock struct(s), heap size 1128, 0 row lock(s) 
            ---TRANSACTION 66812968601, ACTIVE 941 sec updating or deleting 
            mysql tables in use 5, locked 5 
            3 lock struct(s), heap size 1128, 2 row lock(s) 
            MySQL thread id 10, OS thread handle 140378791868160, query id 579492716 Updating 
            UPDATE loginlogs SET logouttime = 1607525129, rxbytes = rxbytes + 10004012, txbytes = txbytes + 675244 WHERE logid = 2538556223701127079 
            ---TRANSACTION 421853774901432, not started 
            0 lock struct(s), heap size 1128, 0 row lock(s)
            

             
            Thread pointer: 0x0 
            Attempting backtrace. You can use the following information to find out 
            where mysqld died. If you see no messages after this, something went 
            terribly wrong... 
            stack_bottom = 0x0 thread_stack 0x49000 
            /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55784d20bffe] 
            /usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x55784cc9d9ef] 
            /lib64/libpthread.so.0(+0xf630)[0x7facbbbb4630] 
            /lib64/libc.so.6(gsignal+0x37)[0x7facb9e84387] 
            /lib64/libc.so.6(abort+0x148)[0x7facb9e85a78] 
            /usr/sbin/mysqld(+0xbc1d50)[0x55784cf94d50] 
            /usr/sbin/mysqld(+0xb7a362)[0x55784cf4d362] 
            /lib64/libpthread.so.0(+0x7ea5)[0x7facbbbacea5] 
            /lib64/libc.so.6(clone+0x6d)[0x7facb9f4c96d] 
            The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains 
            information that should help you find out what is causing the crash. 
            Writing a core file... 
            Working directory at /var/lib/mysql 
            Resource Limits: 
            Fatal signal 11 while backtracing
            

            Checking stack trace:

            Program terminated with signal 11, Segmentation fault.
            #0  my_read (Filedes=291, 
                Buffer=0x7ef87aefd3e0 "Limit", ' ' <repeats 21 times>, "Soft Limit", ' ' <repeats 11 times>, "Hard Limit", ' ' <repeats 11 times>, "Units     \nMax cpu time", ' ' <repeats 14 times>, "unlimited", ' ' <repeats 12 times>, "unlimited", ' ' <repeats 12 times>, "seconds   \nMax file size", ' ' <repeats 13 times>, "unlimited       "..., Count=4096, MyFlags=<optimized out>) at /usr/src/debug/MariaDB-10.4.17/src_0/mysys/my_read.c:63
            63            int got_errno= my_errno= errno;
            Missing separate debuginfos, use: debuginfo-install systemd-libs-219-78.el7_9.2.x86_64
            (gdb) bt
            #0  my_read (Filedes=291, 
                Buffer=0x7ef87aefd3e0 "Limit", ' ' <repeats 21 times>, "Soft Limit", ' ' <repeats 11 times>, "Hard Limit", ' ' <repeats 11 times>, "Units     \nMax cpu time", ' ' <repeats 14 times>, "unlimited", ' ' <repeats 12 times>, "unlimited", ' ' <repeats 12 times>, "seconds   \nMax file size", ' ' <repeats 13 times>, "unlimited       "..., Count=4096, MyFlags=<optimized out>) at /usr/src/debug/MariaDB-10.4.17/src_0/mysys/my_read.c:63
            #1  0x000055d89333065d in output_core_info () at /usr/src/debug/MariaDB-10.4.17/src_0/sql/signal_handler.cc:66
            #2  0x000055d893330a50 in handle_fatal_signal (sig=6) at /usr/src/debug/MariaDB-10.4.17/src_0/sql/signal_handler.cc:339
            #3  <signal handler called>
            #4  0x00007f0d9a06a387 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
            #5  0x00007f0d9a06ba78 in __GI_abort () at abort.c:90
            #6  0x000055d893627d50 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>)
                at /usr/src/debug/MariaDB-10.4.17/src_0/storage/innobase/ut/ut0ut.cc:601
            #7  0x000055d8935e0362 in srv_error_monitor_thread () at /usr/src/debug/MariaDB-10.4.17/src_0/storage/innobase/srv/srv0srv.cc:1836
            #8  0x00007f0d9bd92ea5 in start_thread (arg=0x7ef87aeff700) at pthread_create.c:307
            #9  0x00007f0d9a13296d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
            

            shalogo Leonard Sinquer added a comment - - edited We are having exact the same issues on our multi-master set-up, restarts happening daily on random masters/slaves (all running 10.4.17). We previously uses 10.4.2 and this was never happening. I will try the suggestion by @mikeininfo and will let you know the results. Thanks! InnoDB: ###### Diagnostic info printed to the standard error stream 2020 - 12 - 09 16 : 01 : 22 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. mysql-error.log snippet: ===================================== 2020 - 12 - 09 16 : 01 : 10 0x7f924a7fc700 INNODB MONITOR OUTPUT ===================================== Per second averages calculated from the last 20 seconds ----------------- BACKGROUND THREAD ----------------- srv_master_thread loops: 162828 srv_active, 0 srv_shutdown, 5635 srv_idle srv_master_thread log flush and writes: 168462 ---------- SEMAPHORES ---------- OS WAIT ARRAY INFO: reservation count 891960 --Thread 140378791868160 has waited at btr0cur.cc line 1492 for 941.00 seconds the semaphore: SX-lock on RW-latch at 0x7f923801f0a8 created in file dict0dict.cc line 1954 a writer (thread id 140266283448064 ) has reserved it in mode  SX number of readers 0 , waiters flag 1 , lock_word: 10000000 Last time write locked in file dict0stats.cc line 1968 OS WAIT ARRAY INFO: signal count 44950247 RW-shared spins 216698389 , rounds 768920256 , OS waits 382090 RW-excl spins 108692241 , rounds 463496787 , OS waits 57256 RW-sx spins 50881 , rounds 607691 , OS waits 5555 Spin rounds per wait: 3.55 RW-shared, 4.26 RW-excl, 11.94 RW-sx ------------ TRANSACTIONS ------------ Trx id counter 66812968602 Purge done for trx's n:o < 66812968601 undo n:o < 0 state: running but idle History list length 1 LIST OF TRANSACTIONS FOR EACH SESSION: ---TRANSACTION 421853774909944 , not started 0 lock struct(s), heap size 1128 , 0 row lock(s) ---TRANSACTION 66812968601 , ACTIVE 941 sec updating or deleting mysql tables in use 5 , locked 5 3 lock struct(s), heap size 1128 , 2 row lock(s) MySQL thread id 10 , OS thread handle 140378791868160 , query id 579492716 Updating UPDATE loginlogs SET logouttime = 1607525129 , rxbytes = rxbytes + 10004012 , txbytes = txbytes + 675244 WHERE logid = 2538556223701127079 ---TRANSACTION 421853774901432 , not started 0 lock struct(s), heap size 1128 , 0 row lock(s)   Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this , something went terribly wrong... stack_bottom = 0x0 thread_stack 0x49000 /usr/sbin/mysqld(my_print_stacktrace+ 0x2e )[ 0x55784d20bffe ] /usr/sbin/mysqld(handle_fatal_signal+ 0x30f )[ 0x55784cc9d9ef ] /lib64/libpthread.so. 0 (+ 0xf630 )[ 0x7facbbbb4630 ] /lib64/libc.so. 6 (gsignal+ 0x37 )[ 0x7facb9e84387 ] /lib64/libc.so. 6 (abort+ 0x148 )[ 0x7facb9e85a78 ] /usr/sbin/mysqld(+ 0xbc1d50 )[ 0x55784cf94d50 ] /usr/sbin/mysqld(+ 0xb7a362 )[ 0x55784cf4d362 ] /lib64/libpthread.so. 0 (+ 0x7ea5 )[ 0x7facbbbacea5 ] /lib64/libc.so. 6 (clone+ 0x6d )[ 0x7facb9f4c96d ] The manual page at https: //mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains information that should help you find out what is causing the crash. Writing a core file... Working directory at /var/lib/mysql Resource Limits: Fatal signal 11 while backtracing Checking stack trace: Program terminated with signal 11 , Segmentation fault. # 0 my_read (Filedes= 291 , Buffer= 0x7ef87aefd3e0 "Limit" , ' ' <repeats 21 times>, "Soft Limit" , ' ' <repeats 11 times>, "Hard Limit" , ' ' <repeats 11 times>, "Units \nMax cpu time" , ' ' <repeats 14 times>, "unlimited" , ' ' <repeats 12 times>, "unlimited" , ' ' <repeats 12 times>, "seconds \nMax file size" , ' ' <repeats 13 times>, "unlimited " ..., Count= 4096 , MyFlags=<optimized out>) at /usr/src/debug/MariaDB- 10.4 . 17 /src_0/mysys/my_read.c: 63 63 int got_errno= my_errno= errno; Missing separate debuginfos, use: debuginfo-install systemd-libs- 219 - 78 .el7_9. 2 .x86_64 (gdb) bt # 0 my_read (Filedes= 291 , Buffer= 0x7ef87aefd3e0 "Limit" , ' ' <repeats 21 times>, "Soft Limit" , ' ' <repeats 11 times>, "Hard Limit" , ' ' <repeats 11 times>, "Units \nMax cpu time" , ' ' <repeats 14 times>, "unlimited" , ' ' <repeats 12 times>, "unlimited" , ' ' <repeats 12 times>, "seconds \nMax file size" , ' ' <repeats 13 times>, "unlimited " ..., Count= 4096 , MyFlags=<optimized out>) at /usr/src/debug/MariaDB- 10.4 . 17 /src_0/mysys/my_read.c: 63 # 1 0x000055d89333065d in output_core_info () at /usr/src/debug/MariaDB- 10.4 . 17 /src_0/sql/signal_handler.cc: 66 # 2 0x000055d893330a50 in handle_fatal_signal (sig= 6 ) at /usr/src/debug/MariaDB- 10.4 . 17 /src_0/sql/signal_handler.cc: 339 # 3 <signal handler called> # 4 0x00007f0d9a06a387 in __GI_raise (sig=sig @entry = 6 ) at ../nptl/sysdeps/unix/sysv/linux/raise.c: 55 # 5 0x00007f0d9a06ba78 in __GI_abort () at abort.c: 90 # 6 0x000055d893627d50 in ib::fatal::~fatal ( this =<optimized out>, __in_chrg=<optimized out>) at /usr/src/debug/MariaDB- 10.4 . 17 /src_0/storage/innobase/ut/ut0ut.cc: 601 # 7 0x000055d8935e0362 in srv_error_monitor_thread () at /usr/src/debug/MariaDB- 10.4 . 17 /src_0/storage/innobase/srv/srv0srv.cc: 1836 # 8 0x00007f0d9bd92ea5 in start_thread (arg= 0x7ef87aeff700 ) at pthread_create.c: 307 # 9 0x00007f0d9a13296d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S: 111
            mikeininfo Mike Ingle added a comment -

            @shalogo Did that happen before or after you made the change? I have not seen it again since the change. Trying to determine if this is the only place it happens.

            mikeininfo Mike Ingle added a comment - @shalogo Did that happen before or after you made the change? I have not seen it again since the change. Trying to determine if this is the only place it happens.

            @Mike not sure what change you mean? The semaphore wait started happening after we upgraded to 10.4.17 immediately. Like every 24 hours since the upgrade at least once. Sometimes after 4 hours, sometimes takes full day. In the end, I was forced to roll back to 10.4.15 (which has its own issues, like innodb_io_capacity being interpreted in a weird way) but the semaphore waits are not present at least. I have only production set-up, so I cannot afford to experiment much, sorry.

            shalogo Leonard Sinquer added a comment - @Mike not sure what change you mean? The semaphore wait started happening after we upgraded to 10.4.17 immediately. Like every 24 hours since the upgrade at least once. Sometimes after 4 hours, sometimes takes full day. In the end, I was forced to roll back to 10.4.15 (which has its own issues, like innodb_io_capacity being interpreted in a weird way) but the semaphore waits are not present at least. I have only production set-up, so I cannot afford to experiment much, sorry.
            digdilem Simon Avery added a comment - - edited

            Another sufferer of this reporting in.

            We have around a dozen MariaDb servers, all running on Centos 7, which started exhibiting this behavior only after we upgraded from 10.2 a few months back.

            Current version is 10.3.27-MariaDB, and we started at 3.14 when we moved up.

            Happens under heavy load. Sometimes under high concurrency, sometimes with a single active connection. Resources are fine, lots of free ram, cpu load adequate, disk and network fast. Nothing near capacity. Some SLOW queries are generated (default timeouts) but nothing terrible.

            There's nothing unusual in the work being asked of the databases, similar queries, data and workload worked fine before this upgrade.

            We have 10.3.x running on Centos 8 Linux servers too. I can't say that it doesn't crash on that, only that the ones I'm having most problem with are on C7.

            Have been chasing this for a while, so relieved to see we are not the only one.

            Jan 15 00:21:10 serverA maria 2021-01-15 0:21:05 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

            Partial other logs: maria-sem.txt

            digdilem Simon Avery added a comment - - edited Another sufferer of this reporting in. We have around a dozen MariaDb servers, all running on Centos 7, which started exhibiting this behavior only after we upgraded from 10.2 a few months back. Current version is 10.3.27-MariaDB, and we started at 3.14 when we moved up. Happens under heavy load. Sometimes under high concurrency, sometimes with a single active connection. Resources are fine, lots of free ram, cpu load adequate, disk and network fast. Nothing near capacity. Some SLOW queries are generated (default timeouts) but nothing terrible. There's nothing unusual in the work being asked of the databases, similar queries, data and workload worked fine before this upgrade. We have 10.3.x running on Centos 8 Linux servers too. I can't say that it doesn't crash on that, only that the ones I'm having most problem with are on C7. Have been chasing this for a while, so relieved to see we are not the only one. Jan 15 00:21:10 serverA maria 2021-01-15 0:21:05 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. Partial other logs: maria-sem.txt

            I have not been able to dedicate the time to get the stack trace, but I still plan to. We downgraded our servers to 10.4.13 which was the last version that was stable for us. It has been stable for several weeks now on the older version. Versions 14-17 each had slightly different symptoms, but we saw 1-2 crashes per week over 8 servers, with significant locking issues preceding the crashes. Prior to 10.4.17 the crashes did not have much in the logs, either nothing or a generic crash output with no stack trace. Starting with .17 the crashes were always preceded by the semaphore warnings, although sometimes the semaphore warnings resolved without a crash.

            When I get a chance, I will upgrade our staging environment again and set it up to get a proper stack trace. I still don't know how to trigger the problem at will in our environment. It happens often enough to cause problems, but infrequently enough to make it hard to catch. And it is much less frequently in staging, since we don't maintain nearly as much load as production.

            jacob.williams Jacob Williams added a comment - I have not been able to dedicate the time to get the stack trace, but I still plan to. We downgraded our servers to 10.4.13 which was the last version that was stable for us. It has been stable for several weeks now on the older version. Versions 14-17 each had slightly different symptoms, but we saw 1-2 crashes per week over 8 servers, with significant locking issues preceding the crashes. Prior to 10.4.17 the crashes did not have much in the logs, either nothing or a generic crash output with no stack trace. Starting with .17 the crashes were always preceded by the semaphore warnings, although sometimes the semaphore warnings resolved without a crash. When I get a chance, I will upgrade our staging environment again and set it up to get a proper stack trace. I still don't know how to trigger the problem at will in our environment. It happens often enough to cause problems, but infrequently enough to make it hard to catch. And it is much less frequently in staging, since we don't maintain nearly as much load as production.
            nostromo Santiago Gala added a comment -

            We have experienced the same crash twice. We have a server dealing with an archive. A big database where records are inserted continuously. While this insertion process is going on, periodic processes copy those records to new ones on a backup volume and delete the original ones, and users query the same table.

            We suffered the problem twice in 2 weeks, soon after starting the first "export" batch process post upgrade from 10.4.13.

            The table that holds files tracks about 400 million rows, and up to 10 per second are added. during normal operation, about 20 per second while files are being sent to long term storage.

            Previous results with 10.4.13 were quite satisfactory except that we experienced MDEV-24117 a few times and were working around it (simplifying some user queries).

            nostromo Santiago Gala added a comment - We have experienced the same crash twice. We have a server dealing with an archive. A big database where records are inserted continuously. While this insertion process is going on, periodic processes copy those records to new ones on a backup volume and delete the original ones, and users query the same table. We suffered the problem twice in 2 weeks, soon after starting the first "export" batch process post upgrade from 10.4.13. The table that holds files tracks about 400 million rows, and up to 10 per second are added. during normal operation, about 20 per second while files are being sent to long term storage. Previous results with 10.4.13 were quite satisfactory except that we experienced MDEV-24117 a few times and were working around it (simplifying some user queries).
            rantal Anton Rozhkov added a comment -

            We have expiriences same issue after upgrading to 10.5.8 (in 10.5.x line) and 10.4.17 (in 10.4.x line). The issue appears periodically under normal workload, and it is difficult to distinguish the differences in workload patterns between time when problem is not exist and time when we have a lock. We have same issue on different servers with diffferent workload, on some ones it may appear several times a day, on other ones - once time a week. The difference with the cases described here is that we don't crashes on Semaphore wait > 600 seconds, in our case all db insert/update operation are locked for some time (from several seconds to 2-3 minutes) and then locks are releases and db write operations are unfreezes.

            10.5.8 :
            --Thread 140650289166080 has waited at btr0cur.cc line 1480 for 2.00 seconds the semaphore:
            SX-lock on RW-latch at 0x7feb64253140 created in file dict0dict.cc line 2161
            a writer (thread id 140649492657920) has reserved it in mode SX
            number of readers 0, waiters flag 1, lock_word: 10000000
            Last time write locked in file dict0stats.cc line 1969

            10.4.17:
            --Thread 139856299718400 has waited at btr0cur.cc line 1492 for 9.00 seconds the semaphore:
            SX-lock on RW-latch at 0x7f329c3729d8 created in file dict0dict.cc line 1954
            a writer (thread id 139857212626688) has reserved it in mode SX
            number of readers 0, waiters flag 1, lock_word: 10000000
            Last time write locked in file dict0stats.cc line 1968

            We have tried compiling binaries with cherry-picked patches for MDEV-24188 by Marko Mäkelä applied to sources of 10.5.8 and 10.4.17, but this does not help in our case. About two weeks ago we tried the solution proposed by Octavian Cerna and disable innodb_stats_persistent. This did the trick and locks go away now in our case.

            rantal Anton Rozhkov added a comment - We have expiriences same issue after upgrading to 10.5.8 (in 10.5.x line) and 10.4.17 (in 10.4.x line). The issue appears periodically under normal workload, and it is difficult to distinguish the differences in workload patterns between time when problem is not exist and time when we have a lock. We have same issue on different servers with diffferent workload, on some ones it may appear several times a day, on other ones - once time a week. The difference with the cases described here is that we don't crashes on Semaphore wait > 600 seconds, in our case all db insert/update operation are locked for some time (from several seconds to 2-3 minutes) and then locks are releases and db write operations are unfreezes. 10.5.8 : --Thread 140650289166080 has waited at btr0cur.cc line 1480 for 2.00 seconds the semaphore: SX-lock on RW-latch at 0x7feb64253140 created in file dict0dict.cc line 2161 a writer (thread id 140649492657920) has reserved it in mode SX number of readers 0, waiters flag 1, lock_word: 10000000 Last time write locked in file dict0stats.cc line 1969 10.4.17: --Thread 139856299718400 has waited at btr0cur.cc line 1492 for 9.00 seconds the semaphore: SX-lock on RW-latch at 0x7f329c3729d8 created in file dict0dict.cc line 1954 a writer (thread id 139857212626688) has reserved it in mode SX number of readers 0, waiters flag 1, lock_word: 10000000 Last time write locked in file dict0stats.cc line 1968 We have tried compiling binaries with cherry-picked patches for MDEV-24188 by Marko Mäkelä applied to sources of 10.5.8 and 10.4.17, but this does not help in our case. About two weeks ago we tried the solution proposed by Octavian Cerna and disable innodb_stats_persistent. This did the trick and locks go away now in our case.
            Zimeon Zimeon added a comment -

            We seem to have the same issue after upgrading to 10.2.36-1.el7.centos from MariaDB-server-10.2.31-1.el7.centos.

            ==================================================

            InnoDB: ###### Diagnostic info printed to the standard error stream
            2021-01-31 15:46:36 140389001185024 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
            210131 15:46:36 [ERROR] mysqld got signal 6 ;
            This could be because you hit a bug. It is also possible that this binary
            or one of the libraries it was linked against is corrupt, improperly built,
            or misconfigured. This error can also be caused by malfunctioning hardware.

            To report this bug, see https://mariadb.com/kb/en/reporting-bugs

            ==================================================

            InnoDB: ###### Diagnostic info printed to the standard error stream
            2021-01-31 15:38:20 140389001185024 [Warning] InnoDB: A long semaphore wait:
            --Thread 140388810667776 has waited at btr0cur.cc line 1143 for 458.00 seconds the semaphore:
            SX-lock on RW-latch at 0x5639d4c538a8 created in file dict0dict.cc line 2185
            a writer (thread id 140388976006912) has reserved it in mode SX
            number of readers 0, waiters flag 1, lock_word: 10000000
            Last time read locked in file btr0cur.cc line 1175
            Last time write locked in file dict0stats.cc line 1972
            2021-01-31 15:38:20 140389001185024 [Warning] InnoDB: A long semaphore wait:
            --Thread 140388812203776 has waited at btr0cur.cc line 1143 for 458.00 seconds the semaphore:
            SX-lock on RW-latch at 0x5639d4c538a8 created in file dict0dict.cc line 2185
            a writer (thread id 140388976006912) has reserved it in mode SX
            number of readers 0, waiters flag 1, lock_word: 10000000
            Last time read locked in file btr0cur.cc line 1175
            Last time write locked in file dict0stats.cc line 1972
            2021-01-31 15:38:20 140389001185024 [Warning] InnoDB: A long semaphore wait:

            We will try our best to scrape up some info that will hopefully help
            diagnose the problem, but since we have already crashed,
            something is definitely wrong and this may fail.

            Server version: 10.2.36-MariaDB
            key_buffer_size=33554432
            read_buffer_size=131072
            max_used_connections=573
            max_threads=1202
            thread_count=579
            It is possible that mysqld could use up to
            key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2673944 K bytes of memory
            Hope that's ok; if not, decrease some variables in the equation.

            Thread pointer: 0x0
            Attempting backtrace. You can use the following information to find out
            where mysqld died. If you see no messages after this, something went
            terribly wrong...
            stack_bottom = 0x0 thread_stack 0x49000
            /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x56398f3ed84e]
            /usr/sbin/mysqld(handle_fatal_signal+0x30d)[0x56398ee858ad]
            /lib64/libpthread.so.0(+0xf630)[0x7fbc35de5630]
            /lib64/libc.so.6(gsignal+0x37)[0x7fbc340b5387]
            /lib64/libc.so.6(abort+0x148)[0x7fbc340b6a78]
            /usr/sbin/mysqld(+0x999ac3)[0x56398f17bac3]
            /usr/sbin/mysqld(+0x949602)[0x56398f12b602]
            /lib64/libpthread.so.0(+0x7ea5)[0x7fbc35dddea5]
            /lib64/libc.so.6(clone+0x6d)[0x7fbc3417d96d]
            The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
            information that should help you find out what is causing the crash.
            ==================================================

            I'll have to rollback to the previous installed version.

            Zimeon Zimeon added a comment - We seem to have the same issue after upgrading to 10.2.36-1.el7.centos from MariaDB-server-10.2.31-1.el7.centos. ================================================== InnoDB: ###### Diagnostic info printed to the standard error stream 2021-01-31 15:46:36 140389001185024 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. 210131 15:46:36 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. To report this bug, see https://mariadb.com/kb/en/reporting-bugs ================================================== InnoDB: ###### Diagnostic info printed to the standard error stream 2021-01-31 15:38:20 140389001185024 [Warning] InnoDB: A long semaphore wait: --Thread 140388810667776 has waited at btr0cur.cc line 1143 for 458.00 seconds the semaphore: SX-lock on RW-latch at 0x5639d4c538a8 created in file dict0dict.cc line 2185 a writer (thread id 140388976006912) has reserved it in mode SX number of readers 0, waiters flag 1, lock_word: 10000000 Last time read locked in file btr0cur.cc line 1175 Last time write locked in file dict0stats.cc line 1972 2021-01-31 15:38:20 140389001185024 [Warning] InnoDB: A long semaphore wait: --Thread 140388812203776 has waited at btr0cur.cc line 1143 for 458.00 seconds the semaphore: SX-lock on RW-latch at 0x5639d4c538a8 created in file dict0dict.cc line 2185 a writer (thread id 140388976006912) has reserved it in mode SX number of readers 0, waiters flag 1, lock_word: 10000000 Last time read locked in file btr0cur.cc line 1175 Last time write locked in file dict0stats.cc line 1972 2021-01-31 15:38:20 140389001185024 [Warning] InnoDB: A long semaphore wait: We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Server version: 10.2.36-MariaDB key_buffer_size=33554432 read_buffer_size=131072 max_used_connections=573 max_threads=1202 thread_count=579 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2673944 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x0 thread_stack 0x49000 /usr/sbin/mysqld(my_print_stacktrace+0x2e) [0x56398f3ed84e] /usr/sbin/mysqld(handle_fatal_signal+0x30d) [0x56398ee858ad] /lib64/libpthread.so.0(+0xf630) [0x7fbc35de5630] /lib64/libc.so.6(gsignal+0x37) [0x7fbc340b5387] /lib64/libc.so.6(abort+0x148) [0x7fbc340b6a78] /usr/sbin/mysqld(+0x999ac3) [0x56398f17bac3] /usr/sbin/mysqld(+0x949602) [0x56398f12b602] /lib64/libpthread.so.0(+0x7ea5) [0x7fbc35dddea5] /lib64/libc.so.6(clone+0x6d) [0x7fbc3417d96d] The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains information that should help you find out what is causing the crash. ================================================== I'll have to rollback to the previous installed version.
            marko Marko Mäkelä added a comment - - edited

            Releases with the fix of MDEV-24188 should be out in a few days.

            But, I see that rantal mentioned that the fix would not help. Would it be possible to produce a stack trace of all threads during the hang, like I asked on December 9?

            marko Marko Mäkelä added a comment - - edited Releases with the fix of MDEV-24188 should be out in a few days. But, I see that rantal mentioned that the fix would not help. Would it be possible to produce a stack trace of all threads during the hang, like I asked on December 9?
            Zimeon Zimeon added a comment -

            Unfortunately I won't be able to test, had to downgrade in the hope it won't crash. It's been stable now for about 16 hours, hope it keeps that way. 10.2.34 was the oldest in the repository for a quick replacement.

            ---> Package MariaDB-common.x86_64 0:10.2.34-1.el7.centos will be a downgrade
            ---> Package MariaDB-common.x86_64 0:10.2.36-1.el7.centos will be erased
            ---> Package MariaDB-compat.x86_64 0:10.2.34-1.el7.centos will be a downgrade
            ---> Package MariaDB-compat.x86_64 0:10.2.36-1.el7.centos will be erased
            ---> Package MariaDB-server.x86_64 0:10.2.34-1.el7.centos will be a downgrade
            ---> Package MariaDB-server.x86_64 0:10.2.36-1.el7.centos will be erased

            Zimeon Zimeon added a comment - Unfortunately I won't be able to test, had to downgrade in the hope it won't crash. It's been stable now for about 16 hours, hope it keeps that way. 10.2.34 was the oldest in the repository for a quick replacement. ---> Package MariaDB-common.x86_64 0:10.2.34-1.el7.centos will be a downgrade ---> Package MariaDB-common.x86_64 0:10.2.36-1.el7.centos will be erased ---> Package MariaDB-compat.x86_64 0:10.2.34-1.el7.centos will be a downgrade ---> Package MariaDB-compat.x86_64 0:10.2.36-1.el7.centos will be erased ---> Package MariaDB-server.x86_64 0:10.2.34-1.el7.centos will be a downgrade ---> Package MariaDB-server.x86_64 0:10.2.36-1.el7.centos will be erased

            I've seen a performance impact flavor of this issue (as described by Anton Rozhkov here) on a couple of different ocasions. Both 10.4.17, both have persistent stats and recalc enabled. One of the environments experienced semaphore waits (~250 seconds long) under normal write load (which is, however, heavily insert-based). The other environment was a new replica build process, importing a relatively large database. The latter environment showed the peculiar changing load. Roughly every 30 minutes it would go from all write load (as expected for the dump restoration) to getting stuck on semaphore waits while reading in pages, and not progressing writes at all, then back again. Disabling stats recalc makes load predictably write-heavy 100% of the time during restore. Buffer pool is 32G and largest table is ~50G.

            Semaphore waits faced are perfect copy of what this and other comments show: https://jira.mariadb.org/browse/MDEV-24378?focusedCommentId=177911&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-177911

            What I have in terms of the data gathered are some perf outputs and flamegraphs generated on the "dump-restore" environment during the write and read loads.

            I've been checking this locally, but either the size of the table is still too small, or IO/performance is more favorable. I get some amount of semaphore waits, but none longer than ~30 seconds. The write thread is in the same state as seen in oc_1_threads.txt already attached, while stats calculation thread is usually in dict_stats_analyze_index for me.

            Will attach SEIS output (SEIS.dict0stats.out) and backtraces (mariadb_10417.37345.3.gdb.bt and mariadb_10417.37345.3.gdb.bt-full) from my limited testing, and also flamegraphs from the real affected environment and raw perf data. read-spike.perf.script.svg, normal-load.perf.script.svg, MDEV-24378.perf.tgz. The read-spike flamegraphs were generated while dump restore was stuck on semaphore wait, and normal load while it was progressing as expected. Perhaps the >600 seconds wait crashes are just an extreme example of what I've seen, but I'm not sure at this point.

            sergey.kuzmichev Sergey Kuzmichev added a comment - I've seen a performance impact flavor of this issue (as described by Anton Rozhkov here) on a couple of different ocasions. Both 10.4.17, both have persistent stats and recalc enabled. One of the environments experienced semaphore waits (~250 seconds long) under normal write load (which is, however, heavily insert-based). The other environment was a new replica build process, importing a relatively large database. The latter environment showed the peculiar changing load. Roughly every 30 minutes it would go from all write load (as expected for the dump restoration) to getting stuck on semaphore waits while reading in pages, and not progressing writes at all, then back again. Disabling stats recalc makes load predictably write-heavy 100% of the time during restore. Buffer pool is 32G and largest table is ~50G. Semaphore waits faced are perfect copy of what this and other comments show: https://jira.mariadb.org/browse/MDEV-24378?focusedCommentId=177911&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-177911 What I have in terms of the data gathered are some perf outputs and flamegraphs generated on the "dump-restore" environment during the write and read loads. I've been checking this locally, but either the size of the table is still too small, or IO/performance is more favorable. I get some amount of semaphore waits, but none longer than ~30 seconds. The write thread is in the same state as seen in oc_1_threads.txt already attached, while stats calculation thread is usually in dict_stats_analyze_index for me. Will attach SEIS output ( SEIS.dict0stats.out ) and backtraces ( mariadb_10417.37345.3.gdb.bt and mariadb_10417.37345.3.gdb.bt-full ) from my limited testing, and also flamegraphs from the real affected environment and raw perf data. read-spike.perf.script.svg , normal-load.perf.script.svg , MDEV-24378.perf.tgz . The read-spike flamegraphs were generated while dump restore was stuck on semaphore wait, and normal load while it was progressing as expected. Perhaps the >600 seconds wait crashes are just an extreme example of what I've seen, but I'm not sure at this point.

            sergey.kuzmichev, a quick look of mariadb_10417.37345.3.gdb.bt suggests that the only non-idle threads are Thread 2 and Thread 16. It is likely that Thread 2 is waiting for a page latch that held by Thread 16 (the persistent statistics recalculation thread). It is a known fact that the recalculation of persistent statistics can distrupt the system. Also, as noted in MDEV-15020, in some cases it may hang the entire server, due to the way how the interface to the statistics tables has been implemented. (That is why I think that it is somewhat risky to enable the automatic recalculation of InnoDB statistics.)

            However, I see no evidence of a deadlock, livelock, or busy loop in this case. It definitely does not look like MDEV-24188, because there is no buf_page_create() being executed.

            marko Marko Mäkelä added a comment - sergey.kuzmichev , a quick look of mariadb_10417.37345.3.gdb.bt suggests that the only non-idle threads are Thread 2 and Thread 16. It is likely that Thread 2 is waiting for a page latch that held by Thread 16 (the persistent statistics recalculation thread). It is a known fact that the recalculation of persistent statistics can distrupt the system. Also, as noted in MDEV-15020 , in some cases it may hang the entire server, due to the way how the interface to the statistics tables has been implemented. (That is why I think that it is somewhat risky to enable the automatic recalculation of InnoDB statistics.) However, I see no evidence of a deadlock, livelock, or busy loop in this case. It definitely does not look like MDEV-24188 , because there is no buf_page_create() being executed.
            marko Marko Mäkelä added a comment - sergey.kuzmichev , I just realized that mariadb_10417.37345.3.gdb.bt could match MDEV-24275 .

            Did anyone have a chance to test 10.4.18, which was released this week? That should fix both MDEV-24275 and MDEV-24188.

            marko Marko Mäkelä added a comment - Did anyone have a chance to test 10.4.18, which was released this week? That should fix both MDEV-24275 and MDEV-24188 .

            @Marko - I will install this next week if I do not see any issues in this thread. I have only production servers, so its always a high risk low reward for me. Currently on 10.4.15 and its been stable

            shalogo Leonard Sinquer added a comment - @Marko - I will install this next week if I do not see any issues in this thread. I have only production servers, so its always a high risk low reward for me. Currently on 10.4.15 and its been stable
            mhubert Mario Hubert added a comment -

            Comments: I have been using 10.5.8 ... the problem has been resolved after upgrading to 10.5.9

            mhubert Mario Hubert added a comment - Comments: I have been using 10.5.8 ... the problem has been resolved after upgrading to 10.5.9

            Thank you, mhubert.

            jacob.williams, shalogo, sergey.kuzmichev, did any of you try the newer releases? Can we close this ticket?

            marko Marko Mäkelä added a comment - Thank you, mhubert . jacob.williams , shalogo , sergey.kuzmichev , did any of you try the newer releases? Can we close this ticket?
            jacob.williams Jacob Williams added a comment - - edited

            Apologies for not getting back on this. We had downgraded to 10.4.13 to avoid crashes and avoided upgrading again for a while. We did update to 10.4.21 finally last weekend and started crashing again starting yesterday. Working on stacktrace now from the coredump we took. The error log message was not very detailed so I can't say it is the same exact issue, but the symptoms were otherwise identical and it occurred twice within 24 hours, starting about 5 days after we upgraded from 10.4.13 to 10.4.21.

            I don't see the SEMAPHORE messages in the logs, only a generic crash report. But the symptom is that several queries are hung for long periods, clients cannot start new sessions, and some existing sessions may continue to run queries for a time. The server process does not respond to a normal shutdown request from systemd. Kill -9 is required to stop it and then it starts back up fine. This behavior is the same as what we usually saw when we first updasted to 10.4.17. We are going to downgrade back to 10.4.13 to avoid more crashes. I will attach the stacktrace as soon as I have ensured it doesn't have any sensitive data that I can't share.

            jacob.williams Jacob Williams added a comment - - edited Apologies for not getting back on this. We had downgraded to 10.4.13 to avoid crashes and avoided upgrading again for a while. We did update to 10.4.21 finally last weekend and started crashing again starting yesterday. Working on stacktrace now from the coredump we took. The error log message was not very detailed so I can't say it is the same exact issue, but the symptoms were otherwise identical and it occurred twice within 24 hours, starting about 5 days after we upgraded from 10.4.13 to 10.4.21. I don't see the SEMAPHORE messages in the logs, only a generic crash report. But the symptom is that several queries are hung for long periods, clients cannot start new sessions, and some existing sessions may continue to run queries for a time. The server process does not respond to a normal shutdown request from systemd. Kill -9 is required to stop it and then it starts back up fine. This behavior is the same as what we usually saw when we first updasted to 10.4.17. We are going to downgrade back to 10.4.13 to avoid more crashes. I will attach the stacktrace as soon as I have ensured it doesn't have any sensitive data that I can't share.
            jacob.williams Jacob Williams added a comment - gdb-thread-apply-all-bt.txt.gz

            I do not see any InnoDB hang in gdb-thread-apply-all-bt.txt.gz. In fact, I am not convinced that there is a hang at all. Based on only one sample, it is very difficult to tell whether anything is hung.

            I see that some InnoDB threads are waiting for something inside malloc (apparently in the jemalloc library).

            If you find evidence of a hang, please file a new ticket, and identify the supposedly hung threads.

            marko Marko Mäkelä added a comment - I do not see any InnoDB hang in gdb-thread-apply-all-bt.txt.gz . In fact, I am not convinced that there is a hang at all. Based on only one sample, it is very difficult to tell whether anything is hung. I see that some InnoDB threads are waiting for something inside malloc (apparently in the jemalloc library). If you find evidence of a hang, please file a new ticket, and identify the supposedly hung threads.

            People

              marko Marko Mäkelä
              jacob.williams Jacob Williams
              Votes:
              15 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.