[MDEV-24378] Crashes on Semaphore wait > 600 seconds - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Incomplete
Affects Version/s: 10.4.17
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
- need_feedback
Environment:
4.14.203-116.332.amzn1.x86_64
Amazon Linux - RHEL6

Description

Since upgrading to 10.4.17 we have been getting long semaphore waits which frequently result in a crash. Sometimes they resolve before a crash, but still cause lots of other lock wait timeouts or other related problems.

The most frequent latches referenced are in btr0cur.cc and dict0dict.cc.

We have 4 production and 4 staging servers, each with different workloads, and we have seen these errors on at least 3 of the production servers and 2 of the staging servers. The tendency is for it to happen during high write load but it has happened when load was fairly low.

It has occurred at least 8 times in our production cluster since the upgrade to 10.4.17 which we installed on 2020/11/21. Prior to the upgrade to 10.4.17 we had been on 10.4.14. That version had been having unexplained crashes, following a similar pattern, but we were not getting the semaphore wait warnings in our logs. Those had started with the 10.4.14 upgrade but without any ability to reproduce, and no significant log messages, we had not filed a bug report on it yet. I suspect that it was the same issue, but don't know how to test that theory.

InnoDB: ###### Diagnostic info printed to the standard error stream

2020-12-09  9:55:45 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

201209  9:55:45 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.4.17-MariaDB-log

key_buffer_size=31457280

read_buffer_size=131072

max_used_connections=152

max_threads=4002

thread_count=161

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 8837909 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x0 thread_stack 0x49000

/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x560adb3d04ae]

/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x560adae63e1f]

/lib64/libpthread.so.0(+0xf600)[0x7feabd373600]

/lib64/libc.so.6(gsignal+0x37)[0x7feabb89c3a7]

/lib64/libc.so.6(abort+0x148)[0x7feabb89da98]

/usr/sbin/mysqld(+0xc1bd60)[0x560adb159d60]

/usr/sbin/mysqld(+0xbd4232)[0x560adb112232]

/lib64/libpthread.so.0(+0x7e75)[0x7feabd36be75]

/lib64/libc.so.6(clone+0x6d)[0x7feabb9648fd]

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains

information that should help you find out what is causing the crash.

Writing a core file...

There are warnings with INNODB monitor output numerous times preceding the crash. The number of open queries getting stuck gets quite long as it approaches the crash so I figure the one most likely to be helpful is the earliest one. I have censored the query strings due to sensitive data.

InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:

InnoDB: Pending reads 0, writes 0

=====================================

2020-12-09 09:43:53 0x7fe4bcbfc700 INNODB MONITOR OUTPUT

=====================================

Per second averages calculated from the last 60 seconds

-----------------

BACKGROUND THREAD

-----------------

srv_master_thread loops: 102524 srv_active, 0 srv_shutdown, 221160 srv_idle

srv_master_thread log flush and writes: 323650

----------

SEMAPHORES

----------

OS WAIT ARRAY INFO: reservation count 3217156

--Thread 140620618213120 has waited at buf0flu.cc line 1176 for 241.00 seconds the semaphore:

SX-lock on RW-latch at 0x7fe6e6989208 created in file buf0buf.cc line 1568

a writer (thread id 140620299949824) has reserved it in mode  exclusive

number of readers 0, waiters flag 1, lock_word: e0000000

Last time write locked in file btr0cur.cc line 2417

--Thread 140620387563264 has waited at fsp0fsp.cc line 1984 for 227.00 seconds the semaphore:

X-lock on RW-latch at 0x7fe4a280aec0 created in file fil0fil.cc line 1266

a writer (thread id 140620299949824) has reserved it in mode  exclusive

number of readers 0, waiters flag 1, lock_word: dfffffff

Last time write locked in file fsp0fsp.cc line 2402

OS WAIT ARRAY INFO: signal count 22529095

RW-shared spins 41276936, rounds 128593064, OS waits 1031233

RW-excl spins 27792813, rounds 117073205, OS waits 1025437

RW-sx spins 1932526, rounds 13959835, OS waits 165716

Spin rounds per wait: 3.12 RW-shared, 4.21 RW-excl, 7.22 RW-sx

------------

TRANSACTIONS

------------

Trx id counter 17462604867

Purge done for trx's n:o < 17462603855 undo n:o < 0 state: running

History list length 487

LIST OF TRANSACTIONS FOR EACH SESSION:

---TRANSACTION 17462604866, ACTIVE 67 sec setting auto-inc lock

mysql tables in use 1, locked 1

LOCK WAIT 1 lock struct(s), heap size 1128, 0 row lock(s)

MySQL thread id 18649, OS thread handle 140620601120512, query id 777866038 172.31.66.100 dbworkerfactorysecure Executing

INSERT INTO ******

------- TRX HAS BEEN WAITING 67 SEC FOR THIS LOCK TO BE GRANTED:

TABLE LOCK table `leanload_landingzone`.`database_action_log` trx id 17462604866 lock mode AUTO-INC waiting

------------------

---TRANSACTION 17462604865, ACTIVE 144 sec inserting

mysql tables in use 1, locked 1

2 lock struct(s), heap size 1128, 0 row lock(s)

MySQL thread id 18646, OS thread handle 140619220154112, query id 777865925 172.31.66.223 dbworkerfactorysecure Executing

INSERT INTO ******

---TRANSACTION 422120853290824, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853286568, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 17462604837, ACTIVE 234 sec inserting

mysql tables in use 19, locked 13

2 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 203216

MySQL thread id 18631, OS thread handle 140619223607040, query id 777476378 172.31.66.68 dbworkerfactorysecure Creating sort index

INSERT *****

Trx read view will not see trx with id >= 17462604819, sees < 17462603853

---TRANSACTION 17462603853, ACTIVE 325 sec inserting

mysql tables in use 5, locked 2

2 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 3713879

MySQL thread id 18627, OS thread handle 140620299949824, query id 777462590 172.31.66.142 dbworkerfactorysecure Sending data

INSERT INTO ****

Trx read view will not see trx with id >= 17462603853, sees < 17462603853

---TRANSACTION 422120853273800, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853269544, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853265288, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853261032, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853167400, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853256776, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853252520, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853248264, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853244008, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853239752, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853235496, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853231240, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853226984, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853222728, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853218472, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853214216, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853209960, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853205704, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853201448, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853197192, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853192936, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853188680, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853184424, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853180168, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853175912, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853171656, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853163144, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853158888, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853154632, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853150376, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853146120, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853141864, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853137608, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853133352, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853129096, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853124840, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853120584, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853116328, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853112072, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853107816, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853103560, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853099304, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853095048, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 422120853090792, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 17462604858, ACTIVE 220 sec inserting

mysql tables in use 1, locked 1

3 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1

MySQL thread id 10, OS thread handle 140646178735872, query id 777865730

---TRANSACTION 422120853082280, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

--------

FILE I/O

--------

I/O thread 0 state: waiting for completed aio requests (insert buffer thread)

I/O thread 1 state: waiting for completed aio requests (log thread)

I/O thread 2 state: waiting for completed aio requests (read thread)

I/O thread 3 state: waiting for completed aio requests (read thread)

I/O thread 4 state: waiting for completed aio requests (read thread)

I/O thread 5 state: waiting for completed aio requests (read thread)

I/O thread 6 state: waiting for completed aio requests (write thread)

I/O thread 7 state: waiting for completed aio requests (write thread)

I/O thread 8 state: waiting for completed aio requests (write thread)

I/O thread 9 state: waiting for completed aio requests (write thread)

Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,

 ibuf aio reads:, log i/o's:, sync i/o's:

Pending flushes (fsync) log: 0; buffer pool: 0

24188890 OS file reads, 29232470 OS file writes, 4530146 OS fsyncs

0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s

-------------------------------------

INSERT BUFFER AND ADAPTIVE HASH INDEX

-------------------------------------

Ibuf: size 1, free list len 209942, seg size 209944, 1509741 merges

merged operations:

 insert 3862925, delete mark 11675526, delete 1201855

discarded operations:

 insert 0, delete mark 0, delete 0

Hash table size 5843809, node heap has 3791 buffer(s)

Hash table size 5843809, node heap has 5577 buffer(s)

Hash table size 5843809, node heap has 3297 buffer(s)

Hash table size 5843809, node heap has 5687 buffer(s)

Hash table size 5843809, node heap has 3953 buffer(s)

Hash table size 5843809, node heap has 7772 buffer(s)

Hash table size 5843809, node heap has 18181 buffer(s)

Hash table size 5843809, node heap has 10925 buffer(s)

7.32 hash searches/s, 2.65 non-hash searches/s

---

LOG

---

Log sequence number 102422876480104

Log flushed up to   102422876480104

Pages flushed up to 102422800703925

Last checkpoint at  102422800703925

0 pending log flushes, 0 pending chkp writes

1642503 log i/o's done, 0.00 log i/o's/second

----------------------

BUFFER POOL AND MEMORY

----------------------

Total large memory allocated 23655874560

Dictionary memory allocated 4009296

Buffer pool size   1412752

Free buffers       4632

Database pages     1348936

Old database pages 497782

Modified db pages  5220

Percent of dirty pages(LRU & free pages): 0.386

Max dirty pages percent: 75.000

Pending reads 0

Pending writes: LRU 0, flush list 2, single page 0

Pages made young 12285071, not young 736294742

0.00 youngs/s, 0.00 non-youngs/s

Pages read 24091045, created 6416261, written 26204033

0.00 reads/s, 0.00 creates/s, 0.00 writes/s

Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000

Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s

LRU len: 1348936, unzip_LRU len: 0

I/O sum[0]:cur[0], unzip sum[0]:cur[0]

----------------------

INDIVIDUAL BUFFER POOL INFO

----------------------

---BUFFER POOL 0

Buffer pool size   176594

Free buffers       575

Database pages     168645

Old database pages 62233

Modified db pages  825

Percent of dirty pages(LRU & free pages): 0.488

Max dirty pages percent: 75.000

Pending reads 0

Pending writes: LRU 0, flush list 0, single page 0

Pages made young 1530158, not young 89652355

0.00 youngs/s, 0.00 non-youngs/s

Pages read 3000704, created 820651, written 4459033

0.00 reads/s, 0.00 creates/s, 0.00 writes/s

Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000

Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s

LRU len: 168645, unzip_LRU len: 0

I/O sum[0]:cur[0], unzip sum[0]:cur[0]

---BUFFER POOL 1

Buffer pool size   176594

Free buffers       546

Database pages     168545

Old database pages 62196

Modified db pages  661

Percent of dirty pages(LRU & free pages): 0.391

Max dirty pages percent: 75.000

Pending reads 0

Pending writes: LRU 0, flush list 0, single page 0

Pages made young 1548223, not young 93971088

0.00 youngs/s, 0.00 non-youngs/s

Pages read 3064144, created 797585, written 2957721

0.00 reads/s, 0.00 creates/s, 0.00 writes/s

Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000

Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s

LRU len: 168545, unzip_LRU len: 0

I/O sum[0]:cur[0], unzip sum[0]:cur[0]

---BUFFER POOL 2

Buffer pool size   176594

Free buffers       621

Database pages     168578

Old database pages 62208

Modified db pages  608

Percent of dirty pages(LRU & free pages): 0.359

Max dirty pages percent: 75.000

Pending reads 0

Pending writes: LRU 0, flush list 0, single page 0

Pages made young 1553285, not young 91180805

0.00 youngs/s, 0.00 non-youngs/s

Pages read 3023249, created 801831, written 2881988

0.00 reads/s, 0.00 creates/s, 0.00 writes/s

Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000

Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s

LRU len: 168578, unzip_LRU len: 0

I/O sum[0]:cur[0], unzip sum[0]:cur[0]

---BUFFER POOL 3

Buffer pool size   176594

Free buffers       623

Database pages     168501

Old database pages 62180

Modified db pages  621

Percent of dirty pages(LRU & free pages): 0.367

Max dirty pages percent: 75.000

Pending reads 0

Pending writes: LRU 0, flush list 0, single page 0

Pages made young 1544401, not young 92782155

0.00 youngs/s, 0.00 non-youngs/s

Pages read 3055426, created 798407, written 3548916

0.00 reads/s, 0.00 creates/s, 0.00 writes/s

Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000

Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s

LRU len: 168501, unzip_LRU len: 0

I/O sum[0]:cur[0], unzip sum[0]:cur[0]

---BUFFER POOL 4

Buffer pool size   176594

Free buffers       570

Database pages     168653

Old database pages 62236

Modified db pages  605

Percent of dirty pages(LRU & free pages): 0.358

Max dirty pages percent: 75.000

Pending reads 0

Pending writes: LRU 0, flush list 0, single page 0

Pages made young 1530555, not young 93768585

0.00 youngs/s, 0.00 non-youngs/s

Pages read 3023945, created 798087, written 3336562

0.00 reads/s, 0.00 creates/s, 0.00 writes/s

Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000

Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s

LRU len: 168653, unzip_LRU len: 0

I/O sum[0]:cur[0], unzip sum[0]:cur[0]

---BUFFER POOL 5

Buffer pool size   176594

Free buffers       530

Database pages     168808

Old database pages 62293

Modified db pages  641

Percent of dirty pages(LRU & free pages): 0.379

Max dirty pages percent: 75.000

Pending reads 0

Pending writes: LRU 0, flush list 2, single page 0

Pages made young 1547158, not young 91440491

0.00 youngs/s, 0.00 non-youngs/s

Pages read 2993753, created 800748, written 2959769

0.00 reads/s, 0.00 creates/s, 0.00 writes/s

Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000

Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s

LRU len: 168808, unzip_LRU len: 0

I/O sum[0]:cur[0], unzip sum[0]:cur[0]

---BUFFER POOL 6

Buffer pool size   176594

Free buffers       537

Database pages     168617

Old database pages 62223

Modified db pages  636

Percent of dirty pages(LRU & free pages): 0.376

Max dirty pages percent: 75.000

Pending reads 0

Pending writes: LRU 0, flush list 0, single page 0

Pages made young 1514961, not young 89712669

0.00 youngs/s, 0.00 non-youngs/s

Pages read 2978655, created 800363, written 2979454

0.00 reads/s, 0.00 creates/s, 0.00 writes/s

Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000

Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s

LRU len: 168617, unzip_LRU len: 0

I/O sum[0]:cur[0], unzip sum[0]:cur[0]

---BUFFER POOL 7

Buffer pool size   176594

Free buffers       630

Database pages     168589

Old database pages 62213

Modified db pages  623

Percent of dirty pages(LRU & free pages): 0.368

Max dirty pages percent: 75.000

Pending reads 0

Pending writes: LRU 0, flush list 0, single page 0

Pages made young 1516330, not young 93786594

0.00 youngs/s, 0.00 non-youngs/s

Pages read 2951169, created 798589, written 3080590

0.00 reads/s, 0.00 creates/s, 0.00 writes/s

Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000

Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s

LRU len: 168589, unzip_LRU len: 0

I/O sum[0]:cur[0], unzip sum[0]:cur[0]

--------------

ROW OPERATIONS

--------------

0 queries inside InnoDB, 0 queries in queue

2 read views open inside InnoDB

Process ID=2563, Main thread ID=140620421134080, state: sleeping

Number of rows inserted 291442737, updated 43530027, deleted 16560971, read 18102275991

0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 21.62 reads/s

Number of system rows inserted 839287, updated 0, deleted 839231, read 839237

0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s

----------------------------

END OF INNODB MONITOR OUTPUT

============================

Potentially related issues?
https://jira.mariadb.org/browse/MDEV-24142
https://jira.mariadb.org/browse/MDEV-24225

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

gdb-thread-apply-all-bt.txt.gz
161 kB
2021-10-29 18:49
mariadb_10417.37345.3.gdb.bt
38 kB
2021-02-02 15:01
mariadb_10417.37345.3.gdb.bt-full
127 kB
2021-02-02 15:01
maria-sem.txt
10 kB
2021-01-15 08:29
MDEV-24378.perf.tgz
417 kB
2021-02-02 15:01
normal-load.perf.script.svg
294 kB
2021-02-02 15:01
oc_1_innodb.txt
11 kB
2020-12-14 09:21
oc_1_threads.txt
49 kB
2020-12-14 09:21
read-spike.perf.script.svg
99 kB
2021-02-02 15:01
SEIS.dict0stats.out
6 kB
2021-02-02 15:01

Issue Links

duplicates

MDEV-30684 InnoDB: Semaphore wait has lasted > 600 seconds

Open

is blocked by

MDEV-27020 Maria DB Engine crashes and displays a message related to bug

Open

relates to

MDEV-24384 MariaDB crashing with long semaphore waits after upgrade from v10.2 to v10.5.8

Closed

MDEV-24504 [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

Closed

MDEV-24606 InnoDB: Semaphore wait has lasted > 600 second

Closed

MDEV-24869 The replication suddenly stops for N minutes in version after version 10.4.15

Closed

MDEV-25137 InnoDB: A long semaphore wait

Closed

(2 relates to)

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2020-12-09 17:05

jacob.williams, can you please try to produce a stack trace of all threads during the hang? This report could be a duplicate of ~~MDEV-24188~~, but I cannot confirm that based on the available information yet.

Side note: in MariaDB 10.6, I plan to remove the InnoDB semaphore output as part of ~~MDEV-21452~~.

Marko Mäkelä added a comment - 2020-12-09 17:05 jacob.williams , can you please try to produce a stack trace of all threads during the hang? This report could be a duplicate of MDEV-24188 , but I cannot confirm that based on the available information yet. Side note: in MariaDB 10.6, I plan to remove the InnoDB semaphore output as part of MDEV-21452 .

Octavian Cerna added a comment - 2020-12-14 09:21

I can confirm, I'm seeing long semaphore waits on multiple machines running both 10.3.27 and 10.4.17,
to the point where there are semaphore-related crashes almost daily.
This didn't happen before on 10.3.24, which was stable in our environment (Debian buster on bare metal).

I have attached a monitor output: oc_1_innodb.txt and gdb backtrace: oc_1_threads.txt of a server
running 10.4.17-MariaDB-1:10.4.17+maria~buster-log. The server is a slave replicating an ETL job,
long-running queries examining tens of millions of rows are normal for this machine.
In this particular case the query didn't even finish before the server was taken down by the lock monitor.

There was a single query running, waiting for a RW-lock. Looking at the backtrace of the thread holding that RW-lock
I noticed it was doing a lot of I/O in dict_stats_update_persistent, so I straced that thread and saw that it made
a lot of pread64() calls, confirming that the thread was making progress and wasn't stalled by hung I/O or other OS-related issue.

This led me to believe that the background stats update is starving the query threads by holding a write lock to
a highly-contended mutex (a system dictionary lock?).
Disabling innodb_stats_persistent has fixed the issue for me, and the server has managed to finish the workload and not crash.

I suspect that this high contention on that internal mutex might be related to ~~MDEV-23991~~
but I'm not that knowledgeable on InnoDB internals.

I'm seeing similar behaviour on a server running 10.3.27 - long semaphore waits in btr0cur.cc resulting in crashes,
that server was recently upgraded from 10.3.24 - which was very stable for us and didn't have this issue.

Octavian Cerna added a comment - 2020-12-14 09:21 I can confirm, I'm seeing long semaphore waits on multiple machines running both 10.3.27 and 10.4.17, to the point where there are semaphore-related crashes almost daily. This didn't happen before on 10.3.24, which was stable in our environment (Debian buster on bare metal). I have attached a monitor output: oc_1_innodb.txt and gdb backtrace: oc_1_threads.txt of a server running 10.4.17-MariaDB-1:10.4.17+maria~buster-log. The server is a slave replicating an ETL job, long-running queries examining tens of millions of rows are normal for this machine. In this particular case the query didn't even finish before the server was taken down by the lock monitor. There was a single query running, waiting for a RW-lock. Looking at the backtrace of the thread holding that RW-lock I noticed it was doing a lot of I/O in dict_stats_update_persistent, so I straced that thread and saw that it made a lot of pread64() calls, confirming that the thread was making progress and wasn't stalled by hung I/O or other OS-related issue. This led me to believe that the background stats update is starving the query threads by holding a write lock to a highly-contended mutex (a system dictionary lock?). Disabling innodb_stats_persistent has fixed the issue for me, and the server has managed to finish the workload and not crash. I suspect that this high contention on that internal mutex might be related to MDEV-23991 but I'm not that knowledgeable on InnoDB internals. I'm seeing similar behaviour on a server running 10.3.27 - long semaphore waits in btr0cur.cc resulting in crashes, that server was recently upgraded from 10.3.24 - which was very stable for us and didn't have this issue.

harri korpi added a comment - 2020-12-23 11:48 - edited

Hello,
I think we are having this same problem in production environment where the server crashes daily on CentOs 7. This started occurring after upgrading to 10.4.17 from 10.4.14 - our app reads/insert/deletes quite large number of rows twice a day by copying the data from one schema to another (the crash seems to happen mostly every 2nd run).

Maria log shows the crash as:

..

Pages read ahead 6.40/s, evicted without access 1.65/s, Random read ahead 0.00/s

LRU len: 61456, unzip_LRU len: 0

I/O sum[175215]:cur[3038], unzip sum[0]:cur[0]

--------------

ROW OPERATIONS

--------------

0 queries inside InnoDB, 0 queries in queue

0 read views open inside InnoDB

Process ID=25897, Main thread ID=140580188915456, state: sleeping

Number of rows inserted 17557314, updated 65985, deleted 8711548, read 159080556

0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s

Number of system rows inserted 0, updated 0, deleted 0, read 0

0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s

----------------------------

END OF INNODB MONITOR OUTPUT

============================

InnoDB: ###### Diagnostic info printed to the standard error stream

2020-12-22  6:16:56 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

201222  6:16:56 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.4.17-MariaDB

key_buffer_size=268435456

read_buffer_size=1048576

max_used_connections=5

max_threads=153

thread_count=7

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 579242 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x0 thread_stack 0x49000

/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x556b8b498ffe]

/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x556b8af2a9ef]

/lib64/libpthread.so.0(+0xf630)[0x7fddb2a68630]

/lib64/libc.so.6(gsignal+0x37)[0x7fddb0d39387]

/lib64/libc.so.6(abort+0x148)[0x7fddb0d3aa78]

/usr/sbin/mysqld(+0xbc1d50)[0x556b8b221d50]

/usr/sbin/mysqld(+0xb7a362)[0x556b8b1da362]

/lib64/libpthread.so.0(+0x7ea5)[0x7fddb2a60ea5]

/lib64/libc.so.6(clone+0x6d)[0x7fddb0e018dd]

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains

information that should help you find out what is causing the crash.

Writing a core file...

Working directory at /var/lib/mysql

Resource Limits:

Fatal signal 11 while backtracing

..

Would you able to confirm if this is the same or different issue?

Thank you.

harri korpi added a comment - 2020-12-23 11:48 - edited Hello, I think we are having this same problem in production environment where the server crashes daily on CentOs 7. This started occurring after upgrading to 10.4.17 from 10.4.14 - our app reads/insert/deletes quite large number of rows twice a day by copying the data from one schema to another (the crash seems to happen mostly every 2nd run). Maria log shows the crash as: .. Pages read ahead 6.40 /s, evicted without access 1.65 /s, Random read ahead 0.00 /s LRU len: 61456 , unzip_LRU len: 0 I/O sum[ 175215 ]:cur[ 3038 ], unzip sum[ 0 ]:cur[ 0 ] -------------- ROW OPERATIONS -------------- 0 queries inside InnoDB, 0 queries in queue 0 read views open inside InnoDB Process ID= 25897 , Main thread ID= 140580188915456 , state: sleeping Number of rows inserted 17557314 , updated 65985 , deleted 8711548 , read 159080556 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Number of system rows inserted 0 , updated 0 , deleted 0 , read 0 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s ---------------------------- END OF INNODB MONITOR OUTPUT ============================ InnoDB: ###### Diagnostic info printed to the standard error stream 2020 - 12 - 22 6 : 16 : 56 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. 201222 6 : 16 : 56 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. To report this bug, see https: //mariadb.com/kb/en/reporting-bugs We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Server version: 10.4 . 17 -MariaDB key_buffer_size= 268435456 read_buffer_size= 1048576 max_used_connections= 5 max_threads= 153 thread_count= 7 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 579242 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this , something went terribly wrong... stack_bottom = 0x0 thread_stack 0x49000 /usr/sbin/mysqld(my_print_stacktrace+ 0x2e )[ 0x556b8b498ffe ] /usr/sbin/mysqld(handle_fatal_signal+ 0x30f )[ 0x556b8af2a9ef ] /lib64/libpthread.so. 0 (+ 0xf630 )[ 0x7fddb2a68630 ] /lib64/libc.so. 6 (gsignal+ 0x37 )[ 0x7fddb0d39387 ] /lib64/libc.so. 6 (abort+ 0x148 )[ 0x7fddb0d3aa78 ] /usr/sbin/mysqld(+ 0xbc1d50 )[ 0x556b8b221d50 ] /usr/sbin/mysqld(+ 0xb7a362 )[ 0x556b8b1da362 ] /lib64/libpthread.so. 0 (+ 0x7ea5 )[ 0x7fddb2a60ea5 ] /lib64/libc.so. 6 (clone+ 0x6d )[ 0x7fddb0e018dd ] The manual page at https: //mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains information that should help you find out what is causing the crash. Writing a core file... Working directory at /var/lib/mysql Resource Limits: Fatal signal 11 while backtracing .. Would you able to confirm if this is the same or different issue? Thank you.

Mike Ingle added a comment - 2020-12-31 19:12

We had this problem on 10.4.17 ubuntu 18.04 x86_64. On one server I could not even load a large 250GB import file from mysqldump, as it would throw the >600 error and restart at about the 25% mark every time. I downgraded that machine to 10.4.15 and was able to load it.

Our production server had been upgraded to 10.4.17 to avoid another issue with a complex stored procedure causing a restart. It was getting long semaphore timeouts causing rollbacks every day and eventually had a >600 restart during a large batch insert. We got a lot of log entries like this:
OS WAIT ARRAY INFO: reservation count 8089822
--Thread 140441005106944 has waited at btr0cur.cc line 1492 for 543.00 seconds the semaphore:
SX-lock on RW-latch at 0x7fbae40162e8 created in file dict0dict.cc line 1954
a writer (thread id 140442200524544) has reserved it in mode SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time write locked in file dict0stats.cc line 1968

The lock was always from dict0stats.cc, which is table/index statistics. This file has been modified a lot between 10.4.15 and 10.4.17. I turned off table/index statistics:

set global innodb_stats_persistent = off;
set global innodb_stats_auto_recalc = off;
set global innodb_stats_traditional = off;

The errors, which had been appearing nightly, did not happen last night. I think what is happening is that when some percentage of the table has been changed, Mariadb decides to update the statistics. Because the insert or update is still going on, the statistics gets into a deadlock with the insert/update and crashes the server. It looks like the dict0stats.cc file has a lot of exception bailouts in it. Perhaps one of them is not releasing a lock? Until Mariadb gets this fixed, turning off innodb stats may be a workaround to keep your server alive.

Mike Ingle added a comment - 2020-12-31 19:12 We had this problem on 10.4.17 ubuntu 18.04 x86_64. On one server I could not even load a large 250GB import file from mysqldump, as it would throw the >600 error and restart at about the 25% mark every time. I downgraded that machine to 10.4.15 and was able to load it. Our production server had been upgraded to 10.4.17 to avoid another issue with a complex stored procedure causing a restart. It was getting long semaphore timeouts causing rollbacks every day and eventually had a >600 restart during a large batch insert. We got a lot of log entries like this: OS WAIT ARRAY INFO: reservation count 8089822 --Thread 140441005106944 has waited at btr0cur.cc line 1492 for 543.00 seconds the semaphore: SX-lock on RW-latch at 0x7fbae40162e8 created in file dict0dict.cc line 1954 a writer (thread id 140442200524544) has reserved it in mode SX number of readers 0, waiters flag 1, lock_word: 10000000 Last time write locked in file dict0stats.cc line 1968 The lock was always from dict0stats.cc, which is table/index statistics. This file has been modified a lot between 10.4.15 and 10.4.17. I turned off table/index statistics: set global innodb_stats_persistent = off; set global innodb_stats_auto_recalc = off; set global innodb_stats_traditional = off; The errors, which had been appearing nightly, did not happen last night. I think what is happening is that when some percentage of the table has been changed, Mariadb decides to update the statistics. Because the insert or update is still going on, the statistics gets into a deadlock with the insert/update and crashes the server. It looks like the dict0stats.cc file has a lot of exception bailouts in it. Perhaps one of them is not releasing a lock? Until Mariadb gets this fixed, turning off innodb stats may be a workaround to keep your server alive.

Leonard Sinquer added a comment - 2021-01-02 12:40 - edited

We are having exact the same issues on our multi-master set-up, restarts happening daily on random masters/slaves (all running 10.4.17). We previously uses 10.4.2 and this was never happening. I will try the suggestion by @mikeininfo and will let you know the results. Thanks!

InnoDB: ###### Diagnostic info printed to the standard error stream

2020-12-09 16:01:22 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

mysql-error.log snippet:

=====================================

2020-12-09 16:01:10 0x7f924a7fc700 INNODB MONITOR OUTPUT

=====================================

Per second averages calculated from the last 20 seconds

-----------------

BACKGROUND THREAD

-----------------

srv_master_thread loops: 162828 srv_active, 0 srv_shutdown, 5635 srv_idle

srv_master_thread log flush and writes: 168462

----------

SEMAPHORES

----------

OS WAIT ARRAY INFO: reservation count 891960

--Thread 140378791868160 has waited at btr0cur.cc line 1492 for 941.00 seconds the semaphore:

SX-lock on RW-latch at 0x7f923801f0a8 created in file dict0dict.cc line 1954

a writer (thread id 140266283448064) has reserved it in mode  SX

number of readers 0, waiters flag 1, lock_word: 10000000

Last time write locked in file dict0stats.cc line 1968

OS WAIT ARRAY INFO: signal count 44950247

RW-shared spins 216698389, rounds 768920256, OS waits 382090

RW-excl spins 108692241, rounds 463496787, OS waits 57256

RW-sx spins 50881, rounds 607691, OS waits 5555

Spin rounds per wait: 3.55 RW-shared, 4.26 RW-excl, 11.94 RW-sx

------------

TRANSACTIONS

------------

Trx id counter 66812968602

Purge done for trx's n:o < 66812968601 undo n:o < 0 state: running but idle

History list length 1

LIST OF TRANSACTIONS FOR EACH SESSION:

---TRANSACTION 421853774909944, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

---TRANSACTION 66812968601, ACTIVE 941 sec updating or deleting

mysql tables in use 5, locked 5

3 lock struct(s), heap size 1128, 2 row lock(s)

MySQL thread id 10, OS thread handle 140378791868160, query id 579492716 Updating

UPDATE loginlogs SET logouttime = 1607525129, rxbytes = rxbytes + 10004012, txbytes = txbytes + 675244 WHERE logid = 2538556223701127079

---TRANSACTION 421853774901432, not started

0 lock struct(s), heap size 1128, 0 row lock(s)

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x0 thread_stack 0x49000

/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55784d20bffe]

/usr/sbin/mysqld(handle_fatal_signal+0x30f)[0x55784cc9d9ef]

/lib64/libpthread.so.0(+0xf630)[0x7facbbbb4630]

/lib64/libc.so.6(gsignal+0x37)[0x7facb9e84387]

/lib64/libc.so.6(abort+0x148)[0x7facb9e85a78]

/usr/sbin/mysqld(+0xbc1d50)[0x55784cf94d50]

/usr/sbin/mysqld(+0xb7a362)[0x55784cf4d362]

/lib64/libpthread.so.0(+0x7ea5)[0x7facbbbacea5]

/lib64/libc.so.6(clone+0x6d)[0x7facb9f4c96d]

The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains

information that should help you find out what is causing the crash.

Writing a core file...

Working directory at /var/lib/mysql

Resource Limits:

Fatal signal 11 while backtracing

Checking stack trace:

Program terminated with signal 11, Segmentation fault.

#0  my_read (Filedes=291,

    Buffer=0x7ef87aefd3e0 "Limit", ' ' <repeats 21 times>, "Soft Limit", ' ' <repeats 11 times>, "Hard Limit", ' ' <repeats 11 times>, "Units     \nMax cpu time", ' ' <repeats 14 times>, "unlimited", ' ' <repeats 12 times>, "unlimited", ' ' <repeats 12 times>, "seconds   \nMax file size", ' ' <repeats 13 times>, "unlimited       "..., Count=4096, MyFlags=<optimized out>) at /usr/src/debug/MariaDB-10.4.17/src_0/mysys/my_read.c:63

63            int got_errno= my_errno= errno;

Missing separate debuginfos, use: debuginfo-install systemd-libs-219-78.el7_9.2.x86_64

(gdb) bt

#0  my_read (Filedes=291,

    Buffer=0x7ef87aefd3e0 "Limit", ' ' <repeats 21 times>, "Soft Limit", ' ' <repeats 11 times>, "Hard Limit", ' ' <repeats 11 times>, "Units     \nMax cpu time", ' ' <repeats 14 times>, "unlimited", ' ' <repeats 12 times>, "unlimited", ' ' <repeats 12 times>, "seconds   \nMax file size", ' ' <repeats 13 times>, "unlimited       "..., Count=4096, MyFlags=<optimized out>) at /usr/src/debug/MariaDB-10.4.17/src_0/mysys/my_read.c:63

#1  0x000055d89333065d in output_core_info () at /usr/src/debug/MariaDB-10.4.17/src_0/sql/signal_handler.cc:66

#2  0x000055d893330a50 in handle_fatal_signal (sig=6) at /usr/src/debug/MariaDB-10.4.17/src_0/sql/signal_handler.cc:339

#3  <signal handler called>

#4  0x00007f0d9a06a387 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55

#5  0x00007f0d9a06ba78 in __GI_abort () at abort.c:90

#6  0x000055d893627d50 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>)

    at /usr/src/debug/MariaDB-10.4.17/src_0/storage/innobase/ut/ut0ut.cc:601

#7  0x000055d8935e0362 in srv_error_monitor_thread () at /usr/src/debug/MariaDB-10.4.17/src_0/storage/innobase/srv/srv0srv.cc:1836

#8  0x00007f0d9bd92ea5 in start_thread (arg=0x7ef87aeff700) at pthread_create.c:307

#9  0x00007f0d9a13296d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Leonard Sinquer added a comment - 2021-01-02 12:40 - edited We are having exact the same issues on our multi-master set-up, restarts happening daily on random masters/slaves (all running 10.4.17). We previously uses 10.4.2 and this was never happening. I will try the suggestion by @mikeininfo and will let you know the results. Thanks! InnoDB: ###### Diagnostic info printed to the standard error stream 2020 - 12 - 09 16 : 01 : 22 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. mysql-error.log snippet: ===================================== 2020 - 12 - 09 16 : 01 : 10 0x7f924a7fc700 INNODB MONITOR OUTPUT ===================================== Per second averages calculated from the last 20 seconds ----------------- BACKGROUND THREAD ----------------- srv_master_thread loops: 162828 srv_active, 0 srv_shutdown, 5635 srv_idle srv_master_thread log flush and writes: 168462 ---------- SEMAPHORES ---------- OS WAIT ARRAY INFO: reservation count 891960 --Thread 140378791868160 has waited at btr0cur.cc line 1492 for 941.00 seconds the semaphore: SX-lock on RW-latch at 0x7f923801f0a8 created in file dict0dict.cc line 1954 a writer (thread id 140266283448064 ) has reserved it in mode SX number of readers 0 , waiters flag 1 , lock_word: 10000000 Last time write locked in file dict0stats.cc line 1968 OS WAIT ARRAY INFO: signal count 44950247 RW-shared spins 216698389 , rounds 768920256 , OS waits 382090 RW-excl spins 108692241 , rounds 463496787 , OS waits 57256 RW-sx spins 50881 , rounds 607691 , OS waits 5555 Spin rounds per wait: 3.55 RW-shared, 4.26 RW-excl, 11.94 RW-sx ------------ TRANSACTIONS ------------ Trx id counter 66812968602 Purge done for trx's n:o < 66812968601 undo n:o < 0 state: running but idle History list length 1 LIST OF TRANSACTIONS FOR EACH SESSION: ---TRANSACTION 421853774909944 , not started 0 lock struct(s), heap size 1128 , 0 row lock(s) ---TRANSACTION 66812968601 , ACTIVE 941 sec updating or deleting mysql tables in use 5 , locked 5 3 lock struct(s), heap size 1128 , 2 row lock(s) MySQL thread id 10 , OS thread handle 140378791868160 , query id 579492716 Updating UPDATE loginlogs SET logouttime = 1607525129 , rxbytes = rxbytes + 10004012 , txbytes = txbytes + 675244 WHERE logid = 2538556223701127079 ---TRANSACTION 421853774901432 , not started 0 lock struct(s), heap size 1128 , 0 row lock(s) Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this , something went terribly wrong... stack_bottom = 0x0 thread_stack 0x49000 /usr/sbin/mysqld(my_print_stacktrace+ 0x2e )[ 0x55784d20bffe ] /usr/sbin/mysqld(handle_fatal_signal+ 0x30f )[ 0x55784cc9d9ef ] /lib64/libpthread.so. 0 (+ 0xf630 )[ 0x7facbbbb4630 ] /lib64/libc.so. 6 (gsignal+ 0x37 )[ 0x7facb9e84387 ] /lib64/libc.so. 6 (abort+ 0x148 )[ 0x7facb9e85a78 ] /usr/sbin/mysqld(+ 0xbc1d50 )[ 0x55784cf94d50 ] /usr/sbin/mysqld(+ 0xb7a362 )[ 0x55784cf4d362 ] /lib64/libpthread.so. 0 (+ 0x7ea5 )[ 0x7facbbbacea5 ] /lib64/libc.so. 6 (clone+ 0x6d )[ 0x7facb9f4c96d ] The manual page at https: //mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains information that should help you find out what is causing the crash. Writing a core file... Working directory at /var/lib/mysql Resource Limits: Fatal signal 11 while backtracing Checking stack trace: Program terminated with signal 11 , Segmentation fault. # 0 my_read (Filedes= 291 , Buffer= 0x7ef87aefd3e0 "Limit" , ' ' <repeats 21 times>, "Soft Limit" , ' ' <repeats 11 times>, "Hard Limit" , ' ' <repeats 11 times>, "Units \nMax cpu time" , ' ' <repeats 14 times>, "unlimited" , ' ' <repeats 12 times>, "unlimited" , ' ' <repeats 12 times>, "seconds \nMax file size" , ' ' <repeats 13 times>, "unlimited " ..., Count= 4096 , MyFlags=<optimized out>) at /usr/src/debug/MariaDB- 10.4 . 17 /src_0/mysys/my_read.c: 63 63 int got_errno= my_errno= errno; Missing separate debuginfos, use: debuginfo-install systemd-libs- 219 - 78 .el7_9. 2 .x86_64 (gdb) bt # 0 my_read (Filedes= 291 , Buffer= 0x7ef87aefd3e0 "Limit" , ' ' <repeats 21 times>, "Soft Limit" , ' ' <repeats 11 times>, "Hard Limit" , ' ' <repeats 11 times>, "Units \nMax cpu time" , ' ' <repeats 14 times>, "unlimited" , ' ' <repeats 12 times>, "unlimited" , ' ' <repeats 12 times>, "seconds \nMax file size" , ' ' <repeats 13 times>, "unlimited " ..., Count= 4096 , MyFlags=<optimized out>) at /usr/src/debug/MariaDB- 10.4 . 17 /src_0/mysys/my_read.c: 63 # 1 0x000055d89333065d in output_core_info () at /usr/src/debug/MariaDB- 10.4 . 17 /src_0/sql/signal_handler.cc: 66 # 2 0x000055d893330a50 in handle_fatal_signal (sig= 6 ) at /usr/src/debug/MariaDB- 10.4 . 17 /src_0/sql/signal_handler.cc: 339 # 3 <signal handler called> # 4 0x00007f0d9a06a387 in __GI_raise (sig=sig @entry = 6 ) at ../nptl/sysdeps/unix/sysv/linux/raise.c: 55 # 5 0x00007f0d9a06ba78 in __GI_abort () at abort.c: 90 # 6 0x000055d893627d50 in ib::fatal::~fatal ( this =<optimized out>, __in_chrg=<optimized out>) at /usr/src/debug/MariaDB- 10.4 . 17 /src_0/storage/innobase/ut/ut0ut.cc: 601 # 7 0x000055d8935e0362 in srv_error_monitor_thread () at /usr/src/debug/MariaDB- 10.4 . 17 /src_0/storage/innobase/srv/srv0srv.cc: 1836 # 8 0x00007f0d9bd92ea5 in start_thread (arg= 0x7ef87aeff700 ) at pthread_create.c: 307 # 9 0x00007f0d9a13296d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S: 111

Mike Ingle added a comment - 2021-01-03 18:47

@shalogo Did that happen before or after you made the change? I have not seen it again since the change. Trying to determine if this is the only place it happens.

Mike Ingle added a comment - 2021-01-03 18:47 @shalogo Did that happen before or after you made the change? I have not seen it again since the change. Trying to determine if this is the only place it happens.

Leonard Sinquer added a comment - 2021-01-03 20:46

@Mike not sure what change you mean? The semaphore wait started happening after we upgraded to 10.4.17 immediately. Like every 24 hours since the upgrade at least once. Sometimes after 4 hours, sometimes takes full day. In the end, I was forced to roll back to 10.4.15 (which has its own issues, like innodb_io_capacity being interpreted in a weird way) but the semaphore waits are not present at least. I have only production set-up, so I cannot afford to experiment much, sorry.

Leonard Sinquer added a comment - 2021-01-03 20:46 @Mike not sure what change you mean? The semaphore wait started happening after we upgraded to 10.4.17 immediately. Like every 24 hours since the upgrade at least once. Sometimes after 4 hours, sometimes takes full day. In the end, I was forced to roll back to 10.4.15 (which has its own issues, like innodb_io_capacity being interpreted in a weird way) but the semaphore waits are not present at least. I have only production set-up, so I cannot afford to experiment much, sorry.

Simon Avery added a comment - 2021-01-15 08:30 - edited

Another sufferer of this reporting in.

We have around a dozen MariaDb servers, all running on Centos 7, which started exhibiting this behavior only after we upgraded from 10.2 a few months back.

Current version is 10.3.27-MariaDB, and we started at 3.14 when we moved up.

Happens under heavy load. Sometimes under high concurrency, sometimes with a single active connection. Resources are fine, lots of free ram, cpu load adequate, disk and network fast. Nothing near capacity. Some SLOW queries are generated (default timeouts) but nothing terrible.

There's nothing unusual in the work being asked of the databases, similar queries, data and workload worked fine before this upgrade.

We have 10.3.x running on Centos 8 Linux servers too. I can't say that it doesn't crash on that, only that the ones I'm having most problem with are on C7.

Have been chasing this for a while, so relieved to see we are not the only one.

Jan 15 00:21:10 serverA maria 2021-01-15 0:21:05 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

Partial other logs: maria-sem.txt

Simon Avery added a comment - 2021-01-15 08:30 - edited Another sufferer of this reporting in. We have around a dozen MariaDb servers, all running on Centos 7, which started exhibiting this behavior only after we upgraded from 10.2 a few months back. Current version is 10.3.27-MariaDB, and we started at 3.14 when we moved up. Happens under heavy load. Sometimes under high concurrency, sometimes with a single active connection. Resources are fine, lots of free ram, cpu load adequate, disk and network fast. Nothing near capacity. Some SLOW queries are generated (default timeouts) but nothing terrible. There's nothing unusual in the work being asked of the databases, similar queries, data and workload worked fine before this upgrade. We have 10.3.x running on Centos 8 Linux servers too. I can't say that it doesn't crash on that, only that the ones I'm having most problem with are on C7. Have been chasing this for a while, so relieved to see we are not the only one. Jan 15 00:21:10 serverA maria 2021-01-15 0:21:05 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. Partial other logs: maria-sem.txt

Jacob Williams added a comment - 2021-01-15 16:59

I have not been able to dedicate the time to get the stack trace, but I still plan to. We downgraded our servers to 10.4.13 which was the last version that was stable for us. It has been stable for several weeks now on the older version. Versions 14-17 each had slightly different symptoms, but we saw 1-2 crashes per week over 8 servers, with significant locking issues preceding the crashes. Prior to 10.4.17 the crashes did not have much in the logs, either nothing or a generic crash output with no stack trace. Starting with .17 the crashes were always preceded by the semaphore warnings, although sometimes the semaphore warnings resolved without a crash.

When I get a chance, I will upgrade our staging environment again and set it up to get a proper stack trace. I still don't know how to trigger the problem at will in our environment. It happens often enough to cause problems, but infrequently enough to make it hard to catch. And it is much less frequently in staging, since we don't maintain nearly as much load as production.

Jacob Williams added a comment - 2021-01-15 16:59 I have not been able to dedicate the time to get the stack trace, but I still plan to. We downgraded our servers to 10.4.13 which was the last version that was stable for us. It has been stable for several weeks now on the older version. Versions 14-17 each had slightly different symptoms, but we saw 1-2 crashes per week over 8 servers, with significant locking issues preceding the crashes. Prior to 10.4.17 the crashes did not have much in the logs, either nothing or a generic crash output with no stack trace. Starting with .17 the crashes were always preceded by the semaphore warnings, although sometimes the semaphore warnings resolved without a crash. When I get a chance, I will upgrade our staging environment again and set it up to get a proper stack trace. I still don't know how to trigger the problem at will in our environment. It happens often enough to cause problems, but infrequently enough to make it hard to catch. And it is much less frequently in staging, since we don't maintain nearly as much load as production.

Santiago Gala added a comment - 2021-01-19 03:04

We have experienced the same crash twice. We have a server dealing with an archive. A big database where records are inserted continuously. While this insertion process is going on, periodic processes copy those records to new ones on a backup volume and delete the original ones, and users query the same table.

We suffered the problem twice in 2 weeks, soon after starting the first "export" batch process post upgrade from 10.4.13.

The table that holds files tracks about 400 million rows, and up to 10 per second are added. during normal operation, about 20 per second while files are being sent to long term storage.

Previous results with 10.4.13 were quite satisfactory except that we experienced ~~MDEV-24117~~ a few times and were working around it (simplifying some user queries).

Santiago Gala added a comment - 2021-01-19 03:04 We have experienced the same crash twice. We have a server dealing with an archive. A big database where records are inserted continuously. While this insertion process is going on, periodic processes copy those records to new ones on a backup volume and delete the original ones, and users query the same table. We suffered the problem twice in 2 weeks, soon after starting the first "export" batch process post upgrade from 10.4.13. The table that holds files tracks about 400 million rows, and up to 10 per second are added. during normal operation, about 20 per second while files are being sent to long term storage. Previous results with 10.4.13 were quite satisfactory except that we experienced MDEV-24117 a few times and were working around it (simplifying some user queries).

Anton Rozhkov added a comment - 2021-01-21 12:42

We have expiriences same issue after upgrading to 10.5.8 (in 10.5.x line) and 10.4.17 (in 10.4.x line). The issue appears periodically under normal workload, and it is difficult to distinguish the differences in workload patterns between time when problem is not exist and time when we have a lock. We have same issue on different servers with diffferent workload, on some ones it may appear several times a day, on other ones - once time a week. The difference with the cases described here is that we don't crashes on Semaphore wait > 600 seconds, in our case all db insert/update operation are locked for some time (from several seconds to 2-3 minutes) and then locks are releases and db write operations are unfreezes.

10.5.8 :
--Thread 140650289166080 has waited at btr0cur.cc line 1480 for 2.00 seconds the semaphore:
SX-lock on RW-latch at 0x7feb64253140 created in file dict0dict.cc line 2161
a writer (thread id 140649492657920) has reserved it in mode SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time write locked in file dict0stats.cc line 1969

10.4.17:
--Thread 139856299718400 has waited at btr0cur.cc line 1492 for 9.00 seconds the semaphore:
SX-lock on RW-latch at 0x7f329c3729d8 created in file dict0dict.cc line 1954
a writer (thread id 139857212626688) has reserved it in mode SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time write locked in file dict0stats.cc line 1968

We have tried compiling binaries with cherry-picked patches for ~~MDEV-24188~~ by Marko Mäkelä applied to sources of 10.5.8 and 10.4.17, but this does not help in our case. About two weeks ago we tried the solution proposed by Octavian Cerna and disable innodb_stats_persistent. This did the trick and locks go away now in our case.

Anton Rozhkov added a comment - 2021-01-21 12:42 We have expiriences same issue after upgrading to 10.5.8 (in 10.5.x line) and 10.4.17 (in 10.4.x line). The issue appears periodically under normal workload, and it is difficult to distinguish the differences in workload patterns between time when problem is not exist and time when we have a lock. We have same issue on different servers with diffferent workload, on some ones it may appear several times a day, on other ones - once time a week. The difference with the cases described here is that we don't crashes on Semaphore wait > 600 seconds, in our case all db insert/update operation are locked for some time (from several seconds to 2-3 minutes) and then locks are releases and db write operations are unfreezes. 10.5.8 : --Thread 140650289166080 has waited at btr0cur.cc line 1480 for 2.00 seconds the semaphore: SX-lock on RW-latch at 0x7feb64253140 created in file dict0dict.cc line 2161 a writer (thread id 140649492657920) has reserved it in mode SX number of readers 0, waiters flag 1, lock_word: 10000000 Last time write locked in file dict0stats.cc line 1969 10.4.17: --Thread 139856299718400 has waited at btr0cur.cc line 1492 for 9.00 seconds the semaphore: SX-lock on RW-latch at 0x7f329c3729d8 created in file dict0dict.cc line 1954 a writer (thread id 139857212626688) has reserved it in mode SX number of readers 0, waiters flag 1, lock_word: 10000000 Last time write locked in file dict0stats.cc line 1968 We have tried compiling binaries with cherry-picked patches for MDEV-24188 by Marko Mäkelä applied to sources of 10.5.8 and 10.4.17, but this does not help in our case. About two weeks ago we tried the solution proposed by Octavian Cerna and disable innodb_stats_persistent. This did the trick and locks go away now in our case.

Zimeon added a comment - 2021-01-31 15:16

We seem to have the same issue after upgrading to 10.2.36-1.el7.centos from MariaDB-server-10.2.31-1.el7.centos.

==================================================

InnoDB: ###### Diagnostic info printed to the standard error stream
2021-01-31 15:46:36 140389001185024 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
210131 15:46:36 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

==================================================

InnoDB: ###### Diagnostic info printed to the standard error stream
2021-01-31 15:38:20 140389001185024 [Warning] InnoDB: A long semaphore wait:
--Thread 140388810667776 has waited at btr0cur.cc line 1143 for 458.00 seconds the semaphore:
SX-lock on RW-latch at 0x5639d4c538a8 created in file dict0dict.cc line 2185
a writer (thread id 140388976006912) has reserved it in mode SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time read locked in file btr0cur.cc line 1175
Last time write locked in file dict0stats.cc line 1972
2021-01-31 15:38:20 140389001185024 [Warning] InnoDB: A long semaphore wait:
--Thread 140388812203776 has waited at btr0cur.cc line 1143 for 458.00 seconds the semaphore:
SX-lock on RW-latch at 0x5639d4c538a8 created in file dict0dict.cc line 2185
a writer (thread id 140388976006912) has reserved it in mode SX
number of readers 0, waiters flag 1, lock_word: 10000000
Last time read locked in file btr0cur.cc line 1175
Last time write locked in file dict0stats.cc line 1972
2021-01-31 15:38:20 140389001185024 [Warning] InnoDB: A long semaphore wait:

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.2.36-MariaDB
key_buffer_size=33554432
read_buffer_size=131072
max_used_connections=573
max_threads=1202
thread_count=579
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2673944 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x56398f3ed84e]
/usr/sbin/mysqld(handle_fatal_signal+0x30d)[0x56398ee858ad]
/lib64/libpthread.so.0(+0xf630)[0x7fbc35de5630]
/lib64/libc.so.6(gsignal+0x37)[0x7fbc340b5387]
/lib64/libc.so.6(abort+0x148)[0x7fbc340b6a78]
/usr/sbin/mysqld(+0x999ac3)[0x56398f17bac3]
/usr/sbin/mysqld(+0x949602)[0x56398f12b602]
/lib64/libpthread.so.0(+0x7ea5)[0x7fbc35dddea5]
/lib64/libc.so.6(clone+0x6d)[0x7fbc3417d96d]
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
==================================================

I'll have to rollback to the previous installed version.

Zimeon added a comment - 2021-01-31 15:16 We seem to have the same issue after upgrading to 10.2.36-1.el7.centos from MariaDB-server-10.2.31-1.el7.centos. ================================================== InnoDB: ###### Diagnostic info printed to the standard error stream 2021-01-31 15:46:36 140389001185024 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. 210131 15:46:36 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. To report this bug, see https://mariadb.com/kb/en/reporting-bugs ================================================== InnoDB: ###### Diagnostic info printed to the standard error stream 2021-01-31 15:38:20 140389001185024 [Warning] InnoDB: A long semaphore wait: --Thread 140388810667776 has waited at btr0cur.cc line 1143 for 458.00 seconds the semaphore: SX-lock on RW-latch at 0x5639d4c538a8 created in file dict0dict.cc line 2185 a writer (thread id 140388976006912) has reserved it in mode SX number of readers 0, waiters flag 1, lock_word: 10000000 Last time read locked in file btr0cur.cc line 1175 Last time write locked in file dict0stats.cc line 1972 2021-01-31 15:38:20 140389001185024 [Warning] InnoDB: A long semaphore wait: --Thread 140388812203776 has waited at btr0cur.cc line 1143 for 458.00 seconds the semaphore: SX-lock on RW-latch at 0x5639d4c538a8 created in file dict0dict.cc line 2185 a writer (thread id 140388976006912) has reserved it in mode SX number of readers 0, waiters flag 1, lock_word: 10000000 Last time read locked in file btr0cur.cc line 1175 Last time write locked in file dict0stats.cc line 1972 2021-01-31 15:38:20 140389001185024 [Warning] InnoDB: A long semaphore wait: We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Server version: 10.2.36-MariaDB key_buffer_size=33554432 read_buffer_size=131072 max_used_connections=573 max_threads=1202 thread_count=579 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2673944 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x0 thread_stack 0x49000 /usr/sbin/mysqld(my_print_stacktrace+0x2e) [0x56398f3ed84e] /usr/sbin/mysqld(handle_fatal_signal+0x30d) [0x56398ee858ad] /lib64/libpthread.so.0(+0xf630) [0x7fbc35de5630] /lib64/libc.so.6(gsignal+0x37) [0x7fbc340b5387] /lib64/libc.so.6(abort+0x148) [0x7fbc340b6a78] /usr/sbin/mysqld(+0x999ac3) [0x56398f17bac3] /usr/sbin/mysqld(+0x949602) [0x56398f12b602] /lib64/libpthread.so.0(+0x7ea5) [0x7fbc35dddea5] /lib64/libc.so.6(clone+0x6d) [0x7fbc3417d96d] The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains information that should help you find out what is causing the crash. ================================================== I'll have to rollback to the previous installed version.

Marko Mäkelä added a comment - 2021-01-31 16:37 - edited

Releases with the fix of ~~MDEV-24188~~ should be out in a few days.

But, I see that rantal mentioned that the fix would not help. Would it be possible to produce a stack trace of all threads during the hang, like I asked on December 9?

Marko Mäkelä added a comment - 2021-01-31 16:37 - edited Releases with the fix of MDEV-24188 should be out in a few days. But, I see that rantal mentioned that the fix would not help. Would it be possible to produce a stack trace of all threads during the hang, like I asked on December 9?

Zimeon added a comment - 2021-02-01 07:26

Unfortunately I won't be able to test, had to downgrade in the hope it won't crash. It's been stable now for about 16 hours, hope it keeps that way. 10.2.34 was the oldest in the repository for a quick replacement.

---> Package MariaDB-common.x86_64 0:10.2.34-1.el7.centos will be a downgrade
---> Package MariaDB-common.x86_64 0:10.2.36-1.el7.centos will be erased
---> Package MariaDB-compat.x86_64 0:10.2.34-1.el7.centos will be a downgrade
---> Package MariaDB-compat.x86_64 0:10.2.36-1.el7.centos will be erased
---> Package MariaDB-server.x86_64 0:10.2.34-1.el7.centos will be a downgrade
---> Package MariaDB-server.x86_64 0:10.2.36-1.el7.centos will be erased

Zimeon added a comment - 2021-02-01 07:26 Unfortunately I won't be able to test, had to downgrade in the hope it won't crash. It's been stable now for about 16 hours, hope it keeps that way. 10.2.34 was the oldest in the repository for a quick replacement. ---> Package MariaDB-common.x86_64 0:10.2.34-1.el7.centos will be a downgrade ---> Package MariaDB-common.x86_64 0:10.2.36-1.el7.centos will be erased ---> Package MariaDB-compat.x86_64 0:10.2.34-1.el7.centos will be a downgrade ---> Package MariaDB-compat.x86_64 0:10.2.36-1.el7.centos will be erased ---> Package MariaDB-server.x86_64 0:10.2.34-1.el7.centos will be a downgrade ---> Package MariaDB-server.x86_64 0:10.2.36-1.el7.centos will be erased

Sergey Kuzmichev added a comment - 2021-02-02 15:02

I've seen a performance impact flavor of this issue (as described by Anton Rozhkov here) on a couple of different ocasions. Both 10.4.17, both have persistent stats and recalc enabled. One of the environments experienced semaphore waits (~250 seconds long) under normal write load (which is, however, heavily insert-based). The other environment was a new replica build process, importing a relatively large database. The latter environment showed the peculiar changing load. Roughly every 30 minutes it would go from all write load (as expected for the dump restoration) to getting stuck on semaphore waits while reading in pages, and not progressing writes at all, then back again. Disabling stats recalc makes load predictably write-heavy 100% of the time during restore. Buffer pool is 32G and largest table is ~50G.

Semaphore waits faced are perfect copy of what this and other comments show: https://jira.mariadb.org/browse/MDEV-24378?focusedCommentId=177911&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-177911

What I have in terms of the data gathered are some perf outputs and flamegraphs generated on the "dump-restore" environment during the write and read loads.

I've been checking this locally, but either the size of the table is still too small, or IO/performance is more favorable. I get some amount of semaphore waits, but none longer than ~30 seconds. The write thread is in the same state as seen in oc_1_threads.txt already attached, while stats calculation thread is usually in dict_stats_analyze_index for me.

Will attach SEIS output (SEIS.dict0stats.out) and backtraces (mariadb_10417.37345.3.gdb.bt and mariadb_10417.37345.3.gdb.bt-full) from my limited testing, and also flamegraphs from the real affected environment and raw perf data. read-spike.perf.script.svg, normal-load.perf.script.svg, MDEV-24378.perf.tgz. The read-spike flamegraphs were generated while dump restore was stuck on semaphore wait, and normal load while it was progressing as expected. Perhaps the >600 seconds wait crashes are just an extreme example of what I've seen, but I'm not sure at this point.

Sergey Kuzmichev added a comment - 2021-02-02 15:02 I've seen a performance impact flavor of this issue (as described by Anton Rozhkov here) on a couple of different ocasions. Both 10.4.17, both have persistent stats and recalc enabled. One of the environments experienced semaphore waits (~250 seconds long) under normal write load (which is, however, heavily insert-based). The other environment was a new replica build process, importing a relatively large database. The latter environment showed the peculiar changing load. Roughly every 30 minutes it would go from all write load (as expected for the dump restoration) to getting stuck on semaphore waits while reading in pages, and not progressing writes at all, then back again. Disabling stats recalc makes load predictably write-heavy 100% of the time during restore. Buffer pool is 32G and largest table is ~50G. Semaphore waits faced are perfect copy of what this and other comments show: https://jira.mariadb.org/browse/MDEV-24378?focusedCommentId=177911&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-177911 What I have in terms of the data gathered are some perf outputs and flamegraphs generated on the "dump-restore" environment during the write and read loads. I've been checking this locally, but either the size of the table is still too small, or IO/performance is more favorable. I get some amount of semaphore waits, but none longer than ~30 seconds. The write thread is in the same state as seen in oc_1_threads.txt already attached, while stats calculation thread is usually in dict_stats_analyze_index for me. Will attach SEIS output ( SEIS.dict0stats.out ) and backtraces ( mariadb_10417.37345.3.gdb.bt and mariadb_10417.37345.3.gdb.bt-full ) from my limited testing, and also flamegraphs from the real affected environment and raw perf data. read-spike.perf.script.svg , normal-load.perf.script.svg , MDEV-24378.perf.tgz . The read-spike flamegraphs were generated while dump restore was stuck on semaphore wait, and normal load while it was progressing as expected. Perhaps the >600 seconds wait crashes are just an extreme example of what I've seen, but I'm not sure at this point.

Marko Mäkelä added a comment - 2021-02-02 17:49

sergey.kuzmichev, a quick look of mariadb_10417.37345.3.gdb.bt suggests that the only non-idle threads are Thread 2 and Thread 16. It is likely that Thread 2 is waiting for a page latch that held by Thread 16 (the persistent statistics recalculation thread). It is a known fact that the recalculation of persistent statistics can distrupt the system. Also, as noted in ~~MDEV-15020~~, in some cases it may hang the entire server, due to the way how the interface to the statistics tables has been implemented. (That is why I think that it is somewhat risky to enable the automatic recalculation of InnoDB statistics.)

However, I see no evidence of a deadlock, livelock, or busy loop in this case. It definitely does not look like ~~MDEV-24188~~, because there is no buf_page_create() being executed.

Marko Mäkelä added a comment - 2021-02-02 17:49 sergey.kuzmichev , a quick look of mariadb_10417.37345.3.gdb.bt suggests that the only non-idle threads are Thread 2 and Thread 16. It is likely that Thread 2 is waiting for a page latch that held by Thread 16 (the persistent statistics recalculation thread). It is a known fact that the recalculation of persistent statistics can distrupt the system. Also, as noted in MDEV-15020 , in some cases it may hang the entire server, due to the way how the interface to the statistics tables has been implemented. (That is why I think that it is somewhat risky to enable the automatic recalculation of InnoDB statistics.) However, I see no evidence of a deadlock, livelock, or busy loop in this case. It definitely does not look like MDEV-24188 , because there is no buf_page_create() being executed.

Marko Mäkelä added a comment - 2021-02-15 15:02

sergey.kuzmichev, I just realized that mariadb_10417.37345.3.gdb.bt could match ~~MDEV-24275~~.

Marko Mäkelä added a comment - 2021-02-15 15:02 sergey.kuzmichev , I just realized that mariadb_10417.37345.3.gdb.bt could match MDEV-24275 .

Marko Mäkelä added a comment - 2021-02-26 08:28

Did anyone have a chance to test 10.4.18, which was released this week? That should fix both ~~MDEV-24275~~ and ~~MDEV-24188~~.

Marko Mäkelä added a comment - 2021-02-26 08:28 Did anyone have a chance to test 10.4.18, which was released this week? That should fix both MDEV-24275 and MDEV-24188 .

Leonard Sinquer added a comment - 2021-02-26 10:25

@Marko - I will install this next week if I do not see any issues in this thread. I have only production servers, so its always a high risk low reward for me. Currently on 10.4.15 and its been stable

Leonard Sinquer added a comment - 2021-02-26 10:25 @Marko - I will install this next week if I do not see any issues in this thread. I have only production servers, so its always a high risk low reward for me. Currently on 10.4.15 and its been stable

Mario Hubert added a comment - 2021-02-26 10:39

Comments: I have been using 10.5.8 ... the problem has been resolved after upgrading to 10.5.9

Mario Hubert added a comment - 2021-02-26 10:39 Comments: I have been using 10.5.8 ... the problem has been resolved after upgrading to 10.5.9

Marko Mäkelä added a comment - 2021-03-05 16:43

Thank you, mhubert.

jacob.williams, shalogo, sergey.kuzmichev, did any of you try the newer releases? Can we close this ticket?

Marko Mäkelä added a comment - 2021-03-05 16:43 Thank you, mhubert . jacob.williams , shalogo , sergey.kuzmichev , did any of you try the newer releases? Can we close this ticket?

Jacob Williams added a comment - 2021-10-29 18:12 - edited

Apologies for not getting back on this. We had downgraded to 10.4.13 to avoid crashes and avoided upgrading again for a while. We did update to 10.4.21 finally last weekend and started crashing again starting yesterday. Working on stacktrace now from the coredump we took. The error log message was not very detailed so I can't say it is the same exact issue, but the symptoms were otherwise identical and it occurred twice within 24 hours, starting about 5 days after we upgraded from 10.4.13 to 10.4.21.

I don't see the SEMAPHORE messages in the logs, only a generic crash report. But the symptom is that several queries are hung for long periods, clients cannot start new sessions, and some existing sessions may continue to run queries for a time. The server process does not respond to a normal shutdown request from systemd. Kill -9 is required to stop it and then it starts back up fine. This behavior is the same as what we usually saw when we first updasted to 10.4.17. We are going to downgrade back to 10.4.13 to avoid more crashes. I will attach the stacktrace as soon as I have ensured it doesn't have any sensitive data that I can't share.

Jacob Williams added a comment - 2021-10-29 18:12 - edited Apologies for not getting back on this. We had downgraded to 10.4.13 to avoid crashes and avoided upgrading again for a while. We did update to 10.4.21 finally last weekend and started crashing again starting yesterday. Working on stacktrace now from the coredump we took. The error log message was not very detailed so I can't say it is the same exact issue, but the symptoms were otherwise identical and it occurred twice within 24 hours, starting about 5 days after we upgraded from 10.4.13 to 10.4.21. I don't see the SEMAPHORE messages in the logs, only a generic crash report. But the symptom is that several queries are hung for long periods, clients cannot start new sessions, and some existing sessions may continue to run queries for a time. The server process does not respond to a normal shutdown request from systemd. Kill -9 is required to stop it and then it starts back up fine. This behavior is the same as what we usually saw when we first updasted to 10.4.17. We are going to downgrade back to 10.4.13 to avoid more crashes. I will attach the stacktrace as soon as I have ensured it doesn't have any sensitive data that I can't share.

Jacob Williams added a comment - 2021-10-29 18:50

gdb-thread-apply-all-bt.txt.gz

Jacob Williams added a comment - 2021-10-29 18:50 gdb-thread-apply-all-bt.txt.gz

Marko Mäkelä added a comment - 2021-10-31 16:25

I do not see any InnoDB hang in gdb-thread-apply-all-bt.txt.gz. In fact, I am not convinced that there is a hang at all. Based on only one sample, it is very difficult to tell whether anything is hung.

I see that some InnoDB threads are waiting for something inside malloc (apparently in the jemalloc library).

If you find evidence of a hang, please file a new ticket, and identify the supposedly hung threads.

Marko Mäkelä added a comment - 2021-10-31 16:25 I do not see any InnoDB hang in gdb-thread-apply-all-bt.txt.gz . In fact, I am not convinced that there is a hang at all. Based on only one sample, it is very difficult to tell whether anything is hung. I see that some InnoDB threads are waiting for something inside malloc (apparently in the jemalloc library). If you find evidence of a hang, please file a new ticket, and identify the supposedly hung threads.

People

Assignee:: Marko Mäkelä

Reporter:: Jacob Williams

Votes:: 15 Vote for this issue

Watchers:: 29 Start watching this issue

Dates

Created:: 2020-12-09 16:45

Updated:: 2024-07-07 21:44

Resolved:: 2021-04-18 18:09

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration