[MDEV-12722] Maria DB 10.1.16 freeze Created: 2017-05-07  Updated: 2020-08-25  Resolved: 2017-08-10

Status: Closed
Project: MariaDB Server
Component/s: Server, Storage Engine - InnoDB
Affects Version/s: 10.1.16
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Stefan Assignee: Marko Mäkelä
Resolution: Duplicate Votes: 0
Labels: None
Environment:

CentOS release 6.8 (Final)

Linux localhost 2.6.32-642.3.1.el6.x86_64 #1 SMP Tue Jul 12 18:30:56 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux


Attachments: Text File 1_crash.log     Text File 2_crash.log     HTML File Test structures and SQL     Text File innodb stats.txt     Text File maria server configuration.txt     File stack.sh     Text File statisctics.txt    
Issue Links:
Duplicate
duplicates MDEV-12496 mtflush thread's hang cause mysqld crash Closed
Relates
relates to MDEV-10843 XtraDB Semaphore Stalls with innodb_u... Closed

 Description   

Hi,

I think we found a potential bug in MariaDB server 10.1.16, in production.

Checking the logs we observed that no application could execute sql. After we got max user connections limit reached.
The processlist showed that there were over 2500 transactions in different state having over 6 hours duration

I attached the innodb stats from the server, together with the server configuration.

Unfortunately no stack trace or anything is available.



 Comments   
Comment by Stefan [ 2017-05-08 ]

InnoDB: Warning: a long semaphore wait:
--Thread 139885753087744 has waited at buf0flu.cc line 1099 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x7f3a13755940 '&block->lock'
a writer (thread id 139885552043776) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0flu.cc line 1093
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/row/row0ins.cc line 2410
Holder thread 0 file not yet reserved line 0
InnoDB: Warning: semaphore wait:
--Thread 139885753087744 has waited at buf0flu.cc line 1099 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x7f3a13755940 '&block->lock'
a writer (thread id 139885552043776) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0flu.cc line 1093
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/row/row0ins.cc line 2410
Holder thread 0 file not yet reserved line 0
InnoDB: Warning: semaphore wait:
--Thread 139885550978816 has waited at row0ins.cc line 2410 for 236.00 seconds the semaphore:
X-lock on RW-latch at 0x7f3a13755940 '&block->lock'
a writer (thread id 139885552043776) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0flu.cc line 1093
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/row/row0ins.cc line 2410
Holder thread 0 file not yet reserved line 0
InnoDB: Warning: semaphore wait:
--Thread 139885633599232 has waited at row0ins.cc line 2410 for 236.00 seconds the semaphore:
X-lock on RW-latch at 0x7f3a13755940 '&block->lock'
a writer (thread id 139885552043776) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0flu.cc line 1093
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/row/row0ins.cc line 2410
Holder thread 0 file not yet reserved line 0
InnoDB: Warning: semaphore wait:
--Thread 139885646182144 has waited at row0ins.cc line 2410 for 233.00 seconds the semaphore:
X-lock on RW-latch at 0x7f3a13755940 '&block->lock'
a writer (thread id 139885552043776) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0flu.cc line 1093
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/row/row0ins.cc line 2410
Holder thread 0 file not yet reserved line 0
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0

.................

-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 3220490 srv_active, 0 srv_shutdown, 5060747 srv_idle
srv_master_thread log flush and writes: 8281237
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 4030789
--Thread 139885753087744 has waited at buf0flu.cc line 1099 for 939.00 seconds the semaphore:
S-lock on RW-latch at 0x7f3a13755940 '&block->lock'
a writer (thread id 139885552043776) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0flu.cc line 1093
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/row/row0ins.cc line 2410
Holder thread 0 file not yet reserved line 0
--Thread 139885550978816 has waited at row0ins.cc line 2410 for 934.00 seconds the semaphore:
X-lock on RW-latch at 0x7f3a13755940 '&block->lock'
a writer (thread id 139885552043776) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0flu.cc line 1093
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/row/row0ins.cc line 2410
Holder thread 0 file not yet reserved line 0
--Thread 139885633599232 has waited at row0ins.cc line 2410 for 934.00 seconds the semaphore:
X-lock on RW-latch at 0x7f3a13755940 '&block->lock'
a writer (thread id 139885552043776) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0flu.cc line 1093
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/row/row0ins.cc line 2410
Holder thread 0 file not yet reserved line 0
--Thread 139885646182144 has waited at row0ins.cc line 2410 for 931.00 seconds the semaphore:
X-lock on RW-latch at 0x7f3a13755940 '&block->lock'
a writer (thread id 139885552043776) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file buf0flu.cc line 1093
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/row/row0ins.cc line 2410
Holder thread 0 file not yet reserved line 0
OS WAIT ARRAY INFO: signal count 7563270
Mutex spin waits 57944255, rounds 191420993, OS waits 1707054
RW-shared spins 4170955, rounds 91034528, OS waits 2142812
RW-excl spins 2504676, rounds 17939004, OS waits 70531
Spin rounds per wait: 3.30 mutex, 21.83 RW-shared, 7.16 RW-excl
------------
TRANSACTIONS
------------

.............
---TRANSACTION 454073972, not started
MySQL thread id 3388936, OS thread handle 0x7f38f537d700, query id 119099404 10.220.62.13 root cleaning up
---TRANSACTION 454073970, not started
MySQL thread id 3388914, OS thread handle 0x7f38f45b6700, query id 119099402 10.220.62.13 root cleaning up
---TRANSACTION 381495043, not started
MySQL thread id 1, OS thread handle 0x7f39ab9eb700, query id 0 Waiting for background binlog tasks
---TRANSACTION 480931009, ACTIVE 931 sec inserting, thread declared inside InnoDB 5000
mysql tables in use 1, locked 1
1 lock struct(s), heap size 360, 0 row lock(s)
MySQL thread id 4805271, OS thread handle 0x7f39aa3ff700, query id 119082211 10.220.62.12 root update
....
Trx #rec lock waits 0 #table lock waits 0
Trx total rec lock wait time 0 SEC
Trx total table lock wait time 0 SEC
---TRANSACTION 480931005, ACTIVE 934 sec inserting, thread declared inside InnoDB 5000
mysql tables in use 1, locked 1
1 lock struct(s), heap size 360, 0 row lock(s)
MySQL thread id 4805272, OS thread handle 0x7f39a97ff700, query id 119082207 10.220.62.12 root update
...
Trx #rec lock waits 0 #table lock waits 0
Trx total rec lock wait time 0 SEC
Trx total table lock wait time 0 SEC
---TRANSACTION 480931002, ACTIVE 934 sec inserting, thread declared inside InnoDB 5000
mysql tables in use 1, locked 1
1 lock struct(s), heap size 360, 0 row lock(s)
MySQL thread id 4805273, OS thread handle 0x7f39a4934700, query id 119082204 10.220.62.12 root update
..
Trx #rec lock waits 0 #table lock waits 0
Trx total rec lock wait time 0 SEC
Trx total table lock wait time 0 SEC
---TRANSACTION 480930999, ACTIVE 939 sec inserting, thread declared inside InnoDB 5000
mysql tables in use 1, locked 1
1 lock struct(s), heap size 360, 0 row lock(s)
MySQL thread id 4805274, OS thread handle 0x7f39a4a38700, query id 119082192 10.220.62.12 root update
..
Trx #rec lock waits 0 #table lock waits 0
Trx total rec lock wait time 0 SEC
Trx total table lock wait time 0 SEC
....

Comment by Stefan [ 2017-05-08 ]

Few days before, I found this trace in the slave log:

-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 5297, seg size 5299, 3633181 merges
merged operations:
insert 5753424, delete mark 44578355, delete 1959174
discarded operations:
insert 0, delete mark 0, delete 0
0.00 hash searches/s, 0.00 non-hash searches/s

LOG

Log sequence number 178892306277
Log flushed up to 178892306277
Pages flushed up to 178892306227
Last checkpoint at 178892306227
Max checkpoint age 651231499
Checkpoint age target 630880515
Modified age 50
Checkpoint age 50
0 pending log writes, 0 pending chkp writes
30422864 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 4504682496; in additional pool allocated 0
Total memory allocated by read views 296
Internal hash tables (constant factor + variable factor)
Adaptive hash index 75696848 (75692648 + 4200)
Page hash 1107208 (buffer pool 0 only)
Dictionary cache 19098461 (17702416 + 1396045)
File system 85360 (82672 + 2688)
Lock system 10627608 (10625704 + 1904)
Recovery system 0 (0 + 0)
Dictionary memory allocated 1396045
Buffer pool size 262140
Buffer pool size, bytes 4294901760
Free buffers 1594
Database pages 260546
Old database pages 96256
Modified db pages 1
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 2, single page 0
Pages made young 18104148, not young 2494888820
0.00 youngs/s, 0.00 non-youngs/s
Pages read 105397641, created 1260374, written 58688419
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 260546, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
----------------------
INDIVIDUAL BUFFER POOL INFO
----------------------
---BUFFER POOL 0
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 0
Database pages 65535
Old database pages 24211
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 4672836, not young 650179854
0.00 youngs/s, 0.00 non-youngs/s
Pages read 26451331, created 424123, written 29642653
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 65535, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 1
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 624
Database pages 64911
Old database pages 23981
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 4393568, not young 620223514
0.00 youngs/s, 0.00 non-youngs/s
Pages read 26177531, created 275940, written 7307879
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 64911, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 2
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 440
Database pages 65095
Old database pages 24049
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 4622285, not young 629264811
0.00 youngs/s, 0.00 non-youngs/s
Pages read 26555198, created 274203, written 5879473
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 65095, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 3
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 530
Database pages 65005
Old database pages 24015
Modified db pages 1
Percent of dirty pages(LRU & free pages): 0.002
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 2, single page 0
Pages made young 4415459, not young 595220641
0.00 youngs/s, 0.00 non-youngs/s
Pages read 26213581, created 286108, written 15858414
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 65005, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
1 queries inside InnoDB, 0 queries in queue
1 read views open inside InnoDB
2 RW transactions active inside InnoDB
0 RO transactions active inside InnoDB
2 out of 1000 descriptors used
--OLDEST VIEW--
Normal read view
Read view low limit trx n:o 189855222
Read view up limit trx id 189855222
Read view low limit trx id 189855222
Read view individually stored trx ids:
-----------------
2 tablespace extents now reserved for B-tree split operations
Main thread process no. 11595, id 139675259873024, state: sleeping
Number of rows inserted 21120995, updated 7889795, deleted 18987619, read 27297144161
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Number of system rows inserted 23916553, updated 0, deleted 23916550, read 23916551
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================
InnoDB: ###### Diagnostic info printed to the standard error stream
InnoDB: Warning: a long semaphore wait:
--Thread 139675238893312 has waited at buf0flu.cc line 1099 for 923.00 seconds the semaphore:
S-lock on RW-latch at 0x7f08c81725e0 '&block->lock'
a writer (thread id 139675079538432) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file btr0cur.cc line 258
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/row/row0upd.cc line 2665
Holder thread 0 file not yet reserved line 0
InnoDB: Warning: semaphore wait:
--Thread 139675238893312 has waited at buf0flu.cc line 1099 for 923.00 seconds the semaphore:
S-lock on RW-latch at 0x7f08c81725e0 '&block->lock'
a writer (thread id 139675079538432) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file btr0cur.cc line 258
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/row/row0upd.cc line 2665
Holder thread 0 file not yet reserved line 0
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0

=====================================
2017-05-02 01:18:11 7f08aedfb700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 16 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 3379979 srv_active, 0 srv_shutdown, 4459065 srv_idle
srv_master_thread log flush and writes: 7839044
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 5879353
--Thread 139675238893312 has waited at buf0flu.cc line 1099 for 938.00 seconds the semaphore:
S-lock on RW-latch at 0x7f08c81725e0 '&block->lock'
a writer (thread id 139675079538432) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file btr0cur.cc line 258
Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.1.16/storage/xtradb/row/row0upd.cc line 2665
Holder thread 0 file not yet reserved line 0
OS WAIT ARRAY INFO: signal count 10102134
Mutex spin waits 39553646, rounds 221124095, OS waits 2613300
RW-shared spins 5383460, rounds 122080875, OS waits 3034741
RW-excl spins 3278636, rounds 22329155, OS waits 83698
Spin rounds per wait: 5.59 mutex, 22.68 RW-shared, 6.81 RW-excl
------------
TRANSACTIONS
------------
Trx id counter 189855499
Purge done for trx's n:o < 189855223 undo n:o < 2 state: running but idle
History list length 766
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 139495683, not started
MySQL thread id 1, OS thread handle 0x7f08a4deb700, query id 0 Waiting for background binlog tasks
---TRANSACTION 189855498, ACTIVE 938 sec updating or deleting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1184, 1 row lock(s), undo log entries 2
MySQL thread id 7, OS thread handle 0x7f08a37ff700, query id 55764971
Trx #rec lock waits 0 #table lock waits 0
Trx total rec lock wait time 0 SEC
Trx total table lock wait time 0 SEC

--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
I/O thread 3 state: waiting for completed aio requests (read thread)
I/O thread 4 state: waiting for completed aio requests (read thread)
I/O thread 5 state: waiting for completed aio requests (read thread)
I/O thread 6 state: waiting for completed aio requests (read thread)
I/O thread 7 state: waiting for completed aio requests (read thread)
I/O thread 8 state: waiting for completed aio requests (read thread)
I/O thread 9 state: waiting for completed aio requests (read thread)
I/O thread 10 state: waiting for completed aio requests (write thread)
I/O thread 11 state: waiting for completed aio requests (write thread)
I/O thread 12 state: waiting for completed aio requests (write thread)
I/O thread 13 state: waiting for completed aio requests (write thread)
Pending normal aio reads: 0 [0, 0, 0, 0, 0, 0, 0, 0] , aio writes: 0 [0, 0, 0, 0] ,
ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
105397133 OS file reads, 89109812 OS file writes, 11133036 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 5297, seg size 5299, 3633181 merges
merged operations:
insert 5753424, delete mark 44578355, delete 1959174
discarded operations:
insert 0, delete mark 0, delete 0
0.00 hash searches/s, 0.00 non-hash searches/s

LOG

Log sequence number 178892306277
Log flushed up to 178892306277
Pages flushed up to 178892306227
Last checkpoint at 178892306227
Max checkpoint age 651231499
Checkpoint age target 630880515
Modified age 50
Checkpoint age 50
0 pending log writes, 0 pending chkp writes
30422864 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 4504682496; in additional pool allocated 0
Total memory allocated by read views 296
Internal hash tables (constant factor + variable factor)
Adaptive hash index 75696848 (75692648 + 4200)
Page hash 1107208 (buffer pool 0 only)
Dictionary cache 19098461 (17702416 + 1396045)
File system 85360 (82672 + 2688)
Lock system 10627608 (10625704 + 1904)
Recovery system 0 (0 + 0)
Dictionary memory allocated 1396045
Buffer pool size 262140
Buffer pool size, bytes 4294901760
Free buffers 1594
Database pages 260546
Old database pages 96256
Modified db pages 1
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 2, single page 0
Pages made young 18104148, not young 2494888820
0.00 youngs/s, 0.00 non-youngs/s
Pages read 105397641, created 1260374, written 58688419
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 260546, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
----------------------
INDIVIDUAL BUFFER POOL INFO
----------------------
---BUFFER POOL 0
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 0
Database pages 65535
Old database pages 24211
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 4672836, not young 650179854
0.00 youngs/s, 0.00 non-youngs/s
Pages read 26451331, created 424123, written 29642653
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 65535, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 1
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 624
Database pages 64911
Old database pages 23981
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 4393568, not young 620223514
0.00 youngs/s, 0.00 non-youngs/s
Pages read 26177531, created 275940, written 7307879
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 64911, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 2
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 440
Database pages 65095
Old database pages 24049
Modified db pages 0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 4622285, not young 629264811
0.00 youngs/s, 0.00 non-youngs/s
Pages read 26555198, created 274203, written 5879473
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 65095, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
---BUFFER POOL 3
Buffer pool size 65535
Buffer pool size, bytes 1073725440
Free buffers 530
Database pages 65005
Old database pages 24015
Modified db pages 1
Percent of dirty pages(LRU & free pages): 0.002
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 2, single page 0
Pages made young 4415459, not young 595220641
0.00 youngs/s, 0.00 non-youngs/s
Pages read 26213581, created 286108, written 15858414
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 65005, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
1 queries inside InnoDB, 0 queries in queue
1 read views open inside InnoDB
2 RW transactions active inside InnoDB
0 RO transactions active inside InnoDB
2 out of 1000 descriptors used
--OLDEST VIEW--
Normal read view
Read view low limit trx n:o 189855222
Read view up limit trx id 189855222
Read view low limit trx id 189855222
Read view individually stored trx ids:
-----------------
2 tablespace extents now reserved for B-tree split operations
Main thread process no. 11595, id 139675259873024, state: sleeping
Number of rows inserted 21120995, updated 7889795, deleted 18987619, read 27297144161
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
Number of system rows inserted 23916553, updated 0, deleted 23916550, read 23916551
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================

InnoDB: ###### Diagnostic info printed to the standard error stream
InnoDB: Error: semaphore wait has lasted > 600 seconds
InnoDB: We intentionally crash the server, because it appears to be hung.
2017-05-02 01:18:26 7f08af7fc700 InnoDB: Assertion failure in thread 139675280852736 in file srv0srv.cc line 2365
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
170502 1:18:26 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.1.16-MariaDB
key_buffer_size=16777216
read_buffer_size=262144
max_used_connections=3
max_threads=3001
thread_count=1
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 6992161 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2b)[0x7f09e566faab]
/usr/sbin/mysqld(handle_fatal_signal+0x475)[0x7f09e51ccef5]
/lib64/libpthread.so.0(+0xf7e0)[0x7f09e47d27e0]
/lib64/libc.so.6(gsignal+0x35)[0x7f09e2bf95e5]
/lib64/libc.so.6(abort+0x175)[0x7f09e2bfadc5]
/usr/sbin/mysqld(+0x805af5)[0x7f09e5407af5]
/lib64/libpthread.so.0(+0x7aa1)[0x7f09e47caaa1]
/lib64/libc.so.6(clone+0x6d)[0x7f09e2cafaad]

Comment by Marko Mäkelä [ 2017-05-09 ]

I see that multiple threads are waiting for some (not necessarily the same) buffer page lock (block->lock). These rw-locks are supposed to be very short-term ones. The root cause of the hang should be some thread that is holding the block->lock and either stuck for some reason, or it forgot to release the lock, possibly as part of some error handling. It appears to me that the output was truncated, so the data on the culprit thread(s) is not visible.

Is it possible to obtain a core dump of the server while it is hung? For that, you would have to enable core dumps and then kill -ABRT the mysqld process. You will not have to send the core dump to us (it necessarily contains a partial copy of your database), but you would have to be prepared to extract some data from it using GDB. For starters, I would like to see the stacks.txt produced by the following:

gdb /path/to/mysqld core
set height 0
set log file stacks.txt
set log on
thread apply all backtrace full
quit

Alternatively, would it be possible to get stack traces using http://poormansprofiler.org/ or similar?

Comment by Stefan [ 2017-05-11 ]

We tried to reproduce the crash. I don't think we could get it in the same state as first time, but here are the stack traces from the 2 crashes we had.
If you need more info we also have the coredump saved and we can provide you extracts.

Comment by Stefan [ 2017-05-12 ]

It seems that if we disable innodb_use_mtflush (let as default) together with innodb_use_fallocate (let as default).
Server is more stable (crash isn'y happening).

Comment by Marko Mäkelä [ 2017-05-15 ]

In 1_crash.log, one of the threads is mtflush_service_io() which indeed looks idle, waiting in ib_wqueue_wait(mtflush_io->wq). Normally, this work queue appears to only be appended to by buf_mtflu_flush_work_items(). The mtflush was introduced in MariaDB 10.1, and my understanding is that 10.2 uses a different multi-threaded flushing mechanism.

Among the next threads we have row_ins_clust_index_entry_low(), executed as part of an INSERT or REPLACE, which is apparently waiting for an S-latch on the index->lock of the clustered index of the table.

Then we have quite a few threads that are waiting for a ticket to enter InnoDB (innodb_thread_concurrency or innodb_commit_concurrency throttling).
Here is the potentially interesting subset of the 1_crash.log output:

      7 pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,reset_sig_count=<value,ib_wqueue_wait,mtflush_service_io,at,start_thread,clone
      4 pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,reset_sig_count=<value,sync_array_wait_event,rw_lock_s_lock_spin,rw_lock_s_lock_func,level=0,,level=0,,level=0,,row_ins_clust_index_entry_low,row_ins_clust_index_entry,at,at,at,at,row_insert_for_mysql,ha_innobase::write_row,handler::ha_write_row,write_record,mysql_insert,mysql_execute_command,mysql_parse,dispatch_command,do_command,threadpool_process_request,handle_event,at,start_thread,clone
      1 select,os_thread_sleep,buf_LRU_get_free_block,buf_page_init_for_read,buf_read_page_low,buf_read_ahead_linear,buf_page_get_gen,btr_block_get_func,mtr=0x7fee232b8000),btr_pcur_move_to_next,row_search_for_mysql,ha_innobase::general_fetch,handler::ha_index_next,join_read_next,sub_select,do_select,JOIN::exec_inner,JOIN::exec,mysql_select,handle_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,threadpool_process_request,handle_event,at,start_thread,clone
      1 select,os_thread_sleep,buf_LRU_get_free_block,buf_page_create,fsp_page_create,fseg_alloc_free_page_low,fseg_alloc_free_page_general,btr_page_alloc_low,cursor=0x7fee4cb38690,,cursor=0x7fee4cb38690,,btr_cur_pessimistic_insert,row_ins_clust_index_entry_low,row_ins_clust_index_entry,at,at,at,at,row_insert_for_mysql,ha_innobase::write_row,handler::ha_write_row,write_record,mysql_insert,mysql_execute_command,mysql_parse,dispatch_command,do_command,threadpool_process_request,handle_event,at,start_thread,clone
      1 select,os_thread_sleep,btr_defragment_thread,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,reset_sig_count=<value,sync_array_wait_event,rw_lock_x_lock_func,pfs_rw_lock_x_lock_func,"",,"",,btr_get_size_and_reserved,at,at,dict_stats_analyze_index,dict_stats_update_persistent,stats_upd_option=),dict_stats_process_entry_from_recalc_pool,at,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,reset_sig_count=<value,sync_array_wait_event,rw_lock_s_lock_spin,rw_lock_s_lock_func,bpage=0x7feeb7ad3180,,bpage=0x7feeb7ad3180,,bpage=0x7feeb7ad3180,,buf_flush_try_neighbors,buf_flush_page_and_try_neighbors,min_n=401,,buf_flush_batch,buf_mtflu_flush_pool_instance,at,at,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,reset_sig_count=<value,ib_wqueue_wait,buf_mtflu_flush_work_items,buf_mtflu_flush_work_items,lsn_limit=222145676465,,buf_flush_list,page_cleaner_do_flush_batch,at,at,start_thread,clone
      1 __lll_lock_wait,_L_lock_995,pthread_mutex_lock,pfs_os_fast_mutex_lock,at,buf_flush_LRU_tail,buf_flush_lru_manager_thread,start_thread,clone

Possibly the system is trying to flush dirty blocks from the buffer pool to make some space available to an INSERT or REPLACE that needs to split a page (see btr_cur_pessimistic_insert() above). Maybe this thread is holding the clustered index X-latch, blocking the S-latch holders? Maybe it is also holding some buffer pool page X-latches, preventing the flushing from proceeding?

How small is the buffer pool compared to the working set? Do you have any buffer pool usage statistics? How many dirty pages?

I must say that while I have been working on InnoDB internals since 2003, I do not know if this kind of a scenario (running out of buffer pool in the middle of a page split while most of the buffer pool is dirty) can lead to a hang. I think that it is very well possible, but I may be missing something. If my fear is warranted, disabling the mtflush should not help, other than by affecting timings and probabilities of such scenarios.

In 2_crash.log it is a bit different: a number of threads are stuck in ha_innobase::open(), apparently waiting for dict_sys->mutex or dict_operation_lock. Two threads are waiting for a free page in the buffer pool:

      1 select,os_thread_sleep,buf_LRU_get_free_block,buf_page_init_for_read,buf_read_page_low,buf_read_page,buf_page_get_gen,btr_cur_search_to_nth_level,btr_pcur_open_low,at,dict_load_table,dict_table_open_on_name,ha_innobase::open,handler::ha_open,open_table_from_share,open_table,open_and_process_table,options=...,,open_tables,tables=0x7f4d8cf5c170,,mysqld_list_fields,dispatch_command,do_command,threadpool_process_request,handle_event,at,start_thread,clone
      1 select,os_thread_sleep,buf_LRU_get_free_block,buf_page_create,fsp_page_create,fseg_alloc_free_page_low,fseg_alloc_free_page_general,btr_page_alloc_low,cursor=0x7f4da78bf790,,cursor=0x7f4da78bf790,,btr_cur_pessimistic_insert,row_ins_sec_index_entry_low,row_ins_sec_index_entry,row_ins_index_entry,at,at,at,row_insert_for_mysql,ha_innobase::write_row,handler::ha_write_row,write_record,send_data,values=),end_send,join_tab=),evaluate_join_record,sub_select,do_select,JOIN::exec_inner,JOIN::exec,mysql_select,handle_select,mysql_execute_command,mysql_parse,dispatch_command,do_command,threadpool_process_request,handle_event,at,start_thread,clone

The first one, dict_load_table(), is probably holding one or both of dict_sys->mutex and dict_operation_lock. That explains why all the other ha_innobase::open() are stuck (that is, any SQL statement that is trying to use an InnoDB table would be stuck).

I do not see anything strange in the attached server configuration, but then again, I am not that familiar with the page flushing.

Comment by Marko Mäkelä [ 2017-05-15 ]

The function buf_flush_page() appears to skip X-latched blocks, because it is calling rw_lock_s_lock_gen_nowait() [in 10.2, rw_lock_sx_lock_nowait()]. This looks reasonable to me. The only question is if log_preflush_pool_modified_pages() can get stuck due to this.

In 10.2, the page cleaner workers in pc_flush_slot() are invoking buf_flush_do_batch() to do the work. This function seems to wrap together a few function calls that the multi-threaded flushing is invoking directly. The MariaDB 10.1 multi-threaded flushing code does not seem to significantly differ from the MySQL 5.7 page_cleaner code that MariaDB 10.2 inherited.

Note: in 10.2, log_preflush_pool_modified_pages() is delegating actual work to the page_cleaner if it is active, but never to the multi-threaded flusher. Maybe this is the reason of the hang?

Also, I wonder if this hang repeats more easily with large values of innodb_buffer_pool_instances.

Comment by Stefan [ 2017-05-16 ]

I attached the buffer pool statistics from the one of the crashes that we simulated on the VM.
I'm not sure if that statistics are related to first or second crash. But we have the full core dump of the first 2 crashed

Comment by Marko Mäkelä [ 2017-05-16 ]

Hi stefan2017, it is hard to explain how to examine the core dump. I would like to see if any flushing threads are blocked because of buf_block_t::lock S-latch held by some thread.

It looks like buf_LRU_get_free_block() is not actively flushing anything; it is waiting for the flushing to happen. It is unclear to me why some threads would be stuck there even if the flushing was suspended, as long as we have some clean pages that can be evicted from the buffer pool. In the buffer pool statistics, I see that less than ⅓ of the buffer pool is dirty:

Dictionary memory allocated 624581
Buffer pool size        131070
Buffer pool size, bytes 2147450880
Free buffers            31
Database pages          129982
Old database pages      48019
Modified db pages       37397
Percent of dirty pages(LRU & free pages): 28.764
Max dirty pages percent: 75.000

In MariaDB 10.1 and 10.2, when innodb_use_mtlush=ON, the function log_preflush_pool_modified_pages() is not delegating the work to the flushing threads, like MySQL 5.7 is delegating it to the innodb_page_cleaners. I am not sure if this can lead to a hang, though. The MySQL 5.7 commit message only mentions performance, not correctness.

It is difficult to move forward if we cannot repeat this.
Does the workload involve any DDL statements, such as ALTER TABLE, DROP TABLE, OPTIMIZE TABLE, TRUNCATE TABLE?

Comment by Stefan [ 2017-05-17 ]

Hi Marko,
We reproduce this kind of crash by having a table with ~ 2 mil rows (10 columns varchar, 10 text columns) and doing a
truncate t1;
insert into t1 values ..... from select values from t2;

This query is execute simultaneously by 2-3 mysql clients.
Then same time we run a mysqldump for all tables.

Basically this was the loading during the night of the crash (moving date from one table to other and doing a backup not really in same time but we tried to stress more the locks).

Regards,
Stefan

Comment by Marko Mäkelä [ 2017-05-18 ]

stefan2017, that is good news! Can you please provide SHOW CREATE TABLE? You can obfuscate the table and column names, of course.
Please also include SHOW INDEXES IN tablename for the table as it is before TRUNCATE, in case the key distribution matters.

It would be ideal if you could upload a script that reproduces the hang. If that is not possible, it means some more extra work for us.

Comment by Elena Stepanova [ 2017-05-19 ]

stefan2017 : also, you said "same time we run a mysqldump for all tables". How many tables are "all tables", how big the total amount of data is? Does it involve system tables as well? If possible, please paste the mysqldump command line.

Comment by Marko Mäkelä [ 2017-06-13 ]

stefan2017, can you please provide more information. We were unable to reproduce this, and without being able to reproduce, we cannot fix it.

Comment by Stefan [ 2017-06-14 ]

Hi,

I was very busy.
I attached the structure and SQL used to test starting multiple clients same time.
You have to generate around 2 mil transactions.

I used a 2 instance 1 GB buffer pool in testing.

Also during the tests the Frontend app was put in loading test generating selects on other databases, and inserts in the database used in crash (t1). It is a Java Tomcat will connection pool making HTTP request and using the DB in order to add more stress.

I think you can replicate without the app but if you have a high rate inser client you can use it to stress the table t1.

Hope it help,

Comment by Elena Stepanova [ 2017-06-20 ]

stefan2017,

Thank you for this, but the question about mysqldump command still stands. If mysqldump is important in the scenario, as it appears to be, then its options are very important too. mysqldump can use different kinds of locking and/or transactions depending on the set of options.

I tried to reproduce with some most common mysqldump variations, but don't have any luck.

Comment by Stefan [ 2017-06-22 ]

Hi Elena,

mysqldump was used in order the add more stress, but it seems that only mysqldump is not enough and is not the reason.
The logs with the crashes I provided were reproduced by me having the FE app loaded 200 req/s (per request there are different 10 selects, 3 inserts on different table + 1 insert on the test table) - in total ~100 connection open in the mysql pool. the FE app is on other VM.

I tried to replicate only with mysqldump + the insert/select but is not working.
I haven't time to test without FE app.

In all cases disabling the parameters mentioned this cannot be replicated.

Regards,
Stefan

Comment by Marko Mäkelä [ 2017-06-22 ]

stefan2017, is it possible for you to provide to us some way of reproducing this? Could you privately provide us a copy of the program and necessary data?
Or can you set up a test environment that we can log into over ssh, attach gdb to a server, and then run a script that invokes the client-side programs to trigger the problem?

Marko

Comment by Marko Mäkelä [ 2017-08-10 ]

This bug has not been fixed yet, but further findings should be reported in MDEV-12496 (which this report duplicates) or MDEV-10843.

Generated at Thu Feb 08 07:59:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.