[MDEV-11296] InnoDB stalls under OLTP RW on P8 Created: 2016-11-16  Updated: 2016-11-25  Resolved: 2016-11-25

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2
Fix Version/s: 10.2.3

Type: Bug Priority: Major
Reporter: Sergey Vojtovich Assignee: Sergey Vojtovich
Resolution: Fixed Votes: 0
Labels: None

Attachments: File mysqld10.err     Text File mysqld10.txt     File run.mysqld.p8    

 Description   

Running 10 threads slightly modified OLTP RW on Power8 hangs:

sysbench 1.0:  multi-threaded system evaluation benchmark
 
Running the test with following options:
Number of threads: 10
Report intermediate results every 10 second(s)
Initializing random number generator from seed (42).
 
 
Initializing worker threads...
 
Threads started!
 
[  10s] threads: 10, tps: 1514.48, reads: 7665.91, writes: 54882.19, response time: 12.08ms (95%), errors: 17.70, reconnects:  0.00
[  20s] threads: 10, tps: 1595.89, reads: 8061.07, writes: 57769.51, response time: 11.35ms (95%), errors: 16.40, reconnects:  0.00
[  30s] threads: 10, tps: 1623.60, reads: 8194.80, writes: 58757.63, response time: 11.28ms (95%), errors: 15.30, reconnects:  0.00
[  40s] threads: 10, tps: 1612.90, reads: 8150.98, writes: 58388.66, response time: 11.23ms (95%), errors: 17.30, reconnects:  0.00
[  50s] threads: 10, tps: 1630.11, reads: 8225.34, writes: 58968.35, response time: 11.13ms (95%), errors: 15.10, reconnects:  0.00
[  60s] threads: 10, tps: 1618.00, reads: 8182.58, writes: 58589.23, response time: 11.28ms (95%), errors: 18.50, reconnects:  0.00
[  80s] threads: 10, tps: 1433.49, reads: 7240.15, writes: 51899.38, response time: 11.13ms (95%), errors: 14.40, reconnects:  0.00
[  90s] threads: 10, tps: 1620.61, reads: 8171.75, writes: 58636.19, response time: 11.13ms (95%), errors: 13.70, reconnects:  0.00
[ 100s] threads: 10, tps: 1309.59, reads: 6610.07, writes: 47377.28, response time: 11.13ms (95%), errors: 12.60, reconnects:  0.00
[ 110s] threads: 10, tps: 1370.81, reads: 6926.83, writes: 49623.40, response time: 11.11ms (95%), errors: 14.60, reconnects:  0.00
[ 120s] threads: 10, tps: 1634.79, reads: 8248.27, writes: 59145.77, response time: 10.98ms (95%), errors: 14.70, reconnects:  0.00
[ 130s] threads: 10, tps: 1628.70, reads: 8224.02, writes: 58949.45, response time: 11.13ms (95%), errors: 16.20, reconnects:  0.00
[ 140s] threads: 10, tps: 1636.70, reads: 8266.78, writes: 59239.05, response time: 10.98ms (95%), errors: 16.50, reconnects:  0.00
[ 150s] threads: 10, tps: 1637.31, reads: 8270.93, writes: 59281.33, response time: 11.03ms (95%), errors: 16.90, reconnects:  0.00
[ 160s] threads: 10, tps: 1631.79, reads: 8239.67, writes: 59051.96, response time: 11.13ms (95%), errors: 16.30, reconnects:  0.00
[ 170s] threads: 10, tps: 1629.71, reads: 8231.64, writes: 58985.49, response time: 11.09ms (95%), errors: 16.50, reconnects:  0.00
[ 180s] threads: 10, tps: 1641.50, reads: 8285.48, writes: 59382.38, response time: 11.09ms (95%), errors: 15.60, reconnects:  0.00
[ 190s] threads: 10, tps: 1629.10, reads: 8234.81, writes: 58995.58, response time: 11.15ms (95%), errors: 17.90, reconnects:  0.00
[ 200s] threads: 10, tps: 1628.10, reads: 8213.18, writes: 58892.87, response time: 11.17ms (95%), errors: 14.60, reconnects:  0.00
[ 210s] threads: 10, tps: 1637.01, reads: 8263.33, writes: 59227.40, response time: 11.03ms (95%), errors: 15.60, reconnects:  0.00
[ 220s] threads: 10, tps: 1610.09, reads: 8132.07, writes: 58272.12, response time: 11.05ms (95%), errors: 16.50, reconnects:  0.00
[ 230s] threads: 10, tps: 1492.80, reads: 7534.02, writes: 54017.07, response time: 11.07ms (95%), errors: 14.10, reconnects:  0.00
[ 240s] threads: 10, tps: 1628.69, reads: 8220.87, writes: 58926.31, response time: 11.12ms (95%), errors: 15.30, reconnects:  0.00
[ 250s] threads: 10, tps: 1632.60, reads: 8253.01, writes: 59127.11, response time: 11.14ms (95%), errors: 17.90, reconnects:  0.00
[ 260s] threads: 10, tps: 1639.50, reads: 8268.29, writes: 59305.79, response time: 10.98ms (95%), errors: 14.10, reconnects:  0.00
[ 270s] threads: 10, tps: 1617.61, reads: 8171.73, writes: 58562.89, response time: 11.19ms (95%), errors: 16.70, reconnects:  0.00
[ 280s] threads: 10, tps: 1628.29, reads: 8238.47, writes: 58998.69, response time: 11.12ms (95%), errors: 19.60, reconnects:  0.00
[ 290s] threads: 10, tps: 306.50, reads: 1543.90, writes: 11068.33, response time: 11.03ms (95%), errors: 2.50, reconnects:  0.00
[ 300s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 310s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 320s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 330s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 340s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 350s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 360s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 370s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 380s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 390s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 400s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 410s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 420s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 430s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 440s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 450s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 460s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 470s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 480s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 490s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 500s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 510s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 520s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 530s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 540s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 550s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00
[ 560s] threads: 10, tps: 0.00, reads: 0.00, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects:  0.00

Attached please see error log with monitor dump, gdb traces and script that was used to run sysbench and mysqld.



 Comments   
Comment by Sergey Vojtovich [ 2016-11-16 ]

Assigned to Jan for now, but I'll keep analysing this issue meanwhile.

Comment by Sergey Vojtovich [ 2016-11-22 ]

From mysqld10.err:

--Thread 70334649302368 has waited at trx0undo.ic line 191 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x3ffde1a37148 created in file buf0buf.cc line 1504
number of readers 0, waiters flag 1, lock_word: 20000000
Last time read locked in file trx0undo.ic line 191
Last time write locked in file /home/mariadb/svoj/mariadb/storage/innobase/include/trx0undo.ic line 171

Thread is waiting for S-lock on 0x3ffde1a37148, which is unlocked (lock_word= 20000000). There're no concurrent X-locks waiting, which likely means there is again this mess around waiters flag. Pretty much like we had with mutexes.

Versions before 10.2 are not that badly affected, because we had this periodic wakeup.

Comment by Sergey Vojtovich [ 2016-11-22 ]

Another interesting observation:

--Thread 70334619417952 has waited at btr0sea.ic line 128 for 235.00 seconds the semaphore:
S-lock on RW-latch at 0x1009bdc5de0 created in file btr0sea.cc line 195
a writer (thread id 70334619417952) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file btr0sea.ic line 128
Last time write locked in file /home/mariadb/source/server/storage/innobase/include/btr0sea.ic line 90

Thread is attempting to acquire S-lock, while it is holding X-lock. This doesn't look like acceptable scenario.

Comment by Sergey Vojtovich [ 2016-11-24 ]

Fixes are in bb-10.2-svoj. Will assign to Jan for review after confirmation from IBM side that the fixes work.

Comment by Sergey Vojtovich [ 2016-11-25 ]

jplindst, please review 5 patches for this bug at https://github.com/MariaDB/server/commits/bb-10.2-svoj

First 4 patches are just clean-ups. Last patch actually fixes this problem.

Comment by Jan Lindström (Inactive) [ 2016-11-25 ]

ok to push.

Generated at Thu Feb 08 07:48:51 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.