Details
-
Bug
-
Status: Needs Feedback (View Workflow)
-
Critical
-
Resolution: Unresolved
-
10.11.11
-
None
-
None
Description
This is a regression affecting the open source OpenSVC monitoring database
The issue is triggered with a 20 rows scheduler table that is not inside opensvc code but a generic web2py internal scheduler
CREATE TABLE `scheduler_task` ( |
`id` int(11) NOT NULL AUTO_INCREMENT, |
`uuid` varchar(255) DEFAULT NULL, |
`args` longtext DEFAULT NULL, |
`vars` longtext DEFAULT NULL, |
`enabled` char(1) DEFAULT NULL, |
`start_time` datetime DEFAULT NULL, |
`next_run_time` datetime DEFAULT NULL, |
`stop_time` datetime DEFAULT NULL, |
`repeats` int(11) DEFAULT NULL, |
`retry_failed` int(11) DEFAULT NULL, |
`period` int(11) DEFAULT NULL, |
`timeout` int(11) DEFAULT NULL, |
`sync_output` int(11) DEFAULT NULL, |
`times_run` int(11) DEFAULT NULL, |
`times_failed` int(11) DEFAULT NULL, |
`last_run_time` datetime DEFAULT NULL, |
`prevent_drift` char(1) DEFAULT NULL, |
`group_name` varchar(255) DEFAULT NULL, |
`function_name` varchar(255) DEFAULT NULL, |
`status` varchar(255) DEFAULT NULL, |
`task_name` varchar(255) DEFAULT NULL, |
`application_name` varchar(255) DEFAULT NULL, |
`assigned_worker_name` varchar(512) DEFAULT '', |
PRIMARY KEY (`id`), |
UNIQUE KEY `uuid` (`uuid`), |
KEY `idx_dispatch` (`assigned_worker_name`(255),`status`), |
KEY `idx_expire` (`status`,`stop_time`) |
) ENGINE=InnoDB AUTO_INCREMENT=166 DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci; |
The affected user capture a partial processlist before restarting the server
-e "SHOW FULL PROCESSLIST"
|
+---------+---------+-----------------+---------+---------+---------+---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+---------+---------+-----------------+---------+---------+---------+---------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|
| 11 | opensvc | 127.0.0.1:32928 | opensvc | Sleep | 11 | | NULL | 0.000 |
|
| 45 | opensvc | 127.0.0.1:33220 | opensvc | Sleep | 7 | | NULL | 0.000 |
|
| 121 | opensvc | 127.0.0.1:33862 | opensvc | Sleep | 50 | | NULL | 0.000 |
|
| 186 | opensvc | 127.0.0.1:34456 | opensvc | Query | 1261600 | Updating | UPDATE `scheduler_task` SET `status`='RUNNING',`last_run_time`='2025-03-03 00:39:36' WHERE (`scheduler_task`.`id` = 19) | 0.000 |
|
| 187 | opensvc | 127.0.0.1:34460 | opensvc | Sleep | 17 | | NULL | 0.000 |
|
| 193 | opensvc | 127.0.0.1:34522 | opensvc | Sleep | 50 | | NULL | 0.000 |
|
| 195 | opensvc | 127.0.0.1:34540 | opensvc | Sleep | 49 | | NULL | 0.000 |
|
| 201 | opensvc | 127.0.0.1:51328 | opensvc | Sleep | 8 | | NULL | 0.000 |
|
| 206 | opensvc | 127.0.0.1:51374 | opensvc | Query | 1261585 | Searching rows for update | UPDATE `scheduler_task` SET `status`='EXPIRED' WHERE ((`scheduler_task`.`status` IN ('ASSIGNED','QUEUED')) AND (`scheduler_task`.`stop_time` < '2025-03-03 00:39:51')) | 0.000 |
|
| 207 | opensvc | 127.0.0.1:33216 | opensvc | Sleep | 1 | | NULL | 0.000 |
|
| 209 | opensvc | 127.0.0.1:33224 | opensvc | Sleep | 19 | | NULL | 0.000 |
|
| 208 | opensvc | 127.0.0.1:33222 | opensvc | Sleep | 19 | | NULL | 0.000 |
|
| 71278 | opensvc | 127.0.0.1:53904 | opensvc | Query | 1264427 | Updating | UPDATE `nodes` SET `last_comm`='2025-03-02 23:52:29' WHERE (`nodes`.`node_id` = 'c0994f20-1e29-42a1-9c85-68f2d512df85') | 0.000 |
|
| 71282 | opensvc | 127.0.0.1:47578 | opensvc | Query | 1264419 | Updating | UPDATE `nodes` SET `last_comm`='2025-03-02 23:52:37' WHERE (`nodes`.`node_id` = '755b77c4-9f98-4114-b993-2438cc85771d') | 0.000 |
|
| 71286 | opensvc | 127.0.0.1:56316 | opensvc | Query | 1264365 | Update | INSERT INTO `log`(`svc_id`,`log_level`,`log_user`,`node_id`,`log_action`,`log_fmt`,`log_dict`) VALUES (NULL,'warning','feed',NULL,'node.auth','node authentication error: %(e)s from %(nodename)s','{"e": "agent xxxx not registered", "nodename": "xxx"}') | 0.000 |
|
| 71292 | opensvc | 127.0.0.1:59498 | opensvc | Query | 1264354 | Update | INSERT INTO `log`(`svc_id`,`log_level`,`log_user`,`node_id`,`log_action`,`log_fmt`,`log_dict`) VALUES (NULL,'warning','feed',NULL,'node.auth','node authentication error: %(e)s from %(nodename)s','{"e": "agent xxx not registered", "nodename": "xxx"}') | 0.000 |
|
| 71298 | opensvc | 127.0.0.1:34314 | opensvc | Query | 1264295 | Update | INSERT INTO `log`(`svc_id`,`log_level`,`log_user`,`node_id`,`log_action`,`log_fmt`,`log_dict`) VALUES (NULL,'warning','feed',NULL,'node.auth','node authentication error: %(e)s from %(nodename)s','{"e": "agent xxx not registered", "nodename": "xxx"}') | 0.000 |
|
| 71302 | opensvc | 127.0.0.1:59928 | opensvc | Query | 1264241 | Updating | UPDATE `nodes` SET `last_comm`='2025-03-02 23:55:35' WHERE (`nodes`.`node_id` = '775975b4-b85b-4211-ad58-278c5b51774a') | 0.000 |
|
| 71304 | opensvc | 127.0.0.1:59948 | opensvc | Query | 1264235 | Update | INSERT INTO `log`(`svc_id`,`log_level`,`log_user`,`node_id`,`log_action`,`log_fmt`,`log_dict`) VALUES (NULL,'warning','feed',NULL,'node.auth','node authentication error: %(e)s from %(nodename)s','{"e": "agent xxx not registered", "nodename": "xxx"}') | 0.000 |
|
| 71310 | opensvc | 127.0.0.1:59450 | opensvc | Query | 1264178 | Update | INSERT INTO `log`(`svc_id`,`log_level`,`log_user`,`node_id`,`log_action`,`log_fmt`,`log_dict`) VALUES (NULL,'warning','feed',NULL,'node.auth','node authentication error: %(e)s from %(nodename)s','{"e": "agent xxx not registered", "nodename": "xxx"}') | 0.000 |
|
| 71314 | opensvc | 127.0.0.1:41286 | opensvc | Query | 1264170 | Updating | UPDATE `nodes` SET `last_comm`='2025-03-02 23:56:45' WHERE (`nodes`.`node_id` = 'bce5b121-24e8-4874-bf97-97eca661c52f') | 0.000 |
|
| 71332 | opensvc | 127.0.0.1:60228 | opensvc | Query | 1264043 | Update | INSERT INTO `log`(`svc_id`,`log_level`,`log_user`,`node_id`,`log_action`,`log_fmt`,`log_dict`) VALUES (NULL,'warning','feed',NULL,'node.auth','node authentication error: %(e)s from %(nodename)s','{"e": "agent xxx not registered", "nodename": "xxx"}') | 0.000 |
|
| 71336 | opensvc | 127.0.0.1:40108 | opensvc | Query | 1263986 | Updating | UPDATE `nodes` SET `last_comm`='2025-03-02 23:59:49' WHERE (`nodes`.`node_id` = '5fe0a2ad-d284-4a94-ac7e-998a39286ecc') | 0.000 |
|
| 71340 | opensvc | 127.0.0.1:38404 | opensvc | Query | 1263981 | Update | INSERT INTO `log`(`svc_id`,`log_level`,`log_user`,`node_id`,`log_action`,`log_fmt`,`log_dict`) VALUES (NULL,'warning','feed',NULL,'node.auth','node authentication error: %(e)s from %(nodename)s','{"e": "agent xxx not registered", "nodename": "xxx"}') | 0.000 |
|
| 71346 | opensvc | 127.0.0.1:37182 | opensvc | Query | 1263923 | Update | INSERT INTO `log`(`svc_id`,`log_level`,`log_user`,`node_id`,`log_action`,`log_fmt`,`log_dict`) VALUES (NULL,'warning','feed',NULL,'node.auth','node authentication error: %(e)s from %(nodename)s','{"e": "agent xxx not registered", "nodename": "xxx"}') | 0.000 |
|
| 71350 | opensvc | 127.0.0.1:45466 | opensvc | Query | 1263869 | Update | INSERT INTO `log`(`svc_id`,`log_level`,`log_user`,`node_id`,`log_action`,`log_fmt`,`log_dict`) VALUES (NULL,'warning','feed',NULL,'node.auth','node authentication error: %(e)s from %(nodename)s','{"e": "agent xxx not registered", "nodename": "xxx"}') | 0.000 |
|
| 71354 | opensvc | 127.0.0.1:57502 | opensvc | Query | 1263861 | Update | INSERT INTO `log`(`svc_id`,`log_level`,`log_user`,`node_id`,`log_action`,`log_fmt`,`log_dict`) VALUES (NULL,'warning','feed',NULL,'node.auth','node authentication error: %(e)s from %(nodename)s','{"e": "agent xxx not registered", "nodename": "xxx"}') | 0.000 |
|
....
|
The following query look at innodb TRX state
|
-e "SELECT r.trx_id AS blocking_transaction, r.trx_mysql_thread_id AS blocking_thread, r.trx_state AS blocking_state, b.trx_id AS blocked_transaction, b.trx_mysql_thread_id AS blocked_thread, b.trx_state AS blocked_state FROM information_schema.innodb_lock_waits w JOIN information_schema.innodb_trx b ON w.requesting_trx_id = b.trx_id JOIN information_schema.innodb_trx r ON w.blocking_trx_id = r.trx_id;"
|
+----------------------+-----------------+----------------+---------------------+----------------+---------------+
|
| blocking_transaction | blocking_thread | blocking_state | blocked_transaction | blocked_thread | blocked_state |
|
+----------------------+-----------------+----------------+---------------------+----------------+---------------+
|
| 5650814 | 71675 | LOCK WAIT | 5650815 | 71698 | LOCK WAIT |
|
| 5650813 | 71674 | LOCK WAIT | 5650815 | 71698 | LOCK WAIT |
|
| 5650813 | 71674 | LOCK WAIT | 5650814 | 71675 | LOCK WAIT |
|
| 5622544 | 206 | RUNNING | 5650815 | 71698 | LOCK WAIT |
|
| 5622544 | 206 | RUNNING | 5650814 | 71675 | LOCK WAIT |
|
| 5622544 | 206 | RUNNING | 5650813 | 71674 | LOCK WAIT |
|
+----------------------+-----------------+----------------+---------------------+----------------+---------------+
|
4:22 PM
|
-e "SHOW EXPLAIN FOR 71675;"
|
--------------
|
SHOW EXPLAIN FOR 71675
|
--------------
|
|
ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction
|
-e "SHOW EXPLAIN FOR 71674;"
|
+------+-------------+----------------+-------+---------------+------------+---------+------+------+---------------------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|
+------+-------------+----------------+-------+---------------+------------+---------+------+------+---------------------------+
|
| 1 | SIMPLE | scheduler_task | range | idx_expire | idx_expire | 774 | NULL | 2 | Using where; Using buffer |
|
+------+-------------+----------------+-------+---------------+------------+---------+------+------+---------------------------+
|
-e "SHOW EXPLAIN FOR 71698;"
|
+------+-------------+----------------+-------+---------------+------------+---------+------+------+---------------------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|
+------+-------------+----------------+-------+---------------+------------+---------+------+------+---------------------------+
|
| 1 | SIMPLE | scheduler_task | range | idx_expire | idx_expire | 774 | NULL | 2 | Using where; Using buffer |
|
+------+-------------+----------------+-------+---------------+------------+---------+------+------+---------------------------+
|
-e "SHOW EXPLAIN FOR 71675;"
|
+------+-------------+----------------+-------+---------------+------------+---------+------+------+---------------------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|
+------+-------------+----------------+-------+---------------+------------+---------+------+------+---------------------------+
|
From a previous crash freeze on 10.11.7
*************************** 1. row ***************************
|
Type: InnoDB
|
Name:
|
Status:
|
=====================================
|
2025-02-10 15:53:40 0x7ff936f4c640 INNODB MONITOR OUTPUT
|
=====================================
|
Per second averages calculated from the last 37 seconds
|
-----------------
|
BACKGROUND THREAD
|
-----------------
|
srv_master_thread loops: 0 srv_active, 0 srv_shutdown, 363824 srv_idle
|
srv_master_thread log flush and writes: 363735
|
----------
|
SEMAPHORES
|
----------
|
------------
|
TRANSACTIONS
|
------------
|
Trx id counter 3279232
|
Purge done for trx's n:o < 3272353 undo n:o < 0 state: running but idle
|
History list length 0
|
LIST OF TRANSACTIONS FOR EACH SESSION:
|
---TRANSACTION 3279231, ACTIVE 30 sec starting index read
|
mysql tables in use 1, locked 1
|
LOCK WAIT 2 lock struct(s), heap size 1128, 1 row lock(s)
|
MariaDB thread id 1013514, OS thread handle 140708360341056, query id 2850795 127.0.0.1 opensvc Updating
|
UPDATE `nodes` SET `last_comm`='2025-02-10 15:53:09' WHERE (`nodes`.`node_id` = '755b77c4-9f98-4114-b993-2438cc85771d')
|
Trx read view will not see trx with id >= 3279231, sees < 3272354
|
------- TRX HAS BEEN WAITING 30598033 ns FOR THIS LOCK TO BE GRANTED:
|
RECORD LOCKS space id 99 page no 4 n bits 16 index k_node_id of table `opensvc`.`nodes` trx id 3279231 lock_mode X waiting
|
Record lock, heap no 5 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
|
0: len 30; hex 37353562373763342d396639382d343131342d623939332d323433386363; asc 755b77c4-9f98-4114-b993-2438cc; (total 36 bytes);
|
1: len 4; hex 8011653e; asc e>;;
|
|
.......
|
Update
|
INSERT INTO `log`(`svc_id`,`log_level`,`log_user`,`node_id`,`log_action`,`log_fmt`,`log_dict`) VALUES (NULL,'warning','feed',NULL,'node.auth','node authentication error: %(e)s from %(nodename)s','{"e": "agent xxx not registered", "nodename": "xxx"}')
|
Trx read view will not see trx with id >= 3272372, sees < 3272354
|
---TRANSACTION 3272365, ACTIVE 271477 sec inserting
|
mysql tables in use 1, locked 1
|
1 lock struct(s), heap size 1128, 0 row lock(s)
|
MariaDB thread id 136431, OS thread handle 140708701066816, query id 2300994 127.0.0.1 opensvc Update
|
INSERT INTO `log`(`svc_id`,`log_level`,`log_user`,`node_id`,`log_action`,`log_fmt`,`log_dict`) VALUES (NULL,'warning','feed',NULL,'node.auth','node authentication error: %(e)s from %(nodename)s','{"e": "agent xxx not registered", "nodename": "xxx"}')
|
Trx read view will not see trx with id >= 3272365, sees < 3272354
|
---TRANSACTION 3272362, ACTIVE 271483 sec inserting
|
mysql tables in use 1, locked 1
|
1 lock struct(s), heap size 1128, 0 row lock(s)
|
MariaDB thread id 136419, OS thread handle 140708719388224, query id 2300936 127.0.0.1 opensvc Update
|
INSERT INTO `log`(`svc_id`,`log_level`,`log_user`,`node_id`,`log_action`,`log_fmt`,`log_dict`) VALUES (NULL,'warning','feed',NULL,'node.auth','node authentication error: %(e)s from %(nodename)s','{"e": "agent xxx not registered", "nodename": "xxx"}')
|
Trx read view will not see trx with id >= 3272362, sees < 3272354
|
---TRANSACTION 3272356, ACTIVE 271541 sec starting index read
|
mysql tables in use 1, locked 1
|
2 lock struct(s), heap size 1128, 1 row lock(s)
|
MariaDB thread id 136357, OS thread handle 140708337927744, query id 2300670 127.0.0.1 opensvc Updating
|
UPDATE `nodes` SET `last_comm`='2025-02-07 12:27:59' WHERE (`nodes`.`node_id` = 'c0994f20-1e29-42a1-9c85-68f2d512df85')
|
Trx read view will not see trx with id >= 3272354, sees < 3272354
|
---TRANSACTION 3272355, ACTIVE 271544 sec inserting
|
mysql tables in use 1, locked 1
|
1 lock struct(s), heap size 1128, 0 row lock(s)
|
MariaDB thread id 136352, OS thread handle 140708301592128, query id 2300646 127.0.0.1 opensvc Update
|
INSERT INTO `log`(`svc_id`,`log_level`,`log_user`,`node_id`,`log_action`,`log_fmt`,`log_dict`) VALUES (NULL,'warning','feed',NULL,'node.auth','node authentication error: %(e)s from %(nodename)s','{"e": "agent xxx not registered", "nodename": "xxx"}')
|
Trx read view will not see trx with id >= 3272354, sees < 3272354
|
---TRANSACTION 3272354, ACTIVE 271545 sec updating or deleting
|
mysql tables in use 1, locked 1
|
3 lock struct(s), heap size 1128, 2 row lock(s)
|
MariaDB thread id 136349, OS thread handle 140708720309824, query id 2300631 127.0.0.1 opensvc Updating
|
UPDATE `nodes` SET `last_comm`='2025-02-07 12:27:55' WHERE (`nodes`.`node_id` = 'bce5b121-24e8-4874-bf97-97eca661c52f')
|
Trx read view will not see trx with id >= 3272354, sees < 3272354
|
---TRANSACTION (0x7ffa5cf36780), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7ffa5cf37d80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION 3272557, ACTIVE 270013 sec updating or deleting
|
mysql tables in use 1, locked 1
|
2 lock struct(s), heap size 1128, 1 row lock(s)
|
MariaDB thread id 19, OS thread handle 140708703524416, query id 2308189 127.0.0.1 opensvc Updating
|
UPDATE `scheduler_task` SET `status`='RUNNING',`last_run_time`='2025-02-07 12:53:27' WHERE (`scheduler_task`.`id` = 19)
|
Trx read view will not see trx with id >= 3272557, sees < 3272354
|
---TRANSACTION 3272558, ACTIVE 269998 sec starting index read
|
mysql tables in use 1, locked 1
|
2 lock struct(s), heap size 1128, 1 row lock(s)
|
MariaDB thread id 188, OS thread handle 140708336084544, query id 2308239 127.0.0.1 opensvc Searching rows for update
|
UPDATE `scheduler_task` SET `status`='EXPIRED' WHERE ((`scheduler_task`.`status` IN ('ASSIGNED','QUEUED')) AND (`scheduler_task`.`stop_time` < '2025-02-07 12:53:42'))
|
---TRANSACTION (0x7ffa5cf34680), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7ffa5cf33b80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
--------
|
FILE I/O
|
--------
|
Pending flushes (fsync): 0
|
1857 OS file reads, 451773 OS file writes, 257778 OS fsyncs
|
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
|
---
|
LOG
|
---
|
Log sequence number 3505933275
|
Log flushed up to 3505933275
|
Pages flushed up to 3505926452
|
Last checkpoint at 3505926452
|
----------------------
|
BUFFER POOL AND MEMORY
|
----------------------
|
Total large memory allocated 4328521728
|
Dictionary memory allocated 26740392
|
Buffer pool size 259584
|
Free buffers 257536
|
Database pages 2048
|
Old database pages 768
|
Modified db pages 4
|
Percent of dirty pages(LRU & free pages): 0.002
|
Max dirty pages percent: 40.000
|
Pending reads 0
|
Pending writes: LRU 0, flush list 4
|
Pages made young 592, not young 47425
|
0.00 youngs/s, 0.00 non-youngs/s
|
Pages read 1676, created 92317, written 279544
|
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
|
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 0 / 1000
|
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
|
LRU len: 2048, unzip_LRU len: 0
|
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
|
--------------
|
ROW OPERATIONS
|
--------------
|
138 read views open inside InnoDB
|
state: sleeping
|
----------------------------
|
END OF INNODB MONITOR OUTPUT
|
============================
|
Found in 10.11.10 that may trigger such case
MDEV-34515
This is so far all what we have but we will report progress on the issue any help on is welcome
Hi Stephane! Would it be possible to try to obtain stack traces from the mariadbd process at the time it is found to be hung? We have a fix of MDEV-36122 under review, and I was thinking that it could potentially affect this kind of hangs as well. The data that you provided so far does not allow us to draw any conclusion about this.
I checked that the default value of lock_wait_timeout seems to be 24 hours. I thought that metadata locks (MDL) originally used an even longer default timeout (1 year), but I suppose that 24 hours is already "infinite enough" for most practical purposes. If my guess is on the right track, even after MDEV-36122 was fixed, we might need a fix of MDEV-34253.