Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.11.11, 10.6.22
-
None
-
debian bookworm
-
Can result in hang or crash
-
A semi-sync primary could become unresponsive and hang indefinitely if both a transaction commits and semi-sync switches off at the same time (e.g. if all replicas have disconnected).
-
Q3/2025 Maintenance
Description
This is something we've been observing in 10.6 for many versions but we never got the chance to capture information, as it was bringing down production.
See attached the topology we are seeing this mostly happening on.
Basically a primary master with some replicas and then an intermediate master in a different datacenter with some replicas.
Most of the time, whenever we stop one of the replicas, the master becomes totally unresponsive, to the point that we need to kill -9 the process, otherwise the host won't recover back.
To prevent this we have to disable semi-sync in order to be able to do topology changes or even stopping a replica. Any maintenance that involves stopping a replica (the daemon) normally brings down the master.
This topology only receive INSERTS (no deletes, no updates) and REPLACE (only for the pt-heartbeat table) the INSERTS are like this:
#Q> INSERT /* ExternalStoreDB::store */ INTO `blobs_cluster31` (blob_text) VALUES ('<AB><AE>vO<CD>O/J,<C8>PH+<CA><CF>U(.)*M.)-JMQHI,I<AC><AD>劎vN,IM<CF>/<AA><B4>r,<CE>HI<AC>T<F0>I<CC>K<D5><CE>/-<C9><C8>//J̍<8D>EV<E4><91>_Z<9C>Z<AC><90><99><87><A2>^B\0')
|
The table structure:
CREATE TABLE `blobs_cluster31` (
|
`blob_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
|
`blob_text` longblob DEFAULT NULL,
|
PRIMARY KEY (`blob_id`)
|
) ENGINE=InnoDB AUTO_INCREMENT=35529070 DEFAULT CHARSET=binary
|
Today, we had yet again, another issue and I was able to capture a bit of what was going on on the master in case it helps:
First a processlist when the master is starting to hung
show processlist;
|
+---------+-----------------+----------------------+--------------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+---------+-----------------+----------------------+--------------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
|
| 3 | event_scheduler | localhost | NULL | Daemon | 102217 | Waiting for next activation | NULL | 0.000 |
|
| 10 | orchestrator | 208.80.155.103:58412 | NULL | Query | 82 | Filling schema table | show global status like 'rpl_semi_sync_%' | 0.000 |
|
| 11 | orchestrator | 208.80.155.103:58420 | NULL | Query | 82 | starting | show master status | 0.000 |
|
| 13 | orchestrator | 208.80.155.103:58428 | NULL | Query | 82 | Filling schema table | show global status like 'Uptime' | 0.000 |
|
| 18 | root | localhost | heartbeat | Query | 85 | Commit | REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_mas | 0.000 |
|
| 633 | repl2024 | 10.64.164.3:55152 | NULL | Binlog Dump | 102184 | Master has sent all binlog to slave; waiting for more updates | NULL | 0.000 |
|
| 2282497 | wikiadmin2023 | 10.64.0.157:34120 | NULL | Sleep | 1627 | | NULL | 0.000 |
|
| 2282498 | wikiadmin2023 | 10.64.0.157:34130 | enwiki | Sleep | 12 | | NULL | 0.000 |
|
| 2317648 | repl2024 | 10.192.48.126:52118 | NULL | Binlog Dump | 491 | Master has sent all binlog to slave; waiting for more updates | NULL | 0.000 |
|
| 2322108 | wikiadmin2023 | 10.64.32.40:45520 | NULL | Sleep | 392 | | NULL | 0.000 |
|
| 2322109 | wikiadmin2023 | 10.64.32.40:45526 | commonswiki | Sleep | 0 | | NULL | 0.000 |
|
| 2329823 | wikiadmin2023 | 10.64.32.40:54186 | NULL | Sleep | 203 | | NULL | 0.000 |
|
| 2329824 | wikiadmin2023 | 10.64.32.40:54194 | wikidatawiki | Sleep | 0 | | NULL | 0.000 |
|
| 2334618 | repl2024 | 10.64.162.3:40068 | NULL | Killed | 86 | starting | NULL | 0.000 |
|
| 2334643 | nagios | localhost | NULL | Query | 85 | Filling schema table | SHOW GLOBAL STATUS like 'Uptime' | 0.000 |
|
| 2334920 | prometheus | localhost | NULL | Query | 77 | Filling schema table | SHOW GLOBAL STATUS | 0.000 |
|
| 2335116 | orchestrator | 208.80.155.103:37680 | NULL | Query | 71 | Filling schema table | show global status like 'rpl_semi_sync_%' | 0.000 |
|
| 2335118 | orchestrator | 208.80.155.103:37692 | NULL | Query | 71 | Filling schema table | show global status like 'Uptime' | 0.000 |
|
| 2335167 | orchestrator | 208.80.155.103:37698 | NULL | Query | 71 | starting | show master status | 0.000 |
|
| 2335592 | orchestrator | 208.80.155.103:53948 | NULL | Query | 60 | Filling schema table | show global status like 'rpl_semi_sync_%' | 0.000 |
|
| 2335637 | orchestrator | 208.80.155.103:53964 | NULL | Query | 60 | starting | show master status | 0.000 |
|
| 2335638 | orchestrator | 208.80.155.103:53976 | NULL | Query | 60 | Filling schema table | show global status like 'Uptime' | 0.000 |
|
| 2336054 | orchestrator | 208.80.155.103:35690 | NULL | Query | 49 | starting | show master status | 0.000 |
|
| 2336057 | orchestrator | 208.80.155.103:35704 | NULL | Query | 49 | Filling schema table | show global status like 'Uptime' | 0.000 |
|
| 2336058 | orchestrator | 208.80.155.103:35706 | NULL | Query | 49 | Filling schema table | show global status like 'rpl_semi_sync_%' | 0.000 |
|
| 2336061 | prometheus | localhost | NULL | Query | 50 | Filling schema table | SHOW GLOBAL STATUS | 0.000 |
|
| 2336534 | orchestrator | 208.80.155.103:60256 | NULL | Query | 38 | starting | show master status | 0.000 |
|
| 2336537 | orchestrator | 208.80.155.103:60266 | NULL | Query | 38 | Filling schema table | show global status like 'Uptime' | 0.000 |
|
| 2336538 | orchestrator | 208.80.155.103:60272 | NULL | Query | 38 | Filling schema table | show global status like 'rpl_semi_sync_%' | 0.000 |
|
| 2336632 | nagios | localhost | NULL | Query | 37 | Filling schema table | SHOW GLOBAL STATUS like 'Uptime' | 0.000 |
|
| 2336988 | orchestrator | 208.80.155.103:50100 | NULL | Query | 28 | starting | show master status | 0.000 |
|
| 2336989 | orchestrator | 208.80.155.103:50104 | NULL | Query | 28 | Filling schema table | show global status like 'Uptime' | 0.000 |
|
| 2336990 | orchestrator | 208.80.155.103:50120 | NULL | Query | 28 | Filling schema table | show global status like 'rpl_semi_sync_%' | 0.000 |
|
| 2337083 | nagios | localhost | NULL | Query | 26 | Filling schema table | SHOW GLOBAL STATUS like 'Uptime' | 0.000 |
|
| 2337088 | repl2024 | 10.64.162.3:44224 | NULL | Binlog Dump | 26 | starting | NULL | 0.000 |
|
| 2337217 | cumin2024 | 10.64.48.98:51970 | NULL | Query | 0 | starting | show processlist | 0.000 |
|
| 2337342 | orchestrator | 208.80.155.103:58136 | NULL | Query | 17 | Filling schema table | show global status like 'Uptime' | 0.000 |
|
| 2337381 | prometheus | localhost | NULL | Query | 17 | Filling schema table | SHOW GLOBAL STATUS | 0.000 |
|
| 2337382 | orchestrator | 208.80.155.103:58148 | NULL | Query | 17 | Filling schema table | show global status like 'rpl_semi_sync_%' | 0.000 |
|
| 2337385 | orchestrator | 208.80.155.103:58160 | NULL | Query | 17 | starting | show master status | 0.000 |
|
| 2337710 | orchestrator | 208.80.155.103:42160 | NULL | Query | 6 | Filling schema table | show global status like 'rpl_semi_sync_%' | 0.000 |
|
| 2337750 | orchestrator | 208.80.155.103:52934 | NULL | Query | 6 | starting | show master status | 0.000 |
|
| 2337752 | orchestrator | 208.80.155.103:52944 | NULL | Query | 6 | Filling schema table | show global status like 'Uptime' | 0.000 |
|
| 2337847 | wikiuser2023 | 10.67.166.25:57532 | NULL | Sleep | 3 | | NULL | 0.000 |
|
| 2337848 | wikiuser2023 | 10.67.166.25:57542 | wikidatawiki | Sleep | 3 | | NULL | 0.000 |
|
| 2337900 | wikiuser2023 | 10.67.133.90:33902 | NULL | Sleep | 0 | | NULL | 0.000 |
|
| 2337902 | wikiuser2023 | 10.67.133.90:33908 | zhwiki | Sleep | 0 | | NULL | 0.000 |
|
| 2337903 | wikiuser2023 | 10.67.151.107:36602 | NULL | Sleep | 0 | | NULL | 0.000 |
|
| 2337904 | wikiuser2023 | 10.67.151.107:36606 | ruwiki | Sleep | 0 | | NULL | 0.000 |
|
| 2337911 | wikiuser2023 | 10.67.135.219:40606 | NULL | Sleep | 0 | | NULL | 0.000 |
|
| 2337912 | wikiuser2023 | 10.67.135.219:40620 | frwiki | Sleep | 0 | | NULL | 0.000 |
|
| 2337919 | wikiuser2023 | 10.67.190.243:34464 | NULL | Sleep | 0 | | NULL | 0.000 |
|
| 2337920 | wikiuser2023 | 10.67.190.243:34476 | enwiki | Sleep | 0 | | NULL | 0.000 |
|
| 2337921 | wikiuser2023 | 10.67.176.214:34130 | NULL | Sleep | 0 | | NULL | 0.000 |
|
| 2337922 | wikiuser2023 | 10.67.176.214:34140 | wikidatawiki | Sleep | 0 | | NULL | 0.000 |
|
+---------+-----------------+----------------------+--------------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
|
55 rows in set (0.001 sec)
|
|
|
| InnoDB | |
|
=====================================
|
2025-06-03 11:57:47 0x7f2c151066c0 INNODB MONITOR OUTPUT
|
=====================================
|
Per second averages calculated from the last 7 seconds
|
-----------------
|
BACKGROUND THREAD
|
-----------------
|
srv_master_thread loops: 0 srv_active, 0 srv_shutdown, 102207 srv_idle
|
srv_master_thread log flush and writes: 102207
|
----------
|
SEMAPHORES
|
----------
|
------------
|
TRANSACTIONS
|
------------
|
Trx id counter 84539341991
|
Purge done for trx's n:o < 84539341990 undo n:o < 0 state: running but idle
|
History list length 0
|
LIST OF TRANSACTIONS FOR EACH SESSION:
|
---TRANSACTION (0x7f2c1cb59680), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb6f680), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb6d580), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb64680), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb5e380), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb70c80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb6e080), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb60480), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb6bf80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb6ca80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb67d80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb5a180), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb5b780), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb6a980), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb70180), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb61a80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb65180), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb66780), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb67280), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb60f80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb5ee80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb5ac80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb6b480), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb69380), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb68880), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb65c80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb5c280), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb62580), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb63b80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb69e80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb6eb80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb5f980), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb63080), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION 84539341990, ACTIVE (PREPARED) 115 sec
|
mysql tables in use 1, locked 1
|
2 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1
|
MariaDB thread id 18, OS thread handle 139617629841088, query id 13380306 localhost root Commit
|
REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2025-06-03T11:55:52.001680', '172005380', 'es1035-bin.002722', '988815510', NULL, NULL, 'es7', 'eqiad')
|
---TRANSACTION (0x7f2c1cb5cd80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
---TRANSACTION (0x7f2c1cb58b80), not started
|
0 lock struct(s), heap size 1128, 0 row lock(s)
|
--------
|
FILE I/O
|
--------
|
Pending flushes (fsync) log: 0; buffer pool: 0
|
5906126 OS file reads, 1201298 OS file writes, 1004218 OS fsyncs
|
72.13 reads/s, 16384 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
|
-------------------------------------
|
INSERT BUFFER AND ADAPTIVE HASH INDEX
|
-------------------------------------
|
Ibuf: size 1, free list len 0, seg size 2, 0 merges
|
merged operations:
|
insert 0, delete mark 0, delete 0
|
discarded operations:
|
insert 0, delete mark 0, delete 0
|
0.00 hash searches/s, 0.00 non-hash searches/s
|
---
|
LOG
|
---
|
Log sequence number 397685236779204
|
Log flushed up to 397685236779204
|
Pages flushed up to 397683737361130
|
Last checkpoint at 397683737361130
|
0 pending log flushes, 0 pending chkp writes
|
992399 log i/o's done, 0.00 log i/o's/second
|
----------------------
|
BUFFER POOL AND MEMORY
|
----------------------
|
Total large memory allocated 202970759168
|
Dictionary memory allocated 1206734704
|
Buffer pool size 12265344
|
Free buffers 6053136
|
Database pages 6212208
|
Old database pages 2293198
|
Modified db pages 103290
|
Percent of dirty pages(LRU & free pages): 0.842
|
Max dirty pages percent: 75.000
|
Pending reads 0
|
Pending writes: LRU 0, flush list 0
|
Pages made young 121898, not young 2297288
|
5.28 youngs/s, 71.70 non-youngs/s
|
Pages read 5906507, created 306349, written 209188
|
72.13 reads/s, 0.00 creates/s, 0.00 writes/s
|
Buffer pool hit rate 783 / 1000, young-making rate 15 / 1000 not 215 / 1000
|
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
|
LRU len: 6212208, unzip_LRU len: 0
|
I/O sum[4949]:cur[505], unzip sum[0]:cur[0]
|
--------------
|
ROW OPERATIONS
|
--------------
|
0 read views open inside InnoDB
|
Process ID=0, Main thread ID=0, state: sleeping
|
Number of rows inserted 785353, updated 102131, deleted 15, read 3357447
|
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 81.42 reads/s
|
Number of system rows inserted 0, updated 0, deleted 0, read 6
|
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
|
----------------------------
|
END OF INNODB MONITOR OUTPUT
|
============================
|
|
Find also attached the backtraces of the process on the master.
Our semi-sync config
master
|
mysql:root@localhost [(none)]> show global variables like '%semi%';
|
+---------------------------------------+--------------+
|
| Variable_name | Value |
|
+---------------------------------------+--------------+
|
| rpl_semi_sync_master_enabled | ON |
|
| rpl_semi_sync_master_timeout | 100 |
|
| rpl_semi_sync_master_trace_level | 32 |
|
| rpl_semi_sync_master_wait_no_slave | OFF |
|
| rpl_semi_sync_master_wait_point | AFTER_COMMIT |
|
| rpl_semi_sync_slave_delay_master | OFF |
|
| rpl_semi_sync_slave_enabled | OFF |
|
| rpl_semi_sync_slave_kill_conn_timeout | 5 |
|
| rpl_semi_sync_slave_trace_level | 32 |
|
+---------------------------------------+--------------+
|
Slaves:
+---------------------------------------+--------------+
|
| Variable_name | Value |
|
+---------------------------------------+--------------+
|
| rpl_semi_sync_master_enabled | OFF |
|
| rpl_semi_sync_master_timeout | 100 |
|
| rpl_semi_sync_master_trace_level | 32 |
|
| rpl_semi_sync_master_wait_no_slave | OFF |
|
| rpl_semi_sync_master_wait_point | AFTER_COMMIT |
|
| rpl_semi_sync_slave_delay_master | OFF |
|
| rpl_semi_sync_slave_enabled | ON |
|
| rpl_semi_sync_slave_kill_conn_timeout | 5 |
|
| rpl_semi_sync_slave_trace_level | 32 |
|
+---------------------------------------+--------------+
|
9 rows in set (0.002 sec)
|
If there is some more information neeed, please let me know.
Attachments
Issue Links
- is caused by
-
MDEV-33551 Semi-sync Wait Point AFTER_COMMIT Slow on Workloads with Heavy Concurrency
-
- Closed
-
- relates to
-
MDEV-34122 Assertion `entry' failed in Active_tranx::assert_thd_is_waiter
-
- Closed
-