Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-36934

semi sync makes the master unresponsive when a replica is stopped

    XMLWordPrintable

Details

    • Can result in hang or crash
    • A semi-sync primary could become unresponsive and hang indefinitely if both a transaction commits and semi-sync switches off at the same time (e.g. if all replicas have disconnected).
    • Q3/2025 Maintenance

    Description

      This is something we've been observing in 10.6 for many versions but we never got the chance to capture information, as it was bringing down production.

      See attached the topology we are seeing this mostly happening on.
      Basically a primary master with some replicas and then an intermediate master in a different datacenter with some replicas.

      Most of the time, whenever we stop one of the replicas, the master becomes totally unresponsive, to the point that we need to kill -9 the process, otherwise the host won't recover back.

      To prevent this we have to disable semi-sync in order to be able to do topology changes or even stopping a replica. Any maintenance that involves stopping a replica (the daemon) normally brings down the master.

      This topology only receive INSERTS (no deletes, no updates) and REPLACE (only for the pt-heartbeat table) the INSERTS are like this:

      #Q> INSERT /* ExternalStoreDB::store  */ INTO `blobs_cluster31` (blob_text) VALUES ('<AB><AE>vO<CD>O/J,<C8>PH+<CA><CF>U(.)*M.)-JMQHI,I<AC><AD>劎vN,IM<CF>/<AA><B4>r,<CE>HI<AC>T<F0>I<CC>K<D5><CE>/-<C9><C8>//J̍<8D>EV<E4><91>_Z<9C>Z<AC><90><99><87><A2>^B\0')
      

      The table structure:

      CREATE TABLE `blobs_cluster31` (
        `blob_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
        `blob_text` longblob DEFAULT NULL,
        PRIMARY KEY (`blob_id`)
      ) ENGINE=InnoDB AUTO_INCREMENT=35529070 DEFAULT CHARSET=binary
      

      Today, we had yet again, another issue and I was able to capture a bit of what was going on on the master in case it helps:

      First a processlist when the master is starting to hung

      show processlist;
      +---------+-----------------+----------------------+--------------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
      | Id      | User            | Host                 | db           | Command     | Time   | State                                                         | Info                                                                                                 | Progress |
      +---------+-----------------+----------------------+--------------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
      |       3 | event_scheduler | localhost            | NULL         | Daemon      | 102217 | Waiting for next activation                                   | NULL                                                                                                 |    0.000 |
      |      10 | orchestrator    | 208.80.155.103:58412 | NULL         | Query       |     82 | Filling schema table                                          | show global status like 'rpl_semi_sync_%'                                                            |    0.000 |
      |      11 | orchestrator    | 208.80.155.103:58420 | NULL         | Query       |     82 | starting                                                      | show master status                                                                                   |    0.000 |
      |      13 | orchestrator    | 208.80.155.103:58428 | NULL         | Query       |     82 | Filling schema table                                          | show global status like 'Uptime'                                                                     |    0.000 |
      |      18 | root            | localhost            | heartbeat    | Query       |     85 | Commit                                                        | REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_mas |    0.000 |
      |     633 | repl2024        | 10.64.164.3:55152    | NULL         | Binlog Dump | 102184 | Master has sent all binlog to slave; waiting for more updates | NULL                                                                                                 |    0.000 |
      | 2282497 | wikiadmin2023   | 10.64.0.157:34120    | NULL         | Sleep       |   1627 |                                                               | NULL                                                                                                 |    0.000 |
      | 2282498 | wikiadmin2023   | 10.64.0.157:34130    | enwiki       | Sleep       |     12 |                                                               | NULL                                                                                                 |    0.000 |
      | 2317648 | repl2024        | 10.192.48.126:52118  | NULL         | Binlog Dump |    491 | Master has sent all binlog to slave; waiting for more updates | NULL                                                                                                 |    0.000 |
      | 2322108 | wikiadmin2023   | 10.64.32.40:45520    | NULL         | Sleep       |    392 |                                                               | NULL                                                                                                 |    0.000 |
      | 2322109 | wikiadmin2023   | 10.64.32.40:45526    | commonswiki  | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |
      | 2329823 | wikiadmin2023   | 10.64.32.40:54186    | NULL         | Sleep       |    203 |                                                               | NULL                                                                                                 |    0.000 |
      | 2329824 | wikiadmin2023   | 10.64.32.40:54194    | wikidatawiki | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |
      | 2334618 | repl2024        | 10.64.162.3:40068    | NULL         | Killed      |     86 | starting                                                      | NULL                                                                                                 |    0.000 |
      | 2334643 | nagios          | localhost            | NULL         | Query       |     85 | Filling schema table                                          | SHOW GLOBAL STATUS like 'Uptime'                                                                     |    0.000 |
      | 2334920 | prometheus      | localhost            | NULL         | Query       |     77 | Filling schema table                                          | SHOW GLOBAL STATUS                                                                                   |    0.000 |
      | 2335116 | orchestrator    | 208.80.155.103:37680 | NULL         | Query       |     71 | Filling schema table                                          | show global status like 'rpl_semi_sync_%'                                                            |    0.000 |
      | 2335118 | orchestrator    | 208.80.155.103:37692 | NULL         | Query       |     71 | Filling schema table                                          | show global status like 'Uptime'                                                                     |    0.000 |
      | 2335167 | orchestrator    | 208.80.155.103:37698 | NULL         | Query       |     71 | starting                                                      | show master status                                                                                   |    0.000 |
      | 2335592 | orchestrator    | 208.80.155.103:53948 | NULL         | Query       |     60 | Filling schema table                                          | show global status like 'rpl_semi_sync_%'                                                            |    0.000 |
      | 2335637 | orchestrator    | 208.80.155.103:53964 | NULL         | Query       |     60 | starting                                                      | show master status                                                                                   |    0.000 |
      | 2335638 | orchestrator    | 208.80.155.103:53976 | NULL         | Query       |     60 | Filling schema table                                          | show global status like 'Uptime'                                                                     |    0.000 |
      | 2336054 | orchestrator    | 208.80.155.103:35690 | NULL         | Query       |     49 | starting                                                      | show master status                                                                                   |    0.000 |
      | 2336057 | orchestrator    | 208.80.155.103:35704 | NULL         | Query       |     49 | Filling schema table                                          | show global status like 'Uptime'                                                                     |    0.000 |
      | 2336058 | orchestrator    | 208.80.155.103:35706 | NULL         | Query       |     49 | Filling schema table                                          | show global status like 'rpl_semi_sync_%'                                                            |    0.000 |
      | 2336061 | prometheus      | localhost            | NULL         | Query       |     50 | Filling schema table                                          | SHOW GLOBAL STATUS                                                                                   |    0.000 |
      | 2336534 | orchestrator    | 208.80.155.103:60256 | NULL         | Query       |     38 | starting                                                      | show master status                                                                                   |    0.000 |
      | 2336537 | orchestrator    | 208.80.155.103:60266 | NULL         | Query       |     38 | Filling schema table                                          | show global status like 'Uptime'                                                                     |    0.000 |
      | 2336538 | orchestrator    | 208.80.155.103:60272 | NULL         | Query       |     38 | Filling schema table                                          | show global status like 'rpl_semi_sync_%'                                                            |    0.000 |
      | 2336632 | nagios          | localhost            | NULL         | Query       |     37 | Filling schema table                                          | SHOW GLOBAL STATUS like 'Uptime'                                                                     |    0.000 |
      | 2336988 | orchestrator    | 208.80.155.103:50100 | NULL         | Query       |     28 | starting                                                      | show master status                                                                                   |    0.000 |
      | 2336989 | orchestrator    | 208.80.155.103:50104 | NULL         | Query       |     28 | Filling schema table                                          | show global status like 'Uptime'                                                                     |    0.000 |
      | 2336990 | orchestrator    | 208.80.155.103:50120 | NULL         | Query       |     28 | Filling schema table                                          | show global status like 'rpl_semi_sync_%'                                                            |    0.000 |
      | 2337083 | nagios          | localhost            | NULL         | Query       |     26 | Filling schema table                                          | SHOW GLOBAL STATUS like 'Uptime'                                                                     |    0.000 |
      | 2337088 | repl2024        | 10.64.162.3:44224    | NULL         | Binlog Dump |     26 | starting                                                      | NULL                                                                                                 |    0.000 |
      | 2337217 | cumin2024       | 10.64.48.98:51970    | NULL         | Query       |      0 | starting                                                      | show processlist                                                                                     |    0.000 |
      | 2337342 | orchestrator    | 208.80.155.103:58136 | NULL         | Query       |     17 | Filling schema table                                          | show global status like 'Uptime'                                                                     |    0.000 |
      | 2337381 | prometheus      | localhost            | NULL         | Query       |     17 | Filling schema table                                          | SHOW GLOBAL STATUS                                                                                   |    0.000 |
      | 2337382 | orchestrator    | 208.80.155.103:58148 | NULL         | Query       |     17 | Filling schema table                                          | show global status like 'rpl_semi_sync_%'                                                            |    0.000 |
      | 2337385 | orchestrator    | 208.80.155.103:58160 | NULL         | Query       |     17 | starting                                                      | show master status                                                                                   |    0.000 |
      | 2337710 | orchestrator    | 208.80.155.103:42160 | NULL         | Query       |      6 | Filling schema table                                          | show global status like 'rpl_semi_sync_%'                                                            |    0.000 |
      | 2337750 | orchestrator    | 208.80.155.103:52934 | NULL         | Query       |      6 | starting                                                      | show master status                                                                                   |    0.000 |
      | 2337752 | orchestrator    | 208.80.155.103:52944 | NULL         | Query       |      6 | Filling schema table                                          | show global status like 'Uptime'                                                                     |    0.000 |
      | 2337847 | wikiuser2023    | 10.67.166.25:57532   | NULL         | Sleep       |      3 |                                                               | NULL                                                                                                 |    0.000 |
      | 2337848 | wikiuser2023    | 10.67.166.25:57542   | wikidatawiki | Sleep       |      3 |                                                               | NULL                                                                                                 |    0.000 |
      | 2337900 | wikiuser2023    | 10.67.133.90:33902   | NULL         | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |
      | 2337902 | wikiuser2023    | 10.67.133.90:33908   | zhwiki       | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |
      | 2337903 | wikiuser2023    | 10.67.151.107:36602  | NULL         | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |
      | 2337904 | wikiuser2023    | 10.67.151.107:36606  | ruwiki       | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |
      | 2337911 | wikiuser2023    | 10.67.135.219:40606  | NULL         | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |
      | 2337912 | wikiuser2023    | 10.67.135.219:40620  | frwiki       | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |
      | 2337919 | wikiuser2023    | 10.67.190.243:34464  | NULL         | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |
      | 2337920 | wikiuser2023    | 10.67.190.243:34476  | enwiki       | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |
      | 2337921 | wikiuser2023    | 10.67.176.214:34130  | NULL         | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |
      | 2337922 | wikiuser2023    | 10.67.176.214:34140  | wikidatawiki | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |
      +---------+-----------------+----------------------+--------------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
      55 rows in set (0.001 sec)
       
      
      

      | InnoDB |      |
      =====================================
      2025-06-03 11:57:47 0x7f2c151066c0 INNODB MONITOR OUTPUT
      =====================================
      Per second averages calculated from the last 7 seconds
      -----------------
      BACKGROUND THREAD
      -----------------
      srv_master_thread loops: 0 srv_active, 0 srv_shutdown, 102207 srv_idle
      srv_master_thread log flush and writes: 102207
      ----------
      SEMAPHORES
      ----------
      ------------
      TRANSACTIONS
      ------------
      Trx id counter 84539341991
      Purge done for trx's n:o < 84539341990 undo n:o < 0 state: running but idle
      History list length 0
      LIST OF TRANSACTIONS FOR EACH SESSION:
      ---TRANSACTION (0x7f2c1cb59680), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb6f680), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb6d580), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb64680), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb5e380), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb70c80), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb6e080), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb60480), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb6bf80), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb6ca80), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb67d80), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb5a180), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb5b780), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb6a980), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb70180), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb61a80), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb65180), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb66780), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb67280), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb60f80), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb5ee80), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb5ac80), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb6b480), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb69380), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb68880), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb65c80), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb5c280), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb62580), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb63b80), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb69e80), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb6eb80), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb5f980), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb63080), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION 84539341990, ACTIVE (PREPARED) 115 sec
      mysql tables in use 1, locked 1
      2 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1
      MariaDB thread id 18, OS thread handle 139617629841088, query id 13380306 localhost root Commit
      REPLACE INTO `heartbeat`.`heartbeat` (ts, server_id, file, position, relay_master_log_file, exec_master_log_pos, shard, datacenter) VALUES ('2025-06-03T11:55:52.001680', '172005380', 'es1035-bin.002722', '988815510', NULL, NULL, 'es7', 'eqiad')
      ---TRANSACTION (0x7f2c1cb5cd80), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      ---TRANSACTION (0x7f2c1cb58b80), not started
      0 lock struct(s), heap size 1128, 0 row lock(s)
      --------
      FILE I/O
      --------
      Pending flushes (fsync) log: 0; buffer pool: 0
      5906126 OS file reads, 1201298 OS file writes, 1004218 OS fsyncs
      72.13 reads/s, 16384 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
      -------------------------------------
      INSERT BUFFER AND ADAPTIVE HASH INDEX
      -------------------------------------
      Ibuf: size 1, free list len 0, seg size 2, 0 merges
      merged operations:
       insert 0, delete mark 0, delete 0
      discarded operations:
       insert 0, delete mark 0, delete 0
      0.00 hash searches/s, 0.00 non-hash searches/s
      ---
      LOG
      ---
      Log sequence number 397685236779204
      Log flushed up to   397685236779204
      Pages flushed up to 397683737361130
      Last checkpoint at  397683737361130
      0 pending log flushes, 0 pending chkp writes
      992399 log i/o's done, 0.00 log i/o's/second
      ----------------------
      BUFFER POOL AND MEMORY
      ----------------------
      Total large memory allocated 202970759168
      Dictionary memory allocated 1206734704
      Buffer pool size   12265344
      Free buffers       6053136
      Database pages     6212208
      Old database pages 2293198
      Modified db pages  103290
      Percent of dirty pages(LRU & free pages): 0.842
      Max dirty pages percent: 75.000
      Pending reads 0
      Pending writes: LRU 0, flush list 0
      Pages made young 121898, not young 2297288
      5.28 youngs/s, 71.70 non-youngs/s
      Pages read 5906507, created 306349, written 209188
      72.13 reads/s, 0.00 creates/s, 0.00 writes/s
      Buffer pool hit rate 783 / 1000, young-making rate 15 / 1000 not 215 / 1000
      Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
      LRU len: 6212208, unzip_LRU len: 0
      I/O sum[4949]:cur[505], unzip sum[0]:cur[0]
      --------------
      ROW OPERATIONS
      --------------
      0 read views open inside InnoDB
      Process ID=0, Main thread ID=0, state: sleeping
      Number of rows inserted 785353, updated 102131, deleted 15, read 3357447
      0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 81.42 reads/s
      Number of system rows inserted 0, updated 0, deleted 0, read 6
      0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
      ----------------------------
      END OF INNODB MONITOR OUTPUT
      ============================
      
      

      Find also attached the backtraces of the process on the master.

      Our semi-sync config
      master

       
      mysql:root@localhost [(none)]> show global variables like '%semi%';
      +---------------------------------------+--------------+
      | Variable_name                         | Value        |
      +---------------------------------------+--------------+
      | rpl_semi_sync_master_enabled          | ON           |
      | rpl_semi_sync_master_timeout          | 100          |
      | rpl_semi_sync_master_trace_level      | 32           |
      | rpl_semi_sync_master_wait_no_slave    | OFF          |
      | rpl_semi_sync_master_wait_point       | AFTER_COMMIT |
      | rpl_semi_sync_slave_delay_master      | OFF          |
      | rpl_semi_sync_slave_enabled           | OFF          |
      | rpl_semi_sync_slave_kill_conn_timeout | 5            |
      | rpl_semi_sync_slave_trace_level       | 32           |
      +---------------------------------------+--------------+
      

      Slaves:

      +---------------------------------------+--------------+
      | Variable_name                         | Value        |
      +---------------------------------------+--------------+
      | rpl_semi_sync_master_enabled          | OFF          |
      | rpl_semi_sync_master_timeout          | 100          |
      | rpl_semi_sync_master_trace_level      | 32           |
      | rpl_semi_sync_master_wait_no_slave    | OFF          |
      | rpl_semi_sync_master_wait_point       | AFTER_COMMIT |
      | rpl_semi_sync_slave_delay_master      | OFF          |
      | rpl_semi_sync_slave_enabled           | ON           |
      | rpl_semi_sync_slave_kill_conn_timeout | 5            |
      | rpl_semi_sync_slave_trace_level       | 32           |
      +---------------------------------------+--------------+
      9 rows in set (0.002 sec)
      

      If there is some more information neeed, please let me know.

      Attachments

        1. Captura de pantalla 2025-06-03 a las 14.16.47.png
          55 kB
          Manuel Arostegui
        2. mariadbd_full_bt_all_threads.txt
          2.98 MB
          Manuel Arostegui
        3. mariadbd_full_bt_all_threads.txt.1
          5.67 MB
          Manuel Arostegui
        4. mariadbd_full_bt_all_threads.txt.2
          1.95 MB
          Manuel Arostegui
        5. mariadbd_full_bt_all_threads-1.txt
          4.68 MB
          Manuel Arostegui
        6. master_mariadbd_full_bt_all_threads.txt
          2.21 MB
          Manuel Arostegui
        7. master_mariadbd_full_bt_all_threads.txt.1
          2.54 MB
          Manuel Arostegui
        8. master_mariadbd_full_bt_all_threads.txt.2
          2.64 MB
          Manuel Arostegui
        9. master_shutdown.txt
          2.70 MB
          Manuel Arostegui
        10. mdev_36934_test.patch
          10 kB
          Brandon Nesterenko
        11. mdev36934-v1.patch
          7 kB
          Kristian Nielsen
        12. replica_mariadbd_full_bt_all_threads.txt
          1.48 MB
          Manuel Arostegui
        13. replica_mariadbd_full_bt_all_threads.txt.1
          1.58 MB
          Manuel Arostegui

        Issue Links

          Activity

            People

              knielsen Kristian Nielsen
              marostegui Manuel Arostegui
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.