[MDEV-33202] Slave SQL replication hung/unresponsive Created: 2024-01-08  Updated: 2024-01-14

Status: Needs Feedback
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.6.16, 10.11.6
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Oliver Kelly Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Server: Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-169-generic x86_64) as well as Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-91-generic x86_64)
DB Package: mariadb-server 1:10.11.6+maria~ubu2204 and/or 1:10.11.6+maria~ubu2004 (also observed on 10.6.16+maria~ubu2004)
Underlying Platform: Virtual Machine running on VMware ESXi, 6.7.0, 14320388, VM compatability set to both VM version 11 (6.0+) and VM version 15 (6.7U2+)



 Description   

We operate a number of MariaDB server instances across two data centres, primarily vers 10.6 but also some 10.11 instances. Most but not all instances are running Drupal 8 or Drupal 10 sites. We generally run PROD workload in our Primary data centre and SLAVE replicas in our secondary Data Centre; hardware and VMware ESXi version is the same across data centres.

Starting sometime within the past 4-6 weeks, our MariaDB SLAVE replicas have begun experiencing a Slave SQL replication hung/unresponsive error. The slave SQL process becomes completely unresponsive and `Seconds_Behind_Master` continues to grow without any progress. The stuck/unresponsive query can not be killed within MariaDB, and the `mariadb.service` or `mariadb` process can not be stopped or restarted at the OS level, nor can the OS be restarted as systemd will wait forever for `mariadb` service to stop. The only remedy appears to be a power reset at the VM level.

Below is an example query; noting we have not seen any significant pattern in databases/tables or query types affected by this issue (note: DB name redacted).

*************************** 2. row ***************************
      Id: 6
    User: system user
    Host: 
      db: ????
 Command: Slave_SQL
    Time: 6869
   State: Update
    Info: INSERT INTO "cache_data" ("cid", "expire", "created", "tags", "checksum", "data", "serialized") VALUES ('route:[language]=en:/community-and-safety/grants-awards-and-sponsorships/community-grants/lord-mayors-community-fund:', '-1', '1704702224.983', 'route_match', '133545', 'a:3:{s:4:\"path\";s:11:\"/node/22131\";s:5:\"query\";a:0:{}s:6:\"routes\";O:41:\"Symfony\\Component\\Routing\\RouteCollection\":4:{s:49:\"\0Symfony\\Component\\Routing\\RouteCollection\0routes\";a:1:{s:21:\"entity.node.canonical\";O:31:\"Symfony\\Component\\Routing\\Route\":9:{s:4:\"path\";s:12:\"/node/{node}\";s:4:\"host\";s:0:\"\";s:8:\"defaults\";a:2:{s:11:\"_controller\";s:48:\"\\Drupal\\node\\Controller\\NodeViewController::view\";s:15:\"_title_callback\";s:49:\"\\Drupal\\node\\Controller\\NodeViewController::title\";}s:12:\"requirements\";a:3:{s:4:\"node\";s:3:\"\\d+\";s:14:\"_entity_access\";s:9:\"node.view\";s:39:\"_access_preview_link_canonical_rerouter\";s:4:\"node\";}s:7:\"options\";a:4:{s:14:\"compiler_class\";s:33:\"Drupal\\Core\\Routing\\RouteCompiler\";s:10:\"parameters\";a:1:{s:4:\"node\";a:2:{s:4:\"type\";s:11:\"entity:node\";s:9:\"converter\";s:21:\"paramconverter.entity\";}}s:14:\"_access_checks\";a:2:{i:0;s:19:\"access_check.entity\";i:1;s:44:\"access_check.preview_link.canonical_rerouter\";}s:4:\"utf8\";b:1;}s:7:\"schemes\";a:0:{}s:7:\"methods\";a:2:{i:0;s:3:\"GET\";i:1;s:4:\"POST\";}s:9:\"condition\";s:0:\"\";s:8:\"compiled\";O:33:\"Drupal\\Core\\Routing\\CompiledRoute\":11:{s:4:\"vars\";a:1:{i:0;s:4:\"node\";}s:11:\"path_prefix\";s:0:\"\";s:10:\"path_regex\";s:26:\"{^/node/(?P<node>\\d+)$}sDu\";s:11:\"path_tokens\";a:2:{i:0;a:5:{i:0;s:8:\"variable\";i:1;s:1:\"/\";i:2;s:3:\"\\d+\";i:3;s:4:\"node\";i:4;b:1;}i:1;a:2:{i:0;s:4:\"text\";i:1;s:5:\"/node\";}}s:9:\"path_vars\";a:1:{i:0;s:4:\"node\";}s:10:\"host_regex\";N;s:11:\"host_tokens\";a:0:{}s:9:\"host_vars\";a:0:{}s:3:\"fit\";i:2;s:14:\"patternOutline\";s:7:\"/node/%\";s:8:\"numParts\";i:2;}}}s:50:\"\0Symfony\\Component\\Routing\\RouteCollection\0aliases\";a:0:{}s:52:\"\0Symfony\\Component\\Routing\\RouteCollection\0resources\";a:0:{}s:53:\"\0Symfony\\Component\\Routing\\RouteCollection\0priorities\";a:0:{}}}', '1') ON DUPLICATE KEY UPDATE "cid" = VALUES("cid"), "expire" = VALUES("expire"), "created" = VALUES("created"), "tags" = VALUES("tags"), "checksum" = VALUES("checksum"), "data" = VALUES("data"), "serialized" = VALUES("serialized")
Progress: 0.000

Running strace on the `mariadb` PID once the process is in this unresponsive state reveals the following EAGAIN errors:

$ ps -ef | egrep mariadb
mysql       6462       1  2 10:19 ?        00:14:42 /usr/sbin/mariadbd
root       25489   25422  0 20:17 pts/0    00:00:00 grep -E --color=auto mariadb
[root@db-dr-01.bcc.atech.int:~][0]$ strace -p 6462
strace: Process 6462 attached
restart_syscall(<... resuming interrupted read ...>) = 1
accept4(25, {sa_family=AF_UNIX}, [128->2], SOCK_CLOEXEC) = 67
futex(0x562301a9224c, FUTEX_WAKE_PRIVATE, 1) = 1
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}], 3, -1) = 1 ([{fd=25, revents=POLLIN}])
accept4(25, {sa_family=AF_UNIX}, [128->2], SOCK_CLOEXEC) = 67
futex(0x562301a92248, FUTEX_WAKE_PRIVATE, 1) = 1
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}], 3, -1) = 1 ([{fd=25, revents=POLLIN}])```

Just to cover the bases, there are no errors in the mariadb err log, nor in syslog or dmesg. There are no issues with disk space available for the servers, or memory/CPU/etc. Additionally, we have seen no issues or error in the VMware layer, and the rest of the OS appears to be functioning correctly during/after these queries become stuck/unresponsive.

We have also experimented with different valuies of `Parallel_Mode` including `none` which so have had no impact. The mariadb defaults are below:

$ mariadbd --print-defaults
mariadbd would have been started with the following arguments:
--user=mysql --default-storage-engine=InnoDB --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --collation_server=utf8_unicode_ci --character_set_server=utf8 --key-buffer-size=32M --myisam-recover=FORCE,BACKUP --max-allowed-packet=256M --max-connect-errors=1000000 --sql-mode=STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_AUTO_VALUE_ON_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE,ONLY_FULL_GROUP_BY --sysdate-is-now=1 --innodb=FORCE --innodb-strict-mode=1 --datadir=/var/lib/mysql/ --log-bin=/var/lib/mysql/mysql-bin --expire-logs-days=10 --max_binlog_size=128M --sync-binlog=1 --server-id=2 --slave_parallel_mode=none --tmp-table-size=32M --max-heap-table-size=32M --query-cache-type=0 --query-cache-size=0 --max-connections=500 --thread-cache-size=50 --open-files-limit=65535 --table-definition-cache=1024 --table-open-cache=2048 --innodb_adaptive_hash_index=0 --innodb-flush-method=O_DIRECT --innodb-log-files-in-group=2 --innodb-log-file-size=128M --innodb-flush-log-at-trx-commit=1 --innodb-file-per-table=1 --innodb-buffer-pool-size=8354M --log-error=/var/log/mysql/error.log --log-queries-not-using-indexes=0 --slow-query-log=1 --slow-query-log-file=/var/log/mysql/mysql-slow.log

The replication hung/unresponsive issue does not occur at any specific time, but generally reoccurs on each replica within 48-72 hours.

Please let me know if there is any additional information I can provide.



 Comments   
Comment by Marko Mäkelä [ 2024-01-09 ]

The reason for this could be that some threads inside the mariadbd process are deadlocked. This would be best diagnosed by attaching a debugger and producing stack traces of all threads during the hang. Some such hangs have been fixed rather recently. A very prominent one was MDEV-29835. The versions that you tested include that fix, but not the fix of MDEV-32899, for example.

Comment by Oliver Kelly [ 2024-01-14 ]

Thanks Team,

I will try and capture a stack trace if possible.

One thing to note that we have a number of DB replica servers running Ubuntu 20.04 w/ MariaDB 10.6.6 and 10.11.6, and Ubuntu 22.04 w/ MariaDB 10.11.6 all exhibiting this issue. In addition to this we have one older replica server which we haven't been able to update yet, running Ubuntu 18.04 and Mariadb 10.6.15, and this server has yet to encounter this error. Is there some way to see possible introduction of error in the changes from 10.6.15 -> 10.6.16 ?

Generated at Thu Feb 08 10:37:08 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.