|
We operate a number of MariaDB server instances across two data centres, primarily vers 10.6 but also some 10.11 instances. Most but not all instances are running Drupal 8 or Drupal 10 sites. We generally run PROD workload in our Primary data centre and SLAVE replicas in our secondary Data Centre; hardware and VMware ESXi version is the same across data centres.
Starting sometime within the past 4-6 weeks, our MariaDB SLAVE replicas have begun experiencing a Slave SQL replication hung/unresponsive error. The slave SQL process becomes completely unresponsive and `Seconds_Behind_Master` continues to grow without any progress. The stuck/unresponsive query can not be killed within MariaDB, and the `mariadb.service` or `mariadb` process can not be stopped or restarted at the OS level, nor can the OS be restarted as systemd will wait forever for `mariadb` service to stop. The only remedy appears to be a power reset at the VM level.
Below is an example query; noting we have not seen any significant pattern in databases/tables or query types affected by this issue (note: DB name redacted).
*************************** 2. row ***************************
|
Id: 6
|
User: system user
|
Host:
|
db: ????
|
Command: Slave_SQL
|
Time: 6869
|
State: Update
|
Info: INSERT INTO "cache_data" ("cid", "expire", "created", "tags", "checksum", "data", "serialized") VALUES ('route:[language]=en:/community-and-safety/grants-awards-and-sponsorships/community-grants/lord-mayors-community-fund:', '-1', '1704702224.983', 'route_match', '133545', 'a:3:{s:4:\"path\";s:11:\"/node/22131\";s:5:\"query\";a:0:{}s:6:\"routes\";O:41:\"Symfony\\Component\\Routing\\RouteCollection\":4:{s:49:\"\0Symfony\\Component\\Routing\\RouteCollection\0routes\";a:1:{s:21:\"entity.node.canonical\";O:31:\"Symfony\\Component\\Routing\\Route\":9:{s:4:\"path\";s:12:\"/node/{node}\";s:4:\"host\";s:0:\"\";s:8:\"defaults\";a:2:{s:11:\"_controller\";s:48:\"\\Drupal\\node\\Controller\\NodeViewController::view\";s:15:\"_title_callback\";s:49:\"\\Drupal\\node\\Controller\\NodeViewController::title\";}s:12:\"requirements\";a:3:{s:4:\"node\";s:3:\"\\d+\";s:14:\"_entity_access\";s:9:\"node.view\";s:39:\"_access_preview_link_canonical_rerouter\";s:4:\"node\";}s:7:\"options\";a:4:{s:14:\"compiler_class\";s:33:\"Drupal\\Core\\Routing\\RouteCompiler\";s:10:\"parameters\";a:1:{s:4:\"node\";a:2:{s:4:\"type\";s:11:\"entity:node\";s:9:\"converter\";s:21:\"paramconverter.entity\";}}s:14:\"_access_checks\";a:2:{i:0;s:19:\"access_check.entity\";i:1;s:44:\"access_check.preview_link.canonical_rerouter\";}s:4:\"utf8\";b:1;}s:7:\"schemes\";a:0:{}s:7:\"methods\";a:2:{i:0;s:3:\"GET\";i:1;s:4:\"POST\";}s:9:\"condition\";s:0:\"\";s:8:\"compiled\";O:33:\"Drupal\\Core\\Routing\\CompiledRoute\":11:{s:4:\"vars\";a:1:{i:0;s:4:\"node\";}s:11:\"path_prefix\";s:0:\"\";s:10:\"path_regex\";s:26:\"{^/node/(?P<node>\\d+)$}sDu\";s:11:\"path_tokens\";a:2:{i:0;a:5:{i:0;s:8:\"variable\";i:1;s:1:\"/\";i:2;s:3:\"\\d+\";i:3;s:4:\"node\";i:4;b:1;}i:1;a:2:{i:0;s:4:\"text\";i:1;s:5:\"/node\";}}s:9:\"path_vars\";a:1:{i:0;s:4:\"node\";}s:10:\"host_regex\";N;s:11:\"host_tokens\";a:0:{}s:9:\"host_vars\";a:0:{}s:3:\"fit\";i:2;s:14:\"patternOutline\";s:7:\"/node/%\";s:8:\"numParts\";i:2;}}}s:50:\"\0Symfony\\Component\\Routing\\RouteCollection\0aliases\";a:0:{}s:52:\"\0Symfony\\Component\\Routing\\RouteCollection\0resources\";a:0:{}s:53:\"\0Symfony\\Component\\Routing\\RouteCollection\0priorities\";a:0:{}}}', '1') ON DUPLICATE KEY UPDATE "cid" = VALUES("cid"), "expire" = VALUES("expire"), "created" = VALUES("created"), "tags" = VALUES("tags"), "checksum" = VALUES("checksum"), "data" = VALUES("data"), "serialized" = VALUES("serialized")
|
Progress: 0.000
|
Running strace on the `mariadb` PID once the process is in this unresponsive state reveals the following EAGAIN errors:
$ ps -ef | egrep mariadb
|
mysql 6462 1 2 10:19 ? 00:14:42 /usr/sbin/mariadbd
|
root 25489 25422 0 20:17 pts/0 00:00:00 grep -E --color=auto mariadb
|
[root@db-dr-01.bcc.atech.int:~][0]$ strace -p 6462
|
strace: Process 6462 attached
|
restart_syscall(<... resuming interrupted read ...>) = 1
|
accept4(25, {sa_family=AF_UNIX}, [128->2], SOCK_CLOEXEC) = 67
|
futex(0x562301a9224c, FUTEX_WAKE_PRIVATE, 1) = 1
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
poll([{fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}], 3, -1) = 1 ([{fd=25, revents=POLLIN}])
|
accept4(25, {sa_family=AF_UNIX}, [128->2], SOCK_CLOEXEC) = 67
|
futex(0x562301a92248, FUTEX_WAKE_PRIVATE, 1) = 1
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
|
poll([{fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}], 3, -1) = 1 ([{fd=25, revents=POLLIN}])```
|
Just to cover the bases, there are no errors in the mariadb err log, nor in syslog or dmesg. There are no issues with disk space available for the servers, or memory/CPU/etc. Additionally, we have seen no issues or error in the VMware layer, and the rest of the OS appears to be functioning correctly during/after these queries become stuck/unresponsive.
We have also experimented with different valuies of `Parallel_Mode` including `none` which so have had no impact. The mariadb defaults are below:
$ mariadbd --print-defaults
|
mariadbd would have been started with the following arguments:
|
--user=mysql --default-storage-engine=InnoDB --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --collation_server=utf8_unicode_ci --character_set_server=utf8 --key-buffer-size=32M --myisam-recover=FORCE,BACKUP --max-allowed-packet=256M --max-connect-errors=1000000 --sql-mode=STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_AUTO_VALUE_ON_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE,ONLY_FULL_GROUP_BY --sysdate-is-now=1 --innodb=FORCE --innodb-strict-mode=1 --datadir=/var/lib/mysql/ --log-bin=/var/lib/mysql/mysql-bin --expire-logs-days=10 --max_binlog_size=128M --sync-binlog=1 --server-id=2 --slave_parallel_mode=none --tmp-table-size=32M --max-heap-table-size=32M --query-cache-type=0 --query-cache-size=0 --max-connections=500 --thread-cache-size=50 --open-files-limit=65535 --table-definition-cache=1024 --table-open-cache=2048 --innodb_adaptive_hash_index=0 --innodb-flush-method=O_DIRECT --innodb-log-files-in-group=2 --innodb-log-file-size=128M --innodb-flush-log-at-trx-commit=1 --innodb-file-per-table=1 --innodb-buffer-pool-size=8354M --log-error=/var/log/mysql/error.log --log-queries-not-using-indexes=0 --slow-query-log=1 --slow-query-log-file=/var/log/mysql/mysql-slow.log
|
The replication hung/unresponsive issue does not occur at any specific time, but generally reoccurs on each replica within 48-72 hours.
Please let me know if there is any additional information I can provide.
|