[MDEV-33202] Slave SQL replication hung/unresponsive - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Fix
Affects Version/s: 10.6.16, 10.11.6
Fix Version/s: N/A
Component/s: Replication
Labels:
None
Environment:

Hide
Server: Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-169-generic x86_64) as well as Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-91-generic x86_64)
DB Package: mariadb-server 1:10.11.6+maria~ubu2204 and/or 1:10.11.6+maria~ubu2004 (also observed on 10.6.16+maria~ubu2004)
Underlying Platform: Virtual Machine running on VMware ESXi, 6.7.0, 14320388, VM compatability set to both VM version 11 (6.0+) and VM version 15 (6.7U2+)

Show
Server: Ubuntu 20.04.6 LTS (GNU/Linux 5.4.0-169-generic x86_64) as well as Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-91-generic x86_64) DB Package: mariadb-server 1:10.11.6+maria~ubu2204 and/or 1:10.11.6+maria~ubu2004 (also observed on 10.6.16+maria~ubu2004) Underlying Platform: Virtual Machine running on VMware ESXi, 6.7.0, 14320388, VM compatability set to both VM version 11 (6.0+) and VM version 15 (6.7U2+)

Description

We operate a number of MariaDB server instances across two data centres, primarily vers 10.6 but also some 10.11 instances. Most but not all instances are running Drupal 8 or Drupal 10 sites. We generally run PROD workload in our Primary data centre and SLAVE replicas in our secondary Data Centre; hardware and VMware ESXi version is the same across data centres.

Starting sometime within the past 4-6 weeks, our MariaDB SLAVE replicas have begun experiencing a Slave SQL replication hung/unresponsive error. The slave SQL process becomes completely unresponsive and `Seconds_Behind_Master` continues to grow without any progress. The stuck/unresponsive query can not be killed within MariaDB, and the `mariadb.service` or `mariadb` process can not be stopped or restarted at the OS level, nor can the OS be restarted as systemd will wait forever for `mariadb` service to stop. The only remedy appears to be a power reset at the VM level.

Below is an example query; noting we have not seen any significant pattern in databases/tables or query types affected by this issue (note: DB name redacted).

*************************** 2. row ***************************

      Id: 6

    User: system user

    Host:

      db: ????

 Command: Slave_SQL

    Time: 6869

   State: Update

    Info: INSERT INTO "cache_data" ("cid", "expire", "created", "tags", "checksum", "data", "serialized") VALUES ('route:[language]=en:/community-and-safety/grants-awards-and-sponsorships/community-grants/lord-mayors-community-fund:', '-1', '1704702224.983', 'route_match', '133545', 'a:3:{s:4:\"path\";s:11:\"/node/22131\";s:5:\"query\";a:0:{}s:6:\"routes\";O:41:\"Symfony\\Component\\Routing\\RouteCollection\":4:{s:49:\"\0Symfony\\Component\\Routing\\RouteCollection\0routes\";a:1:{s:21:\"entity.node.canonical\";O:31:\"Symfony\\Component\\Routing\\Route\":9:{s:4:\"path\";s:12:\"/node/{node}\";s:4:\"host\";s:0:\"\";s:8:\"defaults\";a:2:{s:11:\"_controller\";s:48:\"\\Drupal\\node\\Controller\\NodeViewController::view\";s:15:\"_title_callback\";s:49:\"\\Drupal\\node\\Controller\\NodeViewController::title\";}s:12:\"requirements\";a:3:{s:4:\"node\";s:3:\"\\d+\";s:14:\"_entity_access\";s:9:\"node.view\";s:39:\"_access_preview_link_canonical_rerouter\";s:4:\"node\";}s:7:\"options\";a:4:{s:14:\"compiler_class\";s:33:\"Drupal\\Core\\Routing\\RouteCompiler\";s:10:\"parameters\";a:1:{s:4:\"node\";a:2:{s:4:\"type\";s:11:\"entity:node\";s:9:\"converter\";s:21:\"paramconverter.entity\";}}s:14:\"_access_checks\";a:2:{i:0;s:19:\"access_check.entity\";i:1;s:44:\"access_check.preview_link.canonical_rerouter\";}s:4:\"utf8\";b:1;}s:7:\"schemes\";a:0:{}s:7:\"methods\";a:2:{i:0;s:3:\"GET\";i:1;s:4:\"POST\";}s:9:\"condition\";s:0:\"\";s:8:\"compiled\";O:33:\"Drupal\\Core\\Routing\\CompiledRoute\":11:{s:4:\"vars\";a:1:{i:0;s:4:\"node\";}s:11:\"path_prefix\";s:0:\"\";s:10:\"path_regex\";s:26:\"{^/node/(?P<node>\\d+)$}sDu\";s:11:\"path_tokens\";a:2:{i:0;a:5:{i:0;s:8:\"variable\";i:1;s:1:\"/\";i:2;s:3:\"\\d+\";i:3;s:4:\"node\";i:4;b:1;}i:1;a:2:{i:0;s:4:\"text\";i:1;s:5:\"/node\";}}s:9:\"path_vars\";a:1:{i:0;s:4:\"node\";}s:10:\"host_regex\";N;s:11:\"host_tokens\";a:0:{}s:9:\"host_vars\";a:0:{}s:3:\"fit\";i:2;s:14:\"patternOutline\";s:7:\"/node/%\";s:8:\"numParts\";i:2;}}}s:50:\"\0Symfony\\Component\\Routing\\RouteCollection\0aliases\";a:0:{}s:52:\"\0Symfony\\Component\\Routing\\RouteCollection\0resources\";a:0:{}s:53:\"\0Symfony\\Component\\Routing\\RouteCollection\0priorities\";a:0:{}}}', '1') ON DUPLICATE KEY UPDATE "cid" = VALUES("cid"), "expire" = VALUES("expire"), "created" = VALUES("created"), "tags" = VALUES("tags"), "checksum" = VALUES("checksum"), "data" = VALUES("data"), "serialized" = VALUES("serialized")

Progress: 0.000

Running strace on the `mariadb` PID once the process is in this unresponsive state reveals the following EAGAIN errors:

$ ps -ef | egrep mariadb

mysql       6462       1  2 10:19 ?        00:14:42 /usr/sbin/mariadbd

root       25489   25422  0 20:17 pts/0    00:00:00 grep -E --color=auto mariadb

[root@db-dr-01.bcc.atech.int:~][0]$ strace -p 6462

strace: Process 6462 attached

restart_syscall(<... resuming interrupted read ...>) = 1

accept4(25, {sa_family=AF_UNIX}, [128->2], SOCK_CLOEXEC) = 67

futex(0x562301a9224c, FUTEX_WAKE_PRIVATE, 1) = 1

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

poll([{fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}], 3, -1) = 1 ([{fd=25, revents=POLLIN}])

accept4(25, {sa_family=AF_UNIX}, [128->2], SOCK_CLOEXEC) = 67

futex(0x562301a92248, FUTEX_WAKE_PRIVATE, 1) = 1

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)

poll([{fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}], 3, -1) = 1 ([{fd=25, revents=POLLIN}])```

Just to cover the bases, there are no errors in the mariadb err log, nor in syslog or dmesg. There are no issues with disk space available for the servers, or memory/CPU/etc. Additionally, we have seen no issues or error in the VMware layer, and the rest of the OS appears to be functioning correctly during/after these queries become stuck/unresponsive.

We have also experimented with different valuies of `Parallel_Mode` including `none` which so have had no impact. The mariadb defaults are below:

$ mariadbd --print-defaults

mariadbd would have been started with the following arguments:

--user=mysql --default-storage-engine=InnoDB --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --collation_server=utf8_unicode_ci --character_set_server=utf8 --key-buffer-size=32M --myisam-recover=FORCE,BACKUP --max-allowed-packet=256M --max-connect-errors=1000000 --sql-mode=STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_AUTO_VALUE_ON_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE,ONLY_FULL_GROUP_BY --sysdate-is-now=1 --innodb=FORCE --innodb-strict-mode=1 --datadir=/var/lib/mysql/ --log-bin=/var/lib/mysql/mysql-bin --expire-logs-days=10 --max_binlog_size=128M --sync-binlog=1 --server-id=2 --slave_parallel_mode=none --tmp-table-size=32M --max-heap-table-size=32M --query-cache-type=0 --query-cache-size=0 --max-connections=500 --thread-cache-size=50 --open-files-limit=65535 --table-definition-cache=1024 --table-open-cache=2048 --innodb_adaptive_hash_index=0 --innodb-flush-method=O_DIRECT --innodb-log-files-in-group=2 --innodb-log-file-size=128M --innodb-flush-log-at-trx-commit=1 --innodb-file-per-table=1 --innodb-buffer-pool-size=8354M --log-error=/var/log/mysql/error.log --log-queries-not-using-indexes=0 --slow-query-log=1 --slow-query-log-file=/var/log/mysql/mysql-slow.log

The replication hung/unresponsive issue does not occur at any specific time, but generally reoccurs on each replica within 48-72 hours.

Please let me know if there is any additional information I can provide.

Attachments

Activity

People

Assignee:: Sergei Golubchik

Reporter:: Oliver Kelly

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2024-01-08 23:37

Updated:: 2024-02-14 23:40

Resolved:: 2024-02-14 10:49

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.