Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33202

Slave SQL replication hung/unresponsive

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Won't Fix
    • 10.6.16, 10.11.6
    • N/A
    • Replication
    • None

    Description

      We operate a number of MariaDB server instances across two data centres, primarily vers 10.6 but also some 10.11 instances. Most but not all instances are running Drupal 8 or Drupal 10 sites. We generally run PROD workload in our Primary data centre and SLAVE replicas in our secondary Data Centre; hardware and VMware ESXi version is the same across data centres.

      Starting sometime within the past 4-6 weeks, our MariaDB SLAVE replicas have begun experiencing a Slave SQL replication hung/unresponsive error. The slave SQL process becomes completely unresponsive and `Seconds_Behind_Master` continues to grow without any progress. The stuck/unresponsive query can not be killed within MariaDB, and the `mariadb.service` or `mariadb` process can not be stopped or restarted at the OS level, nor can the OS be restarted as systemd will wait forever for `mariadb` service to stop. The only remedy appears to be a power reset at the VM level.

      Below is an example query; noting we have not seen any significant pattern in databases/tables or query types affected by this issue (note: DB name redacted).

      *************************** 2. row ***************************
            Id: 6
          User: system user
          Host: 
            db: ????
       Command: Slave_SQL
          Time: 6869
         State: Update
          Info: INSERT INTO "cache_data" ("cid", "expire", "created", "tags", "checksum", "data", "serialized") VALUES ('route:[language]=en:/community-and-safety/grants-awards-and-sponsorships/community-grants/lord-mayors-community-fund:', '-1', '1704702224.983', 'route_match', '133545', 'a:3:{s:4:\"path\";s:11:\"/node/22131\";s:5:\"query\";a:0:{}s:6:\"routes\";O:41:\"Symfony\\Component\\Routing\\RouteCollection\":4:{s:49:\"\0Symfony\\Component\\Routing\\RouteCollection\0routes\";a:1:{s:21:\"entity.node.canonical\";O:31:\"Symfony\\Component\\Routing\\Route\":9:{s:4:\"path\";s:12:\"/node/{node}\";s:4:\"host\";s:0:\"\";s:8:\"defaults\";a:2:{s:11:\"_controller\";s:48:\"\\Drupal\\node\\Controller\\NodeViewController::view\";s:15:\"_title_callback\";s:49:\"\\Drupal\\node\\Controller\\NodeViewController::title\";}s:12:\"requirements\";a:3:{s:4:\"node\";s:3:\"\\d+\";s:14:\"_entity_access\";s:9:\"node.view\";s:39:\"_access_preview_link_canonical_rerouter\";s:4:\"node\";}s:7:\"options\";a:4:{s:14:\"compiler_class\";s:33:\"Drupal\\Core\\Routing\\RouteCompiler\";s:10:\"parameters\";a:1:{s:4:\"node\";a:2:{s:4:\"type\";s:11:\"entity:node\";s:9:\"converter\";s:21:\"paramconverter.entity\";}}s:14:\"_access_checks\";a:2:{i:0;s:19:\"access_check.entity\";i:1;s:44:\"access_check.preview_link.canonical_rerouter\";}s:4:\"utf8\";b:1;}s:7:\"schemes\";a:0:{}s:7:\"methods\";a:2:{i:0;s:3:\"GET\";i:1;s:4:\"POST\";}s:9:\"condition\";s:0:\"\";s:8:\"compiled\";O:33:\"Drupal\\Core\\Routing\\CompiledRoute\":11:{s:4:\"vars\";a:1:{i:0;s:4:\"node\";}s:11:\"path_prefix\";s:0:\"\";s:10:\"path_regex\";s:26:\"{^/node/(?P<node>\\d+)$}sDu\";s:11:\"path_tokens\";a:2:{i:0;a:5:{i:0;s:8:\"variable\";i:1;s:1:\"/\";i:2;s:3:\"\\d+\";i:3;s:4:\"node\";i:4;b:1;}i:1;a:2:{i:0;s:4:\"text\";i:1;s:5:\"/node\";}}s:9:\"path_vars\";a:1:{i:0;s:4:\"node\";}s:10:\"host_regex\";N;s:11:\"host_tokens\";a:0:{}s:9:\"host_vars\";a:0:{}s:3:\"fit\";i:2;s:14:\"patternOutline\";s:7:\"/node/%\";s:8:\"numParts\";i:2;}}}s:50:\"\0Symfony\\Component\\Routing\\RouteCollection\0aliases\";a:0:{}s:52:\"\0Symfony\\Component\\Routing\\RouteCollection\0resources\";a:0:{}s:53:\"\0Symfony\\Component\\Routing\\RouteCollection\0priorities\";a:0:{}}}', '1') ON DUPLICATE KEY UPDATE "cid" = VALUES("cid"), "expire" = VALUES("expire"), "created" = VALUES("created"), "tags" = VALUES("tags"), "checksum" = VALUES("checksum"), "data" = VALUES("data"), "serialized" = VALUES("serialized")
      Progress: 0.000
      

      Running strace on the `mariadb` PID once the process is in this unresponsive state reveals the following EAGAIN errors:

      $ ps -ef | egrep mariadb
      mysql       6462       1  2 10:19 ?        00:14:42 /usr/sbin/mariadbd
      root       25489   25422  0 20:17 pts/0    00:00:00 grep -E --color=auto mariadb
      [root@db-dr-01.bcc.atech.int:~][0]$ strace -p 6462
      strace: Process 6462 attached
      restart_syscall(<... resuming interrupted read ...>) = 1
      accept4(25, {sa_family=AF_UNIX}, [128->2], SOCK_CLOEXEC) = 67
      futex(0x562301a9224c, FUTEX_WAKE_PRIVATE, 1) = 1
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      poll([{fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}], 3, -1) = 1 ([{fd=25, revents=POLLIN}])
      accept4(25, {sa_family=AF_UNIX}, [128->2], SOCK_CLOEXEC) = 67
      futex(0x562301a92248, FUTEX_WAKE_PRIVATE, 1) = 1
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      accept4(25, 0x7ffc8900a130, [128], SOCK_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
      poll([{fd=23, events=POLLIN}, {fd=24, events=POLLIN}, {fd=25, events=POLLIN}], 3, -1) = 1 ([{fd=25, revents=POLLIN}])```
      

      Just to cover the bases, there are no errors in the mariadb err log, nor in syslog or dmesg. There are no issues with disk space available for the servers, or memory/CPU/etc. Additionally, we have seen no issues or error in the VMware layer, and the rest of the OS appears to be functioning correctly during/after these queries become stuck/unresponsive.

      We have also experimented with different valuies of `Parallel_Mode` including `none` which so have had no impact. The mariadb defaults are below:

      $ mariadbd --print-defaults
      mariadbd would have been started with the following arguments:
      --user=mysql --default-storage-engine=InnoDB --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --collation_server=utf8_unicode_ci --character_set_server=utf8 --key-buffer-size=32M --myisam-recover=FORCE,BACKUP --max-allowed-packet=256M --max-connect-errors=1000000 --sql-mode=STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_AUTO_VALUE_ON_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE,ONLY_FULL_GROUP_BY --sysdate-is-now=1 --innodb=FORCE --innodb-strict-mode=1 --datadir=/var/lib/mysql/ --log-bin=/var/lib/mysql/mysql-bin --expire-logs-days=10 --max_binlog_size=128M --sync-binlog=1 --server-id=2 --slave_parallel_mode=none --tmp-table-size=32M --max-heap-table-size=32M --query-cache-type=0 --query-cache-size=0 --max-connections=500 --thread-cache-size=50 --open-files-limit=65535 --table-definition-cache=1024 --table-open-cache=2048 --innodb_adaptive_hash_index=0 --innodb-flush-method=O_DIRECT --innodb-log-files-in-group=2 --innodb-log-file-size=128M --innodb-flush-log-at-trx-commit=1 --innodb-file-per-table=1 --innodb-buffer-pool-size=8354M --log-error=/var/log/mysql/error.log --log-queries-not-using-indexes=0 --slow-query-log=1 --slow-query-log-file=/var/log/mysql/mysql-slow.log

      The replication hung/unresponsive issue does not occur at any specific time, but generally reoccurs on each replica within 48-72 hours.

      Please let me know if there is any additional information I can provide.

      Attachments

        Activity

          People

            serg Sergei Golubchik
            oliver.kelly Oliver Kelly
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.