Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22796

asio problems and IST failures



    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.4.13
    • Fix Version/s: 10.4
    • Component/s: Galera
    • Labels:
    • Environment:
      Linux 4.19.124-gentoo x86_64 AMD EPYC 7451, Intel I350 Gigabit Ethernet


      Note. Recent Galera version affected is 1.0.4/26.4.4, but same is relevant for previous versions, at least 26.4.3.

      We have cluster of 3 nodes working with large (above 1 TB) data volume. All the nodes have same hardware and software. Sometimes nodes run SST and IST to transfer data.
      It was notices that frequently SST fails and has to be restarted due to error:

      [ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error)

      We tried to set bigger galera gcache size, but on some cases error happened again, on some it didn't. Moreover, sometimes simple restart of mysqld on receiving node (and thus restarting SST when donor node returned to synced state back) lead to successful SST and joiner managed to join the cluster, but sometimes it failed.

      • It was noticed that gcache size and amount of transactions happening on cluster nodes has no effect on the issue.
      • disabling or enabling compression of state transfer data and also attempts to flush logs has had no effect also
      • It was also noticed that in case IST failed, it was always possible to find same error message logged at 20(+/- 1) minutes after starting mysqld on joining node (thus, 20 minutes after state transfer request). This error was:

      2020-06-01 21:50:42 0 [Note] WSREP: IST sender 232217729 -> 232234231
      WSREP_SST: [INFO] Evaluating /usr/bin/mariabackup --innobackupex --defaults-file=/etc/mysql/my.cnf     $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2> /var/lib/mysql//mariabackup.backup.log | /usr/bin/zstd --fast=3 | socat -u stdio TCP:***.*.***.*:4444; RC=( ${PIPESTATUS[@]} ) (20200601 21:50:53.977)
      2020-06-01 22:10:59 0 [ERROR] WSREP: async IST sender failed to serve tcp://***.*.***.*:4568: ist send failed: asio.system:110', asio error 'write: Connection timed out': 110 (Connection timed out)
           at galera/src/ist.cpp:send():887
      2020-06-01 22:10:59 0 [Note] WSREP: async IST sender served

      Appearance of these last two lines (error+note) in mysqld log file always ended with state transfer failed with following errors logged:

      2020-06-02  1:49:58 0 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:232217728, protocol version: 5
      2020-06-02  1:49:58 0 [ERROR] WSREP: got asio system error while reading IST stream: asio.system:104
      2020-06-02  1:49:58 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 232234231 last received: 232221423
      2020-06-02  1:49:58 2 [ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error)
           at galera/src/replicator_smm.hpp:pop_front():314. Null event.

      So, questionable things are:

      1. How to avoid such situations - nodes require manual restart on failed transfers!
      2. Why this asio error is always logged 20 minutes after state transfer start?
      3. Reported failure is 'Connection timed out' while connection is stable and no service or monitoring tool reports connection issues
      4. Issue is floating: on some restarts it appears and on others it doesn't, this was actual for previous version on galera library, too. No configuration change seems to cause or solve this.
      5. Also it was noted that referred asio library that is used by galera is 1.10.8 and this version can't be changed - however version 1.18 is out already.


          Issue Links



              jplindst Jan Lindström
              euglorg Eugene
              0 Vote for this issue
              3 Start watching this issue