[MDEV-22796] asio problems and IST failures Created: 2020-06-04  Updated: 2023-09-11  Resolved: 2023-09-11

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.4.13
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Eugene Assignee: Teemu Ollakka
Resolution: Incomplete Votes: 1
Labels: None
Environment:

Linux 4.19.124-gentoo x86_64 AMD EPYC 7451, Intel I350 Gigabit Ethernet


Attachments: File mcc-mariadb-logs.tar.gz    
Issue Links:
Relates
relates to MDEV-22797 galera uses old version of asio library Open

 Description   

Note. Recent Galera version affected is 1.0.4/26.4.4, but same is relevant for previous versions, at least 26.4.3.

We have cluster of 3 nodes working with large (above 1 TB) data volume. All the nodes have same hardware and software. Sometimes nodes run SST and IST to transfer data.
It was notices that frequently SST fails and has to be restarted due to error:

[ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error)

We tried to set bigger galera gcache size, but on some cases error happened again, on some it didn't. Moreover, sometimes simple restart of mysqld on receiving node (and thus restarting SST when donor node returned to synced state back) lead to successful SST and joiner managed to join the cluster, but sometimes it failed.

  • It was noticed that gcache size and amount of transactions happening on cluster nodes has no effect on the issue.
  • disabling or enabling compression of state transfer data and also attempts to flush logs has had no effect also
  • It was also noticed that in case IST failed, it was always possible to find same error message logged at 20(+/- 1) minutes after starting mysqld on joining node (thus, 20 minutes after state transfer request). This error was:

2020-06-01 21:50:42 0 [Note] WSREP: IST sender 232217729 -> 232234231
...
WSREP_SST: [INFO] Evaluating /usr/bin/mariabackup --innobackupex --defaults-file=/etc/mysql/my.cnf     $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2> /var/lib/mysql//mariabackup.backup.log | /usr/bin/zstd --fast=3 | socat -u stdio TCP:***.*.***.*:4444; RC=( ${PIPESTATUS[@]} ) (20200601 21:50:53.977)
2020-06-01 22:10:59 0 [ERROR] WSREP: async IST sender failed to serve tcp://***.*.***.*:4568: ist send failed: asio.system:110', asio error 'write: Connection timed out': 110 (Connection timed out)
     at galera/src/ist.cpp:send():887
2020-06-01 22:10:59 0 [Note] WSREP: async IST sender served

Appearance of these last two lines (error+note) in mysqld log file always ended with state transfer failed with following errors logged:

2020-06-02  1:49:58 0 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:232217728, protocol version: 5
2020-06-02  1:49:58 0 [ERROR] WSREP: got asio system error while reading IST stream: asio.system:104
2020-06-02  1:49:58 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 232234231 last received: 232221423
2020-06-02  1:49:58 2 [ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error)
     at galera/src/replicator_smm.hpp:pop_front():314. Null event.

So, questionable things are:

  1. How to avoid such situations - nodes require manual restart on failed transfers!
  2. Why this asio error is always logged 20 minutes after state transfer start?
  3. Reported failure is 'Connection timed out' while connection is stable and no service or monitoring tool reports connection issues
  4. Issue is floating: on some restarts it appears and on others it doesn't, this was actual for previous version on galera library, too. No configuration change seems to cause or solve this.
  5. Also it was noted that referred asio library that is used by galera is 1.10.8 and this version can't be changed - however version 1.18 is out already.


 Comments   
Comment by Ranjan Ghosh [ 2020-07-21 ]

Also happens for me very often when I just try to start a node. I start the first node. Works. DB is running. Then I try to start the second node (with a completely empty /var/lib/mysql folder). mariabackup sst starts. But on the second node I find this in the logs:

Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 3 [Note] WSREP: SST received: a6a91495-1b7f-11ea-a0be-7f78e01c49de:2896595604
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: Joiner monitor thread ended with total time 482 sec
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: Installed new state from SST: a6a91495-1b7f-11ea-a0be-7f78e01c49de:2896595604
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: Cert. index preload up to 2896595604
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: ####### IST applying starts with 2896595605
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: ####### IST current seqno initialized to 2896576828
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: Receiving IST...  0.0% ( 0/37 events) complete.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: IST preload starting at 2896576828
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: Service thread queue flushed.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:2896576827, protocol version: 5
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [ERROR] WSREP: got asio system error while reading IST stream: asio.system:104
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 2896576864 last received: 2896576840
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error)
Jul 21 18:43:00 yak2 mariadbd[778444]:          at galera/src/replicator_smm.hpp:pop_front():315. Null event.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: Closing send monitor...
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: Closed send monitor.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: gcomm: terminating thread
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: gcomm: joining thread
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: gcomm: closing backend
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: PC protocol downgrade 1 -> 0
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: view((empty))
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: gcomm: closed
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: New SELF-LEAVE.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: Flow-control interval: [0, 0]
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 2896595604)
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: RECV thread exiting 0: Success
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: recv_thread() joined.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: Closing replication queue.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: Closing slave action queue.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: Cert. index preloaded up to 2896576840
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [ERROR] WSREP: gcs/src/gcs.cpp:s_join():929: Sending JOIN failed: -103 (Software caused connection abort).
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Warning] WSREP: Failed to JOIN the cluster after SST gcs_join(a6a91495-1b7f-11ea-a0be-7f78e01c49de:2896595604) failed: 103 (Software caused connection abort)
Jul 21 18:43:00 yak2 mariadbd[778444]:          at galera/src/galera_gcs.hpp:join():218

After that, the cluster is in state "INCONSISTENT" and doesn't work. Neither on the first nor on the second node. I usually resolve this by killing the mariadbd on both nodes, then switching back to sst_method: rsync.

Comment by Ranjan Ghosh [ 2020-07-21 ]

PS: On MDEV-22797 I read some talk about timeouts. Perhaps this is useful: We have 2 technically completely identical clusters (exact same hardware, OS, setup etc). On one cluster this problem never happens. On the other cluster, this happens very frequently. The only difference is: The database of the problematic cluster (where the log excerpt I just posted is from) is much larger (lots more schemas and tables). Therefore, the SST is running way longer (a few minutes). So at least I find it not unlikely that this has sth. todo with some timeout or timing problem in general. That might be why it doesn't surface for everyone.

Comment by Alexey [ 2020-07-22 ]

Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 2896576864 last received: 2896576840

What does donor log say about that?

Comment by Ranjan Ghosh [ 2020-07-22 ]

@Alexey. I just looked it up in the logs. First there's a long list of lines like these repeating for all schemas and tables:

Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37 Streaming ./schemaxyz/drupal_anyredirect_group.frm to <STDOUT>
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37 ...done

And then, finally:

Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:37 Finished backing up non-InnoDB tables and files
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37 Streaming ./aria_log.00000001 to <STDOUT>
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37         ...done
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37 Streaming ./aria_log_control to <STDOUT>
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37         ...done
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:37 Waiting for log copy thread to read lsn 18807917230561
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 >> log scanned up to (18807917230561)
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming xtrabackup_galera_info
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38         ...done
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 mariabackup: The latest check point (for incremental): '18807917230549'
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: mariabackup: Stopping log copying thread.[00] 2020-07-21 18:42:38 >> log scanned up to (18807917230561)
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: 
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 >> log scanned up to (18807917230561)
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Executing BACKUP STAGE END
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 All tables unlocked
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming ib_buffer_pool to <STDOUT>
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38         ...done
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Backup created in directory '/tmp/tmp.tGaft5xC75/'
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming backup-my.cnf
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38         ...done
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming xtrabackup_info
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37 Streaming ./aria_log_control to <STDOUT>
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37         ...done
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:37 Waiting for log copy thread to read lsn 18807917230561
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 >> log scanned up to (18807917230561)
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming xtrabackup_galera_info
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38         ...done
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 mariabackup: The latest check point (for incremental): '18807917230549'
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: mariabackup: Stopping log copying thread.[00] 2020-07-21 18:42:38 >> log scanned up to (18807917230561)
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: 
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 >> log scanned up to (18807917230561)
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Executing BACKUP STAGE END
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 All tables unlocked
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming ib_buffer_pool to <STDOUT>
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38         ...done
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Backup created in directory '/tmp/tmp.tGaft5xC75/'
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming backup-my.cnf
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38         ...done
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming xtrabackup_info
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38         ...done
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Redo log (from LSN 18807858713703 to 18807917230561) was copied.
Jul 21 18:42:39 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:39 completed OK!
Jul 21 18:42:39 yak1 mariadbd[132692]: 2020-07-21 18:42:39 0 [Note] WSREP: SST sent: a6a91495-1b7f-11ea-a0be-7f78e01c49de:2896576865
Jul 21 18:42:39 yak1 mariadbd[132692]: 2020-07-21 18:42:39 0 [Warning] WSREP: server: yak1 unallowed state transition: joined -> joined
Jul 21 18:42:39 yak1 mariadbd[132692]: mariadbd: /home/buildbot/buildbot/build/mariadb-10.5.4/wsrep-lib/src/server_state.cpp:1356: void wsrep::server_state::state(wsrep:>
Jul 21 18:42:39 yak1 mariadbd[132692]: 200721 18:42:39 [ERROR] mysqld got signal 6 ;
Jul 21 18:42:39 yak1 mariadbd[132692]: This could be because you hit a bug. It is also possible that this binary
Jul 21 18:42:39 yak1 mariadbd[132692]: or one of the libraries it was linked against is corrupt, improperly built,
Jul 21 18:42:39 yak1 mariadbd[132692]: or misconfigured. This error can also be caused by malfunctioning hardware.
Jul 21 18:42:39 yak1 mariadbd[132692]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Jul 21 18:42:39 yak1 mariadbd[132692]: We will try our best to scrape up some info that will hopefully help
Jul 21 18:42:39 yak1 mariadbd[132692]: diagnose the problem, but since we have already crashed,
Jul 21 18:42:39 yak1 mariadbd[132692]: something is definitely wrong and this may fail.
Jul 21 18:42:39 yak1 mariadbd[132692]: Server version: 10.5.4-MariaDB-1:10.5.4+maria~focal
Jul 21 18:42:39 yak1 mariadbd[132692]: key_buffer_size=134217728
Jul 21 18:42:39 yak1 mariadbd[132692]: read_buffer_size=131072
Jul 21 18:42:39 yak1 mariadbd[132692]: max_used_connections=21
Jul 21 18:42:39 yak1 mariadbd[132692]: max_threads=65537
Jul 21 18:42:39 yak1 mariadbd[132692]: thread_count=6
Jul 21 18:42:39 yak1 mariadbd[132692]: It is possible that mysqld could use up to
Jul 21 18:42:39 yak1 mariadbd[132692]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 142752116 K  bytes of memory
Jul 21 18:42:39 yak1 mariadbd[132692]: Hope that's ok; if not, decrease some variables in the equation.
Jul 21 18:42:39 yak1 mariadbd[132692]: Thread pointer: 0x7f5a90000c58
Jul 21 18:42:39 yak1 mariadbd[132692]: Attempting backtrace. You can use the following information to find out
Jul 21 18:42:39 yak1 mariadbd[132692]: where mysqld died. If you see no messages after this, something went
Jul 21 18:42:39 yak1 mariadbd[132692]: terribly wrong...
Jul 21 18:42:39 yak1 mariadbd[132692]: stack_bottom = 0x7f5ab5ffaaa8 thread_stack 0x49000
Jul 21 18:42:39 yak1 -wsrep-sst-donor[177111]: Total time on donor: 0 seconds
Jul 21 18:42:39 yak1 -wsrep-sst-donor[177113]: Cleaning up temporary directories
Jul 21 18:42:39 yak1 mariadbd[132692]: /usr/sbin/mariadbd(my_print_stacktrace+0x32)[0x5559d678d9c2]
Jul 21 18:42:39 yak1 mariadbd[132692]: ??:0(my_print_stacktrace)[0x5559d61db495]
Jul 21 18:42:39 yak1 mariadbd[132692]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f641ef1d3c0]
Jul 21 18:42:39 yak1 mariadbd[132692]: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f641ea2418b]
Jul 21 18:42:39 yak1 mariadbd[132692]: ??:0(gsignal)[0x7f641ea03859]
Jul 21 18:42:39 yak1 mariadbd[132692]: /lib/x86_64-linux-gnu/libc.so.6(+0x25729)[0x7f641ea03729]
Jul 21 18:42:39 yak1 mariadbd[132692]: ??:0(__assert_fail)[0x7f641ea14f36]
Jul 21 18:42:40 yak1 mariadbd[132692]: /usr/sbin/mariadbd(_ZN5wsrep12server_state5stateERNS_11unique_lockINS_5mutexEEENS0_5stateE+0x657)[0x5559d6810397]
Jul 21 18:42:40 yak1 mariadbd[132692]: ??:0(wsrep::server_state::state(wsrep::unique_lock<wsrep::mutex>&, wsrep::server_state::state))[0x5559d6810b5f]
Jul 21 18:42:40 yak1 mariadbd[132692]: ??:0(wsrep::server_state::sst_sent(wsrep::gtid const&, int))[0x5559d64a837c]
Jul 21 18:42:40 yak1 mariadbd[132692]: ??:0(wsrep_sst_donate(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, wsrep::gtid const&, >
Jul 21 18:42:40 yak1 mariadbd[132692]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f641ef11609]
Jul 21 18:42:40 yak1 mariadbd[132692]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f641eb00103]
Jul 21 18:42:40 yak1 mariadbd[132692]: Trying to get some variables.
Jul 21 18:42:40 yak1 mariadbd[132692]: Some pointers may be invalid and cause the dump to abort.
Jul 21 18:42:40 yak1 mariadbd[132692]: Query (0x0):
Jul 21 18:42:40 yak1 mariadbd[132692]: Connection ID (thread ID): 0
Jul 21 18:42:40 yak1 mariadbd[132692]: Status: NOT_KILLED
Jul 21 18:42:40 yak1 mariadbd[132692]: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_inter>
Jul 21 18:42:40 yak1 mariadbd[132692]: The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
Jul 21 18:42:40 yak1 mariadbd[132692]: information that should help you find out what is causing the crash.
Jul 21 18:42:40 yak1 mariadbd[132692]: Writing a core file...
Jul 21 18:42:40 yak1 mariadbd[132692]: Working directory at /var/lib/mysql
Jul 21 18:42:40 yak1 mariadbd[132692]: Resource Limits:
Jul 21 18:42:40 yak1 mariadbd[132692]: Limit                     Soft Limit           Hard Limit           Units
Jul 21 18:42:40 yak1 mariadbd[132692]: Max cpu time              unlimited            unlimited            seconds
Jul 21 18:42:40 yak1 mariadbd[132692]: Max file size             unlimited            unlimited            bytes
Jul 21 18:42:40 yak1 mariadbd[132692]: Max data size             unlimited            unlimited            bytes
Jul 21 18:42:40 yak1 mariadbd[132692]: Max stack size            8388608              unlimited            bytes
Jul 21 18:42:40 yak1 mariadbd[132692]: Max core file size        0                    unlimited            bytes
Jul 21 18:42:40 yak1 mariadbd[132692]: Max resident set          unlimited            unlimited            bytes
Jul 21 18:42:40 yak1 mariadbd[132692]: Max processes             513841               513841               processes
Jul 21 18:42:40 yak1 mariadbd[132692]: Max open files            16364                16364                files
Jul 21 18:42:40 yak1 mariadbd[132692]: Max locked memory         65536                65536                bytes
Jul 21 18:42:40 yak1 mariadbd[132692]: Max address space         unlimited            unlimited            bytes
Jul 21 18:42:40 yak1 mariadbd[132692]: Max file locks            unlimited            unlimited            locks
Jul 21 18:42:40 yak1 mariadbd[132692]: Max pending signals       513841               513841               signals
Jul 21 18:42:40 yak1 mariadbd[132692]: Max msgqueue size         819200               819200               bytes
Jul 21 18:42:40 yak1 mariadbd[132692]: Max nice priority         0                    0
Jul 21 18:42:40 yak1 mariadbd[132692]: Max realtime priority     0                    0
Jul 21 18:42:40 yak1 mariadbd[132692]: Max realtime timeout      unlimited            unlimited            us
Jul 21 18:42:40 yak1 mariadbd[132692]: Core pattern: core
Jul 21 18:42:41 yak1 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT
Jul 21 18:42:41 yak1 systemd[1]: mariadb.service: Failed with result 'signal'.

Comment by Alexey [ 2020-07-22 ]

As for the main report: error log snippets start too late to have a certain diagnosis, however they are consistent with the hypothesis that the donor was heading for SST bypass (i.e. actual transfer was going to be through IST), however for some reason SST was taking too long (when it should have been momentary) and joiner, being unable to apply IST events was just idling on the connection and it timed out.

Comment by Ranjan Ghosh [ 2020-07-22 ]

Perhaps also related to MDEV-23053 ? I see some similarities...

Comment by Vasyl Saienko [ 2021-05-20 ]

We've meet with the same issue:

2021-05-19 09:48:11,721 - OpenStack-Helm Mariadb - INFO - 2021-05-19  9:48:11 0 [ERROR] WSREP: got asio system error while reading IST stream: asio.misc:2
2021-05-19 09:48:11,721 - OpenStack-Helm Mariadb - INFO - 2021-05-19  9:48:11 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 5470044 last received: 5457306
2021-05-19 09:48:11,721 - OpenStack-Helm Mariadb - INFO - 2021-05-19  9:48:11 10 [ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error)
2021-05-19 09:48:11,721 - OpenStack-Helm Mariadb - INFO - 	 at galera/src/replicator_smm.hpp:pop_front():315. Null event.

On the donor side:

2021-05-19 09:48:11,670 - OpenStack-Helm Mariadb - INFO - 2021-05-19  9:48:11 0 [ERROR] WSREP: async IST sender failed to serve tcp://192.168.44.112:4568: ist send failed: asio.system:14', asio error 'write: Bad address': 14 (Bad address)

Comment by Rick Pizzi [ 2023-06-14 ]

The same happened to a customer running 10.5.6.
On donor side, the following message appeared:

2023-06-14  7:59:37 0 [ERROR] WSREP: async IST sender failed to serve ssl://10.29.95.53:4568: Unrecognized writeset version: -1: 71 (Protocol error)

Comment by Jan Lindström [ 2023-08-07 ]

In Galera library version 26.4.15 there is asio to 1.14.1 maybe that can be tested with more recent version of MariaDB server. Does the issue still reproduce?

Comment by Eugene [ 2023-08-07 ]

Just for the case, since we upgraded to mariadb-10.6 (and thus, asio-1.26), we haven't seen this very issue. For at least half a year.

Comment by Vasyl Saienko [ 2023-08-14 ]

Hello Eugene,

We tried to use more fresh mariadb version and still see the same issue, during abnormal shutdown with kill -9.

mysql@mariadb-server-0:/$ dpkg -l |grep galera
ii  galera-4                   26.4.14-ubu2004                   amd64        Replication framework for transactional applications
mysql@mariadb-server-0:/$ dpkg -l |grep maria
ii  libdbd-mariadb-perl        1.11-3ubuntu2                     amd64        Perl5 database interface to the MariaDB/MySQL databases
ii  libmariadb3:amd64          1:10.6.14+maria~ubu2004           amd64        MariaDB database client library
ii  mariadb-backup             1:10.6.14+maria~ubu2004           amd64        Backup tool for MariaDB server
ii  mariadb-client-10.6        1:10.6.14+maria~ubu2004           amd64        MariaDB database client binaries
ii  mariadb-client-core-10.6   1:10.6.14+maria~ubu2004           amd64        MariaDB database core client binaries
ii  mariadb-common             1:10.6.14+maria~ubu2004           all          MariaDB common configuration files
ii  mariadb-server             1:10.6.14+maria~ubu2004           all          MariaDB database server (metapackage depending on the latest version)
ii  mariadb-server-10.6        1:10.6.14+maria~ubu2004           amd64        MariaDB database server binaries
ii  mariadb-server-core-10.6   1:10.6.14+maria~ubu2004           amd64        MariaDB database core server files
ii  mysql-common               1:10.6.14+maria~ubu2004           all          MariaDB database common files (e.g. /etc/mysql/my.cnf)

Comment by Vasyl Saienko [ 2023-08-14 ]

oh, there is 26.4.15, let me try and get back with feedback

Comment by Vasyl Saienko [ 2023-08-15 ]

I confirm with galera 26.4.15 issue is fixed, do you know when mariadb with this galera version will be released officially?

Generated at Thu Feb 08 09:17:33 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.