Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Incomplete
-
10.4.13
-
None
-
Linux 4.19.124-gentoo x86_64 AMD EPYC 7451, Intel I350 Gigabit Ethernet
Description
Note. Recent Galera version affected is 1.0.4/26.4.4, but same is relevant for previous versions, at least 26.4.3.
We have cluster of 3 nodes working with large (above 1 TB) data volume. All the nodes have same hardware and software. Sometimes nodes run SST and IST to transfer data.
It was notices that frequently SST fails and has to be restarted due to error:
[ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error) |
We tried to set bigger galera gcache size, but on some cases error happened again, on some it didn't. Moreover, sometimes simple restart of mysqld on receiving node (and thus restarting SST when donor node returned to synced state back) lead to successful SST and joiner managed to join the cluster, but sometimes it failed.
- It was noticed that gcache size and amount of transactions happening on cluster nodes has no effect on the issue.
- disabling or enabling compression of state transfer data and also attempts to flush logs has had no effect also
- It was also noticed that in case IST failed, it was always possible to find same error message logged at 20(+/- 1) minutes after starting mysqld on joining node (thus, 20 minutes after state transfer request). This error was:
2020-06-01 21:50:42 0 [Note] WSREP: IST sender 232217729 -> 232234231 |
...
|
WSREP_SST: [INFO] Evaluating /usr/bin/mariabackup --innobackupex --defaults-file=/etc/mysql/my.cnf $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2> /var/lib/mysql//mariabackup.backup.log | /usr/bin/zstd --fast=3 | socat -u stdio TCP:***.*.***.*:4444; RC=( ${PIPESTATUS[@]} ) (20200601 21:50:53.977) |
2020-06-01 22:10:59 0 [ERROR] WSREP: async IST sender failed to serve tcp://***.*.***.*:4568: ist send failed: asio.system:110', asio error 'write: Connection timed out': 110 (Connection timed out) |
at galera/src/ist.cpp:send():887 |
2020-06-01 22:10:59 0 [Note] WSREP: async IST sender served |
Appearance of these last two lines (error+note) in mysqld log file always ended with state transfer failed with following errors logged:
2020-06-02 1:49:58 0 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:232217728, protocol version: 5 |
2020-06-02 1:49:58 0 [ERROR] WSREP: got asio system error while reading IST stream: asio.system:104 |
2020-06-02 1:49:58 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 232234231 last received: 232221423 |
2020-06-02 1:49:58 2 [ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error) |
at galera/src/replicator_smm.hpp:pop_front():314. Null event. |
So, questionable things are:
- How to avoid such situations - nodes require manual restart on failed transfers!
- Why this asio error is always logged 20 minutes after state transfer start?
- Reported failure is 'Connection timed out' while connection is stable and no service or monitoring tool reports connection issues
- Issue is floating: on some restarts it appears and on others it doesn't, this was actual for previous version on galera library, too. No configuration change seems to cause or solve this.
- Also it was noted that referred asio library that is used by galera is 1.10.8 and this version can't be changed - however version 1.18 is out already.
Attachments
Issue Links
- relates to
-
MDEV-22797 galera uses old version of asio library
-
- Open
-
Activity
PS: On MDEV-22797 I read some talk about timeouts. Perhaps this is useful: We have 2 technically completely identical clusters (exact same hardware, OS, setup etc). On one cluster this problem never happens. On the other cluster, this happens very frequently. The only difference is: The database of the problematic cluster (where the log excerpt I just posted is from) is much larger (lots more schemas and tables). Therefore, the SST is running way longer (a few minutes). So at least I find it not unlikely that this has sth. todo with some timeout or timing problem in general. That might be why it doesn't surface for everyone.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 2896576864 last received: 2896576840 |
What does donor log say about that?
@Alexey. I just looked it up in the logs. First there's a long list of lines like these repeating for all schemas and tables:
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37 Streaming ./schemaxyz/drupal_anyredirect_group.frm to <STDOUT>
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37 ...done
And then, finally:
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:37 Finished backing up non-InnoDB tables and files
|
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37 Streaming ./aria_log.00000001 to <STDOUT>
|
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37 ...done
|
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37 Streaming ./aria_log_control to <STDOUT>
|
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37 ...done
|
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:37 Waiting for log copy thread to read lsn 18807917230561
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 >> log scanned up to (18807917230561)
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming xtrabackup_galera_info
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 ...done
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 mariabackup: The latest check point (for incremental): '18807917230549'
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: mariabackup: Stopping log copying thread.[00] 2020-07-21 18:42:38 >> log scanned up to (18807917230561)
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]:
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 >> log scanned up to (18807917230561)
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Executing BACKUP STAGE END
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 All tables unlocked
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming ib_buffer_pool to <STDOUT>
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 ...done
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Backup created in directory '/tmp/tmp.tGaft5xC75/'
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming backup-my.cnf
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 ...done
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming xtrabackup_info
|
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37 Streaming ./aria_log_control to <STDOUT>
|
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [01] 2020-07-21 18:42:37 ...done
|
Jul 21 18:42:37 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:37 Waiting for log copy thread to read lsn 18807917230561
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 >> log scanned up to (18807917230561)
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming xtrabackup_galera_info
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 ...done
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 mariabackup: The latest check point (for incremental): '18807917230549'
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: mariabackup: Stopping log copying thread.[00] 2020-07-21 18:42:38 >> log scanned up to (18807917230561)
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]:
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 >> log scanned up to (18807917230561)
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Executing BACKUP STAGE END
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 All tables unlocked
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming ib_buffer_pool to <STDOUT>
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 ...done
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Backup created in directory '/tmp/tmp.tGaft5xC75/'
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming backup-my.cnf
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 ...done
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Streaming xtrabackup_info
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 ...done
|
Jul 21 18:42:38 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:38 Redo log (from LSN 18807858713703 to 18807917230561) was copied.
|
Jul 21 18:42:39 yak1 -innobackupex-backup[136486]: [00] 2020-07-21 18:42:39 completed OK!
|
Jul 21 18:42:39 yak1 mariadbd[132692]: 2020-07-21 18:42:39 0 [Note] WSREP: SST sent: a6a91495-1b7f-11ea-a0be-7f78e01c49de:2896576865
|
Jul 21 18:42:39 yak1 mariadbd[132692]: 2020-07-21 18:42:39 0 [Warning] WSREP: server: yak1 unallowed state transition: joined -> joined
|
Jul 21 18:42:39 yak1 mariadbd[132692]: mariadbd: /home/buildbot/buildbot/build/mariadb-10.5.4/wsrep-lib/src/server_state.cpp:1356: void wsrep::server_state::state(wsrep:>
|
Jul 21 18:42:39 yak1 mariadbd[132692]: 200721 18:42:39 [ERROR] mysqld got signal 6 ;
|
Jul 21 18:42:39 yak1 mariadbd[132692]: This could be because you hit a bug. It is also possible that this binary
|
Jul 21 18:42:39 yak1 mariadbd[132692]: or one of the libraries it was linked against is corrupt, improperly built,
|
Jul 21 18:42:39 yak1 mariadbd[132692]: or misconfigured. This error can also be caused by malfunctioning hardware.
|
Jul 21 18:42:39 yak1 mariadbd[132692]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
|
Jul 21 18:42:39 yak1 mariadbd[132692]: We will try our best to scrape up some info that will hopefully help
|
Jul 21 18:42:39 yak1 mariadbd[132692]: diagnose the problem, but since we have already crashed,
|
Jul 21 18:42:39 yak1 mariadbd[132692]: something is definitely wrong and this may fail.
|
Jul 21 18:42:39 yak1 mariadbd[132692]: Server version: 10.5.4-MariaDB-1:10.5.4+maria~focal
|
Jul 21 18:42:39 yak1 mariadbd[132692]: key_buffer_size=134217728
|
Jul 21 18:42:39 yak1 mariadbd[132692]: read_buffer_size=131072
|
Jul 21 18:42:39 yak1 mariadbd[132692]: max_used_connections=21
|
Jul 21 18:42:39 yak1 mariadbd[132692]: max_threads=65537
|
Jul 21 18:42:39 yak1 mariadbd[132692]: thread_count=6
|
Jul 21 18:42:39 yak1 mariadbd[132692]: It is possible that mysqld could use up to
|
Jul 21 18:42:39 yak1 mariadbd[132692]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 142752116 K bytes of memory
|
Jul 21 18:42:39 yak1 mariadbd[132692]: Hope that's ok; if not, decrease some variables in the equation.
|
Jul 21 18:42:39 yak1 mariadbd[132692]: Thread pointer: 0x7f5a90000c58
|
Jul 21 18:42:39 yak1 mariadbd[132692]: Attempting backtrace. You can use the following information to find out
|
Jul 21 18:42:39 yak1 mariadbd[132692]: where mysqld died. If you see no messages after this, something went
|
Jul 21 18:42:39 yak1 mariadbd[132692]: terribly wrong...
|
Jul 21 18:42:39 yak1 mariadbd[132692]: stack_bottom = 0x7f5ab5ffaaa8 thread_stack 0x49000
|
Jul 21 18:42:39 yak1 -wsrep-sst-donor[177111]: Total time on donor: 0 seconds
|
Jul 21 18:42:39 yak1 -wsrep-sst-donor[177113]: Cleaning up temporary directories
|
Jul 21 18:42:39 yak1 mariadbd[132692]: /usr/sbin/mariadbd(my_print_stacktrace+0x32)[0x5559d678d9c2]
|
Jul 21 18:42:39 yak1 mariadbd[132692]: ??:0(my_print_stacktrace)[0x5559d61db495]
|
Jul 21 18:42:39 yak1 mariadbd[132692]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f641ef1d3c0]
|
Jul 21 18:42:39 yak1 mariadbd[132692]: /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f641ea2418b]
|
Jul 21 18:42:39 yak1 mariadbd[132692]: ??:0(gsignal)[0x7f641ea03859]
|
Jul 21 18:42:39 yak1 mariadbd[132692]: /lib/x86_64-linux-gnu/libc.so.6(+0x25729)[0x7f641ea03729]
|
Jul 21 18:42:39 yak1 mariadbd[132692]: ??:0(__assert_fail)[0x7f641ea14f36]
|
Jul 21 18:42:40 yak1 mariadbd[132692]: /usr/sbin/mariadbd(_ZN5wsrep12server_state5stateERNS_11unique_lockINS_5mutexEEENS0_5stateE+0x657)[0x5559d6810397]
|
Jul 21 18:42:40 yak1 mariadbd[132692]: ??:0(wsrep::server_state::state(wsrep::unique_lock<wsrep::mutex>&, wsrep::server_state::state))[0x5559d6810b5f]
|
Jul 21 18:42:40 yak1 mariadbd[132692]: ??:0(wsrep::server_state::sst_sent(wsrep::gtid const&, int))[0x5559d64a837c]
|
Jul 21 18:42:40 yak1 mariadbd[132692]: ??:0(wsrep_sst_donate(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, wsrep::gtid const&, >
|
Jul 21 18:42:40 yak1 mariadbd[132692]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f641ef11609]
|
Jul 21 18:42:40 yak1 mariadbd[132692]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f641eb00103]
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Trying to get some variables.
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Some pointers may be invalid and cause the dump to abort.
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Query (0x0):
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Connection ID (thread ID): 0
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Status: NOT_KILLED
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_inter>
|
Jul 21 18:42:40 yak1 mariadbd[132692]: The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
|
Jul 21 18:42:40 yak1 mariadbd[132692]: information that should help you find out what is causing the crash.
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Writing a core file...
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Working directory at /var/lib/mysql
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Resource Limits:
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Limit Soft Limit Hard Limit Units
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max cpu time unlimited unlimited seconds
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max file size unlimited unlimited bytes
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max data size unlimited unlimited bytes
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max stack size 8388608 unlimited bytes
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max core file size 0 unlimited bytes
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max resident set unlimited unlimited bytes
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max processes 513841 513841 processes
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max open files 16364 16364 files
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max locked memory 65536 65536 bytes
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max address space unlimited unlimited bytes
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max file locks unlimited unlimited locks
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max pending signals 513841 513841 signals
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max msgqueue size 819200 819200 bytes
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max nice priority 0 0
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max realtime priority 0 0
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Max realtime timeout unlimited unlimited us
|
Jul 21 18:42:40 yak1 mariadbd[132692]: Core pattern: core
|
Jul 21 18:42:41 yak1 systemd[1]: mariadb.service: Main process exited, code=killed, status=6/ABRT
|
Jul 21 18:42:41 yak1 systemd[1]: mariadb.service: Failed with result 'signal'.
|
As for the main report: error log snippets start too late to have a certain diagnosis, however they are consistent with the hypothesis that the donor was heading for SST bypass (i.e. actual transfer was going to be through IST), however for some reason SST was taking too long (when it should have been momentary) and joiner, being unable to apply IST events was just idling on the connection and it timed out.
We've meet with the same issue:
2021-05-19 09:48:11,721 - OpenStack-Helm Mariadb - INFO - 2021-05-19 9:48:11 0 [ERROR] WSREP: got asio system error while reading IST stream: asio.misc:2 |
2021-05-19 09:48:11,721 - OpenStack-Helm Mariadb - INFO - 2021-05-19 9:48:11 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 5470044 last received: 5457306 |
2021-05-19 09:48:11,721 - OpenStack-Helm Mariadb - INFO - 2021-05-19 9:48:11 10 [ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error) |
2021-05-19 09:48:11,721 - OpenStack-Helm Mariadb - INFO - at galera/src/replicator_smm.hpp:pop_front():315. Null event. |
On the donor side:
2021-05-19 09:48:11,670 - OpenStack-Helm Mariadb - INFO - 2021-05-19 9:48:11 0 [ERROR] WSREP: async IST sender failed to serve tcp://192.168.44.112:4568: ist send failed: asio.system:14', asio error 'write: Bad address': 14 (Bad address) |
The same happened to a customer running 10.5.6.
On donor side, the following message appeared:
2023-06-14 7:59:37 0 [ERROR] WSREP: async IST sender failed to serve ssl://10.29.95.53:4568: Unrecognized writeset version: -1: 71 (Protocol error)
|
In Galera library version 26.4.15 there is asio to 1.14.1 maybe that can be tested with more recent version of MariaDB server. Does the issue still reproduce?
Just for the case, since we upgraded to mariadb-10.6 (and thus, asio-1.26), we haven't seen this very issue. For at least half a year.
Hello Eugene,
We tried to use more fresh mariadb version and still see the same issue, during abnormal shutdown with kill -9.
mysql@mariadb-server-0:/$ dpkg -l |grep galera |
ii galera-4 26.4.14-ubu2004 amd64 Replication framework for transactional applications |
mysql@mariadb-server-0:/$ dpkg -l |grep maria |
ii libdbd-mariadb-perl 1.11-3ubuntu2 amd64 Perl5 database interface to the MariaDB/MySQL databases |
ii libmariadb3:amd64 1:10.6.14+maria~ubu2004 amd64 MariaDB database client library |
ii mariadb-backup 1:10.6.14+maria~ubu2004 amd64 Backup tool for MariaDB server |
ii mariadb-client-10.6 1:10.6.14+maria~ubu2004 amd64 MariaDB database client binaries |
ii mariadb-client-core-10.6 1:10.6.14+maria~ubu2004 amd64 MariaDB database core client binaries |
ii mariadb-common 1:10.6.14+maria~ubu2004 all MariaDB common configuration files |
ii mariadb-server 1:10.6.14+maria~ubu2004 all MariaDB database server (metapackage depending on the latest version) |
ii mariadb-server-10.6 1:10.6.14+maria~ubu2004 amd64 MariaDB database server binaries |
ii mariadb-server-core-10.6 1:10.6.14+maria~ubu2004 amd64 MariaDB database core server files |
ii mysql-common 1:10.6.14+maria~ubu2004 all MariaDB database common files (e.g. /etc/mysql/my.cnf) |
I confirm with galera 26.4.15 issue is fixed, do you know when mariadb with this galera version will be released officially?
Also happens for me very often when I just try to start a node. I start the first node. Works. DB is running. Then I try to start the second node (with a completely empty /var/lib/mysql folder). mariabackup sst starts. But on the second node I find this in the logs:
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 3 [Note] WSREP: SST received: a6a91495-1b7f-11ea-a0be-7f78e01c49de:2896595604
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: Joiner monitor thread ended with total time 482 sec
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: Installed new state from SST: a6a91495-1b7f-11ea-a0be-7f78e01c49de:2896595604
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: Cert. index preload up to 2896595604
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: ####### IST applying starts with 2896595605
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: ####### IST current seqno initialized to 2896576828
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: Receiving IST... 0.0% ( 0/37 events) complete.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: IST preload starting at 2896576828
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: Service thread queue flushed.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:2896576827, protocol version: 5
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [ERROR] WSREP: got asio system error while reading IST stream: asio.system:104
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 2896576864 last received: 2896576840
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error)
Jul 21 18:43:00 yak2 mariadbd[778444]: at galera/src/replicator_smm.hpp:pop_front():315. Null event.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: Closing send monitor...
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: Closed send monitor.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: gcomm: terminating thread
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: gcomm: joining thread
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: gcomm: closing backend
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: PC protocol downgrade 1 -> 0
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: view((empty))
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: gcomm: closed
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: New SELF-LEAVE.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: Flow-control interval: [0, 0]
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 2896595604)
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 0 [Note] WSREP: RECV thread exiting 0: Success
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: recv_thread() joined.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: Closing replication queue.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: Closing slave action queue.
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Note] WSREP: Cert. index preloaded up to 2896576840
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [ERROR] WSREP: gcs/src/gcs.cpp:s_join():929: Sending JOIN failed: -103 (Software caused connection abort).
Jul 21 18:43:00 yak2 mariadbd[778444]: 2020-07-21 18:43:00 2 [Warning] WSREP: Failed to JOIN the cluster after SST gcs_join(a6a91495-1b7f-11ea-a0be-7f78e01c49de:2896595604) failed: 103 (Software caused connection abort)
Jul 21 18:43:00 yak2 mariadbd[778444]: at galera/src/galera_gcs.hpp:join():218
After that, the cluster is in state "INCONSISTENT" and doesn't work. Neither on the first nor on the second node. I usually resolve this by killing the mariadbd on both nodes, then switching back to sst_method: rsync.