[MDEV-22796] asio problems and IST failures - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Incomplete
Affects Version/s: 10.4.13
Fix Version/s: N/A
Component/s: Galera
Labels:
None
Environment:
Linux 4.19.124-gentoo x86_64 AMD EPYC 7451, Intel I350 Gigabit Ethernet

Description

Note. Recent Galera version affected is 1.0.4/26.4.4, but same is relevant for previous versions, at least 26.4.3.

We have cluster of 3 nodes working with large (above 1 TB) data volume. All the nodes have same hardware and software. Sometimes nodes run SST and IST to transfer data.
It was notices that frequently SST fails and has to be restarted due to error:

[ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error)

We tried to set bigger galera gcache size, but on some cases error happened again, on some it didn't. Moreover, sometimes simple restart of mysqld on receiving node (and thus restarting SST when donor node returned to synced state back) lead to successful SST and joiner managed to join the cluster, but sometimes it failed.

It was noticed that gcache size and amount of transactions happening on cluster nodes has no effect on the issue.
disabling or enabling compression of state transfer data and also attempts to flush logs has had no effect also
It was also noticed that in case IST failed, it was always possible to find same error message logged at 20(+/- 1) minutes after starting mysqld on joining node (thus, 20 minutes after state transfer request). This error was:

2020-06-01 21:50:42 0 [Note] WSREP: IST sender 232217729 -> 232234231

...

WSREP_SST: [INFO] Evaluating /usr/bin/mariabackup --innobackupex --defaults-file=/etc/mysql/my.cnf     $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2> /var/lib/mysql//mariabackup.backup.log | /usr/bin/zstd --fast=3 | socat -u stdio TCP:***.*.***.*:4444; RC=( ${PIPESTATUS[@]} ) (20200601 21:50:53.977)

2020-06-01 22:10:59 0 [ERROR] WSREP: async IST sender failed to serve tcp://***.*.***.*:4568: ist send failed: asio.system:110', asio error 'write: Connection timed out': 110 (Connection timed out)

     at galera/src/ist.cpp:send():887

2020-06-01 22:10:59 0 [Note] WSREP: async IST sender served

Appearance of these last two lines (error+note) in mysqld log file always ended with state transfer failed with following errors logged:

2020-06-02  1:49:58 0 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:232217728, protocol version: 5

2020-06-02  1:49:58 0 [ERROR] WSREP: got asio system error while reading IST stream: asio.system:104

2020-06-02  1:49:58 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 232234231 last received: 232221423

2020-06-02  1:49:58 2 [ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error)

     at galera/src/replicator_smm.hpp:pop_front():314. Null event.

So, questionable things are:

How to avoid such situations - nodes require manual restart on failed transfers!
Why this asio error is always logged 20 minutes after state transfer start?
Reported failure is 'Connection timed out' while connection is stable and no service or monitoring tool reports connection issues
Issue is floating: on some restarts it appears and on others it doesn't, this was actual for previous version on galera library, too. No configuration change seems to cause or solve this.
Also it was noted that referred asio library that is used by galera is 1.10.8 and this version can't be changed - however version 1.18 is out already.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mcc-mariadb-logs.tar.gz
35 kB
2021-05-20 09:12

Issue Links

relates to

MDEV-22797 galera uses old version of asio library

Open

Activity

People

Assignee:: Teemu Ollakka

Reporter:: Eugene

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 2020-06-04 11:57

Updated:: 2024-07-07 22:00

Resolved:: 2023-09-11 06:10

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.