Note. Recent Galera version affected is 1.0.4/26.4.4, but same is relevant for previous versions, at least 26.4.3.
We have cluster of 3 nodes working with large (above 1 TB) data volume. All the nodes have same hardware and software. Sometimes nodes run SST and IST to transfer data.
It was notices that frequently SST fails and has to be restarted due to error:
[ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error)
|
We tried to set bigger galera gcache size, but on some cases error happened again, on some it didn't. Moreover, sometimes simple restart of mysqld on receiving node (and thus restarting SST when donor node returned to synced state back) lead to successful SST and joiner managed to join the cluster, but sometimes it failed.
- It was noticed that gcache size and amount of transactions happening on cluster nodes has no effect on the issue.
- disabling or enabling compression of state transfer data and also attempts to flush logs has had no effect also
- It was also noticed that in case IST failed, it was always possible to find same error message logged at 20(+/- 1) minutes after starting mysqld on joining node (thus, 20 minutes after state transfer request). This error was:
2020-06-01 21:50:42 0 [Note] WSREP: IST sender 232217729 -> 232234231
|
...
|
WSREP_SST: [INFO] Evaluating /usr/bin/mariabackup --innobackupex --defaults-file=/etc/mysql/my.cnf $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2> /var/lib/mysql//mariabackup.backup.log | /usr/bin/zstd --fast=3 | socat -u stdio TCP:***.*.***.*:4444; RC=( ${PIPESTATUS[@]} ) (20200601 21:50:53.977)
|
2020-06-01 22:10:59 0 [ERROR] WSREP: async IST sender failed to serve tcp://***.*.***.*:4568: ist send failed: asio.system:110', asio error 'write: Connection timed out': 110 (Connection timed out)
|
at galera/src/ist.cpp:send():887
|
2020-06-01 22:10:59 0 [Note] WSREP: async IST sender served
|
Appearance of these last two lines (error+note) in mysqld log file always ended with state transfer failed with following errors logged:
2020-06-02 1:49:58 0 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:232217728, protocol version: 5
|
2020-06-02 1:49:58 0 [ERROR] WSREP: got asio system error while reading IST stream: asio.system:104
|
2020-06-02 1:49:58 0 [ERROR] WSREP: IST didn't contain all write sets, expected last: 232234231 last received: 232221423
|
2020-06-02 1:49:58 2 [ERROR] WSREP: Receiving IST failed, node restart required: IST receiver reported failure: 71 (Protocol error)
|
at galera/src/replicator_smm.hpp:pop_front():314. Null event.
|
So, questionable things are:
- How to avoid such situations - nodes require manual restart on failed transfers!
- Why this asio error is always logged 20 minutes after state transfer start?
- Reported failure is 'Connection timed out' while connection is stable and no service or monitoring tool reports connection issues
- Issue is floating: on some restarts it appears and on others it doesn't, this was actual for previous version on galera library, too. No configuration change seems to cause or solve this.
- Also it was noted that referred asio library that is used by galera is 1.10.8 and this version can't be changed - however version 1.18 is out already.
- relates to
-
MDEV-22797
galera uses old version of asio library
-
-
Open
{"report":{"fcp":766.6999998092651,"ttfb":240.2999997138977,"pageVisibility":"visible","entityId":87810,"key":"jira.project.issue.view-issue","isInitial":true,"threshold":1000,"elementTimings":{},"userDeviceMemory":8,"userDeviceProcessors":64,"apdex":1,"journeyId":"6a6bbdc8-a148-4d60-8d25-d289f1413137","navigationType":0,"readyForUser":843.5,"redirectCount":0,"resourceLoadedEnd":327.2999997138977,"resourceLoadedStart":247.69999980926514,"resourceTiming":[{"duration":5,"initiatorType":"link","name":"https://jira.mariadb.org/s/2c21342762a6a02add1c328bed317ffd-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/css/_super/batch.css","startTime":247.69999980926514,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":247.69999980926514,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":252.69999980926514,"responseStart":0,"secureConnectionStart":0},{"duration":5.199999809265137,"initiatorType":"link","name":"https://jira.mariadb.org/s/7ebd35e77e471bc30ff0eba799ebc151-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/css/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.css?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&slack-enabled=true&whisper-enabled=true","startTime":247.90000009536743,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":247.90000009536743,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":253.09999990463257,"responseStart":0,"secureConnectionStart":0},{"duration":66.80000019073486,"initiatorType":"script","name":"https://jira.mariadb.org/s/0917945aaa57108d00c5076fea35e069-CDN/lu2cib/820016/12ta74/0a8bac35585be7fc6c9cc5a0464cd4cf/_/download/contextbatch/js/_super/batch.js?locale=en","startTime":248.19999980926514,"connectEnd":248.19999980926514,"connectStart":248.19999980926514,"domainLookupEnd":248.19999980926514,"domainLookupStart":248.19999980926514,"fetchStart":248.19999980926514,"redirectEnd":0,"redirectStart":0,"requestStart":258.59999990463257,"responseEnd":315,"responseStart":277.40000009536743,"secureConnectionStart":248.19999980926514},{"duration":78.89999961853027,"initiatorType":"script","name":"https://jira.mariadb.org/s/2d8175ec2fa4c816e8023260bd8c1786-CDN/lu2cib/820016/12ta74/494e4c556ecbb29f90a3d3b4f09cb99c/_/download/contextbatch/js/jira.browse.project,project.issue.navigator,jira.view.issue,jira.general,jira.global,atl.general,-_super/batch.js?agile_global_admin_condition=true&jag=true&jira.create.linked.issue=true&locale=en&slack-enabled=true&whisper-enabled=true","startTime":248.40000009536743,"connectEnd":248.40000009536743,"connectStart":248.40000009536743,"domainLookupEnd":248.40000009536743,"domainLookupStart":248.40000009536743,"fetchStart":248.40000009536743,"redirectEnd":0,"redirectStart":0,"requestStart":261.7999997138977,"responseEnd":327.2999997138977,"responseStart":292.69999980926514,"secureConnectionStart":248.40000009536743},{"duration":28.699999809265137,"initiatorType":"script","name":"https://jira.mariadb.org/s/a9324d6758d385eb45c462685ad88f1d-CDN/lu2cib/820016/12ta74/c92c0caa9a024ae85b0ebdbed7fb4bd7/_/download/contextbatch/js/atl.global,-_super/batch.js?locale=en","startTime":248.59999990463257,"connectEnd":248.59999990463257,"connectStart":248.59999990463257,"domainLookupEnd":248.59999990463257,"domainLookupStart":248.59999990463257,"fetchStart":248.59999990463257,"redirectEnd":0,"redirectStart":0,"requestStart":258.90000009536743,"responseEnd":277.2999997138977,"responseStart":276.69999980926514,"secureConnectionStart":248.59999990463257},{"duration":33.40000009536743,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-en/jira.webresources:calendar-en.js","startTime":248.69999980926514,"connectEnd":248.69999980926514,"connectStart":248.69999980926514,"domainLookupEnd":248.69999980926514,"domainLookupStart":248.69999980926514,"fetchStart":248.69999980926514,"redirectEnd":0,"redirectStart":0,"requestStart":260.2999997138977,"responseEnd":282.09999990463257,"responseStart":281.5,"secureConnectionStart":248.69999980926514},{"duration":43.69999980926514,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:calendar-localisation-moment/jira.webresources:calendar-localisation-moment.js","startTime":248.90000009536743,"connectEnd":248.90000009536743,"connectStart":248.90000009536743,"domainLookupEnd":248.90000009536743,"domainLookupStart":248.90000009536743,"fetchStart":248.90000009536743,"redirectEnd":0,"redirectStart":0,"requestStart":262.7999997138977,"responseEnd":292.59999990463257,"responseStart":291.2999997138977,"secureConnectionStart":248.90000009536743},{"duration":10.199999809265137,"initiatorType":"link","name":"https://jira.mariadb.org/s/b04b06a02d1959df322d9cded3aeecc1-CDN/lu2cib/820016/12ta74/a2ff6aa845ffc9a1d22fe23d9ee791fc/_/download/contextbatch/css/jira.global.look-and-feel,-_super/batch.css","startTime":249,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":249,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":259.19999980926514,"responseStart":0,"secureConnectionStart":0},{"duration":53.30000019073486,"initiatorType":"script","name":"https://jira.mariadb.org/rest/api/1.0/shortcuts/820016/47140b6e0a9bc2e4913da06536125810/shortcuts.js?context=issuenavigation&context=issueaction","startTime":249.19999980926514,"connectEnd":249.19999980926514,"connectStart":249.19999980926514,"domainLookupEnd":249.19999980926514,"domainLookupStart":249.19999980926514,"fetchStart":249.19999980926514,"redirectEnd":0,"redirectStart":0,"requestStart":267.09999990463257,"responseEnd":302.5,"responseStart":301.90000009536743,"secureConnectionStart":249.19999980926514},{"duration":13.700000286102295,"initiatorType":"link","name":"https://jira.mariadb.org/s/3ac36323ba5e4eb0af2aa7ac7211b4bb-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/css/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.css?jira.create.linked.issue=true","startTime":249.2999997138977,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":249.2999997138977,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":263,"responseStart":0,"secureConnectionStart":0},{"duration":52.19999980926514,"initiatorType":"script","name":"https://jira.mariadb.org/s/5d5e8fe91fbc506585e83ea3b62ccc4b-CDN/lu2cib/820016/12ta74/d176f0986478cc64f24226b3d20c140d/_/download/contextbatch/js/com.atlassian.jira.projects.sidebar.init,-_super,-project.issue.navigator,-jira.view.issue/batch.js?jira.create.linked.issue=true&locale=en","startTime":249.5,"connectEnd":249.5,"connectStart":249.5,"domainLookupEnd":249.5,"domainLookupStart":249.5,"fetchStart":249.5,"redirectEnd":0,"redirectStart":0,"requestStart":265.59999990463257,"responseEnd":301.69999980926514,"responseStart":300.59999990463257,"secureConnectionStart":249.5},{"duration":18.399999618530273,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-js/jira.webresources:bigpipe-js.js","startTime":266.40000009536743,"connectEnd":273.09999990463257,"connectStart":273.09999990463257,"domainLookupEnd":273.09999990463257,"domainLookupStart":273.09999990463257,"fetchStart":266.40000009536743,"redirectEnd":0,"redirectStart":0,"requestStart":273.19999980926514,"responseEnd":284.7999997138977,"responseStart":283.7999997138977,"secureConnectionStart":273.09999990463257},{"duration":18.199999809265137,"initiatorType":"script","name":"https://jira.mariadb.org/s/d41d8cd98f00b204e9800998ecf8427e-CDN/lu2cib/820016/12ta74/1.0/_/download/batch/jira.webresources:bigpipe-init/jira.webresources:bigpipe-init.js","startTime":268.5,"connectEnd":268.5,"connectStart":268.5,"domainLookupEnd":268.5,"domainLookupStart":268.5,"fetchStart":268.5,"redirectEnd":0,"redirectStart":0,"requestStart":275.69999980926514,"responseEnd":286.69999980926514,"responseStart":285.7999997138977,"secureConnectionStart":268.5},{"duration":157.2999997138977,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":525,"connectEnd":525,"connectStart":525,"domainLookupEnd":525,"domainLookupStart":525,"fetchStart":525,"redirectEnd":0,"redirectStart":0,"requestStart":648.5999999046326,"responseEnd":682.2999997138977,"responseStart":681.5999999046326,"secureConnectionStart":525},{"duration":145.69999980926514,"initiatorType":"script","name":"https://www.google-analytics.com/analytics.js","startTime":739.5,"connectEnd":0,"connectStart":0,"domainLookupEnd":0,"domainLookupStart":0,"fetchStart":739.5,"redirectEnd":0,"redirectStart":0,"requestStart":0,"responseEnd":885.1999998092651,"responseStart":0,"secureConnectionStart":0},{"duration":81.59999990463257,"initiatorType":"xmlhttprequest","name":"https://jira.mariadb.org/rest/webResources/1.0/resources","startTime":755.4000000953674,"connectEnd":755.4000000953674,"connectStart":755.4000000953674,"domainLookupEnd":755.4000000953674,"domainLookupStart":755.4000000953674,"fetchStart":755.4000000953674,"redirectEnd":0,"redirectStart":0,"requestStart":802.1999998092651,"responseEnd":837,"responseStart":836.1999998092651,"secureConnectionStart":755.4000000953674}],"fetchStart":0,"domainLookupStart":0,"domainLookupEnd":0,"connectStart":0,"connectEnd":0,"requestStart":39,"responseStart":240,"responseEnd":268,"domLoading":244,"domInteractive":918,"domContentLoadedEventStart":918,"domContentLoadedEventEnd":967,"domComplete":1144,"loadEventStart":1144,"loadEventEnd":1144,"userAgent":"Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)","marks":[{"name":"bigPipe.sidebar-id.start","time":890.7999997138977},{"name":"bigPipe.sidebar-id.end","time":891.5999999046326},{"name":"bigPipe.activity-panel-pipe-id.start","time":891.5999999046326},{"name":"bigPipe.activity-panel-pipe-id.end","time":894},{"name":"activityTabFullyLoaded","time":989.2999997138977}],"measures":[],"correlationId":"ac7e169e0d1d90","effectiveType":"4g","downlink":9.4,"rtt":0,"serverDuration":137,"dbReadsTimeInMs":18,"dbConnsTimeInMs":28,"applicationHash":"9d11dbea5f4be3d4cc21f03a88dd11d8c8687422","experiments":[]}}
In Galera library version 26.4.15 there is asio to 1.14.1 maybe that can be tested with more recent version of MariaDB server. Does the issue still reproduce?