[MDEV-15383] Galera SST Broken on 10.2.13 Created: 2018-02-21  Updated: 2021-04-18  Resolved: 2021-04-18

Status: Closed
Project: MariaDB Server
Component/s: Server
Affects Version/s: 10.2.13
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Todd Stoffel (Inactive) Assignee: Julius Goryavsky
Resolution: Incomplete Votes: 3
Labels: need_feedback
Environment:

Centos 7, xtrabackup-v2


Attachments: Text File sst_mariabackup_fail.txt    

 Description   

SST scripts are not firing on donor nodes. Joiner fails as though it cannot prepare a backup that never actually got created. No error on the donor as the joiner never seems to make contact. Rolling back to 10.2.12 resolves the issue. So NOT a communications problem.

----------------------------

WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid stage (20180221 23:37:58.184)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20180221 23:37:58.187)
2018-02-21 23:37:58 140308544943872 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.10.10.11' --datadir '/var/lib/mysql/' --parent '17375' --binlog '/var/log/mysql/mariadb-bin'
Read: '(null)'
2018-02-21 23:37:58 140308544943872 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.10.10.11' --datadir '/var/lib/mysql/' --parent '17375' --binlog '/var/log/mysql/mariadb-bin' : 32 (Broken pipe)
2018-02-21 23:37:58 140308545246976 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable.
2018-02-21 23:37:58 140308545246976 [ERROR] Aborting

Error in my_thread_global_end(): 1 threads didn't exit

----------------------------

[root@node2 mysql]# mysql --version
mysql Ver 15.1 Distrib 10.2.13-MariaDB, for Linux (x86_64) using readline 5.1

----------------------------

I also tested this using 10.2.12 SST scripts on 10.2.13 with NO success.

Could this be an lsof loop problem?



 Comments   
Comment by Elena Stepanova [ 2018-02-21 ]

It appears to be a duplicate of MDEV-15254.

Comment by Todd Stoffel (Inactive) [ 2018-02-21 ]

Not necessary a duplicate because previous version of SST scripts do not work on 10.2.13 which seemed to be the work around on MDEV-15254

Comment by Ilshat Karazbaev [ 2018-03-01 ]

Good Day
We've also hit the bug
Log from joining node:
2018-03-01 11:05:17 139928247703296 [Note] WSREP: Flow-control interval: [28, 28]
2018-03-01 11:05:17 139928247703296 [Note] WSREP: Trying to continue unpaused monitor
2018-03-01 11:05:17 139928247703296 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 0)
2018-03-01 11:05:17 139929437337344 [Note] WSREP: State transfer required:
Group state: 43f70abb-1d27-11e8-9ae9-078170937149:0
Local state: 00000000-0000-0000-0000-000000000000:-1
2018-03-01 11:05:17 139929437337344 [Note] WSREP: New cluster view: global state: 43f70abb-1d27-11e8-9ae9-078170937149:0, view# 3: Primary, number of nodes: 3, my index: 2, protocol version 3
2018-03-01 11:05:17 139929437337344 [Warning] WSREP: Gap in state sequence. Need state transfer.
2018-03-01 11:05:17 139928239310592 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.0.129' --datadir '/var/lib/mysql/' --parent '27035' --binlog '/var/log/mysql/mariadb-bin' '
WSREP_SST: [INFO] Streaming with xbstream (20180301 11:05:17.873)
WSREP_SST: [INFO] Using socat as streamer (20180301 11:05:17.875)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20180301 11:05:17.903)
2018-03-01 11:05:19 139928256096000 [Note] WSREP: (46ff4a4e, 'tcp://0.0.0.0:4572') connection to peer 46ff4a4e with addr tcp://192.168.0.129:4572 timed out, no messages seen in PT3S
2018-03-01 11:05:19 139928256096000 [Note] WSREP: (46ff4a4e, 'tcp://0.0.0.0:4572') turning message relay requesting off
Terminated
WSREP_SST: [ERROR] Removing /tmp/tmp.SfeqE6ca6Q/xtrabackup_galera_info file due to signal (20180301 11:06:46.657)
WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 143 143 (20180301 11:06:46.661)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20180301 11:06:46.663)
2018-03-01 11:06:46 139928239310592 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.0.129' --datadir '/var/lib/mysql/' --parent '27035' --binlog '/var/log/mysql/mariadb-bin'
Read: '(null)'
2018-03-01 11:06:46 139928239310592 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.0.129' --datadir '/var/lib/mysql/' --parent '27035' --binlog '/var/log/mysql/mariadb-bin' : 32 (Broken pipe)
2018-03-01 11:06:46 139929437337344 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable.
2018-03-01 11:06:46 139929437337344 [ERROR] Aborting

Error in my_thread_global_end(): 1 threads didn't exit

Log from donor node:
2018-03-01 11:05:17 139958632097536 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2018-03-01 11:05:17 139958632097536 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 474c5f92-1d27-11e8-98e3-268766cef611
2018-03-01 11:05:17 139958632097536 [Note] WSREP: STATE EXCHANGE: sent state msg: 474c5f92-1d27-11e8-98e3-268766cef611
2018-03-01 11:05:17 139958632097536 [Note] WSREP: STATE EXCHANGE: got state msg: 474c5f92-1d27-11e8-98e3-268766cef611 from 0 (galeratest-n1)
2018-03-01 11:05:17 139958632097536 [Note] WSREP: STATE EXCHANGE: got state msg: 474c5f92-1d27-11e8-98e3-268766cef611 from 1 (garb)
2018-03-01 11:05:17 139958632097536 [Note] WSREP: STATE EXCHANGE: got state msg: 474c5f92-1d27-11e8-98e3-268766cef611 from 2 (galeratest-n2)
2018-03-01 11:05:17 139958632097536 [Note] WSREP: Quorum results:
version = 4,
component = PRIMARY,
conf_id = 2,
members = 2/3 (joined/total),
act_id = 0,
last_appl. = 0,
protocols = 0/8/3 (gcs/repl/appl),
group UUID = 43f70abb-1d27-11e8-9ae9-078170937149
2018-03-01 11:05:17 139958632097536 [Note] WSREP: Flow-control interval: [28, 28]
2018-03-01 11:05:17 139958632097536 [Note] WSREP: Trying to continue unpaused monitor
2018-03-01 11:05:17 139958623704832 [Note] WSREP: New cluster view: global state: 43f70abb-1d27-11e8-9ae9-078170937149:0, view# 3: Primary, number of nodes: 3, my index: 0, protocol version 3
2018-03-01 11:05:17 139958623704832 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-03-01 11:05:17 139958623704832 [Note] WSREP: REPL Protocols: 8 (3, 2)
2018-03-01 11:05:17 139958623704832 [Note] WSREP: Assign initial position for certification: 0, protocol version: 3
2018-03-01 11:05:17 139958674061056 [Note] WSREP: Service thread queue flushed.
2018-03-01 11:05:20 139958640490240 [Note] WSREP: (43f6a5f6, 'tcp://0.0.0.0:4572') turning message relay requesting off
2018-03-01 11:06:48 139958640490240 [Note] WSREP: declaring 44d5dbcd at tcp://192.168.0.152:4572 stable
2018-03-01 11:06:48 139958640490240 [Note] WSREP: forgetting 46ff4a4e (tcp://192.168.0.129:4572)
2018-03-01 11:06:48 139958640490240 [Note] WSREP: Node 43f6a5f6 state prim
2018-03-01 11:06:48 139958640490240 [Note] WSREP: view(view_id(PRIM,43f6a5f6,4) memb

{ 43f6a5f6,0 44d5dbcd,0 }

joined {
} left {
} partitioned

{ 46ff4a4e,0 }

)
2018-03-01 11:06:48 139958640490240 [Note] WSREP: save pc into disk
2018-03-01 11:06:48 139958640490240 [Note] WSREP: forgetting 46ff4a4e (tcp://192.168.0.129:4572)
2018-03-01 11:06:48 139958632097536 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
2018-03-01 11:06:48 139958632097536 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 7dd64ee1-1d27-11e8-ae63-1e794acab87f
2018-03-01 11:06:48 139958632097536 [Note] WSREP: STATE EXCHANGE: sent state msg: 7dd64ee1-1d27-11e8-ae63-1e794acab87f
2018-03-01 11:06:48 139958632097536 [Note] WSREP: STATE EXCHANGE: got state msg: 7dd64ee1-1d27-11e8-ae63-1e794acab87f from 0 (galeratest-n1)
2018-03-01 11:06:48 139958632097536 [Note] WSREP: STATE EXCHANGE: got state msg: 7dd64ee1-1d27-11e8-ae63-1e794acab87f from 1 (garb)
2018-03-01 11:06:48 139958632097536 [Note] WSREP: Quorum results:
version = 4,
component = PRIMARY,
conf_id = 3,
members = 2/2 (joined/total),
act_id = 0,
last_appl. = 0,
protocols = 0/8/3 (gcs/repl/appl),
group UUID = 43f70abb-1d27-11e8-9ae9-078170937149
2018-03-01 11:06:48 139958632097536 [Note] WSREP: Flow-control interval: [23, 23]
2018-03-01 11:06:48 139958632097536 [Note] WSREP: Trying to continue unpaused monitor
2018-03-01 11:06:48 139958623704832 [Note] WSREP: New cluster view: global state: 43f70abb-1d27-11e8-9ae9-078170937149:0, view# 4: Primary, number of nodes: 2, my index: 0, protocol version 3
2018-03-01 11:06:48 139958623704832 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-03-01 11:06:48 139958623704832 [Note] WSREP: REPL Protocols: 8 (3, 2)
2018-03-01 11:06:48 139958623704832 [Note] WSREP: Assign initial position for certification: 0, protocol version: 3
2018-03-01 11:06:48 139958674061056 [Note] WSREP: Service thread queue flushed.
2018-03-01 11:06:54 139958640490240 [Note] WSREP: cleaning up 46ff4a4e (tcp://192.168.0.129:4572)

Comment by Daniel Black [ 2018-03-02 ]

Can you try the fix from MDEV-15254?

https://jira.mariadb.org/secure/attachment/45259/45259_wsrep_sst_xtrabackup-v2.sh

Comment by Ilshat Karazbaev [ 2018-03-02 ]

I have already used script from mariadb 10.2.8 and it works fine. Thank you. I just want to have this bug fixed in the next release

Comment by Daniel Black [ 2018-03-02 ]

It will be. MDEV-15254 already lists 10.2.14 as the fixed version.

toddstoffel your description of previous versions do not work is a bit vague. Does the fixed script complete a SST?

Comment by Todd Stoffel (Inactive) [ 2018-03-02 ]

@danblack,

As I stated previously, we tried the SST scripts from several previous versions. The link you provided is a little late as the client has already rolled back their install to 10.2.12.

Creating custom shell scripts as a "workaround", or offering ANY "workaround" for a product that was broken on release is not really adequate. Let's hope 10.2.14 goes better than 10.2.13 or 10.2.10* went.

*See (https://jira.mariadb.org/browse/MDEV-14256)

Comment by Daniel Black [ 2018-03-02 ]

Agree. FYI MDEV-15409 is blocker

Comment by Aurélien LEQUOY [ 2018-03-20 ]

FOUND : the problem come from --defaults-file with : (look it's not loaded in particular order or combined with an other option).

First I tried to upgrade my galera cluster from 10.2.12 to 10.2.13, i stopped one node to upgrade and put back, but the node fail to IST and then fail to SST. (I pushed a bug about this : https://jira.mariadb.org/browse/MDEV-15453?filter=-2)

Then when we tried to deploy a new cluster :
At begining we asked ourself why the upgrade of password failed with SALT (https://docs.saltstack.com/en/latest/ref/states/all/salt.states.mysql_user.html). in fact I think this trouble and the sst is exactly the same the cannot connect to mysql (particular sign, you can see : ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO)
(or whatever you put in /root/.my.cnf)
It's mean that all stuff witch use salt are falling also. I don't know where it is in the code but i know it's linked with : "libmariadbclient18" and special options of client to connect to MySQL/MariaDB

Work :

ii  libmariadbclient18:amd64          10.1.26-0+deb9u1               amd64        MariaDB database client library

(i don't know why Debian installed this version for mariadb 10.2.13)

Not Work :
(with apt-get upgrade)

ii  libmariadbclient18                10.2.13+maria~stretch          amd64        Virtual package to satisfy external depends

Full listing :
Work :

 dpkg -l | grep -E "mysql|maria" 
ii  libdbd-mysql-perl                 4.041-2                        amd64        Perl5 database interface to the MariaDB/MySQL database
ii  libmariadbclient18:amd64          10.1.26-0+deb9u1               amd64        MariaDB database client library
ii  mariadb-backup-10.2               10.2.13+maria~stretch          amd64        Backup tool for MariaDB server
ii  mariadb-client-10.2               10.2.13+maria~stretch          amd64        MariaDB database client binaries
ii  mariadb-client-core-10.2          10.2.13+maria~stretch          amd64        MariaDB database core client binaries
ii  mariadb-common                    10.2.13+maria~stretch          all          MariaDB database common files (e.g. /etc/mysql/conf.d/mariadb.cnf)
ii  mariadb-server-10.2               10.2.13+maria~stretch          amd64        MariaDB database server binaries
ii  mariadb-server-core-10.2          10.2.13+maria~stretch          amd64        MariaDB database core server files
ii  mysql-common                      10.2.13+maria~stretch          all          MariaDB database common files (e.g. /etc/mysql/my.cnf)
ii  python-mysqldb                    1.3.7-1.1                      amd64        Python interface to MySQL

not work :

ii  libdbd-mysql-perl                 4.041-2                        amd64        Perl5 database interface to the MariaDB/MySQL database
ii  libmariadb3                       10.2.13+maria~stretch          amd64        MariaDB database client library
ii  libmariadbclient18                10.2.13+maria~stretch          amd64        Virtual package to satisfy external depends
ii  mariadb-backup-10.2               10.2.13+maria~stretch          amd64        Backup tool for MariaDB server
ii  mariadb-client-10.2               10.2.13+maria~stretch          amd64        MariaDB database client binaries
ii  mariadb-client-core-10.2          10.2.13+maria~stretch          amd64        MariaDB database core client binaries
ii  mariadb-common                    10.2.13+maria~stretch          all          MariaDB database common files (e.g. /etc/mysql/conf.d/mariadb.cnf)
ii  mariadb-server-10.2               10.2.13+maria~stretch          amd64        MariaDB database server binaries
ii  mariadb-server-core-10.2          10.2.13+maria~stretch          amd64        MariaDB database core server files
ii  mysql-common                      10.2.13+maria~stretch          all          MariaDB database common files (e.g. /etc/mysql/my.cnf)
ii  python-mysqldb                    1.3.7-1.1                      amd64        Python interface to MySQL

Comment by Aurélien LEQUOY [ 2018-03-26 ]

any news ?

Comment by Tortue [ 2018-04-13 ]

Hi,
i tried this with the new version

libmariadbclient18                10.2.14+maria~stretch

the salt.states.mysql_user can't connect with this lib client with any parameters: connection_host, connection_port, connection_user, connection_pass or connection_default_file
but it's working with :

libmariadbclient18:amd64          10.1.26-0+deb9u1

About the SST all working well with lib "libmariadbclient18 10.2.14+maria~stretch"

Comment by nbulusanjr [ 2018-07-05 ]

Resolutions:
1. Disable SELinux on all nodes
2. Temporary disable firewall on all nodes
2. Adjust timeout of mysql.service (useful when you already have huge databases)
4. Upgrade to 10.2.16 and install "socat" and "nc" . If you are planning to use "mariabackup" please do install "MariaDB-backup".
5. On your MariaDB config under galera section
wsrep_debug=1
#mariabackup or xtrabackup-v2
wsrep_sst_method=mariabackup

5. On your MariaDB config under client section
user=root
#enclosed your password with double qoutes here
password="password-here"

Hope someone find it useful.
Actually this one gave me a headache for almost a week.

I haven't tested yet if this will work with user with this server privileges: RELOAD , PROCESS , LOCK TABLES , REPLICATION SLAVE , REPLICATION CLIENT .

I find it strange that "wsrep_sst_auth" is not being used by mariabackup and xtrabackup-v2 even if you have defined it in your configuration, according to the MariaDB documentation this should be sufficient.

Comment by brianr [ 2018-10-09 ]

Hi

Sure it doesnt use wsrep_sst_auth ?

I have just upgraded from 10.2.9 -A 10.2.18 and would claim that mariabackup indeed does use the wsrep_sst_auth values.

I admit that wsrep_sst_method=mariabackup is still, however, not working in any way. It tries to read and use the info from that _auth variable. But if you enclose it in ".." it seems to end up trying to set user = nothing .... And when removing the double quotes, it just fails with "access denied" ...
Frustrating as he11.

I'm still trying to figure out where "I took a wrong turn", because it *did" work in my test environment ...

Comment by Aurélien LEQUOY [ 2018-10-09 ]

mariabackup working for me. don't this problem anymore in 10.3.9

Comment by brianr [ 2018-10-22 ]

Never mind - I found the reason for the last two weeks fighting with my Galera/MariaDB after the latest (or, first ever) upgrade since going into production!

I have now created a /etc/systemd/system/mariadb.service.d/timeout.conf
with the sole content:
[Service]
TimeoutSec=infinity "Yes, "I N F I N I T Y" , not "zero" like so many other notes say, no: infinity.

- Would it have been an idea to write something in startup log, to someone using an OLD value for this parameter, if you decide to change the design?? (And that's a low kick to the systemd devels, not the MariaDB ones, you go that, right )
- It could have saved us a LOT of headache, until I stumbled over somewhere stating that "systemd" from versionx is using "infinity and not 0 ..

Fairly on the "uncool" side of documentation, as documentation goes

Could I have suggestions on what to troubleshoot on my GaleraCluster, because I am lost for ideas now.
I have a test env with 3 nodes (ubuntu (16.04) / MariaDB (10.2.18) ) It works fine (On VirtualBox).

But I have a production system with similar versions, on VMWare ESX as only difference, and it just cannot do SST with neither rsync nor mariabackup.
It used to be able to do SST (with the default option rsync), before the latest apt update/apt upgrade and reboots, then Galera died (apt sessions were successful, on all three nodes)

Now neither rsync nor mariabackup can do SST !?

It just gets connection aborted/ pipe broken. I suspect it to be some kind of timeout function somewhere, but can't find where it could be,
Both servers are on same VMware host, so network is not the cause- I have tried sending 10GB over a "nc" stream on port 4444, and it finished successfully, but MariaDB/Galera fails doing "the same", so I suspect reason is somewhere in OS/MariaDB/Galera settings.

On DONOR log it will die on " -wsrep-sst-donor: socat[10625] E write(6, 0x68dd30, 8192): Broken pipe"
On JOINER log it will report:
"Oct 15 16:20:00 DB-6 -wsrep-sst-joiner: Waiting for SST streaming to complete!
Oct 15 16:20:05 DB-6 systemd[1]: mariadb.service: Start operation timed out. Terminating."

All ports required, are open in firewall on both servers
the libmariadbclient mentioned here in ticket is apparently libmysqlclient20:amd64 (5.7.23-0ubuntu0.16.04.1)

I will try to paste the results of trying an (mariadb) SST run here, and hope some one will feel like running through and maybe suggest what I could look for next?

sst_mariabackup_fail.txt

Comment by Daniel Black [ 2018-10-22 ]

systemd code around TimeoutSec (before and after this change, https://github.com/systemd/systemd/commit/fb27be3f60147685fe5272ce3fdb1384f92390b5) seems to indicate 0 is treated as infinity.

Comment by Daniel Black [ 2018-10-22 ]

MDEV-15607 will extend the timeout of systemd without a configuration change in systemd v236+

Comment by Daniel Black [ 2018-10-22 ]

So the joiner is timing out and causing the donor to die. Can you show `systemctl show mariadb.service` on DB-6?

Comment by brianr [ 2018-10-23 ]

Thanks for pointing to the code, Daniel - accustomed to Oracle, I completely forgot that's possible with OpenSource ....

However, followiing multiple boots, trying to figure out what's causing my grief, I'd say that the TimeoutSec=0 in my /etc/mysql/conf.d/galera.cnf definitely did not do the trick, until I edited it to TimeoutSec=infinity, reloaded systemd and started MariaDB.

The only thing I wonder is, will systemd be able to ignore a change to its config files if one does not run the daemon-reload? That could explain it - because I don't recall if I remembered doing exactly that reload or simply rebooted Ubuntu...

Yes, the Joiner timed out, but only caused the Join to fail. The DONOR was unaffected the whole time.
Before setting timoutsec=infinity, I remember checking the "systemctl show mariadb.service|grep -i timeout" and it responded "1m 30s" - despite that config file having TimeoutSec=0

Thanks,
Brian

Comment by brianr [ 2018-10-23 ]

Ahh, you said MDEV-15607 , in v236+ will extend timeout w/o config change, so I tried cheking my systemd version:
it is apparently "229-4ubuntu21.4" - would that explain something too ?

Could you please reveal how I check that I would need 236+, to have this fix, I don't think I can see references in the https://jira.mariadb.org/browse/MDEV-15607 ?

I just figured out that 236+ seems to be for Ubuntu 18, so I can assume that this will not be fixed for the Ubuntu 16.04 that I'm using?

Comment by Daniel Black [ 2018-10-23 ]

v236 is from when the systemd feature was added that extends timeouts: https://github.com/systemd/systemd/commit/a327431bd168b2f327f3cd422379e213c643f2a5 . The Timeoutsec= wil suffice in the meantime.

Comment by Jan Lindström (Inactive) [ 2021-03-17 ]

Hi, is this issue resolved if not can you try with more recent version of mariadb 10.2 server and Galera library,

Generated at Thu Feb 08 08:20:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.