[MDEV-20276] MariaDB-Galera 10.3 -> 10.4 | Startup time problem (hangs) Created: 2019-08-07  Updated: 2019-12-11  Resolved: 2019-12-11

Status: Closed
Project: MariaDB Server
Component/s: Galera, Galera SST, Replication, Server
Affects Version/s: 10.4.7
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Florian Strankowski Assignee: Jan Lindström (Inactive)
Resolution: Not a Bug Votes: 1
Labels: galera
Environment:

CentOS 7


Attachments: Text File Galera_Cluster_10.4_node_restart.log     HTML File messages     Text File sql.txt    

 Description   

We've upgraded from MariaDB 10.3 to the latest version of 10.4 (10.4.7). Once the upgrade has been completed we noticed some strange behavior during cluster-initialisation. No matter which node of our 7-node-cluster we start, it takes approx 8 minutes to get the node into the cluster, each time. Another test-cluster of ours has also inreased joining and sync times, but i'll stick to our production cluster here.

The first time a "hung up" occurs is at the following point (see attached log):

Aug 7 12:59:36 MariaDB-PROD-001 mysqld: 2019-08-07 12:59:36 2 [Note] WSREP: ####### drain monitors upto -1

Getting past this point takes 4 minutes and 33 seconds (273 seconds)

The second time during the join-phase the problem occurs here:

Aug 7 13:04:10 MariaDB-PROD-001 mysqld: 2019-08-07 13:04:10 0 [Note] WSREP: ####### drain monitors upto 0
Aug 7 13:04:10 MariaDB-PROD-001 systemd: Started MariaDB 10.4.7 database server.
Aug 7 13:04:12 MariaDB-PROD-001 mysqld: 2019-08-07 13:04:12 0 [Note] InnoDB: Buffer pool(s) load completed at 190807 13:04:12
Aug 7 13:08:43 MariaDB-PROD-001 mysqld: 2019-08-07 13:08:43 0 [Note] WSREP: REPL Protocols: 10 (5, 3)

This one took 4 minutes and 31 seconds (271 seconds), nearly exactly like the other one before o_Ô. So this might be something to work keep an eye on?!

Same problems apply to all and every of our 7 node Cluster, also our 2nd cluster is affected.

Regards



 Comments   
Comment by Thorsten Krohn [ 2019-08-16 ]

Hi,
i'm a college of Florian and attached a full log of stopping and starting one node of a 7node cluster.
It takes about 10 Minutes to bring a node back online. In the log you can see that there are no errors/warnings. The startup-process stalls two times for a couple of minutes. If i remove the database-dir and restart the node it is back after a minute. So SST them to work, but IST not. We use her multicast., but we have a smaller cluster with 3 Nodes and unicast and there we see the same problem.

Comment by Thorsten Krohn [ 2019-09-06 ]

Hi,
are there any news on this bug ?.
We run here always in trouble in our virtulisation. If we move nodes around, we always have problems to bring the node back in sync.

Can we provide more information to analyze the problem ?

Comment by Jan Lindström (Inactive) [ 2019-09-06 ]

To me it looks like InnoDB is loading buffer pool context from disk:

Aug 16 08:54:31 MariaDB-PROD-006 mysqld: 2019-08-16  8:54:31 0 [Note] WSREP: ####### drain monitors upto 0
Aug 16 08:54:31 MariaDB-PROD-006 systemd: Started MariaDB 10.4.7 database server.
Aug 16 08:54:53 MariaDB-PROD-006 mysqld: 2019-08-16  8:54:53 0 [Note] InnoDB: Buffer pool(s) load completed at 190816  8:54:53
 
 
 
 
Aug 16 08:55:48 MariaDB-PROD-006 mysqld: 2019-08-16  8:55:48 50 [Warning] Aborted connection 50 to db: 'customer_location' user: 'qlik' host: '172.17.64.11' (Got an error reading communication packets)
Aug 16 08:59:57 MariaDB-PROD-006 mysqld: 2019-08-16  8:59:57 0 [Note] WSREP: REPL Protocols: 10 (5, 3)

In this example it used ~20s to load buffer pool. If this is problematic for your system, you could try disable buffer pool loading at startup.

Comment by Thorsten Krohn [ 2019-09-06 ]

Hi,
thanks for your rersponse. I checked this out and i disable the dumping/loading of the bufferpools, but it dosn't changes anything.

Comment by Jan Lindström (Inactive) [ 2019-09-06 ]

Can you try to run with --wsrep-debug=server and provide me a full error log.

Comment by Florian Strankowski [ 2019-09-18 ]

I've attached the startup log including wsrep-debug enabled. Sorry for the delay though.

Comment by Thorsten Krohn [ 2019-10-10 ]

Since we have repeatedly encountered problems with the multicast operation here, I have now changed the cluster to unicast operation. Now the nodes start again normally, without hangers.

Generated at Thu Feb 08 08:58:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.