[MDEV-27457] MariaDB Galera Cluster galera.cache file getting bigger than specified gcache.size Created: 2022-01-10 Updated: 2022-01-24 Resolved: 2022-01-21 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.3.22, 10.4.12 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | ALEJANDRO NÚÑEZ SÁNCHEZ | Assignee: | Daniel Black |
| Resolution: | Done | Votes: | 0 |
| Labels: | Galera, galera.cache | ||
| Environment: |
Kubernetes, 10.3.22-debian-10-r1 Bitnami Docker image |
||
| Attachments: |
|
| Description |
|
We have a 3 node Galera Cluster running on Kubernetes, behind 2 HAProxy PODs configured so, all queries are executed on the first POD/node of the cluster if available, and the other 2 nodes, provide HA (HA Proxy backend backup nodes). In the config file, gcache.size is configured to 5 GB, and when a new node is deployed, galera.cache file is 5.1GB so, it seems to get that configuration correctly. However, what we are seeing is galera.cache growing in size up to 80 GBs or more for that first node of the cluster. As far as we know, this file should not increase in size. The problem is also reproduced when scaling the cluster down to one only node. It does not stop growing. The version deployed is 10.3.22 (10.3.22-debian-10-r1 Bitnami Docker image) These are the wsrep provider options specified in my.cnf:
We've been dealing with this situation for some time now, we can remove the first node(POD) and then the galera.cache is recreated so, we free disk space. The first node syncs through IST with any of the other 2 nodes and the HAProxy points to a backup node meanwhile, and then back to the first node when recovered, with no downtime. However, we want to avoid to do this. We can't figure out why galera.cache file size increases. |
| Comments |
| Comment by ALEJANDRO NÚÑEZ SÁNCHEZ [ 2022-01-12 ] | |||
|
I upgraded the cluster to a 10.4 version but I'm still facing the same issue. Is it pointing our HAProxy to a single node what may be causing the issue? defaults frontend frontend_galera_write I guess this should not make galera.cache grow over the specified size in my.cnf, but it does. | |||
| Comment by Daniel Black [ 2022-01-13 ] | |||
|
AlexNS thanks for reporting the bug. How many nodes HAProxy points too shouldn't make a difference though a single node is definitely the simplest. I am making an assumption that the 10.4 version is the current 10.4.22. If you have some log file that might be useful as the default galera is fairly verbose. You can try https://mariadb.com/kb/en/wsrep_provider_options/#debug, however this might be too noisy, at least its dynamic (runtime changeable). | |||
| Comment by ALEJANDRO NÚÑEZ SÁNCHEZ [ 2022-01-13 ] | |||
|
Hi @Daniel Black, Deployed version is 10.4.12. It would take some time to upgrade to 10.4.22 because of HELM chart required changes and so on. As this issue already happened for 10.3.22, I guess the 10.4 minor version won't make a difference. However, if you think upgrading to 10.4.22 or a newer major version may solve the issue, we will have a look to it. I just attached the logs from the first node after setting debug=yes option for wsrep. Right now, as you will see in the logs, we have 3 nodes that replicate correctly through IST when a node is redeployed. I redeployed the 3 of them to change gcache size back to 128MB and set the debug mode. As soon as queries from services where run on the node (around 9:10 in the logs), the galera.cache file started growing over the specified size of 128MB (180 MB at the moment, and growing... ). We are really stuck with this issue... | |||
| Comment by Daniel Black [ 2022-01-14 ] | |||
|
10.4.12 was a full 2 years ago. If you don't want to jump the mariadb version, try to get the updated galera library from https://packages.debian.org/buster-backports/galera-4 (26.4.5) or better a 26.4.9 version from http://deb.mariadb.org/mariadb-10.4.22/galera-26.4.9/deb/, and put that at the bitnami provider location (or change the config). Provided ldd on the library in the container doesn't show missing files it should work compatibly as the 4.3 provider (26 was added as a version prenumber). Please test in test environment because I haven't tired this. The galera gcache implementation is contained within this one shared library. | |||
| Comment by ALEJANDRO NÚÑEZ SÁNCHEZ [ 2022-01-14 ] | |||
|
Hi @Daniel Black, Thanks for the help. Jumping mariadb version shouldn't be much trouble, but using HELM v3 is, and it seems I can't use Bitnami galera-mariadb latest Docker images with older HELM v2 Bitnami based charts I use. I had a look in order to change wsrep_provider_version as you suggested but I wouldn't know how to change it within the container. I'm using a testing environment so, if you could instruct me on how to perform the upgrade, I'll be very happy to give it a try. | |||
| Comment by Daniel Black [ 2022-01-15 ] | |||
|
Building a new container image with the Dockerfile below and the extracted libgalera_smm.so shared library in the same directory:
docker build --tag bitnami/mariadb-galera:10.4.12-test . Push that to a common repository. Adjust helm chart to have that image with this adjusted tag. | |||
| Comment by ALEJANDRO NÚÑEZ SÁNCHEZ [ 2022-01-17 ] | |||
|
Hi @Daniel Black, I tried that and for now, it seems it's working. I will wait some more time just in case, and let you know. | |||
| Comment by ALEJANDRO NÚÑEZ SÁNCHEZ [ 2022-01-18 ] | |||
|
Hi @Daniel Black, Still working so, I think I'm safe to say the issue is not there anymore, after upgrading the Galera library. Thank you very much! | |||
| Comment by Daniel Black [ 2022-01-21 ] | |||
|
I've never run the combination, however the Galera principle is at least that that the major version corresponds to an ABI, so it should be compatible. Any issues I'd expect in startup. By your https://dba.stackexchange.com/questions/305669/mariadb-galera-cluster-galera-cache-file-getting-bigger-than-specified-gcache-si/306387#306387 answer you seem pretty happy. | |||
| Comment by ALEJANDRO NÚÑEZ SÁNCHEZ [ 2022-01-24 ] | |||
|
Yes Daniel, pretty happy thanks to your support. We'll keep this combination as it is working well so far. |