[MDEV-19084] Galera cluster gcache.page files creation at startup/restart Created: 2019-03-29 Updated: 2021-11-04 Resolved: 2021-01-13 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.3.13, 10.3.20 |
| Fix Version/s: | 10.3.25, 10.4.14 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Roope Pääkkönen (Inactive) | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Fixed | Votes: | 3 |
| Labels: | None | ||
| Environment: |
Centos 7 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
We're experiencing following behavior seemingly randomly: Mariadb 10.3 galera cluster with 3 nodes, restart one, and the node joins back to cluster normally (in less than a minute). But it immediately begins creating gcache.page.XXXXX file(s) after it's rejoined the cluster. gcache size is 2 gigabytes and the db usage during the restart was quite small. What could be causing this kind of behaviour? I believe the biggest "problem" this causes for us is that when a server node is using gcache.page files, it cannot serve IST to a joiner, but always does full SST, as I've understood. |
| Comments |
| Comment by Roope Pääkkönen (Inactive) [ 2019-04-09 ] | |||||
|
I googled around and found this issue from percona xtradb cluster, which has very similar description to what i'm seeing: That is, it feels like the server is sometimes forgetting the actual gcache file on startup/restart and just uses page files. Restarting the service once more usually fixes it, gcache.page files are no longer created | |||||
| Comment by Roope Pääkkönen (Inactive) [ 2019-05-17 ] | |||||
|
wsrep-debug-log.txt So far I've been able to make it happen after I reboot a vm. I'm not sure if that can be related in some way. On one node: start this command: sysbench oltp_update_index --tables=10 --table-size=10000000 prepare Wait until sbtest1-table is about 500mb in size And then run the same sysbench command again, now gcache.page files are created immediately. I've attached mariadb log with wsrep_debug=on After this, if I restart mariadb service, and begin the same sysbench again, gcache.page files are not created. wsrep_provider_options="gcache.size=1G;gcache.recover=yes;pc.recovery=TRUE" | |||||
| Comment by Roope Pääkkönen (Inactive) [ 2019-08-09 ] | |||||
|
Recently we've seen this on our servers very often whenever they are restarted for minor version upgrades. I believe whenever lines similar to these are logged on startup:
The server begins to create gcache.page.XXX files immediately, irrespective of load or even if it doesn't receive any client connections. (e.g. nodes that are only used for fail-over do it as well) | |||||
| Comment by Mark Reibert [ 2020-06-17 ] | |||||
|
I have recently run into this following an upgrade from MariaDB 10.4.12 ⟶ 10.4.13. I performed a rolling upgrade of a three-node cluster, and because I did not know to look for this problem I am now wedged. The scenario is I upgraded node 1, and upon restart it began using the gcache.page files. So at this point it is no good as a donor because the single 128M gcache.page file is not large enough to store anything but a second of writesets for my busy cluster. Then I upgraded node 2, and after receiving a IST from node 3 (the lone remaining "good" node) it too began using the gcache.page files. So now neither nodes 1 or 2 are effectively available as donors. Again, because I didn't know this I then attempted to upgrade node 3, but of course I cannot get it to join the cluster because of the issue with nodes 1 and 2. Effectively, then, I am left with a two-node cluster where neither of the nodes can donate, so I am dead in the water. The only way to fix this is complete down time on the cluster (for non-rolling restarts). Assuming the root cause is the same as discussed in https://jira.percona.com/browse/PXC-887, can that fix be ported to MariaDB? For those of us bitten by this it causes much pain. | |||||
| Comment by Roope Pääkkönen (Inactive) [ 2020-06-17 ] | |||||
|
I just saw in the latest WSREP library patches from galera, there were fixes to gcache recovery - so maybe these might help here. http://releases.galeracluster.com/galera-3/release-notes-galera-25.3.30.txt | |||||
| Comment by Mark Reibert [ 2020-08-10 ] | |||||
|
Still waiting for some kind of movement here. | |||||
| Comment by Roope Pääkkönen (Inactive) [ 2020-11-28 ] | |||||
|
For me, it seems like after we updated Mariadb to 10.3.25 with the updated galera libraries, it has fixed the issue. | |||||
| Comment by Alexey [ 2021-01-04 ] | |||||
|
Until Galera 3.30 there was a bug in GCache ring buffer recovery that could make most of the ring buffer unavailable, effectively making is tiny, and as a result GCache needed to allocate page files right from the start. This is fixed in 3.30 and later, released May 2020 | |||||
| Comment by Mark Reibert [ 2021-01-04 ] | |||||
|
Yes, I do not believe I have encountered this issue since upgrading to MariaDB 10.4.14 (which brings along Galera 26.4.5). | |||||
| Comment by Mark Reibert [ 2021-11-03 ] | |||||
|
I recently upgraded to Ubuntu 20.04 running MariaDB 10.4.21 / Galera 26.4.9 and this problem has resurfaced. So it looks like we have a regression. I opened |