[MDEV-24159] Millions of GRA_*.log files Created: 2020-11-06 Updated: 2022-01-05 Resolved: 2021-12-27 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, wsrep |
| Affects Version/s: | 10.4.14 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Mark Reibert | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Incomplete | Votes: | 2 |
| Labels: | None | ||
| Description |
|
I am running a three-node MariaDB/Galera cluster on Ubuntu 16.04 with MariaDB 10.4.14 and Galera 26.4.5. The cluster is setup in "master/slave" mode such that all connections are to one (master) node. At times the two slave nodes start generating GRA_∗.log files at an alarming rate, on the order of 100k files per day. Something seem to trigger the situation; that is, I typically have some information in the server log corresponding to the first GRA_∗.log file. One such example is:
After this initial event the server generates GRA_∗.log files at an alarming rate, 1–2 per second or about 100k per day. There are no further entries in the server corresponding to the flood of GRA_∗.log files. If gone unnoticed you'll have millions of GRA_∗.log files in /var/lib/mysql in a matter of weeks, and the server will grind to a halt. I have not been able to determine reproduction steps, however it is important to note if I set wsrep_slave_threads=1 (dynamically; without a server restart) the creation of GRA_∗.log files immediately stops. I can then return wsrep_slave_threads to a value >1 and no GRA_∗.log files are created (well, until a few days later when the situation repeats). It is also important to note the cluster remains in-sync throughout (wsrep_local_state_comment reports Synced on all nodes), as if all these GRA_∗.log files are being generated erroneously. |
| Comments |
| Comment by Mark Reibert [ 2020-11-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here are the relevant parameters set on the server:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Walter Doekes [ 2021-04-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
vvv - IGNORE THIS REPLY - ISSUE TURNED OUT TO BE POOR CONFIG - vvv We also have a cluster with millions of GRA* files with 10.4.17+maria~bionic and galera 26.4.6-bionic. In our case however, the cluster is not in sync. We've identified at least one query that was executed on two nodes (node2, node3) but not on the third (node1, donor). When browsing backwards in the logs, I see that the gtid_seq_no (and gtid) is in sync until that happens – we're using wsrep_gtid_mode=OFF for the record:
(No idea if the gtid_seq_no is relevant, but it appears to match the point when the replication does out of sync.) Thoughts:
Notes:
^^^ - IGNORE THIS REPLY - ISSUE TURNED OUT TO BE POOR CONFIG - ^^^ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Walter Doekes [ 2021-04-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sooo... ignore my reply. It turned out that node1 had a stray replicate_wild_ignore_table= that everyone missed. Move along people, nothing to see here.. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-04-09 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
mreibert I need something to analyze so at least some steps how to reproduce, error logs when something goes wrong (preferable with --wsrep-debug=1 setting). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Reibert [ 2021-04-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jplindst I completely understand your request, unfortunately this problem only occurs on my production cluster that is not conducive to debugging since the primary database is 7TB in size. (SSTs are 8 hour events!) That being said, I can enable the error log and set wsrep_debug since according to the docs it is dynamic, and then just wait for the next time the GRA_*.log files spew out. What enumeration value do you want for wsrep_debug? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Reibert [ 2022-01-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello jplindst — I see you closed this issue as incomplete, but note I was waiting for a response from you as to what value you want for wsrep_debug. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2022-01-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
mreibert It was in my request i.e. wsrep-debug=1, I closed this as we could not reproduce this. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Reibert [ 2022-01-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jplindst — I apologize for the confusion. You requested wsrep-debug=1 however the documentation indicates this is an enumeration and I am unsure which which enumeration value maps to 1. This was a very frequent problem for me when I was running 10.4.14, however since I am now running 10.4.22 I generally do not see the creation of GRA_*.log files. So somewhere between 10.4.14 and 10.4.22—which brings along with it new versions of the Galera library—things improved! |