Hello, this either smells like a bug, or I've been looking at the wrong places during troubleshooting.
This issue has manifested twice, once on 10.3 and once on 10.4.17. Environment is a three-node production galera cluster, and so far the errors in question have appeared on the node to which we're sending the writes (i.e. the "master").
At the time of the failure, the entire cluster seems to stall, and apps depending on database functions start to fail as well. We monitor several mariadb status vars (through a slightly modified https://github.com/uvoteam/mysql-monitoring), and only `Innodb_row_lock_waits` seems to significantly fluctuate while the issue is still manifesting. It lasted between 5-10 minutes. The mariadb process was not killed, the cluster self-healed without admin intervention and the only entry in the logs which seems relevant is several iterations of the following:
Our setup does involve a lot of short lived connections from php scripts (peaking at 4-5K during the day).
Our monitoring service indicated that the host had about 8GiB of available memory (although this includes filesystem page cache) at the moment these lines were printed in the error.log.
mariadb runs in a VM configured with 40GiB of RAM, which is presented on two virtual NUMA nodes (to mirror the VM host's hardware's architecture, hoping to improve performance by helping the kernel schedule processes more effectively while taking memory and cache locality under consideration).
Some relevant configuration options are the following:
If there's any other information I can produce to determine if this is a mariadb-server bug or a configuration issue, do let me know. Thank you.