[MDEV-24397] MariaDB Galera Server Crashes on Large DELETE Created: 2020-12-11 Updated: 2022-03-16 Resolved: 2021-12-23 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Data Manipulation - Delete, Galera |
| Affects Version/s: | 10.5.8 |
| Fix Version/s: | 10.5.13 |
| Type: | Bug | Priority: | Major |
| Reporter: | Larry Adams | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | crash, galera | ||
| Environment: |
CentOS 8.2, 3 Node Galera Cluster |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
We are attempting to deploy Galera on a large high I/O environment, and the cluster continues to loose state due to failing MariaDB servers. The server continues to segfault trying to roll forward transactions. It eventually comes up, but then other servers are expelled from the cluster. We are running from the MariaDB.repo on MariaDB-Server-10.5.8-1 The backtrace looks like the following:
I've truncated the query as it's fairly large. |
| Comments |
| Comment by Larry Adams [ 2020-12-11 ] |
|
I have added the debug RPM's and since then, have not experience this bug again. Once I get a better backtrace, I'll upload it. I've got some core files, but they are pretty big. |
| Comment by Daniel Black [ 2020-12-11 ] |
|
Can you include `SHOW CREATE TABLE poller_output`, server configuration and galera library version? If you have a core dump you from this crash (is it in coredumpct output?) you can get the backtrace from that after the event occurred - https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/#analyzing-a-core-file-with-gdb-on-linux |
| Comment by Larry Adams [ 2020-12-12 ] |
|
Daniel, I'll do that. Here is the simple backtrace. The full thread backtrace I'll upload shortly. #0 0x00007f597f8fb70f in raise () from /lib64/libc.so.6 |
| Comment by Larry Adams [ 2020-12-12 ] |
|
The inserts can be fairly large upto 16MB or so. The number of rows affected can be > 100k. CREATE TABLE `poller_output` ( |
| Comment by Larry Adams [ 2020-12-12 ] |
|
Daniel, I have a half a dozen cores per server. There is another ticket as well, same kind of issue in that the server looses sync, but does not crash, until the OOM killer takes it that is. Server is a DELL, with CentOS8.2 [root@rtm-db-03 tmp]# free -g [root@rtm-db-03 tmp]# cat /proc/cpuinfo | grep cpuid | wc -l Larry |
| Comment by Larry Adams [ 2020-12-12 ] |
|
Okay, on the tertiary server, 3 of the cores had the same short bt stack count of 42, and ended with the call to clone(), like the attached. The others were from me doing a kill -9, so not so relevant. |
| Comment by Larry Adams [ 2020-12-12 ] |
|
gdb.txt -> issue with grcache recovery to memory leak too oom killer |
| Comment by Larry Adams [ 2021-01-12 ] |
|
We are still not in production. So, I have a small window to test a fix before I'll have to stick with 10.5.6, which has been stable. |
| Comment by Larry Adams [ 2021-03-09 ] |
|
I've switched from Galera to Async Parallel Replication due to the OPTIMIZE causing the entire cluster to pause. Replication is working, but I'm loosing my hair slowly but surely trying to figure things out, or maybe it's just my age... |
| Comment by Jan Lindström (Inactive) [ 2022-03-16 ] |
|
Roel This is not possible to reproduce after |
| Comment by Roel Van de Paar [ 2022-03-16 ] |
|
jplindst Thank you! |