[MDEV-9825] mariadb 10.1 crashing Created: 2016-03-29  Updated: 2020-10-20  Resolved: 2020-10-20

Status: Closed
Project: MariaDB Server
Component/s: OTHER
Affects Version/s: 10.1.12
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Alex Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None


 Description   

Hello,
recently I switched to 10.1 and centos7
however, I had to urgently downgrade to 10.0

I faced few problems:

1. master crashing
my master servers are mariadb 10.1 with blackhole tables and around 20k qps
under load spikes (from many web servers, like if web app was restarted) mariadb 10.1 crashes. It does not matter if load is 20k qps or 5k qps.
More than that, it seems related to number of IPs (web app servers) that work with master since a different server that is still not downgraded (because it's production and never crashed) has fewer web app IPs that access it. Few other masters crashed from time to time without any information in logs. systemd notices mariadb is down and restarts it.
After few tries/changes I decided to downgrade to 10.0 and it's stable for a few days already no matter what happens on web app servers (the single 10.1 master is still alive but I am afraid it may begin crashing when I'll add more web app servers)
All masters are the same.

2. Backup slaves crashing
I have few backup slaves that replicate from those masters to myisam with disabled indexes. Single slave replicates from 3 masters, 2 production (i've described above) and 1 different master that's using innodb. I am using parallel replication. Out of sudden (I have few slaves like that) different slave crashing and restarts. I can't tell for sure if this happens as result from backups that run daily, but it's occasional and I've already downgraded it to 10.0

I seen nothing in dmesg, logs, or anywhere else. I know I need to compile/install some debug version and tune my.cnf
I'd like to help you debug this and need some guidance. I am ready to let production server crash once again.

But for now all systems run with 10.0 without any problem or any crash seen so far (maybe this will happen some day and it relates to centos7 or other settings I've missed..who knows)

Thanks
Alex



 Comments   
Comment by Elena Stepanova [ 2016-03-30 ]

ShivaS,
The only known reason for MariaDB to crash without anything in logs is when somebody SIGKILLs it manually, which obviously isn't the case here.
Most often people miss the complaint from the OOM killer, which also sends the SIGKILL, and its own message is usually short and not easily noticeable, and tends to hide deep in system logs. OOM would also explain the randomness of the problem, and could even explain the relation to the number of masters, if the leak is somehow related to the master connections.
Do you have any kind of memory monitoring on the machines where the crash occurred?

With other signals, even with a non-debug and stripped version, there is always something in the logs, at least the note about receiving the signal, you just need to find it.

Comment by Alex [ 2016-03-30 ]

Hi Elena,

About the masters:

I never seen any OOM with that setup.
Same setup with 10.0 on either centos6 or centos7 runs just fine. (I haven't tried 10.1 on centos6 though)

this is how's top looks like with still-running 10.1 setup:

top - 12:22:24 up 7 days, 30 min,  1 user,  load average: 4.25, 4.18, 4.43
Tasks: 158 total,   2 running, 156 sleeping,   0 stopped,   0 zombie
%Cpu0  : 16.2 us,  5.9 sy,  0.0 ni, 76.5 id,  0.0 wa,  0.0 hi,  0.0 si,  1.5 st
%Cpu1  : 19.4 us,  7.3 sy,  0.0 ni, 71.8 id,  0.0 wa,  0.0 hi,  0.0 si,  1.5 st
%Cpu2  : 18.3 us,  6.7 sy,  0.0 ni, 73.6 id,  0.0 wa,  0.0 hi,  0.0 si,  1.4 st
%Cpu3  : 34.1 us, 13.4 sy,  0.0 ni, 31.8 id,  0.0 wa,  0.0 hi, 20.3 si,  0.5 st
%Cpu4  : 32.3 us, 14.3 sy,  0.0 ni, 34.1 id,  0.0 wa,  0.0 hi, 18.9 si,  0.5 st
%Cpu5  : 13.8 us,  5.3 sy,  0.0 ni, 80.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.9 st
%Cpu6  : 13.2 us,  6.2 sy,  0.0 ni, 79.3 id,  0.0 wa,  0.0 hi,  0.0 si,  1.3 st
%Cpu7  : 11.7 us,  5.3 sy,  0.0 ni, 81.6 id,  0.0 wa,  0.0 hi,  0.0 si,  1.5 st
KiB Mem : 15137948 total,   157392 free, 12078528 used,  2902028 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  2693708 avail Mem
 
|  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
11154 mysql     20   0 18.874g 0.010t   4540 R 416.3 74.0  22235:50 mysqld

this server runs with 20qps and never crashed (yet)

And this on a different machine with 10.0, that was previously upgraded to double it capacity (I thought it could help) and running 10.0 because it didn't help and 10.1 kept crashing:

top - 12:28:42 up 3 days, 15:04,  1 user,  load average: 1.20, 1.26, 1.18
Tasks: 435 total,   1 running, 434 sleeping,   0 stopped,   0 zombie
%Cpu0  :  6.0 us,  1.5 sy,  0.0 ni, 92.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  3.7 us,  2.9 sy,  0.0 ni, 92.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.7 st
%Cpu2  :  4.5 us,  0.8 sy,  0.0 ni, 94.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.8 st
%Cpu3  :  4.4 us,  2.2 sy,  0.0 ni, 92.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.7 st
%Cpu4  :  3.8 us,  1.5 sy,  0.0 ni, 94.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.8 st
%Cpu5  :  3.8 us,  1.5 sy,  0.0 ni, 93.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.8 st
%Cpu6  :  3.8 us,  1.5 sy,  0.0 ni, 93.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.8 st
%Cpu7  :  3.8 us,  2.3 sy,  0.0 ni, 94.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  0.7 us,  1.4 sy,  0.0 ni, 97.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.7 st
%Cpu9  :  0.7 us,  0.7 sy,  0.0 ni, 98.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  2.9 us,  0.7 sy,  0.0 ni, 95.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.7 st
%Cpu11 :  7.0 us,  3.1 sy,  0.0 ni, 83.7 id,  0.0 wa,  0.0 hi,  6.2 si,  0.0 st
%Cpu12 :  7.7 us,  3.1 sy,  0.0 ni, 83.1 id,  0.0 wa,  0.0 hi,  5.4 si,  0.8 st
%Cpu13 :  0.0 us,  1.5 sy,  0.0 ni, 98.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 :  2.2 us,  0.7 sy,  0.0 ni, 97.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 :  2.2 us,  1.5 sy,  0.0 ni, 96.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 30606744 total,   228300 free,  9592484 used, 20785960 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 20311264 avail Mem
 
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
105532 mysql     20   0 31.010g 7.579g   4716 S 226.5 26.0   9787:40 mysqld

The above 10.0 machine runs at 16qps

It may happen under any load, whether it's 3k/sec or 20k/sec.

The pattern is simple insert delay and I am using thread-per-connection cause pool of threads seems to be not effective for quick queries (server stops responding and goes to high LA and CPU usage)

Even if I missed OOM (which I beleive I did not) or misconfigured it by overusing resources/settings I need for it - 10.0 is far more stable anyway - either as master or a slave.

On master, when the crash happens - restarts can be endless, database starts and crashes and over and over again. Until I manually stop it.

All I see in messages is something like this:

Mar 25 13:12:27 mdb03 systemd: mariadb.service: main process exited, code=killed, status=11/SEGV
Mar 25 13:12:27 mdb03 systemd: Unit mariadb.service entered failed state.
Mar 25 13:12:27 mdb03 systemd: mariadb.service failed.
Mar 25 13:12:32 mdb03 systemd: mariadb.service holdoff time over, scheduling restart.
Mar 25 13:12:32 mdb03 systemd: Starting MariaDB database server...
Mar 25 13:12:32 mdb03 journal: Suppressed 1650 messages from /system.slice/mariadb.service
Mar 25 13:12:34 mdb03 systemd: Started MariaDB database server.
Mar 25 13:22:16 mdb03 systemd: mariadb.service: main process exited, code=killed, status=11/SEGV
Mar 25 13:22:16 mdb03 systemd: Unit mariadb.service entered failed state.
Mar 25 13:22:16 mdb03 systemd: mariadb.service failed.
Mar 25 13:22:21 mdb03 systemd: mariadb.service holdoff time over, scheduling restart.
Mar 25 13:22:21 mdb03 systemd: Starting MariaDB database server...
Mar 25 13:22:22 mdb03 systemd: Started MariaDB database server.

I think that signal 11 isn't related to OOM,

As for the slave - here you are right. It's OOM, however it doesn't happen with 10.0 and even with 10.1 it's quite rare - could be twice a day or once a few days.
It looks like it's something about memory access/usage in both cases.

Thanks,
Alex

Comment by Alex [ 2016-04-07 ]

Hi Elena,
I think I found problem with crashes.
The problem is with persistent connections used in apps, under a load mysql hits lot of connections from many app servers
since I've missed sort_buffer size default setting (which is 2MB and allocated per thread) it causes to crash (run mariadb out of memory). So far I had a single crash of 10.0 and its error log was full of relevant details about the crash.
I am using blackhole on masters , so can safely decrease that setting.
Pool of threads is not good for my load as well as disabling persistence (too much resources used by MariaDB for every connection and server gets high LA). So persistence with delayed inserts are used (thanks for keeping delayed inserts in MariaDB!)

So I guess the case can be closed

However - 10.1 crashes more frequently and I couldn't find where it's over allocated memory compared to 10.0.
It crashes everywhere whether master or slave. So I still consider 10.0 more stable under sudden loads of any type.

Also, I noticed that under sudden loads on masters I can get duplicated binary log entry (statement log used). Whole app level and logs been rechecked, it's just double entry caused by MariaDB and it happens in both 10.0 and 10.1
(I noticed something similar happens to MaxScale under loads - duplicate entry may appear in its binary logs)
That's an offtopic I guess and requires a different thread, but since I've spread load a bit more between servers and tuned sort_buffer size - problems disappeared, so I won't be able to simulate it again at this point.
For now 2 masters with load of 20k qps used and a single slave that thanks to multi-source parallel replication may handle that load and never delays. Going higher with the load than the above qps rate causes slave to start delaying. So that's the limit I currently reached.

I'll give another try to 10.1 on master servers but a bit later. Too many problems with it (in my humble opinion), including freeze on stop if multi-source replication wasn't stopped in advance as well as parallel work threads don't display 'seconds behind master' info)

Comment by Elena Stepanova [ 2016-04-08 ]

ShivaS,
I think you misunderstood me a bit. While saying that it's probably OOM, I don't deny at all that it's a bug, I'm just trying to narrow down the problem. If your 10.1 server goes OOM more often than 10.0 with the same configuration, at the very least it is worth investigating. Hypothetically, if your 10.0 server operates on the edge of the memory (I'm not saying it does), then 10.1 might indeed go OOM more often simply because every version has higher defaults and requires more resources, so where your 10.0 stays within limits, 10.1 grows bigger. But even then, we need to find out which exactly difference in defaults causes the difference in (in)stability.

Comment by Alex [ 2016-04-10 ]

Hi Elena,
Well, I assumed that you are too busy and/or my problem is not a problem at all, so already downgraded all 10.1 to 10.0 (you know this task is not my only problem with 10.1) and found problem with sort_buffer_size default setting.

However, I am here for anything you want me to test/check. Can't promise I'll be able to sacrifice production once again, but I'll do my best to help with everything I can.

Thanks,
Alex

Comment by Elena Stepanova [ 2020-10-20 ]

We've never been able to get even close to solve the mystery, closing as incomplete.

Generated at Thu Feb 08 07:37:36 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.