[MXS-3070] regular aborts with DCB_STATE_POLLING failed: 104 Created: 2020-07-07 Updated: 2021-08-02 Resolved: 2021-08-02 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | readwritesplit |
| Affects Version/s: | 2.4.10 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Louis Kidd | Assignee: | markus makela |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | crash, need_feedback | ||
| Environment: |
Debian 10 Buster 3 Maxscale Servers on top of 3 galera nodes. |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Sprint: | MXS-SPRINT-113 | ||||||||
| Description |
|
Maxscale unexpectedly segfaults with the following error, this appears to only happen on nodes that are not directly on the elected master node.
unfortunately this is so badly service affecting it requires us to jump to another load balancing solution but i thought I would leave this bug report with hopes that it gets a fix sooner rather than later. |
| Comments |
| Comment by Johan Wikman [ 2020-07-08 ] |
|
The stacktrace and signal number 6 suggests it's the systemd watchdog that kills MaxScale. You could try whether a longer timeout would remove the problem, although the fact that the default timeout is not long enough suggests there is a real problem of some sort. |
| Comment by markus makela [ 2020-07-08 ] |
|
That's most likely the SystemD watchdog killing the MaxScale process. If you're not seeing any other "negative" behavior, you could try commenting out the WatchdogSec=60s line from /lib/systemd/system/maxscale.service as a temporary workaround. |
| Comment by Louis Kidd [ 2020-07-08 ] |
|
honestly these just happen out of no where, no logs no errors before, they literally just stop. something else has come to light we use laravel queues running on redis, when these servers crashed they released about 100 jobs that were stuck. |
| Comment by markus makela [ 2020-07-08 ] |
|
Can you add the maxscale.cnf that you're using? Knowing which features are used would help us figure out where the problem might be. When the SystemD watchdog kills MaxScale it's almost always due to some bug in MaxScale that causes a thread to get stuck waiting on something. You should be able to get some extra information with log_info, this will at least show you where the connections stop. |
| Comment by Louis Kidd [ 2020-07-08 ] |
|
I have added the maxscale cnf file |
| Comment by markus makela [ 2020-07-08 ] |
|
The configuration looks to be a pretty standard one. Just to rule this out as a possible cause: do you run any maxadmin commands regularly to monitor the MaxScale instance? Given that you have jobs that are stuck that only release after the crash, I suspect it wouldn't be caused by it but better to be safe than sorry. Which service do you use, the readconnroute or the readwritesplit one? If both services are used, which service were the jobs that were stuck using? |
| Comment by markus makela [ 2020-07-08 ] |
|
You can add log_info=true under the [maxscale] section of your configuration. This will enable a more verbose log level that can log quite a few messages and can fill up disks relatively fast if you have a lot of traffic. You can also enable it at runtime with either maxadmin or maxctrl. |
| Comment by Louis Kidd [ 2020-07-08 ] |
|
yep all pretty standard stuff in the configuration. we only used the readwrite split, this was for our clustered web control services which needed the constant read/write permissions, our primary traffic which is read only and much greater demand than the webserver traffic would have used the read only service but this was not migrated to maxscale as we haven't been able to place all of our trust in the loadbalancers due to this signal 6 issue. We didn't run any maxadmin commands that are automated, I was from time to time running maxctrl list sessions / servers / services to keep an eye out which also i would find would sometimes it would get a SOCKETT error in json. |
| Comment by markus makela [ 2020-07-08 ] |
|
How high is the CPU usage on that system? Is it under a great deal of stress? The only "valid" reason for this kind of a crash is when MaxScale has so much traffic that it cannot answer the systemd daemon in time to let it know it's alive. I'll just assume it's not under a heavy load as that would go away and would be very obvious. The maxctrl error are most likely due to a socket timeout, the default timeout is "only" 10 seconds (it can be increased with the -t option , argument is in milliseconds). The strange thing is that it's quite unexpected for the list commands to time out. This might point to some problem causing one of the threads to not respond to the diagnostic commands that maxctrl uses in time. This would also support the theory that something in MaxScale isn't responding properly and that causes both the maxctrl timeouts and the crashes. The simplest, and the best, way to figure out what is going on would be to disable the systemd watchdog and attach GDB to the maxscale process when it hangs. This should immediately tell us where each thread is and what they are doing. Another option is to enable core dumps on the system and use GDB on it to get the thread states during the crash. |
| Comment by markus makela [ 2020-08-20 ] |
|
MooseHmooseh have you been able to measure the CPU usage on the system when these sorts of problems appear? You mentioned that:
Does this mean that the MaxScale instances are located on the same machine as the Galera nodes? |
| Comment by markus makela [ 2020-08-24 ] |
|
Can you try to reproduce this with the latest 2.4 release? There was a bug fix to galeramon in 2.4.11 that could lead to undefined behavior and in theory it could explain this. |
| Comment by markus makela [ 2020-10-19 ] |
|
Could theoretically be caused by |
| Comment by markus makela [ 2021-07-12 ] |
|
MooseHmooseh have you had a chance to test one of the newer releases? |
| Comment by markus makela [ 2021-08-02 ] |
|
Closing as Cannot Reproduce as we haven't been able to reproduce this and there's been no feedback on whether newer releases suffer from the same problem. |