|
MaxScale OOM Crash when using readwritesplit with stuck thread on node
MaxScale was consistently using ~1G RAM before this.
4 cores in the maxscale server (and threads=4).
Also, have router_options=disable_sescmd_history=true enabled.
Scenario just prior to the MaxScale crash:
1. Server 1, which is normally master, had a stuck thread that could not be killed.
2. This resulted in over 100 threads on Server 1 waiting to write data.
3. To resolve that situation, Server 1 was set to maintenance mode in MaxScale to allow the proxy to direct all write traffic to Server 2.
4. MariaDB was restarted on Server 1.
5. Once the restart was complete and node synched with cluster, maintenance mode on Server 1 was cleared.
6. At this point, all write traffic was again directed to Server 1 and cluster was behaving normally.
7. It was noticed at this point that RAM on the proxy was unusually high then, about 3G in use.
8. About 1.5 hours later, MaxScale crashed with the following error.
Note nothing was logged to the MaxScale log.
The following snippet is from syslog:
... kernel: [6153611.517881] Out of memory: Kill process 11982 (maxscale) score 977 or sacrifice child
... kernel: [6153611.524155] Killed process 11982 (maxscale) total-vm:18820824kB, anon-rss:16009168kB, file-rss:48kB
... kernel: [6153611.534542] in:imklog invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
... kernel: [6153611.534544] in:imklog cpuset=/ mems_allowed=0
... kernel: [6153611.534547] CPU: 0 PID: 816 Comm: in:imklog Not tainted 3.13.0-74-generic #118-Ubuntu
... kernel: [6153611.534548] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/12/2016
... kernel: [6153611.534549] 0000000000000000 ffff8803f849d980 ffffffff81724b70 ffff8803f698b000
... kernel: [6153611.534552] ffff8803f849da08 ffffffff8171f177 0000000000000000 0000000000000000
... kernel: [6153611.534554] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
... kernel: [6153611.534557] Call Trace:
... kernel: [6153611.534563] [<ffffffff81724b70>] dump_stack+0x45/0x56
... kernel: [6153611.534566] [<ffffffff8171f177>] dump_header+0x7f/0x1f1
... kernel: [6153611.534570] [<ffffffff8115308e>] oom_kill_process+0x1ce/0x330
... kernel: [6153611.534575] [<ffffffff812d85a5>] ? security_capable_noaudit+0x15/0x20
... kernel: [6153611.534577] [<ffffffff811537c4>] out_of_memory+0x414/0x450
... kernel: [6153611.534579] [<ffffffff81159b00>] __alloc_pages_nodemask+0xa60/0xb80
... kernel: [6153611.534583] [<ffffffff81198073>] alloc_pages_current+0xa3/0x160
... kernel: [6153611.534587] [<ffffffff8114fc47>] __page_cache_alloc+0x97/0xc0
... kernel: [6153611.534588] [<ffffffff81151655>] filemap_fault+0x185/0x410
... kernel: [6153611.534592] [<ffffffff8117652f>] __do_fault+0x6f/0x530
... kernel: [6153611.534595] [<ffffffff81371384>] ? vsnprintf+0x1f4/0x610
... kernel: [6153611.534597] [<ffffffff8117a3b2>] handle_mm_fault+0x482/0xf10
... kernel: [6153611.534601] [<ffffffff810bc01c>] ? print_time.part.8+0x6c/0x90
... kernel: [6153611.534604] [<ffffffff810bc0af>] ? print_prefix+0x6f/0xb0
... kernel: [6153611.534608] [<ffffffff81730cb4>] __do_page_fault+0x184/0x570
... kernel: [6153611.534610] [<ffffffff810be82d>] ? do_syslog+0x4fd/0x600
...5 kernel: [6153611.534614] [<ffffffff810ab460>] ? prepare_to_wait_event+0x100/0x100
... kernel: [6153611.534616] [<ffffffff817310ba>] do_page_fault+0x1a/0x70
... kernel: [6153611.534619] [<ffffffff8172d3e8>] page_fault+0x28/0x30
|