[MXS-886] MaxScale OOM Crash when using readwritesplit with stuck thread on node Created: 2016-10-11  Updated: 2016-12-14  Resolved: 2016-12-14

Status: Closed
Project: MariaDB MaxScale
Component/s: N/A
Affects Version/s: 1.4.3
Fix Version/s: N/A

Type: Bug Priority: Blocker
Reporter: Chris Calender (Inactive) Assignee: markus makela
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

MariaDB MaxScale 1.4.3
Ubuntu 14 64-bit


Sprint: 2016-23, 2016-24

 Description   

MaxScale OOM Crash when using readwritesplit with stuck thread on node

MaxScale was consistently using ~1G RAM before this.

4 cores in the maxscale server (and threads=4).

Also, have router_options=disable_sescmd_history=true enabled.

Scenario just prior to the MaxScale crash:

1. Server 1, which is normally master, had a stuck thread that could not be killed.
2. This resulted in over 100 threads on Server 1 waiting to write data.
3. To resolve that situation, Server 1 was set to maintenance mode in MaxScale to allow the proxy to direct all write traffic to Server 2.
4. MariaDB was restarted on Server 1.
5. Once the restart was complete and node synched with cluster, maintenance mode on Server 1 was cleared.
6. At this point, all write traffic was again directed to Server 1 and cluster was behaving normally.
7. It was noticed at this point that RAM on the proxy was unusually high then, about 3G in use.
8. About 1.5 hours later, MaxScale crashed with the following error.

Note nothing was logged to the MaxScale log.

The following snippet is from syslog:

... kernel: [6153611.517881] Out of memory: Kill process 11982 (maxscale) score 977 or sacrifice child
... kernel: [6153611.524155] Killed process 11982 (maxscale) total-vm:18820824kB, anon-rss:16009168kB, file-rss:48kB
... kernel: [6153611.534542] in:imklog invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
... kernel: [6153611.534544] in:imklog cpuset=/ mems_allowed=0
... kernel: [6153611.534547] CPU: 0 PID: 816 Comm: in:imklog Not tainted 3.13.0-74-generic #118-Ubuntu
... kernel: [6153611.534548] Hardware name: Xen HVM domU, BIOS 4.2.amazon 05/12/2016
... kernel: [6153611.534549] 0000000000000000 ffff8803f849d980 ffffffff81724b70 ffff8803f698b000
... kernel: [6153611.534552] ffff8803f849da08 ffffffff8171f177 0000000000000000 0000000000000000
... kernel: [6153611.534554] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
... kernel: [6153611.534557] Call Trace:
... kernel: [6153611.534563] [<ffffffff81724b70>] dump_stack+0x45/0x56
... kernel: [6153611.534566] [<ffffffff8171f177>] dump_header+0x7f/0x1f1
... kernel: [6153611.534570] [<ffffffff8115308e>] oom_kill_process+0x1ce/0x330
... kernel: [6153611.534575] [<ffffffff812d85a5>] ? security_capable_noaudit+0x15/0x20
... kernel: [6153611.534577] [<ffffffff811537c4>] out_of_memory+0x414/0x450
... kernel: [6153611.534579] [<ffffffff81159b00>] __alloc_pages_nodemask+0xa60/0xb80
... kernel: [6153611.534583] [<ffffffff81198073>] alloc_pages_current+0xa3/0x160
... kernel: [6153611.534587] [<ffffffff8114fc47>] __page_cache_alloc+0x97/0xc0
... kernel: [6153611.534588] [<ffffffff81151655>] filemap_fault+0x185/0x410
... kernel: [6153611.534592] [<ffffffff8117652f>] __do_fault+0x6f/0x530
... kernel: [6153611.534595] [<ffffffff81371384>] ? vsnprintf+0x1f4/0x610
... kernel: [6153611.534597] [<ffffffff8117a3b2>] handle_mm_fault+0x482/0xf10
... kernel: [6153611.534601] [<ffffffff810bc01c>] ? print_time.part.8+0x6c/0x90
... kernel: [6153611.534604] [<ffffffff810bc0af>] ? print_prefix+0x6f/0xb0
... kernel: [6153611.534608] [<ffffffff81730cb4>] __do_page_fault+0x184/0x570
... kernel: [6153611.534610] [<ffffffff810be82d>] ? do_syslog+0x4fd/0x600
...5 kernel: [6153611.534614] [<ffffffff810ab460>] ? prepare_to_wait_event+0x100/0x100
... kernel: [6153611.534616] [<ffffffff817310ba>] do_page_fault+0x1a/0x70
... kernel: [6153611.534619] [<ffffffff8172d3e8>] page_fault+0x28/0x30



 Comments   
Comment by markus makela [ 2016-12-05 ]

Has this happened more than once? Does it happen with 2.0.2?

Comment by Chris Calender (Inactive) [ 2016-12-05 ]

It only happened the one time, which was on 1.4.3.

Comment by markus makela [ 2016-12-06 ]

So far we haven't been able to reproduce this.

Comment by markus makela [ 2016-12-14 ]

We couldn't reproduce this.

Generated at Thu Feb 08 04:02:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.