[MXS-2585] MaxScale dies with fatal signal 11 Created: 2019-07-02  Updated: 2020-12-08  Resolved: 2020-07-07

Status: Closed
Project: MariaDB MaxScale
Component/s: cache
Affects Version/s: 2.2.16, 2.3.4, 2.3.6, 2.3.9, 2.4.1, 2.4.4, 2.4.6, 2.4.10
Fix Version/s: 2.4.11

Type: Bug Priority: Critical
Reporter: Daniel Assignee: markus makela
Resolution: Fixed Votes: 1
Labels: None
Environment:

Ubuntu 16.04 LTS, Ubuntu 18.04 LTS


Attachments: File caching_rules.json     Text File maxscale-2-4-10-verbose-log.txt     File maxscale-verbose-2-4-6-1-debug.log     File maxscale-verbose-log.cnf     File maxscale.cnf     File maxscale.log    
Sprint: MXS-SPRINT-96, MXS-SPRINT-98, MXS-SPRINT-101

 Description   

I am experiencing an issue that MaxScale is periodically dying with the following backtrace:

2019-07-02 15:25:02   alert  : (169770) Fatal: MaxScale 2.3.6 received fatal signal 11. Attempting backtrace.
2019-07-02 15:25:02   alert  : (169770) Commit ID: 03dc969cf203decb4d0bc65280fe4f901da87830 System name: Linux Release string: Ubuntu 18.04.2 LTS
sh: 1: nm: not found
sh: 1: addr2line: not found
2019-07-02 15:25:02   alert  : (169770)   /usr/bin/maxscale(_ZN7maxbase15dump_stacktraceESt8functionIFvPKcS2_EE+0x3a) [0x558bd60300da]: 
sh: 1: nm: not found
sh: 1: addr2line: not found
2019-07-02 15:25:02   alert  : (169770)   /usr/bin/maxscale(_ZN7maxbase15dump_stacktraceEPFvPKcS1_E+0x48) [0x558bd60304b8]: 
sh: 1: addr2line: not found
2019-07-02 15:25:02   alert  : (169770)   /usr/bin/maxscale(+0x99f0) [0x558bd602c9f0]: 
sh: 1: addr2line: not found
2019-07-02 15:25:02   alert  : (169770)   /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890) [0x7fc2afbc7890]: 
sh: 1: nm: not found
sh: 1: addr2line: not found
2019-07-02 15:25:03   alert  : (169770)   /usr/lib/x86_64-linux-gnu/maxscale/libcache.so(_ZN10LRUStorage12do_put_valueERK9cache_keyPK5gwbuf+0xe8) [0x7fc2a6281178]: 
sh: 1: nm: not found
sh: 1: addr2line: not found
2019-07-02 15:25:03   alert  : (169770)   /usr/lib/x86_64-linux-gnu/maxscale/libcache.so(_ZN18CacheFilterSession12store_resultEv+0x39) [0x7fc2a627b049]: 
sh: 1: nm: not found
sh: 1: addr2line: not found
2019-07-02 15:25:03   alert  : (169770)   /usr/lib/x86_64-linux-gnu/maxscale/libcache.so(_ZN18CacheFilterSession21handle_expecting_rowsEv+0xea) [0x7fc2a627c2ca]: 
sh: 1: nm: not found
sh: 1: addr2line: not found
2019-07-02 15:25:03   alert  : (169770)   /usr/lib/x86_64-linux-gnu/maxscale/libcache.so(_ZN8maxscale6FilterI11CacheFilter18CacheFilterSessionE11clientReplyEP10mxs_filterP18mxs_filter_sessionP5gwbuf+0x14) [0x7fc2a6279c44]: 
sh: 1: nm: not found
sh: 1: addr2line: not found
2019-07-02 15:25:03   alert  : (169770)   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(session_route_reply+0x15) [0x7fc2b0392db5]: 
sh: 1: nm: not found
sh: 1: addr2line: not found
2019-07-02 15:25:04   alert  : (169770)   /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(_ZN14RWSplitSession11clientReplyEP5gwbufP3dcb+0x127) [0x7fc2a68be077]: 
sh: 1: nm: not found
sh: 1: addr2line: not found
2019-07-02 15:25:04   alert  : (169770)   /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(_ZN8maxscale6RouterI7RWSplit14RWSplitSessionE11clientReplyEP10mxs_routerP18mxs_router_sessionP5gwbufP3dcb+0x27) [0x7fc2a68b7687]: 
sh: 1: addr2line: not found
2019-07-02 15:25:04   alert  : (169770)   /usr/lib/x86_64-linux-gnu/maxscale/libmariadbbackend.so(+0x4138) [0x7fc2a649c138]: 
sh: 1: addr2line: not found
2019-07-02 15:25:04   alert  : (169770)   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(+0x82f77) [0x7fc2b0344f77]: 
sh: 1: addr2line: not found
2019-07-02 15:25:04   alert  : (169770)   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(+0x83031) [0x7fc2b0345031]: 
sh: 1: nm: not found
sh: 1: addr2line: not found
2019-07-02 15:25:05   alert  : (169770)   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x1b6) [0x7fc2b03a1ce6]: 
sh: 1: nm: not found
sh: 1: addr2line: not found
2019-07-02 15:25:05   alert  : (169770)   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x4f) [0x7fc2b03a1eff]: 
sh: 1: addr2line: not found
2019-07-02 15:25:05   alert  : (169770)   /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7fc2af16866f]: 
sh: 1: addr2line: not found
2019-07-02 15:25:05   alert  : (169770)   /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fc2afbbc6db]: 
sh: 1: nm: not found
sh: 1: addr2line: not found
2019-07-02 15:25:05   alert  : (169770)   /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fc2ae4c088f]: 

Right after that monit detects, that the service isn't running and starts MaxScale again.

The issue is surely situated around the memory consumption through caching. But I limited it at the configuration. Anyway it keeps dying. I tried different max_sizes as same threaded/shared operation mode. At the beginning of my investigation of that issue the system was running OOM. But I through that it should not happen based on the hard and soft TTL set.

I attached the maxscale configuration as same as the only one used filter_rule.

System itself has 8 cores/threads with 16 Gb of RAM and is virtualized with Qemu/KVM.
Kernel-Version: 4.15.0-50-generic #54-Ubuntu SMP Mon May 6 18:46:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Backend database Servers are: Server version: 10.2.18-MariaDB-10.2.18+maria~xenial

Looking forward for any hints.


Greetings



 Comments   
Comment by markus makela [ 2019-10-07 ]

Can you reproduce this without the cache filter?

Comment by Daniel [ 2019-10-08 ]

Hello,

yes - also without the cache filter this issue appears.

Greetings

Comment by markus makela [ 2019-10-08 ]

Can you provide the stacktrace without the cache filter?

Comment by Daniel [ 2019-10-11 ]

Hello Markus,

service-restart was done @23:13:30 without cache filter.

2019-10-10 23:13:30   notice : Loaded server states from journal file: /var/lib/maxscale/MariaDB-Monitor/monitor.dat
2019-10-10 23:28:07   error  : (9257) [readwritesplit] Lost connection to the master server 'db01-c', closing session. Lost connection to master server while connection was idle. Connection has been idle for 600.1 seconds. Error caused by: #HY000: Lost connection to backend server.
2019-10-10 23:42:00   error  : (35291) [readwritesplit] Lost connection to the master server 'db01-c', closing session. Lost connection to master server while connection was idle. Connection has been idle for 600.1 seconds. Error caused by: #HY000: Lost connection to backend server.
2019-10-10 23:44:20   error  : (36388) [readwritesplit] Lost connection to the master server 'db01-c', closing session. Lost connection to master server while connection was idle. Connection has been idle for 600.1 seconds. Error caused by: #HY000: Lost connection to backend server.
2019-10-10 23:44:44   error  : (33898) [readwritesplit] Lost connection to the master server 'db01-c', closing session. Lost connection to master server while connection was idle. Connection has been idle for 600.1 seconds. Error caused by: #HY000: Lost connection to backend server.
2019-10-11 00:28:07   error  : (115406) [readwritesplit] Lost connection to the master server 'db01-c', closing session. Lost connection to master server while connection was idle. Connection has been idle for 600.1 seconds. Error caused by: #HY000: Lost connection to backend server.
2019-10-11 00:56:06   error  : (160126) [readwritesplit] Lost connection to the master server 'db01-c', closing session. Lost connection to master server while connection was idle. Connection has been idle for 600.1 seconds. Error caused by: #HY000: Lost connection to backend server.
2019-10-11 01:20:44   alert  : Fatal: MaxScale 2.4.1 received fatal signal 11. Attempting backtrace.
2019-10-11 01:20:44   alert  : Commit ID: b47b7c15edbbc70d9137520e3c590c4f16b3a4eb System name: Linux Release string: Ubuntu 18.04.3 LTS
sh: 1: /bin/test: not found
2019-10-11 01:20:44   alert  :   /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(_ZN14RWSplitSession17handle_got_targetEP5GWBUFPN8maxscale9RWBackendEb+0xba) [0x7f90fbe9c58a]: <binutils not installed>
2019-10-11 01:20:44   alert  :   /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(_ZN14RWSplitSession17route_single_stmtEP5GWBUF+0x667) [0x7f90fbe9d0d7]: <binutils not installed>
2019-10-11 01:20:44   alert  :   /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(_ZN14RWSplitSession10routeQueryEP5GWBUF+0x1fe) [0x7f90fbe95bde]: <binutils not installed>
2019-10-11 01:20:44   alert  :   /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(_ZN8maxscale6RouterI7RWSplit14RWSplitSessionE10routeQueryEP10mxs_routerP18mxs_router_sessionP5GWBUF+0x21) [0x7f90fbe8e6f1]: <binutils not installed>
2019-10-11 01:20:44   alert  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(+0xfb720) [0x7f9106049720]: <binutils not installed>
2019-10-11 01:20:44   alert  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker4tickEv+0x24f) [0x7f910605d08f]: <binutils not installed>
2019-10-11 01:20:44   alert  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase11WorkerTimer6handleEPNS_6WorkerEj+0x67) [0x7f910605b587]: <binutils not installed>
2019-10-11 01:20:44   alert  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x1a6) [0x7f910605c116]: <binutils not installed>
2019-10-11 01:20:44   alert  :   /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x53) [0x7f910605c333]: <binutils not installed>
2019-10-11 01:20:44   alert  :   /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7f9104d7e66f]: <binutils not installed>
2019-10-11 01:20:44   alert  :   /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f91058256db]: <binutils not installed>
2019-10-11 01:20:44   alert  :   /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f91040d688f]: <binutils not installed>

Comment by markus makela [ 2019-10-11 ]

Just to confirm, you tested this with the latest 2.3 release as well?

Comment by Daniel [ 2019-10-11 ]

Hello,

following the tested versions as far as the logfiles reach:

root@maxscale01-a:~# cat /var/log/maxscale/maxscale.* |grep 'notice : MariaDB MaxScale' |awk {'print $7 "\t" $10'} |uniq |sort -n
2.3.4	aea64aede280558ca6b55500dfa7eb049ec9c377)
2.3.6	03dc969cf203decb4d0bc65280fe4f901da87830)
2.4.1	b47b7c15edbbc70d9137520e3c590c4f16b3a4eb)
2.4.2	aad4148d77bf2dfbaa0042bc45abda30c101cad2) <-- installed during this night and running since 3:29 am
 
root@maxscale01-b:/# cat /var/log/maxscale/maxscale.* |grep 'notice : MariaDB MaxScale' |awk {'print $7 "\t" $10'} |uniq |sort -n
2.2.5	
2.3.9	66d6ce6fa598a43b20af052ef6c9f637a3279fa1)
2.2.16	47e7595a08e4672501dc577b05e5d74de7dc11d2)
2.2.18	136c0c3062eb0bb51be4f427ec24303b62b71a9e)
2.2.20	0df7228e5eb8a4ce56e43bcbe57b9deebe29a081)
2.2.21	a7a22156810433c5f161efe2121e0a4f509a9548)

Comment by markus makela [ 2019-10-11 ]

Strange, it still looks like a null or a corrupted buffer is passed to the handle_got_target function.

Comment by markus makela [ 2019-10-11 ]

Would it be possible for you to add log_info=true under the [maxscale] section in the configuration? This enables verbose logging and it should help figure out why the crash is happening. Please note that the verbose logging will take up a sizable amount of disk space.

Comment by Daniel [ 2019-12-04 ]

Hello,

sorry for late response - I will try to figure out a time span when to add the verbose logging.

Greetings

Comment by Daniel [ 2019-12-05 ]

Hello,

attached the last lines before the sudden termination.

2019-12-05 13:21:51   info   : (446685) [readwritesplit] Route query to slave: db01-b 	[10.0.0.22]:3306 <
2019-12-05 13:21:51   alert  : Fatal: MaxScale 2.4.4 received fatal signal 11. Commit ID: 231f68b6dc70804b785df356679601ab3eb8e379 System name: Linux Release string: Ubuntu 18.04.3 LTS
2019-12-05 13:21:51   info   : (447459) [readwritesplit] Route query to slave: db01-c 	[10.0.0.23]:3306 <
2019-12-05 13:21:51   info   : (447506) [readwritesplit] Route query to master: db01-a 	[10.0.0.21]:3306 <
2019-12-05 13:21:51   info   : (446685) [readwritesplit] Reply complete, last reply from db01-b
2019-12-05 13:21:51   info   : (447642) [readwritesplit] Reply complete, last reply from db01-b
2019-12-05 13:21:51   info   : (447707) [readwritesplit] Reply not yet complete. Waiting for 1 replies, got one from db01-c
2019-12-05 13:21:51   info   : (447671) [readwritesplit] Route query to slave: db01-b 	[10.0.0.22]:3306 <
 
<---shortened here-->
 
2019-12-05 13:21:55   alert  : 
  /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(_ZN14RWSplitSession17handle_got_targetEP5GWBUFPN8maxscale9RWBackendEb+0xba): include/maxscale/protocol/mysql.hh:633
  /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(_ZN14RWSplitSession17route_single_stmtEP5GWBUF+0x667): server/modules/routing/readwritesplit/rwsplit_route_stmt.cc:375
  /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(_ZN14RWSplitSession10routeQueryEP5GWBUF+0x20e): server/modules/routing/readwritesplit/rwsplitsession.cc:162
  /usr/lib/x86_64-linux-gnu/maxscale/libreadwritesplit.so(_ZN8maxscale6RouterI7RWSplit14RWSplitSessionE10routeQueryEP10mxs_routerP18mxs_router_sessionP5GWBUF+0x21): include/maxscale/router.hh:452
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(+0x103890): server/core/session.cc:1065
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker4tickEv+0x24f): maxutils/maxbase/include/maxbase/worker.hh:790
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase11WorkerTimer6handleEPNS_6WorkerEj+0x67): maxutils/maxbase/src/worker.cc:256
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x1a6): maxutils/maxbase/src/worker.cc:858
  /usr/lib/x86_64-linux-gnu/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x53): maxutils/maxbase/src/worker.cc:559
  /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f): ??:?
  /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db): ??:?
  /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f): ??:0

When it's maybe relevant - on backend mysql running:
3x Server version: 10.2.18-MariaDB-10.2.18+maria~xenial (@db01-a/b/c)

Greetings

Comment by markus makela [ 2019-12-10 ]

This looks to happen in the delayed routing code.

Comment by markus makela [ 2019-12-10 ]

The only place where this event can be triggered with this configuration is in the retry_query call inside RWSplitSession::handle_error_new_connection. All uses of the m_current_query variable appear to handle the buffer ownership correctly so it's possible that something else is going on.

Comment by markus makela [ 2019-12-17 ]

Can you try installing the 2.4.4 debug packages? They can be found here: https://downloads.mariadb.com/MaxScale/2.4.4-debug/

If you can test the debug packages, we might get more information than we have now.

Comment by Daniel [ 2019-12-18 ]

Hello Markus,

thanks for reply - I installed the debug packages and enabled log_info=true.

Best regards,
dm.

Comment by Daniel [ 2019-12-18 ]

Hello,

I had to abort the running the debug-version, because the application was not able to work normal anymore.
There is already the followig error logged

root@maxscale01-a:/var/log/maxscale# tail -n10000 maxscale.log |grep 'error'
2019-12-18 10:27:45   error  : (1) [readwritesplit] Couldn't find suitable Master from 3 candidates.
2019-12-18 10:27:45   error  : (1) Failed to create new router session for service 'Read-Write-Service'. See previous errors for more details.
2019-12-18 10:27:45   error  : (4) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:45   error  : (5) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:46   error  : (1) [readwritesplit] Couldn't find suitable Master from 3 candidates.
2019-12-18 10:27:46   error  : (1) Failed to create new router session for service 'Read-Write-Service'. See previous errors for more details.
2019-12-18 10:27:46   error  : (2) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:46   error  : (4) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:46   error  : (3) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:47   error  : (5) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:47   error  : (1) [readwritesplit] Couldn't find suitable Master from 3 candidates.
2019-12-18 10:27:47   error  : (1) Failed to create new router session for service 'Read-Write-Service'. See previous errors for more details.
2019-12-18 10:27:47   error  : (3) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:47   error  : (4) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:48   error  : (1) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:48   error  : (2) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:49   error  : (1) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:49   error  : (4) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:50   error  : (1) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:50   error  : (2) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:50   error  : (1) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:51   error  : (3) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:51   error  : (1) [readwritesplit] Couldn't find suitable Master from 3 candidates.
2019-12-18 10:27:51   error  : (1) Failed to create new router session for service 'Read-Write-Service'. See previous errors for more details.
2019-12-18 10:27:51   error  : (3) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:51   error  : (4) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:52   error  : (1) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:52   error  : (2) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:53   error  : (2) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:53   error  : (3) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:53   error  : (1) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:53   error  : (2) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:54   error  : (1) [readwritesplit] Couldn't find suitable Master from 3 candidates.
2019-12-18 10:27:54   error  : (1) Failed to create new router session for service 'Read-Write-Service'. See previous errors for more details.
2019-12-18 10:27:54   error  : (2) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:54   error  : (3) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:55   error  : (1) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:55   error  : (2) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:55   error  : (1) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:55   error  : (2) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:56   error  : (1) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2019-12-18 10:27:56   error  : (2) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)

Greetings

Comment by Daniel [ 2019-12-18 ]

I also noticed some segfault erros down dmesg. Do not know if its maybe relevant. Fact: RAM is okay. The issue appears on different hostsystems.

[Do Dez  5 10:49:39 2019] maxscale[3084]: segfault at 0 ip 00007f8ad7896a6e sp 00007ffd86f5f4d0 error 4 in libmaxscale-common.so.1.0.0[7f8ad7784000+233000]
[Di Dez 10 08:51:12 2019] maxscale[7142]: segfault at 10 ip 00007f746d3f989a sp 00007f7469562090 error 4 in libreadwritesplit.so.1.0.2[7f746d3d9000+2c000]
[Di Dez 10 16:21:51 2019] maxscale[9598]: segfault at 10 ip 00007f100474c89a sp 00007f10020b8090 error 4 in libreadwritesplit.so.1.0.2[7f100472c000+2c000]

Comment by markus makela [ 2019-12-18 ]

Can you post the full stacktrace that's found in the error log?

Comment by Daniel [ 2019-12-18 ]

Hello,

attached the log. Excluding some sensitive data.

Greetings maxscale.log

Comment by markus makela [ 2019-12-30 ]

Can you also add the configuration that you used when those errors were logged?

Comment by markus makela [ 2019-12-31 ]

We think that the debug assertion might be wrong and that the code in question is not what causes the crashes. We'll fix the debug assertion and also improve the likelihood of stacktrace generation when multiple signals are sent in close proximity. We can provide a custom debug package for testing if you'd be interested in helping us figure out why MaxScale crashes.

Comment by Daniel [ 2020-01-20 ]

Hello Markus,

attached the used configuration maxscale-verbose-log.cnf, excluding comments and passwords.
Thanks for the offer about debug package, I am interested in.

Greetings

Comment by markus makela [ 2020-01-24 ]

Does the crash still happen if you remove the line filters=Cache from the configuration? I was under the impression that the cache filter was not in use and it would be good to first isolate the component that is causing the problem.

Comment by Daniel [ 2020-01-24 ]

Hello,

yes - also with inactive cache-filters this issue appears.
The setup is running in productive mode, thats why the line is not commented out at the last attached maxscale config.

Greetings

Comment by markus makela [ 2020-01-27 ]

You can find the debug packages for the latest release of MaxScale here: https://downloads.mariadb.com/MaxScale/2.4.6-debug/
These contain fixes to the debug assertions so we should get more data this time.

Comment by Daniel [ 2020-02-13 ]

Hello,

I installed the new debug version and removed the filters. System was creating large core dumps with each startup of maxscale.
Database access through maxscale was close to not possible for the application. Just received error 502 gateway timeout on loadbalancer and backends due no db access. That happened already after a few seconds after maxscale startup and activating the host for usage.

The following lines are logged a few times

2020-02-13 01:22:19   error  : (5) debug assert at /home/vagrant/MaxScale/server/core/buffer.cc:48 failed: buf->tail != reinterpret_cast<GWBUF*>(0xdeadbeef)
2020-02-13 01:22:19   alert  : (5) Fatal: MaxScale 2.4.6 received fatal signal 6. Commit ID: 714dece7ddb025d2ed2cca945a607a02b80bc7a2 System name: Linux Release string: Ubuntu 18.04.4 LTS

Attached the shortened log: maxscale-verbose-2-4-6-1-debug.log

Greetings

Comment by markus makela [ 2020-02-13 ]

Thank you for providing the log, I think this will help us figure out what is going on since now we have a proper stacktrace.

Comment by markus makela [ 2020-02-13 ]

Can you provide one without the cache filter? This would help us figure out where the original problem comes from.

Make sure there are no files in /var/lib/maxscale/maxscale.cnf.d/ before restarting MaxScale. It looks like your setup still uses the cache filter which leads me to think there might be persisted runtime configuration files.

Comment by Daniel [ 2020-02-14 ]

Hello Markus,

I cleared the the folders /etc/maxscale.modules.d as same as /var/lib/maxscale/maxscale.cnf.d/ before the attached log was been created and maxscale started. They had no content.

Additinally I set a comment infront of the specific rule "#rules=caching_rules.json".

Greetings

Comment by markus makela [ 2020-03-09 ]

Could you try adding session_trace=200 under the [maxscale] section? This will keep a detailed log for any sessions that cause a crash. Hopefully we'll be able to know more about this with the extra output as right now it isn't clear what causes this.

Comment by markus makela [ 2020-03-19 ]

The 2.4.6 release had code that would rule out the cache filter as the culprit but it seems there's a debug assertion that is hit by it.Would it be possible for you to test the normal MaxScale 2.4.8 packages? Hopefully that will reveal more information.

Comment by Daniel [ 2020-06-02 ]

Hello,

sorry for late respone - just got a maintenance window in 2 weeks.

Greetings

Comment by Daniel [ 2020-06-18 ]

Hello,

last night I tested the latest release 2.4.10 with "session_trace=200" and "log_info=true" active. MaxScale was crashing as 'expected'.
Log attached maxscale-2-4-10-verbose-log.txt.

Greetings

Comment by markus makela [ 2020-07-02 ]

Looks like the crash happened with a relatively large resultset. So far I haven't been able to reproduce the crash with the same configuration and the same rule file.

The line numbers in the stacktrace look a bit odd and don't seem to correspond with what is in the source code. This might be simply due to the fact that the code is optimized. Looking at the code the stacktrace suggests that the value of pNode is either null or invalid. This means that the most probable place for the bug is in either get_existing_node or get_new_node. Looking at the code in get_new_node a possible reason for a null node is the use of new (std::nothrow) Node and no check for the return value. Forcibly returning a null pointer there and letting the program continue appears to reproduce the stacktrace. This should only be possible if the system is configured to not kill a process when it runs out of resources. This would also mean that setting sysctl -w vm.overcommit_memory=1 should cause the crash to be replaced by an OOM kill of MaxScale by the system.

If you add writeq_high_water=16Mi and writeq_low_water=8Ki under the [maxscale] section of the configuration, does the problem happen less often?

Comment by markus makela [ 2020-07-03 ]

If possible, please try and see if the following command prevents the crash:

sysctl -w vm.overcommit_memory=1

Comment by markus makela [ 2020-07-07 ]

The cache filter should now not crash upon running out of memory. In general this isn't something MaxScale deals with well so we recommend configuring the system to always over allocate memory.

Comment by Daniel [ 2020-07-24 ]

Hello Markus,

sorry that I am re-opening this thread.
I had put up a maintenance window for testing the values you provided.

writeq_high_water=16Mi and writeq_low_water=8Ki under the [maxscale] as same as vm.overcommit_memory=1
-> no, unfortunately there is no difference in the amount of crashes.

The system has enough RAM to keep the process running. At the very beginning of this issue we rised the available RAM for that VM to 32 GB to see the behavior. But it does not consume it.

Greetings

Comment by markus makela [ 2020-07-24 ]

Did you use MaxScale 2.4.11? Please try with the latest release.

Generated at Thu Feb 08 04:15:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.