[MXS-3068] memory leak Created: 2020-07-06  Updated: 2021-05-02  Resolved: 2021-04-08

Status: Closed
Project: MariaDB MaxScale
Component/s: N/A
Affects Version/s: 2.3.20, 2.4.13
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Massimo Assignee: markus makela
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Centos 7.6.181


Attachments: PNG File Screenshot 2020-07-06 at 13.35.20.png    
Issue Links:
Duplicate
is duplicated by MXS-3253 Maxscale uses half of my server memory Closed
Sprint: MXS-SPRINT-111, MXS-SPRINT-112, MXS-SPRINT-113, MXS-SPRINT-114, MXS-SPRINT-115, MXS-SPRINT-116, MXS-SPRINT-117, MXS-SPRINT-118, MXS-SPRINT-120, MXS-SPRINT-121, MXS-SPRINT-122, MXS-SPRINT-123

 Description   

we are facing memory leak by maxscale

here is the conf:

cat /etc/maxscale.cnf
# Globals
[maxscale]
threads=auto
log_augmentation=1
ms_timestamp=1
syslog=1
admin_host=0.0.0.0
admin_port=8989
writeq_high_water=32M
writeq_low_water=512K
# Servers
[te-db01]
type=server
address=10.5.0.11
port=3306
protocol=mariadbbackend
persistpoolmax=10000
persistmaxtime=3600
[te-db02]
type=server
address=10.5.0.12
port=3306
protocol=mariadbbackend
persistpoolmax=10000
persistmaxtime=3600
[te-db03]
type=server
address=10.5.0.13
port=3306
protocol=mariadbbackend
persistpoolmax=10000
persistmaxtime=3600
# Monitoring for the servers
[Galera-Monitor]
type=monitor
module=galeramon
servers=te-db01,te-db02,te-db03
user=maxscale
password=**
monitor_interval=1000
available_when_donor=true
disable_master_failback=true
# Galera router service
[Galera-Service]
type=service
router=readwritesplit
servers=te-db01,te-db02,te-db03
user=maxscale
password=**
master_failure_mode=fail_on_write
#disable_sescmd_history=true
master_reconnection=true
max_sescmd_history=1000
prune_sescmd_history=true
# Galera cluster listener
[Galera-Listener]
type=listener
service=Galera-Service
protocol=mariadbclient
port=3306

and:

cat global-options.cnf
[maxscale]
auth_connect_timeout=3
auth_read_timeout=1
auth_write_timeout=2
admin_auth=true
passive=0

and

cat Galera-Service.cnf
[Galera-Service]
type=service
servers=te-db01,te-db02,te-db03
router=readwritesplit
user=maxscale
password=**
master_failure_mode=fail_on_write
master_reconnection=true
max_sescmd_history=500
prune_sescmd_history=true
enable_root_user=false
max_retry_interval=3600
max_connections=0
connection_timeout=500
auth_all_servers=false
strip_db_esc=true
localhost_match_wildcard_host=true
log_auth_warnings=true
retry_on_failure=true
session_track_trx_state=false
retain_last_statements=-1
use_sql_variables_in=all
slave_selection_criteria=LEAST_CURRENT_OPERATIONS
max_slave_replication_lag=-1
max_slave_connections=255
retry_failed_reads=true
disable_sescmd_history=false
strict_multi_stmt=false
strict_sp_calls=false
master_accept_reads=false
connection_keepalive=250
causal_reads=false
causal_reads_timeout=10
delayed_retry=false
delayed_retry_timeout=10
transaction_replay=false
transaction_replay_max_size=1Mi
optimistic_trx=false
session_trace=0

and

cat global-options.cnf
[maxscale]
auth_connect_timeout=3
auth_read_timeout=1
auth_write_timeout=2
admin_auth=true
passive=0

and

[te-db01]
type=server
persistpoolmax=13000
priority=200
address=10.5.0.11
port=3306
protocol=mariadbbackend
persistmaxtime=3600
extra_port=0
proxy_protocol=false
ssl=false
ssl_version=MAX
ssl_cert_verify_depth=9
ssl_verify_peer_certificate=true
priority=200
[te-db02]
type=server
persistpoolmax=13000
priority=100
address=10.5.0.12
port=3306
protocol=mariadbbackend
persistmaxtime=3600
extra_port=0
proxy_protocol=false
ssl=false
ssl_version=MAX
ssl_cert_verify_depth=9
ssl_verify_peer_certificate=true
priority=100
[te-db03]
type=server
persistpoolmax=13000
priority=300
address=10.5.0.13
port=3306
protocol=mariadbbackend
persistmaxtime=3600
extra_port=0
proxy_protocol=false
ssl=false
ssl_version=MAX
ssl_cert_verify_depth=9
ssl_verify_peer_certificate=true
priority=300



 Comments   
Comment by markus makela [ 2020-10-16 ]

Persistent connections have been ruled out as a cause.

Comment by markus makela [ 2020-10-16 ]

Possibly caused by MXS-2917.

Comment by markus makela [ 2020-10-16 ]

Unlikely to be caused by MXS-2917 as the leak is still visible with 2.3.20.

Comment by markus makela [ 2020-10-16 ]

There appears to be a leak in 2.3.20 that is very similar to MXS-2917 that does not appear in the latest 2.4 release. The following query causes a leak with 2.3 but does not cause a leak in 2.4.

CREATE TABLE IF NOT EXISTS X ( v1 DOUBLE AS (IF(NULLIF( f2, f2 ) AND f1, ( ( ( f3 )  AND f3) DIV f6 ), f3 )) PERSISTENT);

Comment by markus makela [ 2020-10-16 ]

Backporting the fix to MXS-2508 seems to fix the leak. This needs some further analysis as to why it happens.

Comment by markus makela [ 2020-10-16 ]

johan.wikman can you analyze why the following memory leak occurs in 2.3 but doesn't in 2.4?

==1286591== 256 (80 direct, 176 indirect) bytes in 1 blocks are definitely lost in loss record 1,044 of 1,437
==1286591==    at 0x483A809: malloc (vg_replace_malloc.c:307)
==1286591==    by 0x6BEF1CD: sqlite3MemMalloc (sqlite3.c:18650)
==1286591==    by 0x6BEF793: sqlite3Malloc (sqlite3.c:22353)
==1286591==    by 0x6BEFE52: dbMallocRawFinish (sqlite3.c:22676)
==1286591==    by 0x6BEFFCD: sqlite3DbMallocRawNN (sqlite3.c:22744)
==1286591==    by 0x6C334FF: sqlite3ExprAlloc (sqlite3.c:87168)
==1286591==    by 0x6C3393A: sqlite3ExprAnd (sqlite3.c:87314)
==1286591==    by 0x6C3377B: sqlite3PExpr (sqlite3.c:87258)
==1286591==    by 0x6C759A0: spanBinaryExpr (sqlite3.c:128536)
==1286591==    by 0x6C78FA0: yy_reduce (sqlite3.c:132821)
==1286591==    by 0x6C7C38D: sqlite3Parser (sqlite3.c:134363)
==1286591==    by 0x6C7D943: sqlite3RunParser (sqlite3.c:135758)

Comment by Johan Wikman [ 2020-10-19 ]

markus makela There is a bug in sqlite as it may leak memory if it fails to parse a statement. If sqlite is used as a database that is not a problem, because, for obvious reasons, you will not use such statements.

However, it may be a problem in MaxScale since if the statement is legal MariaDB SQL it obviously may be used over and over again, each time potentially causing a leak in sqlite if the statement cannot be fully parsed by the sqlite parser.

The default sqlite does not recognize DIV , but / must be used. As

CREATE TABLE IF NOT EXISTS X ( v1 DOUBLE AS (IF(NULLIF( f2, f2 ) AND f1, ( ( ( f3 )  AND f3) DIV f6 ), f3 )) PERSISTENT);

uses DIV it means that sqlite was not able to fully parse it, which in this case then apparently caused a leak.

The fix of MXS-2508 was to add DIV and MOD as synonyms for / and % respectively. Hence when backporting the fix to 2.3, the parsing will no longer fail and hence there will be no leak (caused by the cleanup bug in sqlite).

Comment by Johan Wikman [ 2020-10-19 ]

This may also be an indication that some sqlite lemon destructor still is missing, although MXS-2917 was supposed to have fixed that.

Comment by markus makela [ 2020-10-20 ]

Assigning this for you johan.wikman since it looks like it's SQLite related.

Comment by markus makela [ 2020-12-28 ]

With a new configuration that doesn't use persistent connections and a correctly configured max_sescmd_history, the memory leak does not appear to happen as fast as before. It might even be that it'll eventually flatten out as the query classifier cache is populated.

Comment by Johan Wikman [ 2021-02-16 ]

massimo.disaro Could you find out whether they are using window functions that use PRECEDING or FOLLOWING?

Generated at Thu Feb 08 04:18:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.