[MXS-4342] Binlog router causing database slowness Created: 2022-10-07  Updated: 2022-10-18  Resolved: 2022-10-12

Status: Closed
Project: MariaDB MaxScale
Component/s: binlogrouter
Affects Version/s: 2.5.19
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Bryan Bancroft (Inactive) Assignee: Unassigned
Resolution: Duplicate Votes: 1
Labels: triage

Attachments: PNG File image.png    
Issue Links:
Duplicate
duplicates MXS-4000 Binlogrouter creates malformed replic... Closed

 Description   

Situation is a local cluster of 2 nodes, 2 mxs nodes with configured with keepalived, and an external skysql instance.

external db is connecting to maxscales vip via a binlog router

┌─────────────────────┬────────────────────────────────────────────────────────────────────────┐
│ Service             │ Replication-Proxy                                                      │
├─────────────────────┼────────────────────────────────────────────────────────────────────────┤
│ Router              │ binlogrouter                                                           │
├─────────────────────┼────────────────────────────────────────────────────────────────────────┤
│ State               │ Started                                                                │
├─────────────────────┼────────────────────────────────────────────────────────────────────────┤
│ Started At          │ Tue Sep  6 16:11:38 2022
├─────────────────────┼────────────────────────────────────────────────────────────────────────┤
│ Current Connections │ 0
├─────────────────────┼────────────────────────────────────────────────────────────────────────┤
│ Total Connections   │ 190653
├─────────────────────┼────────────────────────────────────────────────────────────────────────┤
│ Max Connections     │ 3
├─────────────────────┼────────────────────────────────────────────────────────────────────────┤
│ Cluster             │ MariaDB-Monitor                                                        │
├─────────────────────┼────────────────────────────────────────────────────────────────────────┤
│ Servers             │                                                                        │
├─────────────────────┼────────────────────────────────────────────────────────────────────────┤
│ Services            │                                                                        │
├─────────────────────┼────────────────────────────────────────────────────────────────────────┤
│ Filters             │                                                                        │
├─────────────────────┼────────────────────────────────────────────────────────────────────────┤
│ Parameters          │ {                                                                      │
│                     │     "auth_all_servers": false,                                         │
│                     │     "cluster": "MariaDB-Monitor",                                      │
│                     │     "connection_keepalive": 300,                                       │
│                     │     "connection_timeout": 0,                                           │
│                     │     "datadir": "/var/lib/maxscale/binlogs/",                           │
│                     │     "enable_root_user": false,                                         │
│                     │     "expire_log_duration": 172800000,                                  │
│                     │     "expire_log_minimum_files": 16,                                    │
│                     │     "localhost_match_wildcard_host": true,                             │
│                     │     "log_auth_warnings": true,                                         │
│                     │     "max_connections": 0,                                              │
│                     │     "net_timeout": 10000,                                              │
│                     │     "net_write_timeout": 0,                                            │
│                     │     "password": "*****",                                               │
│                     │     "purge_poll_timeout": 120000,                                      │
│                     │     "purge_startup_delay": 120000,                                     │
│                     │     "rank": "primary",                                                 │
│                     │     "retain_last_statements": -1,                                      │
│                     │     "router_options": null,                                            │
│                     │     "select_master": true,                                             │
│                     │     "server_id": 109019485,                                            │
│                     │     "session_trace": false,                                            │
│                     │     "session_track_trx_state": false,                                  │
│                     │     "strip_db_esc": true,                                              │
│                     │     "targets": null,                                                   │
│                     │     "user": "maxscale",                                                │
│                     │     "version_string": null
│                     │ }                                                                      │
├─────────────────────┼────────────────────────────────────────────────────────────────────────┤
│ Router Diagnostics  │ {                                                                      │
│                     │     "current_binlog": "/var/lib/maxscale/binlogs//mariadb-bin.000313", │
│                     │     "gtid_io_pos": "0-2-10323413",                                     │
│                     │     "master_config": {                                                 │
│                     │         "host": "10.90.194.27",                                        │
│                     │         "port": 3306,                                                  │
│                     │         "ssl": false,                                                  │
│                     │         "user": "maxscale"
│                     │     }                                                                  │
│                     │ }                                                                      │
└─────────────────────┴────────────────────────────────────────────────────────────────────────┘

When replication is enabled these errors spam.

2022-10-07 19:45:19   error  : Write to Client DCB 10.15.1.115 in state DCB::State::POLLING failed: 104, Connection reset by peer
2022-10-07 19:49:14   error  : Write to Client DCB 10.15.1.115 in state DCB::State::POLLING failed: 104, Connection reset by peer
2022-10-07 19:58:44   error  : Write to Client DCB 10.15.1.111 in state DCB::State::POLLING failed: 104, Connection reset by peer
2022-10-07 19:58:46   error  : Write to Client DCB 10.15.1.111 in state DCB::State::POLLING failed: 104, Connection reset by peer
2022-10-07 19:58:49   error  : Write to Client DCB 10.15.1.111 in state DCB::State::POLLING failed: 104, Connection reset by peer
2022-10-07 19:58:53   error  : Write to Client DCB 10.15.1.111 in state DCB::State::POLLING failed: 104, Connection reset by peer
2022-10-07 19:58:55   error  : Write to Client DCB 10.15.1.111 in state DCB::State::POLLING failed: 104, Connection reset by peer
2022-10-07 19:59:04   error  : Write to Client DCB 10.15.1.111 in state DCB::State::POLLING failed: 104, Connection reset by peer
2022-10-07 19:59:07   error  : Write to Client DCB 10.15.1.111 in state DCB::State::POLLING failed: 104, Connection reset by peer

Cause is a spike in CPU usage on the database and overall slowness in the DB.



 Comments   
Comment by markus makela [ 2022-10-08 ]

Can you include the MaxScale configuration and the versions of the MariaDB servers?

Comment by Bryan Bancroft (Inactive) [ 2022-10-10 ]

markus makela

Local cluster 10.5.13
local mxs 2.5.19
skysql 10.6.7

# MaxScale documentation:
# https://mariadb.com/kb/en/mariadb-maxscale-25/
 
# Global parameters
#
# Complete list of configuration options:
# https://mariadb.com/kb/en/mariadb-maxscale-25-mariadb-maxscale-configuration-guide/
#
[maxscale]
admin_host=0.0.0.0
admin_secure_gui=false
log_info=true
admin_auth=true
 
# Server definitions
#
# Set the address of the server to the network
# address of a MariaDB server.
#
 
[db1.]
type=server
address=10.90.194.124
port=3306
protocol=MariaDBBackend
proxy_protocol=yes
 
[db2.]
type=server
address=10.90.194.27
port=3306
protocol=MariaDBBackend
proxy_protocol=yes
 
# Monitor for the servers
#
# This will keep MaxScale aware of the state of the servers.
# MariaDB Monitor documentation:
# https://mariadb.com/kb/en/maxscale-25-monitors/
 
[MariaDB-Monitor]
type=monitor
module=mariadbmon
servers=db1.,db2.
user=maxscale
password=
replication_user=maxscale
replication_password=
auto_failover=true
auto_rejoin=true
monitor_interval=100
failcount=2
switchover_timeout=20
failover_timeout=20
 
# Service definitions
#
# Service Definition for a read-only service and
# a read/write splitting service.
#
 
# ReadConnRoute documentation:
# https://mariadb.com/kb/en/mariadb-maxscale-25-readconnroute/
 
[Read-Only-Service]
type=service
router=readconnroute
servers=db1.,db2.
user=maxscale
password=
router_options=slave
 
# ReadWriteSplit documentation:
# https://mariadb.com/kb/en/mariadb-maxscale-25-readwritesplit/
 
[Read-Write-Service]
max_connections=20000
type=service
router=readwritesplit
servers=db1.,db2.
user=maxscale
password=
master_reconnection=1
delayed_retry=1
#causal_reads=true
##transaction_replay=true
 
[Replication-Proxy]
type=service
router=binlogrouter
cluster=MariaDB-Monitor
select_master=true
expire_log_duration=48h
expire_log_minimum_files=16
user=maxscale
password=
server_id=109019485
 
 
# Listener definitions for the services
#
# These listeners represent the ports the
# services will listen on.
#
 
[Read-Only-Listener]
type=listener
service=Read-Only-Service
protocol=MariaDBClient
port=4008
 
[Read-Write-Listener]
type=listener
service=Read-Write-Service
protocol=MariaDBClient
port=4006
 
 
 
[Replication-Listener]
type=listener
service=Replication-Proxy
protocol=MariaDBClient
port=3307

Comment by Bryan Bancroft (Inactive) [ 2022-10-10 ]

Some context, replication was working fine for over 1 month

Comment by markus makela [ 2022-10-10 ]

Any significant changes over this one month? Increased binlog volume or something else that would hint at what causes it?

If at all possible, please try profiling the running process. This should help pinpoint what is taking so much CPU time.

Comment by Bryan Bancroft (Inactive) [ 2022-10-10 ]

markus makela Nothing abnormal about the binary logs

-rw-rw---- 1 mysql mysql 1073742500 Sep 21 23:16 mariadb-bin.002820
-rw-rw---- 1 mysql mysql 1073742985 Sep 28 17:11 mariadb-bin.002821
-rw-rw---- 1 mysql mysql 1073742299 Oct  4 09:49 mariadb-bin.002822
-rw-rw---- 1 mysql mysql 1073751013 Oct  9 20:57 mariadb-bin.002823
-rw-rw---- 1 mysql mysql  195752147 Oct 10 21:32 mariadb-bin.002824

Working on a timeframe we can try to collect more process info. Any ideal format or commands run for you to debug?

Comment by markus makela [ 2022-10-11 ]

One thing that I could use is a clarification on whether it's the DB or MaxScale whose CPU usage has increased. If it's the database, are there any errors reported in the replication (e.g. in SHOW SLAVE STATUS)?

Comment by Bryan Bancroft (Inactive) [ 2022-10-11 ]

It is on the database server in the local cluster (target master). If we set the replica to maint mode and send the connection to local cluster master, same cpu spike there

Error skysql replica side is
Slave_IO_State: Reconnecting after a failed master event read

No errors on the pegged DB and to clarify, the target master is hitting performance issues

Comment by markus makela [ 2022-10-11 ]

Ah, so the CPU spike is on the primary node (db1. and db2. in the config) where MaxScale is replicating from?

Comment by markus makela [ 2022-10-11 ]

This might be something that deterministically happens with the current binlogs stored in the binlogrouter. This would explain the repeated failure to start replication and the same write failure message being logged: if the other end abruptly closes the TCP socket, MaxScale logs this error since it most of the time means something is wrong somewhere. This behavior could be explained by MXS-4000 which was fixed for 2.5.20 and which involved undefined behavior. If the uninitialized memory used by the binlogrouter was set to the byte value 0xff, the replicating server would end up treating an otherwise perfectly valid binlog event as an error and would reconnect immediately.

Comment by Bryan Bancroft (Inactive) [ 2022-10-11 ]

The slave process is interesting. Never seen that checking version status. Is this as simple as a version incompatibility bug?

MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
                Slave_IO_State: Checking master version
                   Master_Host: 10.90.194.8
                   Master_User: skysql_replication_db00006557
                   Master_Port: 3307
                 Connect_Retry: 60
               Master_Log_File: mariadb-bin.000313
           Read_Master_Log_Pos: 362999295
                Relay_Log_File: mariadb-relay-bin.000002
                 Relay_Log_Pos: 600
         Relay_Master_Log_File: mariadb-bin.000313
              Slave_IO_Running: Preparing
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 299
               Relay_Log_Space: 4844
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: NULL
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 109019485
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-2-10012483
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: optimistic
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 26
1 row in set (0.000 sec)
 
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
                Slave_IO_State: Reconnecting after a failed master event read
                   Master_Host: 10.90.194.8
                   Master_User: skysql_replication_db00006557
                   Master_Port: 3307
                 Connect_Retry: 60
               Master_Log_File: mariadb-bin.000313
           Read_Master_Log_Pos: 362999295
                Relay_Log_File: mariadb-relay-bin.000002
                 Relay_Log_Pos: 600
         Relay_Master_Log_File: mariadb-bin.000313
              Slave_IO_Running: Connecting
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 299
               Relay_Log_Space: 5296
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: NULL
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 109019485
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-2-10012483
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: optimistic
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 26
1 row in set (0.000 sec)
 
MariaDB [(none)]> show slave status\G
*************************** 1. row ***************************
                Slave_IO_State: Waiting for master to send event
                   Master_Host: 10.90.194.8
                   Master_User: skysql_replication_db00006557
                   Master_Port: 3307
                 Connect_Retry: 60
               Master_Log_File: mariadb-bin.000313
           Read_Master_Log_Pos: 362999295
                Relay_Log_File: mariadb-relay-bin.000002
                 Relay_Log_Pos: 600
         Relay_Master_Log_File: mariadb-bin.000313
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 299
               Relay_Log_Space: 5296
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 109019485
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-2-10012483
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: optimistic
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 26
1 row in set (0.000 sec)

Comment by markus makela [ 2022-10-11 ]

Since 2.5.19 suffers from MXS-4000, it would be smart to uprgade and rule it out as a source of problems.

Comment by Bryan Bancroft (Inactive) [ 2022-10-11 ]

Upgrade to 2.5 latest solved the issue.

Comment by markus makela [ 2022-10-12 ]

Closing as a duplicate of MXS-4000.

Generated at Thu Feb 08 04:27:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.