[MDEV-33185] Connection timeout/slowdown upon increased load Created: 2024-01-05  Updated: 2024-02-08

Status: Needs Feedback
Project: MariaDB Server
Component/s: None
Affects Version/s: 11.2.2
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Erlandas Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: performance
Environment:

Ubuntu 22.04.3 LTS (GNU/Linux 5.15.0-91-generic x86_64)


Attachments: Text File log_before_timeout.txt    
Issue Links:
Relates
relates to MDEV-31139 MariaDB hangs with multiple threads open Closed

 Description   

Once I got Server version: 11.2.2-MariaDB-1:11.2.2+maria~ubu2204-log mariadb.org binary distribution installed I have started noticing periodic reports about connection timeout to Master DB and at times connection timeout to Replica DB (1 master , 2 replicas).

We had similar (but more severe) problem reported at MDEV-31139 where it was fixed and everything was working more or less fine for at least a couple next releases.

Since that time we have figured out that server would recover/not hang up if we stop using it once we see connection time going above 50ms (a usual connection time is under 1ms). With 11.1.2 this was not happening so often or at least much less often to get noticed as 11.2.2 slows down drastically 10-12 times a day where we have to stop using Master (or some replica) to let it recover.

The problem occurs when intense writing starts with about 500+ inserts per second and there are 8000+ QPS in SELECTs.

Master server is given a number 12 here and connection time is reported 142 ms (and a switch away from it happens as replicas have 0.3ms connection time and application starts using them instead of a slow master for reading)

    "connections_info": {
        "server_id": 12,
        "visible": true,
        "writable": true,
        "readable": false,
        "master": true,
        "is_local_network": true,
        "slave_io_running": "",
        "slave_sql_running": "",
        "seconds_behind_master": 0,
        "slave_io_state": "",
        "slave_sql_state": "",
        "mysqli_exception": "",
        "last_error": "",
        "last_io_error": "",
        "last_sql_error": "",
        "connection_time": 142.21,
        "write_time": 3.8,
        "threads_connected": 144,
        "max_connections": 1200,
        "remote_clients": [
            {
                "server_id": 13,
                "visible": true,
                "writable": false,
                "readable": true,
                "master": false,
                "is_local_network": true,
                "slave_io_running": "Yes",
                "slave_sql_running": "Yes",
                "seconds_behind_master": 0,
                "slave_io_state": "Waiting for master to send event",
                "slave_sql_state": "Slave has read all relay log; waiting for more updates",
                "mysqli_exception": "",
                "last_error": "",
                "last_io_error": "",
                "last_sql_error": "",
                "connection_time": 0.38,
                "write_time": 0,
                "threads_connected": 30,
                "max_connections": 1200,
                "remote_clients": []
            },
            {
                "server_id": 11,
                "visible": true,
                "writable": false,
                "readable": true,
                "master": false,
                "is_local_network": true,
                "slave_io_running": "Yes",
                "slave_sql_running": "Yes",
                "seconds_behind_master": 0,
                "slave_io_state": "Waiting for master to send event",
                "slave_sql_state": "Slave has read all relay log; waiting for more updates",
                "mysqli_exception": "",
                "last_error": "",
                "last_io_error": "",
                "last_sql_error": "",
                "connection_time": 0.31,
                "write_time": 0,
                "threads_connected": 19,
                "max_connections": 1200,
                "remote_clients": []
            }
        ]
    }



 Comments   
Comment by Marko Mäkelä [ 2024-01-05 ]

Can you enable some instrumentation on the server? For example:

performance-schema-consumer-events_waits_current=on
performance-schema-consumer-events_statements_current=off
performance-schema-consumer-statements_digest=off
performance-schema-instrument='%=off'
performance-schema-instrument='wait/synch/mutex/%=on'
performance-schema-instrument='wait/synch/rwlock/%=on'
performance-schema=on

and then query it by the following (executed every few seconds):

select * from performance_schema.events_waits_summary_global_by_event_name
order by sum_timer_wait desc limit 20;
truncate table performance_schema.events_waits_summary_global_by_event_name;

It could also be helpful to record the output of SHOW ENGINE INNODB STATUS or SELECT * FROM information_schema.global_status once every few seconds. Hopefully there will be some correlation between the reduced performance and some of this data.

Comment by Erlandas [ 2024-01-07 ]

I have run the queries for a day noted the times when our apps detected MariaDB connection timeouts. Performance schema logging queries were executed every 5 seconds aprox. The log file is 164 Mb in size - where can I upload it safely as I see it has some of our queries logged as well?

Attaching log_before_timeout.txt one output part extracted just 4 seconds before the connection timeout detected.

Times when connection timeout was detected for you to check when you have the full log file:

2024-01-06 01:52:13
2024-01-06 03:51:05
2024-01-06 05:21:39
2024-01-06 05:27:19
2024-01-06 05:49:01
2024-01-06 09:49:10
2024-01-06 09:49:16
2024-01-06 11:59:43
2024-01-06 12:00:03
2024-01-06 12:07:34
2024-01-06 13:06:04
2024-01-06 13:21:39
2024-01-06 13:27:25
2024-01-06 13:49:13
2024-01-06 15:46:03
2024-01-06 17:09:03
2024-01-06 17:26:49
2024-01-06 17:48:26
2024-01-06 19:54:26

Comment by Erlandas [ 2024-01-11 ]

I still have the log of queries you have asked for and if needed can upload it where you could examine it.

Have been monitoring servers more frequently and have noticed that even replicas connecting to itself on LOCALHOST do get slower connection times 20-240ms really frequently (about once every every minute or so). Same happens with write time to master - very periodic slowdown. A much greater slowdown comes when there is more activity going where connection time can go up to 1.2-1.5 seconds instead of 0.5 ms.

Comment by Erlandas [ 2024-02-08 ]

Upgraded to 11.2.3-MariaDB-1:11.2.3+maria~ubu2204-log but the part of the issue persists.
On a first glance it looks like slow connection is not being seen anymore, but will need to monitor for longer.
Slow writing is seen every few seconds when moderate/intense writing to DB starts. The writing test is done by a tiny insert into a separate InnoDB table every second and time of such insert is measured.

Normally insert happens in 0.5-0.9ms, but every 2-3 seconds that time jumps to 50-140ms (when inserting into other tables is happening). The disks and CPU are not even near their full load capacity .

Generated at Thu Feb 08 10:37:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.