[MXS-3808] Improve Rest API performance Created: 2021-10-07  Updated: 2022-03-04  Resolved: 2022-03-04

Status: Closed
Project: MariaDB MaxScale
Component/s: REST-API
Affects Version/s: 6.1.0, 6.1.1, 6.1.2, 6.1.3
Fix Version/s: 2.5.20, 6.2.3, 22.08.0

Type: Bug Priority: Minor
Reporter: Phil Porada Assignee: markus makela
Resolution: Fixed Votes: 1
Labels: performance
Environment:

Multiple maxscale nodes connecting to multiple mariadb 10.5.x nodes


Attachments: PNG File Screenshot_2021-10-29_12-31-40.png     PNG File maxctrl-stats-exporter.png    

 Description   

We have the following application and database topology. On each maxscale node we deploy https://github.com/Vetal1977/maxctrl_exporter to scrape the rest api and turn its output into prometheus stats every 30 seconds. Under periods of high connection load to our maxscale instances, we find that the maxctrl_exporter is unable to scrape the rest api in a timely fashion. Attached is a screenshot showing dropped/missing stats for an example period.

Would it be possible to add an admin thread that can be used to scrape the rest api and be guaranteed to return more timely?

 
[app1] [app2] [app3] [app4]
  |       |     |      |
  +-------+-----+------+
             |
[DNS record proxy.mariadb.example.com]
           |       |
        [max1]-[max2]
          |   X   |
          [db1] [db2]



 Comments   
Comment by markus makela [ 2021-10-08 ]

The REST API already runs on a separate thread but, depending on what is scraped, it can interact with the worker threads. We'd need more information to know why it appears to slow down.

Comment by markus makela [ 2021-10-11 ]

SneakyPhil can you find out which endpoint is what causes this problem? A quick look at that exporter seems to reveal that it queries multiple endpoints and figuring out which one of them would greatly help us fix any inefficiencies in it.

Comment by Phil Porada [ 2021-10-29 ]

I let this one sit and I'm sorry about that. A few days after I posted this issue we stopped using maxscale altogether. Not due to maxscale itself, but because our application design is unable to take advantage of maxscale's strengths.

As for this exporter, the stat from the original attached image hits `/servers` and can fail to return data when maxscale is under high session creation/deletion load.

Here's another picture showing that not all stats are dropped.

Comment by markus makela [ 2022-02-25 ]

Having had the time to look into this closer, the /servers endpoint does indeed seem to be the worst offender and mostly due to the fact that it calculates the connection pool statistics by asking each thread for their locally cached versions.

In addition, I believe the data ended up being generated twice on accident which doubled the amount of work for no good reason. Fixing this should cut the delay in roughly half as it only needs to ask for the data once.

It also seems that in 2.5 a call to the /servers endpoint causes the connection pool to be cleared of stale connections. This is probably a remnant from the old maxadmin days when it was used mainly for testing and to get accurate counts but in practice it's not worth doing it.

Comment by markus makela [ 2022-03-01 ]

I'm changing this to a bug since a few of the endpoints do some pretty inefficient stuff that's not really needed.

Comment by markus makela [ 2022-03-04 ]

The /servers endpoint is now more efficient in how it collects data that is spread out to other threads. It also no longer purges the persistent connection pool as it is now done automatically (MXS-4034). In 6.2 the pool statistics are also accessed directly instead of waiting for each thread to send it when they are no longer busy. This should improve the performance under heavy load.

Generated at Thu Feb 08 04:24:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.