[MXS-4204] Maxscale Schema Router is slow to respond for normal querying. Created: 2022-07-14  Updated: 2022-09-16  Resolved: 2022-09-16

Status: Closed
Project: MariaDB MaxScale
Component/s: N/A
Affects Version/s: 6.4.0
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Edward Stoever Assignee: markus makela
Resolution: Not a Bug Votes: 0
Labels: None


 Description   

Tested on Maxscale 6.4.0

Customer has requested that maxscale schema router be configured so that they can shard schemas by moving them to separate Mariadb instances.

Current customer configuration: single instance, 300 schemas, each with approximately 75 tables. Approximately 1,000 user accounts. Each account with varied granted privileges.
Desired customer configuration: Move entire schemas to new instances and use schema router as an entry-point for all connections.

To test this scenario, I created the following:

                                          SCHEMAS  TABLES  USERS  Select_priv  SCHEMA_PRIVILEGES  TABLE_PRIVILEGES
f1.edw.ee - replicates to --> f2.edw.ee   87       6732    1103   22           3841               40269
f3.edw.ee - replicates to --> f4.edw.ee   93       9011    1103   100          5029               405
f5.edw.ee                                 102      7258    1103   14           29320              288609
f6.edw.ee                                 88       7093    1103   24           45512              3847
mx.edw.ee maxscale as schema router

Select_priv = grant select on *.* to
SCHEMA_PRIVILEGES = grant select on my_schema.* to
TABLE_PRIVILEGES = grant select on my_schema.my_table to

I created scripts that would connect repeatedly and select, insert, update, delete. Each query a new connection from a separate account. Example:

mariadb -uofficer_h_dreadhiss -ppassword -h192.168.20.125 -P3306 -ABNe 'select * from hatchettine_is_half.damp_we_units_cynomolgus;'
mariadb -ug_yarona -ppassword -h192.168.20.125 -P3306 -ABNe 'select * from campy_by_thaw.hospitable_so_problems_unvitreousness;'
mariadb -ucounselor_s_frannk -ppassword -h192.168.20.125 -P3306 -ABNe 'select * from television_be_bite.tremendous_do_childs_thromboembolic;'
mariadb -udean_m_signe -ppassword -h192.168.20.125 -P3306 -ABNe 'select * from crizzling_or_stab.erratic_by_passengers_ovigerm;'
mariadb -uo_ellil -ppassword -h192.168.20.125 -P3306 -ABNe 'select * from ceratin_am_snub.rotten_of_informations_ecthyma;'
mariadb -ud_hector -ppassword -h192.168.20.125 -P3306 -ABNe 'select * from humpbacks_as_hook.exciting_up_extents_subucula;'

Running a typical script with 605 lines connecting directly to the database completes in 10 to 15 seconds.
Running a typical script with 605 lines connecting through maxscale schema router completes in more than 10 minutes.



 Comments   
Comment by markus makela [ 2022-07-15 ]

10 minutes is definitely too slow if it happens after the initial building of the map of all shards and their locations. One thing that would help evaluate some of the latency issues is to run the following query and see how long it takes:

SELECT LOWER(t.table_schema), LOWER(t.table_name) FROM information_schema.tables t 
UNION ALL 
SELECT LOWER(s.schema_name), '' FROM information_schema.schemata s;

This is the query that the schemarouter executes to determine which tables are visible for which users.

Comment by markus makela [ 2022-07-15 ]

One thing I did notice is that if you have log_info=true, the performance will be much slower as the majority of the time is spent writing debug log data into the log files. Checking whether it's on via maxctrl show maxscale would be a good sanity check to make.

The given test case is pretty much the worst-case performance you can get from schemarouter: the shard map cache won't be of any use as all users are unique and each of them builds the shard map when it starts. The default cache validity of 300 seconds isn't helping here that much either so increasing that to something like refresh_interval=24h would also be sensible (assuming the set of tables and databases insn't constantly changing).

Comment by markus makela [ 2022-07-15 ]

I did find some odd slowness in the schemarouter even in the case where the shard map is found in the cache: compared to a readwritesplit service which took 4 seconds, the schemarouter took around 7 seconds to iterate over 1000 users and do a SELECT 1. This was with the command line client so one of the queries it does might cause this slowness; further testing is requried.

Comment by markus makela [ 2022-09-16 ]

Closing as Not a Bug since this is very much expected behavior with the way the schemarouter was designed to work.

Sequentially executing SQL statements by using a separate user for each statement is expected to take longer as it forces the schemarouter to build the shard location map every time the user logs in. This effectively bypasses the whole caching that is in place for it and is comparable to using InnoDB with no buffer pool.

Generated at Thu Feb 08 04:26:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.