Details
-
Task
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
None
Description
TL;DR
We use a Read-write lock Hash Map design with multiple Mutex Stacks to re-implement the Spider connection pool.
In our test, it reduces the p99 latency and increases the overall QPS. The performance is significantly better than the original design when the query threads are much more than the connection pool capacity.
Introduction
The Spider SE creates and maintains a connection pool that caches the connection objects (pointers) to multiple remote backends.
The Spider can simultaneously send multiple SQL to one backend, so there are probably numerous connection objects to the same backend in the connection pool. These connection objects are interchangeable as long as they point to the same backend.
When receiving a request with a specified backend target bk, one Spider thread searches for one connection object to `bk` in the connection pool. If it exists, Spider will use the object to connect the backend and send SQL to it. After the SQL execution is complete, Spider will put the connection object back to the connection pool.
Otherwise, Spider will create one or wait for another thread to put back one. The variable spider_max_connection determines the specific behavior (whether to create one or wait). And the variable spider_conn_wait_timeout specifies the max waiting time.
Periodically, the Spider will recycle some connection objects if they are not used for a particular interval.
The Original Connection Pool
The Spider used a hash map to represent the connection pool of which the hash key is the remote_name#version and the hash value is the connection object pointer.
For example, in an example configuration:
mysql> select * from mysql.servers; |
+-------------+-----------+----+----------+----------+-------+--------+---------+-------+ |
| Server_name | Host | Db | Username | Password | Port | Socket | Wrapper | Owner | |
+-------------+-----------+----+----------+----------+-------+--------+---------+-------+ |
| SPT0 | 127.0.0.1 | | mysql | mysql | 20000 | | mysql | |
|
| SPT1 | 127.0.0.1 | | mysql | mysql | 20001 | | mysql | |
|
| SPT2 | 127.0.0.1 | | mysql | mysql | 20002 | | mysql | |
|
| SPT3 | 127.0.0.1 | | mysql | mysql | 20003 | | mysql | |
|
+-------------+-----------+----+----------+----------+-------+--------+---------+-------+ |
4 rows in set (0.00 sec) |
The name for the remote server 127.0.0.1:20000 is SPT0. By default, the version is 0, so the hash key is 'SPT0#0'. If the 127.0.0.1:20000 is down, we will do an active/standby failover. The remote standby server will inherit the name SPT0, but update the version to 1, so the new hash key will be 'SPT0#1'.
If we want to get a connection pointer from the connection pool, Spider searches the server name from the hash map, then deletes and returns it if one exists. Spider put back a connection pointer by constructing a new hash item by packaging the 'servername#version', pointer pair, and insert it into the hash map.
Therefore, the hash map is a multi-map, meaning a hash key corresponds to multiple values.
Spider uses a mutex lock to guarantee thread safety whenever it wants to access the hash map, including the search, insertion, and deletion. This is not efficient enough since we will lock the whole hash structure every time we access it. And it is especially slow when we recycle connections since we have to iterate the entire hash map structure and delete connection objects that exceed the threshold.
Optimization
We proposed a new way to represent the connection pool, essentially split one mutex lock into multiple mutex locks to reduce locking time. In CPP pseudocode, it's like:
typedef std::string key_type; /* backend_name#version */ |
typedef void * conn_ptr; /* connection pointer */ |
|
typedef struct { |
std::stack<conn_ptr> stk;
|
pthread_mutex_t mutex;
|
} mutex_stack; /* a stack with a mutex lock */ |
|
typedef struct { |
pthread_rwlock_t rwlock;
|
std::unordered_map<key_type, mutex_stack> hashmap;
|
} spider_connection_pool; /* a hashmap with a rwlock */ |
The key of the hash map remained unchanged, is the remote_name#version, whereas the hash value is a stack that stores connection pointers. In our design, the hash map is not a multi-map, it's a unique hash map, meaning one hash key only corresponds to one value.
The hash map is guard by a read-write lock, and each stack is guard by a mutex lock.
For instance, if we have four remote backends, our connection pool should have four items; each remote backend corresponds to one mutex stack.
The stack stores the pointer of connections. If we want to put back a connection pointer, we push it into the corresponding stack. If we're going to get a connection pointer, we pop it from the corresponding stack. The push and pop operation should be wrapped by mutex locking and unlocking.
Finding the 'corresponding' stack is, first apply a read lock, second, search in the hash map by the key, third, release the read lock. If the key does not exist in the hash map, we initialize a new mutex stack, construct a hash item by <key, mutex_stack>, apply a write lock, and try to insert it into the hash map. Please note that many threads could insert hash items with the same key, and the hashmap is a unique hash so that the insertion could fail, and we should re-search the hash map.
We generally define two essential functions named `put_conn` (put a connection pointer back to the pool) and `get_conn` (get a connection pointer specified by the key).
The write lock acquirement only takes place in `put_conn`. If we cannot find a corresponding hash in `get_conn`, we can directly return not found, thus creating a new connection without getting one in the connection pool.
Except remote backends fail, we only need to apply the write lock n times. (assuming we have n remote backends) Therefore, nearly 100 percent of our operation on the hash map is read in the run time, meaning roughly non-blocking. We split one big mutex lock into n mutex locks for n stacks compared to the original design, reducing lots of conflicts.
Performance
We test the performance of point selection on 1 Spider with 16 remote backends architecture. The Spider machine uses a 12-core Intel Xeon Cascade Lake 8255C (2.5 GHz) with 15G RAM. Other physical configurations are guaranteed fast and large enough so that they cannot be the bottleneck.
The target table t1 has 10GB of data and is with the following structure.
CREATE TABLE `sbtest1` ( |
`id` int(11) NOT NULL AUTO_INCREMENT, |
`k` int(11) NOT NULL DEFAULT 0, |
`c` char(120) NOT NULL DEFAULT '', |
`pad` char(60) NOT NULL DEFAULT '', |
PRIMARY KEY (`id`), |
KEY `k_1` (`k`) |
) ENGINE=SPIDER DEFAULT CHARSET=utf8 |
And we perform point selection using the Sysbench test bench.
SELECT * FROM `sbtest1` WHERE id = ?; |
Test 1
We use the Perf performance test bench tool to trace the time occupied by the concurrency-related functions of the connection pool in the CPU. The result is as follows.
pthread_mutex_lock | pthread_getspecific | pthread_rwlock_lock | Total | |
---|---|---|---|---|
Before | 0.355996% | 0.060372% | N/A | 0.416368% |
After | 0.138516% | 0.07392% | 0.00003% | 0.212466% |
lf the remote backends is without failure, all operations of RWLock are read. So RWLock is almost non-blocking. Although we spent more time on pthread_getspecific (because we have more mutex locks than before), the considerable reduction of pthread_mutex_lock time reduce the proportion of CPU time to 51.03%, which is a significant improvement.
Test 2
We keep the number of Sysbench client threads to be 300 unchanged and do point-selection stress tests with different spider_max_connection variable values.
spider_max_connection specifies the maximum number of spider connections to a single remote backend. In other words, the maximum capacity of a single remote backend in the connection pool.
QPS tests:
spider_max_connection | Before | After | After / Before |
---|---|---|---|
100 | 124912.36 | 125323.45 | 100.32% |
50 | 124804.67 | 125407.97 | 100.48% |
20 | 124437.48 | 125411.6 | 100.78% |
10 | 98681.96 | 117253.38 | 118.81% |
5 | 83200.37 | 98597.90 | 118.50% |
It can be seen from the above test that when the number of test threads is greater than the total capacity of the connection pool (16 remotes times 10 < 300 client threads), the performance improvement of the optimization solution is significant.
Test 3
We let spider_max_connection be 100 unchanged, use 50, 100, and 200, 300 Sysbench client threads to perform stress tests, and use the Sysbench summarization to record the QPS and p99 latency performance.
p99 latency tests:
Before | After | After / Before | |
---|---|---|---|
50 threads | 1.04 | 1.04 | 100% |
100 threads | 1.96 | 1.96 | 100% |
200 threads | 8.13 | 7.84 | 96.43% |
300 threads | 124644.19 | 125723.51 | 89.76% |
It can be seen from the results that the greater the number of threads, the more pronounced the reduction in p99 latency, indicating that the overall query time is more stable.
QPS tests:
Before | After | After / Before | |
---|---|---|---|
50 threads | 57168.74 | 57670.93 | 100.88% |
100 threads | 99769.4 | 99937.68 | 100.17% |
200 threads | 122404.09 | 123153.73 | 100.61% |
300 threads | 124644.19 | 125723.51 | 100.87% |
The overall QPS are slightly greater than the original design when the connection pool capacity is large enough.