|
Can you upload the configuration you use? Which connector do you use and do you have a way to reproduce the problem on a test system?
|
|
i uploaded the config.
i was not sucessfull to reproduce this on our test system, but this system is not getting the same traffic load as the main one.
|
|
Hi,
I have the exact same issue. For me it only occurs when the 'load' (number of queries) is 'higher' and when I use persistpoolmax=(number > 0) without persistance I get no errors.
I just started using maxscale (started directly with v2.4.5) but downgrading to 2.4.4 did not help in my case, I have a galera cluster with 4 "10.3.18-MariaDB-0+deb10u1" nodes, and Splitter-Service on maxscale.
My test-case webpage does the queries below (don't ask me why, not my page ). While the queries themselves are probably not really important it gives an idea about the number of queries)
SET NAMES utf8
|
SET NAMES utf8
|
SELECT value FROM sessions WHERE sesskey = 'xxxx' AND expiry + sess_lifetime > 1579096703
|
SELECT count(*) as aantal FROM not_the_sessions_table WHERE a = '' AND b = ''
|
REPLACE INTO sessions VALUES ('xxxx', 1579096703, 'some string', 1440)
|
There are 4 clients (servers) connecting to maxscale.
Using a simple load generator fetching that webpage, I see no errors when doing 400 requests in serial. but when I start increasing the number (starting with about 10) of parallel reqs (and thus queries), I see the errors. The more requests in parallel, the more errors I get:
2020-01-15 15:20:23 error : (15168) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-3' when no response was expected. Command: 0x00 Query: <not available>
|
2020-01-15 15:20:24 error : (15607) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-4' when no response was expected. Command: 0x00 Query: <not available>
|
2020-01-15 15:20:24 error : (15607) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-3' when no response was expected. Command: 0x00 Query: <not available>
|
2020-01-15 15:20:24 error : (15660) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-3' when no response was expected. Command: 0x00 Query: <not available>
|
The errors only come from the read-nodes, never from the readwrite master.
when I add a query cache on maxscale, that starts complaining as well:
2020-01-15 15:25:39 error : (1280) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-4' when no response was expected. Command: 0x00 Query: <not available>
|
2020-01-15 15:25:39 warning: (1280) [cache] Received data from the backend although filter is expecting nothing. Packet size is 105 bytes long.
|
2020-01-15 15:25:39 error : (1434) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>
|
2020-01-15 15:25:39 warning: (1434) [cache] Received data from the backend although filter is expecting nothing. Packet size is 105 bytes long.
|
etc
|
400 reqs with 50 parallel, takes 9.5 seconds. 5 queries per req, gives ~200 queries/sec. But I suspect the problem might be not the number of queries/sec but the fact that multiple clients are requesting the same stuff from multiple hosts...
I've been trying to replicate this issue more 'scientifically' but no replicatable results there yet.
Hope this helps,
Alex.
|
|
Maikel Punie and qwark, can you add session_trace=200 under the maxscale section? This should cause the last 200 lines of the more verbose log to be dumped when that problem appears.
|
|
Hi Markus,
I have logs. But I ran this briefly on our prod server and it's a lot, almost a million lines . looks like it wasn't only logging the unexpected state errors with added detail, because there were only 5 'Unexpected internal state' errors between the million lines.
Perhaps you can tel me what you're looking for, so I can tune down the volume a bit?
For starters here are the 10 lines that actually have 'Unexpected internal state' in them.
Jan 16 15:29:19 galera1-1 maxscale[10240]: (12884) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>
|
Jan 16 15:29:19 galera1-1 maxscale[10240]: (12884) Session log for session (12884): #0122020-01-16 15:29:19 error : (12884) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>#0122020-01-16 15:29:19 info : (12884) Started Splitter-Service client session [12884] for 'user1' from ::ffff:10.1.1.105#0122020-01-16 15:29:19 info : (12884) [readwritesplit] Selected Slave: galera1-3#0122020-01-16 15:29:19 info : (12884) [readwritesplit] Connected to 'galera1-3'#0122020-01-16 15:29:19 info : (12884) [readwritesplit] Selected Slave: galera1-2#0122020-01-16 15:29:19 info : (12884) [readwritesplit] Connected to 'galera1-2'#0122020-01-16 15:29:19 info : (12884) [readwritesplit] Selected Slave: galera1-4#0122020-01-16 15:29:19 info : (12884) [readwritesplit] Connected to 'galera1-4'#0122020-01-16 15:29:19 info : (12884) [readwritesplit] Selected Master: galera1-1#0122020-01-16 15:29:19 info : (12884) [readwritesplit] Connected to 'galera1-1'#012
|
Jan 16 15:29:21 galera1-1 maxscale[10240]: (13652) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>
|
Jan 16 15:29:21 galera1-1 maxscale[10240]: (13652) Session log for session (13652): #0122020-01-16 15:29:21 error : (13652) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>#0122020-01-16 15:29:21 info : (13652) Started Splitter-Service client session [13652] for 'user1' from ::ffff:10.1.1.106#0122020-01-16 15:29:21 info : (13652) [readwritesplit] Selected Slave: galera1-4#0122020-01-16 15:29:21 info : (13652) [readwritesplit] Connected to 'galera1-4'#0122020-01-16 15:29:21 info : (13652) [readwritesplit] Selected Slave: galera1-3#0122020-01-16 15:29:21 info : (13652) [readwritesplit] Connected to 'galera1-3'#0122020-01-16 15:29:21 info : (13652) [readwritesplit] Selected Slave: galera1-2#0122020-01-16 15:29:21 info : (13652) [readwritesplit] Connected to 'galera1-2'#0122020-01-16 15:29:21 info : (13652) [readwritesplit] Selected Master: galera1-1#0122020-01-16 15:29:21 info : (13652) [readwritesplit] Connected to 'galera1-1'#012
|
Jan 16 15:29:21 galera1-1 maxscale[10240]: (13682) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>
|
Jan 16 15:29:21 galera1-1 maxscale[10240]: (13682) Session log for session (13682): #0122020-01-16 15:29:21 error : (13682) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>#0122020-01-16 15:29:21 info : (13682) Started Splitter-Service client session [13682] for 'user1' from ::ffff:10.1.1.105#0122020-01-16 15:29:21 info : (13682) [readwritesplit] Selected Slave: galera1-4#0122020-01-16 15:29:21 info : (13682) [readwritesplit] Connected to 'galera1-4'#0122020-01-16 15:29:21 info : (13682) [readwritesplit] Selected Slave: galera1-3#0122020-01-16 15:29:21 info : (13682) [readwritesplit] Connected to 'galera1-3'#0122020-01-16 15:29:21 info : (13682) [readwritesplit] Selected Slave: galera1-2#0122020-01-16 15:29:21 info : (13682) [readwritesplit] Connected to 'galera1-2'#0122020-01-16 15:29:21 info : (13682) [readwritesplit] Selected Master: galera1-1#0122020-01-16 15:29:21 info : (13682) [readwritesplit] Connected to 'galera1-1'#012
|
Jan 16 15:29:21 galera1-1 maxscale[10240]: (13908) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>
|
Jan 16 15:29:21 galera1-1 maxscale[10240]: (13908) Session log for session (13908): #0122020-01-16 15:29:21 error : (13908) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>#0122020-01-16 15:29:21 info : (13908) Started Splitter-Service client session [13908] for 'user1' from ::ffff:10.1.1.106#0122020-01-16 15:29:21 info : (13908) [readwritesplit] Selected Slave: galera1-3#0122020-01-16 15:29:21 info : (13908) [readwritesplit] Connected to 'galera1-3'#0122020-01-16 15:29:21 info : (13908) [readwritesplit] Selected Slave: galera1-2#0122020-01-16 15:29:21 info : (13908) [readwritesplit] Connected to 'galera1-2'#0122020-01-16 15:29:21 info : (13908) [readwritesplit] Selected Slave: galera1-4#0122020-01-16 15:29:21 info : (13908) [readwritesplit] Connected to 'galera1-4'#0122020-01-16 15:29:21 info : (13908) [readwritesplit] Selected Master: galera1-1#0122020-01-16 15:29:21 info : (13908) [readwritesplit] Connected to 'galera1-1'#012
|
Jan 16 15:29:27 galera1-1 maxscale[10240]: (17281) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>
|
Jan 16 15:29:27 galera1-1 maxscale[10240]: (17281) Session log for session (17281): #0122020-01-16 15:29:27 error : (17281) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>#0122020-01-16 15:29:27 info : (17281) Started Splitter-Service client session [17281] for 'user1' from ::ffff:10.1.1.106#0122020-01-16 15:29:27 info : (17281) [readwritesplit] Selected Slave: galera1-4#0122020-01-16 15:29:27 info : (17281) [readwritesplit] Connected to 'galera1-4'#0122020-01-16 15:29:27 info : (17281) [readwritesplit] Selected Slave: galera1-3#0122020-01-16 15:29:27 info : (17281) [readwritesplit] Connected to 'galera1-3'#0122020-01-16 15:29:27 info : (17281) [readwritesplit] Selected Slave: galera1-2#0122020-01-16 15:29:27 info : (17281) [readwritesplit] Connected to 'galera1-2'#0122020-01-16 15:29:27 info : (17281) [readwritesplit] Selected Master: galera1-1#0122020-01-16 15:29:27 info : (17281) [readwritesplit] Connected to 'galera1-1'#012
|
We too have proxy_protocol=on but I tried on and off and it has no effect on this problem.
for this test I only enabled persistence on the galera1-2 sever like so:
persistpoolmax=200
persistmaxtime=3600
Alex.
|
|
Looks like the problem happened right when a new session was created. Perhaps this is somehow related to how the persistent connections are initialized?
|
|
There were loads more log lines. this was just the result of a grep 'Unexpected internal' in the log. I can provide more info if needed. let me know what to grep for or where I can upload the full log file.
|
|
Maikel Punie have you done any runtime configuration changes that would enable the persistent connection pool?
|
|
This is the only runtime config that we have
[ReadWriteSVC]
|
type=service
|
router=readwritesplit
|
user=maxscale
|
password=OdLINynAg5RpwoEV
|
enable_root_user=1
|
max_retry_interval=3600s
|
max_connections=0
|
connection_timeout=0
|
net_write_timeout=0
|
auth_all_servers=false
|
strip_db_esc=true
|
localhost_match_wildcard_host=true
|
log_auth_warnings=true
|
retry_on_failure=true
|
session_track_trx_state=false
|
retain_last_statements=-1
|
session_trace=false
|
use_sql_variables_in=all
|
slave_selection_criteria=LEAST_CURRENT_OPERATIONS
|
master_failure_mode=fail_on_write
|
max_slave_replication_lag=30
|
max_slave_connections=90%
|
retry_failed_reads=true
|
prune_sescmd_history=false
|
disable_sescmd_history=false
|
max_sescmd_history=500
|
strict_multi_stmt=false
|
strict_sp_calls=false
|
master_accept_reads=true
|
connection_keepalive=30
|
causal_reads=true
|
causal_reads_timeout=60
|
master_reconnection=true
|
delayed_retry=false
|
delayed_retry_timeout=10s
|
transaction_replay=false
|
transaction_replay_max_size=1Mi
|
transaction_replay_attempts=5
|
optimistic_trx=false
|
lazy_connect=true
|
servers=einstein,albert,hans,eduard,maric,brain
|
|
|
Do you have the possibility to test if disabling some of those feature (causal_reads, lazy_connect, master_reconnection, connection_keepalive) solves the problem?
|
|
For what it's worth, I have the problem with these (mostly default) settings on the splitter-service: (sorry, formatting is not cooperating)
{
"router_options": null,
"user": "maxscale",
"password": "XXXX",
"enable_root_user": false,
"max_retry_interval": 3600,
"max_connections": 0,
"connection_timeout": 0,
"net_write_timeout": 0,
"auth_all_servers": false,
"strip_db_esc": true,
"localhost_match_wildcard_host": true,
"version_string": null,
"weightby": null,
"log_auth_warnings": true,
"retry_on_failure": true,
"session_track_trx_state": true,
"retain_last_statements": -1,
"session_trace": false,
"cluster": null,
"use_sql_variables_in": "all",
"slave_selection_criteria": "LEAST_CURRENT_OPERATIONS",
"master_failure_mode": "fail_instantly",
"max_slave_replication_lag": 0,
"max_slave_connections": "255",
"retry_failed_reads": true,
"prune_sescmd_history": false,
"disable_sescmd_history": true,
"max_sescmd_history": 50,
"strict_multi_stmt": false,
"strict_sp_calls": false,
"master_accept_reads": false,
"connection_keepalive": 300,
"causal_reads": false,
"causal_reads_timeout": 10,
"master_reconnection": false,
"delayed_retry": false,
"delayed_retry_timeout": 10,
"transaction_replay": false,
"transaction_replay_max_size": "1Mi",
"transaction_replay_attempts": 5,
"optimistic_trx": false,
"lazy_connect": false
}
|
|
i can only test this out during maintenance window (as i do not see this in a dev machines).
but of course the dev machines get a lot less traffic then the main system
|
|
Given that there have been no functional changes to readwritesplit between 2.4.4. and 2.4.5, I suspect this commit might have caused unforeseen problems. One theory is that a ping sent by readwritesplit is not handled correctly by the protocol layer which causes an extra OK packet to be returned. If this theory is at all true, then the removal of connection_keepalive should prevent it.
|
|
After modifying the code to repeatedly send the connection keepalive packets I was able to reproduce the problem. This means that disabling connection_keepalive should prevent the problem.
The interesting thing is that reverting that change didn't appear to solve the problem but I am not certain of this: further testing is needed. An alternate implementation of ignorable packets where the result state is explicitly known seems to provide better stability. I wasn't able to reproduce the problem when the implementation was moved inside readwritesplit. Since this is not a trivial change, more testing is needed to gauge whether moving the keepalive ping tracking into readwritesplit is the way to go.
|
|
The current code used the time the last network read was done as the only part of the idle time calculcation. Instead, the latest network read and write should both be used to measure the total idle time. This should prevent cases where multiple keepalive pings are sent to to the backend before a reply is read.
|