[MXS-2822] Unexpected internal state: received response Created: 2020-01-04  Updated: 2020-03-13  Resolved: 2020-02-12

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: 2.4.5
Fix Version/s: 2.4.7

Type: Bug Priority: Blocker
Reporter: Maikel Punie Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None
Environment:

centos


Attachments: File maxscale.cnf    
Sprint: MXS-SPRINT-98, MXS-SPRINT-99

 Description   

after upgrading maxscale from 2.4.4 tot 2.4.5 we get a lot of errors in the readwritesplit

2020-01-04 11:47:37 error : (6833251) [readwritesplit] Unexpected internal state: received response 0x00 from server 'hans' when no response was expected. Command: 0x03 Query: <not available>

when we downgrade again all these errors disappear

the config is exactly the same, same for the servers.
all servers except the master have proxy_protocol enabled



 Comments   
Comment by markus makela [ 2020-01-15 ]

Can you upload the configuration you use? Which connector do you use and do you have a way to reproduce the problem on a test system?

Comment by Maikel Punie [ 2020-01-15 ]

i uploaded the config.

i was not sucessfull to reproduce this on our test system, but this system is not getting the same traffic load as the main one.

Comment by Qwark [ 2020-01-15 ]

Hi,

I have the exact same issue. For me it only occurs when the 'load' (number of queries) is 'higher' and when I use persistpoolmax=(number > 0) without persistance I get no errors.

I just started using maxscale (started directly with v2.4.5) but downgrading to 2.4.4 did not help in my case, I have a galera cluster with 4 "10.3.18-MariaDB-0+deb10u1" nodes, and Splitter-Service on maxscale.

My test-case webpage does the queries below (don't ask me why, not my page ). While the queries themselves are probably not really important it gives an idea about the number of queries)

SET NAMES utf8
SET NAMES utf8
SELECT value FROM sessions WHERE sesskey = 'xxxx' AND expiry + sess_lifetime > 1579096703
SELECT count(*) as aantal  FROM not_the_sessions_table WHERE a  = '' AND b = ''
REPLACE INTO sessions VALUES ('xxxx', 1579096703, 'some string', 1440)

There are 4 clients (servers) connecting to maxscale.

Using a simple load generator fetching that webpage, I see no errors when doing 400 requests in serial. but when I start increasing the number (starting with about 10) of parallel reqs (and thus queries), I see the errors. The more requests in parallel, the more errors I get:

2020-01-15 15:20:23   error  : (15168) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-3' when no response was expected. Command: 0x00 Query: <not available>
2020-01-15 15:20:24   error  : (15607) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-4' when no response was expected. Command: 0x00 Query: <not available>
2020-01-15 15:20:24   error  : (15607) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-3' when no response was expected. Command: 0x00 Query: <not available>
2020-01-15 15:20:24   error  : (15660) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-3' when no response was expected. Command: 0x00 Query: <not available>

The errors only come from the read-nodes, never from the readwrite master.

when I add a query cache on maxscale, that starts complaining as well:

2020-01-15 15:25:39   error  : (1280) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-4' when no response was expected. Command: 0x00 Query: <not available>
2020-01-15 15:25:39   warning: (1280) [cache] Received data from the backend although filter is expecting nothing. Packet size is 105 bytes long.
2020-01-15 15:25:39   error  : (1434) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>
2020-01-15 15:25:39   warning: (1434) [cache] Received data from the backend although filter is expecting nothing. Packet size is 105 bytes long.
etc

400 reqs with 50 parallel, takes 9.5 seconds. 5 queries per req, gives ~200 queries/sec. But I suspect the problem might be not the number of queries/sec but the fact that multiple clients are requesting the same stuff from multiple hosts...

I've been trying to replicate this issue more 'scientifically' but no replicatable results there yet.

Hope this helps,
Alex.

Comment by markus makela [ 2020-01-16 ]

Maikel Punie and qwark, can you add session_trace=200 under the maxscale section? This should cause the last 200 lines of the more verbose log to be dumped when that problem appears.

Comment by Qwark [ 2020-01-16 ]

Hi Markus,

I have logs. But I ran this briefly on our prod server and it's a lot, almost a million lines . looks like it wasn't only logging the unexpected state errors with added detail, because there were only 5 'Unexpected internal state' errors between the million lines.

Perhaps you can tel me what you're looking for, so I can tune down the volume a bit?

For starters here are the 10 lines that actually have 'Unexpected internal state' in them.

Jan 16 15:29:19 galera1-1 maxscale[10240]: (12884) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>
Jan 16 15:29:19 galera1-1 maxscale[10240]: (12884) Session log for session (12884): #0122020-01-16 15:29:19   error  : (12884) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>#0122020-01-16 15:29:19   info   : (12884) Started Splitter-Service client session [12884] for 'user1' from ::ffff:10.1.1.105#0122020-01-16 15:29:19   info   : (12884) [readwritesplit] Selected Slave: galera1-3#0122020-01-16 15:29:19   info   : (12884) [readwritesplit] Connected to 'galera1-3'#0122020-01-16 15:29:19   info   : (12884) [readwritesplit] Selected Slave: galera1-2#0122020-01-16 15:29:19   info   : (12884) [readwritesplit] Connected to 'galera1-2'#0122020-01-16 15:29:19   info   : (12884) [readwritesplit] Selected Slave: galera1-4#0122020-01-16 15:29:19   info   : (12884) [readwritesplit] Connected to 'galera1-4'#0122020-01-16 15:29:19   info   : (12884) [readwritesplit] Selected Master: galera1-1#0122020-01-16 15:29:19   info   : (12884) [readwritesplit] Connected to 'galera1-1'#012 
Jan 16 15:29:21 galera1-1 maxscale[10240]: (13652) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>
Jan 16 15:29:21 galera1-1 maxscale[10240]: (13652) Session log for session (13652): #0122020-01-16 15:29:21   error  : (13652) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>#0122020-01-16 15:29:21   info   : (13652) Started Splitter-Service client session [13652] for 'user1' from ::ffff:10.1.1.106#0122020-01-16 15:29:21   info   : (13652) [readwritesplit] Selected Slave: galera1-4#0122020-01-16 15:29:21   info   : (13652) [readwritesplit] Connected to 'galera1-4'#0122020-01-16 15:29:21   info   : (13652) [readwritesplit] Selected Slave: galera1-3#0122020-01-16 15:29:21   info   : (13652) [readwritesplit] Connected to 'galera1-3'#0122020-01-16 15:29:21   info   : (13652) [readwritesplit] Selected Slave: galera1-2#0122020-01-16 15:29:21   info   : (13652) [readwritesplit] Connected to 'galera1-2'#0122020-01-16 15:29:21   info   : (13652) [readwritesplit] Selected Master: galera1-1#0122020-01-16 15:29:21   info   : (13652) [readwritesplit] Connected to 'galera1-1'#012 
Jan 16 15:29:21 galera1-1 maxscale[10240]: (13682) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>
Jan 16 15:29:21 galera1-1 maxscale[10240]: (13682) Session log for session (13682): #0122020-01-16 15:29:21   error  : (13682) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>#0122020-01-16 15:29:21   info   : (13682) Started Splitter-Service client session [13682] for 'user1' from ::ffff:10.1.1.105#0122020-01-16 15:29:21   info   : (13682) [readwritesplit] Selected Slave: galera1-4#0122020-01-16 15:29:21   info   : (13682) [readwritesplit] Connected to 'galera1-4'#0122020-01-16 15:29:21   info   : (13682) [readwritesplit] Selected Slave: galera1-3#0122020-01-16 15:29:21   info   : (13682) [readwritesplit] Connected to 'galera1-3'#0122020-01-16 15:29:21   info   : (13682) [readwritesplit] Selected Slave: galera1-2#0122020-01-16 15:29:21   info   : (13682) [readwritesplit] Connected to 'galera1-2'#0122020-01-16 15:29:21   info   : (13682) [readwritesplit] Selected Master: galera1-1#0122020-01-16 15:29:21   info   : (13682) [readwritesplit] Connected to 'galera1-1'#012 
Jan 16 15:29:21 galera1-1 maxscale[10240]: (13908) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>
Jan 16 15:29:21 galera1-1 maxscale[10240]: (13908) Session log for session (13908): #0122020-01-16 15:29:21   error  : (13908) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>#0122020-01-16 15:29:21   info   : (13908) Started Splitter-Service client session [13908] for 'user1' from ::ffff:10.1.1.106#0122020-01-16 15:29:21   info   : (13908) [readwritesplit] Selected Slave: galera1-3#0122020-01-16 15:29:21   info   : (13908) [readwritesplit] Connected to 'galera1-3'#0122020-01-16 15:29:21   info   : (13908) [readwritesplit] Selected Slave: galera1-2#0122020-01-16 15:29:21   info   : (13908) [readwritesplit] Connected to 'galera1-2'#0122020-01-16 15:29:21   info   : (13908) [readwritesplit] Selected Slave: galera1-4#0122020-01-16 15:29:21   info   : (13908) [readwritesplit] Connected to 'galera1-4'#0122020-01-16 15:29:21   info   : (13908) [readwritesplit] Selected Master: galera1-1#0122020-01-16 15:29:21   info   : (13908) [readwritesplit] Connected to 'galera1-1'#012 
Jan 16 15:29:27 galera1-1 maxscale[10240]: (17281) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>
Jan 16 15:29:27 galera1-1 maxscale[10240]: (17281) Session log for session (17281): #0122020-01-16 15:29:27   error  : (17281) [readwritesplit] Unexpected internal state: received response 0x00 from server 'galera1-2' when no response was expected. Command: 0x00 Query: <not available>#0122020-01-16 15:29:27   info   : (17281) Started Splitter-Service client session [17281] for 'user1' from ::ffff:10.1.1.106#0122020-01-16 15:29:27   info   : (17281) [readwritesplit] Selected Slave: galera1-4#0122020-01-16 15:29:27   info   : (17281) [readwritesplit] Connected to 'galera1-4'#0122020-01-16 15:29:27   info   : (17281) [readwritesplit] Selected Slave: galera1-3#0122020-01-16 15:29:27   info   : (17281) [readwritesplit] Connected to 'galera1-3'#0122020-01-16 15:29:27   info   : (17281) [readwritesplit] Selected Slave: galera1-2#0122020-01-16 15:29:27   info   : (17281) [readwritesplit] Connected to 'galera1-2'#0122020-01-16 15:29:27   info   : (17281) [readwritesplit] Selected Master: galera1-1#0122020-01-16 15:29:27   info   : (17281) [readwritesplit] Connected to 'galera1-1'#012 

We too have proxy_protocol=on but I tried on and off and it has no effect on this problem.
for this test I only enabled persistence on the galera1-2 sever like so:
persistpoolmax=200
persistmaxtime=3600

Alex.

Comment by markus makela [ 2020-01-20 ]

Looks like the problem happened right when a new session was created. Perhaps this is somehow related to how the persistent connections are initialized?

Comment by Qwark [ 2020-01-23 ]

There were loads more log lines. this was just the result of a grep 'Unexpected internal' in the log. I can provide more info if needed. let me know what to grep for or where I can upload the full log file.

Comment by markus makela [ 2020-02-06 ]

Maikel Punie have you done any runtime configuration changes that would enable the persistent connection pool?

Comment by Maikel Punie [ 2020-02-06 ]

This is the only runtime config that we have

[ReadWriteSVC]
type=service
router=readwritesplit
user=maxscale
password=OdLINynAg5RpwoEV
enable_root_user=1
max_retry_interval=3600s
max_connections=0
connection_timeout=0
net_write_timeout=0
auth_all_servers=false
strip_db_esc=true
localhost_match_wildcard_host=true
log_auth_warnings=true
retry_on_failure=true
session_track_trx_state=false
retain_last_statements=-1
session_trace=false
use_sql_variables_in=all
slave_selection_criteria=LEAST_CURRENT_OPERATIONS
master_failure_mode=fail_on_write
max_slave_replication_lag=30
max_slave_connections=90%
retry_failed_reads=true
prune_sescmd_history=false
disable_sescmd_history=false
max_sescmd_history=500
strict_multi_stmt=false
strict_sp_calls=false
master_accept_reads=true
connection_keepalive=30
causal_reads=true
causal_reads_timeout=60
master_reconnection=true
delayed_retry=false
delayed_retry_timeout=10s
transaction_replay=false
transaction_replay_max_size=1Mi
transaction_replay_attempts=5
optimistic_trx=false
lazy_connect=true
servers=einstein,albert,hans,eduard,maric,brain

Comment by markus makela [ 2020-02-10 ]

Do you have the possibility to test if disabling some of those feature (causal_reads, lazy_connect, master_reconnection, connection_keepalive) solves the problem?

Comment by Qwark [ 2020-02-10 ]

For what it's worth, I have the problem with these (mostly default) settings on the splitter-service: (sorry, formatting is not cooperating)

{ "router_options": null, "user": "maxscale", "password": "XXXX", "enable_root_user": false, "max_retry_interval": 3600, "max_connections": 0, "connection_timeout": 0, "net_write_timeout": 0, "auth_all_servers": false, "strip_db_esc": true, "localhost_match_wildcard_host": true, "version_string": null, "weightby": null, "log_auth_warnings": true, "retry_on_failure": true, "session_track_trx_state": true, "retain_last_statements": -1, "session_trace": false, "cluster": null, "use_sql_variables_in": "all", "slave_selection_criteria": "LEAST_CURRENT_OPERATIONS", "master_failure_mode": "fail_instantly", "max_slave_replication_lag": 0, "max_slave_connections": "255", "retry_failed_reads": true, "prune_sescmd_history": false, "disable_sescmd_history": true, "max_sescmd_history": 50, "strict_multi_stmt": false, "strict_sp_calls": false, "master_accept_reads": false, "connection_keepalive": 300, "causal_reads": false, "causal_reads_timeout": 10, "master_reconnection": false, "delayed_retry": false, "delayed_retry_timeout": 10, "transaction_replay": false, "transaction_replay_max_size": "1Mi", "transaction_replay_attempts": 5, "optimistic_trx": false, "lazy_connect": false }
Comment by Maikel Punie [ 2020-02-11 ]

i can only test this out during maintenance window (as i do not see this in a dev machines).
but of course the dev machines get a lot less traffic then the main system

Comment by markus makela [ 2020-02-11 ]

Given that there have been no functional changes to readwritesplit between 2.4.4. and 2.4.5, I suspect this commit might have caused unforeseen problems. One theory is that a ping sent by readwritesplit is not handled correctly by the protocol layer which causes an extra OK packet to be returned. If this theory is at all true, then the removal of connection_keepalive should prevent it.

Comment by markus makela [ 2020-02-11 ]

After modifying the code to repeatedly send the connection keepalive packets I was able to reproduce the problem. This means that disabling connection_keepalive should prevent the problem.

The interesting thing is that reverting that change didn't appear to solve the problem but I am not certain of this: further testing is needed. An alternate implementation of ignorable packets where the result state is explicitly known seems to provide better stability. I wasn't able to reproduce the problem when the implementation was moved inside readwritesplit. Since this is not a trivial change, more testing is needed to gauge whether moving the keepalive ping tracking into readwritesplit is the way to go.

Comment by markus makela [ 2020-02-12 ]

The current code used the time the last network read was done as the only part of the idle time calculcation. Instead, the latest network read and write should both be used to measure the total idle time. This should prevent cases where multiple keepalive pings are sent to to the backend before a reply is read.

Generated at Thu Feb 08 04:16:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.