[MXS-3759] Client hangs forever when server failed or restarted Created: 2021-09-08  Updated: 2021-11-11  Resolved: 2021-09-10

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: 6.1.1
Fix Version/s: 2.5.16, 6.1.2

Type: Bug Priority: Critical
Reporter: Serhii Prykhodko Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None
Environment:

maxscale version 6.1.1

server protocol - MariaDBBackend
monitor module - mariadbmon
service router - readwritesplit
retry_failed_reads = false
delayed_retry = false
transaction_replay = false

listener protocol - MariaDBClient

backend servers - mariadb in docker
Server version: 5.5.5-10.5.12-MariaDB-1:10.5.12+maria~focal-log mariadb.org binary distribution



 Description   

How to reproduce

1. run long running query:

MariaDB [zabbix]> select count(1) from history;

2. restart mariadb server query running on

service mysql-zabbix restart

Actual behaviour

Query hangs forever in client.

Expected behaviour

I think client should receive error (when retry_failed_reads=false) or maxscale retry query on live server (when retry_failed_reads=true). Is it correct or I missed something in documentation ?



 Comments   
Comment by markus makela [ 2021-09-08 ]

If you enable log_info in MaxScale, does it report that it noticed a broken connection on that server?

Comment by Serhii Prykhodko [ 2021-09-08 ]

logs after restarting server:

2021-09-08 09:16:01.018   info   : (1) [readwritesplit] (zabbix-rw-service); (clientReply): Reply complete from 'zabbix1'
2021-09-08 09:16:01.018   info   : (1) [readwritesplit] (zabbix-rw-service); (handleError): Slave 'zabbix1' failed: #HY000: Lost connection to backend server: connection closed by peer (zabbix1)
2021-09-08 09:16:03.065   error  : (log_connect_error): Monitor was unable to connect to server zabbix1[10.0.4.32:3306] : 'Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 11'
2021-09-08 09:16:03.066   notice : (log_state_change): Server changed state: zabbix1[10.0.4.32:3306]: slave_down. [Slave, Running] -> [Down]
2021-09-08 09:16:09.367   notice : (log_state_change): Server changed state: zabbix1[10.0.4.32:3306]: slave_up. [Down] -> [Slave, Running]

Comment by markus makela [ 2021-09-08 ]

Have you tested if this happens with 2.5?

Comment by markus makela [ 2021-09-08 ]

I think I've managed to reproduce this, I'll have to continue testing.

Comment by Serhii Prykhodko [ 2021-09-08 ]

logs from 2.5.14

it says (clientReply): Reply complete, last reply from zabbix2 but client still waiting for response

2021-09-08 09:59:36.222   info   : (124) (zabbix-rw-service) (log_transaction_status): > Autocommit: [enabled], trx is [not open], cmd: (0x03) COM_QUERY, plen: 33, type: QUERY_TYPE_READ, stmt: select count(1) from history
2021-09-08 09:59:36.223   info   : (124) [readwritesplit] (zabbix-rw-service) (handle_got_target): Route query to master: zabbix2 <
2021-09-08 09:59:49.165   info   : (124) [readwritesplit] (zabbix-rw-service) (clientReply): Reply complete, last reply from zabbix2
2021-09-08 09:59:49.166   info   : (124) [readwritesplit] (zabbix-rw-service) (handleError): Master 'zabbix2' failed: #HY000: Lost connection to backend server: connection closed by peer (zabbix2)
2021-09-08 09:59:50.193   error  : (log_connect_error): Monitor was unable to connect to server zabbix2[10.0.4.33:3306] : 'Can't connect to MySQL server on '10.0.4.33' (115)'
2021-09-08 09:59:50.195   notice : (log_state_change): Server changed state: zabbix2[10.0.4.33:3306]: master_down. [Master, Running] -> [Down]
2021-09-08 09:59:50.195   warning: [mariadbmon] (handle_auto_failover): Master has failed. If master does not return in 4 monitor tick(s), failover begins.
2021-09-08 09:59:54.395   warning: [mariadbmon] (update_master): The current master server 'zabbix2' is no longer valid because it is in read-only mode, but there is no valid alternative to swap to.
2021-09-08 09:59:54.395   notice : (log_state_change): Server changed state: zabbix2[10.0.4.33:3306]: slave_up. [Down] -> [Slave, Running]

Comment by markus makela [ 2021-09-08 ]

OK, this seems to be a regression in version 6 that is not present in 2.5 when retry_failed_reads=true. I suspect 2.5 will also fail if retry_failed_reads is disabled.

Comment by Serhii Prykhodko [ 2021-09-08 ]

I have same bug with retry_failed_reads=true on 2.5.14 and 6.1.1

Generated at Thu Feb 08 04:23:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.