[MXS-6189] Spurious "Read from backend failed" errors on short-lived sessions with causal_reads enabled - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Incomplete
Affects Version/s: 23.08.5
Fix Version/s: N/A
Component/s: readwritesplit
Labels:
None

Sprint:
MXS-CAPACITY-26-2

Description

1. Summary

MaxScale logs spurious ERROR-level messages `"Lost connection to the primary server, closing session... idle for 0 seconds.
Error caused by: Read from backend failed"` when short-lived client sessions send COM_QUIT while one or more backend connections are still authenticating.
The error is caused by a race condition in the session teardown path and is not indicative of an actual backend failure.
The issue manifests when `causal_reads` is enabled (any mode other than `none`) because the extra `SET session_track_system_variables` query sent during `prepare_connection()` increases the authentication time window, making the race highly reproducible.

The issue produces dozens of ERROR-level messages per minute, polluting MaxScale logs and obscuring real issues.

1. Steps to Reproduce

1. Configure MaxScale readwritesplit with a 3-node Galera cluster:
```
[AppRouter]
router=readwritesplit
causal_reads=fast
targets=Server1,Server2,Server3
```
2. Use a PHP/mysqlnd application that creates short-lived connections:

Connect
`SET NAMES utf8mb4`
`SET SESSION sql_mode = 'STRICT_ALL_TABLES'`
COM_QUIT (all within < 1 second)
3. Observe MaxScale error log.

1. Expected Result

Session teardown after COM_QUIT should complete cleanly without error-level log messages. Backend connection closures caused by COM_QUIT are expected behavior and should not be reported as errors.

1. Actual Result

MaxScale logs:
```
error: (121074) (AppRouter); Lost connection to the primary server, closing session.
Lost connection to primary server while connection was idle.
Connection from 'appuser'@'192.0.2.10' has been idle for 0 seconds.
Error caused by: Read from backend failed (node2, session=121074, conn_id=129219).
Last error:
```

The message is misleading:

"idle for 0 seconds" — the connection was not idle, COM_QUIT was just sent
"Read from backend failed" — the backend closed correctly in response to COM_QUIT
"Lost connection to the primary server" — the connection was not lost, it was intentionally terminated

1. Root Cause Analysis (from source code review of MaxScale 24.08)

1. 1. The race condition

The bug is a race between session teardown and backend epoll events, triggered by a backend still authenticating when COM_QUIT arrives.

*Sequence of events* (derived from a full session trace with `log_info=true`):

```
T+0ms Client connects, MaxScale opens connections to all 3 Galera nodes
T+0ms Node2 (primary) + Node1 — connected and authenticated
T+0ms Node3 — still in state 'Authenticating'
T+0ms Client sends COM_QUIT
T+0ms COM_QUIT routed to Node1 and Node2 (ROUTING state) → backends close TCP
T+0ms COM_QUIT stored for Node3 (still Authenticating)
T+0ms kill() called → ClientDCB added to zombie queue
T+0ms delete_zombies(): can_close_dcb() checks ALL backends
→ Node3 is Authenticating → can_close() returns FALSE
→ ClientDCB destruction DELAYED (put in slow_zombies)
T+0ms Next epoll iteration: Node1 and Node2 TCP close events fire
→ Session is still alive (not yet destroyed)
→ ready_for_reading() → normal_read() → read returns EOF
→ do_handle_error("Read from backend failed")
→ handleError → handle_master_error → master_exception thrown
→ Session::handleError logs ERROR
T+0ms Node3 authentication finally completes
T+0ms Session stopped
```

1. 1. Code path details

1. *`prepare_connection()`* (`rwsplit_route_stmt.cc:53-56`): With `causal_reads != NONE`, sends extra `SET @@session.session_track_system_variables` query with `IGNORE_RESPONSE` to every backend during connection setup. This adds work to the authentication phase.

2. *COM_QUIT routing* (`rwsplit_route_stmt.cc:469`): COM_QUIT is routed as a session write to all in-use backends. For backends still authenticating, the command is stored (`mariadb_backend.cc`: "Storing COM_QUIT while in state 'Authenticating'").

3. *`MXS_SESSION::kill()`* (`session.cc:114-142`): Called after COM_QUIT routing. Puts ClientDCB into zombie queue.

4. *`delete_zombies()`* (`routingworker.cc:956-992`): Before destroying the ClientDCB, checks all backend connections via `can_close_dcb()` → `MariaDBBackendConnection::can_close()` (`mariadb_backend.cc:1348-1352`). This function returns `false` for `State::AUTHENTICATING` (not in the allowed-states list). If any backend cannot close, the ClientDCB is put into `slow_zombies` and destruction is postponed.

5. *Backend close events* (`mariadb_backend.cc:692-704`): In the next epoll iteration, the backends that received COM_QUIT have closed their TCP connections. The epoll events fire, `ready_for_reading()` is called, `normal_read()` attempts to read from the closed socket, gets EOF, and calls `do_handle_error("Read from backend failed")`.

6. *Error escalation* (`rwsplitsession.cc:898-900, 1045-1047`): `handleError()` → `handle_master_error()` → with default `master_failure_mode=fail_instantly` → `master_exception()` is thrown → caught in `handleError()` (line 848) → `mxs::RouterSession::handleError()` → `Session::handleError()` (`session.cc:1456`) logs `MXB_ERROR`.

7. *Misleading idle time* (`rwsplitsession.cc:1154`): `m_last_write` was updated by the COM_QUIT write (via `RWBackend::write()` at `rwbackend.cc:34`), so `idle = now - m_last_write` = 0 seconds.

1. 1. Why causal_reads triggers it

Without `causal_reads`, the 3 backends authenticate faster (no extra `SET session_track_system_variables` query). By the time the fast PHP client sends COM_QUIT, all 3 backends have typically completed authentication and are in `State::ROUTING`. Then `can_close()` returns `true` for all → zombie deletion proceeds immediately → backend DCBs are closed before epoll events fire → no error.

With `causal_reads`, the extra `IGNORE_RESPONSE` query during `prepare_connection()` adds a round-trip to the authentication phase. The slowest backend (Node3) is still authenticating when COM_QUIT arrives. This blocks zombie deletion, allowing backend close events to fire while the session is still alive.

*no* further _formatting_ is done here

Attachments

Activity

People

Assignee:: Claudio Nanni

Reporter:: Claudio Nanni

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2026-02-24 12:46

Updated:: 2026-03-16 19:07

Resolved:: 2026-03-16 19:07

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.