[MDEV-33207] Killing application cannot remove DB locked session Created: 2024-01-10  Updated: 2024-01-12

Status: Stalled
Project: MariaDB Server
Component/s: Locking, Server
Fix Version/s: None

Type: Task Priority: Major
Reporter: Chow King Tak Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: kill, locking


 Description   

When a DB session running an UPDATE on locked row(s), killing the corresponding application (e.g. MySQL client) does not kill/remove that session in DB. The session is gone when the innodb_lock_wait_timeout is reached.

Scenario:
1. Create a simple table and insert a row:
CREATE TABLE `t1` (
`idx` int(11) NOT NULL,
`name` varchar(100) DEFAULT NULL,
PRIMARY KEY (`idx`)
) ;
INSERT INTO t1 VALUES (1, 'abc'); COMMIT;

2. Run MySQL client to create session A to lock the row:
> SET AUTOCOMMIT=OFF;
> UPDATE t1 SET name='sess A' WHERE idx=1;

3. Run another MySQL client to create sesssion B, update the locked row by session A:
> SET AUTOCOMMIT=ON;
> UPDATE t1 SET name='sess B' WHERE idx=1;

4. Run SHOW FULL PROCESSLIST on another MySQL client, it shows session B is running running the UPDATE statement (waiting for the row to be unlocked).

5. On the OS shell, run "kill -9 <Session B MySQL client PID>" to kill the MySQL client of session B.

6. Run SHOW FULL PROCESSLIST on another MySQL client again, it shows session B still remains in the DB, running the UPDATE statement. Session B is gone until innodb_lock_wait_timeout is reached.

7. If session A commited the UPDATE before session B's innodb_lock_wait_timeout is reached, the row will be updated by session B.

I would suppose that killing application can kill the session in DB but it doesn't in the above scenario. Please advise the above behavior is normal or not.

Thanks.



 Comments   
Comment by Daniel Black [ 2024-01-10 ]

When a connected process is killed, the mariadb server doesn't receive an indication that the socket session has been removed.

I tested this an in a strace

1039777 <... poll resumed>)             = 1 ([{fd=47, revents=POLLIN}])
1039777 recvfrom(47, "(\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4 
1039777 recvfrom(47, "\3UPDATE t1 SET name='sess B' WHERE idx=1", 40, MSG_DONTWAIT, NULL, NULL) = 40

No error on file descriptor 47 or the poll of it return any error after the process was killed.

so normal behaviour it is.

With tcp keep alives you should be able to detect this earlier - https://mariadb.com/kb/en/server-system-variables/#tcp_keepalive_interval

Comment by Vladislav Vaintroub [ 2024-01-10 ]

I think the suspected bug is that server thread is stuck in innodb lock, while executing UPDATE, and while it is stuck, it does not read or write, or check the status of the socket.
For example , a SLEEP() wait is implemented such that it wakes up every 5 seconds and checks a) whether connection was killed and b) THD::is_connected(), which ends up in vio_is_connected.
(s. Interruptible_wait::wait() in item_func.cc) . If network connection is broken, SLEEP() is interrupted.

Currently, only SLEEP() waits are aware of THD::is_connected, as far as I can see, Would be an improvement if other long-ish waits (table, user, row locks) were aware, too, so they do not hog resources, and disappear faster.

Comment by Daniel Black [ 2024-01-10 ]

Nice! There's one other use in MDL_context::acquire_lock too.

Comment by Vladislav Vaintroub [ 2024-01-10 ]

Right. I had a vague idea of having something like poll() loop that would only wait for POLLERR (polldf.events=0, if this works) for active connections, and once error is returned, initiate KILL_CONNECTION with new kill_errno = ER_NETWORK_DISCONNECT .

That's because "kill connection" works much better generally, to get rid of a connection. THD::killed is getting periodically checked, and KILL interrupts lock waits as well, at least usually it does usually, so converting network error to KILL might be a good idea.

Needs something on Windows, too, There is WSAEventSelect to detect socket closure,. for named pipes I'm not sure yet. perhaps just periodically check health with PeekNamedPipe.

Comment by Daniel Black [ 2024-01-10 ]

Yes, I like that too. Its almost sounding 1/2 way to async programming.

Generated at Thu Feb 08 10:37:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.