[MDEV-33207] Killing application cannot remove DB locked session - Jira

Details

Type: Task
Status: Stalled (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Locking, Server
Labels:
- kill
- locking

Description

When a DB session running an UPDATE on locked row(s), killing the corresponding application (e.g. MySQL client) does not kill/remove that session in DB. The session is gone when the innodb_lock_wait_timeout is reached.

Scenario:
1. Create a simple table and insert a row:
CREATE TABLE `t1` (
`idx` int(11) NOT NULL,
`name` varchar(100) DEFAULT NULL,
PRIMARY KEY (`idx`)
) ;
INSERT INTO t1 VALUES (1, 'abc'); COMMIT;

2. Run MySQL client to create session A to lock the row:
> SET AUTOCOMMIT=OFF;
> UPDATE t1 SET name='sess A' WHERE idx=1;

3. Run another MySQL client to create sesssion B, update the locked row by session A:
> SET AUTOCOMMIT=ON;
> UPDATE t1 SET name='sess B' WHERE idx=1;

4. Run SHOW FULL PROCESSLIST on another MySQL client, it shows session B is running running the UPDATE statement (waiting for the row to be unlocked).

5. On the OS shell, run "kill -9 <Session B MySQL client PID>" to kill the MySQL client of session B.

6. Run SHOW FULL PROCESSLIST on another MySQL client again, it shows session B still remains in the DB, running the UPDATE statement. Session B is gone until innodb_lock_wait_timeout is reached.

7. If session A commited the UPDATE before session B's innodb_lock_wait_timeout is reached, the row will be updated by session B.

I would suppose that killing application can kill the session in DB but it doesn't in the above scenario. Please advise the above behavior is normal or not.

Thanks.

Attachments

Activity

Ascending order - Click to sort in descending order

Daniel Black added a comment - 2024-01-10 04:03

When a connected process is killed, the mariadb server doesn't receive an indication that the socket session has been removed.

I tested this an in a strace

1039777 <... poll resumed>)             = 1 ([{fd=47, revents=POLLIN}])

1039777 recvfrom(47, "(\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4

1039777 recvfrom(47, "\3UPDATE t1 SET name='sess B' WHERE idx=1", 40, MSG_DONTWAIT, NULL, NULL) = 40

No error on file descriptor 47 or the poll of it return any error after the process was killed.

so normal behaviour it is.

With tcp keep alives you should be able to detect this earlier - https://mariadb.com/kb/en/server-system-variables/#tcp_keepalive_interval

Daniel Black added a comment - 2024-01-10 04:03 When a connected process is killed, the mariadb server doesn't receive an indication that the socket session has been removed. I tested this an in a strace 1039777 <... poll resumed>) = 1 ([{fd=47, revents=POLLIN}]) 1039777 recvfrom(47, "(\0\0\0", 4, MSG_DONTWAIT, NULL, NULL) = 4 1039777 recvfrom(47, "\3UPDATE t1 SET name='sess B' WHERE idx=1", 40, MSG_DONTWAIT, NULL, NULL) = 40 No error on file descriptor 47 or the poll of it return any error after the process was killed. so normal behaviour it is. With tcp keep alives you should be able to detect this earlier - https://mariadb.com/kb/en/server-system-variables/#tcp_keepalive_interval

Vladislav Vaintroub added a comment - 2024-01-10 12:07 - edited

I think the suspected bug is that server thread is stuck in innodb lock, while executing UPDATE, and while it is stuck, it does not read or write, or check the status of the socket.
For example , a SLEEP() wait is implemented such that it wakes up every 5 seconds and checks a) whether connection was killed and b) THD::is_connected(), which ends up in vio_is_connected.
(s. Interruptible_wait::wait() in item_func.cc) . If network connection is broken, SLEEP() is interrupted.

Currently, only SLEEP() waits are aware of THD::is_connected, as far as I can see, Would be an improvement if other long-ish waits (table, user, row locks) were aware, too, so they do not hog resources, and disappear faster.

Vladislav Vaintroub added a comment - 2024-01-10 12:07 - edited I think the suspected bug is that server thread is stuck in innodb lock, while executing UPDATE, and while it is stuck, it does not read or write, or check the status of the socket. For example , a SLEEP() wait is implemented such that it wakes up every 5 seconds and checks a) whether connection was killed and b) THD::is_connected(), which ends up in vio_is_connected. (s. Interruptible_wait::wait() in item_func.cc) . If network connection is broken, SLEEP() is interrupted. Currently, only SLEEP() waits are aware of THD::is_connected, as far as I can see, Would be an improvement if other long-ish waits (table, user, row locks) were aware, too, so they do not hog resources, and disappear faster.

Daniel Black added a comment - 2024-01-10 22:32

Nice! There's one other use in MDL_context::acquire_lock too.

Daniel Black added a comment - 2024-01-10 22:32 Nice! There's one other use in MDL_context::acquire_lock too.

Vladislav Vaintroub added a comment - 2024-01-10 22:56 - edited

Right. I had a vague idea of having something like poll() loop that would only wait for POLLERR (polldf.events=0, if this works) for active connections, and once error is returned, initiate KILL_CONNECTION with new kill_errno = ER_NETWORK_DISCONNECT .

That's because "kill connection" works much better generally, to get rid of a connection. THD::killed is getting periodically checked, and KILL interrupts lock waits as well, at least usually it does usually, so converting network error to KILL might be a good idea.

Needs something on Windows, too, There is WSAEventSelect to detect socket closure,. for named pipes I'm not sure yet. perhaps just periodically check health with PeekNamedPipe.

Vladislav Vaintroub added a comment - 2024-01-10 22:56 - edited Right. I had a vague idea of having something like poll() loop that would only wait for POLLERR (polldf.events=0, if this works) for active connections, and once error is returned, initiate KILL_CONNECTION with new kill_errno = ER_NETWORK_DISCONNECT . That's because "kill connection" works much better generally, to get rid of a connection. THD::killed is getting periodically checked, and KILL interrupts lock waits as well, at least usually it does usually, so converting network error to KILL might be a good idea. Needs something on Windows, too, There is WSAEventSelect to detect socket closure,. for named pipes I'm not sure yet. perhaps just periodically check health with PeekNamedPipe.

Daniel Black added a comment - 2024-01-10 23:38 - edited

Yes, I like that too. Its almost sounding 1/2 way to async programming.

Daniel Black added a comment - 2024-01-10 23:38 - edited Yes, I like that too. Its almost sounding 1/2 way to async programming.

People

Assignee:: Unassigned

Reporter:: Chow King Tak

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2024-01-10 01:02

Updated:: 2024-01-12 20:12

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server