[MCOL-5105] Reduced systemd timeouts results in corrupted EM Created: 2022-05-27  Updated: 2024-01-26  Resolved: 2022-06-21

Status: Closed
Project: MariaDB ColumnStore
Component/s: DMLProc
Affects Version/s: 5.6.5, 6.3.1
Fix Version/s: 6.4.1

Type: Bug Priority: Major
Reporter: Roman Assignee: Roman
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by MCOL-4775 DMLproc is unable to complete a rollb... Closed
is duplicated by MCOL-4785 ROLLBACK of a long lasting DML left c... Closed
is duplicated by MCOL-4867 DMLProc failed to rollback a txn Closed
is duplicated by MCOL-5093 mariadb services on column cluster no... Closed
is duplicated by MCOL-5267 loadbrm - increase timeout Closed
Relates
relates to MCOL-5594 Interactive "mcs cluster stop" comman... In Progress
Sprint: 2021-17

 Description   

There were multiple cases when low default startup/shutdown systemd timeouts, namely mcs-dmlproc, mcs-workernode, result in a lost or corrupted extent map.



 Comments   
Comment by Roman [ 2022-05-27 ]

Plz review.

Comment by Daniel Lee (Inactive) [ 2022-06-07 ]

Build tested: 6.4.1-1 (PR #4500, cron #4590)

1. set autocommit=0
2. update 1 billion rows
3. restart both mariadb and mariadb-columnstore services

ColumnStore tried to rollback during startup. From the debug.log, it kept trying the rollback every 2 minutes. I waited for 16 iteration before stopping the test.

When I manually issued a rollback after update, it look 10 minutes to complete.

Jun  6 21:53:46 rocky8 DMLProc[8343]: 46.026225 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
Jun  6 21:53:46 rocky8 DMLProc[8343]: 46.050007 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
Jun  6 21:53:46 rocky8 DMLProc[8343]: 46.050091 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 11
Jun  6 21:55:46 rocky8 ddlpackageproc[8340]: 46.115514 |0|0|0| C 23 CAL0008: Timer expired
Jun  6 21:55:46 rocky8 DMLProc[8368]: 46.277344 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
Jun  6 21:55:46 rocky8 DMLProc[8368]: 46.302319 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
Jun  6 21:55:46 rocky8 DMLProc[8368]: 46.302397 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 11
Jun  6 21:57:46 rocky8 ddlpackageproc[8365]: 46.360537 |0|0|0| C 23 CAL0008: Timer expired
Jun  6 21:57:46 rocky8 DMLProc[8391]: 46.528691 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
Jun  6 21:57:46 rocky8 DMLProc[8391]: 46.551910 |0|0|0| I 20 CAL0002: DMLProc will rollback 1 tables.
Jun  6 21:57:46 rocky8 DMLProc[8391]: 46.551991 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 11
.
.
.

Comment by Daniel Lee (Inactive) [ 2022-06-07 ]

Reopen per my last comment

Comment by Roman [ 2022-06-09 ]

The issue caused by a hardcoded pipe read operation limit that implicitly limits the effective timeout for startup operations.
We have to raise CMAPI aliases timeouts also so that systemd TO and this timeouts are in sync.
The scope of the issue doesn't include containers so they are still affected.

Comment by alexey vorovich (Inactive) [ 2022-06-09 ]

drrtuy
Sorry, please clarify . Are containers affected or not. Unclear from your message
Do we need extra work here or this ticket should be in review

Comment by Roman [ 2022-06-10 ]

This issue is about systemd irrespective whether it is a container installation or not.

Comment by Daniel Lee (Inactive) [ 2022-06-10 ]

Build tested: 6.4.1-1 (#4623)

The test was done on VM, not Docker container.

Repeated the last test case. It took 10 minutes to rollback the transaction successfully.

Jun 10 15:01:09 rocky8 DMLProc[8531]: 09.464816 |0|0|0| I 20 CAL0002: DMLProc is rolling back transaction 4
Jun 10 15:11:50 rocky8 DMLProc[8531]: 50.298740 |0|0|0| I 20 CAL0002: DMLProc rolled back transaction 4 and is releasing table lock id 1
Jun 10 15:11:50 rocky8 DMLProc[8531]: 50.301034 |0|0|0| I 20 CAL0002: DMLProc rolled back transaction 4 and table lock id 1 is released.
Jun 10 15:11:50 rocky8 DMLProc[8531]: 50.318591 |0|0|0| I 20 CAL0002: DMLProc finished rollbackAll.

Comment by Daniel Lee (Inactive) [ 2022-06-16 ]

Build tested: 6.4.1-1 (#4660)

Did another two tests in a Docker container.

Test #1
set autocommit=0 and update only 25 rows in a terminal session. In another terminal session, issued a 'mcsShutdown' command. Transaction got rollback during the mcsShutdown process.

Test #2
set autocommit=0 and update only 25 rows in a terminal session. In another terminal session, issued a 'kill -9' command to killed the ColumnStore process. Transaction got rollback during the next mcsStart process.

Generated at Thu Feb 08 02:55:23 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.