[MCOL-3917] DDLProc/DMLProc must return a meaningful error if a local/remote workernode/controllernode was restarted Created: 2020-04-02  Updated: 2023-10-25  Resolved: 2023-10-25

Status: Closed
Project: MariaDB ColumnStore
Component/s: DDLProc, DMLProc
Affects Version/s: None
Fix Version/s: 23.10

Type: Task Priority: Major
Reporter: Roman Assignee: Roman
Resolution: Won't Fix Votes: 0
Labels: None

Epic Link: Columnstore OAM replacement
Sprint: 2021-2, 2021-3, 2021-4

 Description   

DMLProc, DDLProc don’t reestablish connections to WriteEngines if they fail so they won’t survie WriteEngine restart. DMLProc/DDLProc must survive this outage with failing their current operations before they reestablish their connections. The user must be notified about the operations failure.

Update: Make sure columnstore services survive restarts of other services as well, including controllernode, workernode, primproc, and exemgr restarts.



 Comments   
Comment by Roman [ 2020-04-24 ]

We should extend the scope of this task to test all dependencies across services.

Comment by Jose Rojas (Inactive) [ 2020-04-24 ]

You will find all changes in MCOL-3836 branch

Comment by Roman [ 2020-09-02 ]

The changes doesn't allow services to really survive WriteEngine restarts. These are just a systemd workarounds that doesn't work in non-systemd environments.

Comment by David Hall (Inactive) [ 2022-03-04 ]

We believe this works correctly. It needs to be properly tested.

Comment by David Hall (Inactive) [ 2022-05-25 ]

QA: Please force crash of PrimProc and other process and bring them back up. See if DDLProc and DMLProc still function after the other processes come back up.
Do in Develop-6

Comment by Daniel Lee (Inactive) [ 2022-05-31 ]

Build tested: 6.4.1-1 (drone #4524)

1. Restart DDLProc, DMLProc, writeengine, StorageManager
query, DDL, DML continue to work

2. Restart PrimProc
DDL continues to work
DMLProc failed

MariaDB [mytest]> insert into t2 values (1),(2);
ERROR 1815 (HY000): Internal error: CAL0001: Insert Failed:  MCS-2043: An internal error occurred.  Check the error log file & contact support.  

The 2nd try of DML statement worked
After restarting PrimProc, execute a query first, then DML, it also worked

3. Restart ExeMgr
Both DDL and DML would fail on first try
The 2nd try of DDL or DML statement worked
After restarting ExeMgr, execute a query first, then DDL or DML, it also works

4. Restart workernode
All query, DDL, DML would hang

5. Restart controllernode
System in not-ready state

MariaDB [mytest]> select count(*) from t2;
ERROR 1815 (HY000): Internal error: The system is not yet ready to accept queries

Comment by Roman [ 2022-07-27 ]

The goals are quite feasible except the fact that crashed workernode/controllernode might result in a split-brain in how the cluster nodes see the extent map.
So the goal would be to return a meaningful error if workernode/controllernode needs to be restarted(using a cluster restart).

Generated at Thu Feb 08 02:46:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.