[MCOL-5352] Truncate table failed after PrimProc restarted Created: 2022-12-14  Updated: 2024-02-07

Status: In Progress
Project: MariaDB ColumnStore
Component/s: DMLProc
Affects Version/s: 22.08.4
Fix Version/s: 23.10

Type: Bug Priority: Major
Reporter: Daniel Lee (Inactive) Assignee: Denis Khalikov
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
Relates
relates to MCOL-5339 DDLProc[xx]: Could not connect to pmX... Stalled
Sprint: 2023-12

 Description   

Build tested: 22.08.4, as well as the latest in develop

engine: 15f65eff157f8fce48c0dfb30548dc787b259eb2
server: d3049350bb5c61340f5a7518b155d3c9dacdcb33
buildNo: 6257

The TRUNCATE command fails after PrimProc is restarted on single-node setup,
or on the primary node of a multi-node cluster. If PrimProc was restarted on slave node, TRUNCATE would still succeed.

MariaDB [mytest]> truncate lineitem;
ERROR 1815 (HY000): Internal error: CAL0009: Truncate table failed:  MCS-2045: At least one PrimProc closed the connection unexpectedly.  

Repeating the TRUNCATE command would continue to return error, unless a create table command has been processed.

MariaDB [mytest]> truncate lineitem;
ERROR 1815 (HY000): Internal error: CAL0009: Truncate table failed:  MCS-2045: At least one PrimProc closed the connection unexpectedly.  
MariaDB [mytest]> create table t1 (c1 int) engine=columnstore;
Query OK, 0 rows affected (0.085 sec)
 
MariaDB [mytest]> truncate lineitem;
Query OK, 0 rows affected (0.089 sec)



 Comments   
Comment by Daniel Lee (Inactive) [ 2023-05-24 ]

Build tested:

23.02.3
develop branch
engine: a90535e1a7ffefa0e5ae808fdd0d38d30cffc017
server: 805750b3a90ed4aecbf475025e63674aaab7f7f7
buildNo: 7829

systemctl restart mcs-primproc

[rocky8:root@rocky8~]# mariadb mytest
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 22
Server version: 10.6.13-8-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [mytest]> truncate lineitem;
ERROR 1815 (HY000): Internal error: CAL0009: Truncate table failed: MCS-2045: At least one PrimProc closed the connection unexpectedly.

Truncate table after "mcs cluster restart" works fine.

Comment by Roman [ 2024-01-17 ]

After a discussion with leonid.fedorov we came to a conclusion that the problem is caused by TCP socket in DML/DDLProc that stuck when one of PP in a cluster restarts.
DML/DDLProc both establish a connection with PP. When PP goes away presumably the TCP sockets used by DML/DDLProc stick for some time around(they should go down though). There are two approaches to resolve the root cause:

  • investigate why the socket to PP doesn't go away and which state it is in TIME-WAIT or something else. We might add a socket state machine event listener.
  • if the socket waits for the remote to be closed there should be a way to notify all remote services that PP has been restarted. Controllernode is a good candidate to distribute this event. The other way to reliably distribute the info is via dKVS.
Generated at Thu Feb 08 02:57:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.