[MCOL-5263] primproc/exemgr restart stuck following ROLLBACK Created: 2022-10-13  Updated: 2023-11-17  Resolved: 2022-12-14

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 6.3.1, 6.4.5-dompe
Fix Version/s: 22.08.7

Type: Bug Priority: Critical
Reporter: Roman Assignee: Denis Khalikov
Resolution: Fixed Votes: 0
Labels: None

Assigned for Review: Roman Roman
Assigned for Testing: Daniel Lee Daniel Lee (Inactive)

 Description   

DMLProc starts ROLLBACK when SELECT part of UPDATE fails b/c EM facility in PP were restarted. Unfortunately this ROLLBACK stuck if EM/PP are not yet available. DMLProc must have a t/o with re-try doing ROLLBACK.
Here is the log except to describe the moment when DMLProc stuck trying to rollback an update that fails b/c EM restarts.

Oct 12 05:02:41 pixid-csx3 dmlpackageproc[25196]: 41.662215 |16896900|240438|0| D 21 CAL0001: End SQL statement with error
Oct 12 05:02:41 pixid-csx3 messagequeue[25187]: 41.689404 |0|0|0| W 31 CAL0000: MessageQueueClient::write: error writing 70 bytes to IOSocket: sd: 15 inet: 10.10.1.93 port: 8620. Socket error was InetStreamSocket::write error: Broken pipe -- write from InetStreamSocket: sd: 15 inet: 10.10.1.93 port: 8620
Oct 12 05:02:41 pixid-csx3 joblist[25187]: 41.700286 |2147724086|0|0| C 05 CAL0000: st: 0 TupleBPS::sendPrimitiveMessages() caught DistributedEngineComm::write: Broken Pipe error
Oct 12 05:02:41 pixid-csx3 dmlpackageproc[25196]: 41.745564 |0|0|0| E 21 ClientRotator caught exception: InetStreamSocket::write error: Broken pipe -- write from InetStreamSocket: sd: 24 inet: 10.10.1.93 port: 8601
Oct 12 05:02:41 pixid-csx3 dmlpackageproc[25196]: 41.850234 |16896900|240438|0| D 21 CAL0001: Start SQL statement:  ROLLBACK
Oct 12 05:02:41 pixid-csx3 messagequeue[24776]: 41.977122 |0|0|0| W 31 CAL0000: MessageQueueClient::write: error writing 4 bytes to IOSocket: sd: 142 inet: 10.10.1.93 port: 8601. Socket error was InetStreamSocket::write error: Broken pipe -- write from InetStreamSocket: sd: 142 inet: 10.10.1.93 port: 8601

At the same time both EM and PP were restarted.



 Comments   
Comment by Daniel Lee (Inactive) [ 2022-12-14 ]

Build verified: 22.08.7

engine: 15f65eff157f8fce48c0dfb30548dc787b259eb2
server: d3049350bb5c61340f5a7518b155d3c9dacdcb33
buildNo: 6257

1. Tested on single-node and 3PM clusters
2. Used 1gb lineitem for testing
3. Tested 2 scenarios, restarted PrimProc before and during rollback.

Reproduced reported issue in 22.08.4 and verified the fix in 22.08.7
As expected, if PrimProc remains down (not available at all), rollback would failed.

During testing, I also discover a similar issue with TRUNCATE, MCOL-5352 has been open to track it separate since it is an existing issue, not caused by this fix.

22.08.4

 

MariaDB [mytest]> start transaction;
Query OK, 0 rows affected (0.000 sec)

MariaDB [mytest]> load data infile "/data/qa/source/dbt3/1g/lineitem.tbl" into table lineitem fields terminated by "|";
Query OK, 6001215 rows affected (1 min 30.367 sec)
Records: 6001215 Deleted: 0 Skipped: 0 Warnings: 0

MariaDB [mytest]> rollback;
ERROR 1815 (HY000): Internal error: CAL0001: ROLLBACK failed due to: Network error reading WEClient
MariaDB [mytest]> select count from lineitem;
----------

count

----------

6001215

----------
1 row in set (13.212 sec)

MariaDB [mytest]> truncate lineitem;
ERROR 1815 (HY000): Internal error: System is not ready yet. Please try again.
MariaDB [mytest]>

{no format}
Generated at Thu Feb 08 02:56:34 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.