[MCOL-4798] ExeMgr hit cpu , cluster in read only and reported PrimProc error reading file ,Error reading compression header. Created: 2021-07-05  Updated: 2023-11-17  Resolved: 2022-07-27

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 5.5.2
Fix Version/s: 22.08.1

Type: Bug Priority: Critical
Reporter: Massimo Assignee: David Hall (Inactive)
Resolution: Duplicate Votes: 1
Labels: None

Issue Links:
Duplicate
duplicates MCOL-5093 mariadb services on column cluster no... Closed
duplicates MCOL-5163 Increase the stability of writing pro... Closed
Sprint: 2021-16, 2021-17

 Description   

Customer report CPU alert by ExcMgr process. Once we check the status of the cluster was in READ-ONLY , reporting the same error:

Jul 5 08:24:49 pixid-csx2 messagequeue[1366]: 49.012334 |0|0|0| W 31 CAL0000: MessageQueueClient::write: error writing 4790 bytes to IOSocket: sd: 100 inet: 10.10.1.92 port: 8601. Socket error was InetStreamSocket::write error: Broken pipe – write from InetStreamSocket: sd: 100 inet: 10.10.1.92 port: 8601
Jul 5 08:24:51 pixid-csx2 PrimProc[7950]: 51.434867 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 1001; Error reading compression header. rc=-5, idx=0, ptr.size=0
Jul 5 08:24:51 pixid-csx2 PrimProc[7950]: 51.435140 |0|0|0| C 28 CAL0000: Error reading compression header. rc=-5, idx=0, ptr.size=0
Jul 5 08:24:51 pixid-csx2 ExeMgr[8071]: 51.449678 |2181490086|0|0| D 16 CAL0042: End SQL statement
Jul 5 08:24:51 pixid-csx2 ExeMgr[8071]: 51.475415 |34005651|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. IDB-2033: Error occurred when calling system catalog.
Jul 5 08:24:51 pixid-csx2 PrimProc[7950]: 51.475907 |0|0|0| I 28 CAL0061: PrimProc error reading file for OID 1001; retry updateptrs for /var/lib/columnstore/data1/000.dir/000.dir/003.dir/233.dir/000.dir/FILE000.cdf. rc=0, idx=0, ptr.size=0

looking back to the log, looks like there were many error before

Jul 5 06:40:03 pixid-csx2 IDBFile[7950]: 03.159455 |0|0|0| D 35 CAL0002: Failed to open file: (dbroot 3 offline)/000.dir/000.dir/014.dir/099.dir/000.dir/FILE000.cdf, exception: unable to open Unbuffered file

We need to restart all the cluster, which fix the issue.



 Comments   
Comment by Massimo [ 2021-07-16 ]

manjot toddstoffel we do not have the possibilities to reproduce a case, that s why we collect all logs you request and plsu /var/log/messages. First they dont have a test env, or we dont have access. everything is on the log, which should tell the problem

Comment by Massimo [ 2021-09-01 ]

drrtuy toddstoffel any update on this?

Comment by David Hall (Inactive) [ 2022-03-04 ]

We can't reproduce

Generated at Thu Feb 08 02:53:04 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.