[MCOL-5587] Primproc disappears on large selects / special workload Created: 2023-10-06  Updated: 2024-01-25

Status: Confirmed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 23.02.3, 23.02.4, 23.10.0
Fix Version/s: Icebox

Type: Bug Priority: Critical
Reporter: Allen Herrera Assignee: Sergey Zefirov
Resolution: Unresolved Votes: 1
Labels: rm_stability, triage
Environment:

SkySQL, Docker and AWS EC2
4 cpu, 16 GB ram


Attachments: PNG File Right-before-server-crash.png     Text File primproc.log    
Sprint: 2023-11, 2023-12

 Description   

Summary: Running the supplied workload randomly causes primproc to disappear. The workload can complete 1-2 times fine, but then randomly will "crash" (no core dump or stack trace found).

Expectation: Columnstore software stays stable. Rejecting or erroring out too large of queries, self recovery or maybe an error message suggesting what is needed to complete the query (cpu/ram) but a subprocess disappearing and system staying broken until manual intervention to restart the system isn't acceptable.

Workaround: restart columnstore

Reproduction: See developer comment

Client Side error:

ERROR 1815 (HY000) at line 1: Internal error: MCS-2004: Cannot connect to ExeMgr.

primproc.log

getFreeMemory : returned from  getMemUsageFromCGroup : usage 5211672576 (GIB) 4

debug.log

Oct  6 17:29:47 mcs1 messagequeue[794]: 47.156748 |0|0|0| W 31 CAL0071: InetStreamSocket::read: timeout during first read: socket read error: Success; InetStreamSocket: sd: 65 inet: 127.0.0.1 port: 8601; Will retry.

mariadb-error.log

ClientRotator caught exception: InetStreamSocket::connect: connect() error: Connection refused to: InetStreamSocket: sd: 64 inet: 127.0.0.1 port: 8601



 Comments   
Comment by Roman [ 2023-10-07 ]

Did I get it right that memory allowance in Sky for the pod is 5211672576 bytes ?

Comment by Leonid Fedorov [ 2023-10-09 ]

kirill.perov@mariadb.com please try attached reproduction script with simular AWS EC2 VM and with bigger one as well

Comment by Kirill Perov [ 2023-10-10 ]

I ran the replay 4 times in same 4cpu 16Gb AWS VM.
No crashes.

mariadb-plugin-columnstore 10.6.15.10-23.02.4+maria~ubu2204 amd64

The only errors I see are from malformed queries.

Generated at Thu Feb 08 02:58:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.