[MCOL-4845] ExeMgr becomes temporarely unavailable for some seconds [happens rarely], causing some queries to fail in the application Created: 2021-08-26  Updated: 2021-12-13

Status: Open
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 5.6.2
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Eugênio Pacceli Reis da Fonseca Assignee: Unassigned
Resolution: Unresolved Votes: 3
Labels: None
Environment:

Ubuntu 20.04.3 LTS (GNU/Linux 5.8.0-1039-gcp x86_64)

64GB RAM
64 vCPUs
2TB SSD

mariadb-plugin-columnstore/unknown,now 1:10.5.12-5.6.2+maria~focal

mariadb-server-10.5/unknown,now 1:10.5.12+maria~focal amd64
mariadb-server-core-10.5/unknown,now 1:10.5.12+maria~focal amd64
mariadb-server/unknown,now 1:10.5.12+maria~focal all
mysql-common/unknown,now 1:10.5.12+maria~focal all


Attachments: XML File Columnstore.xml    

 Description   

Hello!

We have a MariaDB 10.5.12 server with ColumnStore 5.6.2 running on a Google Compute Instance (64GB RAM, 64vCPUs, 2TB ssd), serving some queries for an API with a couple of accesses per day (about a dozen, but making some other dozen queries each).

The queries include mixes of InnoDB tables and ColumnStore tables.

Since sometime from now this sort of thing happens (rarely) while our backend executes a query (this is an example):

"message": "(conn=123745, no: 1815, SQLState: HY000) Internal error: Lost connection to ExeMgr. Please contact your administrator
sql:
 
SELECT
    c.c4_id,
    c.c4_name,
    ROUND(SUM(quantity)) as quantity,
    ROUND(SUM(total_value),2) as total_value
 
FROM transaction_daily AS sales
    JOIN store s USING (store_id)
    JOIN product p USING (product_id, ean)
    JOIN latest_product_category pc1 USING (product_id, ean) JOIN client_category_tree c ON pc1.category_id = c.c6_id
    JOIN latest_product_category pc2 USING (product_id, ean) JOIN client_origin_tree o ON pc2.category_id = o.o4_id
 
WHERE date BETWEEN '2021-07-01' AND '2021-07-31' AND (s.store_id=1 OR s.store_id=2 OR s.store_id=3 OR s.store_id=4 OR s.store_id=5 OR s.store_id=6 OR s.store_id=7 OR s.store_id=8 OR s.store_id=9 OR s.store_id=10 OR s.store_id=11 OR s.store_id=12 OR s.store_id=13 OR s.store_id=14 OR s.store_id=15 OR s.store_id=16 OR s.store_id=17 OR s.store_id=18 OR s.store_id=19 OR s.store_id=20 OR s.store_id=21 OR s.store_id=22 OR s.store_id=23 OR s.store_id=24 OR s.store_id=25 OR s.store_id=26 OR s.store_id=27 OR s.store_id=28 OR s.store_id=29 OR s.store_id=30 OR s.store_id=31 OR s.store_id=33 OR s.store_id=34 OR s.store_id=35 OR s.store_id=36 OR s.store_id=37 OR s.store_id=38 OR s.store_id=39 OR s.store_id=40 OR s.store_id=41 OR s.store_id=42 OR s.store_id=43 OR s.store_id=44 OR s.store_id=45 OR s.store_id=46 OR s.store_id=47 OR s.store_id=48 OR s.store_id=49 OR s.store_id=50 OR s.store_id=51 OR s.store_id=52 OR s.store_id=53 OR s.store_id=54 OR s.store_id=55 OR s.store_id=56 OR s.store_id=57 OR s.store_id=58 OR s.store_id=59 OR s.store_id=60 OR s.store_id=61 OR s.store_id=62 OR s.store_id=63 OR s.store_id=64 OR s.store_id=65 OR s.store_id=66 OR s.store_id=67 OR s.store_id=68 OR s.store_id=69 OR s.store_id=70 OR s.store_id=71 OR s.store_id=72 OR s.store_id=73 OR s.store_id=74 OR s.store_id=77 OR s.store_id=82 OR s.store_id=83) 
 
GROUP BY c.c4_id, c.c4_name
ORDER BY total_value DESC, quantity DESC, c.c4_name
 
parameters:[]",

Related entry in
/var/log/mariadb/columnstore/crit.log

Aug 26 14:55:36 mariadb-ubuntu-2004-2-vm ExeMgr[656717]: 36.860088 |2147606742|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. Resource temporarily unavailable
Aug 26 14:55:36 mariadb-ubuntu-2004-2-vm ExeMgr[656717]: 36.860148 |2147607379|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. Resource temporarily unavailable
Aug 26 14:55:36 mariadb-ubuntu-2004-2-vm ExeMgr[656717]: 36.860186 |2147607393|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. Resource temporarily unavailable
Aug 26 14:55:36 mariadb-ubuntu-2004-2-vm ExeMgr[656717]: 36.860242 |2147606750|0|0| C 16 CAL0055: ERROR: ExeMgr has caught an exception. Resource temporarily unavailable

The database feeds an API that generates some couple OLAP queries often, by request. This machine was running for 20 days straight without reboot. I noticed the last ExeMgr's PIDs were in the 4 millions mark. Machine was rebooted and the problem seemed to go away by now.

This is a rare event but causes issues on some of our dashboards.

Any idea of what could be causing this? I can provide any additional information necessary. I've attached our ColumnStore.xml setup.



 Comments   
Comment by Roman [ 2021-12-13 ]

Greetings.
You might hit the RAM allowance limit doing JOIN, GROUP BY or ORDER BY. The first two can fallback to disk-based versions(there are Columnstore.xml settings to enable join and group by). There is no disk-based order yet. It is WIP now.

Generated at Thu Feb 08 02:53:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.