[MCOL-3711] Columnstore System Status remains Active and any Alarms are not set if mcs become not write capable Created: 2020-01-06  Updated: 2020-01-31  Resolved: 2020-01-31

Status: Closed
Project: MariaDB ColumnStore
Component/s: writeengine
Affects Version/s: None
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Zdravelina Sokolovska (Inactive) Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Environment:

GKE


Attachments: File columnstoreSupportReport.columnstore-1-all.tar.gz     File core.editem.7z    
Issue Links:
Relates
relates to MCOL-3459 think about data integrity (partial d... Closed

 Description   

Columnstore System Status remains Active if mcs become not write capable

that issue was observed with DBAAS Columnstore SingleNode deployment
Columnstore System started to return Errors when try to insert data , but that problem is not
detected in mcsadmin System Status and any Alarms are not set

s$ mariadb --host=sky0001379.mdb0001293.test.skysql.net     --port=5001 --user=DB00002091   --password='xxxxxx'  --ssl=1 -e "create database abc ; create table abc.abc ( a int , b varchar(64)) engine columnstore ;")
 
$ mariadb --host=sky0001379.mdb0001293.test.skysql.net     --port=5001 --user=DB00002091   --password='xxxxxxxx'  --ssl=1 -e "insert into abc.abc values (1,'aa'),(2,'bbb'), (3,'cccc'); "
ERROR 1815 (HY000) at line 1: Internal error: CAL0001: Insert Failed:  Error occured when calling makeJobList

from the Columnstore PM Pod : columnstore System is in Active Status , any Alarms are set

[root@expmcs0001-mdb-cs-single-0 /]# mcsadmin getsystemi
getsysteminfo   Mon Jan  6 11:35:43 2020
 
System columnstore-1
 
System and Module statuses
 
Component     Status                       Last Status Change
------------  --------------------------   ------------------------
System        ACTIVE                       Thu Jan  2 12:48:21 2020
 
Module pm1    ACTIVE                       Thu Jan  2 11:32:54 2020
 
 
MariaDB ColumnStore Process statuses
 
Process             Module    Status            Last Status Change        Process ID
------------------  ------    ---------------   ------------------------  ----------
ProcessMonitor      pm1       ACTIVE            Thu Jan  2 11:32:06 2020         101
ProcessManager      pm1       ACTIVE            Thu Jan  2 11:32:13 2020         208
StorageManager      pm1       ACTIVE            Thu Jan  2 12:07:18 2020       13702
DBRMControllerNode  pm1       ACTIVE            Thu Jan  2 12:47:59 2020       26541
ServerMonitor       pm1       ACTIVE            Thu Jan  2 11:32:35 2020         780
DBRMWorkerNode      pm1       ACTIVE            Thu Jan  2 11:32:36 2020         801
PrimProc            pm1       ACTIVE            Thu Jan  2 11:32:40 2020         891
ExeMgr              pm1       ACTIVE            Thu Jan  2 11:32:44 2020         962
WriteEngineServer   pm1       ACTIVE            Thu Jan  2 12:11:01 2020       16610
DDLProc             pm1       ACTIVE            Thu Jan  2 12:48:09 2020       26681
DMLProc             pm1       ACTIVE            Thu Jan  2 12:48:20 2020       26799
mysqld              pm1       ACTIVE            Thu Jan  2 11:32:14 2020         511
 
Active Alarm Counts: Critical = 0, Major = 0, Minor = 0, Warning = 0, Info = 0



 Comments   
Comment by Zdravelina Sokolovska (Inactive) [ 2020-01-06 ]

attached columnstore support report and core file generated when the report was get

Comment by Andrew Hutchings (Inactive) [ 2020-01-06 ]

ColumnStore technically is write capable here, just that there are corrupted data files which are causing various issues.

If this ticket is for stricter detection of when writes should suspend rather than the corruption this should be a feature request. If it is for the corruption the title and description should be changed.

Comment by Patrick LeBlanc (Inactive) [ 2020-01-06 ]

The main problem is that StorageManager crashed amidst all this, and it seems that it must have been in the middle of a write when that happened. The good news is that half-written data would only be in local storage not in S3.

The config file error I noticed and a couple other bugs have been fixed since 1.4.1. I don't know what the cause of the crash was, but it's possible that it's been fixed already.

Integrity checking is a planned feature already; I'll escalate and make that it's own ticket. I'll also add a requirement for more robustness vs a crash during a write. Doesn't have to be complex; we could just write to tmp files, then rename once done. Then, only completed writes will be visible.

Generated at Thu Feb 08 02:44:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.