[MCOL-976] System in DBRM_READ_ONLY mode after Non-parent PM recovery under DataRedundancy Created: 2017-10-13  Updated: 2017-11-15  Resolved: 2017-11-15

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 1.0.11, 1.1.0
Fix Version/s: 1.0.12, 1.1.2

Type: Bug Priority: Major
Reporter: Ben Thompson (Inactive) Assignee: David Hill (Inactive)
Resolution: Fixed Votes: 0
Labels: relnote

Sprint: 2017-21, 2017-22, 2017-23

 Description   

Setup a system with DataRedundancy and shutdown a non-parent PM module. Wait for system to report Active. Restart the shutdown module. After returning to Active the system will show an alarm:

AlarmID = 31
Brief Description = DBRM_READ_ONLY
Alarm Severity = CRITICAL
Time Issued = Fri Oct 13 17:20:15 2017
Reporting Module = pm1
Reporting Process = DBRMControllerNode
Reported Device = System

Oct 13 17:19:54 testPM1 joblist[14318]: 54.472863 |0|0|0| E 05 CAL0000: /home/test/mariadb/centOS7/mariadb-columnstore-engine/writeengine/client/we_clients.cpp @ 268 Could not connect to pm3_WriteEngineServer: InetStreamSocket::connect: connect() error: Connection timed out to: InetStreamSocket: sd: 13 inet: 192.168.56.213 port: 8630
Oct 13 17:19:54 testPM1 joblist[14318]: 54.598050 |0|0|0| E 05 CAL0000: /home/test/mariadb/centOS7/mariadb-columnstore-engine/writeengine/client/we_clients.cpp @ 268 Could not connect to pm3_WriteEngineServer: InetStreamSocket::connect: connect() error: Connection timed out to: InetStreamSocket: sd: 15 inet: 192.168.56.213 port: 8630
Oct 13 17:19:55 testPM1 oamcpp[14318]: 55.499143 |0|0|0| E 08 CAL0000: OamCache::checkReload shows state for pm3 as AUTO_DISABLED
Oct 13 17:19:55 testPM1 DMLProc[14318]: 55.502492 |0|0|0| I 20 CAL0002: DMLProc will rollback 0 tables.
Oct 13 17:20:15 testPM1 controllernode[30650]: 15.525003 |0|0|0| C 29 CAL0000: DBRM Controller: network error distributing command to worker 3
Oct 13 17:20:35 testPM1 controllernode[30650]: 35.536485 |0|0|0| C 29 CAL0000: DBRM Controller: undo(): warning, could not contact worker number 3
Oct 13 17:20:35 testPM1 controllernode[30650]: 35.536577 |0|0|0| C 29 CAL0000: DBRM Controller: Caught network error. Sending command 17, length 1. Setting read-only mode.
Oct 13 17:20:35 testPM1 DMLProc[14318]: 35.539316 |0|0|0| I 20 CAL0002: DMLProc finished rollbackAll.
Oct 13 17:20:35 testPM1 ProcessMonitor[4463]: 35.623039 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DMLProc State = ACTIVE
Oct 13 17:20:35 testPM1 ProcessMonitor[4463]: 35.623154 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DMLProc State = ACTIVE PID = 14318
Oct 13 17:20:35 testPM1 ProcessMonitor[4463]: 35.623202 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set System State = ACTIVE



 Comments   
Comment by Ben Thompson (Inactive) [ 2017-10-24 ]

Further investigation this is happening when the failover occurs on loss of connection to a node and not when the node returns.

Comment by David Hill (Inactive) [ 2017-10-31 ]

when testing, I brought back to pm2 and ended up with 2 HOT_STANDBY ProcMgr

Primary Front-End MariaDB ColumnStore Module is 'pm1'
MariaDB ColumnStore Replication Feature is enabled

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor pm1 ACTIVE Tue Oct 31 21:02:03 2017 31524
ProcessManager pm1 ACTIVE Tue Oct 31 21:02:10 2017 31702
DBRMControllerNode pm1 ACTIVE Tue Oct 31 21:49:20 2017 11241
ServerMonitor pm1 ACTIVE Tue Oct 31 21:03:02 2017 771
DBRMWorkerNode pm1 ACTIVE Tue Oct 31 21:03:02 2017 799
DecomSvr pm1 ACTIVE Tue Oct 31 21:03:06 2017 955
PrimProc pm1 ACTIVE Tue Oct 31 21:03:08 2017 1033
ExeMgr pm1 ACTIVE Tue Oct 31 22:29:31 2017 4272
WriteEngineServer pm1 ACTIVE Tue Oct 31 22:28:19 2017 3124
DDLProc pm1 ACTIVE Tue Oct 31 22:30:00 2017 5148
DMLProc pm1 BUSY_INIT Tue Oct 31 22:30:06 2017 5253
mysqld pm1 ACTIVE Tue Oct 31 22:29:46 2017 4676

ProcessMonitor pm2 ACTIVE Tue Oct 31 22:28:30 2017 1413
ProcessManager pm2 HOT_STANDBY Tue Oct 31 22:28:43 2017 2202
DBRMControllerNode pm2 COLD_STANDBY Tue Oct 31 22:28:46 2017
ServerMonitor pm2 ACTIVE Tue Oct 31 22:28:49 2017 2534
DBRMWorkerNode pm2 ACTIVE Tue Oct 31 22:28:51 2017 2590
DecomSvr pm2 ACTIVE Tue Oct 31 22:28:55 2017 2605
PrimProc pm2 ACTIVE Tue Oct 31 22:28:57 2017 2618
ExeMgr pm2 ACTIVE Tue Oct 31 22:29:01 2017 2695
WriteEngineServer pm2 ACTIVE Tue Oct 31 22:29:05 2017 2717
DDLProc pm2 COLD_STANDBY Tue Oct 31 22:29:07 2017
DMLProc pm2 COLD_STANDBY Tue Oct 31 22:29:07 2017
mysqld pm2 ACTIVE Tue Oct 31 22:29:59 2017 3135

ProcessMonitor pm3 ACTIVE Tue Oct 31 21:21:43 2017 1395
ProcessManager pm3 HOT_STANDBY Tue Oct 31 21:30:40 2017 4780
DBRMControllerNode pm3 COLD_STANDBY Tue Oct 31 21:30:43 2017
ServerMonitor pm3 ACTIVE Tue Oct 31 21:22:08 2017 2561
DBRMWorkerNode pm3 ACTIVE Tue Oct 31 21:22:10 2017 2591
DecomSvr pm3 ACTIVE Tue Oct 31 21:22:14 2017 2630

Comment by Ben Thompson (Inactive) [ 2017-11-01 ]

This is not tied to Data Redundancy. This is tied to shutdown or disconnection of PM2 in 3PM system.

Comment by Andrew Hutchings (Inactive) [ 2017-11-01 ]

The cause appears to be that the restart of the DBRM processes is causing the SHM file for the VSS system catalog lock to have a new file name. The DBRM instance in PrimProc caches the key to the old SHM and causes exceptions to be fired when trying to access it.

Comment by David Hill (Inactive) [ 2017-11-01 ]

Addition to Bens comment, problem doesn't occur when you lose PM3 on a PM1-PM3 system. Create table works, but it takes 2 minutes, which is way to long

Comment by Ben Thompson (Inactive) [ 2017-11-08 ]

Steps to test.
Setup 3PM system
shutdown or disconnect PM2
allow system to return to active state and attempt to create table and insert data.
reboot or reconnect PM2
allow system to return to active state and attempt to create table and insert data.

Comment by David Hill (Inactive) [ 2017-11-08 ]

doing a build from pull and will test before merging..

Comment by David Hill (Inactive) [ 2017-11-08 ]

testing out changes, pm2 down/up worked , pm3 down/up worked, issues with pm1 down, pm2 didnt take over. failure occurred

Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Wed Nov 8 22:55:39 2017
Module pm1 AUTO_DISABLED/DEGRADED Wed Nov 8 22:54:57 2017
Module pm2 FAILED Wed Nov 8 22:55:30 2017
Module pm3 ACTIVE Wed Nov 8 22:52:26 2017
Active Parent OAM Performance Module is 'pm2'
Primary Front-End MariaDB ColumnStore Module is 'pm2'
MariaDB ColumnStore Replication Feature is enabled
MariaDB ColumnStore Process statuses
Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
ProcessManager pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
DBRMControllerNode pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
ServerMonitor pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
DBRMWorkerNode pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
DecomSvr pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
PrimProc pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
ExeMgr pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
WriteEngineServer pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
DDLProc pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
DMLProc pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
mysqld pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
ProcessMonitor pm2 ACTIVE Wed Nov 8 22:38:48 2017 1308
ProcessManager pm2 ACTIVE Wed Nov 8 22:56:36 2017 17413
DBRMControllerNode pm2 ACTIVE Wed Nov 8 22:55:16 2017 19209
ServerMonitor pm2 ACTIVE Wed Nov 8 22:55:18 2017 19225
DBRMWorkerNode pm2 ACTIVE Wed Nov 8 22:55:18 2017 19246
DecomSvr pm2 ACTIVE Wed Nov 8 22:55:22 2017 19273
PrimProc pm2 ACTIVE Wed Nov 8 22:55:24 2017 19291
ExeMgr pm2 ACTIVE Wed Nov 8 22:55:28 2017 19343
WriteEngineServer pm2 MAN_OFFLINE Wed Nov 8 22:55:07 2017
DDLProc pm2 MAN_OFFLINE Wed Nov 8 22:55:07 2017
DMLProc pm2 MAN_OFFLINE Wed Nov 8 22:55:07 2017
mysqld pm2 ACTIVE Wed Nov 8 22:56:16 2017 19092
ProcessMonitor pm3 ACTIVE Wed Nov 8 22:50:31 2017 980
ProcessManager pm3 COLD_STANDBY Wed Nov 8 22:52:26 2017
DBRMControllerNode pm3 COLD_STANDBY Wed Nov 8 22:52:26 2017
ServerMonitor pm3 ACTIVE Wed Nov 8 22:50:43 2017 9132
DBRMWorkerNode pm3 ACTIVE Wed Nov 8 22:51:18 2017 12458
DecomSvr pm3 ACTIVE Wed Nov 8 22:50:48 2017 12286
PrimProc pm3 ACTIVE Wed Nov 8 22:51:33 2017 12536
ExeMgr pm3 ACTIVE Wed Nov 8 22:51:46 2017 12603
WriteEngineServer pm3 ACTIVE Wed Nov 8 22:52:00 2017 12783
DDLProc pm3 COLD_STANDBY Wed Nov 8 22:52:26 2017
DMLProc pm3 COLD_STANDBY Wed Nov 8 22:52:26 2017
mysqld pm3 ACTIVE Wed Nov 8 22:52:30 2017 8924
Active Alarm Counts: Critical = 1, Major = 2, Minor = 3, Warni
Add Comment Collapse

david.hill
[5:04 PM]
on pm2, somehow ended up with 2 ProcMgrs running

root     17407  1308  0 22:52 ?        00:00:01 [ProcMgr]
root     17413  1308  0 22:52 ?        00:00:01 [ProcMgr]

error log from pm2
Nov  8 22:50:38 ip-172-30-0-152 ProcessMonitor[1308]: 38.125677 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: ProcessManager, old PID = 13879
Nov  8 22:52:21 ip-172-30-0-152 ProcessMonitor[1308]: 21.653912 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: ProcessManager, old PID = 16915
Nov  8 22:54:02 ip-172-30-0-152 joblist[17229]: 02.567070 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 382 DEC: lost connection to 172.30.0.161
Nov  8 22:54:02 ip-172-30-0-152 joblist[17229]: 02.567810 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 382 DEC: lost connection to 172.30.0.161
Nov  8 22:54:31 ip-172-30-0-152 ProcessManager[17413]: 31.759238 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov  8 22:54:35 ip-172-30-0-152 oamcpp[17407]: 35.878849 |0|0|0| E 08 CAL0000: ERROR: setPmDbrootConfig failure - clear reside ID
Nov  8 22:54:35 ip-172-30-0-152 ProcessManager[17407]: 35.878939 |0|0|0| E 17 CAL0000: line: 9327 EXCEPTION ERROR on autoMovePmDbroot: Caught unknown exception!
Nov  8 22:54:40 ip-172-30-0-152 ProcessManager[17407]: 40.884008 |0|0|0| E 17 CAL0000: line: 6236 sendMsgProcMon ping failure
Nov  8 22:54:56 ip-172-30-0-152 ProcessManager[17413]: 56.959888 |0|0|0| E 17 CAL0000: line: 6236 sendMsgProcMon ping failure
Nov  8 22:55:31 ip-172-30-0-152 ProcessManager[17413]: 31.759655 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov  8 22:55:37 ip-172-30-0-152 ProcessManager[17413]: 37.923985 |0|0|0| E 17 CAL0000: line: 6236 sendMsgProcMon ping failure
Nov  8 22:55:39 ip-172-30-0-152 ProcessManager[17407]: 39.207292 |0|0|0| E 17 CAL0000: line: 6236 sendMsgProcMon ping failure
Nov  8 22:56:31 ip-172-30-0-152 ProcessManager[17413]: 31.760205 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov  8 22:57:31 ip-172-30-0-152 ProcessManager[17413]: 31.760619 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov  8 22:58:31 ip-172-30-0-152 ProcessManager[17413]: 31.760997 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov  8 22:59:31 ip-172-30-0-152 ProcessManager[17413]: 31.761341 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov  8 23:00:31 ip-172-30-0-152 ProcessManager[17413]: 31.761709 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov  8 23:01:31 ip-172-30-0-152 ProcessManager[17413]: 31.762202 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov  8 23:02:31 ip-172-30-0-152 ProcessManager[17413]: 31.762579 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov  8 23:03:31 ip-172-30-0-152 ProcessManager[17413]: 31.762935 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use

Comment by David Hill (Inactive) [ 2017-11-08 ]

need some more testing pm1 failover test case

Comment by David Hill (Inactive) [ 2017-11-09 ]

while we look into the pm1 failover issues, Im ytesting out the new changes on amazon AMI setup

Comment by David Hill (Inactive) [ 2017-11-14 ]

latest fixed being tested by Daniel and David Hill. Daniel testing the original fix, the 3 um/pm DR issue. David Hill testing the Amazon 2um/3pm system

Comment by David Hill (Inactive) [ 2017-11-14 ]

this test case still failing...

um2 as active module (ddl/dml are active)
take down pm3 and bring back up.

Ends up with um1 as active module, but the replication master/slave arent setup correct. has um1 as slave and um2 as master

Comment by Daniel Lee (Inactive) [ 2017-11-15 ]

Build verified: 1.1.2 GitHub source

/root/columnstore/mariadb-columnstore-server
commit 9dd5265d8a059657e166377d88b1bf9177f42c22
Merge: ed21e67 b5b886c
Author: Andrew Hutchings <andrew@linuxjedi.co.uk>
Date: Tue Nov 14 21:55:09 2017 +0000

Merge pull request #76 from mariadb-corporation/MCOL-963

MCOL-963 correction for regression errors

/root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine
commit dadcd6e9690533cfe71d99c60a893a7830e90748
Merge: f5a2306 9a01244
Author: Andrew Hutchings <andrew@linuxjedi.co.uk>
Date: Tue Nov 14 21:55:01 2017 +0000

Merge pull request #319 from mariadb-corporation/MCOL-963

MCOL-963 fix for regressions introduced

Comment by David Hill (Inactive) [ 2017-11-15 ]

David Hill is testing original BUG - went into read-only mode during pm2 down/up.

system DR 3 um/pm system

1. took pm2 down
2. no read-only mode and could create tables
3. brought pm2 up
4. no read-only mode and could create tables

Comment by David Hill (Inactive) [ 2017-11-15 ]

pass test of original issue

Generated at Thu Feb 08 02:25:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.