|
Further investigation this is happening when the failover occurs on loss of connection to a node and not when the node returns.
|
|
when testing, I brought back to pm2 and ended up with 2 HOT_STANDBY ProcMgr
Primary Front-End MariaDB ColumnStore Module is 'pm1'
MariaDB ColumnStore Replication Feature is enabled
MariaDB ColumnStore Process statuses
Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor pm1 ACTIVE Tue Oct 31 21:02:03 2017 31524
ProcessManager pm1 ACTIVE Tue Oct 31 21:02:10 2017 31702
DBRMControllerNode pm1 ACTIVE Tue Oct 31 21:49:20 2017 11241
ServerMonitor pm1 ACTIVE Tue Oct 31 21:03:02 2017 771
DBRMWorkerNode pm1 ACTIVE Tue Oct 31 21:03:02 2017 799
DecomSvr pm1 ACTIVE Tue Oct 31 21:03:06 2017 955
PrimProc pm1 ACTIVE Tue Oct 31 21:03:08 2017 1033
ExeMgr pm1 ACTIVE Tue Oct 31 22:29:31 2017 4272
WriteEngineServer pm1 ACTIVE Tue Oct 31 22:28:19 2017 3124
DDLProc pm1 ACTIVE Tue Oct 31 22:30:00 2017 5148
DMLProc pm1 BUSY_INIT Tue Oct 31 22:30:06 2017 5253
mysqld pm1 ACTIVE Tue Oct 31 22:29:46 2017 4676
ProcessMonitor pm2 ACTIVE Tue Oct 31 22:28:30 2017 1413
ProcessManager pm2 HOT_STANDBY Tue Oct 31 22:28:43 2017 2202
DBRMControllerNode pm2 COLD_STANDBY Tue Oct 31 22:28:46 2017
ServerMonitor pm2 ACTIVE Tue Oct 31 22:28:49 2017 2534
DBRMWorkerNode pm2 ACTIVE Tue Oct 31 22:28:51 2017 2590
DecomSvr pm2 ACTIVE Tue Oct 31 22:28:55 2017 2605
PrimProc pm2 ACTIVE Tue Oct 31 22:28:57 2017 2618
ExeMgr pm2 ACTIVE Tue Oct 31 22:29:01 2017 2695
WriteEngineServer pm2 ACTIVE Tue Oct 31 22:29:05 2017 2717
DDLProc pm2 COLD_STANDBY Tue Oct 31 22:29:07 2017
DMLProc pm2 COLD_STANDBY Tue Oct 31 22:29:07 2017
mysqld pm2 ACTIVE Tue Oct 31 22:29:59 2017 3135
ProcessMonitor pm3 ACTIVE Tue Oct 31 21:21:43 2017 1395
ProcessManager pm3 HOT_STANDBY Tue Oct 31 21:30:40 2017 4780
DBRMControllerNode pm3 COLD_STANDBY Tue Oct 31 21:30:43 2017
ServerMonitor pm3 ACTIVE Tue Oct 31 21:22:08 2017 2561
DBRMWorkerNode pm3 ACTIVE Tue Oct 31 21:22:10 2017 2591
DecomSvr pm3 ACTIVE Tue Oct 31 21:22:14 2017 2630
|
|
This is not tied to Data Redundancy. This is tied to shutdown or disconnection of PM2 in 3PM system.
|
|
The cause appears to be that the restart of the DBRM processes is causing the SHM file for the VSS system catalog lock to have a new file name. The DBRM instance in PrimProc caches the key to the old SHM and causes exceptions to be fired when trying to access it.
|
|
Addition to Bens comment, problem doesn't occur when you lose PM3 on a PM1-PM3 system. Create table works, but it takes 2 minutes, which is way to long
|
|
Steps to test.
Setup 3PM system
shutdown or disconnect PM2
allow system to return to active state and attempt to create table and insert data.
reboot or reconnect PM2
allow system to return to active state and attempt to create table and insert data.
|
|
doing a build from pull and will test before merging..
|
|
testing out changes, pm2 down/up worked , pm3 down/up worked, issues with pm1 down, pm2 didnt take over. failure occurred
Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Wed Nov 8 22:55:39 2017
Module pm1 AUTO_DISABLED/DEGRADED Wed Nov 8 22:54:57 2017
Module pm2 FAILED Wed Nov 8 22:55:30 2017
Module pm3 ACTIVE Wed Nov 8 22:52:26 2017
Active Parent OAM Performance Module is 'pm2'
Primary Front-End MariaDB ColumnStore Module is 'pm2'
MariaDB ColumnStore Replication Feature is enabled
MariaDB ColumnStore Process statuses
Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
ProcessManager pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
DBRMControllerNode pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
ServerMonitor pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
DBRMWorkerNode pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
DecomSvr pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
PrimProc pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
ExeMgr pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
WriteEngineServer pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
DDLProc pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
DMLProc pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
mysqld pm1 AUTO_OFFLINE Wed Nov 8 22:55:02 2017
ProcessMonitor pm2 ACTIVE Wed Nov 8 22:38:48 2017 1308
ProcessManager pm2 ACTIVE Wed Nov 8 22:56:36 2017 17413
DBRMControllerNode pm2 ACTIVE Wed Nov 8 22:55:16 2017 19209
ServerMonitor pm2 ACTIVE Wed Nov 8 22:55:18 2017 19225
DBRMWorkerNode pm2 ACTIVE Wed Nov 8 22:55:18 2017 19246
DecomSvr pm2 ACTIVE Wed Nov 8 22:55:22 2017 19273
PrimProc pm2 ACTIVE Wed Nov 8 22:55:24 2017 19291
ExeMgr pm2 ACTIVE Wed Nov 8 22:55:28 2017 19343
WriteEngineServer pm2 MAN_OFFLINE Wed Nov 8 22:55:07 2017
DDLProc pm2 MAN_OFFLINE Wed Nov 8 22:55:07 2017
DMLProc pm2 MAN_OFFLINE Wed Nov 8 22:55:07 2017
mysqld pm2 ACTIVE Wed Nov 8 22:56:16 2017 19092
ProcessMonitor pm3 ACTIVE Wed Nov 8 22:50:31 2017 980
ProcessManager pm3 COLD_STANDBY Wed Nov 8 22:52:26 2017
DBRMControllerNode pm3 COLD_STANDBY Wed Nov 8 22:52:26 2017
ServerMonitor pm3 ACTIVE Wed Nov 8 22:50:43 2017 9132
DBRMWorkerNode pm3 ACTIVE Wed Nov 8 22:51:18 2017 12458
DecomSvr pm3 ACTIVE Wed Nov 8 22:50:48 2017 12286
PrimProc pm3 ACTIVE Wed Nov 8 22:51:33 2017 12536
ExeMgr pm3 ACTIVE Wed Nov 8 22:51:46 2017 12603
WriteEngineServer pm3 ACTIVE Wed Nov 8 22:52:00 2017 12783
DDLProc pm3 COLD_STANDBY Wed Nov 8 22:52:26 2017
DMLProc pm3 COLD_STANDBY Wed Nov 8 22:52:26 2017
mysqld pm3 ACTIVE Wed Nov 8 22:52:30 2017 8924
Active Alarm Counts: Critical = 1, Major = 2, Minor = 3, Warni
Add Comment Collapse
david.hill
[5:04 PM]
on pm2, somehow ended up with 2 ProcMgrs running
root 17407 1308 0 22:52 ? 00:00:01 [ProcMgr]
root 17413 1308 0 22:52 ? 00:00:01 [ProcMgr]
error log from pm2
Nov 8 22:50:38 ip-172-30-0-152 ProcessMonitor[1308]: 38.125677 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: ProcessManager, old PID = 13879
Nov 8 22:52:21 ip-172-30-0-152 ProcessMonitor[1308]: 21.653912 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: ProcessManager, old PID = 16915
Nov 8 22:54:02 ip-172-30-0-152 joblist[17229]: 02.567070 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 382 DEC: lost connection to 172.30.0.161
Nov 8 22:54:02 ip-172-30-0-152 joblist[17229]: 02.567810 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 382 DEC: lost connection to 172.30.0.161
Nov 8 22:54:31 ip-172-30-0-152 ProcessManager[17413]: 31.759238 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov 8 22:54:35 ip-172-30-0-152 oamcpp[17407]: 35.878849 |0|0|0| E 08 CAL0000: ERROR: setPmDbrootConfig failure - clear reside ID
Nov 8 22:54:35 ip-172-30-0-152 ProcessManager[17407]: 35.878939 |0|0|0| E 17 CAL0000: line: 9327 EXCEPTION ERROR on autoMovePmDbroot: Caught unknown exception!
Nov 8 22:54:40 ip-172-30-0-152 ProcessManager[17407]: 40.884008 |0|0|0| E 17 CAL0000: line: 6236 sendMsgProcMon ping failure
Nov 8 22:54:56 ip-172-30-0-152 ProcessManager[17413]: 56.959888 |0|0|0| E 17 CAL0000: line: 6236 sendMsgProcMon ping failure
Nov 8 22:55:31 ip-172-30-0-152 ProcessManager[17413]: 31.759655 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov 8 22:55:37 ip-172-30-0-152 ProcessManager[17413]: 37.923985 |0|0|0| E 17 CAL0000: line: 6236 sendMsgProcMon ping failure
Nov 8 22:55:39 ip-172-30-0-152 ProcessManager[17407]: 39.207292 |0|0|0| E 17 CAL0000: line: 6236 sendMsgProcMon ping failure
Nov 8 22:56:31 ip-172-30-0-152 ProcessManager[17413]: 31.760205 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov 8 22:57:31 ip-172-30-0-152 ProcessManager[17413]: 31.760619 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov 8 22:58:31 ip-172-30-0-152 ProcessManager[17413]: 31.760997 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov 8 22:59:31 ip-172-30-0-152 ProcessManager[17413]: 31.761341 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov 8 23:00:31 ip-172-30-0-152 ProcessManager[17413]: 31.761709 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov 8 23:01:31 ip-172-30-0-152 ProcessManager[17413]: 31.762202 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov 8 23:02:31 ip-172-30-0-152 ProcessManager[17413]: 31.762579 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
Nov 8 23:03:31 ip-172-30-0-152 ProcessManager[17413]: 31.762935 |0|0|0| E 17 CAL0000: line: 456 EXCEPTION ERROR on MessageQueueServer for ProcMgr:InetStreamSocket::bind: bind() error: Address already in use
|
|
need some more testing pm1 failover test case
|
|
while we look into the pm1 failover issues, Im ytesting out the new changes on amazon AMI setup
|
|
latest fixed being tested by Daniel and David Hill. Daniel testing the original fix, the 3 um/pm DR issue. David Hill testing the Amazon 2um/3pm system
|
|
this test case still failing...
um2 as active module (ddl/dml are active)
take down pm3 and bring back up.
Ends up with um1 as active module, but the replication master/slave arent setup correct. has um1 as slave and um2 as master
|
|
Build verified: 1.1.2 GitHub source
/root/columnstore/mariadb-columnstore-server
commit 9dd5265d8a059657e166377d88b1bf9177f42c22
Merge: ed21e67 b5b886c
Author: Andrew Hutchings <andrew@linuxjedi.co.uk>
Date: Tue Nov 14 21:55:09 2017 +0000
Merge pull request #76 from mariadb-corporation/MCOL-963
MCOL-963 correction for regression errors
/root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine
commit dadcd6e9690533cfe71d99c60a893a7830e90748
Merge: f5a2306 9a01244
Author: Andrew Hutchings <andrew@linuxjedi.co.uk>
Date: Tue Nov 14 21:55:01 2017 +0000
Merge pull request #319 from mariadb-corporation/MCOL-963
MCOL-963 fix for regressions introduced
|
|
David Hill is testing original BUG - went into read-only mode during pm2 down/up.
system DR 3 um/pm system
1. took pm2 down
2. no read-only mode and could create tables
3. brought pm2 up
4. no read-only mode and could create tables
|
|
pass test of original issue
|