[MCOL-1523] OAM Process failover logic for DDLproc is incorrect - causing DDL to stop working Created: 2018-07-02 Updated: 2023-10-26 Resolved: 2018-09-18 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | ? |
| Affects Version/s: | 1.1.5 |
| Fix Version/s: | 1.1.7 |
| Type: | Bug | Priority: | Major |
| Reporter: | David Hill (Inactive) | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
2 um 3 pm system on redhat 7 |
||
| Issue Links: |
|
||||
| Sprint: | 2018-14, 2018-15, 2018-16, 2018-17 | ||||
| Description |
|
Customer reproted problems with DDl commands create table failing. Investigation showed that DDLproc ip address was pointing to incorrect UM module, it was pointing to UM2. Should have been pointing to UM1. issue caused by OAM process failover logic. its a DDLproc crash and failover... DDLproc um1 is crashing and we are trying to start DDMLproc on um2... that is changing the IP... Jun 28 23:57:12 x01sibgadb3a ProcessMonitor[7877]: 12.138592 |0|0|0| D 18 CAL0000: statusControl: Set Process um1/DDLProc State = AUTO_OFFLINE PID = 0 Jun 28 23:57:13 x01sibgadb3a ProcessManager[8303]: 13.192889 |0|0|0| D 17 CAL0000: setPMProcIPs called for um2 |
| Comments |
| Comment by David Hill (Inactive) [ 2018-07-31 ] | ||
|
How to Test - start with a 2um 1+pm system with mariadb rep enabled on um1, need to kill DDLProc and not allow it to restart
Now that DDLproc fails to restart on um1, um1 will be disabled and um2 will be made the master UM with DDL/DMLproc's active. Make sure you can create tables on um2. get um1 back inservice
on pm1:
| ||
| Comment by David Hill (Inactive) [ 2018-07-31 ] | ||
|
https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/529 | ||
| Comment by Daniel Lee (Inactive) [ 2018-08-09 ] | ||
|
Build tested: 1.1.6-1 source /root/columnstore/mariadb-columnstore-server Merge pull request #123 from drrtuy/ /root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine Merge pull request #523 from mariadb-corporation/ Could not reproduce the same issue on 1.1.5-1. Instead, DDLProc failed on UM1. UM1 did not get into a disabled state, and DDLProc and DMLPro on UM2 remained in COLD_STANDBY state. Tested on 1.1.6-1. The following happened: 1) DDLProc and DMLProc on UM1 did failover to UM2 MariaDB [mytest]> create table t (c1 int, c2 char(50)) engine=columnstore; MariaDB [mytest]> insert into t values (1, 'hello'); MariaDB [mytest]> select * from t; Reenabling UM1 from PM1 hung and timed out. mcsadmin> altersystem-enablemodule um1 This command starts the processing of applications on a Module within the MariaDB ColumnStore System Enabling Modules Starting Modules
At this point, status on UMs: Process Module Status Last Status Change Process ID ProcessMonitor um2 ACTIVE Thu Aug 9 15:04:16 2018 3306 I was able to login to mysql client on UM1, but the table I created on UM2 was not there. schema sync was enabled during postConfigure. | ||
| Comment by David Hill (Inactive) [ 2018-08-10 ] | ||
|
I did reproduce Daniels issue.. MariaDB [(none)]> use david MariaDB [david]> insert into tmp values (1); MariaDB [david]> select * from tmp; | ||
| Comment by David Hill (Inactive) [ 2018-08-10 ] | ||
|
fixed... MariaDB [(none)]> use david MariaDB [david]> insert into tmp1 values (1); MariaDB [david]> select * from tmp;
------
------ MariaDB [david]> | ||
| Comment by David Hill (Inactive) [ 2018-08-10 ] | ||
|
1. fixed query issue | ||
| Comment by David Hill (Inactive) [ 2018-08-10 ] | ||
|
https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/536 | ||
| Comment by Daniel Lee (Inactive) [ 2018-08-22 ] | ||
|
Build tested: 1.1.6-1 source /root/columnstore/mariadb-columnstore-server Merge pull request #126 from mariadb-corporation/ /root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine Merge pull request #542 from drrtuy/ 1) DDLProc and DMLProc on UM1 did failover to UM2 AlarmID = 31 | ||
| Comment by David Hill (Inactive) [ 2018-08-28 ] | ||
|
Im am able to reproduce issue mcsadmin> altersystem-di um1 y This command stops the processing of applications on the Primary User Module, which is where DDL/DML are performed Stopping Modules Disabling Modules New Primary User Module = um2 ALLS GOOD mcsadmin> altersystem-en um1 y Enabling Modules Starting Modules Aug 28 20:57:56 ip-172-31-46-144 controllernode[16164]: 56.957547 |0|0|0| C 29 CAL0000: DBRM Controller: network error distributing command to worker 2 | ||
| Comment by David Hill (Inactive) [ 2018-08-30 ] | ||
|
Arg, this is tougher than I would have that to fix. as part of the testing, I wanted to make sure the ddl/dml/queries all worked after a disablemodule um and enablemodule um. Took a while, but got the disablemodule working, still the enablemodule is having issue where I cant create a new table. Still working that. Once I have the disable/enable fully functioning, then I will go back to the failover test cases. | ||
| Comment by David Hill (Inactive) [ 2018-09-04 ] | ||
|
still in progress
| ||
| Comment by David Hill (Inactive) [ 2018-09-05 ] | ||
|
got both the disable and enable module working when doing manually, now testing the auto disable and enable module to make sure create and query works | ||
| Comment by David Hill (Inactive) [ 2018-09-05 ] | ||
|
to get the enablemodule to work, I changed the code to do a restartsystem as part of the enable module mcsadmin> altersystem-en um2 y Enabling Modules Restarting System .. mcsadmin> from um1 mcsmysql < test.sql | ||
| Comment by David Hill (Inactive) [ 2018-09-05 ] | ||
|
failover case is working, now testing the mysql repication is working during all scenarios | ||
| Comment by Daniel Lee (Inactive) [ 2018-09-14 ] | ||
|
Build tested: 1.1.7-1 (built on Sept 13, 2018) Stack: 2UM3PM Issue found
Other tests worked fine: | ||
| Comment by David Hill (Inactive) [ 2018-09-15 ] | ||
|
https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/561 | ||
| Comment by Daniel Lee (Inactive) [ 2018-09-18 ] | ||
|
Build verified: 1.1.7-1 (released to QA on 09/17/2018) Verified the following items: 1) rename DDLProc and pkill DDLProc to cause failover |