[MCOL-1523] OAM Process failover logic for DDLproc is incorrect - causing DDL to stop working Created: 2018-07-02  Updated: 2023-10-26  Resolved: 2018-09-18

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?
Affects Version/s: 1.1.5
Fix Version/s: 1.1.7

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

2 um 3 pm system on redhat 7


Issue Links:
Blocks
Sprint: 2018-14, 2018-15, 2018-16, 2018-17

 Description   

Customer reproted problems with DDl commands create table failing. Investigation showed that DDLproc ip address was pointing to incorrect UM module, it was pointing to UM2. Should have been pointing to UM1.

issue caused by OAM process failover logic.

its a DDLproc crash and failover... DDLproc um1 is crashing and we are trying to start DDMLproc on um2... that is changing the IP...

Jun 28 23:57:12 x01sibgadb3a ProcessMonitor[7877]: 12.138592 |0|0|0| D 18 CAL0000: statusControl: Set Process um1/DDLProc State = AUTO_OFFLINE PID = 0

Jun 28 23:57:13 x01sibgadb3a ProcessManager[8303]: 13.192889 |0|0|0| D 17 CAL0000: setPMProcIPs called for um2
Jun 28 23:57:13 x01sibgadb3a ProcessManager[8303]: 13.195507 |0|0|0| D 17 CAL0000: setPMProcIPs: DDLProc to 10.91.134.124



 Comments   
Comment by David Hill (Inactive) [ 2018-07-31 ]

How to Test - start with a 2um 1+pm system with mariadb rep enabled

on um1, need to kill DDLProc and not allow it to restart

  1. mv DDLProc DDLProc.save
  2. pkill DDLProc

Now that DDLproc fails to restart on um1, um1 will be disabled and um2 will be made the master UM with DDL/DMLproc's active.

Make sure you can create tables on um2.

get um1 back inservice

  1. mv DDLProc.save DDLProc

on pm1:

  1. ma altersystem-enablemodule um1
Comment by David Hill (Inactive) [ 2018-07-31 ]

https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/529

Comment by Daniel Lee (Inactive) [ 2018-08-09 ]

Build tested: 1.1.6-1 source

/root/columnstore/mariadb-columnstore-server
commit 513775738f72ec990d055a5d47e2511e3c0e34dd
Merge: 3c37210 9236098
Author: Andrew Hutchings <andrew@linuxjedi.co.uk>
Date: Wed Jul 18 09:37:17 2018 +0100

Merge pull request #123 from drrtuy/MCOL-970

MCOL-970 Slow query log now contains original query even in vtable mode

/root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine
commit ee40c3ac050ad7b64302673fc4ab08640f64892f
Merge: 0df1b92 979d00a
Author: benthompson15 <ben.thompson@mariadb.com>
Date: Mon Aug 6 13:02:08 2018 -0500

Merge pull request #523 from mariadb-corporation/MCOL-1579

MCOL-1579 Remove chmod of /dev/shm

Could not reproduce the same issue on 1.1.5-1. Instead, DDLProc failed on UM1. UM1 did not get into a disabled state, and DDLProc and DMLPro on UM2 remained in COLD_STANDBY state.

Tested on 1.1.6-1. The following happened:

1) DDLProc and DMLProc on UM1 did failover to UM2
2) UM1 was in disabled state
3) DDLProc and DMLProc on UM2 started
4) On UM2, I was able to create table and insert a row
5) ON UM2, query failed

MariaDB [mytest]> create table t (c1 int, c2 char(50)) engine=columnstore;
Query OK, 0 rows affected (0.91 sec)

MariaDB [mytest]> insert into t values (1, 'hello');
Query OK, 1 row affected (0.48 sec)

MariaDB [mytest]> select * from t;
ERROR 1815 (HY000): Internal error: The system is not yet ready to accept queries

Reenabling UM1 from PM1 hung and timed out.

mcsadmin> altersystem-enablemodule um1
altersystem-enablemodule Thu Aug 9 15:35:03 2018

This command starts the processing of applications on a Module within the MariaDB ColumnStore System
Do you want to proceed: (y or n) [n]: y

Enabling Modules
Successful enable of Modules

Starting Modules

        • startModule Failed : Timeout error from startModule API

At this point, status on UMs:

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Thu Aug 9 15:04:24 2018 3307
ServerMonitor um1 ACTIVE Thu Aug 9 15:35:49 2018 15643
DBRMWorkerNode um1 ACTIVE Thu Aug 9 15:35:49 2018 15668
ExeMgr um1 ACTIVE Thu Aug 9 15:35:54 2018 15704
DDLProc um1 COLD_STANDBY Thu Aug 9 15:35:56 2018
DMLProc um1 COLD_STANDBY Thu Aug 9 15:35:56 2018
mysqld um1 ACTIVE Thu Aug 9 15:35:46 2018 15578

ProcessMonitor um2 ACTIVE Thu Aug 9 15:04:16 2018 3306
ServerMonitor um2 ACTIVE Thu Aug 9 15:04:41 2018 3752
DBRMWorkerNode um2 ACTIVE Thu Aug 9 15:35:16 2018 12740
ExeMgr um2 ACTIVE Thu Aug 9 15:24:11 2018 11344
DDLProc um2 ACTIVE Thu Aug 9 15:24:26 2018 11435
DMLProc um2 MAN_OFFLINE Thu Aug 9 15:35:37 2018
mysqld um2 ACTIVE Thu Aug 9 15:24:09 2018 11277

I was able to login to mysql client on UM1, but the table I created on UM2 was not there.

schema sync was enabled during postConfigure.

Comment by David Hill (Inactive) [ 2018-08-10 ]

I did reproduce Daniels issue..

MariaDB [(none)]> use david
Database changed
MariaDB [david]> create table tmp (c1 int) engine=columnstore;
Query OK, 0 rows affected (0.67 sec)

MariaDB [david]> insert into tmp values (1);
Query OK, 1 row affected (0.14 sec)

MariaDB [david]> select * from tmp;
ERROR 1815 (HY000): Internal error: The system is not yet ready to accept queries
MariaDB [david]>

Comment by David Hill (Inactive) [ 2018-08-10 ]

fixed...

MariaDB [(none)]> use david
Database changed
MariaDB [david]> create table tmp1 (c1 int) engine=columnstore;
Query OK, 0 rows affected (0.62 sec)

MariaDB [david]> insert into tmp1 values (1);
Query OK, 1 row affected (0.13 sec)

MariaDB [david]> select * from tmp;
------

c1

------

1

------
1 row in set (0.07 sec)

MariaDB [david]>

Comment by David Hill (Inactive) [ 2018-08-10 ]

1. fixed query issue
2. found and fixed another issue. after failover and um1 is disabled, it was failing to get back to enabled due to a code bug, so that was also fixed.

Comment by David Hill (Inactive) [ 2018-08-10 ]

https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/536

Comment by Daniel Lee (Inactive) [ 2018-08-22 ]

Build tested: 1.1.6-1 source

/root/columnstore/mariadb-columnstore-server
commit bab181f892fdbfba7ca287115bd26581c3bd2e67
Merge: 5137757 a035b4a
Author: David.Hall <david.hall@mariadb.com>
Date: Wed Aug 15 10:22:39 2018 -0500

Merge pull request #126 from mariadb-corporation/MCOL-1615

MCOL-1615 Merge MariaDB 10.2.17 into develop-1.1

/root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine
commit fdb4ef7f796d9c8ad664b71544743da6f64f480d
Merge: b5c3888 a98aec0
Author: Andrew Hutchings <andrew@linuxjedi.co.uk>
Date: Fri Aug 17 08:02:02 2018 +0100

Merge pull request #542 from drrtuy/MCOL-1655

MCOL-1655 removed hardcoded %debug from ddl.y.

1) DDLProc and DMLProc on UM1 did failover to UM2
2) UM1 was in disabled state
3) DDLProc and DMLProc on UM2 started
4) On UM2, I was able to create table and insert a row
5) ON UM2, query was successful (It failed in during my last test. It got further not with this release)
6) On PM1, ran alter system-enablemodule um1, stack seemed to be normal according to the output of getsystemstatus
7) When starting the admin console (ma), it there is alarm about DBRM readonly

AlarmID = 31
Brief Description = DBRM_READ_ONLY
Alarm Severity = CRITICAL
Time Issued = Wed Aug 22 20:44:34 2018
Reporting Module = pm1
Reporting Process = DBRMControllerNode
Reported Device = System

Comment by David Hill (Inactive) [ 2018-08-28 ]

Im am able to reproduce issue

mcsadmin> altersystem-di um1 y
altersystem-disablemodule Tue Aug 28 20:56:37 2018

This command stops the processing of applications on the Primary User Module, which is where DDL/DML are performed
If there is another module that can be changed to a new Primary User Module, this will be done
Do you want to proceed: (y or n) [n]: y

Stopping Modules
Successful stop of Modules

Disabling Modules
Successful disable of Modules

New Primary User Module = um2

ALLS GOOD

mcsadmin> altersystem-en um1 y
altersystem-enablemodule Tue Aug 28 20:57:34 2018

Enabling Modules
Successful enable of Modules

Starting Modules
Successful start of Modules

Aug 28 20:57:56 ip-172-31-46-144 controllernode[16164]: 56.957547 |0|0|0| C 29 CAL0000: DBRM Controller: network error distributing command to worker 2
Aug 28 20:57:56 ip-172-31-46-144 controllernode[16164]: 56.962575 |0|0|0| C 29 CAL0000: DBRM Controller: undo(): warning, could not contact worker number 2
Aug 28 20:57:56 ip-172-31-46-144 controllernode[16164]: 56.962870 |0|0|0| C 29 CAL0000: DBRM Controller: Caught network error. Sending command 17, length 1. Setting read-only mode.

Comment by David Hill (Inactive) [ 2018-08-30 ]

Arg, this is tougher than I would have that to fix. as part of the testing, I wanted to make sure the ddl/dml/queries all worked after a disablemodule um and enablemodule um. Took a while, but got the disablemodule working, still the enablemodule is having issue where I cant create a new table. Still working that. Once I have the disable/enable fully functioning, then I will go back to the failover test cases.

Comment by David Hill (Inactive) [ 2018-09-04 ]

still in progress

MCOL-1523 - Fix is in branch MCOL-1523 - disablemodule, I can still do create table, insert, and select, but enablemodule fails.

Comment by David Hill (Inactive) [ 2018-09-05 ]

got both the disable and enable module working when doing manually, now testing the auto disable and enable module to make sure create and query works

Comment by David Hill (Inactive) [ 2018-09-05 ]

to get the enablemodule to work, I changed the code to do a restartsystem as part of the enable module

mcsadmin> altersystem-en um2 y
altersystem-enablemodule Wed Sep 5 14:47:28 2018

Enabling Modules
Successful enable of Modules

Restarting System ..
Successful restart of System

mcsadmin>

from um1

mcsmysql < test.sql
c1
1
root@ip-172-31-36-5:/usr/local/mariadb/columnstore/bin# cat test.sql
use david;
create table tmp86 (c1 int) engine=columnstore;
insert into tmp86 values (1);
select * from tmp86;
root@ip-172-31-36-5:/usr/local/mariadb/columnstore/bin#

Comment by David Hill (Inactive) [ 2018-09-05 ]

failover case is working, now testing the mysql repication is working during all scenarios

Comment by Daniel Lee (Inactive) [ 2018-09-14 ]

Build tested: 1.1.7-1 (built on Sept 13, 2018)

Stack: 2UM3PM

Issue found

  • On UM1, after renaming DDLProc to DDLProc.save and pkill DDLProc, processes seemed to have failover to UM2 correctly, but the system stuck in BUSY_INIT state. UM2 would not accept DDL statements and queries due to a system not ready error.

Other tests worked fine:
enable um1, after renaming DDLProc back
disable um2, failover to um1
enable um2

Comment by David Hill (Inactive) [ 2018-09-15 ]

https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/561

Comment by Daniel Lee (Inactive) [ 2018-09-18 ]

Build verified: 1.1.7-1 (released to QA on 09/17/2018)

Verified the following items:

1) rename DDLProc and pkill DDLProc to cause failover
2) disable and enable UMs
3) made one of the UMs to be out of service by suspending the VM and resuming it
4) query, DDL, DML, cpimport after these failover tests
5) stop, shutdown, and start systems
6) suspending and resuming PM1. Although no failover occurred since I am using local storage, I just want to make sure it did not cause unexpected issues.

Generated at Thu Feb 08 02:29:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.