[MCOL-4939] Add a method to disable failover facility in CMAPI. - Jira

Roman created issue - 2021-12-06 20:29

Gregory Dorman (Inactive) made changes - 2021-12-07 17:30

Field	Original Value	New Value
Rank		Ranked higher

Gregory Dorman (Inactive) made changes - 2021-12-07 17:58

Sprint

2021-15 [ 587 ]

2021-15, 2021-16 [ 587, 598 ]

Alan Mologorsky made changes - 2021-12-10 13:41

Status

Open [ 1 ]

In Progress [ 3 ]

Alan Mologorsky made changes - 2021-12-10 13:52

Assignee	Alan Mologorsky [ JIRAUSER49150 ]	Roman [ drrtuy ]
Status	In Progress [ 3 ]	In Review [ 10002 ]

Roman added a comment - 2021-12-14 11:25 - edited

4QA Previously there was no way to disable failover for clusters with >= 3 nodes. It affects clusters with non-shared storages a lot.

Roman added a comment - 2021-12-14 11:25 - edited 4QA Previously there was no way to disable failover for clusters with >= 3 nodes. It affects clusters with non-shared storages a lot.

Roman made changes - 2021-12-14 11:25

Status

In Review [ 10002 ]

In Testing [ 10301 ]

Roman made changes - 2021-12-14 11:25

Assignee

Roman [ drrtuy ]

Daniel Lee [ dleeyh ]

Daniel Lee (Inactive) made changes - 2021-12-14 18:38

Rank

Ranked higher

Alan Mologorsky made changes - 2021-12-16 14:48

Link

This issue is part of MCOL-4850 [ MCOL-4850 ]

Alan Mologorsky made changes - 2021-12-17 16:24

Comment

[ Fixed is_shared_storage behaviour.
Now failover using SM config file to check if storage is shared or not.
Previously 'yes' value has been hardcoded. ]

Alan Mologorsky added a comment - 2021-12-17 16:38

dleeyh
Changes has been made:

add application section with auto_failover = False parameter to default cmapi_server.conf
failover now is turned off by default even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf
failover has now three different logical states:
- turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi.
- turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3
- turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3

Alan Mologorsky added a comment - 2021-12-17 16:38 dleeyh Changes has been made: add application section with auto_failover = False parameter to default cmapi_server.conf failover now is turned off by default even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf failover has now three different logical states: turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi. turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3 turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3

Daniel Lee (Inactive) added a comment - 2021-12-17 20:45

Build tested: ColumnStore Engine (build 3561)
CMAPI (585)

Test cluster: 3-node cluster

Reproduced the reported issue in 6.2.2-1, with CMAPI 1.6 as released.

Non-shared storage (local dbroot)
With auto_failover=False, failover did not occur when PM1 was suspended.

With auto_failover=True, this is a misconfiguration since non-shared storage is used. When PM1 was suspended, I expected failover to occur and the cluster would end up in a non-operational state, but it did not occur. Was it because CMAPI detected non-shared storage and did not kick off the failover process? or was it failover simply did not occur?

Glusterfs
With auto_failover=False, failover did not occur when PM1 was suspended.
With auto_failover=True, I expected failover to occur, having PM2 taking over as the master node. It did not happen.
The same test did worked in 6.1.1 and 6.2.2.

Daniel Lee (Inactive) added a comment - 2021-12-17 20:45 Build tested: ColumnStore Engine (build 3561) CMAPI (585) Test cluster: 3-node cluster Reproduced the reported issue in 6.2.2-1, with CMAPI 1.6 as released. Non-shared storage (local dbroot) With auto_failover=False, failover did not occur when PM1 was suspended. With auto_failover=True, this is a misconfiguration since non-shared storage is used. When PM1 was suspended, I expected failover to occur and the cluster would end up in a non-operational state, but it did not occur. Was it because CMAPI detected non-shared storage and did not kick off the failover process? or was it failover simply did not occur? Glusterfs With auto_failover=False, failover did not occur when PM1 was suspended. With auto_failover=True, I expected failover to occur, having PM2 taking over as the master node. It did not happen. The same test did worked in 6.1.1 and 6.2.2.

Daniel Lee (Inactive) made changes - 2021-12-17 20:45

Assignee	Daniel Lee [ dleeyh ]	Alan Mologorsky [ JIRAUSER49150 ]
Status	In Testing [ 10301 ]	Stalled [ 10000 ]

Gregory Dorman (Inactive) made changes - 2022-02-01 18:18

Sprint

2021-15, 2021-16 [ 587, 598 ]

2021-15, 2021-16, 2021-17 [ 587, 598, 614 ]

Alan Mologorsky made changes - 2022-03-04 20:55

Status

Stalled [ 10000 ]

In Progress [ 3 ]

alexey vorovich (Inactive) made changes - 2022-03-09 18:56

Status

In Progress [ 3 ]

In Testing [ 10301 ]

alexey vorovich (Inactive) added a comment - 2022-03-09 18:58

David.Hall I moved this to testing . was it incorrect .

Is the action item for alan.mologorsky instead

alexey vorovich (Inactive) added a comment - 2022-03-09 18:58 David.Hall I moved this to testing . was it incorrect . Is the action item for alan.mologorsky instead

alexey vorovich (Inactive) added a comment - 2022-03-15 16:23

alan.mologorsky i moved to testing by mistake

please post the status

alexey vorovich (Inactive) added a comment - 2022-03-15 16:23 alan.mologorsky i moved to testing by mistake please post the status

alexey vorovich (Inactive) made changes - 2022-03-15 16:23

Status

In Testing [ 10301 ]

Stalled [ 10000 ]

alexey vorovich (Inactive) made changes - 2022-03-15 16:42

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Alan Mologorsky made changes - 2022-03-18 17:27

Status

In Progress [ 3 ]

In Testing [ 10301 ]

Alan Mologorsky made changes - 2022-03-18 17:27

Assignee

Alan Mologorsky [ JIRAUSER49150 ]

Daniel Lee [ dleeyh ]

Daniel Lee (Inactive) made changes - 2022-03-22 21:34

Status

In Testing [ 10301 ]

Stalled [ 10000 ]

alexey vorovich (Inactive) added a comment - 2022-03-23 14:36

dleeyh Question : all 3 scenarios worked in the previous version ? which one ?
alan.mologorsky please see Daniel's request for steps.

Note that we are discussing the possible regression in the overall failover functionally in this release - not the actual change that Alan did on this ticket.
toddstoffel should maxscale be involved in this ?
Eventually we will need to review this in Sky as well petko.vasilev

alexey vorovich (Inactive) added a comment - 2022-03-23 14:36 dleeyh Question : all 3 scenarios worked in the previous version ? which one ? alan.mologorsky please see Daniel's request for steps. Note that we are discussing the possible regression in the overall failover functionally in this release - not the actual change that Alan did on this ticket. toddstoffel should maxscale be involved in this ? Eventually we will need to review this in Sky as well petko.vasilev

Daniel Lee (Inactive) added a comment - 2022-03-23 16:43 - edited

Build tested: 6.3.1-1 (#4101), cmapi 1.6 (#580), 3-node glusterfs

alexey.vorovich Yes, it was tested before and it worked fine. Just in case, I retested the same build of ColumnStore 6.3.1-1 using an older build of CMAPI.

1. When restarting ColumnStore (mcsShutdown and mcsStart, not a failover situation), PM1 remained as the master node. ColumnStore continued function properly as expected.

2. Failover scenario, PM1 also came back as master node. ColumnStore continued function properly as expected.

MariaDB [mytest]> select count(*) from lineitem;

+----------+

| count(*) |

+----------+

|  6001215 |

+----------+

1 row in set (0.186 sec)

MariaDB [mytest]> create table t1 (c1 int) engine=columnstore;

Query OK, 0 rows affected (1.619 sec)

use near 'table t1 values (1)' at line 1

MariaDB [mytest]> insert t1 values (1);

Query OK, 1 row affected (0.159 sec)

MariaDB [mytest]> insert t1 values (2);

Query OK, 1 row affected (0.074 sec)

MariaDB [mytest]> select * from t1;

+------+

| c1   |

+------+

|    1 |

|    2 |

+------+

2 rows in set (0.521 sec)

Daniel Lee (Inactive) added a comment - 2022-03-23 16:43 - edited Build tested: 6.3.1-1 (#4101), cmapi 1.6 (#580), 3-node glusterfs alexey.vorovich Yes, it was tested before and it worked fine. Just in case, I retested the same build of ColumnStore 6.3.1-1 using an older build of CMAPI. 1. When restarting ColumnStore (mcsShutdown and mcsStart, not a failover situation), PM1 remained as the master node. ColumnStore continued function properly as expected. 2. Failover scenario, PM1 also came back as master node. ColumnStore continued function properly as expected. MariaDB [mytest]> select count(*) from lineitem; +----------+ | count(*) | +----------+ | 6001215 | +----------+ 1 row in set (0.186 sec) MariaDB [mytest]> create table t1 (c1 int) engine=columnstore; Query OK, 0 rows affected (1.619 sec) use near 'table t1 values (1)' at line 1 MariaDB [mytest]> insert t1 values (1); Query OK, 1 row affected (0.159 sec) MariaDB [mytest]> insert t1 values (2); Query OK, 1 row affected (0.074 sec) MariaDB [mytest]> select * from t1; +------+ | c1 | +------+ | 1 | | 2 | +------+ 2 rows in set (0.521 sec)

Daniel Lee (Inactive) made changes - 2022-03-23 16:44

Assignee

Daniel Lee [ dleeyh ]

Alan Mologorsky [ JIRAUSER49150 ]

alexey vorovich (Inactive) added a comment - 2022-03-23 18:19

Guys, i would suggest this .

Start with something that works for at least one of you " ColumnStore 6.3.1-1 using an older build of CMAPI."

Indeed , Daniel, please create an annotated script in your repo to do the actions for test1,2,3 to minimize the chance of miscommunication.

Then Alan can try to repeat them on the version he believes cannot work and we go from there.

alexey vorovich (Inactive) added a comment - 2022-03-23 18:19 Guys, i would suggest this . Start with something that works for at least one of you " ColumnStore 6.3.1-1 using an older build of CMAPI." Indeed , Daniel, please create an annotated script in your repo to do the actions for test1,2,3 to minimize the chance of miscommunication. Then Alan can try to repeat them on the version he believes cannot work and we go from there.

alexey vorovich (Inactive) added a comment - 2022-03-23 18:21

and please always use build numbers in notes instead of "latest" and "older build"

alexey vorovich (Inactive) added a comment - 2022-03-23 18:21 and please always use build numbers in notes instead of "latest" and "older build"

Daniel Lee (Inactive) added a comment - 2022-03-23 18:28

I always start my comments for test with a line like the following:

Build tested: 6.3.1-1 (#4101), cmapi 1.6 (#580)

When closing a ticket, I do

Build verified: 6.3.1-1 (#4101), cmapi 1.6 (#580)

The number in () is the build number from Drone.

For example. The CMAPI build that I had issues with was "cmapi 1.6.2 (#612)" and the older one that I retested was "cmapi 1.6 (#580)"

Daniel Lee (Inactive) added a comment - 2022-03-23 18:28 I always start my comments for test with a line like the following: Build tested: 6.3.1-1 (#4101), cmapi 1.6 (#580) When closing a ticket, I do Build verified: 6.3.1-1 (#4101), cmapi 1.6 (#580) The number in () is the build number from Drone. For example. The CMAPI build that I had issues with was "cmapi 1.6.2 (#612)" and the older one that I retested was "cmapi 1.6 (#580)"

alexey vorovich (Inactive) added a comment - 2022-03-23 19:50 - edited

alan.mologorsky
Yes, in the future we will share QA scripts (and they will include k8s commands as well) .

For today, test1 has these 4 steps that you can run one after the other . Please try these with "cmapi 1.6 (#580)" and see if you can confirm what Daniel is seeing (he sees that working and you believe it cannot work) . Lets start from there.

1. Set auto_failover to True in /etc/columnstore/cmapi_serfer.conf in all nodes
2. "systemctl restart mariadb-columnstore-cmapi" in all nodes
3. mcsShutdown
4. mcsStart

As a side note : putting commands that are supposed to be run on separate multiples host could be done via a loop of SSH or via Kubectl . We will need to decide how to do that in the future. There is also SkyTf framework for these created by georgi.harizanov
Eventually we will integrate multi-node tests with that framework . For now - just heads up

alexey vorovich (Inactive) added a comment - 2022-03-23 19:50 - edited alan.mologorsky Yes, in the future we will share QA scripts (and they will include k8s commands as well) . For today, test1 has these 4 steps that you can run one after the other . Please try these with "cmapi 1.6 (#580)" and see if you can confirm what Daniel is seeing (he sees that working and you believe it cannot work) . Lets start from there. 1. Set auto_failover to True in /etc/columnstore/cmapi_serfer.conf in all nodes 2. "systemctl restart mariadb-columnstore-cmapi" in all nodes 3. mcsShutdown 4. mcsStart As a side note : putting commands that are supposed to be run on separate multiples host could be done via a loop of SSH or via Kubectl . We will need to decide how to do that in the future. There is also SkyTf framework for these created by georgi.harizanov Eventually we will integrate multi-node tests with that framework . For now - just heads up

alexey vorovich (Inactive) added a comment - 2022-03-23 20:44

i kind of doubt that rocky and types of data loaded are important. Could be.

What is important is to have the same common scripts to install the system to begin with .

If Daniel has these scripts that Alan should use them. I will try this install script as well.

after these scripts are used I would start with the the older build that works for both and then move to newer builds

alexey vorovich (Inactive) added a comment - 2022-03-23 20:44 i kind of doubt that rocky and types of data loaded are important. Could be. What is important is to have the same common scripts to install the system to begin with . If Daniel has these scripts that Alan should use them. I will try this install script as well. after these scripts are used I would start with the the older build that works for both and then move to newer builds

Alan Mologorsky made changes - 2022-03-23 20:47

Status

Stalled [ 10000 ]

In Progress [ 3 ]

alexey vorovich (Inactive) added a comment - 2022-03-23 20:54

well , If Alan can reproduce the problem using a separate setup then good.

However , i would definitely invest into a common installation script

alexey vorovich (Inactive) added a comment - 2022-03-23 20:54 well , If Alan can reproduce the problem using a separate setup then good. However , i would definitely invest into a common installation script

alexey vorovich (Inactive) added a comment - 2022-03-23 22:18

We have 3 candidates for common install script for multi-node

Daniel's QA setup . Needs work as per Daniel to make it really standalone
Direct MOE that brings up cluster with pods ... Pending this week , i hope and pray
Docker compose from toddstoffel. If this supports shared disk then we could use that to start/test /validate and share identical setup between different people short term and maybe long term as well

Todd, do u agree ?

alexey vorovich (Inactive) added a comment - 2022-03-23 22:18 We have 3 candidates for common install script for multi-node Daniel's QA setup . Needs work as per Daniel to make it really standalone Direct MOE that brings up cluster with pods ... Pending this week , i hope and pray Docker compose from toddstoffel . If this supports shared disk then we could use that to start/test /validate and share identical setup between different people short term and maybe long term as well Todd, do u agree ?

alexey vorovich (Inactive) added a comment - 2022-03-23 23:43

The link leads to step 2 of 9 step procedure. Many of these steps require the user to execute commands on each host. Much time is needed and many possibilities of errors exist.

On the contrary the Direct MOE will allow something along these lines

moe -topology CS -replicas 3 -nodetype verylarge

this will create all the nodes, s3, nfs, config files etc etc

I suspect that docker compose approach is simple to use . Todd will clarify

My concern is actually debugging. How one can do symbolic debugging in the docker. There are tools for that as well (at least for Python)

https://code.visualstudio.com/docs/containers/debug-common

alexey vorovich (Inactive) added a comment - 2022-03-23 23:43 The link leads to step 2 of 9 step procedure. Many of these steps require the user to execute commands on each host. Much time is needed and many possibilities of errors exist. On the contrary the Direct MOE will allow something along these lines moe -topology CS -replicas 3 -nodetype verylarge this will create all the nodes, s3, nfs, config files etc etc I suspect that docker compose approach is simple to use . Todd will clarify My concern is actually debugging. How one can do symbolic debugging in the docker. There are tools for that as well (at least for Python) https://code.visualstudio.com/docs/containers/debug-common

Georgi Harizanov (Inactive) added a comment - 2022-03-24 06:16

alan.mologorsky can you point me to a repo where your scripts are so I can have a look?

Georgi Harizanov (Inactive) added a comment - 2022-03-24 06:16 alan.mologorsky can you point me to a repo where your scripts are so I can have a look?

alexey vorovich (Inactive) made changes - 2022-03-24 21:17

Description

There are known customer installations that don't use shared storage so failover mechanism might break such clusters.
There must be a knob in cmapi configuration file to disable failover facility if needed.

There are known customer installations that don't use shared storage so failover mechanism might break such clusters.
There must be a knob in cmapi configuration file to disable failover facility if needed.

New

add application section with auto_failover = False parameter to default cmapi_server.conf
failover now is turned off by default even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf
failover has now three different logical states:
turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi.
turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3
turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3

alexey vorovich (Inactive) made changes - 2022-03-24 21:18

Description

There are known customer installations that don't use shared storage so failover mechanism might break such clusters.
There must be a knob in cmapi configuration file to disable failover facility if needed.

New

add application section with auto_failover = False parameter to default cmapi_server.conf
failover now is turned off by default even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf
failover has now three different logical states:
turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi.
turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3
turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3

There are known customer installations that don't use shared storage so failover mechanism might break such clusters.
There must be a knob in cmapi configuration file to disable failover facility if needed.

New

Changes has been made:
* add application section with auto_failover = False parameter to default cmapi_server.conf
* failover now is turned off by default even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf
* failover has now three different logical states:
** turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi.
** turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3
** turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3

alexey vorovich (Inactive) made changes - 2022-03-24 21:19

Description

There are known customer installations that don't use shared storage so failover mechanism might break such clusters.
There must be a knob in cmapi configuration file to disable failover facility if needed.

New

Changes has been made:
* add application section with auto_failover = False parameter to default cmapi_server.conf
* failover now is turned off by default even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf
* failover has now three different logical states:
** turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi.
** turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3
** turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3

There are known customer installations that don't use shared storage so failover mechanism might break such clusters.
There must be a knob in cmapi configuration file to disable failover facility if needed.

New

Changes has been made:
* add application section with auto_failover = False parameter to default cmapi_server.conf
* failover now is *turned off by default* even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf
* failover has now three different logical states:
** turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi.
** turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3
** turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3

alexey vorovich (Inactive) added a comment - 2022-03-31 16:06

alan.mologorsky did I summarize the meeting correctly ? If so, please do the reversal of default and pass to dleeyh so that we can move on

toddstoffel Here is a suggestion /question from gdorman What if we ALWAYS require maxscale to be present to enable HA. This is currently the case for Sky.

this would reduce the various options . Before we discuss this in dev , what is our take from PM point of view

alexey vorovich (Inactive) added a comment - 2022-03-31 16:06 alan.mologorsky did I summarize the meeting correctly ? If so, please do the reversal of default and pass to dleeyh so that we can move on toddstoffel Here is a suggestion /question from gdorman What if we ALWAYS require maxscale to be present to enable HA. This is currently the case for Sky. this would reduce the various options . Before we discuss this in dev , what is our take from PM point of view

Roman added a comment - 2022-04-01 06:45

alexey.vorovich As the result we decided that I will ask toddstoffel offline regarding default behavior and wait for his decision.

Roman added a comment - 2022-04-01 06:45 alexey.vorovich As the result we decided that I will ask toddstoffel offline regarding default behavior and wait for his decision.

Todd Stoffel (Inactive) made changes - 2022-04-01 20:41

Fix Version/s		cmapi-1.7 [ 26900 ]
Fix Version/s	6.3.1 [ 25801 ]

Nedyalko Petrov (Inactive) made changes - 2022-04-04 14:49

Link

This issue blocks DBAAS-9369 [ DBAAS-9369 ]

Roman made changes - 2022-04-06 06:04

Assignee

Alan Mologorsky [ JIRAUSER49150 ]

Roman [ drrtuy ]

Roman made changes - 2022-04-06 06:04

Status

In Progress [ 3 ]

In Testing [ 10301 ]

Roman made changes - 2022-04-06 06:05

Assignee

Roman [ drrtuy ]

Daniel Lee [ dleeyh ]

Roman added a comment - 2022-04-06 06:06

4QA Plz use the latest CMAPI build.

Roman added a comment - 2022-04-06 06:06 4QA Plz use the latest CMAPI build.

Todd Stoffel (Inactive) made changes - 2022-04-06 17:51

Fix Version/s		cmapi-1.6.3 [ 27900 ]
Fix Version/s	cmapi-6.4.1 [ 26900 ]

Daniel Lee (Inactive) added a comment - 2022-04-08 23:53

Build tested: 6.3.1-1 (#4234), CMAPI-1.6.3 (619)

Cluster: 3-node

Test #1, PASSED
Default auto_failover value

In /etc/columnstore/cmapi_server.conf, auto_failover is set to True by default.

------

Test #2, PASSED
Setup: no shared storage
auto_failover: False

This is the use case which the user does not have shared storage setup and failover is not desired

Failover did not occur

------

Test #3, PASSED
Setup: gluster
auto_failover: False

This is the use case which the user has glusterfs setup and failover IS NOT desired

Failover did not occur

------

Test #4, FAILED
Setup: gluster
auto_failover: True

This is the use case which the user has glusterfs setup and failover IS desired

Observation:

When PM1, which is the master node, taken offline
mcsStatus on PM2 showed PM2 as the master, PM3 as slave (2-node cluster)
MaxScale showed PM2 as the master, PM1 was down
So far, this is expected

When PM1 was put back online
mcsStatus showed PM1 eventually bacame the master node again
but MaxScale showed PM2 should be the master node

It was expected that ColumnStore would set the master according to what MaxScale selected, but this did not happen. Now ColumnStore and MaxScale are out of sync.

------

At the time of this writing, fixVersion of the ticket has been set for cmapi-1.6.3, but the package as been named for 1.6.2, such as MariaDB-columnstore-cmapi-1.6.2.x86_64.rpm. The package name shoudl be corrected.

Daniel Lee (Inactive) added a comment - 2022-04-08 23:53 Build tested: 6.3.1-1 (#4234), CMAPI-1.6.3 (619) Cluster: 3-node Test #1, PASSED Default auto_failover value In /etc/columnstore/cmapi_server.conf, auto_failover is set to True by default. ------ Test #2, PASSED Setup: no shared storage auto_failover: False This is the use case which the user does not have shared storage setup and failover is not desired Failover did not occur ------ Test #3, PASSED Setup: gluster auto_failover: False This is the use case which the user has glusterfs setup and failover IS NOT desired Failover did not occur ------ Test #4, FAILED Setup: gluster auto_failover: True This is the use case which the user has glusterfs setup and failover IS desired Observation: When PM1, which is the master node, taken offline mcsStatus on PM2 showed PM2 as the master, PM3 as slave (2-node cluster) MaxScale showed PM2 as the master, PM1 was down So far, this is expected When PM1 was put back online mcsStatus showed PM1 eventually bacame the master node again but MaxScale showed PM2 should be the master node It was expected that ColumnStore would set the master according to what MaxScale selected, but this did not happen. Now ColumnStore and MaxScale are out of sync. ------ At the time of this writing, fixVersion of the ticket has been set for cmapi-1.6.3, but the package as been named for 1.6.2, such as MariaDB-columnstore-cmapi-1.6.2.x86_64.rpm. The package name shoudl be corrected.

Daniel Lee (Inactive) made changes - 2022-04-08 23:57

Assignee

Daniel Lee [ dleeyh ]

Todd Stoffel [ toddstoffel ]

Todd Stoffel (Inactive) made changes - 2022-04-11 23:05

Comment

[ A comment with security level 'Developers' was removed. ]

alexey vorovich (Inactive) added a comment - 2022-04-12 14:28 - edited

Guys,

1. I tend to agree that default file section created at new install should be empty
2. Let's go back to Test4 failed in https://jira.mariadb.org/browse/MCOL-4939?focusedCommentId=219832&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219832

I am trying to repro myself, but inconclusive so far..

dleeyh and toddstoffel

Besides the discrepancy between MCS and MXS in respect to master node choice , what issues with DDL/DML update do we observe ?
Please list what has been found. My understanding is that Maxscale will direct updates to PM2.

Also Daniel , for whatever symptoms we see , please confirm in which old release we did not see them

alan.mologorsky drrtuy gdorman FYI

alexey vorovich (Inactive) added a comment - 2022-04-12 14:28 - edited Guys, 1. I tend to agree that default file section created at new install should be empty 2. Let's go back to Test4 failed in https://jira.mariadb.org/browse/MCOL-4939?focusedCommentId=219832&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219832 I am trying to repro myself, but inconclusive so far.. dleeyh and toddstoffel Besides the discrepancy between MCS and MXS in respect to master node choice , what issues with DDL/DML update do we observe ? Please list what has been found. My understanding is that Maxscale will direct updates to PM2. Also Daniel , for whatever symptoms we see , please confirm in which old release we did not see them alan.mologorsky drrtuy gdorman FYI

alexey vorovich (Inactive) made changes - 2022-04-13 15:58

Assignee

Todd Stoffel [ toddstoffel ]

Alan Mologorsky [ JIRAUSER49150 ]

alexey vorovich (Inactive) made changes - 2022-04-13 15:58

Status

In Testing [ 10301 ]

Stalled [ 10000 ]

alexey vorovich (Inactive) made changes - 2022-04-13 15:58

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Todd Stoffel (Inactive) made changes - 2022-04-13 16:03

Comment

[ A comment with security level 'Developers' was removed. ]

Alan Mologorsky made changes - 2022-04-13 17:15

Assignee	Alan Mologorsky [ JIRAUSER49150 ]	Roman [ drrtuy ]
Status	In Progress [ 3 ]	In Review [ 10002 ]

alexey vorovich (Inactive) added a comment - 2022-04-13 17:24

alan.mologorsky dleeyh I opened a new https://jira.mariadb.org/browse/MCOL-5052 for that mismatch discussion

The only remaining item here is for Alan and is described above

alexey vorovich (Inactive) added a comment - 2022-04-13 17:24 alan.mologorsky dleeyh I opened a new https://jira.mariadb.org/browse/MCOL-5052 for that mismatch discussion The only remaining item here is for Alan and is described above

Daniel Lee (Inactive) added a comment - 2022-04-14 02:01

Build tested: 6.3.1-1 (#4234), CMAPI-1.6.3-1 (#623)

Preliminary test results for failover behavior. More functional tests will be done.

3-node cluster, with gluster, schema replication, MaxScale

For each of the follow tests, a newly installed 3-node cluster is used

Test #1
Default installation, auto_failover parameter has been removed from /etc/columnstore/cmapi_server.conf , default behavior is auto failover enabled.

Failover now works the same way as it used to be. When putting PM1 back online, PM2 remained as the master node, in sync with MaxScale.

Test #2
On each node, added the following to /etc/columnstore/cmapi_server.conf

[application]

auto_failover = False

and restarted cmapi

systemctl restart mariadb-columnstore-cmapi

mcsStatus on all three (3) nodes showed there is only one (1) node (pm1) in the cluster. pm2 and pm3 are no longer part of the cluster. Output is like the following:

[rocky8:root~]# mcsStatus

  "timestamp": "2022-04-14 00:43:02.932548",

  "s1pm1": {

    "timestamp": "2022-04-14 00:43:02.938951",

    "uptime": 1149,

    "dbrm_mode": "master",

    "cluster_mode": "readwrite",

    "dbroots": [],

    "module_id": 1,

    "services": [

        "name": "workernode",

        "pid": 9290

},

        "name": "controllernode",

        "pid": 9301

},

        "name": "PrimProc",

        "pid": 9317

},

        "name": "ExeMgr",

        "pid": 9365

},

        "name": "WriteEngine",

        "pid": 9382

},

        "name": "DDLProc",

        "pid": 9413

},

  "num_nodes": 1

I tried the same test again and all nodes returned somethig like the following

[rocky8:root~]# mcsStatus

  "timestamp": "2022-04-14 01:46:02.956786",

  "s1pm1": {

    "timestamp": "2022-04-14 01:46:02.963366",

    "uptime": 1631,

    "dbrm_mode": "offline",

    "cluster_mode": "readonly",

    "dbroots": [],

    "module_id": 1,

    "services": []

},

  "num_nodes": 1

Failover was not tested since there is only one node in the cluster now.

Test #3
On each node, added the following to /etc/columnstore/cmapi_server.conf

[application]

auto_failover = True

and restarted cmapi

systemctl restart mariadb-columnstore-cmapi

I got the same result as Test #1 above

Daniel Lee (Inactive) added a comment - 2022-04-14 02:01 Build tested: 6.3.1-1 (#4234), CMAPI-1.6.3-1 (#623) Preliminary test results for failover behavior. More functional tests will be done. 3-node cluster, with gluster, schema replication, MaxScale For each of the follow tests, a newly installed 3-node cluster is used Test #1 Default installation, auto_failover parameter has been removed from /etc/columnstore/cmapi_server.conf , default behavior is auto failover enabled. Failover now works the same way as it used to be. When putting PM1 back online, PM2 remained as the master node, in sync with MaxScale. Test #2 On each node, added the following to /etc/columnstore/cmapi_server.conf [application] auto_failover = False and restarted cmapi systemctl restart mariadb-columnstore-cmapi mcsStatus on all three (3) nodes showed there is only one (1) node (pm1) in the cluster. pm2 and pm3 are no longer part of the cluster. Output is like the following: [rocky8:root~]# mcsStatus { "timestamp": "2022-04-14 00:43:02.932548", "s1pm1": { "timestamp": "2022-04-14 00:43:02.938951", "uptime": 1149, "dbrm_mode": "master", "cluster_mode": "readwrite", "dbroots": [], "module_id": 1, "services": [ { "name": "workernode", "pid": 9290 }, { "name": "controllernode", "pid": 9301 }, { "name": "PrimProc", "pid": 9317 }, { "name": "ExeMgr", "pid": 9365 }, { "name": "WriteEngine", "pid": 9382 }, { "name": "DDLProc", "pid": 9413 } ] }, "num_nodes": 1 } I tried the same test again and all nodes returned somethig like the following [rocky8:root~]# mcsStatus { "timestamp": "2022-04-14 01:46:02.956786", "s1pm1": { "timestamp": "2022-04-14 01:46:02.963366", "uptime": 1631, "dbrm_mode": "offline", "cluster_mode": "readonly", "dbroots": [], "module_id": 1, "services": [] }, "num_nodes": 1 } Failover was not tested since there is only one node in the cluster now. Test #3 On each node, added the following to /etc/columnstore/cmapi_server.conf [application] auto_failover = True and restarted cmapi systemctl restart mariadb-columnstore-cmapi I got the same result as Test #1 above

Todd Stoffel (Inactive) made changes - 2022-04-14 02:04

Fix Version/s		cmapi-6.4.1 [ 26900 ]
Fix Version/s	cmapi-1.6.3 [ 27900 ]

alexey vorovich (Inactive) made changes - 2022-04-15 18:39

Assignee

Roman [ drrtuy ]

Alan Mologorsky [ JIRAUSER49150 ]

Roman made changes - 2022-04-18 12:34

Status

In Review [ 10002 ]

In Testing [ 10301 ]

Roman made changes - 2022-04-18 12:34

Assignee

Alan Mologorsky [ JIRAUSER49150 ]

Roman [ drrtuy ]

Roman made changes - 2022-04-18 12:34

Assignee

Roman [ drrtuy ]

Daniel Lee [ dleeyh ]

Daniel Lee (Inactive) added a comment - 2022-04-18 21:29

Build verified: ColumnStore 6.3.1-1 (#4278), cmapi (#625)

Following the steps above and using the new cmapi build, test #2 worked as expected, failover did not take place, as it is disabled in the cmapi-server.cnf file.

Daniel Lee (Inactive) added a comment - 2022-04-18 21:29 Build verified: ColumnStore 6.3.1-1 (#4278), cmapi (#625) Following the steps above and using the new cmapi build, test #2 worked as expected, failover did not take place, as it is disabled in the cmapi-server.cnf file.

Daniel Lee (Inactive) made changes - 2022-04-18 21:29

Resolution		Fixed [ 1 ]
Status	In Testing [ 10301 ]	Closed [ 6 ]

Daniel Lee (Inactive) added a comment - 2022-04-20 20:22 - edited

Build verified: ColumnStore 6.3.1-1 (#4299), cmapi 1.6.3 (#626)

cmapi package name has been corrected: MariaDB-columnstore-cmapi-1.6.3-1.x86_6.rpm. from 1.6.2 to 1.6.3

Verified along with the latest build of ColumnStore. Created a 3-node docker cluster.

Daniel Lee (Inactive) added a comment - 2022-04-20 20:22 - edited Build verified: ColumnStore 6.3.1-1 (#4299), cmapi 1.6.3 (#626) cmapi package name has been corrected: MariaDB-columnstore-cmapi-1.6.3-1.x86_6.rpm. from 1.6.2 to 1.6.3 Verified along with the latest build of ColumnStore. Created a 3-node docker cluster.

Todd Stoffel (Inactive) made changes - 2022-06-02 18:02

Rank

Ranked higher

Todd Stoffel (Inactive) made changes - 2022-06-27 22:33

Rank

Ranked higher

Todd Stoffel (Inactive) made changes - 2022-07-14 02:55

Link

This issue causes ~~MCOL-5157~~ [ ~~MCOL-5157~~ ]

Todd Stoffel (Inactive) made changes - 2022-07-14 05:22

Link

This issue causes ~~MCOL-5158~~ [ ~~MCOL-5158~~ ]

Todd Stoffel (Inactive) made changes - 2022-10-25 22:46

Rank

Ranked higher

MariaDB ColumnStore

Add a method to disable failover facility in CMAPI.

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration