[MCOL-4939] Add a method to disable failover facility in CMAPI. Created: 2021-12-06 Updated: 2022-10-25 Resolved: 2022-04-18 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | cmapi |
| Affects Version/s: | 6.1.1 |
| Fix Version/s: | cmapi-6.4.1 |
| Type: | New Feature | Priority: | Major |
| Reporter: | Roman | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Sprint: | 2021-15, 2021-16, 2021-17 | ||||||||||||||||||||
| Description |
|
There are known customer installations that don't use shared storage so failover mechanism might break such clusters. New Changes has been made:
|
| Comments |
| Comment by Roman [ 2021-12-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
4QA Previously there was no way to disable failover for clusters with >= 3 nodes. It affects clusters with non-shared storages a lot. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alan Mologorsky [ 2021-12-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
dleeyh
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2021-12-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Build tested: ColumnStore Engine (build 3561) Test cluster: 3-node cluster Reproduced the reported issue in 6.2.2-1, with CMAPI 1.6 as released. Non-shared storage (local dbroot) With auto_failover=True, this is a misconfiguration since non-shared storage is used. When PM1 was suspended, I expected failover to occur and the cluster would end up in a non-operational state, but it did not occur. Was it because CMAPI detected non-shared storage and did not kick off the failover process? or was it failover simply did not occur? Glusterfs | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-03-09 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
David.Hall I moved this to testing . was it incorrect . Is the action item for alan.mologorsky instead | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-03-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
alan.mologorsky i moved to testing by mistake please post the status | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
dleeyh Question : all 3 scenarios worked in the previous version ? which one ? Note that we are discussing the possible regression in the overall failover functionally in this release - not the actual change that Alan did on this ticket. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2022-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Build tested: 6.3.1-1 (#4101), cmapi 1.6 (#580), 3-node glusterfs alexey.vorovich Yes, it was tested before and it worked fine. Just in case, I retested the same build of ColumnStore 6.3.1-1 using an older build of CMAPI. 1. When restarting ColumnStore (mcsShutdown and mcsStart, not a failover situation), PM1 remained as the master node. ColumnStore continued function properly as expected. 2. Failover scenario, PM1 also came back as master node. ColumnStore continued function properly as expected.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Guys, i would suggest this . Start with something that works for at least one of you " ColumnStore 6.3.1-1 using an older build of CMAPI." Indeed , Daniel, please create an annotated script in your repo to do the actions for test1,2,3 to minimize the chance of miscommunication. Then Alan can try to repeat them on the version he believes cannot work and we go from there. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
and please always use build numbers in notes instead of "latest" and "older build" | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2022-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I always start my comments for test with a line like the following: Build tested: 6.3.1-1 (#4101), cmapi 1.6 (#580) When closing a ticket, I do Build verified: 6.3.1-1 (#4101), cmapi 1.6 (#580) The number in () is the build number from Drone. For example. The CMAPI build that I had issues with was "cmapi 1.6.2 (#612)" and the older one that I retested was "cmapi 1.6 (#580)" | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
alan.mologorsky For today, test1 has these 4 steps that you can run one after the other . Please try these with "cmapi 1.6 (#580)" and see if you can confirm what Daniel is seeing (he sees that working and you believe it cannot work) . Lets start from there. 1. Set auto_failover to True in /etc/columnstore/cmapi_serfer.conf in all nodes As a side note : putting commands that are supposed to be run on separate multiples host could be done via a loop of SSH or via Kubectl . We will need to decide how to do that in the future. There is also SkyTf framework for these created by georgi.harizanov | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
i kind of doubt that rocky and types of data loaded are important. Could be. What is important is to have the same common scripts to install the system to begin with . If Daniel has these scripts that Alan should use them. I will try this install script as well. after these scripts are used I would start with the the older build that works for both and then move to newer builds | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
well , If Alan can reproduce the problem using a separate setup then good. However , i would definitely invest into a common installation script | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We have 3 candidates for common install script for multi-node
Todd, do u agree ? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The link leads to step 2 of 9 step procedure. Many of these steps require the user to execute commands on each host. Much time is needed and many possibilities of errors exist. On the contrary the Direct MOE will allow something along these lines moe -topology CS -replicas 3 -nodetype verylarge this will create all the nodes, s3, nfs, config files etc etc I suspect that docker compose approach is simple to use . Todd will clarify My concern is actually debugging. How one can do symbolic debugging in the docker. There are tools for that as well (at least for Python) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Georgi Harizanov (Inactive) [ 2022-03-24 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
alan.mologorsky can you point me to a repo where your scripts are so I can have a look? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-03-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
alan.mologorsky did I summarize the meeting correctly ? If so, please do the reversal of default and pass to dleeyh so that we can move on toddstoffel Here is a suggestion /question from gdorman What if we ALWAYS require maxscale to be present to enable HA. This is currently the case for Sky. this would reduce the various options . Before we discuss this in dev , what is our take from PM point of view | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roman [ 2022-04-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
alexey.vorovich As the result we decided that I will ask toddstoffel offline regarding default behavior and wait for his decision. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roman [ 2022-04-06 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
4QA Plz use the latest CMAPI build. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2022-04-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Build tested: 6.3.1-1 (#4234), CMAPI-1.6.3 (619) Cluster: 3-node Test #1, PASSED In /etc/columnstore/cmapi_server.conf, auto_failover is set to True by default. ------ Test #2, PASSED This is the use case which the user does not have shared storage setup and failover is not desired Failover did not occur ------ Test #3, PASSED This is the use case which the user has glusterfs setup and failover IS NOT desired Failover did not occur ------ Test #4, FAILED This is the use case which the user has glusterfs setup and failover IS desired Observation: When PM1, which is the master node, taken offline When PM1 was put back online It was expected that ColumnStore would set the master according to what MaxScale selected, but this did not happen. Now ColumnStore and MaxScale are out of sync. ------ At the time of this writing, fixVersion of the ticket has been set for cmapi-1.6.3, but the package as been named for 1.6.2, such as MariaDB-columnstore-cmapi-1.6.2.x86_64.rpm. The package name shoudl be corrected. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-04-12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Guys, 1. I tend to agree that default file section created at new install should be empty I am trying to repro myself, but inconclusive so far.. dleeyh and toddstoffel Besides the discrepancy between MCS and MXS in respect to master node choice , what issues with DDL/DML update do we observe ? Also Daniel , for whatever symptoms we see , please confirm in which old release we did not see them | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-04-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
alan.mologorsky dleeyh I opened a new https://jira.mariadb.org/browse/MCOL-5052 for that mismatch discussion The only remaining item here is for Alan and is described above | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2022-04-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Build tested: 6.3.1-1 (#4234), CMAPI-1.6.3-1 (#623) Preliminary test results for failover behavior. More functional tests will be done. 3-node cluster, with gluster, schema replication, MaxScale For each of the follow tests, a newly installed 3-node cluster is used Test #1 Failover now works the same way as it used to be. When putting PM1 back online, PM2 remained as the master node, in sync with MaxScale. Test #2
and restarted cmapi
mcsStatus on all three (3) nodes showed there is only one (1) node (pm1) in the cluster. pm2 and pm3 are no longer part of the cluster. Output is like the following:
I tried the same test again and all nodes returned somethig like the following
Failover was not tested since there is only one node in the cluster now. Test #3
and restarted cmapi
I got the same result as Test #1 above | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2022-04-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Build verified: ColumnStore 6.3.1-1 (#4278), cmapi (#625) Following the steps above and using the new cmapi build, test #2 worked as expected, failover did not take place, as it is disabled in the cmapi-server.cnf file. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2022-04-20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Build verified: ColumnStore 6.3.1-1 (#4299), cmapi 1.6.3 (#626) cmapi package name has been corrected: MariaDB-columnstore-cmapi-1.6.3-1.x86_6.rpm. from 1.6.2 to 1.6.3 Verified along with the latest build of ColumnStore. Created a 3-node docker cluster. |