Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-4939

Add a method to disable failover facility in CMAPI.

Details

    • New Feature
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 6.1.1
    • cmapi-6.4.1
    • cmapi
    • None
    • 2021-15, 2021-16, 2021-17

    Description

      There are known customer installations that don't use shared storage so failover mechanism might break such clusters.
      There must be a knob in cmapi configuration file to disable failover facility if needed.

      New

      Changes has been made:

      • add application section with auto_failover = False parameter to default cmapi_server.conf
      • failover now is turned off by default even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf
      • failover has now three different logical states:
        • turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi.
        • turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3
        • turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3

      Attachments

        Issue Links

          Activity

            drrtuy Roman created issue -
            gdorman Gregory Dorman (Inactive) made changes -
            Field Original Value New Value
            Rank Ranked higher
            gdorman Gregory Dorman (Inactive) made changes -
            Sprint 2021-15 [ 587 ] 2021-15, 2021-16 [ 587, 598 ]
            alan.mologorsky Alan Mologorsky made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            alan.mologorsky Alan Mologorsky made changes -
            Assignee Alan Mologorsky [ JIRAUSER49150 ] Roman [ drrtuy ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            drrtuy Roman added a comment - - edited

            4QA Previously there was no way to disable failover for clusters with >= 3 nodes. It affects clusters with non-shared storages a lot.

            drrtuy Roman added a comment - - edited 4QA Previously there was no way to disable failover for clusters with >= 3 nodes. It affects clusters with non-shared storages a lot.
            drrtuy Roman made changes -
            Status In Review [ 10002 ] In Testing [ 10301 ]
            drrtuy Roman made changes -
            Assignee Roman [ drrtuy ] Daniel Lee [ dleeyh ]
            dleeyh Daniel Lee (Inactive) made changes -
            Rank Ranked higher
            alan.mologorsky Alan Mologorsky made changes -
            alan.mologorsky Alan Mologorsky made changes -
            Comment [ Fixed is_shared_storage behaviour.
            Now failover using SM config file to check if storage is shared or not.
            Previously 'yes' value has been hardcoded. ]

            dleeyh
            Changes has been made:

            • add application section with auto_failover = False parameter to default cmapi_server.conf
            • failover now is turned off by default even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf
            • failover has now three different logical states:
              • turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi.
              • turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3
              • turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3
            alan.mologorsky Alan Mologorsky added a comment - dleeyh Changes has been made: add application section with auto_failover = False parameter to default cmapi_server.conf failover now is turned off by default even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf failover has now three different logical states: turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi. turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3 turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3

            Build tested: ColumnStore Engine (build 3561)
            CMAPI (585)

            Test cluster: 3-node cluster

            Reproduced the reported issue in 6.2.2-1, with CMAPI 1.6 as released.

            Non-shared storage (local dbroot)
            With auto_failover=False, failover did not occur when PM1 was suspended.

            With auto_failover=True, this is a misconfiguration since non-shared storage is used. When PM1 was suspended, I expected failover to occur and the cluster would end up in a non-operational state, but it did not occur. Was it because CMAPI detected non-shared storage and did not kick off the failover process? or was it failover simply did not occur?

            Glusterfs
            With auto_failover=False, failover did not occur when PM1 was suspended.
            With auto_failover=True, I expected failover to occur, having PM2 taking over as the master node. It did not happen.
            The same test did worked in 6.1.1 and 6.2.2.

            dleeyh Daniel Lee (Inactive) added a comment - Build tested: ColumnStore Engine (build 3561) CMAPI (585) Test cluster: 3-node cluster Reproduced the reported issue in 6.2.2-1, with CMAPI 1.6 as released. Non-shared storage (local dbroot) With auto_failover=False, failover did not occur when PM1 was suspended. With auto_failover=True, this is a misconfiguration since non-shared storage is used. When PM1 was suspended, I expected failover to occur and the cluster would end up in a non-operational state, but it did not occur. Was it because CMAPI detected non-shared storage and did not kick off the failover process? or was it failover simply did not occur? Glusterfs With auto_failover=False, failover did not occur when PM1 was suspended. With auto_failover=True, I expected failover to occur, having PM2 taking over as the master node. It did not happen. The same test did worked in 6.1.1 and 6.2.2.
            dleeyh Daniel Lee (Inactive) made changes -
            Assignee Daniel Lee [ dleeyh ] Alan Mologorsky [ JIRAUSER49150 ]
            Status In Testing [ 10301 ] Stalled [ 10000 ]
            gdorman Gregory Dorman (Inactive) made changes -
            Sprint 2021-15, 2021-16 [ 587, 598 ] 2021-15, 2021-16, 2021-17 [ 587, 598, 614 ]
            alan.mologorsky Alan Mologorsky made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Status In Progress [ 3 ] In Testing [ 10301 ]

            David.Hall I moved this to testing . was it incorrect .

            Is the action item for alan.mologorsky instead

            alexey.vorovich alexey vorovich (Inactive) added a comment - David.Hall I moved this to testing . was it incorrect . Is the action item for alan.mologorsky instead

            alan.mologorsky i moved to testing by mistake

            please post the status

            alexey.vorovich alexey vorovich (Inactive) added a comment - alan.mologorsky i moved to testing by mistake please post the status
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Status In Testing [ 10301 ] Stalled [ 10000 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            alan.mologorsky Alan Mologorsky made changes -
            Status In Progress [ 3 ] In Testing [ 10301 ]
            alan.mologorsky Alan Mologorsky made changes -
            Assignee Alan Mologorsky [ JIRAUSER49150 ] Daniel Lee [ dleeyh ]
            dleeyh Daniel Lee (Inactive) made changes -
            Status In Testing [ 10301 ] Stalled [ 10000 ]

            dleeyh Question : all 3 scenarios worked in the previous version ? which one ?
            alan.mologorsky please see Daniel's request for steps.

            Note that we are discussing the possible regression in the overall failover functionally in this release - not the actual change that Alan did on this ticket.
            toddstoffel should maxscale be involved in this ?
            Eventually we will need to review this in Sky as well petko.vasilev

            alexey.vorovich alexey vorovich (Inactive) added a comment - dleeyh Question : all 3 scenarios worked in the previous version ? which one ? alan.mologorsky please see Daniel's request for steps. Note that we are discussing the possible regression in the overall failover functionally in this release - not the actual change that Alan did on this ticket. toddstoffel should maxscale be involved in this ? Eventually we will need to review this in Sky as well petko.vasilev
            dleeyh Daniel Lee (Inactive) added a comment - - edited

            Build tested: 6.3.1-1 (#4101), cmapi 1.6 (#580), 3-node glusterfs

            alexey.vorovich Yes, it was tested before and it worked fine. Just in case, I retested the same build of ColumnStore 6.3.1-1 using an older build of CMAPI.

            1. When restarting ColumnStore (mcsShutdown and mcsStart, not a failover situation), PM1 remained as the master node. ColumnStore continued function properly as expected.

            2. Failover scenario, PM1 also came back as master node. ColumnStore continued function properly as expected.

            MariaDB [mytest]> select count(*) from lineitem;
            +----------+
            | count(*) |
            +----------+
            |  6001215 |
            +----------+
            1 row in set (0.186 sec)
             
            MariaDB [mytest]> create table t1 (c1 int) engine=columnstore;
            Query OK, 0 rows affected (1.619 sec)
             
            use near 'table t1 values (1)' at line 1
            MariaDB [mytest]> insert t1 values (1);
            Query OK, 1 row affected (0.159 sec)
             
            MariaDB [mytest]> insert t1 values (2);
            Query OK, 1 row affected (0.074 sec)
             
            MariaDB [mytest]> select * from t1;
            +------+
            | c1   |
            +------+
            |    1 |
            |    2 |
            +------+
            2 rows in set (0.521 sec)
            

            dleeyh Daniel Lee (Inactive) added a comment - - edited Build tested: 6.3.1-1 (#4101), cmapi 1.6 (#580), 3-node glusterfs alexey.vorovich Yes, it was tested before and it worked fine. Just in case, I retested the same build of ColumnStore 6.3.1-1 using an older build of CMAPI. 1. When restarting ColumnStore (mcsShutdown and mcsStart, not a failover situation), PM1 remained as the master node. ColumnStore continued function properly as expected. 2. Failover scenario, PM1 also came back as master node. ColumnStore continued function properly as expected. MariaDB [mytest]> select count(*) from lineitem; +----------+ | count(*) | +----------+ | 6001215 | +----------+ 1 row in set (0.186 sec)   MariaDB [mytest]> create table t1 (c1 int) engine=columnstore; Query OK, 0 rows affected (1.619 sec)   use near 'table t1 values (1)' at line 1 MariaDB [mytest]> insert t1 values (1); Query OK, 1 row affected (0.159 sec)   MariaDB [mytest]> insert t1 values (2); Query OK, 1 row affected (0.074 sec)   MariaDB [mytest]> select * from t1; +------+ | c1 | +------+ | 1 | | 2 | +------+ 2 rows in set (0.521 sec)
            dleeyh Daniel Lee (Inactive) made changes -
            Assignee Daniel Lee [ dleeyh ] Alan Mologorsky [ JIRAUSER49150 ]

            Guys, i would suggest this .

            Start with something that works for at least one of you " ColumnStore 6.3.1-1 using an older build of CMAPI."

            Indeed , Daniel, please create an annotated script in your repo to do the actions for test1,2,3 to minimize the chance of miscommunication.

            Then Alan can try to repeat them on the version he believes cannot work and we go from there.

            alexey.vorovich alexey vorovich (Inactive) added a comment - Guys, i would suggest this . Start with something that works for at least one of you " ColumnStore 6.3.1-1 using an older build of CMAPI." Indeed , Daniel, please create an annotated script in your repo to do the actions for test1,2,3 to minimize the chance of miscommunication. Then Alan can try to repeat them on the version he believes cannot work and we go from there.

            and please always use build numbers in notes instead of "latest" and "older build"

            alexey.vorovich alexey vorovich (Inactive) added a comment - and please always use build numbers in notes instead of "latest" and "older build"

            I always start my comments for test with a line like the following:

            Build tested: 6.3.1-1 (#4101), cmapi 1.6 (#580)

            When closing a ticket, I do

            Build verified: 6.3.1-1 (#4101), cmapi 1.6 (#580)

            The number in () is the build number from Drone.

            For example. The CMAPI build that I had issues with was "cmapi 1.6.2 (#612)" and the older one that I retested was "cmapi 1.6 (#580)"

            dleeyh Daniel Lee (Inactive) added a comment - I always start my comments for test with a line like the following: Build tested: 6.3.1-1 (#4101), cmapi 1.6 (#580) When closing a ticket, I do Build verified: 6.3.1-1 (#4101), cmapi 1.6 (#580) The number in () is the build number from Drone. For example. The CMAPI build that I had issues with was "cmapi 1.6.2 (#612)" and the older one that I retested was "cmapi 1.6 (#580)"
            alexey.vorovich alexey vorovich (Inactive) added a comment - - edited

            alan.mologorsky
            Yes, in the future we will share QA scripts (and they will include k8s commands as well) .

            For today, test1 has these 4 steps that you can run one after the other . Please try these with "cmapi 1.6 (#580)" and see if you can confirm what Daniel is seeing (he sees that working and you believe it cannot work) . Lets start from there.

            1. Set auto_failover to True in /etc/columnstore/cmapi_serfer.conf in all nodes
            2. "systemctl restart mariadb-columnstore-cmapi" in all nodes
            3. mcsShutdown
            4. mcsStart

            As a side note : putting commands that are supposed to be run on separate multiples host could be done via a loop of SSH or via Kubectl . We will need to decide how to do that in the future. There is also SkyTf framework for these created by georgi.harizanov
            Eventually we will integrate multi-node tests with that framework . For now - just heads up

            alexey.vorovich alexey vorovich (Inactive) added a comment - - edited alan.mologorsky Yes, in the future we will share QA scripts (and they will include k8s commands as well) . For today, test1 has these 4 steps that you can run one after the other . Please try these with "cmapi 1.6 (#580)" and see if you can confirm what Daniel is seeing (he sees that working and you believe it cannot work) . Lets start from there. 1. Set auto_failover to True in /etc/columnstore/cmapi_serfer.conf in all nodes 2. "systemctl restart mariadb-columnstore-cmapi" in all nodes 3. mcsShutdown 4. mcsStart As a side note : putting commands that are supposed to be run on separate multiples host could be done via a loop of SSH or via Kubectl . We will need to decide how to do that in the future. There is also SkyTf framework for these created by georgi.harizanov Eventually we will integrate multi-node tests with that framework . For now - just heads up

            i kind of doubt that rocky and types of data loaded are important. Could be.

            What is important is to have the same common scripts to install the system to begin with .

            If Daniel has these scripts that Alan should use them. I will try this install script as well.

            after these scripts are used I would start with the the older build that works for both and then move to newer builds

            alexey.vorovich alexey vorovich (Inactive) added a comment - i kind of doubt that rocky and types of data loaded are important. Could be. What is important is to have the same common scripts to install the system to begin with . If Daniel has these scripts that Alan should use them. I will try this install script as well. after these scripts are used I would start with the the older build that works for both and then move to newer builds
            alan.mologorsky Alan Mologorsky made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]

            well , If Alan can reproduce the problem using a separate setup then good.

            However , i would definitely invest into a common installation script

            alexey.vorovich alexey vorovich (Inactive) added a comment - well , If Alan can reproduce the problem using a separate setup then good. However , i would definitely invest into a common installation script

            We have 3 candidates for common install script for multi-node

            1. Daniel's QA setup . Needs work as per Daniel to make it really standalone
            2. Direct MOE that brings up cluster with pods ... Pending this week , i hope and pray
            3. Docker compose from toddstoffel. If this supports shared disk then we could use that to start/test /validate and share identical setup between different people short term and maybe long term as well

            Todd, do u agree ?

            alexey.vorovich alexey vorovich (Inactive) added a comment - We have 3 candidates for common install script for multi-node Daniel's QA setup . Needs work as per Daniel to make it really standalone Direct MOE that brings up cluster with pods ... Pending this week , i hope and pray Docker compose from toddstoffel . If this supports shared disk then we could use that to start/test /validate and share identical setup between different people short term and maybe long term as well Todd, do u agree ?

            The link leads to step 2 of 9 step procedure. Many of these steps require the user to execute commands on each host. Much time is needed and many possibilities of errors exist.

            On the contrary the Direct MOE will allow something along these lines

            moe -topology CS -replicas 3 -nodetype verylarge

            this will create all the nodes, s3, nfs, config files etc etc

            I suspect that docker compose approach is simple to use . Todd will clarify

            My concern is actually debugging. How one can do symbolic debugging in the docker. There are tools for that as well (at least for Python)

            https://code.visualstudio.com/docs/containers/debug-common

            alexey.vorovich alexey vorovich (Inactive) added a comment - The link leads to step 2 of 9 step procedure. Many of these steps require the user to execute commands on each host. Much time is needed and many possibilities of errors exist. On the contrary the Direct MOE will allow something along these lines moe -topology CS -replicas 3 -nodetype verylarge this will create all the nodes, s3, nfs, config files etc etc I suspect that docker compose approach is simple to use . Todd will clarify My concern is actually debugging. How one can do symbolic debugging in the docker. There are tools for that as well (at least for Python) https://code.visualstudio.com/docs/containers/debug-common

            alan.mologorsky can you point me to a repo where your scripts are so I can have a look?

            georgi.harizanov Georgi Harizanov (Inactive) added a comment - alan.mologorsky can you point me to a repo where your scripts are so I can have a look?
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Description There are known customer installations that don't use shared storage so failover mechanism might break such clusters.
            There must be a knob in cmapi configuration file to disable failover facility if needed.
            There are known customer installations that don't use shared storage so failover mechanism might break such clusters.
            There must be a knob in cmapi configuration file to disable failover facility if needed.

            New

            add application section with auto_failover = False parameter to default cmapi_server.conf
            failover now is turned off by default even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf
            failover has now three different logical states:
            turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi.
            turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3
            turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Description There are known customer installations that don't use shared storage so failover mechanism might break such clusters.
            There must be a knob in cmapi configuration file to disable failover facility if needed.

            New

            add application section with auto_failover = False parameter to default cmapi_server.conf
            failover now is turned off by default even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf
            failover has now three different logical states:
            turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi.
            turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3
            turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3
            There are known customer installations that don't use shared storage so failover mechanism might break such clusters.
            There must be a knob in cmapi configuration file to disable failover facility if needed.

            New

            Changes has been made:
            * add application section with auto_failover = False parameter to default cmapi_server.conf
            * failover now is turned off by default even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf
            * failover has now three different logical states:
            ** turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi.
            ** turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3
            ** turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Description There are known customer installations that don't use shared storage so failover mechanism might break such clusters.
            There must be a knob in cmapi configuration file to disable failover facility if needed.

            New

            Changes has been made:
            * add application section with auto_failover = False parameter to default cmapi_server.conf
            * failover now is turned off by default even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf
            * failover has now three different logical states:
            ** turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi.
            ** turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3
            ** turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3
            There are known customer installations that don't use shared storage so failover mechanism might break such clusters.
            There must be a knob in cmapi configuration file to disable failover facility if needed.

            New

            Changes has been made:
            * add application section with auto_failover = False parameter to default cmapi_server.conf
            * failover now is *turned off by default* even if there are no "application" section or no auto_failover parameter exist in cmapi_server.conf
            * failover has now three different logical states:
            ** turned off - no failover thread started. To turn it on set auto_failover=True in application section of cmapi_server.conf file of each node and restart cmapi.
            ** turned on and inactive - there are failover thread but it doesn't work. It becomes active automatically if nodes count >= 3
            ** turned on and active - there are an active failover thread and it is activated. Can be deactivated automatically if nodes_count < 3

            alan.mologorsky did I summarize the meeting correctly ? If so, please do the reversal of default and pass to dleeyh so that we can move on

            toddstoffel Here is a suggestion /question from gdorman What if we ALWAYS require maxscale to be present to enable HA. This is currently the case for Sky.

            this would reduce the various options . Before we discuss this in dev , what is our take from PM point of view

            alexey.vorovich alexey vorovich (Inactive) added a comment - alan.mologorsky did I summarize the meeting correctly ? If so, please do the reversal of default and pass to dleeyh so that we can move on toddstoffel Here is a suggestion /question from gdorman What if we ALWAYS require maxscale to be present to enable HA. This is currently the case for Sky. this would reduce the various options . Before we discuss this in dev , what is our take from PM point of view
            drrtuy Roman added a comment -

            alexey.vorovich As the result we decided that I will ask toddstoffel offline regarding default behavior and wait for his decision.

            drrtuy Roman added a comment - alexey.vorovich As the result we decided that I will ask toddstoffel offline regarding default behavior and wait for his decision.
            toddstoffel Todd Stoffel (Inactive) made changes -
            Fix Version/s cmapi-1.7 [ 26900 ]
            Fix Version/s 6.3.1 [ 25801 ]
            nedyalko.petrov Nedyalko Petrov (Inactive) made changes -
            drrtuy Roman made changes -
            Assignee Alan Mologorsky [ JIRAUSER49150 ] Roman [ drrtuy ]
            drrtuy Roman made changes -
            Status In Progress [ 3 ] In Testing [ 10301 ]
            drrtuy Roman made changes -
            Assignee Roman [ drrtuy ] Daniel Lee [ dleeyh ]
            drrtuy Roman added a comment -

            4QA Plz use the latest CMAPI build.

            drrtuy Roman added a comment - 4QA Plz use the latest CMAPI build.
            toddstoffel Todd Stoffel (Inactive) made changes -
            Fix Version/s cmapi-1.6.3 [ 27900 ]
            Fix Version/s cmapi-6.4.1 [ 26900 ]

            Build tested: 6.3.1-1 (#4234), CMAPI-1.6.3 (619)

            Cluster: 3-node

            Test #1, PASSED
            Default auto_failover value

            In /etc/columnstore/cmapi_server.conf, auto_failover is set to True by default.

            ------

            Test #2, PASSED
            Setup: no shared storage
            auto_failover: False

            This is the use case which the user does not have shared storage setup and failover is not desired

            Failover did not occur

            ------

            Test #3, PASSED
            Setup: gluster
            auto_failover: False

            This is the use case which the user has glusterfs setup and failover IS NOT desired

            Failover did not occur

            ------

            Test #4, FAILED
            Setup: gluster
            auto_failover: True

            This is the use case which the user has glusterfs setup and failover IS desired

            Observation:

            When PM1, which is the master node, taken offline
            mcsStatus on PM2 showed PM2 as the master, PM3 as slave (2-node cluster)
            MaxScale showed PM2 as the master, PM1 was down
            So far, this is expected

            When PM1 was put back online
            mcsStatus showed PM1 eventually bacame the master node again
            but MaxScale showed PM2 should be the master node

            It was expected that ColumnStore would set the master according to what MaxScale selected, but this did not happen. Now ColumnStore and MaxScale are out of sync.

            ------

            At the time of this writing, fixVersion of the ticket has been set for cmapi-1.6.3, but the package as been named for 1.6.2, such as MariaDB-columnstore-cmapi-1.6.2.x86_64.rpm. The package name shoudl be corrected.

            dleeyh Daniel Lee (Inactive) added a comment - Build tested: 6.3.1-1 (#4234), CMAPI-1.6.3 (619) Cluster: 3-node Test #1, PASSED Default auto_failover value In /etc/columnstore/cmapi_server.conf, auto_failover is set to True by default. ------ Test #2, PASSED Setup: no shared storage auto_failover: False This is the use case which the user does not have shared storage setup and failover is not desired Failover did not occur ------ Test #3, PASSED Setup: gluster auto_failover: False This is the use case which the user has glusterfs setup and failover IS NOT desired Failover did not occur ------ Test #4, FAILED Setup: gluster auto_failover: True This is the use case which the user has glusterfs setup and failover IS desired Observation: When PM1, which is the master node, taken offline mcsStatus on PM2 showed PM2 as the master, PM3 as slave (2-node cluster) MaxScale showed PM2 as the master, PM1 was down So far, this is expected When PM1 was put back online mcsStatus showed PM1 eventually bacame the master node again but MaxScale showed PM2 should be the master node It was expected that ColumnStore would set the master according to what MaxScale selected, but this did not happen. Now ColumnStore and MaxScale are out of sync. ------ At the time of this writing, fixVersion of the ticket has been set for cmapi-1.6.3, but the package as been named for 1.6.2, such as MariaDB-columnstore-cmapi-1.6.2.x86_64.rpm. The package name shoudl be corrected.
            dleeyh Daniel Lee (Inactive) made changes -
            Assignee Daniel Lee [ dleeyh ] Todd Stoffel [ toddstoffel ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Comment [ A comment with security level 'Developers' was removed. ]
            alexey.vorovich alexey vorovich (Inactive) added a comment - - edited

            Guys,

            1. I tend to agree that default file section created at new install should be empty
            2. Let's go back to Test4 failed in https://jira.mariadb.org/browse/MCOL-4939?focusedCommentId=219832&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219832

            I am trying to repro myself, but inconclusive so far..

            dleeyh and toddstoffel

            Besides the discrepancy between MCS and MXS in respect to master node choice , what issues with DDL/DML update do we observe ?
            Please list what has been found. My understanding is that Maxscale will direct updates to PM2.

            Also Daniel , for whatever symptoms we see , please confirm in which old release we did not see them

            alan.mologorsky drrtuy gdorman FYI

            alexey.vorovich alexey vorovich (Inactive) added a comment - - edited Guys, 1. I tend to agree that default file section created at new install should be empty 2. Let's go back to Test4 failed in https://jira.mariadb.org/browse/MCOL-4939?focusedCommentId=219832&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-219832 I am trying to repro myself, but inconclusive so far.. dleeyh and toddstoffel Besides the discrepancy between MCS and MXS in respect to master node choice , what issues with DDL/DML update do we observe ? Please list what has been found. My understanding is that Maxscale will direct updates to PM2. Also Daniel , for whatever symptoms we see , please confirm in which old release we did not see them alan.mologorsky drrtuy gdorman FYI
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Assignee Todd Stoffel [ toddstoffel ] Alan Mologorsky [ JIRAUSER49150 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Status In Testing [ 10301 ] Stalled [ 10000 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Comment [ A comment with security level 'Developers' was removed. ]
            alan.mologorsky Alan Mologorsky made changes -
            Assignee Alan Mologorsky [ JIRAUSER49150 ] Roman [ drrtuy ]
            Status In Progress [ 3 ] In Review [ 10002 ]

            alan.mologorsky dleeyh I opened a new https://jira.mariadb.org/browse/MCOL-5052 for that mismatch discussion

            The only remaining item here is for Alan and is described above

            alexey.vorovich alexey vorovich (Inactive) added a comment - alan.mologorsky dleeyh I opened a new https://jira.mariadb.org/browse/MCOL-5052 for that mismatch discussion The only remaining item here is for Alan and is described above

            Build tested: 6.3.1-1 (#4234), CMAPI-1.6.3-1 (#623)

            Preliminary test results for failover behavior. More functional tests will be done.

            3-node cluster, with gluster, schema replication, MaxScale

            For each of the follow tests, a newly installed 3-node cluster is used

            Test #1
            Default installation, auto_failover parameter has been removed from /etc/columnstore/cmapi_server.conf , default behavior is auto failover enabled.

            Failover now works the same way as it used to be. When putting PM1 back online, PM2 remained as the master node, in sync with MaxScale.

            Test #2
            On each node, added the following to /etc/columnstore/cmapi_server.conf

            [application]
            auto_failover = False
            

            and restarted cmapi

            systemctl restart mariadb-columnstore-cmapi
            

            mcsStatus on all three (3) nodes showed there is only one (1) node (pm1) in the cluster. pm2 and pm3 are no longer part of the cluster. Output is like the following:

            [rocky8:root~]# mcsStatus
            {
              "timestamp": "2022-04-14 00:43:02.932548",
              "s1pm1": {
                "timestamp": "2022-04-14 00:43:02.938951",
                "uptime": 1149,
                "dbrm_mode": "master",
                "cluster_mode": "readwrite",
                "dbroots": [],
                "module_id": 1,
                "services": [
                  {
                    "name": "workernode",
                    "pid": 9290
                  },
                  {
                    "name": "controllernode",
                    "pid": 9301
                  },
                  {
                    "name": "PrimProc",
                    "pid": 9317
                  },
                  {
                    "name": "ExeMgr",
                    "pid": 9365
                  },
                  {
                    "name": "WriteEngine",
                    "pid": 9382
                  },
                  {
                    "name": "DDLProc",
                    "pid": 9413
                  }
                ]
              },
              "num_nodes": 1
            }
            

            I tried the same test again and all nodes returned somethig like the following

            [rocky8:root~]# mcsStatus
            {
              "timestamp": "2022-04-14 01:46:02.956786",
              "s1pm1": {
                "timestamp": "2022-04-14 01:46:02.963366",
                "uptime": 1631,
                "dbrm_mode": "offline",
                "cluster_mode": "readonly",
                "dbroots": [],
                "module_id": 1,
                "services": []
              },
              "num_nodes": 1
            }
            

            Failover was not tested since there is only one node in the cluster now.

            Test #3
            On each node, added the following to /etc/columnstore/cmapi_server.conf

            [application]
            auto_failover = True
            

            and restarted cmapi

            systemctl restart mariadb-columnstore-cmapi
            

            I got the same result as Test #1 above

            dleeyh Daniel Lee (Inactive) added a comment - Build tested: 6.3.1-1 (#4234), CMAPI-1.6.3-1 (#623) Preliminary test results for failover behavior. More functional tests will be done. 3-node cluster, with gluster, schema replication, MaxScale For each of the follow tests, a newly installed 3-node cluster is used Test #1 Default installation, auto_failover parameter has been removed from /etc/columnstore/cmapi_server.conf , default behavior is auto failover enabled. Failover now works the same way as it used to be. When putting PM1 back online, PM2 remained as the master node, in sync with MaxScale. Test #2 On each node, added the following to /etc/columnstore/cmapi_server.conf [application] auto_failover = False and restarted cmapi systemctl restart mariadb-columnstore-cmapi mcsStatus on all three (3) nodes showed there is only one (1) node (pm1) in the cluster. pm2 and pm3 are no longer part of the cluster. Output is like the following: [rocky8:root~]# mcsStatus { "timestamp": "2022-04-14 00:43:02.932548", "s1pm1": { "timestamp": "2022-04-14 00:43:02.938951", "uptime": 1149, "dbrm_mode": "master", "cluster_mode": "readwrite", "dbroots": [], "module_id": 1, "services": [ { "name": "workernode", "pid": 9290 }, { "name": "controllernode", "pid": 9301 }, { "name": "PrimProc", "pid": 9317 }, { "name": "ExeMgr", "pid": 9365 }, { "name": "WriteEngine", "pid": 9382 }, { "name": "DDLProc", "pid": 9413 } ] }, "num_nodes": 1 } I tried the same test again and all nodes returned somethig like the following [rocky8:root~]# mcsStatus { "timestamp": "2022-04-14 01:46:02.956786", "s1pm1": { "timestamp": "2022-04-14 01:46:02.963366", "uptime": 1631, "dbrm_mode": "offline", "cluster_mode": "readonly", "dbroots": [], "module_id": 1, "services": [] }, "num_nodes": 1 } Failover was not tested since there is only one node in the cluster now. Test #3 On each node, added the following to /etc/columnstore/cmapi_server.conf [application] auto_failover = True and restarted cmapi systemctl restart mariadb-columnstore-cmapi I got the same result as Test #1 above
            toddstoffel Todd Stoffel (Inactive) made changes -
            Fix Version/s cmapi-6.4.1 [ 26900 ]
            Fix Version/s cmapi-1.6.3 [ 27900 ]
            alexey.vorovich alexey vorovich (Inactive) made changes -
            Assignee Roman [ drrtuy ] Alan Mologorsky [ JIRAUSER49150 ]
            drrtuy Roman made changes -
            Status In Review [ 10002 ] In Testing [ 10301 ]
            drrtuy Roman made changes -
            Assignee Alan Mologorsky [ JIRAUSER49150 ] Roman [ drrtuy ]
            drrtuy Roman made changes -
            Assignee Roman [ drrtuy ] Daniel Lee [ dleeyh ]

            Build verified: ColumnStore 6.3.1-1 (#4278), cmapi (#625)

            Following the steps above and using the new cmapi build, test #2 worked as expected, failover did not take place, as it is disabled in the cmapi-server.cnf file.

            dleeyh Daniel Lee (Inactive) added a comment - Build verified: ColumnStore 6.3.1-1 (#4278), cmapi (#625) Following the steps above and using the new cmapi build, test #2 worked as expected, failover did not take place, as it is disabled in the cmapi-server.cnf file.
            dleeyh Daniel Lee (Inactive) made changes -
            Resolution Fixed [ 1 ]
            Status In Testing [ 10301 ] Closed [ 6 ]
            dleeyh Daniel Lee (Inactive) added a comment - - edited

            Build verified: ColumnStore 6.3.1-1 (#4299), cmapi 1.6.3 (#626)

            cmapi package name has been corrected: MariaDB-columnstore-cmapi-1.6.3-1.x86_6.rpm. from 1.6.2 to 1.6.3

            Verified along with the latest build of ColumnStore. Created a 3-node docker cluster.

            dleeyh Daniel Lee (Inactive) added a comment - - edited Build verified: ColumnStore 6.3.1-1 (#4299), cmapi 1.6.3 (#626) cmapi package name has been corrected: MariaDB-columnstore-cmapi-1.6.3-1.x86_6.rpm. from 1.6.2 to 1.6.3 Verified along with the latest build of ColumnStore. Created a 3-node docker cluster.
            toddstoffel Todd Stoffel (Inactive) made changes -
            Rank Ranked higher
            toddstoffel Todd Stoffel (Inactive) made changes -
            Rank Ranked higher
            toddstoffel Todd Stoffel (Inactive) made changes -
            toddstoffel Todd Stoffel (Inactive) made changes -
            toddstoffel Todd Stoffel (Inactive) made changes -
            Rank Ranked higher

            People

              dleeyh Daniel Lee (Inactive)
              drrtuy Roman
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.