[MCOL-4226] CMAPI fails to configure multi-node CS cluster - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 5.4.1
Component/s: None
Labels:
None

Sprint:
2020-8

Description

NB: This is a long (and sad) ticket. Getting a coffee, coke or beer is recommended.

Due to the lack of any official documentation (why?), we are attempting to install a multi-PM CS with MariaDB Enterprise 10.5.4 following the guide as provided by Todd Stoffel:
https://mdbcdt.com/DOCS-2085/deploy/enterprise-multi-columnstore

We are on CentOS-7 latest with SELinux completely disabled and firewall completely stopped.

We have properly set short and full DNS name for each node, even with matching reverse resolving:

[root@p2w1 columnstore]# uname -a
Linux p2w1.xentio.lan 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@p2w1 columnstore]# hostname
p2w1.xentio.lan
[root@p2w1 columnstore]# host p2w1.xentio.lan
p2w1.xentio.lan has address 172.20.2.21
[root@p2w1 columnstore]# host p2w1
p2w1.xentio.lan has address 172.20.2.21
[root@p2w1 columnstore]# host 172.20.2.21
21.2.20.172.in-addr.arpa domain name pointer p2w1.xentio.lan.

We have set up and verified replication, created cross-engine join user etc. - all these are irrelevant to the inability to create a CS cluster as described below. We ran the whole process twice just to make sure we haven't missed anything; same errors appear every time and building the cluster is impossible.

Adding the first (and any subsequent node) fails:

curl -k -s -X PUT https://p2w1:8640/cmapi/0.4.0/cluster/add-node
--header 'Content-Type:application/json' \
--header 'x-api-key:93816fa66cc2d8c224e62275bd4f248234dd4947b68d4af2b29671dd7d5532dd' \
--data '

{"timeout":60, "node": "p2w1"}

' \

jq .

Yes, we are first adding a node to self - we follow the example from the docs. We use the node short name as the document says.

After several minutes (which is much more than the specified timeout - still increased here 3 times compared to the doc!), the console returns with

{ "error": "got an error during cluster startup when broadcasting config: (422, None)" }

The system journal opens with the following entry:

Aug 01 19:33:26 p2w1.xentio.lan python3[7474]: 172.20.0.42 - - [01/Aug/2020 19:33:26] cmapi_server DEBUG put_add_node starts

after which the following block is repeated many dozens of times over the next few minutes:

Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:34:03] root get_module_net_address Module 1 network address 127.0.0.1
Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:34:03] cmapi_server DEBUG put_begin starts
Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:34:03] cmapi_server DEBUG put_begin JSON body

{'id': 793082, 'timeout': 263}

Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:34:03] cmapi_server DEBUG put_begin returns

{'timestamp': '2020-08-01 19:34:03.403018'}

Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020:19:34:03] "PUT /cmapi/0.4.0/node/begin HTTP/1.1" 200 43 "" "python-requests/2.23.0"
Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020:19:34:03] "PUT /cmapi/0.4.0/node/begin HTTP/1.1" 422 42 "" "python-requests/2.23.0"
Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020:19:34:03] "PUT /cmapi/0.4.0/node/rollback HTTP/1.1" 200 43 "" "python-requests/2.23.0"

As we see, even with debug mode enabled these messages are completely useless and reveal nothing why is 422 generated - and by whom (is the CMAPI service posting to some other service? To self?). Eventually, the journal says:

Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root get_module_net_address Module 1 network address 127.0.0.1
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root add_node(): replacing 127.0.0.1/localhost with 172.20.2.21/p2w1.xentio.lan as this node's name. Be sure p2w1.xentio.lan resolves to 172.20.2.21 on all other nodes in the cluster.
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root _add_Module_entries(): using ip address 172.20.2.21 and hostname p2w1
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root _add_Module_entries(): found ip address already at ModuleIPAddr1-1-3
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root _add_Module_entries(): hostname doesn't match, updating address to 172.20.2.21

The last message is confusing - see above, our host settings are perfectly correct. Whatever it means, please, fix it in cmapi_server/node_manipulation.py.

The journal continues with an attempt to push a new Columnstore.xml file, which also "fails" for unknown reason - however, note tthat the file is actually written to /etc/columnsore/Columnstore.xml - no differences are foudn between the logged line in system journal and the updated /etc/columnsore/Columnstore.xml file!

Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:38:27] cmapi_server DEBUG put_config starts
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:38:27] cmapi_server DEBUG put_config JSON body

{'manager': 'p2w1.xentio.lan', 'revision': '1', 'timeout': 300, 'config': '<?xml version="1.0" ?>\n<Columnstore Version="V1.0.0"> ... </Columnstore>', 'cs_config_filename': '/etc/columnstore/Columnstore.xml'}

Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:38:27] cmapi_server ERROR put_config PUT /config called outside of an operation.
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020:19:38:27] "PUT /cmapi/0.4.0/node/config HTTP/1.1" 422 56 "" "python-requests/2.23.0"
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020:19:38:27] "PUT /cmapi/0.4.0/node/config HTTP/1.1" 422 56 "" "python-requests/2.23.0"
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root broadcast_new_config(): got an error pushing config file to p2w1.xentio.lan: 422 Client Error: Unprocessable Entity for url: https://p2w1.xentio.lan:8640/cmapi/0.4.0/node/config
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root broadcast_new_config(): got an error pushing config file to p2w1: 422 Client Error: Unprocessable Entity for url: https://p2w1:8640/cmapi/0.4.0/node/config
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root broadcast_new_config(): failed to push the new config file to ['p2w1.xentio.lan', 'p2w1']
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020:19:38:27] "PUT /cmapi/0.4.0/node/rollback HTTP/1.1" 422 38 "" "python-requests/2.23.0"
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root rollback_txn_attempt(): got error during request to p2w1.xentio.lan: 422 Client Error: Unprocessable Entity for url: https://p2w1.xentio.lan:8640/cmapi/0.4.0/node/rollback

The above is again repeated few dozen times, until

Aug 01 19:38:48 p2w1.xentio.lan python3[7474]: 172.20.0.42 - - [01/Aug/2020 19:38:48] cmapi_server ERROR put_add_node got an error during cluster startup when broadcasting config: (422, None)
Aug 01 19:38:48 p2w1.xentio.lan python3[7474]: 172.20.0.42 - - [01/Aug/2020:19:38:48] "PUT /cmapi/0.4.0/cluster/add-node HTTP/1.1" 422 86 "" "curl/7.69.1"

The CS logs are completely bare at this time, save for few lines from the first run regarding file /var/lib/columnstore/data1/systemFiles/dbrm/tablelocks missing and rollbackAll being completed by DMLProc.

At that point, the CMAPI happily reports the server being added and being in read-write mode:

"p2w1": {
"timestamp": "2020-08-01 19:47:11.039577",
"uptime": 2504,
"dbrm_mode": "master",
"cluster_mode": "readwrite",
"dbroots": [],
"module_id": 1,
"services": [

{ "name": "workernode", "pid": 6180 }

,

{ "name": "controllernode", "pid": 6182 }

,

{ "name": "PrimProc", "pid": 6184 }

,

{ "name": "WriteEngine", "pid": 6243 }

,

{ "name": "ExeMgr", "pid": 6249 }

,

{ "name": "DMLProc", "pid": 6282 }

,

{ "name": "DDLProc", "pid": 6283 }

]
}

However, a simple restart of CS via "systemctl restart mariadb-columnstore" returns the cluster in read-only mode with our first PM being read-only too:

{
"timestamp": "2020-08-01 19:58:34.866725",
"p2w1.xentio.lan": {
"timestamp": "2020-08-01 19:58:34.873586",
"uptime": 3187,
"dbrm_mode": "master",
"cluster_mode": "readonly",
...
"p2w1": {
"timestamp": "2020-08-01 19:58:34.921834",
"uptime": 4111,
"dbrm_mode": "master",
"cluster_mode": "readonly",

The crit.log of CS now has the following lines:

Aug 1 19:57:03 p2w1 controllernode[9211]: 03.930742 |0|0|0| C 29 CAL0000: DBRM Controller: network error distributing command to worker 2
Aug 1 19:57:23 p2w1 controllernode[9211]: 23.952825 |0|0|0| C 29 CAL0000: DBRM Controller: undo(): warning, could not contact worker number 2
Aug 1 19:57:23 p2w1 controllernode[9211]: 23.952867 |0|0|0| C 29 CAL0000: DBRM Controller: Caught network error. Sending command 17, length 1. Setting read-only mode.

What is "node 2" if we only have added one node?! What is "network error"?! The node is still the only one and if needs to talk, it has to talks to itself - what network error?!

Adding a second node repeats all the above errors in the system journal, after that CMAPI reports the service as being read-only - this time, with two nodes, of which the second is given as read-write:

{
"timestamp": "2020-08-01 20:13:58.140446",
"p2w1.xentio.lan":

{ "timestamp": "2020-08-01 20:13:58.147523", "uptime": 4111, "dbrm_mode": "master", "cluster_mode": "readonly", }

,
"p2w1":

{ "timestamp": "2020-08-01 20:13:58.195228", "uptime": 4111, "dbrm_mode": "master", "cluster_mode": "readonly", ... }

,
"p2w2":

{ "timestamp": "2020-08-01 20:13:58.246645", "uptime": 4115, "dbrm_mode": "master", "cluster_mode": "readwrite", ... }

Along this, on the second node the Columnstore.xml remains untouched. The journal on the second node is full with the same errors as the journal on the first node; a new Columnstore.xml is pushed to the second and logged to its journal, but this time not written to the disk (like it was on the first node).

Finally, we cleaned up one VM and attempted to install a CS cluster once again, this time using the FQDN of the host - and it ended in another cryptic error:

Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:47] root get_module_net_address Module 1 network address 127.0.0.1
Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: 172.20.2.21 - - [01/Aug/2020 22:07:47] cmapi_server DEBUG put_begin starts
Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: 172.20.2.21 - - [01/Aug/2020 22:07:47] cmapi_server DEBUG put_begin JSON body

{'id': 227129, 'timeout': 0}

Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: 172.20.2.21 - - [01/Aug/2020 22:07:47] cmapi_server DEBUG put_begin returns

{'timestamp': '2020-08-01 22:07:47.866914'}

Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: 172.20.2.21 - - [01/Aug/2020:22:07:47] "PUT /cmapi/0.4.0/node/begin HTTP/1.1" 200 43 "" "python-requests/2.23.0"
Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: 172.20.2.21 - - [01/Aug/2020:22:07:47] "PUT /cmapi/0.4.0/node/begin HTTP/1.1" 422 42 "" "python-requests/2.23.0"
Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: 172.20.2.21 - - [01/Aug/2020:22:07:47] "PUT /cmapi/0.4.0/node/rollback HTTP/1.1" 200 43 "" "python-requests/2.23.0"
Aug 01 22:07:52 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:52] root get_module_net_address Module 1 network address 127.0.0.1
Aug 01 22:07:52 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:52] root add_node(): replacing 127.0.0.1/localhost with 172.20.2.21/p2w1.xentio.lan as this node's name.
Be sure p2w1.xentio.lan resolves to 172.20.2.21 on all other nodes in the cluster.
Aug 01 22:07:52 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:52] root _add_node_to_PMS(): node p2w1.xentio.lan already exists
Aug 01 22:07:52 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:52] root add_node(): Caught exception: '1', config file is unchanged
Aug 01 22:07:52 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:52] root get_module_net_address Module 1 network address 127.0.0.1
Aug 01 22:07:52 p2w1.xentio.lan python3[6335]: 127.0.0.1 - - [01/Aug/2020:22:07:52] "PUT /cmapi/0.4.0/node/rollback HTTP/1.1" 422 38 "" "python-requests/2.23.0"
Aug 01 22:07:52 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:52] root rollback_txn_attempt(): got error during request to 127.0.0.1: 422 Client Error: Unprocessable
Entity for url: https://127.0.0.1:8640/cmapi/0.4.0/node/rollback

Proper documentation and a working procedure for multinode CS install will be highly appreciated as MariaDB has released something it claims is enterprise quality withttu any working docs and being, from my perspective, completely broken.

CMAPI fails to configure multi-node CS cluster

Details

Description

Attachments

Activity

People

Dates

Git Integration