[MCOL-4226] CMAPI fails to configure multi-node CS cluster Created: 2020-08-01  Updated: 2020-10-13  Resolved: 2020-09-09

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 1.0.0
Fix Version/s: 5.4.1

Type: Bug Priority: Critical
Reporter: Assen Totin (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 6
Labels: None

Sprint: 2020-8

 Description   

NB: This is a long (and sad) ticket. Getting a coffee, coke or beer is recommended.

Due to the lack of any official documentation (why?), we are attempting to install a multi-PM CS with MariaDB Enterprise 10.5.4 following the guide as provided by Todd Stoffel:
https://mdbcdt.com/DOCS-2085/deploy/enterprise-multi-columnstore

We are on CentOS-7 latest with SELinux completely disabled and firewall completely stopped.

We have properly set short and full DNS name for each node, even with matching reverse resolving:

[root@p2w1 columnstore]# uname -a
Linux p2w1.xentio.lan 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@p2w1 columnstore]# hostname
p2w1.xentio.lan
[root@p2w1 columnstore]# host p2w1.xentio.lan
p2w1.xentio.lan has address 172.20.2.21
[root@p2w1 columnstore]# host p2w1
p2w1.xentio.lan has address 172.20.2.21
[root@p2w1 columnstore]# host 172.20.2.21
21.2.20.172.in-addr.arpa domain name pointer p2w1.xentio.lan.

We have set up and verified replication, created cross-engine join user etc. - all these are irrelevant to the inability to create a CS cluster as described below. We ran the whole process twice just to make sure we haven't missed anything; same errors appear every time and building the cluster is impossible.

Adding the first (and any subsequent node) fails:

curl -k -s -X PUT https://p2w1:8640/cmapi/0.4.0/cluster/add-node
--header 'Content-Type:application/json' \
--header 'x-api-key:93816fa66cc2d8c224e62275bd4f248234dd4947b68d4af2b29671dd7d5532dd' \
--data '

{"timeout":60, "node": "p2w1"}

' \

jq .

Yes, we are first adding a node to self - we follow the example from the docs. We use the node short name as the document says.

After several minutes (which is much more than the specified timeout - still increased here 3 times compared to the doc!), the console returns with

{ "error": "got an error during cluster startup when broadcasting config: (422, None)" }

The system journal opens with the following entry:

Aug 01 19:33:26 p2w1.xentio.lan python3[7474]: 172.20.0.42 - - [01/Aug/2020 19:33:26] cmapi_server DEBUG put_add_node starts

after which the following block is repeated many dozens of times over the next few minutes:

Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:34:03] root get_module_net_address Module 1 network address 127.0.0.1
Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:34:03] cmapi_server DEBUG put_begin starts
Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:34:03] cmapi_server DEBUG put_begin JSON body

{'id': 793082, 'timeout': 263}

Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:34:03] cmapi_server DEBUG put_begin returns

{'timestamp': '2020-08-01 19:34:03.403018'}

Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020:19:34:03] "PUT /cmapi/0.4.0/node/begin HTTP/1.1" 200 43 "" "python-requests/2.23.0"
Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020:19:34:03] "PUT /cmapi/0.4.0/node/begin HTTP/1.1" 422 42 "" "python-requests/2.23.0"
Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020:19:34:03] "PUT /cmapi/0.4.0/node/rollback HTTP/1.1" 200 43 "" "python-requests/2.23.0"

As we see, even with debug mode enabled these messages are completely useless and reveal nothing why is 422 generated - and by whom (is the CMAPI service posting to some other service? To self?). Eventually, the journal says:

Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root get_module_net_address Module 1 network address 127.0.0.1
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root add_node(): replacing 127.0.0.1/localhost with 172.20.2.21/p2w1.xentio.lan as this node's name. Be sure p2w1.xentio.lan resolves to 172.20.2.21 on all other nodes in the cluster.
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root _add_Module_entries(): using ip address 172.20.2.21 and hostname p2w1
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root _add_Module_entries(): found ip address already at ModuleIPAddr1-1-3
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root _add_Module_entries(): hostname doesn't match, updating address to 172.20.2.21

The last message is confusing - see above, our host settings are perfectly correct. Whatever it means, please, fix it in cmapi_server/node_manipulation.py.

The journal continues with an attempt to push a new Columnstore.xml file, which also "fails" for unknown reason - however, note tthat the file is actually written to /etc/columnsore/Columnstore.xml - no differences are foudn between the logged line in system journal and the updated /etc/columnsore/Columnstore.xml file!

Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:38:27] cmapi_server DEBUG put_config starts
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:38:27] cmapi_server DEBUG put_config JSON body

{'manager': 'p2w1.xentio.lan', 'revision': '1', 'timeout': 300, 'config': '<?xml version="1.0" ?>\n<Columnstore Version="V1.0.0"> ... </Columnstore>', 'cs_config_filename': '/etc/columnstore/Columnstore.xml'}

Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:38:27] cmapi_server ERROR put_config PUT /config called outside of an operation.
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020:19:38:27] "PUT /cmapi/0.4.0/node/config HTTP/1.1" 422 56 "" "python-requests/2.23.0"
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020:19:38:27] "PUT /cmapi/0.4.0/node/config HTTP/1.1" 422 56 "" "python-requests/2.23.0"
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root broadcast_new_config(): got an error pushing config file to p2w1.xentio.lan: 422 Client Error: Unprocessable Entity for url: https://p2w1.xentio.lan:8640/cmapi/0.4.0/node/config
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root broadcast_new_config(): got an error pushing config file to p2w1: 422 Client Error: Unprocessable Entity for url: https://p2w1:8640/cmapi/0.4.0/node/config
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root broadcast_new_config(): failed to push the new config file to ['p2w1.xentio.lan', 'p2w1']
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020:19:38:27] "PUT /cmapi/0.4.0/node/rollback HTTP/1.1" 422 38 "" "python-requests/2.23.0"
Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root rollback_txn_attempt(): got error during request to p2w1.xentio.lan: 422 Client Error: Unprocessable Entity for url: https://p2w1.xentio.lan:8640/cmapi/0.4.0/node/rollback

The above is again repeated few dozen times, until

Aug 01 19:38:48 p2w1.xentio.lan python3[7474]: 172.20.0.42 - - [01/Aug/2020 19:38:48] cmapi_server ERROR put_add_node got an error during cluster startup when broadcasting config: (422, None)
Aug 01 19:38:48 p2w1.xentio.lan python3[7474]: 172.20.0.42 - - [01/Aug/2020:19:38:48] "PUT /cmapi/0.4.0/cluster/add-node HTTP/1.1" 422 86 "" "curl/7.69.1"

The CS logs are completely bare at this time, save for few lines from the first run regarding file /var/lib/columnstore/data1/systemFiles/dbrm/tablelocks missing and rollbackAll being completed by DMLProc.

At that point, the CMAPI happily reports the server being added and being in read-write mode:

"p2w1": {
"timestamp": "2020-08-01 19:47:11.039577",
"uptime": 2504,
"dbrm_mode": "master",
"cluster_mode": "readwrite",
"dbroots": [],
"module_id": 1,
"services": [

{ "name": "workernode", "pid": 6180 }

,

{ "name": "controllernode", "pid": 6182 }

,

{ "name": "PrimProc", "pid": 6184 }

,

{ "name": "WriteEngine", "pid": 6243 }

,

{ "name": "ExeMgr", "pid": 6249 }

,

{ "name": "DMLProc", "pid": 6282 }

,

{ "name": "DDLProc", "pid": 6283 }

]
}

However, a simple restart of CS via "systemctl restart mariadb-columnstore" returns the cluster in read-only mode with our first PM being read-only too:

{
"timestamp": "2020-08-01 19:58:34.866725",
"p2w1.xentio.lan": {
"timestamp": "2020-08-01 19:58:34.873586",
"uptime": 3187,
"dbrm_mode": "master",
"cluster_mode": "readonly",
...
"p2w1": {
"timestamp": "2020-08-01 19:58:34.921834",
"uptime": 4111,
"dbrm_mode": "master",
"cluster_mode": "readonly",

The crit.log of CS now has the following lines:

Aug 1 19:57:03 p2w1 controllernode[9211]: 03.930742 |0|0|0| C 29 CAL0000: DBRM Controller: network error distributing command to worker 2
Aug 1 19:57:23 p2w1 controllernode[9211]: 23.952825 |0|0|0| C 29 CAL0000: DBRM Controller: undo(): warning, could not contact worker number 2
Aug 1 19:57:23 p2w1 controllernode[9211]: 23.952867 |0|0|0| C 29 CAL0000: DBRM Controller: Caught network error. Sending command 17, length 1. Setting read-only mode.

What is "node 2" if we only have added one node?! What is "network error"?! The node is still the only one and if needs to talk, it has to talks to itself - what network error?!

Adding a second node repeats all the above errors in the system journal, after that CMAPI reports the service as being read-only - this time, with two nodes, of which the second is given as read-write:

{
"timestamp": "2020-08-01 20:13:58.140446",
"p2w1.xentio.lan":

{ "timestamp": "2020-08-01 20:13:58.147523", "uptime": 4111, "dbrm_mode": "master", "cluster_mode": "readonly", }

,
"p2w1":

{ "timestamp": "2020-08-01 20:13:58.195228", "uptime": 4111, "dbrm_mode": "master", "cluster_mode": "readonly", ... }

,
"p2w2":

{ "timestamp": "2020-08-01 20:13:58.246645", "uptime": 4115, "dbrm_mode": "master", "cluster_mode": "readwrite", ... }

Along this, on the second node the Columnstore.xml remains untouched. The journal on the second node is full with the same errors as the journal on the first node; a new Columnstore.xml is pushed to the second and logged to its journal, but this time not written to the disk (like it was on the first node).

Finally, we cleaned up one VM and attempted to install a CS cluster once again, this time using the FQDN of the host - and it ended in another cryptic error:

Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:47] root get_module_net_address Module 1 network address 127.0.0.1
Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: 172.20.2.21 - - [01/Aug/2020 22:07:47] cmapi_server DEBUG put_begin starts
Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: 172.20.2.21 - - [01/Aug/2020 22:07:47] cmapi_server DEBUG put_begin JSON body

{'id': 227129, 'timeout': 0}

Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: 172.20.2.21 - - [01/Aug/2020 22:07:47] cmapi_server DEBUG put_begin returns

{'timestamp': '2020-08-01 22:07:47.866914'}

Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: 172.20.2.21 - - [01/Aug/2020:22:07:47] "PUT /cmapi/0.4.0/node/begin HTTP/1.1" 200 43 "" "python-requests/2.23.0"
Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: 172.20.2.21 - - [01/Aug/2020:22:07:47] "PUT /cmapi/0.4.0/node/begin HTTP/1.1" 422 42 "" "python-requests/2.23.0"
Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: 172.20.2.21 - - [01/Aug/2020:22:07:47] "PUT /cmapi/0.4.0/node/rollback HTTP/1.1" 200 43 "" "python-requests/2.23.0"
Aug 01 22:07:52 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:52] root get_module_net_address Module 1 network address 127.0.0.1
Aug 01 22:07:52 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:52] root add_node(): replacing 127.0.0.1/localhost with 172.20.2.21/p2w1.xentio.lan as this node's name.
Be sure p2w1.xentio.lan resolves to 172.20.2.21 on all other nodes in the cluster.
Aug 01 22:07:52 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:52] root _add_node_to_PMS(): node p2w1.xentio.lan already exists
Aug 01 22:07:52 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:52] root add_node(): Caught exception: '1', config file is unchanged
Aug 01 22:07:52 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:52] root get_module_net_address Module 1 network address 127.0.0.1
Aug 01 22:07:52 p2w1.xentio.lan python3[6335]: 127.0.0.1 - - [01/Aug/2020:22:07:52] "PUT /cmapi/0.4.0/node/rollback HTTP/1.1" 422 38 "" "python-requests/2.23.0"
Aug 01 22:07:52 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:52] root rollback_txn_attempt(): got error during request to 127.0.0.1: 422 Client Error: Unprocessable
Entity for url: https://127.0.0.1:8640/cmapi/0.4.0/node/rollback

Proper documentation and a working procedure for multinode CS install will be highly appreciated as MariaDB has released something it claims is enterprise quality withttu any working docs and being, from my perspective, completely broken.



 Comments   
Comment by Jose Rojas (Inactive) [ 2020-09-03 ]

Pulling in MariaDB Enterprise 10.5.4 and using the documentation linked in the description https://mdbcdt.com/DOCS-2085/deploy/enterprise-multi-columnstore, I have verified that this documentation was written with functionality in mind that is not included in the mariadb-columnstore-cmapi package that is downloaded alongside MariaDB Enterprise 10.5.4.

Specifically, you should not be adding the first node explicitly in this version (it is already included by default in the cluster). Doing so will cause mariadb-columnstore-cmapi to hang (and is a known fixed bug in the next cmapi release).

Regarding the second problem: Once more than one node exists in a cluster, the systemctl start/stop/restart mariadb-columnstore calls should not be used, as this causes reconnection errors between nodes. These are known reconnection limitations currently being worked on https://jira.mariadb.org/browse/MCOL-3917 https://jira.mariadb.org/browse/MCOL-4015

As of now, the only safe way to restart a cluster is to use the cluster/shutdown and cluster/start endpoints.

Comment by Jose Rojas (Inactive) [ 2020-09-03 ]

dleeyh Testing for this will be verifying that you can get a working cluster when explicitly adding first node.

Comment by Daniel Lee (Inactive) [ 2020-09-08 ]

Build tested: 1.5.4-1 (Drone #587), cmapi (Drone #251)

Using VMs in Vagrant, unable to create tables after configuration a 2pm stack

create table returned:

ERROR 1815 (HY000) at line 7: Internal error: CAL0009: Error while calling getSysCatDBRoot

err.log

[centos7:root~]# cat err.log
Sep 8 16:18:45 localhost controllernode[13593]: 45.354411 |0|0|0| C 29 CAL0000: InetStreamSocket::readToMagic(): I/O error1: rc-1; poll signal interrupt ( POLLHUP POLLERR )
Sep 8 16:18:45 localhost controllernode[13593]: 45.375355 |0|0|0| C 29 CAL0000: InetStreamSocket::readToMagic(): I/O error1: rc-1; poll signal interrupt ( POLLHUP POLLERR )
Sep 8 16:20:00 localhost controllernode[13928]: 00.674397 |0|0|0| C 29 CAL0000: InetStreamSocket::readToMagic(): I/O error1: rc-1; poll signal interrupt ( POLLHUP POLLERR )
Sep 8 16:20:00 localhost controllernode[13928]: 00.698102 |0|0|0| C 29 CAL0000: InetStreamSocket::readToMagic(): I/O error1: rc-1; poll signal interrupt ( POLLHUP POLLERR )
Sep 8 16:20:11 localhost ddlpackageproc[14036]: 11.649388 |0|0|0| E 23 CAL0009: Error while calling getSysCatDBRoot
Sep 8 16:20:11 localhost ddlpackageproc[14036]: 11.677014 |0|0|0| E 23 CAL0009: Error while calling getSysCatDBRoot
Sep 8 16:20:11 localhost ddlpackageproc[14036]: 11.702801 |0|0|0| E 23 CAL0009: Error while calling getSysCatDBRoot
Sep 8 16:20:11 localhost ddlpackageproc[14036]: 11.728657 |0|0|0| E 23 CAL0009: Error while calling getSysCatDBRoot
Sep 8 16:20:11 localhost writeenginesplit[14156]: 11.825906 |0|0|0| E 33 CAL0097: mytest.lineitem : ERROR : IDB-2006: 'mytest.lineitem' does not exist in Columnstore..

Comment by Daniel Lee (Inactive) [ 2020-09-08 ]

The online documentation missing python3-requests as a requirement. Installed it and I was able to configure a 2pm cluster.

Comment by Daniel Lee (Inactive) [ 2020-09-09 ]

Build tested: 1.5.4-1 (Drone #587), cmapi (Drone #251)

I have been able to create a 3pm cluster by adding all 3 nodes using host names.

Generated at Thu Feb 08 02:48:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.