[MCOL-4226] CMAPI fails to configure multi-node CS cluster Created: 2020-08-01 Updated: 2020-10-13 Resolved: 2020-09-09 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | None |
| Affects Version/s: | 1.0.0 |
| Fix Version/s: | 5.4.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Assen Totin (Inactive) | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 6 |
| Labels: | None | ||
| Sprint: | 2020-8 |
| Description |
|
NB: This is a long (and sad) ticket. Getting a coffee, coke or beer is recommended. Due to the lack of any official documentation (why?), we are attempting to install a multi-PM CS with MariaDB Enterprise 10.5.4 following the guide as provided by Todd Stoffel: We are on CentOS-7 latest with SELinux completely disabled and firewall completely stopped. We have properly set short and full DNS name for each node, even with matching reverse resolving: [root@p2w1 columnstore]# uname -a We have set up and verified replication, created cross-engine join user etc. - all these are irrelevant to the inability to create a CS cluster as described below. We ran the whole process twice just to make sure we haven't missed anything; same errors appear every time and building the cluster is impossible. Adding the first (and any subsequent node) fails: curl -k -s -X PUT https://p2w1:8640/cmapi/0.4.0/cluster/add-node ' \
Yes, we are first adding a node to self - we follow the example from the docs. We use the node short name as the document says. After several minutes (which is much more than the specified timeout - still increased here 3 times compared to the doc!), the console returns with { "error": "got an error during cluster startup when broadcasting config: (422, None)" }The system journal opens with the following entry: Aug 01 19:33:26 p2w1.xentio.lan python3[7474]: 172.20.0.42 - - [01/Aug/2020 19:33:26] cmapi_server DEBUG put_add_node starts after which the following block is repeated many dozens of times over the next few minutes: Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:34:03] root get_module_net_address Module 1 network address 127.0.0.1 Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:34:03] cmapi_server DEBUG put_begin returns {'timestamp': '2020-08-01 19:34:03.403018'}Aug 01 19:34:03 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020:19:34:03] "PUT /cmapi/0.4.0/node/begin HTTP/1.1" 200 43 "" "python-requests/2.23.0" As we see, even with debug mode enabled these messages are completely useless and reveal nothing why is 422 generated - and by whom (is the CMAPI service posting to some other service? To self?). Eventually, the journal says: Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: [01/Aug/2020 19:38:27] root get_module_net_address Module 1 network address 127.0.0.1 The last message is confusing - see above, our host settings are perfectly correct. Whatever it means, please, fix it in cmapi_server/node_manipulation.py. The journal continues with an attempt to push a new Columnstore.xml file, which also "fails" for unknown reason - however, note tthat the file is actually written to /etc/columnsore/Columnstore.xml - no differences are foudn between the logged line in system journal and the updated /etc/columnsore/Columnstore.xml file! Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:38:27] cmapi_server DEBUG put_config starts Aug 01 19:38:27 p2w1.xentio.lan python3[7474]: 172.20.2.21 - - [01/Aug/2020 19:38:27] cmapi_server ERROR put_config PUT /config called outside of an operation. The above is again repeated few dozen times, until Aug 01 19:38:48 p2w1.xentio.lan python3[7474]: 172.20.0.42 - - [01/Aug/2020 19:38:48] cmapi_server ERROR put_add_node got an error during cluster startup when broadcasting config: (422, None) The CS logs are completely bare at this time, save for few lines from the first run regarding file /var/lib/columnstore/data1/systemFiles/dbrm/tablelocks missing and rollbackAll being completed by DMLProc. At that point, the CMAPI happily reports the server being added and being in read-write mode: "p2w1": { , , , , , , { "name": "DDLProc", "pid": 6283 } ] However, a simple restart of CS via "systemctl restart mariadb-columnstore" returns the cluster in read-only mode with our first PM being read-only too: { The crit.log of CS now has the following lines: Aug 1 19:57:03 p2w1 controllernode[9211]: 03.930742 |0|0|0| C 29 CAL0000: DBRM Controller: network error distributing command to worker 2 What is "node 2" if we only have added one node?! What is "network error"?! The node is still the only one and if needs to talk, it has to talks to itself - what network error?! Adding a second node repeats all the above errors in the system journal, after that CMAPI reports the service as being read-only - this time, with two nodes, of which the second is given as read-write: { , , Along this, on the second node the Columnstore.xml remains untouched. The journal on the second node is full with the same errors as the journal on the first node; a new Columnstore.xml is pushed to the second and logged to its journal, but this time not written to the disk (like it was on the first node). Finally, we cleaned up one VM and attempted to install a CS cluster once again, this time using the FQDN of the host - and it ended in another cryptic error: Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: [01/Aug/2020 22:07:47] root get_module_net_address Module 1 network address 127.0.0.1 Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: 172.20.2.21 - - [01/Aug/2020 22:07:47] cmapi_server DEBUG put_begin returns {'timestamp': '2020-08-01 22:07:47.866914'}Aug 01 22:07:47 p2w1.xentio.lan python3[6335]: 172.20.2.21 - - [01/Aug/2020:22:07:47] "PUT /cmapi/0.4.0/node/begin HTTP/1.1" 200 43 "" "python-requests/2.23.0" Proper documentation and a working procedure for multinode CS install will be highly appreciated as MariaDB has released something it claims is enterprise quality withttu any working docs and being, from my perspective, completely broken. |
| Comments |
| Comment by Jose Rojas (Inactive) [ 2020-09-03 ] |
|
Pulling in MariaDB Enterprise 10.5.4 and using the documentation linked in the description https://mdbcdt.com/DOCS-2085/deploy/enterprise-multi-columnstore, I have verified that this documentation was written with functionality in mind that is not included in the mariadb-columnstore-cmapi package that is downloaded alongside MariaDB Enterprise 10.5.4. Specifically, you should not be adding the first node explicitly in this version (it is already included by default in the cluster). Doing so will cause mariadb-columnstore-cmapi to hang (and is a known fixed bug in the next cmapi release). Regarding the second problem: Once more than one node exists in a cluster, the systemctl start/stop/restart mariadb-columnstore calls should not be used, as this causes reconnection errors between nodes. These are known reconnection limitations currently being worked on https://jira.mariadb.org/browse/MCOL-3917 https://jira.mariadb.org/browse/MCOL-4015 As of now, the only safe way to restart a cluster is to use the cluster/shutdown and cluster/start endpoints. |
| Comment by Jose Rojas (Inactive) [ 2020-09-03 ] |
|
dleeyh Testing for this will be verifying that you can get a working cluster when explicitly adding first node. |
| Comment by Daniel Lee (Inactive) [ 2020-09-08 ] |
|
Build tested: 1.5.4-1 (Drone #587), cmapi (Drone #251) Using VMs in Vagrant, unable to create tables after configuration a 2pm stack create table returned: ERROR 1815 (HY000) at line 7: Internal error: CAL0009: Error while calling getSysCatDBRoot err.log [centos7:root~]# cat err.log |
| Comment by Daniel Lee (Inactive) [ 2020-09-08 ] |
|
The online documentation missing python3-requests as a requirement. Installed it and I was able to configure a 2pm cluster. |
| Comment by Daniel Lee (Inactive) [ 2020-09-09 ] |
|
Build tested: 1.5.4-1 (Drone #587), cmapi (Drone #251) I have been able to create a 3pm cluster by adding all 3 nodes using host names. |