[MCOL-4337] Controllernode must establish connection with his workernodes on its startup Created: 2020-10-07  Updated: 2020-11-20  Resolved: 2020-11-20

Status: Closed
Project: MariaDB ColumnStore
Component/s: Build
Affects Version/s: 1.5.3
Fix Version/s: 5.5.1

Type: New Feature Priority: Minor
Reporter: Roman Assignee: Jose Rojas (Inactive)
Resolution: Fixed Votes: 0
Labels: stability

Issue Links:
Problem/Incident
is caused by MCOL-3836 Columnstore OAM replacement Closed
Relates
relates to MCOL-3836 Columnstore OAM replacement Closed
Sprint: 2020-8

 Description   

Controllernode now establish DBRM_Worker connections lazily so it doesn't wait for them to come up. This calls for additional startup check conditions in a cluster setups. Controllernode must wait for all WNs to come up before CN is ready to process requests.



 Comments   
Comment by Roman [ 2020-10-09 ]

Plz review.

Comment by Roman [ 2020-10-15 ]

4QA should be tested with 5.5 release.
To test the feature one should manually start mcs-loadbrm, then controllernode and then with some delay workernode on another console. controlernode should delay its startup waiting for controllernode to come up. There is a new XML setting that tells controllernode how much should it wait for workernodes on startup, namely DBRM_Controller.WorkerConnectionTimeout. Second is the unit. Its default value is 30 seconds.

Comment by Daniel Lee (Inactive) [ 2020-10-30 ]

Builds tested: 5.5.1-1
Drone #1013, branch develop-1.5
Drone #1017, branch develop

Tested on both centos 8 and ubuntu18.04

1. I stopped mcs-controllernode and Macs-workernode@1 services
2. started mcs-loadbrm service
3. started mcs-controllernode service. It did not wait for workernode@1 service to start. mcs-controllernode service started in about 7 seconds.

In stead of using systemctl to start services, I did another round of tests by running load_brm and controller node in /usr/bin directly. controllernode started immediately.

I noticed there is no DBRM_Controller:WorkerConnectionTiimeout entry in the Columnstore.xml file. I added the entry with a value of 30 and did another round of tests, controllernode still did not wait for workernode to start.

Comment by Roman [ 2020-11-10 ]

toddstoffel I would like to suggest to change the mark to stability or something similar.

Comment by Roman [ 2020-11-10 ]

dleeyh I need to confess that my test scenario is totally misguiding.
The patch itself enables controllernode to establish connections with its workernodes on startup. Before the patch controllernode did lazy connections establishing them when a new request pops up at controllernode.
So to test the patch you should follow the scenario to took:
1. Stop mcs-controllernode and Macs-workernode@1 services
2. Start mcs-loadbrm service
3. Start mcs-controllernode service. At this point it should complain once into /var/log/mariadb/columnstore/error.log about the workernode it can't connect with.
4. Start mcs-workernode service. At this point controllernode should log about the established connection.

JFYI The test you tried will work after MCOL-4170 that is on its last stages.

Comment by Daniel Lee (Inactive) [ 2020-11-10 ]

Build tested: 5.5.1-1 (Drone)

engine: 1ffca618dfba15d1edda4b21a2d9f9713d4f7262
server: 10b2d5726fa21675362596ff4f52f2eca748bdc9
buildNo: 1097

Few issues

1. The error msg is in the warning.log file, not err.log file

Nov 10 22:00:15 centos-8 controllernode[4810]: 15.905223 |0|0|0| D 29 CAL0000: DBRM Controller: Connected to DBRM_Worker1

2. After both controller and worker started, the cluster is in a system-not-ready state

I logged into the MySQL client after both services were started

MariaDB [mytest]> select count from lineitem;
ERROR 1815 (HY000): Internal error: The system is not yet ready to accept queries

[centos8:root~]# systemctl status mariadb
● mariadb.service - MariaDB 10.5.8 database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/mariadb.service.d
└─migrated-from-my.cnf-settings.conf
Active: active (running) since Tue 2020-11-10 21:53:40 UTC; 7min ago
Docs: man:mariadbd(8)
https://mariadb.com/kb/en/library/systemd/
Main PID: 3920 (mariadbd)
Status: "Taking your SQL requests now..."
Tasks: 12 (limit: 50823)
Memory: 799.2M
CGroup: /system.slice/mariadb.service
└─3920 /usr/sbin/mariadbd

Nov 10 21:53:40 centos-8 mariadbd[3920]: 2020-11-10 21:53:40 0 [Note] Added new Master_info '' to hash table
Nov 10 21:53:40 centos-8 mariadbd[3920]: 2020-11-10 21:53:40 0 [Note] /usr/sbin/mariadbd: ready for connections.
Nov 10 21:53:40 centos-8 mariadbd[3920]: Version: '10.5.8-MariaDB' socket: '/var/lib/mysql/mysql.sock' port: 3306 MariaDB Server
Nov 10 21:53:40 centos-8 systemd[1]: Started MariaDB 10.5.8 database server.
Nov 10 21:54:40 centos-8 dbcon[3920]: 40.145388 |4|0|0| D 24 CAL0001: Start SQL statement: load data infile "/data/qa/source/dbt3/1g/lineitem.tbl" in>
Nov 10 21:55:58 centos-8 writeenginesplit[4425]: 58.648139 |0|0|0| I 33 CAL0000: Send EOD message to All PMs
Nov 10 21:55:59 centos-8 writeenginesplit[4425]: 59.023209 |0|0|0| I 33 CAL0098: Received a Cpimport Pass from PM1.
Nov 10 21:55:59 centos-8 writeenginesplit[4425]: 59.024888 |0|0|0| I 33 CAL0000: Released Table Lock
Nov 10 21:55:59 centos-8 dbcon[3920]: 59.935069 |4|0|0| D 24 CAL0001: End SQL statement
Nov 10 22:01:16 centos-8 mariadbd[3920]: DBRM::send_recv: controller node closed the connection

restarting the mariadb service did not resolve the issue, but restarting mariadb-columnstore did.

Comment by Roman [ 2020-11-11 ]

dleeyh I need the exact steps you took b/c my list of actions didn't include starting or testing of the whole cluster so I'm unaware of what you are exactly doing.

Comment by Daniel Lee (Inactive) [ 2020-11-11 ]

I did the steps you specified and tried to run a query. The cluster was in "not ready" state.

Your change maybe doing what you want to do specifically, but you need to ensure the cluster is not broken.

Comment by Roman [ 2020-11-11 ]

How did you start the cluster?

Comment by Daniel Lee (Inactive) [ 2020-11-11 ]

The "not ready" issue occurred right after I performed the steps you specified. No start or restart was performed.

I did restart the cluster later, attempting to recover the cluster
It was a single node

systemctl restart mariadb
systemctl restart mariadb-columnstore

Generated at Thu Feb 08 02:49:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.