[MCOL-4015] ExeMgr must re-establish its PrimProc connections. Created: 2020-05-22  Updated: 2021-01-25  Resolved: 2020-07-14

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: None
Fix Version/s: 1.5.3

Type: Task Priority: Major
Reporter: Roman Assignee: Gregory Dorman (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: Microsoft Word test_docs.docx    
Issue Links:
Relates
relates to MCOL-3729 ExeMgr and PrimProc reconnect Closed

 Description   

ExeMgr now calls for HUP signal to re-establish its connections to PrimProcs.
ExeMgr must do reconnect on the next command recieved from FE w/o explicit HUP signal.



 Comments   
Comment by Roman [ 2020-05-22 ]

Plz review.

Comment by Roman [ 2020-05-27 ]

Whilst looking into DEC code I found out that our whole ExeMgr to PrimProc State Machine is fragile, e.g. I was tinkering with iptables blocks trying to emulate custom network outages. There was a case when ExeMgr got into a infinity blocking read from the socket awaiting for magic whilst the network traffic was blackholed. I didn't come up with an appropriate solution.

Comment by Patrick LeBlanc (Inactive) [ 2020-05-27 ]

I'm not entirely sure what the patch was for or if it corrects a problem QA can replicate & test. Roman could you advise Daniel how to test this?

Comment by Daniel Lee (Inactive) [ 2020-06-02 ]

Yes, instructions or any info would be great. Thx

Comment by Roman [ 2020-06-03 ]

Ehhm. It doesn't look easily testable now. Here is the reciept though.

  • Setup two 1.5 single-nodes(nodes A and B)
  • Shut A and B down.
  • Take a config from a 1.4 two-node cluster and replace 1.5 configs at A and B with the config. You need to replace 1.4 nodes IPs with IPs of A and B. Where A's IP must get into DBRM_Controller.IpAddr
  • place files "module", one with pm1 and another with pm2 into each /var/lib/columnstore/local
  • Start A and B. Test that SELECT works.
  • Shutdown secondary node B. Wait a minute. Now the cluster must fail SELECT queries.
  • Start node B. Test that SELECT works.
Comment by Gregory Dorman (Inactive) [ 2020-06-19 ]

drrtuy, i am afraid Daniel cannot do it in that manner. We will first get Jose's procs.

I will attempt to do it myself, though. Are the package in build 153 ready, do they include this thing?

Comment by Gregory Dorman (Inactive) [ 2020-06-21 ]

OK, I had a limited success with this - not exactly as written, but seems close (SELECTS work after PM2 shutdown, but CRUDs don't, and start working upon restart).

To make it work I had to inject poor man's synchronization (PM1 and PM2 waiting on each other exemgrs, and on PM1's controllernode (8616) to open. Columnstore.xml was hand-crafted (compring one node to two nodes).

I did the end-to-end: docker build, start containers, push two-node .xml to both, push appropriate modules to both; called start-columnstore on both.

Lowlights: I still cannot make it start up for local storage for some reason (this is S3). Also, something going funky on initialization, I had to restart both in order to make them talk to each other.

But good so far.

Generated at Thu Feb 08 02:47:08 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.