[MCOL-484] DBRMControllerNode FAILED , CONN_FAILURE and PROCESS_INIT_FAILURE Created: 2016-12-23  Updated: 2017-01-10  Resolved: 2017-01-09

Status: Closed
Project: MariaDB ColumnStore
Component/s: DDLProc, DMLProc, ExeMgr, MariaDB Server, writeengine
Affects Version/s: 1.0.6
Fix Version/s: 1.0.6

Type: Bug Priority: Major
Reporter: SANJAY SONTAKKE Assignee: David Hill (Inactive)
Resolution: Done Votes: 0
Labels: need_feedback
Environment:

RHEL 7.1


Attachments: File columnstoreSupportReport.tar    
Issue Links:
Relates
relates to MCOL-396 AlarmConfig.xml file zero'd out when ... Closed

 Description   

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor pm1 ACTIVE Fri Dec 23 12:07:34 2016 8590
ProcessManager pm1 ACTIVE Fri Dec 23 12:07:40 2016 8678
DBRMControllerNode pm1 FAILED Fri Dec 23 12:32:18 2016
ServerMonitor pm1 ACTIVE Fri Dec 23 12:32:21 2016 21483
DBRMWorkerNode pm1 MAN_INIT Fri Dec 23 15:38:49 2016 24937
DecomSvr pm1 ACTIVE Fri Dec 23 12:32:25 2016 21526
PrimProc pm1 MAN_INIT Fri Dec 23 15:39:04 2016 26834
ExeMgr pm1 INITIAL
WriteEngineServer pm1 INITIAL
DDLProc pm1 INITIAL
DMLProc pm1 INITIAL
mysqld pm1 ACTIVE Fri Dec 23 15:39:09 2016

Dec 23 15:51:18 awsmltdbprd03 ProcessMonitor[8590]: 18.961526 |0|0|0| D 18 CAL0000: Process location: /usr/local/mariadb/columnstore/bin/ExeMgr
Dec 23 15:51:18 awsmltdbprd03 ProcessMonitor[8590]: 18.962403 |0|0|0| D 18 CAL0000: Dependent process of PrimProc/pm1 is 2
Dec 23 15:51:18 awsmltdbprd03 ProcessMonitor[8590]: 18.963211 |0|0|0| D 18 CAL0000: Dependent Process is not in correct state, Failed Restoral
Dec 23 15:51:18 awsmltdbprd03 ProcessMonitor[8590]: 18.963285 |0|0|0| I 18 CAL0000: STARTALL: ACK back to ProcMgr, return status = 8
Dec 23 15:51:18 awsmltdbprd03 ProcessManager[8678]: 18.963401 |0|0|0| D 17 CAL0000: pm1 module failed to start!!
Dec 23 15:51:18 awsmltdbprd03 ProcessManager[8678]: 18.963453 |0|0|0| D 17 CAL0000: ACK received from 'pm1' Process-Monitor, return status = 8
Dec 23 15:51:18 awsmltdbprd03 ProcessManager[8678]: 18.964334 |0|0|0| D 17 CAL0000: Set Module pm1 State = 2
Dec 23 15:51:18 awsmltdbprd03 ProcessMonitor[8590]: 18.964619 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Module pm1 State = MAN_INIT
Dec 23 15:51:18 awsmltdbprd03 ProcessManager[8678]: 18.965026 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module pm1
Dec 23 15:51:19 awsmltdbprd03 ProcessMonitor[8590]: 19.282644 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: DBRMWorkerNode, old PID = 19299
Dec 23 15:51:19 awsmltdbprd03 ProcessMonitor[8590]: 19.282767 |0|0|0| D 18 CAL0000: StatusUpdate of Process DBRMWorkerNode State = 1 PID = 0
Dec 23 15:51:19 awsmltdbprd03 ProcessMonitor[8590]: 19.282845 |0|0|0| C 18 CAL0000: *****Process continually dying, stopped trying to restore it: DBRMWorkerNode
Dec 23 15:51:19 awsmltdbprd03 ProcessMonitor[8590]: 19.283183 |0|0|0| D 18 CAL0000: Send SET Alarm ID 13 on device DBRMWorkerNode
Dec 23 15:51:19 awsmltdbprd03 snmpmanager[8590]: 19.284337 |0|0|0| E 11 CAL0000: configAlarm error: Config::parseDoc: error parsing config file /usr/local/mariadb/columnstore/etc/AlarmConfig.xml



 Comments   
Comment by David Hill (Inactive) [ 2016-12-23 ]

please ignore previous comment...

Comment by David Hill (Inactive) [ 2016-12-23 ]

Some errors reported in the logs and why the system doesn't come up, don't know what is causing the errors. The alarm.xml issue has been logged in an MCOL in the past

Dec 22 21:43:03 awsmltdbprd03 ProcessMonitor[2831]: 03.800981 |0|0|0| E 11 CAL0000: configAlarm error: Config::parseDoc: error parsing config file /usr/local/mariadb/co
lumnstore/etc/AlarmConfig.xml

Dec 22 21:47:12 awsmltdbprd03 ProcessManager[2937]: 12.100865 |0|0|0| E 17 CAL0000: line: 6136 sendMsgProcMon: ProcMon Msg timeout on module pm1
Dec 22 21:47:12 awsmltdbprd03 ProcessManager[2937]: 12.101002 |0|0|0| D 17 CAL0000: pm1 module failed to start!!

1 item of note, this OS is RHEL 7.1. No Certification has been done on this OS, has been done on Centos 7, which should be very familiar.. BUT the RHEL-7 might be related to the issue..

Comment by David Thompson (Inactive) [ 2016-12-24 ]

Sanjay, can you send us AlarmConfig.xml or let us know if it's empty.

Comment by SANJAY SONTAKKE [ 2016-12-24 ]

Hi David,
AlarmConfig.xml is empty. Is there any way to re-create that...

Comment by SANJAY SONTAKKE [ 2016-12-24 ]

RHEL-7 might not be a problem. Because database was working well with large amout of data. I tried to load approx 20G table with column mismatch(came to know while checking warnings) and after that database went off. It did not start after that...

Comment by David Thompson (Inactive) [ 2016-12-24 ]

we've had a report of this happening before but have not been able to reproduce before. If this is a multi node install then the following kb article outlines the procedure: https://mariadb.com/kb/en/mariadb/columnstore-configuration-file-redistribution/

However this won't work if you are single server which i think you are, in this case probably the simplest solution is to rerun postConfigure and specify the same options. You'll need to shutdown columnstore first of course.

Comment by David Thompson (Inactive) [ 2016-12-24 ]

I don't think RHEL is the issue either, we've had plenty other users using this and centos is derived from RHEL.

Comment by David Thompson (Inactive) [ 2016-12-24 ]

Also sanjay, can you confirm how you did the very large load: cpimport, direct sql, or load data infile?

Comment by David Thompson (Inactive) [ 2016-12-24 ]

From the logs i think you were using either LDI or a bulk insert on mariadb server but please confirm. If you have the possibility to use cpimport for this use case it will be faster and scale better.

Comment by David Hill (Inactive) [ 2016-12-24 ]

The alarm config for its packaged in the release, so you could install the package on an different server then copy it to the working server.
If this is multi node system, can also be copied that another server.

Comment by David Thompson (Inactive) [ 2016-12-24 ]

Good catch, yes it seems like we don't generate a new copy during postConfigure..

Comment by SANJAY SONTAKKE [ 2016-12-25 ]

I am using load data infile, in one case i used cpimport. Major part of loading is done by LDI. I will switch over to cpimport once this DB gets recover...

Comment by SANJAY SONTAKKE [ 2016-12-26 ]

I installed columnstore on other machine of same OS and copied AlarmConfig.xml to main machine. Two of major alarms gone but critical alarm of CONN_FAILURE is still exists. I marked <Logfile> to 'on' still no detail information is available...

Comment by SANJAY SONTAKKE [ 2016-12-26 ]

I executed "postConfigure" still database is not coming up. Is there anyway that I could remove only one database and start system. Now I can't do anything because of that problem database/table.

Comment by David Thompson (Inactive) [ 2016-12-26 ]

What does getSystemStatus and getProcessStatus show?

Comment by David Hill (Inactive) [ 2016-12-26 ]

Since the connection issue is still occurring, have you check to see if the local firewall and SELinux is disabled. That generally comes into plan on multi-node systems, but still should be disabled on single-node installs. Check firewall considerations in this document.

https://mariadb.com/kb/en/mariadb/preparing-for-columnstore-installation/

And If the system is still failing to come up due to a possible DB problem, then we suggest doing a fresh install and making a copy of what is currently there for us to investigate further, if you like.

Steps for make a copy and do a fresh install

  1. ma shutdown y
  2. pkill mysqld // just to make sure he is down
  3. cd /usr/local/
  4. tar -zxvf mariadb.tar.gz mariadb
  5. cd /root
  6. uninstall rpm or binary package
  7. rm -rf /usr/local/mariadb
  8. reinstall package and run postConfigure again

I believe this should get the system up and running again. You will need to rebuild the DB.
and if this all works, would you have a issue providing the mariadb.tar.gz to us for further analysts here?

David Hill

Comment by SANJAY SONTAKKE [ 2016-12-28 ]

Is there any way to recover that database. What precautions that I need to take before, that I will not land up in similar situation...

Comment by David Hill (Inactive) [ 2016-12-28 ]

Just to clarify, I didn't mean you have to do the clean install. You asked how to delete a DB, so I didn't know if this was what you was looking for...

To continue to investigate the issue:

1. Did you get a chanced to check the firewalls settings?
2. what is the current state of the system? Can you do the following and let provide the status

  1. ma shutdownsystem y
  2. ma startsystem // let it go as far as it can go
  3. ma getsystemi // provide output

3. If the problem continues, what do you think of the idea of tarring up the install where we can check it out here?

Comment by SANJAY SONTAKKE [ 2017-01-03 ]

I checked with Firewall setting keeping on/off does not make difference. Shutdownsystem also takes a long time and does not come out. System stops by command "columnstore stop" and nothing else OR reboot.

There is a business data in this database. So, let me check with our security team whether we can share that complete data with you. Also this is 111GB in size. How will we share this, its quite huge in size.

Comment by David Hill (Inactive) [ 2017-01-03 ]

Would it be possible to setup a shared access session to help debug the issues?

We use teamviewer to do that, but you might have use of any tools in your company..

David Hill

Comment by SANJAY SONTAKKE [ 2017-01-04 ]

Yes, its possible we use GOTO meeting. I will share link with you. Pl. let me know the timings...

Rgds,

Sanjay

Comment by David Hill (Inactive) [ 2017-01-04 ]

looks like the main issue will be the timing due to our different locations...

I work and am available 8 to 6 pm CST hear in the states. Could do later after 6, if needed..

So just let me know what works for you.

Thanks, David Hill

Comment by SANJAY SONTAKKE [ 2017-01-05 ]

Thnx... David,

17:30 to 19:00 India Time Pl. confirm.

Rgds,

Sanjay

Comment by Dipti Joshi (Inactive) [ 2017-01-05 ]

Sanjay:

17:30 India time is 6:00 am U.S Central time(CST). 6 am will be too early for David.

David is requesting between 8:00 am to 6 pm US CST - that is 19:30 pm to 5:30 am India time.

Regards,
Dipti

Comment by SANJAY SONTAKKE [ 2017-01-06 ]

Dipti,

I can take out a time on Monday 09th Jan, 2017 around 20:45 India Time for 1hr 30mins.

Rgds,

Sanjay

Comment by David Hill (Inactive) [ 2017-01-06 ]

Lets plan for that, Monday 09th Jan, 2017 around 20:45 India, which is 9:15 AM CST here...

Do you have a shared session tool you generally use? If not, we have a few.

David Hill

Comment by SANJAY SONTAKKE [ 2017-01-06 ]

We have a GOTO meeting tool. Will share a link with you on Monday...

Let's target to fix the problem same time. Taking out time on such odd hours is difficult.

Rgds,

Sanjay

Comment by David Hill (Inactive) [ 2017-01-06 ]

Agreed..

David Hill

Comment by SANJAY SONTAKKE [ 2017-01-09 ]

https://accelyaindia.webex.com/accelyaindia/j.php?MTID=m8e86f7185fb8fb513ae617a6be75294e

Please go through some quick points with regards to your Web-Ex session
1) You and your participants can access the link 10 minutes before the meeting starts. Whoever joins in first will be by default presenter.

2) On clicking the link it redirects to you a webpage, where you need to enter your name and email id and click Join. Apart from this fields, if any fields are seen you are requested to wait, close the web browser and restart the process by clicking on the link.

3) Next step, will say "Starting Web-Ex" with an statement "Still having trouble? Run a temporary application to join this meeting immediately." You need to click on on Run a temporary application so that an .exe file is downloaded. After downloading click on Run to install the downloaded .exe file.

4) Post installation Web-Ex will be launched on your system. This process of downloading and installation should not take more than 5 minutes.

Rgds,

Sanjay

Comment by David Hill (Inactive) [ 2017-01-09 ]

I'm currently connected on webex, is the meeting going to taking place?

David Hill

Comment by SANJAY SONTAKKE [ 2017-01-09 ]

yes, give me 5mins... Just connecting...

Comment by David Hill (Inactive) [ 2017-01-09 ]

Worked with customer on an online session and got the system running again.

The startsystem was failing/hanging because the load_brm command was hanging at startup.
load_brm was trying to load B copies of the dbrm files, which means that the system wasn't cleanly started in the past.
to clean it up, we did the following:
1. stop the columestore service
2. kill all remaining cs process and the load_brm
3. run the load_brm with the B copy, which actually worked. I thought it would hang.
4. ran save_brm to get back to using the main copy, not the A or B backup copy
5. startsystem

Comment by SANJAY SONTAKKE [ 2017-01-10 ]

Thnx... David,

You can close this call..

Rgds,

Sanjay

Comment by SANJAY SONTAKKE [ 2017-01-10 ]

Can we know how to avoid these scenarios. What does A and B copy are.

Rgds,

Sanjay

Generated at Thu Feb 08 02:21:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.