[MCOL-484] DBRMControllerNode FAILED , CONN_FAILURE and PROCESS_INIT_FAILURE Created: 2016-12-23 Updated: 2017-01-10 Resolved: 2017-01-09 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | DDLProc, DMLProc, ExeMgr, MariaDB Server, writeengine |
| Affects Version/s: | 1.0.6 |
| Fix Version/s: | 1.0.6 |
| Type: | Bug | Priority: | Major |
| Reporter: | SANJAY SONTAKKE | Assignee: | David Hill (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | need_feedback | ||
| Environment: |
RHEL 7.1 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
Process Module Status Last Status Change Process ID Dec 23 15:51:18 awsmltdbprd03 ProcessMonitor[8590]: 18.961526 |0|0|0| D 18 CAL0000: Process location: /usr/local/mariadb/columnstore/bin/ExeMgr |
| Comments |
| Comment by David Hill (Inactive) [ 2016-12-23 ] |
|
please ignore previous comment... |
| Comment by David Hill (Inactive) [ 2016-12-23 ] |
|
Some errors reported in the logs and why the system doesn't come up, don't know what is causing the errors. The alarm.xml issue has been logged in an MCOL in the past Dec 22 21:43:03 awsmltdbprd03 ProcessMonitor[2831]: 03.800981 |0|0|0| E 11 CAL0000: configAlarm error: Config::parseDoc: error parsing config file /usr/local/mariadb/co Dec 22 21:47:12 awsmltdbprd03 ProcessManager[2937]: 12.100865 |0|0|0| E 17 CAL0000: line: 6136 sendMsgProcMon: ProcMon Msg timeout on module pm1 1 item of note, this OS is RHEL 7.1. No Certification has been done on this OS, has been done on Centos 7, which should be very familiar.. BUT the RHEL-7 might be related to the issue.. |
| Comment by David Thompson (Inactive) [ 2016-12-24 ] |
|
Sanjay, can you send us AlarmConfig.xml or let us know if it's empty. |
| Comment by SANJAY SONTAKKE [ 2016-12-24 ] |
|
Hi David, |
| Comment by SANJAY SONTAKKE [ 2016-12-24 ] |
|
RHEL-7 might not be a problem. Because database was working well with large amout of data. I tried to load approx 20G table with column mismatch(came to know while checking warnings) and after that database went off. It did not start after that... |
| Comment by David Thompson (Inactive) [ 2016-12-24 ] |
|
we've had a report of this happening before but have not been able to reproduce before. If this is a multi node install then the following kb article outlines the procedure: https://mariadb.com/kb/en/mariadb/columnstore-configuration-file-redistribution/ However this won't work if you are single server which i think you are, in this case probably the simplest solution is to rerun postConfigure and specify the same options. You'll need to shutdown columnstore first of course. |
| Comment by David Thompson (Inactive) [ 2016-12-24 ] |
|
I don't think RHEL is the issue either, we've had plenty other users using this and centos is derived from RHEL. |
| Comment by David Thompson (Inactive) [ 2016-12-24 ] |
|
Also sanjay, can you confirm how you did the very large load: cpimport, direct sql, or load data infile? |
| Comment by David Thompson (Inactive) [ 2016-12-24 ] |
|
From the logs i think you were using either LDI or a bulk insert on mariadb server but please confirm. If you have the possibility to use cpimport for this use case it will be faster and scale better. |
| Comment by David Hill (Inactive) [ 2016-12-24 ] |
|
The alarm config for its packaged in the release, so you could install the package on an different server then copy it to the working server. |
| Comment by David Thompson (Inactive) [ 2016-12-24 ] |
|
Good catch, yes it seems like we don't generate a new copy during postConfigure.. |
| Comment by SANJAY SONTAKKE [ 2016-12-25 ] |
|
I am using load data infile, in one case i used cpimport. Major part of loading is done by LDI. I will switch over to cpimport once this DB gets recover... |
| Comment by SANJAY SONTAKKE [ 2016-12-26 ] |
|
I installed columnstore on other machine of same OS and copied AlarmConfig.xml to main machine. Two of major alarms gone but critical alarm of CONN_FAILURE is still exists. I marked <Logfile> to 'on' still no detail information is available... |
| Comment by SANJAY SONTAKKE [ 2016-12-26 ] |
|
I executed "postConfigure" still database is not coming up. Is there anyway that I could remove only one database and start system. Now I can't do anything because of that problem database/table. |
| Comment by David Thompson (Inactive) [ 2016-12-26 ] |
|
What does getSystemStatus and getProcessStatus show? |
| Comment by David Hill (Inactive) [ 2016-12-26 ] |
|
Since the connection issue is still occurring, have you check to see if the local firewall and SELinux is disabled. That generally comes into plan on multi-node systems, but still should be disabled on single-node installs. Check firewall considerations in this document. https://mariadb.com/kb/en/mariadb/preparing-for-columnstore-installation/ And If the system is still failing to come up due to a possible DB problem, then we suggest doing a fresh install and making a copy of what is currently there for us to investigate further, if you like. Steps for make a copy and do a fresh install
I believe this should get the system up and running again. You will need to rebuild the DB. David Hill |
| Comment by SANJAY SONTAKKE [ 2016-12-28 ] |
|
Is there any way to recover that database. What precautions that I need to take before, that I will not land up in similar situation... |
| Comment by David Hill (Inactive) [ 2016-12-28 ] |
|
Just to clarify, I didn't mean you have to do the clean install. You asked how to delete a DB, so I didn't know if this was what you was looking for... To continue to investigate the issue: 1. Did you get a chanced to check the firewalls settings?
3. If the problem continues, what do you think of the idea of tarring up the install where we can check it out here? |
| Comment by SANJAY SONTAKKE [ 2017-01-03 ] |
|
I checked with Firewall setting keeping on/off does not make difference. Shutdownsystem also takes a long time and does not come out. System stops by command "columnstore stop" and nothing else OR reboot. There is a business data in this database. So, let me check with our security team whether we can share that complete data with you. Also this is 111GB in size. How will we share this, its quite huge in size. |
| Comment by David Hill (Inactive) [ 2017-01-03 ] |
|
Would it be possible to setup a shared access session to help debug the issues? We use teamviewer to do that, but you might have use of any tools in your company.. David Hill |
| Comment by SANJAY SONTAKKE [ 2017-01-04 ] |
|
Yes, its possible we use GOTO meeting. I will share link with you. Pl. let me know the timings... Rgds, Sanjay |
| Comment by David Hill (Inactive) [ 2017-01-04 ] |
|
looks like the main issue will be the timing due to our different locations... I work and am available 8 to 6 pm CST hear in the states. Could do later after 6, if needed.. So just let me know what works for you. Thanks, David Hill |
| Comment by SANJAY SONTAKKE [ 2017-01-05 ] |
|
Thnx... David, 17:30 to 19:00 India Time Pl. confirm. Rgds, Sanjay |
| Comment by Dipti Joshi (Inactive) [ 2017-01-05 ] |
|
Sanjay: 17:30 India time is 6:00 am U.S Central time(CST). 6 am will be too early for David. David is requesting between 8:00 am to 6 pm US CST - that is 19:30 pm to 5:30 am India time. Regards, |
| Comment by SANJAY SONTAKKE [ 2017-01-06 ] |
|
Dipti, I can take out a time on Monday 09th Jan, 2017 around 20:45 India Time for 1hr 30mins. Rgds, Sanjay |
| Comment by David Hill (Inactive) [ 2017-01-06 ] |
|
Lets plan for that, Monday 09th Jan, 2017 around 20:45 India, which is 9:15 AM CST here... Do you have a shared session tool you generally use? If not, we have a few. David Hill |
| Comment by SANJAY SONTAKKE [ 2017-01-06 ] |
|
We have a GOTO meeting tool. Will share a link with you on Monday... Let's target to fix the problem same time. Taking out time on such odd hours is difficult. Rgds, Sanjay |
| Comment by David Hill (Inactive) [ 2017-01-06 ] |
|
Agreed.. David Hill |
| Comment by SANJAY SONTAKKE [ 2017-01-09 ] |
|
https://accelyaindia.webex.com/accelyaindia/j.php?MTID=m8e86f7185fb8fb513ae617a6be75294e Please go through some quick points with regards to your Web-Ex session 2) On clicking the link it redirects to you a webpage, where you need to enter your name and email id and click Join. Apart from this fields, if any fields are seen you are requested to wait, close the web browser and restart the process by clicking on the link. 3) Next step, will say "Starting Web-Ex" with an statement "Still having trouble? Run a temporary application to join this meeting immediately." You need to click on on Run a temporary application so that an .exe file is downloaded. After downloading click on Run to install the downloaded .exe file. 4) Post installation Web-Ex will be launched on your system. This process of downloading and installation should not take more than 5 minutes. Rgds, Sanjay |
| Comment by David Hill (Inactive) [ 2017-01-09 ] |
|
I'm currently connected on webex, is the meeting going to taking place? David Hill |
| Comment by SANJAY SONTAKKE [ 2017-01-09 ] |
|
yes, give me 5mins... Just connecting... |
| Comment by David Hill (Inactive) [ 2017-01-09 ] |
|
Worked with customer on an online session and got the system running again. The startsystem was failing/hanging because the load_brm command was hanging at startup. |
| Comment by SANJAY SONTAKKE [ 2017-01-10 ] |
|
Thnx... David, You can close this call.. Rgds, Sanjay |
| Comment by SANJAY SONTAKKE [ 2017-01-10 ] |
|
Can we know how to avoid these scenarios. What does A and B copy are. Rgds, Sanjay |