[MCOL-1369] DBRM goes to read only mode and restart fails with error wrong password on UM1 Created: 2018-04-26  Updated: 2019-12-04  Resolved: 2019-12-04

Status: Closed
Project: MariaDB ColumnStore
Component/s: DMLProc, PrimProc, ProcMgr
Affects Version/s: 1.0.11
Fix Version/s: Icebox

Type: Bug Priority: Blocker
Reporter: Abhinav santi Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None
Environment:

AWS AMI, Centos 4PM, 1UM c4.8xLarge, Production



 Description   

Hello,

I have observed a weird behavior with columnstore lately. When a DML fails the PM1 module reports DBRM Read only mode and all the queries on UM1 times out. (select queries too)
As suggested by the documentation, I tried restarting the system, but the restart/stop/shutdown failed with the error message " Wrong password on UM1" while, I have configured SSH keys for authentication and there is no password.

Thanks in Advance.



 Comments   
Comment by David Hill (Inactive) [ 2018-04-26 ]

the password issue is a bit strange.. Lets check that out first.

can you try this just to test the ssh-key between pm1 and um1.
On pm1:

mcsadmin> getsystemn
getsystemnetworkconfig Thu Apr 26 18:09:03 2018

System Network Configuration

Module Name Module Description NIC ID Host Name IP Address
----------- ------------------------- ------ ------------------- ---------------
um1 User Module #1 1 i-0ee2d5aa54e673033 172.31.36.237
pm1 Performance Module #1 1 i-080375f167cce415c 172.31.33.49
pm2 Performance Module #2 1 i-0cb8609e45e3ebe9c 172.31.36.223

// run ssh providing the ip address of um1, like so

  1. ssh 172.31.36.237

// curious is that does work with any prompt for password or any failure

and on the startsystem, are you running with any password like

mcsadmin> startsystem

ON THE DML issue. that is a concern also.. You are on an older build, 1.0.11. Might want to consider upgrading to latest build, 1.1.4. Hopefully that will resolve the original problem of DML failing.

Comment by Abhinav santi [ 2018-04-26 ]

I can confirm that SSH works. I have been using pm1 to hop into other servers through ssh.
mcsadmin restartsystem is successful and the systemstatus says all the modules are active. but UM1 queries fail. Server reboot is the only option that works now.

I can share you the support tool tar if you 'd like. is there any file share system other than open jira?

Thanks

Comment by David Thompson (Inactive) [ 2018-05-07 ]

hi abhinav.santi were you able to resolve this? If you can ssh to um1 and pm2 from pm1 then you should not get this error. Assuming this is a root install can you confirm that you verified ssh key access as root and that you ran mcsadmin as root from pm1? If any of this is off that could explain why.

Also on the failure, how you are running DML and does the client perform a rollback in the absence of a DML exception?

Comment by Abhinav santi [ 2018-05-22 ]

Hi, I installed from AWS AMI.. It is a non root install with the user mariadb-user. I am able to ssh into other instances using this user and run mcsadmin commands.
DMLs are just plain table alter commands. Usually I perform from command line.

I've seen another issue this morning. As your documentation suggest to restart the system after DBRM Readonly state, the restart, stop from mcsadmin fails.. I did a server reboot to get the system up.

Thanks

Comment by Andrew Hutchings (Inactive) [ 2019-12-04 ]

Closing as info was requested but none was provided.

Generated at Thu Feb 08 02:28:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.