[MCOL-781] Crash on SystemState and BRMShmImpl after Update to 1.09 Created: 2017-06-21  Updated: 2017-08-09  Resolved: 2017-08-09

Status: Closed
Project: MariaDB ColumnStore
Component/s: PrimProc
Affects Version/s: None
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Christian2 Assignee: David Hill (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Red Hat Enterprise Linux Server release 7.3, single Server CS 1.09


Attachments: Zip Archive columnstoreSupportReport.columnstore-20170621.zip    

 Description   

Hello,

we are facing a lot of prblems because of DBRM errors with a columunstore made on 1.07. After teh current update from 1.07 over 1.08 (start an stop with no issue, incl. boost 1.57 installation) to 1.09 the system was available and good to answer some querries. AS weel creating a table. With import from exiting table (cpimport) the system crashed in the same manner as version 1.07 before.
We also tried some change sin the cilunstore.xml and clearShm, load/save_brm option but nothing does help.

Part of the log:

Jun 21 13:03:56 kmodekarlsap001 controllernode[64512]: 56.857546 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jun 21 13:03:57 kmodekarlsap001 ProcessMonitor[63488]: 57.244550 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: DBRMControllerNode, old PID = 63955
Jun 21 13:03:57 kmodekarlsap001 controllernode[64512]: 57.858125 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jun 21 13:03:57 kmodekarlsap001 controllernode[64512]: 57.858125 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jun 21 13:03:58 kmodekarlsap001 controllernode[64512]: 58.858710 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jun 21 13:03:58 kmodekarlsap001 controllernode[64512]: 58.858710 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jun 21 13:03:59 kmodekarlsap001 controllernode[64512]: 59.859228 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jun 21 13:03:59 kmodekarlsap001 controllernode[64512]: 59.859228 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jun 21 13:04:00 kmodekarlsap001 controllernode[64512]: 00.859836 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jun 21 13:04:00 kmodekarlsap001 controllernode[64512]: 00.859836 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jun 21 13:04:01 kmodekarlsap001 controllernode[64512]: 01.860425 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jun 21 13:04:01 kmodekarlsap001 controllernode[64512]: 01.860425 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jun 21 13:17:09 kmodekarlsap001 workernode[68482]: 09.661177 |0|0|0| C 30 CAL0000: BRMShmImpl::BRMShmImpl(): retrying on size==0
Jun 21 13:21:59 kmodekarlsap001 controllernode[68595]: 59.956067 |0|0|0| C 29 CAL0000: DBRM Controller: Network error reading from node 1. Reading response to command 6, length 13. 0 length response, possible time-out. Will see if retry is possible.
Jun 21 13:26:59 kmodekarlsap001 controllernode[68595]: 59.036969 |0|0|0| C 29 CAL0000: A node is unresponsive for cmd = 6, no reconfigure in at least 300 seconds. Setting read-only mode.

The support output is attached as well.
Since it will be a productive environment we need to solve the issue substantial.

Thanks a lot in advance



 Comments   
Comment by David Hill (Inactive) [ 2017-06-21 ]

For the logs, it does look like there is some DBRM file problem.
Possible resolutions.

1. Try one of the local copies and see if one of those works better.

So here the dbrm files and its currently using the A version that is shown by the current file shown below. By date, the A version is the latest. B is an older version but its the same size and the other is even older and a different size. That is the version you could try to bring the system up with and see if that resolves things.

total 7584
-rwxr-xr-x. 1 root root 191 Jun 21 13:38 +
-rwxr-xr-x. 1 root root 52 May 11 17:30 backup_BRM_saves_current
-rwxr-xr-x. 1 root root 1866540 Jun 21 13:40 BRM_savesA_em
rw-rw-rw. 1 root root 36 Jun 21 13:40 BRM_savesA_vbbm
rw-rw-rw. 1 root root 8 Jun 21 13:40 BRM_savesA_vss
-rwxr-xr-x. 1 root root 1866540 Jun 21 13:37 BRM_savesB_em
rw-rw-rw. 1 root root 1620 Jun 21 13:37 BRM_savesB_vbbm
rw-rw-rw. 1 root root 2384 Jun 21 13:37 BRM_savesB_vss
rw-rw-r-. 1 root root 65 Jun 21 13:40 BRM_saves_current
-rwxr-xr-x. 1 root root 1873004 Jun 21 13:35 BRM_saves_em
-rwxr-xr-x. 1 root root 0 Jun 21 13:40 BRM_saves_journal
rw-rw-rw. 1 root root 12 Jun 21 13:35 BRM_saves_vbbm
rw-rw-rw. 1 root root 8 Jun 21 13:35 BRM_saves_vss
-rwxr-xr-x. 1 root root 392 Jun 1 08:53 clean_dbrm.sh
drwxr-xr-x. 2 root root 4096 May 11 17:30 odi_save
-rwxr-xr-x. 1 root root 2099204 Jun 21 13:37 oidbitmap
rw-rw-r-. 1 root root 12 Jun 21 13:46 SMTxnID
-rwxr-xr-x. 1 root root 4 Jun 21 13:40 tablelocks

                                  1. cat /usr/local/mariadb/columnstore/data1/systemFiles/dbrm/BRM_saves_current #################

/usr/local/mariadb/columnstore/data1/systemFiles/dbrm/BRM_savesA

You can try these steps to get it work with the different sized version. Hopefully that version good to use.

  1. cd /usr/local/mariadb/columnstore/data1/systemFiles/
  2. cp -r dbrm dbrm.backup
  3. ma shutdownsystem y
  4. /usr/local/mariadb/columnstore/bin/clearShm
  5. cd /usr/local/mariadb/columnstore/data1/systemFiles/
  6. mv dbrm dbrm.backup1
  7. cp -r dbrm.backup dbrm
  8. cd dbrm
  9. vi /usr/local/mariadb/columnstore/data1/systemFiles/dbrm/BRM_saves_current
    change to
    /usr/local/mariadb/columnstore/data1/systemFiles/dbrm/BRM_saves
  10. ma startsystem

Try this to see if it its stabilize.

2. If you do maintenance backups where the dbrm and all the data files are backed up, you would need to reinstall those back on.

These are the 2 options as being able to recover from what looks like to be a DBRM files issue.

Generated at Thu Feb 08 02:23:46 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.