[MCOL-1169] Lost all columnstore tables after reboot Created: 2018-01-17  Updated: 2018-02-16  Resolved: 2018-02-06

Status: Closed
Project: MariaDB ColumnStore
Component/s: ProcMgr
Affects Version/s: 1.1.2
Fix Version/s: Icebox

Type: Bug Priority: Blocker
Reporter: Nicola Assignee: Unassigned
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

2 UM and 2 PM in Vmware environment with NFS


Attachments: File columnstoreSupportReport.dwh-cstore01.tar.gz    

 Description   

Hi all,
after execute RestartSystem, all tables with Engine Columnstore are lost (only data),but in the PM storage the data exist (extent level):
[root@dwh-cstore02 dbrm]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_mysql-mysql_us_data 197G 33G 155G 18% /usr/local/mariadb/columnstore/mysql/db
172.16.30.215:/vol/vol_cstore01 300G 204G 97G 68% /usr/local/mariadb/columnstore/data1

In the debug.log i see this query:
Start SQL statement: select objectid,columnname from syscolumn where schema='c6msg' and tablename='c6_messaggio_in' --columnRIDs/FE; ||

But the table syscolumn not exist in the all databases (mysql,information_schema and infini_vtable).
it would seem that the catalog that relates the tables of type columnstore has been lost and consequently I can not find more data.
Table names are found but without data.

I've upload the support report.
Thanks.
Regards.
Nicola



 Comments   
Comment by David Thompson (Inactive) [ 2018-01-19 ]

Hi, is there some additional information you can provide around the lead up to this. It looks like you maybe had some hardware issues on the 13th and the system failed over to pm2 but you now have recovered. The state of the system looks to be that pm2 is still the primary pm node but it can't seem to find the dbrm and extent map files. It looks like you have set internal storage so it should not have failed over because pm2 can't mount pm1's data volume. I wonder if you switch the active pm node back to pm1 and restart the system if that would work using mcsadmin switchParentOAMModule ?

Comment by Nicola [ 2018-01-19 ]

mcsadmin> getProcessStatus
getprocessstatus Fri Jan 19 07:31:07 2018

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor pm1 ACTIVE Wed Jan 17 11:25:40 2018 46547
ProcessManager pm1 HOT_STANDBY Wed Jan 17 11:25:45 2018 46677
DBRMControllerNode pm1 COLD_STANDBY Wed Jan 17 11:25:51 2018
ServerMonitor pm1 ACTIVE Wed Jan 17 11:25:55 2018 47042
DBRMWorkerNode pm1 ACTIVE Wed Jan 17 11:25:56 2018 47055
DecomSvr pm1 ACTIVE Wed Jan 17 11:25:59 2018 47099
PrimProc pm1 ACTIVE Wed Jan 17 11:26:02 2018 47109
ExeMgr pm1 ACTIVE Wed Jan 17 11:26:06 2018 47168
WriteEngineServer pm1 ACTIVE Wed Jan 17 11:26:10 2018 47220
DDLProc pm1 COLD_STANDBY Wed Jan 17 11:26:11 2018
DMLProc pm1 COLD_STANDBY Wed Jan 17 11:26:11 2018
mysqld pm1 ACTIVE Wed Jan 17 11:25:57 2018 46950

ProcessMonitor pm2 ACTIVE Wed Jan 17 11:25:21 2018 27780
ProcessManager pm2 ACTIVE Wed Jan 17 11:25:27 2018 27929
DBRMControllerNode pm2 ACTIVE Wed Jan 17 11:25:49 2018 28865
ServerMonitor pm2 ACTIVE Wed Jan 17 11:25:50 2018 28913
DBRMWorkerNode pm2 ACTIVE Wed Jan 17 11:25:50 2018 28960
DecomSvr pm2 ACTIVE Wed Jan 17 11:25:54 2018 29123
PrimProc pm2 ACTIVE Wed Jan 17 11:25:57 2018 29204
ExeMgr pm2 ACTIVE Wed Jan 17 11:26:04 2018 30319
WriteEngineServer pm2 ACTIVE Wed Jan 17 11:26:08 2018 30476
DDLProc pm2 ACTIVE Wed Jan 17 11:26:12 2018 30668
DMLProc pm2 ACTIVE Wed Jan 17 11:26:16 2018 30803
mysqld pm2 ACTIVE Wed Jan 17 11:26:08 2018 28743
mcsadmin> switchParentOAMModule pm1
switchparentoammodule Fri Jan 19 07:31:20 2018

Switching to the Performance Module 'pm1'

This command switches the Active Parent OAM Module and should only be executed on an idle system.
Do you want to proceed: (y or n) [n]: y

Check for active transactions

Switch Active Parent OAM Module starting...

Switch Active Parent OAM to Module 'pm1', please wait...

        • switchParentOAMModule Failed : API Failure return in switchParentOAMModule API
          mcsadmin>

mcsadmin> getProcessStatus
getprocessstatus Fri Jan 19 07:33:01 2018

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor pm1 ACTIVE Wed Jan 17 11:25:40 2018 46547
ProcessManager pm1 MAN_OFFLINE Fri Jan 19 07:31:31 2018
DBRMControllerNode pm1 MAN_OFFLINE Fri Jan 19 07:31:31 2018
ServerMonitor pm1 MAN_OFFLINE Fri Jan 19 07:31:31 2018
DBRMWorkerNode pm1 MAN_OFFLINE Fri Jan 19 07:31:31 2018
DecomSvr pm1 MAN_OFFLINE Fri Jan 19 07:31:31 2018
PrimProc pm1 MAN_OFFLINE Fri Jan 19 07:31:31 2018
ExeMgr pm1 MAN_OFFLINE Fri Jan 19 07:31:31 2018
WriteEngineServer pm1 MAN_OFFLINE Fri Jan 19 07:31:31 2018
DDLProc pm1 MAN_OFFLINE Fri Jan 19 07:31:31 2018
DMLProc pm1 MAN_OFFLINE Fri Jan 19 07:31:31 2018
mysqld pm1 MAN_OFFLINE Fri Jan 19 07:31:33 2018

ProcessMonitor pm2 ACTIVE Wed Jan 17 11:25:21 2018 27780
ProcessManager pm2 MAN_OFFLINE Fri Jan 19 07:31:53 2018
DBRMControllerNode pm2 MAN_OFFLINE Fri Jan 19 07:31:36 2018
ServerMonitor pm2 MAN_OFFLINE Fri Jan 19 07:31:36 2018
DBRMWorkerNode pm2 MAN_OFFLINE Fri Jan 19 07:31:36 2018
DecomSvr pm2 MAN_OFFLINE Fri Jan 19 07:31:36 2018
PrimProc pm2 MAN_OFFLINE Fri Jan 19 07:31:36 2018
ExeMgr pm2 MAN_OFFLINE Fri Jan 19 07:31:36 2018
WriteEngineServer pm2 MAN_OFFLINE Fri Jan 19 07:31:36 2018
DDLProc pm2 MAN_OFFLINE Fri Jan 19 07:31:36 2018
DMLProc pm2 MAN_OFFLINE Fri Jan 19 07:31:36 2018
mysqld pm2 MAN_OFFLINE Fri Jan 19 07:31:48 2018

Switch Parent currently doesn't works.
Now what can I do to recover the dbrm and extent map files?
Thanks.
Regards.
Nicola Battista

Comment by Nicola [ 2018-01-19 ]

[root@dwh-cstore02 ~]# tail -1000f /var/log/mariadb/columnstore/err.log
Jan 19 07:31:39 dwh-cstore02 controllernode[27929]: 39.691283 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Jan 19 07:31:42 dwh-cstore02 controllernode[17356]: 42.962138 |0|0|0| C 29 CAL0000: ExtentMap::save(): got request to save an empty BRM
Jan 19 07:31:42 dwh-cstore02 ProcessManager[27929]: 42.967786 |0|0|0| E 17 CAL0000: line: 6450 Error running DBRM save_brm
Jan 19 07:31:52 dwh-cstore02 ProcessMonitor[27780]: 52.830955 |0|0|0| E 18 CAL0000: unmount failed, device busy, DBRoot: 1
Jan 19 07:31:52 dwh-cstore02 oamcpp[27929]: 52.853620 |0|0|0| E 08 CAL0000: ERROR: unmount failed on dbroot1
Jan 19 07:31:52 dwh-cstore02 oamcpp[27929]: 52.853741 |0|0|0| E 08 CAL0000: ERROR: dbroot failed to unmount
Jan 19 07:31:52 dwh-cstore02 ProcessManager[27929]: 52.853824 |0|0|0| E 17 CAL0000: line: 8703 ERROR: manualMovePmDbroot Failed

Comment by Nicola [ 2018-01-19 ]

I've finally switch parent OAM on pm1 but the tables are but the tables are inaccessible.
I've try to create a new table:

Jan 19 09:26:04 dwh-cstore01 ExeMgr[5605]: 04.654544 |2147483664|0|0| D 16 CAL0042: End SQL statement
Jan 19 09:27:09 dwh-cstore01 ddlpackageproc[5924]: 09.562355 |16|3|0| D 23 CAL0041: Start SQL statement: CREATE TABLE `prova` (#015#012#011`Colonna` INT NULL#015#012)#015#012ENGINE=Columnstore;|c6msg|
Jan 19 09:27:09 dwh-cstore01 controllernode[5924]: 09.564509 |0|0|0| W 29 CAL0000: ExtentMap::getSysCatDBRoot(): OID not found: 1001
Jan 19 09:27:09 dwh-cstore01 ddlpackageproc[5924]: 09.564999 |16|3|0| D 23 CAL0042: End SQL statement
Jan 19 09:27:09 dwh-cstore01 ddlpackageproc[5924]: 09.566719 |0|0|0| E 23 CAL0009: Error while calling getSysCatDBRoot

Comment by David Hill (Inactive) [ 2018-01-19 ]

So from the report that you sent, it does look like the DBRM files no longer are there... Maybe a disk wasnt mounted when this was run and its really showing what is on the local disk, cant tell.

Can you run the following command from both pm1 and pm2. And make sure that the disk, especially for dbroot1 is correctly mounted.

  1. ls -l /usr/local/mariadb/columnstore/data1/systemFiles/dbrm/

and run from pm1

  1. mount
  2. cat /etc/fstab
Comment by David Hill (Inactive) [ 2018-01-19 ]

actually I did find of fstab in the report..

172.16.30.215:/vol/vol_cstore01 /usr/local/mariadb/columnstore/data1 nfs defaults,noatime,nodiratime,rw,hard,nointr,rsize=32768,wsize=32768,bg,nfsvers=4,tcp 0 0

NOTED : We don't recommend using NFS mounts for external storage just because the way that the data is cached and store to disk. That might be related, might not...

Comment by Nicola [ 2018-01-22 ]

Hi David,
[root@dwh-cstore01 ~]# ls -l /usr/local/mariadb/columnstore/data1/systemFiles/dbrm/
total 2080
rw-rw-rw 1 root nobody 25 19 gen 09.25 BRM_saves_journal
rw-rw-rw 1 root nobody 2099202 19 gen 09.27 oidbitmap
rw-rr- 1 root nobody 2099202 17 gen 11.24 oidbitmap.bck
rw-rw-r- 1 root nobody 12 19 gen 09.27 SMTxnID

[root@dwh-cstore01 ~]# df -h
/dev/mapper/vg_mysql-mysql_us_data on /usr/local/mariadb/columnstore/mysql/db type ext4 (rw,noatime,nodiratime,data=ordered)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=1642276k,mode=700)
172.16.30.215:/vol/vol_cstore01 on /usr/local/mariadb/columnstore/data1 type nfs4 (rw,noatime,nodiratime,vers=4.0,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.16.30.139,local_lock=none,addr=172.16.30.215)

cat /etc/fstab on pm1 :
172.16.30.215:/vol/vol_cstore01 /usr/local/mariadb/columnstore/data1 nfs defaults,noatime,nodiratime,rw,hard,nointr,rsize=32768,wsize=32768,bg,nfsvers=4,tcp 0 0
172.16.30.215:/vol/vol_cstore02 /usr/local/mariadb/columnstore/data2 nfs defaults,noatime,nodiratime,rw,hard,nointr,rsize=32768,wsize=32768,bg,nfsvers=4,tcp 0 0

PM 2 :
[root@dwh-cstore02 ~]# ls -l /usr/local/mariadb/columnstore/data1/systemFiles/dbrm/
total 0

Mount output :
/dev/mapper/vg_mysql-mysql_us_data on /usr/local/mariadb/columnstore/mysql/db type ext4 (rw,noatime,nodiratime,data=ordered)
tmpfs on /run/user/0 type tmpfs (rw,nosuid,nodev,relatime,size=1642276k,mode=700)
172.16.30.215:/vol/vol_cstore02 on /usr/local/mariadb/columnstore/data2 type nfs4 (rw,noatime,nodiratime,vers=4.0,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.16.30.140,local_lock=none,addr=172.16.30.215)

fstab output :
172.16.30.215:/vol/vol_cstore01 /usr/local/mariadb/columnstore/data1 nfs defaults,noatime,nodiratime,rw,hard,nointr,rsize=32768,wsize=32768,bg,nfsvers=4,tcp 0 0
172.16.30.215:/vol/vol_cstore02 /usr/local/mariadb/columnstore/data2 nfs defaults,noatime,nodiratime,rw,hard,nointr,rsize=32768,wsize=32768,bg,nfsvers=4,tcp 0 0

Comment by David Hill (Inactive) [ 2018-01-22 ]

Ok, that is not good. Your set of DBRM files are gone. And these cant be restored unless you were doing maintenance backups. If you were, then you can recover your full database from backups.

If you don't have any backups, then you will need to rebuild your Database from scratch.
Dont know what they are gone, as mentioned, NFS is not a recommended file system to use.
But I cant say for sure that is the exact reason.

This is what the set of DBRM files should look like:

ll
total 2096
rw-rw-rw 1 root root 3436 Jan 17 19:38 BRM_savesA_em
rw-rw-rw 1 root root 12 Jan 17 19:38 BRM_savesA_vbbm
rw-rw-rw 1 root root 8 Jan 17 19:38 BRM_savesA_vss
rw-rw-rw 1 root root 3372 Jan 17 19:38 BRM_savesB_em
rw-rw-rw 1 root root 12 Jan 17 19:38 BRM_savesB_vbbm
rw-rw-rw 1 root root 8 Jan 17 19:38 BRM_savesB_vss
rw-rw-r- 1 root root 64 Jan 17 19:38 BRM_saves_current
rw-rw-rw 1 root root 3436 Jan 17 19:38 BRM_saves_em
rw-rw-rw 1 root root 0 Jan 17 19:38 BRM_saves_journal
rw-rw-rw 1 root root 12 Jan 17 19:38 BRM_saves_vbbm
rw-rw-rw 1 root root 8 Jan 17 19:38 BRM_saves_vss
rw-rw-rw 1 root root 2099202 Jan 17 19:38 oidbitmap
rw-rw-r- 1 root root 12 Jan 17 19:38 SMTxnID

Comment by Nicola [ 2018-01-29 ]

Hi David,
in the current situation I can not even create columns of type columnstore because it returns me the error that can not find oidbitmap.
Therefore I will have to reinstall the product and rebuild the database.
Also could you tell me the recommended filesystems (in addition to the glusterfs) for a multi-server architecture?
I had thought about using Nfs because I did not have to worry about high reliability because it was managed at the storage level, without creating the glusterfs and consuming more resources.

Thanks.
Regards.
Nicola Battista

Comment by David Hill (Inactive) [ 2018-01-29 ]

For using internal storage with failover capabilities, Gluster is recommended.
For using external storage, EXT2/EXT3/EXT4 SAN storage where shared mounting is done is recommended. But this can be a closely setup invest in, so a lot of times Gluster is the best way to go if you have enough local storage to maintain multiple copy's of the DBROOT datafiles.

Comment by Nicola [ 2018-02-06 ]

Hi David,
Thanks for your support. You can close the ticket.

Thanks again.
Best Regards.
Nicola Battista

Comment by David Hill (Inactive) [ 2018-02-06 ]

close per users request

Generated at Thu Feb 08 02:26:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.