[MCOL-1582] system failing to startup, DBRM Controoler node failing to start, problem with tablelock file Created: 2018-07-23  Updated: 2022-11-05  Resolved: 2022-11-05

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?
Affects Version/s: 1.1.5
Fix Version/s: Icebox

Type: New Feature Priority: Major
Reporter: David Hill (Inactive) Assignee: Todd Stoffel (Inactive)
Resolution: Won't Do Votes: 0
Labels: None
Environment:

1um 4 pm root install, but running commands as non-root



 Description   

Customers system failing to startup

from cusomter site, error shoiwng dbrm controllernode failing to start, probably processing an empty tablelock file

Maybe enhancement can be done to automaticly handle this dbrm tables issues where lockfile is zero, but controller node cant handle that. maybe file can be dleted to auto fix the issue

jaylenv@hexagon-flynn-01$sudo ./controllernode fg
Locale is : C
basic_ios::clear: iostream error... attempt #1/10 to restart the DBRM controller node
basic_ios::clear: iostream error... attempt #2/10 to restart the DBRM controller node
basic_ios::clear: iostream error... attempt #3/10 to restart the DBRM controller node
basic_ios::clear: iostream error... attempt #4/10 to restart the DBRM controller node
basic_ios::clear: iostream error... attempt #5/10 to restart the DBRM controller node
basic_ios::clear: iostream error... attempt #6/10 to restart the DBRM controller node
basic_ios::clear: iostream error... attempt #7/10 to restart the DBRM controller node
basic_ios::clear: iostream error... attempt #8/10 to restart the DBRM controller node
basic_ios::clear: iostream error... attempt #9/10 to restart the DBRM controller node
basic_ios::clear: iostream error... attempt #10/10 to restart the DBRM controller node
failed to notify OAM of server failure
Exiting...
jaylenv@hexagon-flynn-01$log
jaylenv@hexagon-flynn-01$tail -10 debug.log
Jul 23 07:21:09 hexagon-flynn-01 ProcessMonitor[13945]: 09.042353 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set System State = ACTIVE
Jul 23 07:21:13 hexagon-flynn-01 controllernode[37019]: 13.067158 |0|0|0| D 29 CAL0000: TableLockServer::load(): could not load save file /usr/local/mariadb/columnstore/data1/systemFiles/dbrm/tablelocks loaded 0/32764 entries
Jul 23 07:21:13 hexagon-flynn-01 controllernode[37019]: 13.067416 |0|0|0| C 29 CAL0000: basic_ios::clear: iostream error... attempt #10/10 to restart the DBRM controller node

---------------------------------------------------------------------------------

Fix was to delete the 'tablelock' file in the dbrm directory and controllernode successfully started after that...



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2018-07-23 ]

There are multiple things we need to do here:

  1. TableLockServer::save() doesn't flush or close the file, so could easily write a 0 byte file.
  2. TableLockServer::save() also doesn't delete the file if an exception occurs
  3. TableLockServer::load() blindly assumes the file is >= 4 bytes, this needs checking (and a different error) to see if the file is empty. This is causing the count in the error message to be incorrect too.
Comment by Todd Stoffel (Inactive) [ 2022-11-05 ]

This item is being closed because it was well passed the expiration date with no activity. If you suspect this was done in error please create a new ticket.

Generated at Thu Feb 08 02:29:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.