[MCOL-4533] Research columnstore handling disk I/O errors Created: 2021-02-04  Updated: 2023-07-01

Status: Open
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: None
Fix Version/s: Icebox

Type: Task Priority: Minor
Reporter: Gregory Dorman (Inactive) Assignee: Sergey Zefirov
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
PartOf
includes MCOL-4523 Mock file system subsystems to test d... Open
Relates
relates to MCOL-4566 rebuildEM utility must support compre... Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MCOL-4523 Mock file system subsystems to test d... Sub-Task Open Sergey Zefirov  

 Description   

It appears that in many cases when a NFS i/o subsystem returns errors, Columnstore logs a message but takes no further action and proceeds as if nothing bad happened. This policy naturally leads to serious problems like corruption of databases, maybe incorrect query responses. Some examples of what appears in some customers logs:

  • controllernode[1000861]: 49.535374 |0|0|0| C 29 CAL0000: VSS::load(): No such file or directory
  • writeengine[1074762]: 18.480080 |0|92|0| I 19 CAL0080: Compression Handling: Compressed data does not fit, caused a chunk shifting
  • IDBFile[1211506]: 31.500512 |0|0|0| D 35 CAL0002: Failed to open file: /var/lib/columnstore/data1/systemFiles/dbrm/DMLLog_182_1, exception: unable to open Buffered file

This task is ONLY (repeat - ONLY) about studying the code and producing a written document describing what the code does when something abnormal is detected. There should be NO development actions. Opinions on what to change are welcome but not required or expected at this stage.

Among other things whether the text of error messages is faithful to the event or error code received (there is a fear that they are insufficiently fine grained and end up masking the underlying problem instead of illuminating it). But primarily:

  • what do we actually do in each case (looks like we just proceed and pray hard than tnothing bad will happen, but this needs validation).
  • as we go on, what are the possible consequences? Can we later on get a corrupted extent map and blow up the database? Can we corrupt S3 metadata? Can we start returning wrong results to the select statements in case of "compressed data does not fit"?
  • Other?

Generated at Thu Feb 08 02:51:04 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.