[MCOL-4533] Research columnstore handling disk I/O errors - Jira

XML

Word

Printable

Details

Type: Task
Status: Open (View Workflow)
Priority: Minor
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: Icebox
Component/s: None
Labels:
None

Description

It appears that in many cases when a NFS i/o subsystem returns errors, Columnstore logs a message but takes no further action and proceeds as if nothing bad happened. This policy naturally leads to serious problems like corruption of databases, maybe incorrect query responses. Some examples of what appears in some customers logs:

controllernode[1000861]: 49.535374 |0|0|0| C 29 CAL0000: VSS::load(): No such file or directory
writeengine[1074762]: 18.480080 |0|92|0| I 19 CAL0080: Compression Handling: Compressed data does not fit, caused a chunk shifting
IDBFile[1211506]: 31.500512 |0|0|0| D 35 CAL0002: Failed to open file: /var/lib/columnstore/data1/systemFiles/dbrm/DMLLog_182_1, exception: unable to open Buffered file

This task is ONLY (repeat - ONLY) about studying the code and producing a written document describing what the code does when something abnormal is detected. There should be NO development actions. Opinions on what to change are welcome but not required or expected at this stage.

Among other things whether the text of error messages is faithful to the event or error code received (there is a fear that they are insufficiently fine grained and end up masking the underlying problem instead of illuminating it). But primarily:

what do we actually do in each case (looks like we just proceed and pray hard than tnothing bad will happen, but this needs validation).
as we go on, what are the possible consequences? Can we later on get a corrupted extent map and blow up the database? Can we corrupt S3 metadata? Can we start returning wrong results to the select statements in case of "compressed data does not fit"?
Other?

Attachments

Issue Links

includes

MCOL-4523 Mock file system subsystems to test disk failures

Open

relates to

MCOL-4566 rebuildEM utility must support compressed segment files

Closed

Sub-Tasks

Mock file system subsystems to test disk failures

Open

Sergey Zefirov

Activity

People

Assignee:: Sergey Zefirov

Reporter:: Gregory Dorman (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2021-02-04 14:47

Updated:: 2023-07-01 20:43

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.