Details
-
Task
-
Status: Open (View Workflow)
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
Description
It appears that in many cases when a NFS i/o subsystem returns errors, Columnstore logs a message but takes no further action and proceeds as if nothing bad happened. This policy naturally leads to serious problems like corruption of databases, maybe incorrect query responses. Some examples of what appears in some customers logs:
- controllernode[1000861]: 49.535374 |0|0|0| C 29 CAL0000: VSS::load(): No such file or directory
- writeengine[1074762]: 18.480080 |0|92|0| I 19 CAL0080: Compression Handling: Compressed data does not fit, caused a chunk shifting
- IDBFile[1211506]: 31.500512 |0|0|0| D 35 CAL0002: Failed to open file: /var/lib/columnstore/data1/systemFiles/dbrm/DMLLog_182_1, exception: unable to open Buffered file
This task is ONLY (repeat - ONLY) about studying the code and producing a written document describing what the code does when something abnormal is detected. There should be NO development actions. Opinions on what to change are welcome but not required or expected at this stage.
Among other things whether the text of error messages is faithful to the event or error code received (there is a fear that they are insufficiently fine grained and end up masking the underlying problem instead of illuminating it). But primarily:
- what do we actually do in each case (looks like we just proceed and pray hard than tnothing bad will happen, but this needs validation).
- as we go on, what are the possible consequences? Can we later on get a corrupted extent map and blow up the database? Can we corrupt S3 metadata? Can we start returning wrong results to the select statements in case of "compressed data does not fit"?
- Other?