[MCOL-1853] ColumnStore breaks on S3 NFS mounts Created: 2018-11-03  Updated: 2021-02-20  Resolved: 2021-02-20

Status: Closed
Project: MariaDB ColumnStore
Component/s: N/A
Affects Version/s: None
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Martin Adamec Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None

Attachments: File columnstoreSupportReport.csdb-pg1.tar.gz     File columnstoreSupportReport.cstore-pg1c.tar.gz     File columnstoreSupportReport_20181104.csdb-pg1.tar.gz    
Issue Links:
PartOf
is part of MCOL-1858 An `invalid` records indication when ... Closed
is part of MCOL-3392 select * from table : Internal error:... Closed

 Description   

When running the query:

MariaDB [tradealert]> SELECT sio.`id` AS option_id, o.`date` AS market_date, o.`time`, o.`size`, o.`price`, o.`spot`, o.open_int, o.theo, o.ivol, o.delta, o.vega, o.theta, o.`secid` FROM onelots AS o LEFT JOIN securities_occ AS so ON o.secid = so.secid LEFT JOIN dds_main.`sec_instrument_option` AS sio ON so.`occ_identifier` = sio.`occ_identifier` WHERE o.date = '2018-01-19' limit 10;

and this is what I got back:

ERROR 1815 (HY000): Internal error: IDB-2039: Data file does not exist, please contact your system administrator for more information.
I expected this amount of records to come back (well only 10 as I limited it but at least they would travel from PM to UM and I would know that it is OK):

MariaDB [tradealert]> SELECT count(1) FROM onelots AS o LEFT JOIN securities_occ AS so ON o.secid = so.secid LEFT JOIN dds_main.`sec_instrument_option` AS sio ON so.`occ_identifier` = sio.`occ_identifier` WHERE o.date = '2018-01-19';
----------

count(1)

----------

1501056

----------
1 row in set (3.22 sec)

In the stretch of all market dates of Q1 2018 it happened on Jan 19., Feb 5., Feb. 21. and from Mar 5. all the way through Mar 29. All other queries for all other days in this stretch (across the Q1) returned valid and full resultsets.

Attached is zipped support report file.

We also ran redistributeData but there was not much to distribute because of the configuration. So it ran for 10 ms and it did not help.
after that we created 2 extra DBROOTS and then re-run redistributeData - this time successfully but it did not help resolving an issue.



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2018-11-03 ]

The problem here is that you are using NFS mounts and particularly S3 NFS mounts do something very odd with file times that break ColumnStore. I highly recommend you do not use NFS for any database storage backend.

For eng team:

Config::checkReload() is using mtime which S3 NFS mounts often set to zero. Therefore the config never gets loaded in and the DBRoot part is empty upon file access. We should maybe do a checksum based test instead? Or at least don't rely on mtime being correct on a shared/distributed FS.

Comment by John Dutchover [ 2018-11-03 ]

So that we may move forward, I have done away with the EFS file systems and setup EBS (io1) volumes for dbroots 1,3 (pm1) and dbroots 2,4 (pm2).

Comment by Martin Adamec [ 2018-11-04 ]

Andrew,
Please help me figuring out what could be still wrong with our installation. We rebuilt the storage, I removed databases running drop database xyz from UM1, recreated them, recreated tables and started loading data from scratch.

There was one "weirdness" I noticed at the very initial loading. mcsimport reported in few cases quite a lot of invalid records. here is a copy of the reported result:
Execution time: 0.817828s
Rows inserted: 10064
Truncation count: 0
Saturated count: 0
Invalid count: 0
cat calendar-datadock_2018_Q1.csv | wc -l
10064

Execution time: 0.161977s
Rows inserted: 2077
Truncation count: 0
Saturated count: 0
Invalid count: 1
cat groups-datadock_2018_Q1.csv | wc -l
2077

Execution time: 4281.18s
Rows inserted: 87483118
Truncation count: 0
Saturated count: 0
Invalid count: 32809561
cat onelots-datadock_2018_Q1.csv | wc -l
87483118

Execution time: 1043.61s
Rows inserted: 53818130
Truncation count: 0
Saturated count: 0
Invalid count: 38551020
cat open_interest-datadock_2018_Q1.csv | wc -l
53818130

Execution time: 0.481345s
Rows inserted: 17387
Truncation count: 0
Saturated count: 0
Invalid count: 159
cat optimal_ex-datadock_2018_Q1.csv | wc -l
17387

Execution time: 2718.69s
Rows inserted: 101144576
Truncation count: 0
Saturated count: 0
Invalid count: 146538555
cat rates-datadock_2018_Q1.csv | wc -l
101144576

Execution time: 91.0303s
Rows inserted: 4278121
Truncation count: 0
Saturated count: 0
Invalid count: 34791
cat securities-datadock_2018_Q1.csv | wc -l
4278121

Execution time: 38.2951s
Rows inserted: 264387
Truncation count: 0
Saturated count: 0
Invalid count: 17
cat underlyings-datadock_2018_Q1.csv | wc -l
264387

Execution time: 26.1987s
Rows inserted: 494710
Truncation count: 61
Saturated count: 0
Invalid count: 1940652
cat usec_hist-datadock_2018_Q1.csv | wc -l
494710

I was looking for some detail LOG similar to csimport but did not find any. could you point me to the place I should be able to find these detail logs? I tried to check loaded data and ran few queries that counted rows also did come conditional counting and it all looked OK to me. Obviously that was not enough to be 100% sure that everything loaded correctly. If there are no logs it would be probably enough to know what triggeres that invalid count to grow when loading using mcsimport.

What is more scary is my next finding:
We had a case of missing files exactly the same as we had before when I opened this ticket for the first time. There is only a little difference between then and now. Not all dates between march 5 and 29 are with missing files. We have these files missing:

{{MariaDB [tradealert]> SELECT sio.`id` AS option_id, o.`date` AS market_date, o.`time`, o.`size`, o.`price`, o.`spot`, o.open_int, o.theo, o.ivol, o.delta, o.vega, o.theta, o.`secid` FROM onelots AS o LEFT JOIN securities_occ AS so ON o.secid = so.secid LEFT JOIN dds_main.`sec_instrument_option` AS sio ON so.`occ_identifier` = sio.`occ_identifier` WHERE o.date = '2018-01-19';
ERROR 1815 (HY000): Internal error: IDB-2039: Data file does not exist, please contact your system administrator for more information.

MariaDB [tradealert]> SELECT sio.`id` AS option_id, o.`date` AS market_date, o.`time`, o.`size`, o.`price`, o.`spot`, o.open_int, o.theo, o.ivol, o.delta, o.vega, o.theta, o.`secid` FROM onelots AS o LEFT JOIN securities_occ AS so ON o.secid = so.secid LEFT JOIN dds_main.`sec_instrument_option` AS sio ON so.`occ_identifier` = sio.`occ_identifier` WHERE o.date = '2018-02-05';
ERROR 1815 (HY000): Internal error: IDB-2039: Data file does not exist, please contact your system administrator for more information.

MariaDB [tradealert]> SELECT sio.`id` AS option_id, o.`date` AS market_date, o.`time`, o.`size`, o.`price`, o.`spot`, o.open_int, o.theo, o.ivol, o.delta, o.vega, o.theta, o.`secid` FROM onelots AS o LEFT JOIN securities_occ AS so ON o.secid = so.secid LEFT JOIN dds_main.`sec_instrument_option` AS sio ON so.`occ_identifier` = sio.`occ_identifier` WHERE o.date = '2018-02-21';
ERROR 1815 (HY000): Internal error: IDB-2039: Data file does not exist, please contact your system administrator for more information.

MariaDB [tradealert]> SELECT sio.`id` AS option_id, o.`date` AS market_date, o.`time`, o.`size`, o.`price`, o.`spot`, o.open_int, o.theo, o.ivol, o.delta, o.vega, o.theta, o.`secid` FROM onelots AS o LEFT JOIN securities_occ AS so ON o.secid = so.secid LEFT JOIN dds_main.`sec_instrument_option` AS sio ON so.`occ_identifier` = sio.`occ_identifier` WHERE o.date = '2018-03-09' limit 10;
ERROR 1815 (HY000): Internal error: IDB-2039: Data file does not exist, please contact your system administrator for more information.

MariaDB [tradealert]> SELECT sio.`id` AS option_id, o.`date` AS market_date, o.`time`, o.`size`, o.`price`, o.`spot`, o.open_int, o.theo, o.ivol, o.delta, o.vega, o.theta, o.`secid` FROM onelots AS o LEFT JOIN securities_occ AS so ON o.secid = so.secid LEFT JOIN dds_main.`sec_instrument_option` AS sio ON so.`occ_identifier` = sio.`occ_identifier` WHERE o.date = '2018-03-27' limit 10;
------------------------------------------------------------------------------------------------------------

option_id market_date time size price spot open_int theo ivol delta vega theta secid

------------------------------------------------------------------------------------------------------------

55344 2018-03-27 15:45:41.313 1 87.7 1002.01 1180 88.8097 0.268411 -0.432299 3.50937 -0.141225 20393363
5897 2018-03-27 15:45:41.325 1 59.05 1489 3039 60.4389 0.35868 0.374017 2.64117 -0.617602 23051456
1270919 2018-03-27 15:45:41.331 1 0.26 220.71 4681 0.261891 0.828654 0.047409 0.016148 -0.233111 25047193
1493877 2018-03-27 15:45:41.333 1 12.25 220.71 1201 11.9327 0.631092 0.620184 0.139078 -0.453351 25269065
1653542 2018-03-27 15:45:41.375 1 2.55 277.91 17 2.44285 0.742268 0.277963 0.069004 -1.39039 25430466
1548574 2018-03-27 15:45:41.388 1 9.85 213.22 60 9.61333 0.316144 -0.662974 0.162157 -0.154413 25314047
228803 2018-03-27 15:45:41.401 2 6.61 1002.01 5553 6.70226 0.336513 0.152676 0.606164 -0.430907 23298935
1697220 2018-03-27 15:45:41.444 1 0.26 6.905 668 0.238412 0.594428 0.29706 0.009021 -0.005256 25507298
1697220 2018-03-27 15:45:41.445 2 0.26 6.905 668 0.238412 0.594428 0.29706 0.009021 -0.005256 25507298
1697220 2018-03-27 15:45:41.445 1 0.26 6.905 668 0.238412 0.594428 0.29706 0.009021 -0.005256 25507298

------------------------------------------------------------------------------------------------------------
10 rows in set (3.59 sec)

MariaDB [tradealert]> SELECT sio.`id` AS option_id, o.`date` AS market_date, o.`time`, o.`size`, o.`price`, o.`spot`, o.open_int, o.theo, o.ivol, o.delta, o.vega, o.theta, o.`secid` FROM onelots AS o LEFT JOIN securities_occ AS so ON o.secid = so.secid LEFT JOIN dds_main.`sec_instrument_option` AS sio ON so.`occ_identifier` = sio.`occ_identifier` WHERE o.date = '2018-03-27';
ERROR 1815 (HY000): Internal error: IDB-2039: Data file does not exist, please contact your system administrator for more information.
}}

I ran the query one time with limit 10 and another time without and I had 2 different results for March 27.

We need to know what is causing this issue to come back. Without knowing the reason it makes the whole environment unpredictable for us and that is very risky proposition.

I attached a support report file (labeled with today's date for easier recognition) to document what we currently have configured.
columnstoreSupportReport_20181104.csdb-pg1.tar.gz

Comment by Andrew Hutchings (Inactive) [ 2018-11-05 ]

How did you move from one storage type to the other? It looks like you may have just moved the data files over to the new storage and started it? In which case you are trying to start with already damaged files. I highly recommend a clean install.

As for the invalid count. Please file a new ticket for this and attach an example CSV file and schema for it so we can take a look.

Comment by John Dutchover [ 2018-11-05 ]

The data had been retained when moved to EBS. It was a misunderstanding on my part.

A clean installation has been completed. I generated a support file columnstoreSupportReport.cstore-pg1c.tar.gz before turning over to @madamec. Note the new system name is cstore-pg1c to distinguish from previous install.

Generated at Thu Feb 08 02:31:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.