[MCOL-994] cpimport failed with a "new extent FBO too high for current file error" Created: 2017-10-30  Updated: 2017-12-08  Resolved: 2017-12-08

Status: Closed
Project: MariaDB ColumnStore
Component/s: cpimport
Affects Version/s: 1.0.11, 1.1.1
Fix Version/s: 1.0.12, 1.1.3

Type: Bug Priority: Major
Reporter: Daniel Lee (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: relnote

Sprint: 2017-22, 2017-23, 2017-24

 Description   

Build tested: 1.1.1.-1 package file

Stack: 1um4pm, each pm has 1 dbroot

1. created dbt3 tables
2. cpimport 10g source
3. cpimport 10g source again
4. redistribute start remove 2
5. cpimport 100g source and failure occurred

All other tables import successfully. I tried to load the partsupp table again and got the same error

OID-3073 is the ps_supplycost column.

The following is the extent map after cpimport failed and rolled back.

[root@localhost columnstore]# /usr/local/mariadb/columnstore/bin/editem -o 3073
Col OID = 3073, NumExtents = 4, width = 8
878592 - 886783 (8192) min: 100, max: 100000, seqNum: 3, state: valid, fbo: 0, DBRoot: 4, part#: 0, seg#: 0, HWM: 5863; status: avail
1977344 - 1985535 (8192) min: 100, max: 100000, seqNum: 3, state: valid, fbo: 0, DBRoot: 1, part#: 0, seg#: 1, HWM: 8191; status: avail
2006016 - 2014207 (8192) min: 100, max: 100000, seqNum: 3, state: valid, fbo: 0, DBRoot: 1, part#: 0, seg#: 2, HWM: 5863; status: avail
2051072 - 2059263 (8192) min: 100, max: 100000, seqNum: 3, state: valid, fbo: 0, DBRoot: 3, part#: 0, seg#: 3, HWM: 5863; status: avail



 Comments   
Comment by Daniel Lee (Inactive) [ 2017-10-30 ]

info on the partsupp table.

<Table tblName="tpch100.partsupp" tblOid="3069">
<Column colName="ps_partkey" colOid="3070" dataType="integer" compressType="2" width="4"/>
<Column colName="ps_suppkey" colOid="3071" dataType="integer" compressType="2" width="4"/>
<Column colName="ps_availqty" colOid="3072" dataType="integer" compressType="2" width="4"/>
<Column colName="ps_supplycost" colOid="3073" dataType="decimal" compressType="2" precision="12" scale="2" width="8"/>
<Column colName="ps_comment" colOid="3074" dataType="varchar" compressType="2" width="199" colType="D" dValOid="3075"/>
</Table>

extent map entries

Col OID = 3070, NumExtents = 4, width = 4
866304 - 870399 (4096) min: 1, max: 1992500, seqNum: 3, state: valid, fbo: 0, DBRoot: 4, part#: 0, seg#: 0, HWM: 2931; status: avail
1965056 - 1969151 (4096) min: 2501, max: 1995000, seqNum: 3, state: valid, fbo: 0, DBRoot: 1, part#: 0, seg#: 1, HWM: 4095; status: avail
1993728 - 1997823 (4096) min: 5001, max: 1997500, seqNum: 3, state: valid, fbo: 0, DBRoot: 1, part#: 0, seg#: 2, HWM: 2931; status: avail
2038784 - 2042879 (4096) min: 7501, max: 2000000, seqNum: 3, state: valid, fbo: 0, DBRoot: 3, part#: 0, seg#: 3, HWM: 2931; status: avail

Col OID = 3071, NumExtents = 4, width = 4
870400 - 874495 (4096) min: 2, max: 97558, seqNum: 3, state: valid, fbo: 0, DBRoot: 4, part#: 0, seg#: 0, HWM: 2931; status: avail
1969152 - 1973247 (4096) min: 1, max: 100000, seqNum: 3, state: valid, fbo: 0, DBRoot: 1, part#: 0, seg#: 1, HWM: 4095; status: avail
1997824 - 2001919 (4096) min: 2, max: 97539, seqNum: 3, state: valid, fbo: 0, DBRoot: 1, part#: 0, seg#: 2, HWM: 2931; status: avail
2042880 - 2046975 (4096) min: 1, max: 100000, seqNum: 3, state: valid, fbo: 0, DBRoot: 3, part#: 0, seg#: 3, HWM: 2931; status: avail

Col OID = 3072, NumExtents = 4, width = 4
874496 - 878591 (4096) min: 1, max: 9999, seqNum: 3, state: valid, fbo: 0, DBRoot: 4, part#: 0, seg#: 0, HWM: 2931; status: avail
1973248 - 1977343 (4096) min: 1, max: 9999, seqNum: 3, state: valid, fbo: 0, DBRoot: 1, part#: 0, seg#: 1, HWM: 4095; status: avail
2001920 - 2006015 (4096) min: 1, max: 9999, seqNum: 3, state: valid, fbo: 0, DBRoot: 1, part#: 0, seg#: 2, HWM: 2931; status: avail
2046976 - 2051071 (4096) min: 1, max: 9999, seqNum: 3, state: valid, fbo: 0, DBRoot: 3, part#: 0, seg#: 3, HWM: 2931; status: avail

Col OID = 3073, NumExtents = 4, width = 8
878592 - 886783 (8192) min: 100, max: 100000, seqNum: 3, state: valid, fbo: 0, DBRoot: 4, part#: 0, seg#: 0, HWM: 5863; status: avail
1977344 - 1985535 (8192) min: 100, max: 100000, seqNum: 3, state: valid, fbo: 0, DBRoot: 1, part#: 0, seg#: 1, HWM: 8191; status: avail
2006016 - 2014207 (8192) min: 100, max: 100000, seqNum: 3, state: valid, fbo: 0, DBRoot: 1, part#: 0, seg#: 2, HWM: 5863; status: avail
2051072 - 2059263 (8192) min: 100, max: 100000, seqNum: 3, state: valid, fbo: 0, DBRoot: 3, part#: 0, seg#: 3, HWM: 5863; status: avail

Col OID = 3074, NumExtents = 4, width = 8
886784 - 894975 (8192) min: notset, max: notset, seqNum: 0, state: valid, fbo: 0, DBRoot: 4, part#: 0, seg#: 0, HWM: 5863; status: avail
1985536 - 1993727 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 0, DBRoot: 1, part#: 0, seg#: 1, HWM: 8191; status: avail
2014208 - 2022399 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 0, DBRoot: 1, part#: 0, seg#: 2, HWM: 5863; status: avail
2059264 - 2067455 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 0, DBRoot: 3, part#: 0, seg#: 3, HWM: 5863; status: avail

Dct OID = 3075
894976 - 903167 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 0, DBRoot: 4, part#: 0, seg#: 0, HWM: 0; status: avail
2022400 - 2030591 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 0, DBRoot: 1, part#: 0, seg#: 2, HWM: 0; status: avail
2030592 - 2038783 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 0, DBRoot: 1, part#: 0, seg#: 1, HWM: 0; status: avail
2067456 - 2075647 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 0, DBRoot: 3, part#: 0, seg#: 3, HWM: 0; status: avail
2075648 - 2083839 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 8192, DBRoot: 4, part#: 0, seg#: 0, HWM: 0; status: unavail
2083840 - 2092031 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 8192, DBRoot: 1, part#: 0, seg#: 1, HWM: 0; status: unavail
2092032 - 2100223 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 8192, DBRoot: 1, part#: 0, seg#: 2, HWM: 0; status: unavail
2100224 - 2108415 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 8192, DBRoot: 3, part#: 0, seg#: 3, HWM: 0; status: unavail
2108416 - 2116607 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 16384, DBRoot: 4, part#: 0, seg#: 0, HWM: 0; status: unavail
2116608 - 2124799 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 16384, DBRoot: 1, part#: 0, seg#: 1, HWM: 0; status: unavail
2124800 - 2132991 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 16384, DBRoot: 1, part#: 0, seg#: 2, HWM: 0; status: unavail
2132992 - 2141183 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 16384, DBRoot: 3, part#: 0, seg#: 3, HWM: 0; status: unavail
2141184 - 2149375 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 24576, DBRoot: 4, part#: 0, seg#: 0, HWM: 0; status: avail
2149376 - 2157567 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 24576, DBRoot: 1, part#: 0, seg#: 1, HWM: 0; status: avail
2157568 - 2165759 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 24576, DBRoot: 1, part#: 0, seg#: 2, HWM: 0; status: avail
2165760 - 2173951 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 24576, DBRoot: 3, part#: 0, seg#: 3, HWM: 0; status: avail
3806208 - 3814399 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 32768, DBRoot: 1, part#: 0, seg#: 2, HWM: 0; status: unavail
3814400 - 3822591 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 32768, DBRoot: 1, part#: 0, seg#: 1, HWM: 0; status: unavail
3822592 - 3830783 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 32768, DBRoot: 3, part#: 0, seg#: 3, HWM: 0; status: unavail
3830784 - 3838975 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 32768, DBRoot: 4, part#: 0, seg#: 0, HWM: 0; status: unavail
3838976 - 3847167 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 40960, DBRoot: 1, part#: 0, seg#: 1, HWM: 0; status: unavail
3847168 - 3855359 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 40960, DBRoot: 4, part#: 0, seg#: 0, HWM: 0; status: unavail
3855360 - 3863551 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 40960, DBRoot: 1, part#: 0, seg#: 2, HWM: 0; status: unavail
3863552 - 3871743 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 40960, DBRoot: 3, part#: 0, seg#: 3, HWM: 0; status: unavail
3871744 - 3879935 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 49152, DBRoot: 4, part#: 0, seg#: 0, HWM: 0; status: unavail
3879936 - 3888127 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 49152, DBRoot: 1, part#: 0, seg#: 1, HWM: 0; status: unavail
3888128 - 3896319 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 49152, DBRoot: 1, part#: 0, seg#: 2, HWM: 0; status: unavail
3896320 - 3904511 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 49152, DBRoot: 3, part#: 0, seg#: 3, HWM: 0; status: unavail
3904512 - 3912703 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 57344, DBRoot: 1, part#: 0, seg#: 1, HWM: 0; status: avail
3912704 - 3920895 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 57344, DBRoot: 1, part#: 0, seg#: 2, HWM: 0; status: avail
3920896 - 3929087 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 57344, DBRoot: 4, part#: 0, seg#: 0, HWM: 0; status: avail
3929088 - 3937279 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 57344, DBRoot: 3, part#: 0, seg#: 3, HWM: 0; status: avail
5542912 - 5551103 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 65536, DBRoot: 4, part#: 0, seg#: 0, HWM: 0; status: unavail
5551104 - 5559295 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 65536, DBRoot: 1, part#: 0, seg#: 1, HWM: 0; status: unavail
5559296 - 5567487 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 65536, DBRoot: 1, part#: 0, seg#: 2, HWM: 0; status: unavail
5567488 - 5575679 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 65536, DBRoot: 3, part#: 0, seg#: 3, HWM: 0; status: unavail
5575680 - 5583871 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 73728, DBRoot: 4, part#: 0, seg#: 0, HWM: 0; status: unavail
5583872 - 5592063 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 73728, DBRoot: 1, part#: 0, seg#: 1, HWM: 0; status: unavail
5592064 - 5600255 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 73728, DBRoot: 1, part#: 0, seg#: 2, HWM: 0; status: unavail
5600256 - 5608447 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 73728, DBRoot: 3, part#: 0, seg#: 3, HWM: 0; status: unavail
5608448 - 5616639 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 81920, DBRoot: 4, part#: 0, seg#: 0, HWM: 0; status: unavail
5616640 - 5624831 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 81920, DBRoot: 1, part#: 0, seg#: 1, HWM: 0; status: unavail
5624832 - 5633023 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 81920, DBRoot: 1, part#: 0, seg#: 2, HWM: 0; status: unavail
5633024 - 5641215 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 81920, DBRoot: 3, part#: 0, seg#: 3, HWM: 0; status: unavail
5641216 - 5649407 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 90112, DBRoot: 4, part#: 0, seg#: 0, HWM: 92986; status: avail
5649408 - 5657599 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 90112, DBRoot: 1, part#: 0, seg#: 1, HWM: 93015; status: avail
5657600 - 5665791 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 90112, DBRoot: 1, part#: 0, seg#: 2, HWM: 93010; status: avail
5665792 - 5673983 (8192) min: notset, max: notset, seqNum: 0, state: invalid, fbo: 90112, DBRoot: 3, part#: 0, seg#: 3, HWM: 92990; status: avail

Comment by David Hall (Inactive) [ 2017-11-02 ]

It would be useful to have the actual error line so that I could see what information was in it. There should be more than is shown in the title here.

Would also be useful to have the logs for the time of the problem.

Comment by David Hall (Inactive) [ 2017-11-02 ]

When loading 2 x 10g into partsupp, I get 2 segments. The extents above show 4 segments. Was there data already in the table? Was cpimport run more than twice before the redistribute?

Comment by Daniel Lee (Inactive) [ 2017-11-02 ]

I used a 1um4pm stack. I indicated that on the 2nd line of the bug description.

Stack: 1um4pm, each pm has 1 dbroot

Comment by David Hall (Inactive) [ 2017-11-13 ]

Update on investigation:
In this case, redistribute moved a partially full extent from dbroot 2 to dbroot 1. The extent is the first in the segment file. This leaves two partially filled extents, both as the first in their segment files on dbroot1.

The import logic adds rows starting at the HWM of the original segment file until the extent is full. The HWM is at an extent boundary at this point. This triggers an extension logic path. Normally, it would add an extent to the same file (we put two extents per file) and start filling it. However, it looks around for other files that might already exist and finds the second partially full segment file. It then tries to fill this. Unfortunately, it expects to start filling at an extent boundary and throws the error.

There are two bits of logic in the code that looks like someone or someones tried to work around this problem. In the import logic, theres a section that looks like it's trying to detect this problem. If found, it fills the partial extent out to an extent boundary (with empty data). Then the following code starts on an extent boundary and all is OK. However, the detection logic doesn't see the extent at all, for some reason. The comments seem to indicate it is specifically there to handle the moving of dbroots, though that seems a bit strange.

In the replication code, there's logic to handle the HWM-0 problem. It's mentioned throughout the code, but I'm not clear on what it's trying to compensate for. What the HWM-0 problem is i isn't clear.

I don't like either kludge above. It would be better to detect the switch to a partially full extent and adjust the HWM accordingly. The trick is for the code to know it's a moved extent and not a corrupt extent. I think that can be accomplished the same way the current error is being detected. Does the HWM match the file size?

Comment by David Hall (Inactive) [ 2017-11-16 ]

Once this error occurs, it leaves one of the HWM incorrect after rollback and the table is unusable.

Comment by David Hall (Inactive) [ 2017-11-16 ]

I tried to find a way to handle two partial extents in one dbroot. It can be done, but it's a monumental task requiring significant refactoring. The code is full of places that assume the existence of no more than one partial extent. Some of those places will take some work to fix up.

This problem has been seen before by someone. There is code to find such a situation and fill one of the partials to a full extent using the "empty" value and get it back on an extent boundary. Presumably, these rows would never be used in any query result. However, the detection code currently doesn't detect our situation and never attempts this kludge.

For now, I'll attempt to increase the detection sensitivity to include this situation.

Comment by David Hall (Inactive) [ 2017-11-20 ]

As I started thinking about possible fixes, it occurred to me that the problem also exists in DML. so I flipped the use import for LDI switch off and did LDI. When it hit the boundary, DMLProc crashed. Any fix needs to take this into account.

To test:
using only partsupp
cpimport 10g three times
redistribute start remove 2
set infinidb_use_import_for_batchinsert=0;
load data infile "<dir>/partsupp.tbl" into table partsupp fields terminated by "|";
this first one succeeds
load data infile "<dir>/partsupp.tbl" into table partsupp fields terminated by "|";
ERROR 1815 (HY000): Internal error: Lost connection to DMLProc really [1:InetStreamSocket::write error: Broken pipe -- write from InetStreamSocket: sd: 65 inet: 192.168.1.199 port: 8614]

Comment by David Hall (Inactive) [ 2017-12-07 ]

This error only occurs for compressed tables. It occurs when multiple segment files on a PM contains a partial extent in the first extent position of the file. There are two extents per file. This bug won't manifest if the partial extent is the second extent of the file. Note that almost every column will have exactly one partial extent on each dbroot – the one extent where new rows will be added. WriteEngineServer could have been written to handle multiple partial extents, but it wasn't.

When we do a mcsadmin redistribute start remove, all the segment files of one dbroot are moved to other dbroots, which means one will end up with two segment files with partial extents – its own and one moved there. If the lower number segment has the partial extent in segment one, the logic gets confused.

To test, you must assure that the partial extent is the first in the segment.

I use the partsupp table and a 10g and 100g load file. You don't need to load any other tables, though testing that way may be a good idea.

load 10g three times. This loads the table with enough data to cause the problem. Loading 4 times seems to make it go away, as enough blocks are allocated to cause the extent to be "full". Loading only twice functions for the test, but when doing a DML test, it just adds time to the test for no reason.

Next, run mcsadmin redistribute start remove 2

Then load 100g. Be sure to do some queries on the table after this load, as during development, it was noticed that it could break things.

DML test:

load 10g
redistribute
set infinidb_use_import_for_batchinsert=0; << important not to miss this step, else your just testing cpimport again.
load data infile with the same 10g data file
load data infile again, which always triggered the issue

A test should be devised to loop a single insert after redistribute until count reaches 40000000

These tests need to run on a separate multi pm stack.

Comment by Andrew Hutchings (Inactive) [ 2017-12-08 ]

Merged to 1.0. Will cross merge to 1.1 after pull request 337 is merged.

Comment by Daniel Lee (Inactive) [ 2017-12-08 ]

Builds verified: GitHub source

1.0.12-1

/root/columnstore/mariadb-columnstore-server
commit 7ec285d104a8e68320cbf14b44ee8509693fbda2
Merge: 25e9d05 f546eaf
Author: David.Hall <david.hall@mariadb.com>
Date: Wed Dec 6 10:04:56 2017 -0600

Merge pull request #81 from mariadb-corporation/MCOL-1082-1.0

MCOL-1082 Preserve row_count through vtable

/root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine
commit d668ad04662f41e09cf30c784f09e51eecdcacab
Merge: b295c8f 34799d8
Author: Andrew Hutchings <andrew@linuxjedi.co.uk>
Date: Fri Dec 8 10:18:46 2017 +0000

Merge pull request #343 from mariadb-corporation/MCOL-994

MCOL-994 handle a second abbreviated extent in case it moved dbroots …

1.1.3-1

/root/columnstore/mariadb-columnstore-server
commit 0b3b26032aa60d2937cd06535946d7d8575cd4fd
Merge: 632e265 101ea14
Author: David.Hall <david.hall@mariadb.com>
Date: Wed Dec 6 10:05:32 2017 -0600

Merge pull request #80 from mariadb-corporation/MCOL-1082

MCOL-1082 Preserve row_count through vtable

/root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine
commit 71a901e8b38bca584ece8d7b070ad263c4b65c3f
Merge: 4d80266 addd719
Author: David.Hall <david.hall@mariadb.com>
Date: Wed Dec 6 11:17:49 2017 -0600

Merge pull request #341 from mariadb-corporation/MCOL-1083

MCOL-1083 Fix NULL row init for TEXT/BLOB

Repeated mentioned test case, as well as few tests from autopilot for sanity tests.

Generated at Thu Feb 08 02:25:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.