[MCOL-994] cpimport failed with a "new extent FBO too high for current file error" Created: 2017-10-30 Updated: 2017-12-08 Resolved: 2017-12-08 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | cpimport |
| Affects Version/s: | 1.0.11, 1.1.1 |
| Fix Version/s: | 1.0.12, 1.1.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Daniel Lee (Inactive) | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | relnote | ||
| Sprint: | 2017-22, 2017-23, 2017-24 |
| Description |
|
Build tested: 1.1.1.-1 package file Stack: 1um4pm, each pm has 1 dbroot 1. created dbt3 tables All other tables import successfully. I tried to load the partsupp table again and got the same error OID-3073 is the ps_supplycost column. The following is the extent map after cpimport failed and rolled back. [root@localhost columnstore]# /usr/local/mariadb/columnstore/bin/editem -o 3073 |
| Comments |
| Comment by Daniel Lee (Inactive) [ 2017-10-30 ] |
|
info on the partsupp table. <Table tblName="tpch100.partsupp" tblOid="3069"> extent map entries Col OID = 3070, NumExtents = 4, width = 4 Col OID = 3071, NumExtents = 4, width = 4 Col OID = 3072, NumExtents = 4, width = 4 Col OID = 3073, NumExtents = 4, width = 8 Col OID = 3074, NumExtents = 4, width = 8 Dct OID = 3075 |
| Comment by David Hall (Inactive) [ 2017-11-02 ] |
|
It would be useful to have the actual error line so that I could see what information was in it. There should be more than is shown in the title here. Would also be useful to have the logs for the time of the problem. |
| Comment by David Hall (Inactive) [ 2017-11-02 ] |
|
When loading 2 x 10g into partsupp, I get 2 segments. The extents above show 4 segments. Was there data already in the table? Was cpimport run more than twice before the redistribute? |
| Comment by Daniel Lee (Inactive) [ 2017-11-02 ] |
|
I used a 1um4pm stack. I indicated that on the 2nd line of the bug description. Stack: 1um4pm, each pm has 1 dbroot |
| Comment by David Hall (Inactive) [ 2017-11-13 ] |
|
Update on investigation: The import logic adds rows starting at the HWM of the original segment file until the extent is full. The HWM is at an extent boundary at this point. This triggers an extension logic path. Normally, it would add an extent to the same file (we put two extents per file) and start filling it. However, it looks around for other files that might already exist and finds the second partially full segment file. It then tries to fill this. Unfortunately, it expects to start filling at an extent boundary and throws the error. There are two bits of logic in the code that looks like someone or someones tried to work around this problem. In the import logic, theres a section that looks like it's trying to detect this problem. If found, it fills the partial extent out to an extent boundary (with empty data). Then the following code starts on an extent boundary and all is OK. However, the detection logic doesn't see the extent at all, for some reason. The comments seem to indicate it is specifically there to handle the moving of dbroots, though that seems a bit strange. In the replication code, there's logic to handle the HWM-0 problem. It's mentioned throughout the code, but I'm not clear on what it's trying to compensate for. What the HWM-0 problem is i isn't clear. I don't like either kludge above. It would be better to detect the switch to a partially full extent and adjust the HWM accordingly. The trick is for the code to know it's a moved extent and not a corrupt extent. I think that can be accomplished the same way the current error is being detected. Does the HWM match the file size? |
| Comment by David Hall (Inactive) [ 2017-11-16 ] |
|
Once this error occurs, it leaves one of the HWM incorrect after rollback and the table is unusable. |
| Comment by David Hall (Inactive) [ 2017-11-16 ] |
|
I tried to find a way to handle two partial extents in one dbroot. It can be done, but it's a monumental task requiring significant refactoring. The code is full of places that assume the existence of no more than one partial extent. Some of those places will take some work to fix up. This problem has been seen before by someone. There is code to find such a situation and fill one of the partials to a full extent using the "empty" value and get it back on an extent boundary. Presumably, these rows would never be used in any query result. However, the detection code currently doesn't detect our situation and never attempts this kludge. For now, I'll attempt to increase the detection sensitivity to include this situation. |
| Comment by David Hall (Inactive) [ 2017-11-20 ] |
|
As I started thinking about possible fixes, it occurred to me that the problem also exists in DML. so I flipped the use import for LDI switch off and did LDI. When it hit the boundary, DMLProc crashed. Any fix needs to take this into account. To test: |
| Comment by David Hall (Inactive) [ 2017-12-07 ] |
|
This error only occurs for compressed tables. It occurs when multiple segment files on a PM contains a partial extent in the first extent position of the file. There are two extents per file. This bug won't manifest if the partial extent is the second extent of the file. Note that almost every column will have exactly one partial extent on each dbroot – the one extent where new rows will be added. WriteEngineServer could have been written to handle multiple partial extents, but it wasn't. When we do a mcsadmin redistribute start remove, all the segment files of one dbroot are moved to other dbroots, which means one will end up with two segment files with partial extents – its own and one moved there. If the lower number segment has the partial extent in segment one, the logic gets confused. To test, you must assure that the partial extent is the first in the segment. I use the partsupp table and a 10g and 100g load file. You don't need to load any other tables, though testing that way may be a good idea. load 10g three times. This loads the table with enough data to cause the problem. Loading 4 times seems to make it go away, as enough blocks are allocated to cause the extent to be "full". Loading only twice functions for the test, but when doing a DML test, it just adds time to the test for no reason. Next, run mcsadmin redistribute start remove 2 Then load 100g. Be sure to do some queries on the table after this load, as during development, it was noticed that it could break things. DML test: load 10g A test should be devised to loop a single insert after redistribute until count These tests need to run on a separate multi pm stack. |
| Comment by Andrew Hutchings (Inactive) [ 2017-12-08 ] |
|
Merged to 1.0. Will cross merge to 1.1 after pull request 337 is merged. |
| Comment by Daniel Lee (Inactive) [ 2017-12-08 ] |
|
Builds verified: GitHub source 1.0.12-1 /root/columnstore/mariadb-columnstore-server Merge pull request #81 from mariadb-corporation/ /root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine Merge pull request #343 from mariadb-corporation/ 1.1.3-1 /root/columnstore/mariadb-columnstore-server Merge pull request #80 from mariadb-corporation/ /root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine Merge pull request #341 from mariadb-corporation/ Repeated mentioned test case, as well as few tests from autopilot for sanity tests. |