[MCOL-786] redistribute remove moved data from removed node to one node Created: 2017-06-25  Updated: 2017-10-30  Resolved: 2017-10-30

Status: Closed
Project: MariaDB ColumnStore
Component/s: writeengine
Affects Version/s: 1.0.9
Fix Version/s: 1.1.1

Type: Bug Priority: Major
Reporter: David Thompson (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 1
Labels: None

Attachments: Zip Archive info.zip    
Issue Links:
Relates
relates to MCOL-788 redistributeData does nothing for par... Closed
Sprint: 2017-13, 2017-14, 2017-15, 2017-16, 2017-17, 2017-18, 2017-19, 2017-20, 2017-21, 2017-22

 Description   

Single server node with 5 dbroots with large amount of data loaded. Uncovered that dbroot2 was using the same physical storage as dbroot1. So ran:
Fri Jun 23 14:18:30 2017: redistributedata start remove 2
Fri Jun 23 15:02:33 2017: unassignDbrootPmConfig 2 pm1

However after completion the volume for dbroot1 still was near capacity. Most of the data from dbroot2 was moved to dbroot1 rather than being spread out across dbroot1, 3,4,5 as expected.

Why did this happen , there is enough data that this should not be an edge case due to small number of partitions i believe.

Also subsequently redistribute does nothing despite dbroot1 very clearly having more partitions than the others.



 Comments   
Comment by David Thompson (Inactive) [ 2017-06-25 ]

info.zip contains info log for the redistribute logs.

Before the resdistribute for the largest table, the extents per dbroot were (order by dbroot followed by total):

ros_fin_mkt_insm	6203	6207	6202	6192	6200	31004

After the redistribute:

ros_fin_mkt_insm	12410	6202	6192	6200		31004

Comment by David Thompson (Inactive) [ 2017-06-26 ]

I think there are a couple of bugs causing the move to dbroot1 for most partitions.

For the first case in RedistributeControlThread::makeRedistributePlan():

  • L308 :if (dbPartVec[*j].size() > average) In this case every dbroot is less than the calculated average of the target set (7751) so nothing ends up in the first pass algorithm.
  • L397: all of the dbroot2 partitions are falling into this logic but it's only really designed to handle 1 lagging entry so it always picks 1 dbroot to move these too, in this case dbroot1.

Probably only one of these loops needs fixing.

I can't yet reason why a subsequent redistribute does nothing though as dbroot1 very clearly has more partitions than the others and there would be no remove to confuse the algorithm.

Comment by David Thompson (Inactive) [ 2017-06-26 ]

To reproduce both cases use the following to generate a large amount of test data:

rm -f t.tbl
for m in {1..12}; do
    for i in {1..31}; do 
       for j in {1..10}; do echo "2017-$(printf %02d $m)-$(printf %02d $i)|$j"; done > t0.tbl
       for k in {1..5}; do  
         cat t0.tbl t0.tbl t0.tbl t0.tbl t0.tbl t0.tbl t0.tbl t0.tbl t0.tbl t0.tbl  > t1.tbl 
         rm -f t0.tbl; cp t1.tbl t0.tbl
       done
       cat t0.tbl >> t.tbl; rm -f t0.tbl t1.tbl
    done
done

create an instance with multiple dbroots, e.g. 4. Then cpimport this into:

create table t1(dt date, i int) engine=columnstore;

it will complain about a few invalid values because the data generator is not month aware so it generates 2017-02-31 etc.

Now do a redistribute remove:

mcsadmin 
> redistributeData start remove 2
- redistributeData status

After this you'll see that dbroot1 has all the entries from dbroot2:

MariaDB [information_schema]> select object_id, dbroot, count(*) from columnstore_extents where object_id > 3000 group by 1,2;
+-----------+--------+----------+
| object_id | dbroot | count(*) |
+-----------+--------+----------+
|      3001 |      1 |       23 |
|      3001 |      3 |       11 |
|      3001 |      4 |       11 |
|      3002 |      1 |       23 |
|      3002 |      3 |       11 |
|      3002 |      4 |       11 |
+-----------+--------+----------+
6 rows in set (0.02 sec)

Now unassign dbroot2 and do another redistribute:

mcsadmin
> stopsystem
> unassignDbrootPmConfig 2 pm1
> startsystem
> redistributeData start

Nothing happens, and the dbroot info schema report is the same.

Comment by Daniel Lee (Inactive) [ 2017-06-27 ]

Build tested: Github source 1.1.0

[root@localhost mariadb-columnstore-server]# git show
commit 60f2f261f81d994307762d6d93380873513a0be8
Author: david hill <david.hill@mariadb.com>
Date: Tue Jun 20 16:05:32 2017 -0500

[root@localhost mariadb-columnstore-server]# cd mariadb-columnstore-engine/
[root@localhost mariadb-columnstore-engine]# git show
commit b424a02c2e431be6f4cf39c12f0c7fd0155c23f5
Author: david hill <david.hill@mariadb.com>
Date: Mon Jun 26 09:58:38 2017 -0500

Reproduced the issue with 100gb dbt3 database.
Also noticed an issue where redistributedata (without the remove option) did not redistribute the dta evenly.

before

[root@localhost columnstore]# du -sh *
111M bin
1.2M data
19G data1
15G data2
15G data3
15G data4
17G data5
176K etc

mcsadmin> redistributedata start remove 2
redistributedata Tue Jun 27 19:17:20 2017
redistributeData START Removing dbroots: 2
Source dbroots: 1 2 3 4 5
Destination dbroots: 1 3 4 5

mcsadmin> redistributedata status
redistributedata Tue Jun 27 19:21:38 2017
WriteEngineServer returned status 2: Redistribute is in progress: total 5 logical partitions are planned to move

after

[root@localhost columnstore]# du -sh *
111M bin
1.2M data
33G data1
8.0K data2
15G data3
15G data4
17G data5

mcsadmin> redistributedata start
redistributedata Tue Jun 27 19:23:36 2017
redistributeData START
Source dbroots: 1 2 3 4 5
Destination dbroots: 1 2 3 4 5

WriteEngineServer returned status 1: Cleared.
WriteEngineServer returned status 2: Redistribute is started.
mcsadmin> redistributedata status
redistributedata Tue Jun 27 19:23:42 2017
WriteEngineServer returned status 2: Redistribute is in progress: total 1 logical partition is planned to move.

[root@localhost columnstore]# du -sh *
111M bin
1.2M data
33G data1
4.2G data2
15G data3
15G data4
13G data5

Comment by David Hall (Inactive) [ 2017-09-15 ]

Given the example after the REDISTRIBUTE REMOVE action, further redistributes do nothing because the partitions are evenly distributed:

MariaDB [information_schema]> select object_id, dbroot, count(DISTINCT partition_id) from information_schema.columnstore_extents where object_id > 3000 group by 1,2;
---------------------------------------------

object_id dbroot count(DISTINCT partition_id)

---------------------------------------------

3001 1 6
3001 3 6
3001 4 6
3002 1 6
3002 3 6
3002 4 6

---------------------------------------------
6 rows in set (0.08 sec)

Comment by David Thompson (Inactive) [ 2017-09-21 ]

Right, the point is the first redistribute doesn't actually rebalance and loads up dbroot1. The second point is there is no workaround to do another redistribute to fix it. Behavior is still the same in 1.1.0

Comment by David Hall (Inactive) [ 2017-10-04 ]

Remember to merge before testing.

Comment by Daniel Lee (Inactive) [ 2017-10-30 ]

Build verified: 1.1.1-1 package

Verified on a 1um4pm stack, with 1 dbroot in each PM.

after loading 10g source twice

Before redistribute

[root@localhost columnstore]# du -sh data1
9.5G data1
[root@localhost columnstore]# du -sh data2
9.3G data2
[root@localhost columnstore]# du -sh data3
8.8G data3
[root@localhost columnstore]# du -sh data4
9.4G data4

After redistribute

[root@localhost columnstore]# du -sh data1
16G data1
[root@localhost columnstore]# du -sh data2
120K data2
[root@localhost columnstore]# du -sh data3
11G data3
[root@localhost columnstore]# du -sh data4
11G data4

after loaded another 100g source, partsupp did not load due to an error

before redistribute

[root@localhost columnstore]# du -sh data1
29G data1
[root@localhost columnstore]# du -sh data2
17G data2
[root@localhost columnstore]# du -sh data3
24G data3
[root@localhost columnstore]# du -sh data4
25G data4

after redistribute

[root@localhost columnstore]# du -sh data1
35G data1
[root@localhost columnstore]# du -sh data2
120K data2
[root@localhost columnstore]#
[root@localhost columnstore]# du -sh data3
29G data3
[root@localhost columnstore]# du -sh data4
30G data4

Generated at Thu Feb 08 02:23:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.