[MCOL-498] provide option to not preallocate disk on extents Created: 2017-01-10  Updated: 2020-02-16  Resolved: 2019-09-20

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 1.0.6
Fix Version/s: 1.4.0

Type: New Feature Priority: Major
Reporter: David Thompson (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 1
Labels: documentation, relnote

Issue Links:
Problem/Incident
causes MCOL-3317 test000 does not clear out all files ... Closed
Sprint: 2018-05, 2018-06, 2018-07, 2018-08, 2018-09, 2018-10, 2018-11, 2018-12, 2018-13, 2018-14, 2018-15, 2018-16, 2018-17, 2018-18, 2018-19, 2018-20, 2018-21, 2019-01, 2019-02, 2019-03, 2019-04, 2019-05, 2019-06

 Description   

A configuration should be provided to allow for not preallocating disk storage on extent creation. This would be useful for ssd storage where contiguous allocation is not important and this would also reduce wear slightly.



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2017-02-13 ]

I think there is a good middle ground here that means we won't need an option and will optimise SSD and spinning disk file creation... If we use fallocate to set the size upon extent creation (and expansion) then the continuous space will be allocated without zeroing out the space. We wouldn't need an option for this as the default behaviour is good for SSD and spinning disk, it will improve the file creation performance of both and will remove the needless writes for SSDs.

Comment by Allan [ 2017-05-10 ]

Some technical thoughts (not issues).

Assume that there are two processes that each have open handles to the same underlying files on the server. In one of them a truncate/delete of the file is done because it is doing a delete operation, and also assume that to do the delete it holds a lock on the files. The truncate completes and the kernel marks the appropriate inodes and the lock is released. That disk space is gone. But it can't actually be released because the other process has a handle to the files. The only way to actually release the space is for all processes that hold handles to close them so the kernel can actually release the disk space for other use.

My thought is that if the server holds on to the open files after a delete that that space will never be reclaimed until the server is restarted. You should not have to restart the server to reclaim the space.

And of course you may already be doing all this.

Comment by Roman [ 2018-03-06 ]

The tests shows that dedicated NULL values, used to fill the segment file up, aren't essential, so I implemented that good idea with fallocate() usage where possible. However some file systems lack fallocate() call,so I also add a setting to disable segment file preallocation on a per-PM basis.
I use the same set to optimize dictionary files enlarging.

Comment by Roman [ 2018-03-06 ]

Please, review the PR.

Comment by Andrew Hutchings (Inactive) [ 2018-03-06 ]

Unfortunately as it stands this patch causes a regression.

Aggregates are producing incorrect results. Likely because they are seeing whole blocks as used now rather than partly used when they were being filled with the 'empty' byte pattern.

This is one of the diffs from test000 failure:

--- tpchValidation.sql.ref.log	2016-08-16 11:18:47.000000000 +0100
+++ tpchValidation.sql.log	2018-03-06 11:58:48.832982170 +0000
@@ -1,51 +1,51 @@
 count(l_orderkey)	min(l_orderkey)	max(l_orderkey)	sum(l_orderkey)	avg(l_orderkey)
-6001215	1	6000000	18005322964949	3000279.6042
+6002688	0	6000000	18005322964949	2999543.3654
 count(l_partkey)	min(l_partkey)	max(l_partkey)	sum(l_partkey)	avg(l_partkey)
-6001215	1	200000	600229457837	100017.9893
+6002688	0	200000	600229457837	99993.4459
 count(l_suppkey)	min(l_suppkey)	max(l_suppkey)	sum(l_suppkey)	avg(l_suppkey)
-6001215	1	10000	30009691369	5000.6026
+6002688	0	10000	30009691369	4999.3755
 count(l_linenumber)	min(l_linenumber)	max(l_linenumber)	sum(l_linenumber)	avg(l_linenumber)
-6001215	1	7	18007100	3.0006
+6001664	0	7	18007100	3.0004
 count(l_quantity)	min(l_quantity)	max(l_quantity)	sum(l_quantity)	avg(l_quantity)
-6001215	1.00	50.00	153078795.00	25.507967
+6001664	0.00	50.00	153078795.00	25.506059
 count(l_extendedprice)	min(l_extendedprice)	max(l_extendedprice)	sum(l_extendedprice)	avg(l_extendedprice)
-6001215	901.00	104949.50	229577310901.20	38255.138485
+6001664	0.00	104949.50	229577310901.20	38252.276519
 count(l_discount)	min(l_discount)	max(l_discount)	sum(l_discount)	avg(l_discount)
-6001215	0.00	0.10	300057.33	0.049999
+6001664	0.00	0.10	300057.33	0.049996
 count(l_tax)	min(l_tax)	max(l_tax)	sum(l_tax)	avg(l_tax)
-6001215	0.00	0.08	240129.67	0.040014
+6001664	0.00	0.08	240129.67	0.040011
 count(l_returnflag)	min(l_returnflag)	max(l_returnflag)
-6001215	A	R
+6004736	NULL	R
 count(l_linestatus)	min(l_linestatus)	max(l_linestatus)
-6001215	F	O
+6004736	NULL	O
 count(l_shipdate)	min(l_shipdate)	max(l_shipdate)
-6001215	1992-01-02	1998-12-01
+6002688	0000-00-00	1998-12-01
 count(l_commitdate)	min(l_commitdate)	max(l_commitdate)
-6001215	1992-01-31	1998-10-31
+6002688	0000-00-00	1998-10-31
 count(l_receiptdate)	min(l_receiptdate)	max(l_receiptdate)
-6001215	1992-01-04	1998-12-31
+6002688	0000-00-00	1998-12-31
 count(*)
-6001215
+6004736
 count(o_orderkey)	min(o_orderkey)	max(o_orderkey)	sum(o_orderkey)	avg(o_orderkey)
-1500000	1	6000000	4499987250000	2999991.5000
+1501184	0	6000000	4499987250000	2997625.3744
 count(o_custkey)	min(o_custkey)	max(o_custkey)	sum(o_custkey)	avg(o_custkey)
-1500000	1	149999	112509060862	75006.0406
+1501184	0	149999	112509060862	74946.8825
 count(o_orderstatus)	min(o_orderstatus)	max(o_orderstatus)
-1500000	F	P
+1507328	NULL	P
 count(o_totalprice)	min(o_totalprice)	max(o_totalprice)	sum(o_totalprice)	avg(o_totalprice)
-1500000	857.71	555285.16	226829306447.46	151219.537632
+1500160	0.00	555285.16	226829306447.46	151203.409268
 count(o_orderdate)	min(o_orderdate)	max(o_orderdate)
-1500000	1992-01-01	1998-08-02
+1501184	0000-00-00	1998-08-02
 count(o_orderpriority)	min(o_orderpriority)	max(o_orderpriority)
 1500000	1-URGENT	5-LOW
 count(o_clerk)	min(o_clerk)	max(o_clerk)
 1500000	Clerk#000000001	Clerk#000001000
 count(o_shippriority)	min(o_shippriority)	max(o_shippriority)
-1500000	0	0
+1501184	0	0
 count(o_comment)	min(o_comment)	max(o_comment)
 1500000	 Tiresias about the blithely ironic a	zzle? furiously ironic instructions among the unusual t
 count(*)
-1500000
+1507328
 count(n_nationkey)	min(n_nationkey)	max(n_nationkey)	sum(n_nationkey)	avg(n_nationkey)
 25	0	24	300	12.0000
 count(n_name)	min(n_name)	max(n_name)
@@ -99,17 +99,17 @@
 count(*)
 10000
 count(ps_partkey)	min(ps_partkey)	max(ps_partkey)	sum(ps_partkey)	avg(ps_partkey)
-800000	1	200000	80000400000	100000.5000
+800768	0	200000	80000400000	99904.5916
 count(ps_suppkey)	min(ps_suppkey)	max(ps_suppkey)	sum(ps_suppkey)	avg(ps_suppkey)
-800000	1	10000	4000400000	5000.5000
+800768	0	10000	4000400000	4995.7041
 count(ps_availqty)	min(ps_availqty)	max(ps_availqty)	sum(ps_availqty)	avg(ps_availqty)
-800000	1	9999	4002581547	5003.2269
+800768	0	9999	4002581547	4998.4284
 count(ps_supplycost)	min(ps_supplycost)	max(ps_supplycost)	sum(ps_supplycost)	avg(ps_supplycost)
-800000	1.00	1000.00	400420638.54	500.525798
+800768	0.00	1000.00	400420638.54	500.045754
 count(ps_comment)	min(ps_comment)	max(ps_comment)
 800000	 Tiresias according to the quiet courts sleep against the ironic, final requests. carefully unusual requests affix fluffily quickly ironic packages. regular 	zzle. unusual decoys detect slyly blithely express frays. furiously ironic packages about the bold accounts are close requests. slowly silent reque
 count(*)
-800000
+800768
 count(p_partkey)	min(p_partkey)	max(p_partkey)	sum(p_partkey)	avg(p_partkey)
 200000	1	200000	20000100000	100000.5000
 count(p_name)	min(p_name)	max(p_name)

Comment by Roman [ 2018-12-20 ]

Please review the code change.

Comment by Roman [ 2018-12-20 ]

The feature allows one to disable space preallocation when CS adds new or expands abbreviated extents. Nevertheless CS preallocates initial abbreviated extents using fallocate (if filesystem supports it) or sequential write as fallback. Abbreviated extents uses 2MB of disk space only.
How to enable the feature. Space preallocation could be disabled dbroot-wise. To disable preallocation for DBRoot10 one must add this snippet into Columnstore.xml and reboot CS.

<DBRoot1>
     	<PreallocSpace>OFF</PreallocSpace>
</DBRoot1>

It is easy to check that the feature works. Disable space preallocation as said before. Then create a table and insert > 8kk rows into it. Check for disk space used with this query.

select * from information_schema.columnstore_files

Previous CS versions eventually allocate 64MB files and now it allocates the exact needed amount of disk space.

I used this sequence to create a test table.

drop table cs27; set infinidb_compression_type = 2; create table cs27 (o_comment varchar(79)) engine=columnstore;

Here is the link to a 1.5kk dataset to use with the table.

Comment by Roman [ 2018-12-21 ]

Please review the change.

Comment by Patrick LeBlanc (Inactive) [ 2019-01-04 ]

Reviewed it, requested a couple changes.

Comment by Roman [ 2019-03-26 ]

Please review the feature that allows to disable disk space preallocation.

Comment by Roman [ 2019-04-29 ]

For QA:

Here is the snippet from the XML config to disable disk space preallocation for dbroot 1:

<DBRoot1>
     	<PreallocSpace>OFF</PreallocSpace>
</DBRoot1>

Here is the scenario to check the feature works. Upload >= 300 000 records in the table then check the size of the segment/dictionary files:

  • using 'du' utility
  • calling columnstore_info.table_usage()
  • getting data from information_schema.columnstore_files
Comment by Daniel Lee (Inactive) [ 2019-09-20 ]

Build verified: 1.4.0-1

[dlee@master centos7]$ cat gitversionInfo.txt
engine commit:
975463c

Test #1

Create database only, no dataloading
(disconenct source data so data would not be loaded)

1.2.5-1

[root@localhost data1]# du -sh 000.dir
58M 000.dir
[root@localhost data1]# /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch1m columnstore 1m
[root@localhost data1]# du -sh 000.dir
205M 000.dir
[root@localhost data1]# /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch1g columnstore 1g
[root@localhost data1]# du -sh 000.dir
353M 000.dir

Each database creation used about 147MB

1.4.0-1

[root@localhost data1]# du -sh 000.dir
58M 000.dir
[root@localhost ~]# /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch1m columnstore 1m
[root@localhost data1]# du -sh 000.dir
69M 000.dir
[root@localhost data1]# /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch1g columnstore 1g
[root@localhost data1]# du -sh 000.dir
80M 000.dir

Each database created used 11MB

Test #2
Create table and load data

1.2.5-1

[root@localhost columnstore]# du -sh data1/000.dir
58M data1/000.dir
[root@localhost columnstore]# /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch1m columnstore 1m
[root@localhost columnstore]# du -sh data1/000.dir
205M data1/000.dir
[root@localhost columnstore]# /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch1g columnstore 1g
[root@localhost columnstore]# du -sh data1/000.dir
2.5G data1/000.dir

For the 1m database, it used 205M only, the same amount of disk space as create database only in test #1.
The 1m dataset (lineitem has only 6005 rows) fits into the pre-allocated space so no new extents are needed.

1.4.0-1

[root@localhost columnstore]# du -sh data1/000.dir
58M data1/000.dir
[root@localhost columnstore]# /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch1m columnstore 1m
[root@localhost columnstore]# du -sh data1/000.dir
70M data1/000.dir
[root@localhost columnstore]# /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch1g columnstore 1g
[root@localhost columnstore]# du -sh data1/000.dir
634M data1/000.dir

Test #3

Compare disk space utilizaiton on a 1um2pm stack.
dbroot #2 configured not to pre-allocate disk space

with no user database

[root@localhost columnstore]# du -sh data1/000.dir
58M data1/000.dir

[root@localhost columnstore]# du -sh data2/000.dir
0 data2/000.dir

/data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch1m columnstore 1m

  1. create database without data loading (source data disconnected)

[root@localhost columnstore]# du -sh data1/000.dir
123M data1/000.dir

[root@localhost columnstore]# du -sh data2/000.dir
6.6M data2/000.dir

/data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch1g columnstore 1g

  1. with data loading

[root@localhost columnstore]# du -sh data1/000.dir
2.8G data1/000.dir

[root@localhost columnstore]# du -sh data2/000.dir
272M data2/000.dir

/data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch10g columnstore 10g

  1. with data loading

[root@localhost columnstore]# du -sh data1/000.dir
8.4G data1/000.dir

[root@localhost columnstore]# du -sh data2/000.dir
2.4G data2/000.dir

Start with an empty stack again

[root@localhost columnstore]# du -sh data1/000.dir
58M data1/000.dir

[root@localhost columnstore]# du -sh data2/000.dir
0 data2/000.dir

[root@localhost columnstore]# du -sh data1/000.dir
5.7G data1/000.dir

[root@localhost columnstore]# du -sh data2/000.dir
2.3G data2/000.dir

Comment by Daniel Lee (Inactive) [ 2019-09-23 ]

Test results for S3 installation wiht localStorage, single server

1.4.0-1

MCOL-498 not used

empty system

[root@localhost storagemanager]# du -sh *
60M cache
60M fake-cloud
0 journal
232K metadata

[root@localhost storagemanager]# /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch1 columnstore 1g

  • create database only, no data loading (disconnected source data)

MCOL-498 not used

[root@localhost storagemanager]# du -sh *
207M cache
207M fake-cloud
0 journal
588K metadata

MCOL-498 used

[root@localhost storagemanager]# du -sh *
71M cache
71M fake-cloud
0 journal
588K metadata

[root@localhost storagemanager]# /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch1 columnstore 1g

  • create database and load data

MCOL-498 not used

[root@localhost storagemanager]# du -sh *
1.9G cache
2.3G fake-cloud
0 journal
600K metadata

MCOL-498 used

[root@localhost storagemanager]# du -sh *
496M cache
496M fake-cloud
0 journal
584K metadata

[root@localhost storagemanager]# /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch10 columnstore 10g

  • create database and load data

MCOL-498 not used

[root@localhost storagemanager]# du -sh *
1.8G cache
11G fake-cloud
0 journal
1.4M metadata

MCOL-498 used

[root@localhost storagemanager]# du -sh *
1.9G cache
5.0G fake-cloud
0 journal
1.3M metadata

Comment by Daniel Lee (Inactive) [ 2019-09-23 ]

Test results for gluster, 1um2pm configuration

1.4.0-1

dbroot 1 used MCOL-498, dbroot 2 did not

empty system

[root@localhost columnstore]# du -sh data1/000.dir
58M data1/000.dir

[root@localhost columnstore]# du -sh data2/000.dir
1.5K data2/000.dir
[root@localhost columnstore]#

[root@localhost storagemanager]# /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch1 columnstore 1g

  • create database only, no data loading (disconnected source data)

[root@localhost columnstore]# du -sh data1/000.dir
63M data1/000.dir

[root@localhost columnstore]# du -sh data2/000.dir
83M data2/000.dir

[root@localhost storagemanager]# /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch1 columnstore 1g

  • create database and load data

[root@localhost columnstore]# du -sh data1/000.dir
279M data1/000.dir

[root@localhost columnstore]# du -sh data2/000.dir
1.9G data2/000.dir

[root@localhost storagemanager]# /data/qa/autopilot/databases/dbt3/sh/buildDatabase.sh tpch10 columnstore 10g

  • create database and load data

[root@localhost columnstore]# du -sh data1/000.dir
2.5G data1/000.dir

[root@localhost columnstore]# du -sh data2/000.dir
7.1G data2/000.dir

Generated at Thu Feb 08 02:21:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.