I've extended my tests / profiling to also examine the performance impact of multi-threaded mcsimport on Linux operating systems. They differ from the results for Windows.
First test case with CentOS 7 and Ubuntu 18.04 in a Virtual Box environment
A single server installation of ColumnStore 1.2.2-1 from the package repo was performed. mcsimport is executed on the same machine.
1.2.3 labels the single-threaded mcsimport from develop-1.2 (as baseline), MCOL-2089 the new multi-threaded implementation and -O3 the optimizer flag used during compiling. Executed was load_test_2 from mcsimports regression test suite which imports a single 1.28GB csv file with three columns of integers.
Installed kernels:
Linux centos7 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Linux ubuntu18 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Virtual Box tests against ColumnStore 1.2.2-1 on VMs with 8GiB of memory and 4 cores and 8 threads. [host maximum]
In the gcc-7 case of CentOS 7, mcsapi was also compiled with gcc-7. (load_test_2)
1.2.3 | MCOL-2089 -O3
|
CentOS 7 gcc 4.8.5-16 444s | 403s 334s | 320s
|
429s | 405s 326s | 337s
|
| 332s | 337s
|
[436.5s] | [-7.4%] [-24.2%] | [-24.1%]
|
gcc-7 7.3.1-5 432s | 380s 328s | 333s
|
441s | 371s 331s | 340s
|
[0%] | [-14%] [-24.5%] | [-22.9%]
|
Ubuntu 18.04 gcc-7 7.3.0-27 325s | 236s 209s | 184s
|
338s | 239s 212s | 169s
|
[331.5s] | [-28.4%] [-36.5%] | [-46.8%]
|
In an over-threaded setup, the single threaded mcsimport outperforms the multi-threaded. Except on Ubuntu 18.04; it seems to be able to deal with over-threaded setups and shows a similar performance as in the optimal case with 2 cores and 4 threads. It also shows this behaviour in the over-threaded buildbot sample. The CentOS 7 compiler difference is marginal.
Virtual Box tests against ColumnStore 1.2.2-1 on VMs with 8GiB of memory and 2 cores and 4 threads.
In the gcc-7 case of CentOS 7, mcsapi was also compiled with gcc-7. (load_test_2)
1.2.3 | MCOL-2089 -O3
|
CentOS 7 gcc 4.8.5-16 440s | 361s 348s | 271s
|
437s | 348s 345s | 276s
|
[438.5s] | [-19.2%] [-21%] | [-37.6%]
|
gcc-7 7.3.1-5 451s | 339s 345s | 256s
|
446s | 345s 339s | 252s
|
[+2.3%] | [-22%] [-22%] | [-42.1%]
|
Ubuntu 18.04 gcc-7 7.3.0-27 335s | 295s 229s | 189s
|
361s | 303s 230s | 192s
|
[348s] | [-14.1%] [-34.1%] | [-45.3%]
|
This seems to be the optional test case setup for multi-threaded. There is one thread for CS and three threads for mcsimport.
Here the multi-threaded mcsimport outperforms the single threaded. The CentOS 7 compiler difference only takes effect in the optimized multi-threaded use-case.
Virtual Box tests against ColumnStore 1.2.2-1 on VMs with 8GiB of memory and 1 core and 2 threads.
In the gcc-7 case of CentOS 7, mcsapi was also compiled with gcc-7. (load_test_2)
1.2.3 | MCOL-2089 -O3
|
CentOS 7 gcc 4.8.5-16 424s | 545s 344s | 434s
|
421s | 558s 338s | 429s
|
[422.5s] | [+30.5%] [-19.3%] | [+2.1%]
|
gcc-7 7.3.1-5 434s | 562s 327s | 400s
|
426s | 568s 328s | 407s
|
[+1.8%] | [+33.7%] [-22.5%] | [-4.5%]
|
Ubuntu 18.04 gcc-7 7.3.0-27 357s | 505s 219s | 293s
|
359s | 504s 220s | 276s
|
[358s] | [+40.9%] [-38.7%] | [-20.5%]
|
Not suprisingly, in an under-threaded machine the single threaded mcsimport outperforms the multi-threaded.
CentOS'es gcc-7 compiler performs better in an under-threaded environment than the default version.
Second test case - buildbot execution times of load_test_2
Similar test as above, but using buildbot for the execution. The EC2 instances used by buildbot are c4.2xlarge ones which have 8vCPUs and 15GiB of memory. Therefore, an over-threaded environment.
1.2.3 | MCOL-2089 -O3
|
CentOS 7 gcc 4.8.5-16 207s | 259s 173s | 259s
|
| [+25.1%] [-16.4%] | [+25.1%]
|
Debian 8 gcc-4.9 4.9.2-2 199s | 261s 164s | 272s
|
| [+31.2%] [-17.6%] | [+36.7%]
|
Ubuntu 16.04 gcc-5 5.3.1-3 164s | 261s 117s | 204s
|
| [+59.1%] [-28.7%] | [+24.4%]
|
Debian 9 gcc-6 6.3.0-9 165s | 223s 124s | 218s
|
| [+35.2%] [-24.8%] | [+32.1%]
|
Ubuntu 18.04 gcc-7 7.3.0-27 158s | 121s 115s | 105s
|
| [-23.4%] [-27.2%] | [-33.5%]
|
This shows us that the single threaded mcsimport outperforms the multi-threaded mcsimport on every OS except Ubuntu 18.04 during
the buildbot test execution. It further states a performance gain of around 23% for the single threaded mcsimport while using the optimization flag -O3. This contradicts directly with the findings on my Virtual Box setup, as I expected a difference of up to 10% between the multi-threaded and single-threaded execution; Not more than 50%.
Third test case - mcsimport injection from Windows 10
CentOS 7, Ubuntu 18.04 ColumnStore 1.2.2-1 (Virtual Box VM) mcsimport injection from Windows 10 (4 cores) comparison (load_test_2)
1.2.3 | MCOL-2089
|
CentOS 7 (CS) 167s | 145s
|
164s | 146s
|
[165.5s] | [-12.1%]
|
Ubuntu 18.04 (CS) 129s | 112s
|
127s | 111s
|
[128s] | [-12.9%]
|
This shows us that there is a performance difference of around 23% only based on the choice of operating system used for ColumnStore.
This is probably amongst others about the different version of C++ compiler used while building the ColumnStore packages. This also shows that the multi-threaded implementation of mcsimport performs around 12.5% better than the single threaded on a Windows 10 machine with 4 cores. As Windows uses an optimizer by default, there is no -O3 flag.
Fourth test case - mcsapi compiler / optimizer impact
CentOS 7 API 1.2.3 Million Row tests
4.8.5-16 -O3 7.3.1-5 -O3
|
cpp 19.59s n/a 19.69s 19.57s
|
19.55s n/a 19.25s 20.43s
|
20.60s n/a 19.23s 19.62s
|
[19.91s] [-2.6%] [-0.2%]
|
python2 22.82s n/a 20.72s 21.08s
|
[-9.2%] [-7.6%]
|
python3 54.19s n/a 51.67s 52.66s
|
[-4.7%] [-2.8%]
|
java 17.55s n/a 18.46s 18.12s
|
[+5.2%] [+3.25%]
|
This shows us that the choice of C++ compiler and optimizer option can have a around 5% effect on the performance.
But more data-points should be collected to verify this thesis.
My conclusion:
- Using the -O3 optimizer flag gives us an around 24.5% performance enhancement for the single-threaded mcsimport. [median of all tests]
- The multi-threaded optimized mcsimport can give performance enhancements between 37% and 46% under optimal conditions.
- In under-threaded environments the optimized multi-threaded mcsimport performs between 20% and 30% worse than the optimized single threaded mcsimport. We could change the program to use either the multi or single threaded mcsimport depending on the cores it detected during execution.
- In over-threaded environments the optimized multi-threaded mcsimport performance degrades on every Linux operating system except Ubuntu 18.04. The degradation is between 15% and more than 50% depending on the execution setup (local VM vs. buildbot). Therefore, here it is less efficient than the single-threaded optimized mcsimport. I don't have any explanation for this behaviour. But due to the CentOS 7 tests, we can rule out that it is gcc-7/compiler related. Some advise on how to investigate further would be great.
- On Windows switching from single-threaded to multi-threaded mcsimport has a positive impact of around 12.5% on multi-core systems. The performance in under-threaded environments hasn't been evaluated yet.
- Switching the CentOS compiler to gcc-7 has especially good impacts on the multi-threaded performance, but not so much on the single-threaded.
- We might want to evaluate why there is a server side performance difference during the injection of around 12% depending on the OS used to host ColumnStore and the effect of using the -O3 flag for mcsapi as well.
TL/DR: We can get 24.5% optimization right away by enabling -O3 for single threaded mcsimport. We could squeeze out 20% more performance if we use pipelining and figure out why the performance degrades while executing on over-threaded Linux operating systems (except Ubuntu 18.04). We also have to find a solution to minimize the performance degradation while executed on under-threaded operating systems.
My suggestion: Merge PR 34 and close PR 33 with the note that over-threaded and under-threaded environments need to be considered better. Then move MCOL-2089 to testing and create a new ticket to address the changes for multi-threaded.
Please update the "Affected Version" field in the jira item winstone