[MCOL-2089] High CPU usage and slow performance appears when load data with remote mcsimport Created: 2019-01-15  Updated: 2023-10-26  Resolved: 2019-03-22

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 1.2.2
Fix Version/s: 1.2.4

Type: Bug Priority: Major
Reporter: Zdravelina Sokolovska (Inactive) Assignee: Jens Röwekamp (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

mcsimport tool run remotely to mcs single server


Attachments: Zip Archive mcsimport-benchmark.zip    
Issue Links:
Relates
relates to MCOL-2226 Improve performance of mcsimport - ma... Closed
relates to MCOL-2038 mcsimport load time is significantly... Closed
Sprint: 2019-02, 2019-03

 Description   

High CPU usage and slow performance appears when load data with remote mcsimport

run autopilot cpimportLineitem test case group with option mcsimport .All test passed
but it's observed height cpu usage and tests finished slowly even in comparison to
maridb mysqlimport which is using SQL statement: LOAD DATA LOCAL INFILE on MCS.

how to repeat:
run remotely autopilot cpimportLineitem test case group with option mcsimport
run remotely autopilot cpimportLineitem test case group with option mysqlimport
./autopilot.sh features cpimportLineitem

Remote Load Method Elapsed Time [s]
MCSIMPORT 6918
MYSQLIMPORT 2180

during all time of data loading with mcsimport was observed high cpu usage

# top
top - 14:04:09 up 53 days,  2:36,  4 users,  load average: 0.83, 0.82, 0.62
Tasks: 180 total,   3 running, 167 sleeping,   8 stopped,   2 zombie
%Cpu(s): 10.3 us,  0.2 sy,  0.0 ni, 89.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 65975072 total,   454704 free, 13455012 used, 52065356 buff/cache
KiB Swap:  1048572 total,   745468 free,   303104 used. 49717568 avail Mem
 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
21814 root      20   0  500040 170864   2524 R  83.7  0.3   9:40.07 mcsimport
10284 root      20   0  162004   2320   1584 R   0.3  0.0   0:12.78 top
17218 mysql     20   0 4911300   1.0g  17944 S   0.3  1.6  76:38.07 mysqld
    1 root      20   0  191548   2920   1924 S   0.0  0.0   0:17.49 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.60 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0   0:03.22 ksoftirqd/0
    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H
    7 root      rt   0       0      0      0 S   0.0  0.0   0:22.88 migration/0
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh

BF Passed rowCnt=1024 actRowCnt=1024
BC Passed rowCnt=1025 actRowCnt=1025
TF Passed rowCnt=253952 actRowCnt=253952
TC Passed rowCnt=253953 actRowCnt=253953
CF Passed rowCnt=516096 actRowCnt=516096
CC Passed rowCnt=516097 actRowCnt=516097
EF Passed rowCnt=8380416 actRowCnt=8380416
EC Passed rowCnt=8380417 actRowCnt=8380417
SF Passed rowCnt=33546240 actRowCnt=33546240
SW Passed rowCnt=33546241 actRowCnt=33546241
PF Passed rowCnt=67100672 actRowCnt=67100672
PC Passed rowCnt=67100673 actRowCnt=67100673
[root@cps tests]#

trace get during the loading of EC test

# gdb -batch -ex 'thr a a bt' -p=$(pgrep mcsimport)
[New LWP 21818]
[New LWP 21817]
[New LWP 21816]
[New LWP 21815]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fd4e3962cc9 in ____strtod_l_internal () from /lib64/libc.so.6
 
Thread 5 (Thread 0x7fd4e250a700 (LWP 21815)):
#0  0x00007fd4e2730995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd4e31a4fc9 in uv_cond_wait () from /lib64/libuv.so.1
#2  0x00007fd4e3194136 in worker () from /lib64/libuv.so.1
#3  0x00007fd4e272ce25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fd4e3a22bad in clone () from /lib64/libc.so.6
 
Thread 4 (Thread 0x7fd4e1d09700 (LWP 21816)):
#0  0x00007fd4e2730995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd4e31a4fc9 in uv_cond_wait () from /lib64/libuv.so.1
#2  0x00007fd4e3194136 in worker () from /lib64/libuv.so.1
#3  0x00007fd4e272ce25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fd4e3a22bad in clone () from /lib64/libc.so.6
 
Thread 3 (Thread 0x7fd4e1508700 (LWP 21817)):
#0  0x00007fd4e2730995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd4e31a4fc9 in uv_cond_wait () from /lib64/libuv.so.1
#2  0x00007fd4e3194136 in worker () from /lib64/libuv.so.1
#3  0x00007fd4e272ce25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fd4e3a22bad in clone () from /lib64/libc.so.6
 
Thread 2 (Thread 0x7fd4e0d07700 (LWP 21818)):
#0  0x00007fd4e2730995 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fd4e31a4fc9 in uv_cond_wait () from /lib64/libuv.so.1
#2  0x00007fd4e3194136 in worker () from /lib64/libuv.so.1
#3  0x00007fd4e272ce25 in start_thread () from /lib64/libpthread.so.0
#4  0x00007fd4e3a22bad in clone () from /lib64/libc.so.6
 
Thread 1 (Thread 0x7fd4e4967740 (LWP 21814)):
#0  0x00007fd4e3962cc9 in ____strtod_l_internal () from /lib64/libc.so.6
#1  0x00007fd4e453d0bb in mcsapi::ColumnStoreDataConvert::convert (toMeta=toMeta@entry=0x7ffd3f545630, cont=0x34db038, fromValue=...) at /data/buildbot/bb-worker/centos7/mariadb-columnstore-api/src/util_dataconvert.cpp:1151
#2  0x00007fd4e45384bc in mcsapi::ColumnStoreBulkInsertImpl::setCharColumn (this=0xa8f1e0, columnNumber=6, value=..., status=0x7ffd3f545704) at /data/buildbot/bb-worker/centos7/mariadb-columnstore-api/src/mcsapi_bulk.cpp:481
#3  0x00007fd4e45387a8 in mcsapi::ColumnStoreBulkInsert::setColumn (this=0xa905e0, columnNumber=<optimized out>, value=..., status=<optimized out>) at /data/buildbot/bb-worker/centos7/mariadb-columnstore-api/src/mcsapi_bulk.cpp:75
#4  0x0000000000431ef2 in MCSRemoteImport::import() ()
#5  0x000000000042ceab in main ()



 Comments   
Comment by Dipti Joshi (Inactive) [ 2019-01-21 ]

Please update the "Affected Version" field in the jira item winstone

Comment by Jens Röwekamp (Inactive) [ 2019-02-15 ]

Made mcsimport multi threaded.
One thread reads the csv file, one file parses it into csv fields, and one thread writes the csv fields to CS.
They communicate through 2 FiFo queues implemented utilizing ring buffers.

Performance gain is around 25% compared to the single threaded 1.2.2 implementation of mcsimport.
Used the test suite's load_test_2 (1.2GiB CSV file) as reference.
On the downside of using more threads and buffers the implementation now consumes around 10 times more RAM and 1.5 times more CPU cycles.

Test suite successfully executed on Windows 10 against a remote CS 1.2.2-1 instance on CentOS 7.

Comment by Jens Röwekamp (Inactive) [ 2019-02-15 ]

For QA:

  • execute test suite (or verify buildbot's execution)
  • as some major changes have been introduced please test it more extensively
  • also verify on bigger datasets if there is a performance gain compared to the old 1.2.2 implementation
Comment by Jens Röwekamp (Inactive) [ 2019-02-20 ]

I've extended my tests / profiling to also examine the performance impact of multi-threaded mcsimport on Linux operating systems. They differ from the results for Windows.

First test case with CentOS 7 and Ubuntu 18.04 in a Virtual Box environment
A single server installation of ColumnStore 1.2.2-1 from the package repo was performed. mcsimport is executed on the same machine.
1.2.3 labels the single-threaded mcsimport from develop-1.2 (as baseline), MCOL-2089 the new multi-threaded implementation and -O3 the optimizer flag used during compiling. Executed was load_test_2 from mcsimports regression test suite which imports a single 1.28GB csv file with three columns of integers.

Installed kernels:
Linux centos7 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 29 14:49:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Linux ubuntu18 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Virtual Box tests against ColumnStore 1.2.2-1 on VMs with 8GiB of memory and 4 cores and 8 threads. [host maximum]
In the gcc-7 case of CentOS 7, mcsapi was also compiled with gcc-7. (load_test_2)

                                      1.2.3 | MCOL-2089      -O3
CentOS 7      gcc       4.8.5-16      444s  | 403s           334s | 320s
                                      429s  | 405s           326s | 337s
                                            |                332s | 337s
                                   [436.5s] | [-7.4%]    [-24.2%] | [-24.1%]
              gcc-7     7.3.1-5       432s  | 380s           328s | 333s
                                      441s  | 371s           331s | 340s
                                       [0%] | [-14%]     [-24.5%] | [-22.9%]
Ubuntu 18.04  gcc-7     7.3.0-27      325s  | 236s           209s | 184s
                                      338s  | 239s           212s | 169s
                                   [331.5s] | [-28.4%]   [-36.5%] | [-46.8%]

In an over-threaded setup, the single threaded mcsimport outperforms the multi-threaded. Except on Ubuntu 18.04; it seems to be able to deal with over-threaded setups and shows a similar performance as in the optimal case with 2 cores and 4 threads. It also shows this behaviour in the over-threaded buildbot sample. The CentOS 7 compiler difference is marginal.

Virtual Box tests against ColumnStore 1.2.2-1 on VMs with 8GiB of memory and 2 cores and 4 threads.
In the gcc-7 case of CentOS 7, mcsapi was also compiled with gcc-7. (load_test_2)

                                      1.2.3 | MCOL-2089      -O3
CentOS 7      gcc       4.8.5-16      440s  | 361s           348s | 271s
                                      437s  | 348s           345s | 276s
                                   [438.5s] | [-19.2%]     [-21%] | [-37.6%]
              gcc-7     7.3.1-5       451s  | 339s           345s | 256s
                                      446s  | 345s           339s | 252s
                                    [+2.3%] | [-22%]       [-22%] | [-42.1%]
Ubuntu 18.04  gcc-7     7.3.0-27      335s  | 295s           229s | 189s
                                      361s  | 303s           230s | 192s
                                     [348s] | [-14.1%]   [-34.1%] | [-45.3%]

This seems to be the optional test case setup for multi-threaded. There is one thread for CS and three threads for mcsimport.
Here the multi-threaded mcsimport outperforms the single threaded. The CentOS 7 compiler difference only takes effect in the optimized multi-threaded use-case.

Virtual Box tests against ColumnStore 1.2.2-1 on VMs with 8GiB of memory and 1 core and 2 threads.
In the gcc-7 case of CentOS 7, mcsapi was also compiled with gcc-7. (load_test_2)

                                      1.2.3 | MCOL-2089      -O3
CentOS 7      gcc       4.8.5-16      424s  | 545s           344s | 434s
                                      421s  | 558s           338s | 429s
                                   [422.5s] | [+30.5%]   [-19.3%] | [+2.1%]
              gcc-7     7.3.1-5       434s  | 562s           327s | 400s
                                      426s  | 568s           328s | 407s
                                    [+1.8%] | [+33.7%]   [-22.5%] | [-4.5%]
Ubuntu 18.04  gcc-7     7.3.0-27      357s  | 505s           219s | 293s
                                      359s  | 504s           220s | 276s
                                     [358s] | [+40.9%]   [-38.7%] | [-20.5%]

Not suprisingly, in an under-threaded machine the single threaded mcsimport outperforms the multi-threaded.
CentOS'es gcc-7 compiler performs better in an under-threaded environment than the default version.

Second test case - buildbot execution times of load_test_2
Similar test as above, but using buildbot for the execution. The EC2 instances used by buildbot are c4.2xlarge ones which have 8vCPUs and 15GiB of memory. Therefore, an over-threaded environment.

                                      1.2.3 | MCOL-2089     -O3
CentOS 7      gcc       4.8.5-16      207s  | 259s          173s | 259s
                                            | [+25.1%]  [-16.4%] | [+25.1%]
Debian 8      gcc-4.9   4.9.2-2       199s  | 261s          164s | 272s
                                            | [+31.2%]  [-17.6%] | [+36.7%]
Ubuntu 16.04  gcc-5     5.3.1-3       164s  | 261s          117s | 204s
                                            | [+59.1%]  [-28.7%] | [+24.4%]
Debian 9      gcc-6     6.3.0-9       165s  | 223s          124s | 218s
                                            | [+35.2%]  [-24.8%] | [+32.1%]
Ubuntu 18.04  gcc-7     7.3.0-27      158s  | 121s          115s | 105s
                                            | [-23.4%]  [-27.2%] | [-33.5%]

This shows us that the single threaded mcsimport outperforms the multi-threaded mcsimport on every OS except Ubuntu 18.04 during
the buildbot test execution. It further states a performance gain of around 23% for the single threaded mcsimport while using the optimization flag -O3. This contradicts directly with the findings on my Virtual Box setup, as I expected a difference of up to 10% between the multi-threaded and single-threaded execution; Not more than 50%.

Third test case - mcsimport injection from Windows 10
CentOS 7, Ubuntu 18.04 ColumnStore 1.2.2-1 (Virtual Box VM) mcsimport injection from Windows 10 (4 cores) comparison (load_test_2)

                    1.2.3 | MCOL-2089
CentOS 7 (CS)       167s  | 145s
                    164s  | 146s
                 [165.5s] | [-12.1%]
Ubuntu 18.04 (CS)   129s  | 112s
                    127s  | 111s
                   [128s] | [-12.9%]

This shows us that there is a performance difference of around 23% only based on the choice of operating system used for ColumnStore.
This is probably amongst others about the different version of C++ compiler used while building the ColumnStore packages. This also shows that the multi-threaded implementation of mcsimport performs around 12.5% better than the single threaded on a Windows 10 machine with 4 cores. As Windows uses an optimizer by default, there is no -O3 flag.

Fourth test case - mcsapi compiler / optimizer impact
CentOS 7 API 1.2.3 Million Row tests

            4.8.5-16    -O3        7.3.1-5       -O3
cpp         19.59s      n/a        19.69s        19.57s
            19.55s      n/a        19.25s        20.43s
            20.60s      n/a        19.23s        19.62s
           [19.91s]                [-2.6%]       [-0.2%]
python2     22.82s      n/a        20.72s        21.08s
                                   [-9.2%]       [-7.6%]
python3     54.19s      n/a        51.67s        52.66s
                                   [-4.7%]       [-2.8%]
java        17.55s      n/a        18.46s        18.12s
                                   [+5.2%]       [+3.25%]

This shows us that the choice of C++ compiler and optimizer option can have a around 5% effect on the performance.
But more data-points should be collected to verify this thesis.

My conclusion:

  • Using the -O3 optimizer flag gives us an around 24.5% performance enhancement for the single-threaded mcsimport. [median of all tests]
  • The multi-threaded optimized mcsimport can give performance enhancements between 37% and 46% under optimal conditions.
  • In under-threaded environments the optimized multi-threaded mcsimport performs between 20% and 30% worse than the optimized single threaded mcsimport. We could change the program to use either the multi or single threaded mcsimport depending on the cores it detected during execution.
  • In over-threaded environments the optimized multi-threaded mcsimport performance degrades on every Linux operating system except Ubuntu 18.04. The degradation is between 15% and more than 50% depending on the execution setup (local VM vs. buildbot). Therefore, here it is less efficient than the single-threaded optimized mcsimport. I don't have any explanation for this behaviour. But due to the CentOS 7 tests, we can rule out that it is gcc-7/compiler related. Some advise on how to investigate further would be great.
  • On Windows switching from single-threaded to multi-threaded mcsimport has a positive impact of around 12.5% on multi-core systems. The performance in under-threaded environments hasn't been evaluated yet.
  • Switching the CentOS compiler to gcc-7 has especially good impacts on the multi-threaded performance, but not so much on the single-threaded.
  • We might want to evaluate why there is a server side performance difference during the injection of around 12% depending on the OS used to host ColumnStore and the effect of using the -O3 flag for mcsapi as well.

TL/DR: We can get 24.5% optimization right away by enabling -O3 for single threaded mcsimport. We could squeeze out 20% more performance if we use pipelining and figure out why the performance degrades while executing on over-threaded Linux operating systems (except Ubuntu 18.04). We also have to find a solution to minimize the performance degradation while executed on under-threaded operating systems.

My suggestion: Merge PR 34 and close PR 33 with the note that over-threaded and under-threaded environments need to be considered better. Then move MCOL-2089 to testing and create a new ticket to address the changes for multi-threaded.

Comment by Jens Röwekamp (Inactive) [ 2019-03-08 ]

Attached logs verify that the multi threaded implementation of mcsimport has potential, but currently is still slower than the single threaded implementation on some operating systems.

Therefore, as indicated above the single threaded optimizations will be patched into 1.2.3 and the multi threaded implementation will be postponed to 1.2.4. It will be documented in MCOL-2226.

Comment by Zdravelina Sokolovska (Inactive) [ 2019-03-22 ]

1.2.2

Remote Load Method Elapsed Time [s]
MCSIMPORT 6918
MYSQLIMPORT 2180

1.2.3

Remote Load Method Elapsed Time [s]
MCSIMPORT 6303(s)
MYSQLIMPORT 1914(s)
*local CPIMPORT 924(s)

BF Passed rowCnt=1024 actRowCnt=1024
BC Passed rowCnt=1025 actRowCnt=1025
TF Passed rowCnt=253952 actRowCnt=253952
TC Passed rowCnt=253953 actRowCnt=253953
CF Passed rowCnt=516096 actRowCnt=516096
CC Passed rowCnt=516097 actRowCnt=516097
EF Passed rowCnt=8380416 actRowCnt=8380416
EC Passed rowCnt=8380417 actRowCnt=8380417
SF Passed rowCnt=33546240 actRowCnt=33546240
SW Passed rowCnt=33546241 actRowCnt=33546241
PF Passed rowCnt=67100672 actRowCnt=67100672
PC Passed rowCnt=67100673 actRowCnt=67100673

Comment by Zdravelina Sokolovska (Inactive) [ 2019-03-22 ]

issue is reopened as the test results on 1.2.3 show not well improved mcsimport performance ,under 10% from the 1.2.2 value

Comment by Andrew Hutchings (Inactive) [ 2019-03-22 ]

That is all the performance improvements we are going to get out of this ticket. The rest is being tracked in other tickets.

Generated at Thu Feb 08 02:33:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.