[MCOL-4450] cpimport on a 3 node system shows 10X as long as single node system Created: 2020-12-15  Updated: 2021-04-19  Resolved: 2020-12-15

Status: Closed
Project: MariaDB ColumnStore
Component/s: installation
Affects Version/s: 5.4.3
Fix Version/s: 5.4.3

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: David Hill (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Customer test setup

1 node single server system

3 node system with maxscale



 Description   

Reported by customer:

I am seeing some major performance difference now that we have 3 nodes using glusterFS. When we had a single node environment, everything was much quicker. Take a look at these cpimport logs:

2020-11-30 03:10:05 (3531772) INFO : For table 569_cdr.cisco: 350 rows processed and 350 rows inserted.
2020-11-30 03:10:05 (3531772) INFO : Bulk load completed, total run time : 1.50017 seconds
2020-12-10 14:46:29 (3394845) INFO : For table 522_cdr.cisco: 133 rows processed and 133 rows inserted.
2020-12-10 14:46:29 (3394845) INFO : Bulk load completed, total run time : 29.2482

The entry from 11/30 was a cpimport with a single node. The one today is with the 3 node environment.

I would maybe expect it to be 3 times slower but it is much more than that. Before multi node, it did 350 rows in 1.5 seconds but in the 3 node environment, it took 30 seconds to do 133 rows which is like a 3rd of the data it loaded in 1.5 seconds.

The odd thing is, I was taking a look at the logs and I did a bulk data load when I set up the cluster. I was able to import 16 million records in 457 seconds. I ran the same import today to a test database I created and it took 1400 seconds.
Here is the output from a load right after setup:
2020-12-02 11:12:27 (16829) INFO : Reading input from STDIN to import into table 203_cdr.avaya
2020-12-02 11:12:27 (16829) INFO : Running distributed import (mode 1) on all PMs...
2020-12-02 11:20:04 (16829) INFO : For table 203_cdr.avaya: 62130952 rows processed and 62130952 rows inserted.
2020-12-02 11:20:04 (16829) INFO : Bulk load completed, total run time : 457.022 seconds
Here is the test load I did yesterday:
2020-12-10 16:27:12 (3442862) INFO : Reading input from STDIN to import into table test_load.avaya
2020-12-10 16:27:12 (3442862) INFO : Running distributed import (mode 1) on all PMs...
2020-12-10 16:50:22 (3442862) INFO : For table test_load.avaya: 62314457 rows processed and 62314457 rows inserted.
2020-12-10 16:50:22 (3442862) INFO : Bulk load completed, total run time : 1390.21 seconds

I actually did a lot of testing on the glusterFS volumes using gluster top to test the read/write speeds and saw no performance issues. I also benchmark tested the network connections between all 3 servers and confirmed we are getting between 9 - 10 GBps. It is weird because the cpimport speeds very widely even though all table schemas are the same where the table names are the same
2020-12-14 15:13:40 (82928) INFO : For table 203_cdr.avaya: 51 rows processed and 51 rows inserted
2020-12-14 15:13:40 (82928) INFO : Bulk load completed, total run time : 40.6569 seconds
2020-12-14 15:13:43 (83493) INFO : Running distributed import (mode 1) on all PMs...
2020-12-14 15:14:12 (83493) INFO : For table 203_cdr.avaya: 79 rows processed and 79 rows inserted.
2020-12-14 15:14:12 (83493) INFO : Bulk load completed, total run time : 29.4481 seconds

See where it did one import for 51 rows and it took 40 seconds then it did 79 rows and it took less time for more rows. There are also cases we see where it takes over 100 seconds to insert just 1 row.
2020-12-14 14:46:31 (62145) INFO : For table 644_cdr.cisco: 1 rows processed and 1 rows inserted.
2020-12-14 14:46:31 (62145) INFO : Bulk load completed, total run time : 185.843 seconds

At first, we only were doing 1 cpimport process at a time and that was fine because we used to see very fast results but now we have to run multiple at a time so the database can keep up with the stream.



 Comments   
Comment by David Hill (Inactive) [ 2020-12-15 ]

update from customer today:

We have been monitoring the servers and are finding that one of the slaves, which does not have quite as much power as the other slave and the master, seems to be doing most of the work. I thought the master would be the one doing the most work. How does that work?
So now it probably is system related.. Ill keep the jira up-to-date and will close if it turns out to be their system.

Comment by David Hill (Inactive) [ 2020-12-15 ]

resolved by customer, hardware issue

Generated at Thu Feb 08 02:50:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.