[MDEV-8955] Problem with using parallel replication settings on master under heavy load Created: 2015-10-16 Updated: 2023-01-22 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.0.21 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Alex | Assignee: | Angelique Sklavounos (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | parallelslave | ||
| Environment: |
centos 6.7 |
||
| Description |
|
Hello, On master I used to set binlog_commit_wait_count = 10 (also tested 20, 30, 100 and 1000) and binlog_commit_wait_usec=10000 (also 20000 and default of 1 sec).
MariaDB started slowly answering. Even 'mysql' shell command took time to execute. PHP backends went down because of slow performance. No LA (<1), however. The test environment: MyISAM tables without indexes, delayed insert and ROW replication with logs written to a different partition (no IO problems) When I applied the binlog_commit_wait_count settings on master sometimes it takes few seconds to apply, and sometimes immediately. I guess this is caused by heavy internal calculations of small number of queries (10/20/30) that needs to be group-committed. If server performs at 10qps, it means each query takes 100 microseconds and 10/20/30 set for binlog_commit_wait_count queries could be a very small number that keeps MariaDB busy on calculating which in its turn causes problems with connections/threads and so on. Setting the number of wait_count higher only "helps" MariaDB stuck faster, though it's a matter of a few seconds in either case. Under such load it seems that any binlog_commit_wait_usec setting is "just-in-case" limit that is quite irrelevant. So all in all I ended up with standard master and slave that couldn't catch up because its parallel threads didn't do anything as nothing else could be activated on master. Seems like standard group commit isn't helping much in such case.
+
Please advise thanks! |
| Comments |
| Comment by Alex [ 2015-10-17 ] | ||||
|
For now I've succeeded to stabilize the things more or less by using binlog_commit_wait_count=10 under low load and keeping usec = 1 sec (the default). seems like it works under heavy load. But if MariaDB is restarted any of the settings are applied - the problems begin... | ||||
| Comment by Elena Stepanova [ 2015-10-18 ] | ||||
|
ShivaS, | ||||
| Comment by Alex [ 2015-10-18 ] | ||||
|
Hi Elena, | ||||
| Comment by Alex [ 2015-10-19 ] | ||||
|
unfortunately the problem is back at 15-20k qps ;( | ||||
| Comment by Alex [ 2015-10-19 ] | ||||
|
Elena, 1. it's easier to apply/ work with binlog settings on master while it's not using pool of threads, but old style thread per connection. So I assume it's something about internal counters/whatever that's applied to every arrived connection/thread and works much harder with pool of threads (after all once problem happens, all processlist is full of unauthenticated users) | ||||
| Comment by Alex [ 2015-10-20 ] | ||||
|
Another thing I've noticed: on master I have slow log set to 100 msec, all tables are blackhole and all incoming queries are delayed inserts. which means that it truly hits the usec timeout which is set to 1 second. I understand that delayed insert is something that kinda behaves on its own and unique self, but still strange to see slow queries and hitting binlog_commit_wait_usec default timeout | ||||
| Comment by Alex [ 2015-10-26 ] | ||||
|
Elena, one more thing I wanted to pay your attention at: quoting from one of previous comments since I've spammed a bit in this thread and it could be missed:
Maybe a small fix can be introduced? To make alter-to-blackhole happen faster regardless the original table size? | ||||
| Comment by Elena Stepanova [ 2015-10-26 ] | ||||
|
ShivaS, I cannot say right away if it's possible to make it happen faster. Since it apparently has nothing to do with the parallel replication, I suggest to file a separate JIRA issue about it. |