Following my recent thread about problem with chained replication (MDEV-8929), a different problem appeared.
On master I used to set binlog_commit_wait_count = 10 (also tested 20, 30, 100 and 1000) and binlog_commit_wait_usec=10000 (also 20000 and default of 1 sec).
All worked fine until the load been increased. In that case something happened to Mariadb. I saw that processlist started increasing with threads awaiting to login:
MariaDB started slowly answering. Even 'mysql' shell command took time to execute. PHP backends went down because of slow performance. No LA (<1), however.
I tried both pool of threads and thread per connection. Works fine until I reach something like 10qps/sec.
The test environment: MyISAM tables without indexes, delayed insert and ROW replication with logs written to a different partition (no IO problems)
When I applied the binlog_commit_wait_count settings on master sometimes it takes few seconds to apply, and sometimes immediately.
But the result is always the same: instead of having 10k qps (or above) and around 1k threads connected from test servers, I end up with 1-2qps and >10k connected threads, all of which are those unauthenticated users. In the end everything stops working. All I see in DB - some sleep and mostly unauthenticated processes.
I guess this is caused by heavy internal calculations of small number of queries (10/20/30) that needs to be group-committed. If server performs at 10qps, it means each query takes 100 microseconds and 10/20/30 set for binlog_commit_wait_count queries could be a very small number that keeps MariaDB busy on calculating which in its turn causes problems with connections/threads and so on. Setting the number of wait_count higher only "helps" MariaDB stuck faster, though it's a matter of a few seconds in either case. Under such load it seems that any binlog_commit_wait_usec setting is "just-in-case" limit that is quite irrelevant.
So all in all I ended up with standard master and slave that couldn't catch up because its parallel threads didn't do anything as nothing else could be activated on master. Seems like standard group commit isn't helping much in such case.
whatever settings I apply on master, it's always only binlog_group_commit_trigger_timeout counters updated (which to my opinion also indicates some problem under heavy load). Even setting count to 50 queries and keeping wait_user with default 1 sec.