[MDEV-16333] Slow replication after upgrade from 10.1.33 to 10.3.7 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 10.3.7
Fix Version/s: 10.3.7
Component/s: Replication
Labels:
- replication
Environment:
centos7 64bit

Description

We have 2 identical slaves running, same hardware and same software.
After upgrading one of the slaves from 10.1.33 to 10.3.7 (now 48hours ago) the upgraded slave (which is set to maintenance mode, so is not receiving extra work on top of the replication) can not follow the master server. its behind and the backlog is constantly increasing.

Both servers have the exact same hardware and work fine in 10.1.33.
They are not cpu and not io bound.
Those servers never ran 10.2
99% of the tables are innodb

attached is a config diff for a show variables.
If anything else is needed, let me know

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

diff.txt
80 kB
2018-05-30 10:52
slave-slow-while-gtid_slave_pos-uses-innodb.png
166 kB
2019-09-18 12:41

Activity

Ascending order - Click to sort in descending order

View 25 older comments

Maikel Punie added a comment - 2019-07-17 06:55

this was not reproduceable after a few weeks, so not sure what happend, this report can be closed

Maikel Punie added a comment - 2019-07-17 06:55 this was not reproduceable after a few weeks, so not sure what happend, this report can be closed

Sergei Golubchik added a comment - 2019-08-19 11:59

let's give it one more month, just to be sure.
If we won't hear from you during that time, we'll close it.

Sergei Golubchik added a comment - 2019-08-19 11:59 let's give it one more month, just to be sure. If we won't hear from you during that time, we'll close it.

david schumacher added a comment - 2019-09-18 12:55 - edited

hi all and @Maikel Punie ,

i experienced a comparable behaviour when switching from MySQL 5.1 to MariaDB 10.3.17-0+deb10u1 (~~Debian 10~~ Debian 9), kernel 4.9.110-3+deb9u6.
replication with MariaDB 10.3.17 or 10.1.37-MariaDB-0+deb9u1 was multiple times slower on SSDs than before using MySQL 5.1 on spinning discs.

as a workaround i changed table mysql.gtid_slave_pos from engine=innodb to engine=myisam on the slave (vs-db-misc-multidb-02) and replication was fast as normal immediately, see graph of the behaviour while a hourly cronjob inserts data:

just guessing at the moment: there is something slowing down innodb-transactions (in general or only for gtid_slave_pos?) dramatically.
have not done more research on this yet.

environment:
master: vs-db-misc-multidb-01
slave: vs-db-misc-multidb-02
100% MyISAM (exception before the change: mysql.gtid_slave_pos)
MariaDB 10.3.17-0+deb10u1 inside LXC (host and guest: Debian 10)
innodb-config: debian defaults
myisam-config: key_buffer_size = 4Gb

parallel-replication specific settings (experiemented with different settings, changed more or less nothing regarding replication speed while gtid_slave_pos was innodb):
{{MariaDB [(none)]> show variables like '%parallel%';
-----------------------------------------+

Variable_name

Value

-----------------------------------------+

skip_parallel_replication	OFF
slave_domain_parallel_threads	20
slave_parallel_max_queued	32001072	(increased until "Waiting for room in worker thread event queue" vanished from "Slave_SQL_Running_State")
slave_parallel_mode	aggressive
slave_parallel_threads	20
slave_parallel_workers	20

-----------------------------------------+}}

any idea which information can be helpful or what to try to get helpful insights ?

thanks & regards,
david

david schumacher added a comment - 2019-09-18 12:55 - edited hi all and @Maikel Punie , i experienced a comparable behaviour when switching from MySQL 5.1 to MariaDB 10.3.17-0+deb10u1 ( Debian 10 Debian 9), kernel 4.9.110-3+deb9u6. replication with MariaDB 10.3.17 or 10.1.37-MariaDB-0+deb9u1 was multiple times slower on SSDs than before using MySQL 5.1 on spinning discs. as a workaround i changed table mysql.gtid_slave_pos from engine=innodb to engine=myisam on the slave (vs-db-misc-multidb-02) and replication was fast as normal immediately, see graph of the behaviour while a hourly cronjob inserts data: just guessing at the moment: there is something slowing down innodb-transactions (in general or only for gtid_slave_pos?) dramatically. have not done more research on this yet. environment: master: vs-db-misc-multidb-01 slave: vs-db-misc-multidb-02 100% MyISAM (exception before the change: mysql.gtid_slave_pos) MariaDB 10.3.17-0+deb10u1 inside LXC (host and guest: Debian 10) innodb-config: debian defaults myisam-config: key_buffer_size = 4Gb parallel-replication specific settings (experiemented with different settings, changed more or less nothing regarding replication speed while gtid_slave_pos was innodb): {{MariaDB [(none)] > show variables like '%parallel%'; ------------------------------ -----------+ Variable_name Value ------------------------------ -----------+ skip_parallel_replication OFF slave_domain_parallel_threads 20 slave_parallel_max_queued 32001072 (increased until "Waiting for room in worker thread event queue" vanished from "Slave_SQL_Running_State") slave_parallel_mode aggressive slave_parallel_threads 20 slave_parallel_workers 20 ------------------------------ -----------+}} any idea which information can be helpful or what to try to get helpful insights ? thanks & regards, david

Andrei Elkin added a comment - 2019-09-18 15:59

david_schumacher: what table type the replicated inserts go into?

Notice that aggressive may involve "internal" rollback and then the MyISAM type of gtid_slave_pos would fire back with a duplicate key error at replaying.
If it's pure insert load no rollback can occur though.

Andrei Elkin added a comment - 2019-09-18 15:59 david_schumacher : what table type the replicated inserts go into? Notice that aggressive may involve "internal" rollback and then the MyISAM type of gtid_slave_pos would fire back with a duplicate key error at replaying. If it's pure insert load no rollback can occur though.

david schumacher added a comment - 2019-09-20 13:04 - edited

hi,

i tested using blackhole, myisam and innodb for the insert-table, and even with blackhole, replication is extremely delayed as long gtid_slave_pos uses innodb.

important correction, sorry for that ... maybe this behaviour is completely unrelated to maikels findings .. or maikel upgraded his kernel in the meantime:
i just noticed the host-kernels are different (now corrected in my first post): both slaves on different physical servers are using 4.9.110-3+deb9u6 (Debian 9), while master uses 4.19.67-2 (Debian 10). hardware of all three servers is the same (4 sata-SSDs in md-raid10, xfs with discard mount option). after moving one slave to the master-host, innodb-performance was back to normal. moving back to older kernel, performance dropped again.

side-test: 10.1.37-MariaDB-0+deb9u1 on the same hardware using the older kernel suffers the same low innodb-performance like 10.3.17-0+deb10u1 - as long global innodb_flush_log_at_trx_commit is set to 1 (see below).

if gtid_slave_pos is myisam and only insert-table is innodb, all inserts into this insert-table are extremely slow too (25 minutes for 140k inserts, while myisam takes 18 seconds for the same inserts). slave_parallel_mode was conservative and slave_parallel_threads/workers = 2 for other tests, changed nearly nothing.

i tested direct inserts into the slave without using replication, the results are comparable... so i think its a pure innodb-problem and not replication-specific, and only happend on kernel 4.9.110-3+deb9u6, but with older mariadb too. myisam-performance and IO-benchmarks are showing identical results on both kernels.

after i set innodb_flush_log_at_trx_commit to 2 or 0, inserts into innodb-table where fast as usual (for both gtid_slave_pos and the normal insert-table .. test-inserts took 41 seconds which matches what i saw in other cases comparing myisam vs. innodb peformance ).

different values for innodb_use_native_aio and innodb_flush_method did not change much.

for me this is solved too, but as long i run this older kernels, i can do tests if needed.

cheers,
david

david schumacher added a comment - 2019-09-20 13:04 - edited hi, i tested using blackhole, myisam and innodb for the insert-table, and even with blackhole, replication is extremely delayed as long gtid_slave_pos uses innodb. important correction, sorry for that ... maybe this behaviour is completely unrelated to maikels findings .. or maikel upgraded his kernel in the meantime: i just noticed the host-kernels are different (now corrected in my first post): both slaves on different physical servers are using 4.9.110-3+deb9u6 (Debian 9), while master uses 4.19.67-2 (Debian 10). hardware of all three servers is the same (4 sata-SSDs in md-raid10, xfs with discard mount option). after moving one slave to the master-host, innodb-performance was back to normal. moving back to older kernel, performance dropped again. side-test: 10.1.37-MariaDB-0+deb9u1 on the same hardware using the older kernel suffers the same low innodb-performance like 10.3.17-0+deb10u1 - as long global innodb_flush_log_at_trx_commit is set to 1 (see below). if gtid_slave_pos is myisam and only insert-table is innodb, all inserts into this insert-table are extremely slow too (25 minutes for 140k inserts, while myisam takes 18 seconds for the same inserts). slave_parallel_mode was conservative and slave_parallel_threads/workers = 2 for other tests, changed nearly nothing. i tested direct inserts into the slave without using replication, the results are comparable... so i think its a pure innodb-problem and not replication-specific, and only happend on kernel 4.9.110-3+deb9u6, but with older mariadb too. myisam-performance and IO-benchmarks are showing identical results on both kernels. after i set innodb_flush_log_at_trx_commit to 2 or 0, inserts into innodb-table where fast as usual (for both gtid_slave_pos and the normal insert-table .. test-inserts took 41 seconds which matches what i saw in other cases comparing myisam vs. innodb peformance ). different values for innodb_use_native_aio and innodb_flush_method did not change much. for me this is solved too, but as long i run this older kernels, i can do tests if needed. cheers, david

MariaDB Server

Slow replication after upgrade from 10.1.33 to 10.3.7

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration