[MDEV-16333] Slow replication after upgrade from 10.1.33 to 10.3.7 Created: 2018-05-30 Updated: 2023-04-27 |
|
| Status: | Stalled |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.3.7 |
| Fix Version/s: | 10.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maikel Punie | Assignee: | Andrei Elkin |
| Resolution: | Unresolved | Votes: | 2 |
| Labels: | replication | ||
| Environment: |
centos7 64bit |
||
| Attachments: |
|
| Description |
|
We have 2 identical slaves running, same hardware and same software. Both servers have the exact same hardware and work fine in 10.1.33. attached is a config diff for a show variables. |
| Comments |
| Comment by Andrei Elkin [ 2018-05-30 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
There is a chance for | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maikel Punie [ 2018-05-30 ] | ||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-05-30 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for a partial confirmation. So we have the partitioned tables. As to the actual binlog format we can't take for granted that the global binlog_format is the same as the session one for your master clients. Could you please double-check what format exactly these INSERTs are binlogged on the master? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maikel Punie [ 2018-05-30 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
How exactly can we check this? if i just check the binlog on the master i see the plain sql command in there. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-05-30 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Could you try mysqlbinlog `a-binlog-file` | grep Write_rows ? Also grep for Update_rows and Delete_rows. A typical snippet containing RBR events is like here:
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maikel Punie [ 2018-05-31 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
There is nothing like that in the binlog ...
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-05-31 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Could you show processlist on the slave to catch slow events or execution states? Also it's possible to get such info from performance_schema if it's configured. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maikel Punie [ 2018-05-31 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
What exactly do you need? I'll enable the performance schema to get the info
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-05-31 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Perhaps we have got what we need. Could you proceed from here and compare how long indeed it takes to execute such Update on master and the upgraded slave? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maikel Punie [ 2018-05-31 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
That is not a partitioned tabled, but the table is one file of almost 300G on disk | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-05-31 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for clarifying it's not partitioned. UPDATE is nevertheless slow. We can see it's running alone while other workers wait for its finish. If the pasted show-processlist snapshot is typical - UPDATEs on this big table are slow - it would make sense to compare UPDATE execution times between the slaves. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maikel Punie [ 2018-05-31 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
If i do these manually the speed is about the same for the 2 servers .... | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-05-31 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
0.2 on the 2 servers - which ones - the two slaves? Have you benchmarked the Update also feeding it manually on the slow upgraded slave? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-05-31 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Generally I could suspect rollbacks taking place on the slave 'cos of OPTIMISTIC parallel_mode.
to estimate rollback contribution to the slowness. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maikel Punie [ 2018-06-01 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Idd there are a lot of rollbacks happening
i switched to CONSERVATIVE on the 10.3 slave. On the 10.1 slave those rollbacks are also happening ...
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-06-01 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Let's see how CONSERVATIVE would do. As you say the 10.1 slave can keep up we may suspect the cost of rollback has grown in 10.3. I doubt that though, but better consult with marko. Secondly, sorry to repeat the question, could please confirm/clarify that 0.2 sec execution time of a show-prosselist-like manually-feed UPDATE on the slaves? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maikel Punie [ 2018-06-01 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
So the 0.2 seconds is a similar querry as the one in the processlist. The conservative is not helping | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-06-01 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
So we have to dig out and explain 0.2 vs 58 secs of your show-processlist. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maikel Punie [ 2018-06-01 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I have a file like that, but its to big to attach to this ticket. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-06-01 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thank you! We will need table definitions as well. I would appreciate if you have tested the file ideally to arrange it as a relay-log (I sense we don't need master and the slave io thread), or with a "dumb" master server that won't have actual data - which would be about my environment. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2018-06-04 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
The cost of rolling back an update or delete may have slightly increased in MariaDB Server 10.3 due to | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maikel Punie [ 2018-08-29 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
OK digging this item backup i have 2 slaves that have the exact same data, one is running 10.1.35 and the other one is running 10.3.9. both systems are idle, they only have replication running, the data is comming from the same master server. They are both (when the test started) at about the same GTID. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-12-20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I think this case should be closed on the basis of can't reproduce. It was suggested to the reporter to | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-12-20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Reopening after having found the latest comments. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-12-20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Maikel Punie, if I got it right the slowness is on 10.3.9 server and it can be cause by running Could you provide details of the query that hang such way? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maikel Punie [ 2019-07-17 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
this was not reproduceable after a few weeks, so not sure what happend, this report can be closed | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2019-08-19 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
let's give it one more month, just to be sure. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by david schumacher [ 2019-09-18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
hi all and @Maikel Punie , i experienced a comparable behaviour when switching from MySQL 5.1 to MariaDB 10.3.17-0+deb10u1 ( as a workaround i changed table mysql.gtid_slave_pos from engine=innodb to engine=myisam on the slave (vs-db-misc-multidb-02) and replication was fast as normal immediately, see graph of the behaviour while a hourly cronjob inserts data: just guessing at the moment: there is something slowing down innodb-transactions (in general or only for gtid_slave_pos?) dramatically. environment: parallel-replication specific settings (experiemented with different settings, changed more or less nothing regarding replication speed while gtid_slave_pos was innodb):
------------------------------
------------------------------ any idea which information can be helpful or what to try to get helpful insights ? thanks & regards, | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2019-09-18 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
david_schumacher: what table type the replicated inserts go into? Notice that aggressive may involve "internal" rollback and then the MyISAM type of gtid_slave_pos would fire back with a duplicate key error at replaying. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by david schumacher [ 2019-09-20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
hi, i tested using blackhole, myisam and innodb for the insert-table, and even with blackhole, replication is extremely delayed as long gtid_slave_pos uses innodb. important correction, sorry for that ... maybe this behaviour is completely unrelated to maikels findings .. or maikel upgraded his kernel in the meantime: side-test: 10.1.37-MariaDB-0+deb9u1 on the same hardware using the older kernel suffers the same low innodb-performance like 10.3.17-0+deb10u1 - as long global innodb_flush_log_at_trx_commit is set to 1 (see below). if gtid_slave_pos is myisam and only insert-table is innodb, all inserts into this insert-table are extremely slow too (25 minutes for 140k inserts, while myisam takes 18 seconds for the same inserts). slave_parallel_mode was conservative and slave_parallel_threads/workers = 2 for other tests, changed nearly nothing. i tested direct inserts into the slave without using replication, the results are comparable... so i think its a pure innodb-problem and not replication-specific, and only happend on kernel 4.9.110-3+deb9u6, but with older mariadb too. myisam-performance and IO-benchmarks are showing identical results on both kernels. after i set innodb_flush_log_at_trx_commit to 2 or 0, inserts into innodb-table where fast as usual (for both gtid_slave_pos and the normal insert-table .. test-inserts took 41 seconds which matches what i saw in other cases comparing myisam vs. innodb peformance ). different values for innodb_use_native_aio and innodb_flush_method did not change much. for me this is solved too, but as long i run this older kernels, i can do tests if needed. cheers, |