[MDEV-6210] Replication slows down significantly if select query is executed on slave using XtraDB Created: 2014-05-06  Updated: 2014-07-04  Resolved: 2014-07-04

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 5.5.37
Fix Version/s: 5.5.37

Type: Bug Priority: Major
Reporter: Jan Lindström (Inactive) Assignee: Jan Lindström (Inactive)
Resolution: Not a Bug Votes: 0
Labels: upstream, xtradb


 Description   

We have a problem where a long running SELECT query running on one schema affects replication that is running on a different schema.

There are three hosts:

Host A: Sysbench
Host B: MySQL 5.5 Master
Host C: MariaDB 5.5.37 Slave

We have two schemas: test (4.4GB) and test2 (88GB).

This is the test we run on schema `test`:

sysbench \
--mysql-host=192.168.0.241 \
--mysql-port=3306 \
--mysql-user=root \
--mysql-password=123123 \
--mysql-socket=/tmp/mysql.sock \
--mysql-db=test \
--mysql-table-engine=innodb \
--mysql-ignore-duplicates=on \
--test=/usr/share/doc/sysbench/tests/db/oltp.lua \
--num-threads=128 \
--oltp-table-size=100000000 \
--oltp-read-only=off \
--oltp-test-mode=complex \
--max-requests=0 \
--max-time=3000 \
run

On the slave, we monitor Handler_Commit status variable. We run a perl script that just grabs us the value and prints out per-second averages:

[root@test01 ~]# perl supporttools-master/perl/stat_monitor.pl Handler_commit 10
3559.60
3598.40
3570.60
3597.40

With 10 second polling intervals, we see an average of almost 3600 commits per second. We also see seconds_behind_master at 0, and relay logs are not building up:

-rw-rw----. 1 mariadb dba 335 May 5 19:26 mariadb-relay.000821
-rw-rw----. 1 mariadb dba 27451766 May 5 19:27 mariadb-relay.000822

On the slave, we run a long query on database `test2`.

SELECT
...
FROM
  APP
  JOIN B
    ON B.cid = APP.cid
    AND B.p...d = 'string'
  JOIN META
    ON META.cid = APP.cid
  JOIN CTG
    ON CTG.d...o = APP.c...1
  JOIN MBR
    ON MBR.m...o = APP.d...d
  JOIN ITEM
    ON ITEM.cid = APP.CID
    AND ITEM.i...e = 'string'
  JOIN COMM
    ON CONCAT(COMM.c...d, COMM.c...d) = MBR.d...d
GROUP BY ITEM.cid ;

Here is the explain

+------+-------------+-------+--------+-----------------------------------------------------+------------------------------+---------+-----------------------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+-----------------------------------------------------+------------------------------+---------+-----------------------+--------+---------------------------------+
| 1 | SIMPLE | COMM | ALL | NULL | NULL | NULL | NULL | 4447 | Using temporary; Using filesort |
| 1 | SIMPLE | B | ref | fk_...phone,key_...01 | key_...01 | 62 | const | 119788 | Using index condition |
| 1 | SIMPLE | META | ref | key_cid | key_cid | 30 | B.cid | 1 | |
| 1 | SIMPLE | APP | eq_ref | PRIMARY,KEY_...5,key_...id | PRIMARY | 30 | B.cid | 1 | Using where |
| 1 | SIMPLE | MBR | eq_ref | PRIMARY | PRIMARY | 92 | APP.d...id | 1 | Using where |
| 1 | SIMPLE | CTG | eq_ref | PRIMARY | PRIMARY | 62 | APP....1 | 1 | |
| 1 | SIMPLE | ITEM | ref | fk_...item | fk_...item | 30 | cid | 9 | Using where |
+------+-------------+-------+--------+-----------------------------------------------------+------------------------------+---------+-----------------------+--------+---------------------------------+
7 rows in set (0.00 sec)

Around 20-30 seconds in, we see a drop in commits per second:

3563.40
2419.60
490.60
3218.20
446.60
425.20

We see seconds_behind_master increasing, but replication is still going. We see Exec_Master_Log_Pos continually increasing. And relay logs building up:

-rw-rw----. 1 mariadb dba 335 May 5 19:26 mariadb-relay.000821
-rw-rw----. 1 mariadb dba 314572983 May 5 19:42 mariadb-relay.000822
-rw-rw----. 1 mariadb dba 988 May 5 19:42 mariadb-relay.000823
-rw-rw----. 1 mariadb dba 335 May 5 19:42 mariadb-relay.000824
-rw-rw----. 1 mariadb dba 35844766 May 5 19:44 mariadb-relay.000825

Killing the query, we see a jump in commits per second:

112.60
111.60
114.00
<killed query>
7439.40
28726.00
29618.00
...
26761.40
26279.00
18664.60
3567.00
3577.00
3599.00
3582.00

Then goes back to our normal replication speed. Relay logs are all cleaned up. We switch to the InnoDB plugin with this configuration setting:

ignore_builtin_innodb
plugin_load=innodb=ha_innodb.so

And run the same tests and monitors. No replication lag builds up. The queries cannot really be optimized, as they are dynamically generated. This issue is also present on two different data sets, so we believe the root cause of the issue is within XtraDB.

https://bugs.launchpad.net/percona-server/+bug/1316400


Generated at Thu Feb 08 07:10:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.