Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6210

Replication slows down significantly if select query is executed on slave using XtraDB

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Not a Bug
    • 5.5.37
    • 5.5.37
    • None

    Description

      We have a problem where a long running SELECT query running on one schema affects replication that is running on a different schema.

      There are three hosts:

      Host A: Sysbench
      Host B: MySQL 5.5 Master
      Host C: MariaDB 5.5.37 Slave

      We have two schemas: test (4.4GB) and test2 (88GB).

      This is the test we run on schema `test`:

      sysbench \
      --mysql-host=192.168.0.241 \
      --mysql-port=3306 \
      --mysql-user=root \
      --mysql-password=123123 \
      --mysql-socket=/tmp/mysql.sock \
      --mysql-db=test \
      --mysql-table-engine=innodb \
      --mysql-ignore-duplicates=on \
      --test=/usr/share/doc/sysbench/tests/db/oltp.lua \
      --num-threads=128 \
      --oltp-table-size=100000000 \
      --oltp-read-only=off \
      --oltp-test-mode=complex \
      --max-requests=0 \
      --max-time=3000 \
      run

      On the slave, we monitor Handler_Commit status variable. We run a perl script that just grabs us the value and prints out per-second averages:

      [root@test01 ~]# perl supporttools-master/perl/stat_monitor.pl Handler_commit 10
      3559.60
      3598.40
      3570.60
      3597.40

      With 10 second polling intervals, we see an average of almost 3600 commits per second. We also see seconds_behind_master at 0, and relay logs are not building up:

      -rw-rw----. 1 mariadb dba 335 May 5 19:26 mariadb-relay.000821
      -rw-rw----. 1 mariadb dba 27451766 May 5 19:27 mariadb-relay.000822

      On the slave, we run a long query on database `test2`.

      SELECT
      ...
      FROM
        APP
        JOIN B
          ON B.cid = APP.cid
          AND B.p...d = 'string'
        JOIN META
          ON META.cid = APP.cid
        JOIN CTG
          ON CTG.d...o = APP.c...1
        JOIN MBR
          ON MBR.m...o = APP.d...d
        JOIN ITEM
          ON ITEM.cid = APP.CID
          AND ITEM.i...e = 'string'
        JOIN COMM
          ON CONCAT(COMM.c...d, COMM.c...d) = MBR.d...d
      GROUP BY ITEM.cid ;

      Here is the explain

      +------+-------------+-------+--------+-----------------------------------------------------+------------------------------+---------+-----------------------+--------+---------------------------------+
      | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
      +------+-------------+-------+--------+-----------------------------------------------------+------------------------------+---------+-----------------------+--------+---------------------------------+
      | 1 | SIMPLE | COMM | ALL | NULL | NULL | NULL | NULL | 4447 | Using temporary; Using filesort |
      | 1 | SIMPLE | B | ref | fk_...phone,key_...01 | key_...01 | 62 | const | 119788 | Using index condition |
      | 1 | SIMPLE | META | ref | key_cid | key_cid | 30 | B.cid | 1 | |
      | 1 | SIMPLE | APP | eq_ref | PRIMARY,KEY_...5,key_...id | PRIMARY | 30 | B.cid | 1 | Using where |
      | 1 | SIMPLE | MBR | eq_ref | PRIMARY | PRIMARY | 92 | APP.d...id | 1 | Using where |
      | 1 | SIMPLE | CTG | eq_ref | PRIMARY | PRIMARY | 62 | APP....1 | 1 | |
      | 1 | SIMPLE | ITEM | ref | fk_...item | fk_...item | 30 | cid | 9 | Using where |
      +------+-------------+-------+--------+-----------------------------------------------------+------------------------------+---------+-----------------------+--------+---------------------------------+
      7 rows in set (0.00 sec)

      Around 20-30 seconds in, we see a drop in commits per second:

      3563.40
      2419.60
      490.60
      3218.20
      446.60
      425.20

      We see seconds_behind_master increasing, but replication is still going. We see Exec_Master_Log_Pos continually increasing. And relay logs building up:

      -rw-rw----. 1 mariadb dba 335 May 5 19:26 mariadb-relay.000821
      -rw-rw----. 1 mariadb dba 314572983 May 5 19:42 mariadb-relay.000822
      -rw-rw----. 1 mariadb dba 988 May 5 19:42 mariadb-relay.000823
      -rw-rw----. 1 mariadb dba 335 May 5 19:42 mariadb-relay.000824
      -rw-rw----. 1 mariadb dba 35844766 May 5 19:44 mariadb-relay.000825

      Killing the query, we see a jump in commits per second:

      112.60
      111.60
      114.00
      <killed query>
      7439.40
      28726.00
      29618.00
      ...
      26761.40
      26279.00
      18664.60
      3567.00
      3577.00
      3599.00
      3582.00

      Then goes back to our normal replication speed. Relay logs are all cleaned up. We switch to the InnoDB plugin with this configuration setting:

      ignore_builtin_innodb
      plugin_load=innodb=ha_innodb.so

      And run the same tests and monitors. No replication lag builds up. The queries cannot really be optimized, as they are dynamically generated. This issue is also present on two different data sets, so we believe the root cause of the issue is within XtraDB.

      https://bugs.launchpad.net/percona-server/+bug/1316400

      Attachments

        Activity

          People

            jplindst Jan Lindström (Inactive)
            jplindst Jan Lindström (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.