[MDEV-5448] Performance regression between 10.0.4 and 10.0.5 (~8%) Created: 2013-12-14  Updated: 2013-12-16  Resolved: 2013-12-16

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.0.6
Fix Version/s: 10.0.8

Type: Bug Priority: Major
Reporter: Sergey Vojtovich Assignee: Sergey Vojtovich
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by MDEV-5388 Reduce usage of LOCK_open: unused_tables Closed

 Description   

As Axel mentioned in his E-mail, there is performance regression between 10.0.4 and 10.0.5:

Date: Thu, 21 Nov 2013 18:32:45 +0100
From: Axel Schwenke <axel@askmonty.org>
To: "maria-developers@lists.launchpad.net" <maria-developers@lists.launchpad.net>
Subject: [Maria-developers] MariaDB-10.0-beta sysbench results

Looking for this regression I can see clear performance drop with the following revision:

revno: 3427.1.258
revision-id: knielsen@knielsen-hq.org-20130823120213-pbhsq4zc1h3jwa0i
parent: knielsen@knielsen-hq.org-20130823081643-f3yhupp15yw9cpy4
committer: knielsen@knielsen-hq.org
branch nick: work-10.0-mdev26
timestamp: Fri 2013-08-23 14:02:13 +0200
message:
  MDEV-26: Global transaction ID.
 
  Implement @@gtid_binlog_state. This is the internal state of the binlog
  (most recent GTID logged for every domain_id and server_id). This allows
  to save the state before RESET MASTER and restore it afterwards.

Specifically sys_vars.cc part:

static unsigned char opt_gtid_binlog_state_dummy;
static Sys_var_gtid_binlog_state Sys_gtid_binlog_state(
       "gtid_binlog_state",
       "The internal GTID state of the binlog, used to keep track of all "
       "GTIDs ever logged to the binlog.",
       GLOBAL_VAR(opt_gtid_binlog_state_dummy), NO_CMD_LINE);

If I comment it out, I get nice performance boost. Note that it doesn't seem to have anything to do with gtid functionality accessed by Sys_var_gtid_binlog_state methods: I removed all references to gtid code and still observe performance degradation.

It seem to be somehow caused by increase of system variables. If I add new system variable (on revision 3816), I can see performance degradation:

static ulong table_cache_instances1;
static Sys_var_ulong Sys_table_cache_instances1(
       "table_open_cache_instances1",
       "MySQL 5.6 compatible option. Not used or needed in MariaDB",
       READ_ONLY GLOBAL_VAR(table_cache_instances1), CMD_LINE(REQUIRED_ARG),
       VALID_RANGE(1, 64), DEFAULT(1),
       BLOCK_SIZE(1), NO_MUTEX_GUARD, NOT_IN_BINLOG, ON_CHECK(NULL),
       ON_UPDATE(NULL), NULL);

The difference is like:
64 threads, time spent: 60s, queries executed: 9326530, qps: 155442, 1 thread qps: 2428

vs

64 threads, time spent: 60s, queries executed: 9879031, qps: 164650, 1 thread qps: 2572

I was unable to reproduce performance boost with fresh 10.0 by commenting out gtid_binlog_state.

Even simpler patch for revision 3816 to see performance degradation:

=== modified file 'sql/sys_vars.cc'
--- sql/sys_vars.cc	2013-08-14 08:48:50 +0000
+++ sql/sys_vars.cc	2013-12-14 18:24:15 +0000
@@ -2694,6 +2694,8 @@
        BLOCK_SIZE(1), NO_MUTEX_GUARD, NOT_IN_BINLOG, ON_CHECK(NULL),
        ON_UPDATE(NULL), NULL);
 
+char buf[sizeof(Sys_table_cache_instances)];
+
 static Sys_var_ulong Sys_thread_cache_size(
        "thread_cache_size",
        "How many threads we should keep in a cache for reuse",
 



 Comments   
Comment by Sergey Vojtovich [ 2013-12-15 ]

When we add new system variable (e.g. ptr= 0x1061d40, size= 208), addresses of other global C++ variables may change. Among other things address of LOCK_open and unused_tables changes.

rev.3816 (fast):
LOCK_open: 0x1074120, size= 48 (cache line starts 0x1074100)
unused_tables: 0x1074150, size= 8 (cache line starts 0x1074140)

rev.3816 + "char buf[sizeof(Sys_table_cache_instances)]" (slow):
LOCK_open: 0x1074200, size= 48 (cache line starts 0x1074200)
unused_tables: 0x1074230, size= 8 (cache line starts 0x1074200)

Note that in fast version LOCK_open resides on 2 cache lines (32 bytes on first + 16 bytes on second). Second cache line is shared with unused_tables. But since these last 16 bytes are quite static, there should be no false sharing issues.

In slow version LOCK_open resides on 1 cache line which is shared with unused_tables.

oprofile proves that LLC_MISSES increase in slow version:
3816 (fast)
CPU: Intel Sandy Bridge microarchitecture, speed 2.701e+06 MHz (estimated)
Counted LLC_MISSES events (Last level cache demand requests from this core that missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000
samples % image name symbol name
43387 37.4148 no-vmlinux /no-vmlinux
21919 18.9019 libpthread-2.15.so pthread_mutex_lock
6986 6.0244 libpthread-2.15.so pthread_mutex_unlock
5427 4.6800 mysqld tc_release_table(TABLE*)
3741 3.2261 mysqld TABLE::init(THD*, TABLE_LIST*)
3168 2.7319 mysqld tdc_acquire_share(THD*, char const*, char const*, char const*, unsigned int, unsigned int, TABLE**)
3014 2.5991 mysqld open_tables(THD*, TABLE_LIST*, unsigned int, unsigned int, Prelocking_strategy*)
2199 1.8963 libpthread-2.15.so pthread_rwlock_unlock
2151 1.8549 libpthread-2.15.so __lll_lock_wait
2134 1.8403 mysqld dispatch_command(enum_server_command, THD*, char*, unsigned int)

3816 (slow)
CPU: Intel Sandy Bridge microarchitecture, speed 2.701e+06 MHz (estimated)
Counted LLC_MISSES events (Last level cache demand requests from this core that missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000
samples % image name symbol name
43059 39.1488 no-vmlinux /no-vmlinux
20065 18.2429 libpthread-2.15.so pthread_mutex_lock
5736 5.2151 mysqld tc_release_table(TABLE*)
5633 5.1215 libpthread-2.15.so pthread_mutex_unlock
3331 3.0285 mysqld TABLE::init(THD*, TABLE_LIST*)
2913 2.6485 mysqld open_tables(THD*, TABLE_LIST*, unsigned int, unsigned int, Prelocking_strategy*)
2666 2.4239 mysqld tdc_acquire_share(THD*, char const*, char const*, char const*, unsigned int, unsigned int, TABLE**)
2198 1.9984 libpthread-2.15.so pthread_rwlock_unlock
1998 1.8166 libpthread-2.15.so __lll_lock_wait
1976 1.7966 mysqld dispatch_command(enum_server_command, THD*, char*, unsigned int)

3816 (slow + padding)
CPU: Intel Sandy Bridge microarchitecture, speed 2.701e+06 MHz (estimated)
Counted LLC_MISSES events (Last level cache demand requests from this core that missed the LLC) with a unit mask of 0x41 (No unit mask) count 10000
samples % image name symbol name
43144 37.7159 no-vmlinux /no-vmlinux
21324 18.6412 libpthread-2.15.so pthread_mutex_lock
5930 5.1839 libpthread-2.15.so pthread_mutex_unlock
5889 5.1481 mysqld tc_release_table(TABLE*)
3678 3.2153 mysqld TABLE::init(THD*, TABLE_LIST*)
3469 3.0326 mysqld tdc_acquire_share(THD*, char const*, char const*, char const*, unsigned int, unsigned int, TABLE**)
3221 2.8158 mysqld open_tables(THD*, TABLE_LIST*, unsigned int, unsigned int, Prelocking_strategy*)
2418 2.1138 libpthread-2.15.so pthread_rwlock_unlock
2165 1.8926 mysqld dispatch_command(enum_server_command, THD*, char*, unsigned int)
2144 1.8743 libpthread-2.15.so __lll_lock_wait

Adding dummy padding around LOCK_open restore performance:
+char pada[1024];
mysql_mutex_t LOCK_open;
+char padb[1024];

Comment by Sergey Vojtovich [ 2013-12-16 ]

MDEV-5388 removes unused_tables, so this particular performance regression is fixed.

Generated at Thu Feb 08 07:04:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.