[MDEV-11937] InnoDB flushes redo log too often Created: 2017-01-29  Updated: 2021-12-23  Resolved: 2017-08-07

Status: Closed
Project: MariaDB Server
Component/s: Replication, Storage Engine - InnoDB
Affects Version/s: 10.2.3
Fix Version/s: 10.2.8

Type: Bug Priority: Critical
Reporter: Axel Schwenke Assignee: Kristian Nielsen
Resolution: Fixed Votes: 1
Labels: None

Attachments: File commit group size.ods    
Issue Links:
Problem/Incident
causes MDEV-15740 Galera does not recover prepared XA-t... Closed
Relates
relates to MDEV-13190 InnoDB write-only performance regression Open
relates to MDEV-14339 Different Galera performance behavior... Closed
relates to MDEV-18959 Engine transaction recovery through p... Stalled

 Description   

When I did the benchmark tests for MDEV-11900 I checked the size of the commit groups for the InnoDB redo log and the server binlog. An unexpected result was, that for MariaDB 10.2.3 and low thread counts I saw InnoDB doing twice as many redo log fsyncs (Innodb_os_log_fsyncs) as COMMIT operations (Com_commit). MariaDB 10.1 and MySQL 5.7 don't show this behavior. They do only very few more fsyncs than COMMITs.

At higher thread counts the commit group size in MariaDB 10.2 increases as expected, but it stays well below the numbers I see for MariaDB 10.1 and MySQL 5.7. I suspect that InnoDB always does two fsyncs on the redo log when it should do only one.

The effect gets visible with innodb_flush_log_at_trx_commit=1 and enabled binlog (log-bin = xxx) no matter if sync-binlog = 0 or sync-binlog = 1. Disabling the binlog puts everything back to normal. Another visible effect of the excessive flushing is reduced throughput in read/write OLTP benchmarks in comparison to MariaDB 10.1 and MySQL 5.7.

The commit group size for the binlog is also smaller in MariaDB 10.2 (again in comparison to MariaDB 10.1 and MySQL 5.7).

Another unexpected observation is this: the InnoDB redo log commit group size decreases, when going from sync_binlog=0 to sync_binlog=1. For the binlog commit group size it's the other way round: sync_binlog=1 gives bigger groups than sync_binlog=0. Also both numbers differ significantly. I had expected to see roughly the same commit group size for the binlog and the redo log. The indicator I used here is Binlog_commits in relation to Binlog_group_commits.

I attach a spread sheet with my numbers. See also MDEV-11900 for throughput numbers.



 Comments   
Comment by Axel Schwenke [ 2017-01-29 ]

the my.cnf used:

[mysqld]
 
performance-schema=off
default_storage_engine = InnoDB
secure-file-priv = /tmp
 
#####non innodb options
max_connections = 200
table_open_cache = 400
query_cache_type = 0
key_buffer_size = 16M
read_buffer_size = 256K
read_rnd_buffer_size = 512K
 
#####innodb options
innodb-file-per-table = true
innodb-buffer-pool-size = 32G
innodb_buffer_pool_instances = 16
innodb_adaptive_hash_index_parts = 16
innodb_log_file_size = 2G
innodb_log_buffer_size = 8M
innodb-flush-method = O_DIRECT
innodb_flush_log_at_trx_commit=1
 
#####replication
server-id = 1
log-bin = binlog
binlog_cache_size = 32K
max_binlog_size = 1G
sync-binlog = 0

Comment by Marko Mäkelä [ 2017-02-10 ]

The following MySQL 5.7 change was not fully merged to MariaDB Server 10.2:

commit c60ac300ad2e5d1396694fb9f642b78b10bc3c93
Author: Bill Qu <bill.qu@Oracle.com>
Date:   Tue Nov 4 10:28:26 2014 +0800
 
    Bug#19424075    WRITE/SYNC REDO LOG BEFORE FLUSH THREAD CACHE TO BINLOG
    
    According to the crash recovery logic of mysql server, we only
    need to guarantee that write/sync prepared transaction to InnoDB
    redo log before writing to binary log. And prepared records of
    transactions were not committed into InnoDB redo log in a group.
    
    To improve performance, write/sync prepared records of transactions
    to InnoDB redo log in a group right before writing to binary log
    during binlog group commit flush stage.

This added a Boolean parameter to the handlerton::flush_logs operation:

diff --git a/sql/handler.cc b/sql/handler.cc
index b6d42f5add1..a049189f842 100644
--- a/sql/handler.cc
+++ b/sql/handler.cc
@@ -2165,25 +2165,27 @@ static my_bool flush_handlerton(THD *thd, plugin_ref plugin,
                                 void *arg)
 {
   handlerton *hton= plugin_data(plugin, handlerton *);
-  if (hton->state == SHOW_OPTION_YES && hton->flush_logs && 
-      hton->flush_logs(hton))
+  if (hton->state == SHOW_OPTION_YES && hton->flush_logs &&
+      hton->flush_logs(hton, *(static_cast<bool *>(arg))))
     return TRUE;
   return FALSE;
 }
 
 
-bool ha_flush_logs(handlerton *db_type)
+bool ha_flush_logs(handlerton *db_type, bool binlog_group_flush)
 {
   if (db_type == NULL)
   {
     if (plugin_foreach(NULL, flush_handlerton,
-                          MYSQL_STORAGE_ENGINE_PLUGIN, 0))
+                       MYSQL_STORAGE_ENGINE_PLUGIN,
+                       static_cast<void *>(&binlog_group_flush)))
       return TRUE;
   }
   else
   {
     if (db_type->state != SHOW_OPTION_YES ||
-        (db_type->flush_logs && db_type->flush_logs(db_type)))
+        (db_type->flush_logs &&
+         db_type->flush_logs(db_type, binlog_group_flush)))
       return TRUE;
   }
   return FALSE;

The merge from MySQL 5.7 to MariaDB Server 10.2 omitted these changes, and instead only added a dummy wrapper function to InnoDB:

commit 2e814d4702d71a04388386a9f591d14a35980bfe
Author: Jan Lindström <jan.lindstrom@mariadb.com>
Date:   Fri Aug 12 11:17:45 2016 +0300
 
    Merge InnoDB 5.7 from mysql-5.7.9.

static
bool
innobase_flush_logs(
	handlerton*	hton)
{
	return innobase_flush_logs(hton, true);
}

Comment by Jan Lindström (Inactive) [ 2017-02-10 ]

I tried to avoid server changes when InnoDB 5.7 was merged to 10.2.
Above changes are not yet enough. Current MariaDB replication does not contain that binlog_group_flush, in MySQL 5.7 there are many more flush_logs calls:

jan@jan-laptop-asus:~/mysql/mysql-server/sql$ grep ha_flush_logs *
binlog.cc:  if (ha_flush_logs(NULL))
binlog.cc:      && (error= ha_flush_logs(NULL)))
binlog.cc:      ha_flush_logs(NULL);
binlog.cc:    has a leader. We must do this before invoking ha_flush_logs(...)
binlog.cc:  ha_flush_logs(NULL, true);
andlerton *db_type, bool binlog_group_flush)
handler.h:bool ha_flush_logs(handlerton *db_type, bool binlog_group_flush= false);
log_event.cc:         ha_flush_logs(0); 
sql_reload.cc:    if (ha_flush_logs(NULL))
sql_table.cc:    ha_flush_logs(old_db_type);

Comment by Marko Mäkelä [ 2017-03-15 ]

Furthermore, in trx_commit_in_memory() in storage/innobase/trx/trx0trx.cc, MariaDB Server 10.2 is missing some code that exists in MySQL 5.7:

		} else if (srv_flush_log_at_trx_commit == 0
			   || thd_requested_durability(trx->mysql_thd)
			   == HA_IGNORE_DURABILITY) {
			/* Do nothing */
		} else {
			trx_flush_log_if_needed(lsn, trx);
		}

Comment by Marko Mäkelä [ 2017-03-15 ]

I think that in addition to the already mentioned commit we also want to merge this change that adds MYSQL_BIN_LOG::prepare, THD::durability_property, and thd_requested_durability(), among other things.

Comment by Sergei Golubchik [ 2017-04-06 ]

marko, I don't see how these changes in innobase_flush_logs() could possibly affect anything, MariaDB doesn't call flush_logs as often as MySQL does.

This code in trx_commit_in_memory() seems to be presern in MariaDB. But I don't see how it could help either.

Comment by Marko Mäkelä [ 2017-06-30 ]

Elkin, I hope you can investigate this.
My guess is that in a binlog-driven operation, InnoDB in MariaDB 10.2+ is unnecessarily doing flushing and fsync() of the redo log at both COMMIT and PREPARE, while it would suffice to do it at PREPARE only.

Related to this, I hope that you can import and adapt the public tests from MySQL 5.7 that you did when fixing MySQL Bug #12161 Xa recovery and client disconnection.

Comment by Andrei Elkin [ 2017-06-30 ]

Marko,

Sure, let me try to look at it from my end.

A.

Comment by Kristian Nielsen [ 2017-07-03 ]

Maybe this mentioned code from MySQL 5.7 is their optimisation to avoid having to fsync the innodb redo log after binlog commit?

MariaDB implements this in a different way, using binlog checkpoints. There is some info on that here: http://kristiannielsen.livejournal.com/16382.html

It really is critical that these issues are properly dealt with before merging any new InnoDB code into MariaDB.

Comment by Michael Widenius [ 2017-08-06 ]

Just a note: Kristian has promised to take a look at this.
Andrei, please coordinate any work on this issue with Kristian!

Comment by Kristian Nielsen [ 2017-08-07 ]

Patch: https://github.com/MariaDB/server/commit/c5aa11cbb9cba7807cdf77bb3e5223be2d7389de

Mailing list thread: https://lists.launchpad.net/maria-developers/msg10832.html

Generated at Thu Feb 08 07:53:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.