[MDEV-13080] [ERROR] InnoDB: Missing MLOG_CHECKPOINT between the checkpoint x and the end y Created: 2017-06-13  Updated: 2019-05-15  Resolved: 2019-05-15

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB, Tests
Affects Version/s: 10.2.6, 10.2.17, 10.3
Fix Version/s: 10.2.25, 10.3.16, 10.4.5

Type: Bug Priority: Major
Reporter: Farid UYAR Assignee: Eugene Kosov (Inactive)
Resolution: Fixed Votes: 2
Labels: None
Environment:

Ubuntu 14.04.4 LTS


Issue Links:
Duplicate
duplicates MDEV-17680 innodb.undo_truncate_recover fails in... Closed
Relates
relates to MDEV-15282 innodb.autoinc_persist failed in buil... Closed
relates to MDEV-18038 Assertion failure in innodb.undo_trun... Closed
relates to MDEV-13830 Assertion failed: recv_sys->mlog_chec... Closed
relates to MDEV-19346 Assertion `recv_sys->mlog_checkpoint_... Closed

 Description   

Hello,

I've been using MariaDB 10.2.6 with slave_run_triggers_for_rbr for some aggregation tables.
However, there's a big memory leak with this setup and the server regularly hangs.

Last time it hangs, innodb crash recovery could not start because MLOG_CHECKPOINT was not found.

My server has got a BBWC so any fsync should be persisted on disk. I dont understand how that could happen ?



 Comments   
Comment by Elena Stepanova [ 2017-06-13 ]

Can you paste the body of the trigger?
We have a confirmed report about a leak in a trigger which uses a subquery: MDEV-12992 . Can yours be similar?

Comment by Farid UYAR [ 2017-06-13 ]

Dont really understand russian so cant comment on the previous report. But yeah on my triggers I'm storing some variables like this :

SET @var = (SELECT IFNULL(total,0) from table_name where ...).

Then using those variables to insert them (simple aggregations). Could it be that those var are never freed from memory ?

Comment by Elena Stepanova [ 2017-06-13 ]

Sorry I forgot to mention that, you didn't need to read the whole report, there were a couple of comments at the end with the scenario and all.

That particular scenario involved a subquery in the trigger, although possibly there were other variations. Or, does your WHERE clause have a subquery?

Comment by Farid UYAR [ 2017-06-14 ]

Okay I show your comment on the other issue. I dont really have SELECT subqueries like in your example.
Most of my triggers are like this :

IF NEW.var1 = X AND NEW.var2 = Y THEN
     INSERT INTO or DELETE FROM
END IF;

Seems pretty simple to me.

The most complex ones are like this :

IF NEW.var1 = X THEN
   SET @x = (SELECT count(*) FROM other_table WHERE column1 = NEW.xx AND column2 = yy)
   IF @x > 0 THEN
          INSERT INTO agg_table VALUES (NEW.id, @x)
   ELSE
          DELETE FROM agg_table 
   END IF;
END IF;

So no subqueries, only @var usage to store count aggregate.

Comment by Elena Stepanova [ 2017-06-25 ]

Are other_table and agg_table here actual tables, or is any of them a view?
And in the first trigger, what does DELETE have in WHERE clause, and what kind of INSERT is it – INSERT .. VALUES or INSERT ... SELECT?

Comment by Elena Stepanova [ 2017-07-31 ]

Fuyar,

Did you have a chance to upgrade to 10.2.7 and if you did, did it resolve the problem?

Comment by Andrew Ramsay [ 2017-07-31 ]

I nearly logged a new fault, but the missing checkpoint message in the log says the problem is the same, even if the cause is not.
I believe 10.2.7 also has [the] problem, although possibly not to the same degree:

2017-07-29 11:58:59 4208 [Note] InnoDB: Mutexes and rw_locks use Windows interlocked functions
2017-07-29 11:58:59 4208 [Note] InnoDB: Uses event mutexes
2017-07-29 11:58:59 4208 [Note] InnoDB: Compressed tables use zlib 1.2.3
2017-07-29 11:58:59 4208 [Note] InnoDB: Number of pools: 1
2017-07-29 11:58:59 4208 [Note] InnoDB: Using generic crc32 instructions
2017-07-29 11:58:59 4208 [Note] InnoDB: Initializing buffer pool, total size = 16G, instances = 8, chunk size = 128M
2017-07-29 11:59:00 4208 [Note] InnoDB: Completed initialization of buffer pool
2017-07-29 11:59:00 4208 [Note] InnoDB: Highest supported file format is Barracuda.
2017-07-29 11:59:00 4208 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1070287781678
2017-07-29 11:59:00 4208 [ERROR] InnoDB: Missing MLOG_CHECKPOINT at 1070287781678 between the checkpoint 1070287781678 and the end 1070287807736.
2017-07-29 11:59:00 4208 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
2017-07-29 11:59:00 4208 [Note] InnoDB: Starting shutdown...
2017-07-29 11:59:01 4208 [ERROR] Plugin 'InnoDB' init function returned error.
2017-07-29 11:59:01 4208 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2017-07-29 11:59:01 4208 [Note] Plugin 'FEEDBACK' is disabled.
2017-07-29 11:59:01 4208 [ERROR] Unknown/unsupported storage engine: InnoDB
2017-07-29 11:59:01 4208 [ERROR] Aborting

happened for the first time two days ago, although seems more stable than 10.2.6. This happened after I had taken the server up and down a few times in fairly quick succession due to rebooting win7x64 after multiple software upgrades (this could be a red herring though!). server refused to start until innodb_force_recovery=6.

Comment by Elena Stepanova [ 2017-07-31 ]

It appears that there are three problems (either related to each other or not) reported in this issue:

there's a big memory leak

the server regularly hangs

innodb crash recovery could not start because MLOG_CHECKPOINT was not found

amramsay, as I understand, you have had the third kind. Have you experienced the other two?

Comment by Andrew Ramsay [ 2017-07-31 ]

1. To my knowledge I have not experienced the memory leak, I have been watching memory usage quite closely, during my current processing - it has remained static at 18.5GB.
2.I am not aware of the server hanging, although on occasion it has run a bit slow, but I put that down to the amount of data being processed/imported - I have been moving fairly substantial amounts of data (for me anyway) over the last few weeks (~50GB binlogs per 24hour period) - repeated db imports, all has run smoothly.
3. as you said - yes, the checkpoint issue, once.

Comment by Elena Stepanova [ 2017-07-31 ]

amramsay, thanks.

In this case, let's treat this issue as being about the missing MLOG_CHECKPOINT (as the subject suggests, anyway), and consider other parts of it possibly irrelevant.

jplindst, any thoughts about the checkpoint problem?

Comment by Andrew Ramsay [ 2017-08-24 ]

For information - I note that MLOG_CHECKPOINT issue is still present in 10.2.8. I installed 10.2.8 on 18-Aug, crash occurred on 20-Aug, minimal db activity between upgrade and crash recovery:

– restart after crash, but no encryption key present –
2017-08-20 20:43:38 6472 [Note] InnoDB: Mutexes and rw_locks use Windows interlocked functions
2017-08-20 20:43:38 6472 [Note] InnoDB: Uses event mutexes
2017-08-20 20:43:38 6472 [Note] InnoDB: Compressed tables use zlib 1.2.3
2017-08-20 20:43:38 6472 [Note] InnoDB: Number of pools: 1
2017-08-20 20:43:38 6472 [Note] InnoDB: Using generic crc32 instructions
2017-08-20 20:43:38 6472 [Note] InnoDB: Initializing buffer pool, total size = 16G, instances = 8, chunk size = 128M
2017-08-20 20:43:39 6472 [Note] InnoDB: Completed initialization of buffer pool
2017-08-20 20:43:39 6472 [Note] InnoDB: Highest supported file format is Barracuda.
2017-08-20 20:43:39 6472 [Note] InnoDB: Starting crash recovery from checkpoint LSN=656734850618
2017-08-20 20:43:39 6472 [ERROR] InnoDB: Incorrect log record type:90
2017-08-20 20:43:39 6472 [Warning] InnoDB: Log scan aborted at LSN 656734850560
2017-08-20 20:43:39 6472 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
2017-08-20 20:43:39 6472 [Note] InnoDB: Starting shutdown...
2017-08-20 20:43:40 6472 [ERROR] Plugin 'InnoDB' init function returned error.
2017-08-20 20:43:40 6472 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2017-08-20 20:43:40 6472 [Note] Plugin 'FEEDBACK' is disabled.
2017-08-20 20:43:40 6472 [ERROR] Unknown/unsupported storage engine: InnoDB
2017-08-20 20:43:40 6472 [ERROR] Aborting

– restart, now with encryption key present –
2017-08-20 21:15:29 4800 [Note] InnoDB: Mutexes and rw_locks use Windows interlocked functions
2017-08-20 21:15:29 4800 [Note] InnoDB: Uses event mutexes
2017-08-20 21:15:29 4800 [Note] InnoDB: Compressed tables use zlib 1.2.3
2017-08-20 21:15:29 4800 [Note] InnoDB: Number of pools: 1
2017-08-20 21:15:29 4800 [Note] InnoDB: Using generic crc32 instructions
2017-08-20 21:15:29 4800 [Note] InnoDB: Initializing buffer pool, total size = 16G, instances = 8, chunk size = 128M
2017-08-20 21:15:30 4800 [Note] InnoDB: Completed initialization of buffer pool
2017-08-20 21:15:30 4800 [Note] InnoDB: Highest supported file format is Barracuda.
2017-08-20 21:15:30 4800 [Note] InnoDB: Starting crash recovery from checkpoint LSN=656734850618
2017-08-20 21:15:30 4800 [ERROR] InnoDB: Incorrect log record type:90
2017-08-20 21:15:30 4800 [ERROR] InnoDB: Missing MLOG_CHECKPOINT at 656734850618 between the checkpoint 656734850618 and the end 656734861665.
2017-08-20 21:15:30 4800 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
2017-08-20 21:15:30 4800 [Note] InnoDB: Starting shutdown...
2017-08-20 21:15:31 4800 [ERROR] Plugin 'InnoDB' init function returned error.
2017-08-20 21:15:31 4800 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2017-08-20 21:15:31 4800 [Note] Plugin 'FEEDBACK' is disabled.
2017-08-20 21:15:31 4800 [ERROR] Unknown/unsupported storage engine: InnoDB
2017-08-20 21:15:31 4800 [ERROR] Aborting

...I had to go to recovery level 6 to get database online for dumping.
I noted that the log message was slightly different on the first pass, presumably because data could not be decrypted for interpretation, due to lack of key (my fault)?

Comment by Elena Stepanova [ 2018-10-01 ]

http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/14336

innodb.undo_truncate_recover '16k,2,innodb' w1 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2018-09-21 04:45:22
line
2018-09-21  4:45:20 0 [ERROR] InnoDB: Missing MLOG_CHECKPOINT at 24653138 between the checkpoint 23916606 and the end 24653138.
2018-09-21  4:45:20 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[2012] with error Generic error
2018-09-21  4:45:21 0 [ERROR] Plugin 'InnoDB' init function returned error.
2018-09-21  4:45:21 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
^ Found warnings in /mnt/buildbot/build/mariadb-10.3.10/mysql-test/var/1/log/mysqld.1.err
ok

Comment by Daniel Black [ 2018-10-16 ]

 docker logs a80641d42796
MySQL init password for existing database...
MySQL init process in progress...
2018-10-15 13:12:54 70366838301360 [Note] mysqld (mysqld 10.2.17-MariaDB-1:10.2.17+maria~bionic) starting as process 43 ...
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Uses event mutexes
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Compressed tables use zlib 1.2.11
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Using Linux native AIO
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Number of pools: 1
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Using POWER8 crc32 instructions
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Initializing buffer pool, total size = 256M, instances = 1, chunk size = 128M
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Completed initialization of buffer pool
2018-10-15 13:12:54 70366127255728 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Highest supported file format is Barracuda.
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Starting crash recovery from checkpoint LSN=4101625
2018-10-15 13:12:54 70366838301360 [ERROR] InnoDB: Missing MLOG_CHECKPOINT at 4101625 between the checkpoint 4101305 and the end 4101632.
2018-10-15 13:12:54 70366838301360 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
2018-10-15 13:12:55 70366838301360 [Note] InnoDB: Starting shutdown...
2018-10-15 13:12:55 70366838301360 [ERROR] Plugin 'InnoDB' init function returned error.
2018-10-15 13:12:55 70366838301360 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2018-10-15 13:12:55 70366838301360 [Note] Plugin 'FEEDBACK' is disabled.
2018-10-15 13:12:55 70366838301360 [ERROR] Unknown/unsupported storage engine: InnoDB
2018-10-15 13:12:55 70366838301360 [ERROR] Aborting

update:

uploaded files to ftp.mariadb.org:

  • MDEV-13080-docker-export-ppc64le.tar.xz - the result of the docker export of the container. Will be exceptionally close to the docker mariadb official mariadb:10.2.18-bionic - basic configuration.
  • MDEV-13080-datadir.tar.xz - datadir (unrecovered)
Comment by Marko Mäkelä [ 2018-11-20 ]

MDEV-13830 and MDEV-15282 could share the root cause with this.

Comment by Thirunarayanan Balathandayuthapani [ 2018-11-21 ]

MLOG_CHECKPOINT is present in the redo log partially. MLOG_CHECKPOINT record is 9 bytes. 1 byte for MLOG_CHECKPOINT marker and 8 bytes for CHECKPOINT LSN.

MLOG_CHECKPOINT record is present partially (only 3 bytes) at the end of log block and next block is empty.

Scenario could be that checkpoint info is written before MLOG_CHECKPOINT record is written in the next block and server could be killed
before writing the block.

Yet to write mtr test case for it and read the code too

Comment by Marko Mäkelä [ 2018-11-27 ]

I tried to review the code, and I did not find anything obviously wrong, except potential race conditions in log_write_up_to(), for example here:

	if (flush_to_disk
	    && (log_sys->n_pending_flushes > 0
		|| !os_event_is_set(log_sys->flush_event))) {
		/* Figure out if the current flush will do the job
		for us. */
		bool work_done = log_sys->current_flush_lsn >= lsn;
 
		log_write_mutex_exit();
 
		os_event_wait(log_sys->flush_event);
 
		if (work_done) {
			return;

How can we be sure that the desired work was actually completed here? Could there have been multiple pending writes, and could the event have been signaled for an earlier write, for something older than log_sys->current_flush_lsn? Perhaps we should check log_sys->flushed_to_disk_lsn before returning.

I tried to reproduce the failure, but in the end I only reproduced (and subsequently developed a fix for) MDEV-17849.

Comment by Eugene Kosov (Inactive) [ 2018-12-21 ]

To reproduce this bug I use mtr -mem -force -max-test-fail=9999 -suite=innodb -par=5 innodb.undo_truncate_recover{,,,} -repeat=100 --. Sometimes also https://jira.mariadb.org/browse/MDEV-18038 happens.

Both failures are rare.

log_buffer_extend() is called in this test. When I disable it with --innodb-log-buffer-size=100000000 I see no errors at all. But I'm unsure because it's not easy to reproduce. Anyway this is my current guess that it's data races during redo log extending.

Comment by Eugene Kosov (Inactive) [ 2018-12-28 ]

I put write_always = true; at the very beginning of log_checkpoint(). This makes checkpoints flushed more frequent. And both failures again disappear. Maybe the last checkpoint sometimes is not committed and this is the source of problems?

Comment by Marko Mäkelä [ 2019-03-08 ]

kevg, that is interesting. Could it be that we sometimes update last_checkpoint_lsn before the corresponding write to the checkpoint header becomes persistent? Could it be that fil_aio_wait() is invoked on the redo log pseudo-tablespace for something else than the latest pending checkpoint write? After all, we do allow multiple log_checkpoint() executions in parallel, interleaved with each other.

Could you add an assertion to log_complete_checkpoint() or its call path, to verify that the completed write was for the expected checkpoint?

Note: The redo log is not being extended while the server is running. Only the redo log write buffer is extended during this in 10.2 and 10.3. In 10.4, less redo log is being written thanks to the MDEV-17138 MLOG_MEMSET record.

Comment by Eugene Kosov (Inactive) [ 2019-03-26 ]

My assumption about `write_always = true` was incorrect at least for MDEV-18038 which sometimes happen.

Comment by Marko Mäkelä [ 2019-04-23 ]

I implemented debug instrumentation that should make bugs in crash recovery and mariabackup --prepare more repeatable.

Usage:

./mtr --mysqld=--debug=d,ib_log_checkpoint_avoid …

Comment by Eugene Kosov (Inactive) [ 2019-05-13 ]

The bug is inside log_buffer_extend(). MDEV-18038 too.

This patch makes log_buffer_extend() run on every mtr.commit().

diff --git a/storage/innobase/include/log0log.h b/storage/innobase/include/log0log.h
index 16399edad50..dd549baddce 100644
--- a/storage/innobase/include/log0log.h
+++ b/storage/innobase/include/log0log.h
@@ -81,7 +81,7 @@ log_free_check(void);
 
 /** Extends the log buffer.
 @param[in]     len     requested minimum size in bytes */
-void log_buffer_extend(ulong len);
+void log_buffer_extend();
 
 /** Check margin not to overwrite transaction log from the last checkpoint.
 If would estimate the log write to exceed the log_group_capacity,
diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc
index 409b42c452a..4ffa906f86b 100644
--- a/storage/innobase/log/log0log.cc
+++ b/storage/innobase/log/log0log.cc
@@ -159,8 +159,10 @@ log_buf_pool_get_oldest_modification(void)
 
 /** Extends the log buffer.
 @param[in]     len     requested minimum size in bytes */
-void log_buffer_extend(ulong len)
+void log_buffer_extend()
 {
+       size_t len = srv_log_buffer_size + 1;
+
        byte    tmp_buf[OS_FILE_LOG_BLOCK_SIZE];
 
        log_mutex_enter_all();
diff --git a/storage/innobase/mtr/mtr0mtr.cc b/storage/innobase/mtr/mtr0mtr.cc
index 92b1aa38a81..d4950ffba05 100644
--- a/storage/innobase/mtr/mtr0mtr.cc
+++ b/storage/innobase/mtr/mtr0mtr.cc
@@ -764,9 +764,7 @@ mtr_t::Command::prepare_write()
        ut_ad(len > 0);
        ut_ad(n_recs > 0);
 
-       if (len > srv_log_buffer_size / 2) {
-               log_buffer_extend(ulong((len + 1) * 2));
-       }
+       log_buffer_extend();
 
        ut_ad(m_impl->m_n_log_recs == n_recs);

And with that patch I always see MDEV-13080 or MDEV-18038 in 10.3 with this simple test:

create or replace table t (a int) engine=innodb;
insert into t values (1);
drop table t;

Comment by Marko Mäkelä [ 2019-05-15 ]

We will keep the log_buffer_extend() but remove the log file write from it.

Generated at Thu Feb 08 08:02:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.