Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.2.6, 10.2.17, 10.3(EOL)
-
None
-
Ubuntu 14.04.4 LTS
Description
Hello,
I've been using MariaDB 10.2.6 with slave_run_triggers_for_rbr for some aggregation tables.
However, there's a big memory leak with this setup and the server regularly hangs.
Last time it hangs, innodb crash recovery could not start because MLOG_CHECKPOINT was not found.
My server has got a BBWC so any fsync should be persisted on disk. I dont understand how that could happen ?
Attachments
Issue Links
- duplicates
-
MDEV-17680 innodb.undo_truncate_recover fails in buildbot with Missing MLOG_CHECKPOINT
-
- Closed
-
- relates to
-
MDEV-15282 innodb.autoinc_persist failed in buildbot, Assertion failed: recv_sys->mlog_checkpoint_lsn <= recv_sys->recovered_lsn
-
- Closed
-
-
MDEV-18038 Assertion failure in innodb.undo_truncate_recover: "pad_len >= len || i * 512U >= len - pad_len || log_block_get_hdr_no( buf + i * 512U) == log_block_get_hdr_no(buf) + i"
-
- Closed
-
-
MDEV-13830 Assertion failed: recv_sys->mlog_checkpoint_lsn <= recv_sys->recovered_lsn
-
- Closed
-
-
MDEV-19346 Assertion `recv_sys->mlog_checkpoint_lsn <= recv_sys->recovered_lsn' failed in recv_parse_log_recs during mariabackup --prepare
-
- Closed
-
Activity
Dont really understand russian so cant comment on the previous report. But yeah on my triggers I'm storing some variables like this :
SET @var = (SELECT IFNULL(total,0) from table_name where ...).
Then using those variables to insert them (simple aggregations). Could it be that those var are never freed from memory ?
Sorry I forgot to mention that, you didn't need to read the whole report, there were a couple of comments at the end with the scenario and all.
That particular scenario involved a subquery in the trigger, although possibly there were other variations. Or, does your WHERE clause have a subquery?
Okay I show your comment on the other issue. I dont really have SELECT subqueries like in your example.
Most of my triggers are like this :
IF NEW.var1 = X AND NEW.var2 = Y THEN
|
INSERT INTO or DELETE FROM
|
END IF;
|
Seems pretty simple to me.
The most complex ones are like this :
IF NEW.var1 = X THEN
|
SET @x = (SELECT count(*) FROM other_table WHERE column1 = NEW.xx AND column2 = yy)
|
IF @x > 0 THEN
|
INSERT INTO agg_table VALUES (NEW.id, @x)
|
ELSE
|
DELETE FROM agg_table
|
END IF;
|
END IF;
|
So no subqueries, only @var usage to store count aggregate.
Are other_table and agg_table here actual tables, or is any of them a view?
And in the first trigger, what does DELETE have in WHERE clause, and what kind of INSERT is it – INSERT .. VALUES or INSERT ... SELECT?
Did you have a chance to upgrade to 10.2.7 and if you did, did it resolve the problem?
I nearly logged a new fault, but the missing checkpoint message in the log says the problem is the same, even if the cause is not.
I believe 10.2.7 also has [the] problem, although possibly not to the same degree:
2017-07-29 11:58:59 4208 [Note] InnoDB: Mutexes and rw_locks use Windows interlocked functions
2017-07-29 11:58:59 4208 [Note] InnoDB: Uses event mutexes
2017-07-29 11:58:59 4208 [Note] InnoDB: Compressed tables use zlib 1.2.3
2017-07-29 11:58:59 4208 [Note] InnoDB: Number of pools: 1
2017-07-29 11:58:59 4208 [Note] InnoDB: Using generic crc32 instructions
2017-07-29 11:58:59 4208 [Note] InnoDB: Initializing buffer pool, total size = 16G, instances = 8, chunk size = 128M
2017-07-29 11:59:00 4208 [Note] InnoDB: Completed initialization of buffer pool
2017-07-29 11:59:00 4208 [Note] InnoDB: Highest supported file format is Barracuda.
2017-07-29 11:59:00 4208 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1070287781678
2017-07-29 11:59:00 4208 [ERROR] InnoDB: Missing MLOG_CHECKPOINT at 1070287781678 between the checkpoint 1070287781678 and the end 1070287807736.
2017-07-29 11:59:00 4208 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
2017-07-29 11:59:00 4208 [Note] InnoDB: Starting shutdown...
2017-07-29 11:59:01 4208 [ERROR] Plugin 'InnoDB' init function returned error.
2017-07-29 11:59:01 4208 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2017-07-29 11:59:01 4208 [Note] Plugin 'FEEDBACK' is disabled.
2017-07-29 11:59:01 4208 [ERROR] Unknown/unsupported storage engine: InnoDB
2017-07-29 11:59:01 4208 [ERROR] Aborting
happened for the first time two days ago, although seems more stable than 10.2.6. This happened after I had taken the server up and down a few times in fairly quick succession due to rebooting win7x64 after multiple software upgrades (this could be a red herring though!). server refused to start until innodb_force_recovery=6.
It appears that there are three problems (either related to each other or not) reported in this issue:
there's a big memory leak
the server regularly hangs
innodb crash recovery could not start because MLOG_CHECKPOINT was not found
amramsay, as I understand, you have had the third kind. Have you experienced the other two?
1. To my knowledge I have not experienced the memory leak, I have been watching memory usage quite closely, during my current processing - it has remained static at 18.5GB.
2.I am not aware of the server hanging, although on occasion it has run a bit slow, but I put that down to the amount of data being processed/imported - I have been moving fairly substantial amounts of data (for me anyway) over the last few weeks (~50GB binlogs per 24hour period) - repeated db imports, all has run smoothly.
3. as you said - yes, the checkpoint issue, once.
amramsay, thanks.
In this case, let's treat this issue as being about the missing MLOG_CHECKPOINT (as the subject suggests, anyway), and consider other parts of it possibly irrelevant.
jplindst, any thoughts about the checkpoint problem?
For information - I note that MLOG_CHECKPOINT issue is still present in 10.2.8. I installed 10.2.8 on 18-Aug, crash occurred on 20-Aug, minimal db activity between upgrade and crash recovery:
– restart after crash, but no encryption key present –
2017-08-20 20:43:38 6472 [Note] InnoDB: Mutexes and rw_locks use Windows interlocked functions
2017-08-20 20:43:38 6472 [Note] InnoDB: Uses event mutexes
2017-08-20 20:43:38 6472 [Note] InnoDB: Compressed tables use zlib 1.2.3
2017-08-20 20:43:38 6472 [Note] InnoDB: Number of pools: 1
2017-08-20 20:43:38 6472 [Note] InnoDB: Using generic crc32 instructions
2017-08-20 20:43:38 6472 [Note] InnoDB: Initializing buffer pool, total size = 16G, instances = 8, chunk size = 128M
2017-08-20 20:43:39 6472 [Note] InnoDB: Completed initialization of buffer pool
2017-08-20 20:43:39 6472 [Note] InnoDB: Highest supported file format is Barracuda.
2017-08-20 20:43:39 6472 [Note] InnoDB: Starting crash recovery from checkpoint LSN=656734850618
2017-08-20 20:43:39 6472 [ERROR] InnoDB: Incorrect log record type:90
2017-08-20 20:43:39 6472 [Warning] InnoDB: Log scan aborted at LSN 656734850560
2017-08-20 20:43:39 6472 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
2017-08-20 20:43:39 6472 [Note] InnoDB: Starting shutdown...
2017-08-20 20:43:40 6472 [ERROR] Plugin 'InnoDB' init function returned error.
2017-08-20 20:43:40 6472 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2017-08-20 20:43:40 6472 [Note] Plugin 'FEEDBACK' is disabled.
2017-08-20 20:43:40 6472 [ERROR] Unknown/unsupported storage engine: InnoDB
2017-08-20 20:43:40 6472 [ERROR] Aborting
– restart, now with encryption key present –
2017-08-20 21:15:29 4800 [Note] InnoDB: Mutexes and rw_locks use Windows interlocked functions
2017-08-20 21:15:29 4800 [Note] InnoDB: Uses event mutexes
2017-08-20 21:15:29 4800 [Note] InnoDB: Compressed tables use zlib 1.2.3
2017-08-20 21:15:29 4800 [Note] InnoDB: Number of pools: 1
2017-08-20 21:15:29 4800 [Note] InnoDB: Using generic crc32 instructions
2017-08-20 21:15:29 4800 [Note] InnoDB: Initializing buffer pool, total size = 16G, instances = 8, chunk size = 128M
2017-08-20 21:15:30 4800 [Note] InnoDB: Completed initialization of buffer pool
2017-08-20 21:15:30 4800 [Note] InnoDB: Highest supported file format is Barracuda.
2017-08-20 21:15:30 4800 [Note] InnoDB: Starting crash recovery from checkpoint LSN=656734850618
2017-08-20 21:15:30 4800 [ERROR] InnoDB: Incorrect log record type:90
2017-08-20 21:15:30 4800 [ERROR] InnoDB: Missing MLOG_CHECKPOINT at 656734850618 between the checkpoint 656734850618 and the end 656734861665.
2017-08-20 21:15:30 4800 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
2017-08-20 21:15:30 4800 [Note] InnoDB: Starting shutdown...
2017-08-20 21:15:31 4800 [ERROR] Plugin 'InnoDB' init function returned error.
2017-08-20 21:15:31 4800 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2017-08-20 21:15:31 4800 [Note] Plugin 'FEEDBACK' is disabled.
2017-08-20 21:15:31 4800 [ERROR] Unknown/unsupported storage engine: InnoDB
2017-08-20 21:15:31 4800 [ERROR] Aborting
...I had to go to recovery level 6 to get database online for dumping.
I noted that the log message was slightly different on the first pass, presumably because data could not be decrypted for interpretation, due to lack of key (my fault)?
http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/14336
innodb.undo_truncate_recover '16k,2,innodb' w1 [ fail ] Found warnings/errors in server log file!
|
Test ended at 2018-09-21 04:45:22
|
line
|
2018-09-21 4:45:20 0 [ERROR] InnoDB: Missing MLOG_CHECKPOINT at 24653138 between the checkpoint 23916606 and the end 24653138.
|
2018-09-21 4:45:20 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[2012] with error Generic error
|
2018-09-21 4:45:21 0 [ERROR] Plugin 'InnoDB' init function returned error.
|
2018-09-21 4:45:21 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
|
^ Found warnings in /mnt/buildbot/build/mariadb-10.3.10/mysql-test/var/1/log/mysqld.1.err
|
ok
|
docker logs a80641d42796
|
MySQL init password for existing database...
|
MySQL init process in progress...
|
2018-10-15 13:12:54 70366838301360 [Note] mysqld (mysqld 10.2.17-MariaDB-1:10.2.17+maria~bionic) starting as process 43 ...
|
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
|
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Uses event mutexes
|
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Compressed tables use zlib 1.2.11
|
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Using Linux native AIO
|
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Number of pools: 1
|
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Using POWER8 crc32 instructions
|
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Initializing buffer pool, total size = 256M, instances = 1, chunk size = 128M
|
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Completed initialization of buffer pool
|
2018-10-15 13:12:54 70366127255728 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
|
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Highest supported file format is Barracuda.
|
2018-10-15 13:12:54 70366838301360 [Note] InnoDB: Starting crash recovery from checkpoint LSN=4101625
|
2018-10-15 13:12:54 70366838301360 [ERROR] InnoDB: Missing MLOG_CHECKPOINT at 4101625 between the checkpoint 4101305 and the end 4101632.
|
2018-10-15 13:12:54 70366838301360 [ERROR] InnoDB: Plugin initialization aborted with error Generic error
|
2018-10-15 13:12:55 70366838301360 [Note] InnoDB: Starting shutdown...
|
2018-10-15 13:12:55 70366838301360 [ERROR] Plugin 'InnoDB' init function returned error.
|
2018-10-15 13:12:55 70366838301360 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
|
2018-10-15 13:12:55 70366838301360 [Note] Plugin 'FEEDBACK' is disabled.
|
2018-10-15 13:12:55 70366838301360 [ERROR] Unknown/unsupported storage engine: InnoDB
|
2018-10-15 13:12:55 70366838301360 [ERROR] Aborting
|
update:
uploaded files to ftp.mariadb.org:
MDEV-13080-docker-export-ppc64le.tar.xz - the result of the docker export of the container. Will be exceptionally close to the docker mariadb official mariadb:10.2.18-bionic - basic configuration.MDEV-13080-datadir.tar.xz - datadir (unrecovered)
MLOG_CHECKPOINT is present in the redo log partially. MLOG_CHECKPOINT record is 9 bytes. 1 byte for MLOG_CHECKPOINT marker and 8 bytes for CHECKPOINT LSN.
MLOG_CHECKPOINT record is present partially (only 3 bytes) at the end of log block and next block is empty.
Scenario could be that checkpoint info is written before MLOG_CHECKPOINT record is written in the next block and server could be killed
before writing the block.
Yet to write mtr test case for it and read the code too
I tried to review the code, and I did not find anything obviously wrong, except potential race conditions in log_write_up_to(), for example here:
if (flush_to_disk |
&& (log_sys->n_pending_flushes > 0
|
|| !os_event_is_set(log_sys->flush_event))) {
|
/* Figure out if the current flush will do the job |
for us. */
|
bool work_done = log_sys->current_flush_lsn >= lsn; |
|
log_write_mutex_exit();
|
|
os_event_wait(log_sys->flush_event);
|
|
if (work_done) { |
return; |
How can we be sure that the desired work was actually completed here? Could there have been multiple pending writes, and could the event have been signaled for an earlier write, for something older than log_sys->current_flush_lsn? Perhaps we should check log_sys->flushed_to_disk_lsn before returning.
I tried to reproduce the failure, but in the end I only reproduced (and subsequently developed a fix for) MDEV-17849.
To reproduce this bug I use mtr -mem -force -max-test-fail=9999 -suite=innodb -par=5 innodb.undo_truncate_recover{,,,} -repeat=100 --. Sometimes also https://jira.mariadb.org/browse/MDEV-18038 happens.
Both failures are rare.
log_buffer_extend() is called in this test. When I disable it with --innodb-log-buffer-size=100000000 I see no errors at all. But I'm unsure because it's not easy to reproduce. Anyway this is my current guess that it's data races during redo log extending.
I put write_always = true; at the very beginning of log_checkpoint(). This makes checkpoints flushed more frequent. And both failures again disappear. Maybe the last checkpoint sometimes is not committed and this is the source of problems?
kevg, that is interesting. Could it be that we sometimes update last_checkpoint_lsn before the corresponding write to the checkpoint header becomes persistent? Could it be that fil_aio_wait() is invoked on the redo log pseudo-tablespace for something else than the latest pending checkpoint write? After all, we do allow multiple log_checkpoint() executions in parallel, interleaved with each other.
Could you add an assertion to log_complete_checkpoint() or its call path, to verify that the completed write was for the expected checkpoint?
Note: The redo log is not being extended while the server is running. Only the redo log write buffer is extended during this in 10.2 and 10.3. In 10.4, less redo log is being written thanks to the MDEV-17138 MLOG_MEMSET record.
My assumption about `write_always = true` was incorrect at least for MDEV-18038 which sometimes happen.
I implemented debug instrumentation that should make bugs in crash recovery and mariabackup --prepare more repeatable.
Usage:
./mtr --mysqld=--debug=d,ib_log_checkpoint_avoid …
|
The bug is inside log_buffer_extend(). MDEV-18038 too.
This patch makes log_buffer_extend() run on every mtr.commit().
diff --git a/storage/innobase/include/log0log.h b/storage/innobase/include/log0log.h
|
index 16399edad50..dd549baddce 100644
|
--- a/storage/innobase/include/log0log.h
|
+++ b/storage/innobase/include/log0log.h
|
@@ -81,7 +81,7 @@ log_free_check(void);
|
|
/** Extends the log buffer.
|
@param[in] len requested minimum size in bytes */
|
-void log_buffer_extend(ulong len);
|
+void log_buffer_extend();
|
|
/** Check margin not to overwrite transaction log from the last checkpoint.
|
If would estimate the log write to exceed the log_group_capacity,
|
diff --git a/storage/innobase/log/log0log.cc b/storage/innobase/log/log0log.cc
|
index 409b42c452a..4ffa906f86b 100644
|
--- a/storage/innobase/log/log0log.cc
|
+++ b/storage/innobase/log/log0log.cc
|
@@ -159,8 +159,10 @@ log_buf_pool_get_oldest_modification(void)
|
|
/** Extends the log buffer.
|
@param[in] len requested minimum size in bytes */
|
-void log_buffer_extend(ulong len)
|
+void log_buffer_extend()
|
{
|
+ size_t len = srv_log_buffer_size + 1;
|
+
|
byte tmp_buf[OS_FILE_LOG_BLOCK_SIZE];
|
|
log_mutex_enter_all();
|
diff --git a/storage/innobase/mtr/mtr0mtr.cc b/storage/innobase/mtr/mtr0mtr.cc
|
index 92b1aa38a81..d4950ffba05 100644
|
--- a/storage/innobase/mtr/mtr0mtr.cc
|
+++ b/storage/innobase/mtr/mtr0mtr.cc
|
@@ -764,9 +764,7 @@ mtr_t::Command::prepare_write()
|
ut_ad(len > 0);
|
ut_ad(n_recs > 0);
|
|
- if (len > srv_log_buffer_size / 2) {
|
- log_buffer_extend(ulong((len + 1) * 2));
|
- }
|
+ log_buffer_extend();
|
|
ut_ad(m_impl->m_n_log_recs == n_recs); |
And with that patch I always see MDEV-13080 or MDEV-18038 in 10.3 with this simple test:
create or replace table t (a int) engine=innodb; |
insert into t values (1); |
drop table t; |
We will keep the log_buffer_extend() but remove the log file write from it.
Can you paste the body of the trigger?
We have a confirmed report about a leak in a trigger which uses a subquery:
MDEV-12992. Can yours be similar?