[MDEV-22169] Recovery fails after failing to insert into mlog_init - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.5.2
Fix Version/s: 10.5.3
Component/s: Storage Engine - InnoDB
Labels:
- recovery

Description

Workflow of the RQG test

1. Create DB server and start it

2. Three session start to run some SQL stream against this server

     RQG code: INSERT IGNORE INTO _table ( _field_int ) VALUES ( _smallint_unsigned ) ;

     _table --> One of the tables created in SCHEMA test at begin of the RQG run

    _field_int -> diced column of type int

3. At some point of time and not related to the state (waiting for result, sending statement, ...) of  the ongoing 2. the server process gets killed (KILL and not TERM).

4. Make some copy of the data content+logs etc. of the killed server.

5. Try to restart the killed server.

5. fails with

...

2020-04-06 12:44:07 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=29339963

2020-04-06 12:44:22 0 [Note] InnoDB: Read redo log up to LSN=24178688

2020-04-06 12:44:25 0 [Note] InnoDB: Starting a batch to recover 1429 pages from redo log.

2020-04-06 12:44:37 0 [Note] InnoDB: To recover: 500 pages from log

2020-04-06 12:44:38 0 [ERROR] InnoDB: Not applying INSERT_HEAP_DYNAMIC due to corruption on [page id: space=7, page number=110]

2020-04-06 12:44:38 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.

RQG

git clone https://github.com/mleich1/rqg --branch experimental RQG_mleich1

origin/experimental d417497b476e263428422b4640cf63b8e9d16afe 2020-03-30T17:17:54+02:00

Mariadb

origin/10.5 778a174e5e9cad846c47933fcf4727532487cab2 2020-04-06T10:47:11+03:00

build with debug

perl rqg.pl \

--duration=300 \

--queries=10000000 \

--threads=3 \

--no_mask \

--seed=random \

--gendata=conf/mariadb/oltp.zz \

--engine=InnoDB \

--rpl_mode=none \

--mysqld=--innodb-lock-wait-timeout=50 \

--mysqld=--log-output=none \

--mysqld=--loose_innodb_lock_schedule_algorithm=fcfs \

--mysqld=--slave_net_timeout=60 \

--mysqld=--loose-idle_write_transaction_timeout=0 \

--mysqld=--innodb_stats_persistent=off \

--mysqld=--interactive_timeout=28800 \

--mysqld=--loose-idle_readonly_transaction_timeout=0 \

--mysqld=--net_write_timeout=60 \

--mysqld=--lock-wait-timeout=86400 \

--mysqld=--log_bin_trust_function_creators=1 \

--mysqld=--loose-idle_transaction_timeout=0 \

--mysqld=--connect_timeout=60 \

--mysqld=--loose_innodb_use_native_aio=1 \

--mysqld=--log-bin \

--mysqld=--loose-debug_assert_on_not_freed_memory=0 \

--mysqld=--wait_timeout=28800 \

--mysqld=--net_read_timeout=30 \

--mysqld=--innodb_page_size=8K \

--mysqld=--innodb-buffer-pool-size=32M \

--mysqld=--loose-table_lock_wait_timeout=50 \

--reporters=CrashRecovery1,Deadlock1,ErrorLog,None,ServerDead \

--validators=None \

--grammar=rqg.yy \

--workdir=<local settings> \

--vardir=<local settings> \

--mtr-build-thread=<local settings> \

--basedir1=<local settings> \

--script_debug=_nix_

The uploaded RQG protocol "rqg.log" shows exact that RQG run.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

rqg.log
2020-04-06 17:26
82 kB
Matthias Leich
rqg.yy
2020-04-06 17:26
0.1 kB
Matthias Leich

Issue Links

is caused by

MDEV-15528 Avoid writing freed InnoDB pages

Closed

Activity

Ascending order - Click to sort in descending order

Matthias Leich added a comment - 2020-04-06 17:28

Remainings with RR trace of the RQG run described above
"sdp"
/home/mleich/RQG_RR/storage/1586176150/000361.tgz

Matthias Leich added a comment - 2020-04-06 17:28 Remainings with RR trace of the RQG run described above "sdp" /home/mleich/RQG_RR/storage/1586176150/000361.tgz

Marko Mäkelä added a comment - 2020-04-06 19:43

I painstakingly debugged the rr replay of the server that was killed before the recovery, and everything is fine there, as far as I can tell.

The problem occurs at crash recovery. My computer recovered the saved dataset just fine (only complaining about an internal XA transaction that was recovered in the XA PREPARE state), even after I reduced to innodb_buffer_pool_size=32M.

We have a rr replay of the failed recovery:

cd /home/mleich/RQG_RR/storage/1586176150/tmp

_RR_TRACE_DIR=dev/shm/vardir/1586176150/141/rr_trace rr replay

We switch to STORE_NO mode, and then we will simply throw away the PAGE_FREE record that we successfully parsed for the problematic page:

break log0recv.cc:2710

break log0recv.cc:1946

cond 2 id.m_id==0x70000006e

break sql_print_error

continue

when

continue

when

There will be no attempt to reread the redo log record. You should see that at when 1586 we switch to STORE_NO mode, and at when 1593 we throw away the FREE_PAGE record.

The call to mlog_init.add() in recv_sys_t::add() will be skipped, because store==STORE_NO.

As far as I can tell, there could be a similar problem with page initialization records already in 10.2. Before ~~MDEV-15528~~, the impact should be less severe, only causing unnecessary reads of pages that would be completely initialized by redo log records, which ~~MDEV-12699~~ intended to avoid (and ~~MDEV-20688~~, ~~MDEV-21572~~ fixed in some special cases).

Marko Mäkelä added a comment - 2020-04-06 19:43 I painstakingly debugged the rr replay of the server that was killed before the recovery, and everything is fine there, as far as I can tell. The problem occurs at crash recovery. My computer recovered the saved dataset just fine (only complaining about an internal XA transaction that was recovered in the XA PREPARE state), even after I reduced to innodb_buffer_pool_size=32M . We have a rr replay of the failed recovery: cd /home/mleich/RQG_RR/storage/1586176150/tmp _RR_TRACE_DIR=dev/shm/vardir/1586176150/141/rr_trace rr replay We switch to STORE_NO mode, and then we will simply throw away the PAGE_FREE record that we successfully parsed for the problematic page: break log0recv.cc:2710 break log0recv.cc:1946 cond 2 id.m_id==0x70000006e break sql_print_error continue when continue when There will be no attempt to reread the redo log record. You should see that at when 1586 we switch to STORE_NO mode, and at when 1593 we throw away the FREE_PAGE record. The call to mlog_init.add() in recv_sys_t::add() will be skipped, because store==STORE_NO . As far as I can tell, there could be a similar problem with page initialization records already in 10.2. Before MDEV-15528 , the impact should be less severe, only causing unnecessary reads of pages that would be completely initialized by redo log records, which MDEV-12699 intended to avoid (and MDEV-20688 , MDEV-21572 fixed in some special cases).

People

Assignee:: Marko Mäkelä

Reporter:: Matthias Leich

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2020-04-06 17:26

Updated:: 2020-04-07 15:08

Resolved:: 2020-04-07 15:08

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration