Details
-
Bug
-
Status: Closed (View Workflow)
-
Blocker
-
Resolution: Fixed
-
10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL)
Description
mleich produced an rr replay trace where crash recovery fails like this:
2023-08-01 8:26:12 0 [Note] InnoDB: Multi-batch recovery needed at LSN 78671708
|
2023-08-01 8:26:12 0 [Note] InnoDB: End of log at LSN=88328740
|
2023-08-01 8:26:12 0 [Note] InnoDB: To recover: LSN 78752404/88328740; 875 pages
|
2023-08-01 8:26:13 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages.
|
The immediate reason for this is that recovery wants to read page 3 of tablespace 82 (test/B.ibd). That page had been left as all-zero by the server that had been killed right before the server restart. The buf_flush_page_cleaner() thread had invoked buf_flush_discard_page() on the page, with oldest_modification()=74626640 and FIL_PAGE_LSN=87174362. The reason for that is that the file was going to be deleted:
(rr) backtrace
|
#0 fil_space_t::check_pending_operations (id=82) at /data/Server/bb-11.2-MDEV-14795E/storage/innobase/fil/fil0fil.cc:1707
|
#1 0x000055c51e4f7f46 in fil_delete_tablespace (id=82) at /data/Server/bb-11.2-MDEV-14795E/storage/innobase/fil/fil0fil.cc:1774
|
#2 0x000055c51e4f2643 in trx_t::commit (this=this@entry=0xb3b200fc180, deleted=std::vector of length 0, capacity 0) at /data/Server/bb-11.2-MDEV-14795E/storage/innobase/dict/drop.cc:270
|
#3 0x000055c51e3d0982 in ha_innobase::delete_table (this=<optimized out>, name=<optimized out>) at /data/Server/bb-11.2-MDEV-14795E/storage/innobase/handler/ha_innodb.cc:13689
|
#4 0x000055c51e16ec05 in hton_drop_table (hton=<optimized out>, path=<optimized out>) at /data/Server/bb-11.2-MDEV-14795E/sql/handler.cc:578
|
#5 0x000055c51e1742e9 in ha_delete_table (thd=thd@entry=0x461168000c68, hton=hton@entry=0x55c520a22b88, path=path@entry=0x18885dda5e80 "./test/B", db=db@entry=0x461168010b20, alias=alias@entry=0x461168010b30,
|
generate_warning=generate_warning@entry=false) at /data/Server/bb-11.2-MDEV-14795E/sql/handler.cc:3192
|
...
|
(rr) finish
|
Run till exit from #0 fil_space_t::check_pending_operations (id=82) at /data/Server/bb-11.2-MDEV-14795E/storage/innobase/fil/fil0fil.cc:1707
|
|
Thread 3 received signal SIGKILL, Killed.
|
The DDL statement was CREATE OR REPLACE TABLE B AS SELECT * FROM `t8` /* E_R Thread32 QNO 21 CON_ID 29 */ but I think that this can affect any DDL operation that can delete InnoDB .ibd files: DROP TABLE, TRUNCATE TABLE, ALTER TABLE, OPTIMIZE TABLE.
The problem appears to be that fil_delete_tablespace() is first setting the fil_space_t::STOPPING flag and only then writing the FILE_DELETE record. In the above execution, that record was never written and therefore never recovered.
I got copy of the data directory, but the data.tar.gz is 12MiB, while the maximum file size limit of this Jira is only 10MiB.
origin/bb-10.6-MDEV-31826 94cecbfc68698cd410ee7f8d03218d13cb9b88b5 2023-09-04T18:02:16+05:30 tends serious to DB server hangs and long semaphore waits.
sdp:/data1/results/1693843747/Freeze1/rqg.log (no crash recovery involved)
line 894 ERROR: Reporter 'Deadlock': Got no response from server to query '/*!100108 SET @@max_statement_time = 0 */' within 195s.
INFO: Reporter 'Deadlock': Killing <Db server> with pid 2100035 with SIGHUP in order to force debug output.
line 930 INFO: Reporter 'Deadlock': Killing <Db server> with pid 2100035 with SIGSEGV in order to capture core.
line 1012 Quite many entries
[rr 2100035 177568]2023-09-04 9:11:38 38 [Warning] InnoDB: Trying to delete tablespace './test/#sql-alter-200b43-26.ibd' but there are 19 pending operations
over a long timespan.
line 1034 debug output from the server error log
line 1508 backtrace
sdp:/data1/results/1693843747/Freeze1$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio
----------------------------------------------------
Scenario:
1. Several sessions run concurrent some DDL/DML mix.
2. During 1. is ongoing intentional SIGKILL of the DB server
3. Restart with success
4. Only one connection is active and checks tables, runs OPTIMIZE etc.
Than the server disappears during an "OPTIMIZE TABLE `test`.`t7`"
Within the server error log are
several
[Warning] InnoDB: Trying to delete tablespace './test/#sql-ib128.ibd' but there are 92 pending operations
followed by
[ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch.
sdp:/data1/results/1693843747/semaphore_wait$ _RR_TRACE_DIR=./1/rr rr replay --mark-stdio