[MDEV-26464] InnoDB: Failing assertion: UT_LIST_GET_LEN(buf_pool->flush_list) == 0 during shutdown Created: 2021-08-23  Updated: 2023-08-07  Resolved: 2023-08-07

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.4.22
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Matthias Leich Assignee: Marko Mäkelä
Resolution: Incomplete Votes: 0
Labels: need_rr, not-10.5, not-10.6, not-10.7

Issue Links:
Relates
relates to MDEV-19514 Defer change buffer merge until pages... Closed

 Description   

origin/10.4 2b66cd249384221e2a892f655fe84bb58ddc31bd 2021-08-23T10:44:06+03:00

Workflow:

  1. DB server start
  2. Generate some initial data
  3. 33 sessions run a DDL/DML mix concurrent. They disconnect after 300s.
  4. Shutdown attempt which ends with

    # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1955505]2021-08-23  3:51:27 0 [Note] /data/Server_bin/10.4_asan/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
    # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1956164]2021-08-23  3:51:27 83 [Warning] Aborted connection 83 to db: 'mysql' user: 'root' host: 'localhost' (This connection closed normally)
    # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1957093]2021-08-23  3:51:27 0 [Note] InnoDB: FTS optimize thread exiting.
    # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1957281]2021-08-23  3:51:27 0 [Note] Event Scheduler: Purging the queue. 0 events
    # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1957790]2021-08-23  3:51:29 0 [Note] InnoDB: Starting shutdown...
    # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1957905]2021-08-23  3:51:29 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/vardir/1629713923/217/1/data/ib_buffer_pool
    # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1957917]2021-08-23  3:51:29 0 [Note] InnoDB: Instance 0, restricted to 95 pages due to innodb_buf_pool_dump_pct=25
    # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1957929]2021-08-23  3:51:29 0 [Note] InnoDB: Buffer pool(s) dump completed at 210823  3:51:29
    # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1958065]2021-08-23 03:51:29 0x66b427d20700[rr 3469573 1958068]  InnoDB: Assertion failure in file /data/Server/10.4/storage/innobase/buf/buf0flu.cc line 3348
    # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1958070]InnoDB: Failing assertion: UT_LIST_GET_LEN(buf_pool->flush_list) == 0
    # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1958072]InnoDB: We intentionally generate a memory trap.
    

Per marko:
----------------
1. A page was read before the shutdown

#3  0x0000560a62536e21 in buf_read_ibuf_merge_pages (sync=false, 
    space_ids=0x7e8e3789f710, page_nos=0x7e8e3789f6b0, n_stored=1)
    at /data/Server/10.4/storage/innobase/buf/buf0rea.cc:805
#4  0x0000560a620290ba in ibuf_merge_pages (n_pages=0x7e8e3789fd90, sync=false)
    at /data/Server/10.4/storage/innobase/ibuf/ibuf0ibuf.cc:2462
#5  0x0000560a62029e41 in ibuf_merge (n_pages=0x7e8e3789fd90, sync=false)
    at /data/Server/10.4/storage/innobase/ibuf/ibuf0ibuf.cc:2567
#6  0x0000560a6202a407 in ibuf_merge_in_background (full=false)
    at /data/Server/10.4/storage/innobase/ibuf/ibuf0ibuf.cc:2638
#7  0x0000560a62316a0b in srv_master_do_active_tasks ()
    at /data/Server/10.4/storage/innobase/srv/srv0srv.cc:2147

2. There happens a change buffer merge after the begin of the shutdown process

#7  0x0000560a620d2562 in mtr_t::commit (this=0x2f2d612b0830)
    at /data/Server/10.4/storage/innobase/mtr/mtr0mtr.cc:452
#8  0x0000560a6201eecd in btr_pcur_commit_specify_mtr (pcur=0x2f2d612b04c0, 
    mtr=0x2f2d612b0830)
    at /data/Server/10.4/storage/innobase/include/btr0pcur.ic:345
#9  0x0000560a6201feb1 in ibuf_btr_pcur_commit_specify_mtr (
    pcur=0x2f2d612b04c0, mtr=0x2f2d612b0830)
    at /data/Server/10.4/storage/innobase/ibuf/ibuf0ibuf.cc:322
#10 0x0000560a62035092 in ibuf_merge_or_delete_for_page (block=0x5ac0422418b8, 
    page_id=..., zip_size=0)
    at /data/Server/10.4/storage/innobase/ibuf/ibuf0ibuf.cc:4572
#11 0x0000560a624df00d in buf_page_io_complete (bpage=0x5ac0422418b8, 
    dblwr=true, evict=false)
    at /data/Server/10.4/storage/innobase/buf/buf0buf.cc:6201
#12 0x0000560a62610488 in fil_aio_wait (segment=2)
    at /data/Server/10.4/storage/innobase/fil/fil0fil.cc:4375

3. The bug does not seem to be a duplicate of MDEV-12630.
4. MDEV-19514 Resolved: 2019-10-11 16:57 , Fix Version/s: 10.5.0 should have fixed that trouble.

RQG
------

git clone https://github.com/mleich1/rqg --branch experimental RQG
 
perl rqg.pl \
--grammar=conf/mariadb/innodb_compression_encryption.yy \
--gendata=conf/mariadb/innodb_compression_encryption.zz \
--max_gd_duration=1800 \
--mysqld=--loose-innodb-encryption-threads=7 \
--mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
--mysqld=--loose-idle_write_transaction_timeout=0 \
--mysqld=--loose-idle_transaction_timeout=0 \
--mysqld=--loose-idle_readonly_transaction_timeout=0 \
--mysqld=--connect_timeout=60 \
--mysqld=--interactive_timeout=28800 \
--mysqld=--slave_net_timeout=60 \
--mysqld=--net_read_timeout=30 \
--mysqld=--net_write_timeout=60 \
--mysqld=--loose-table_lock_wait_timeout=50 \
--mysqld=--wait_timeout=28800 \
--mysqld=--lock-wait-timeout=86400 \
--mysqld=--innodb-lock-wait-timeout=50 \
--no-mask \
--queries=10000000 \
--seed=random \
--reporters=Backtrace \
--reporters=ErrorLog \
--reporters=Deadlock1 \
--validators=None \
--mysqld=--log_output=none \
--mysqld=--log_bin_trust_function_creators=1 \
--mysqld=--loose-debug_assert_on_not_freed_memory=0 \
--engine=InnoDB \
--restart_timeout=240 \
--mysqld=--plugin-load-add=file_key_management.so \
--mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
--duration=300 \
--mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
--mysqld=--loose-innodb-sync-debug \
--mysqld=--innodb_stats_persistent=on \
--mysqld=--innodb_adaptive_hash_index=on \
--mysqld=--loose-innodb_evict_tables_on_commit_debug=off \
--mysqld=--loose-max-statement-time=30 \
--threads=33 \
--mysqld=--innodb-use-native-aio=0 \
--rr=Extended \
--rr_options=--chaos --wait \
--mysqld=--innodb_page_size=64K \
--mysqld=--innodb-buffer-pool-size=24M \
--no_mask \
--workdir=<local settings> \
--vardir=<local settings> \
--mtr-build-thread=<local settings> \
--basedir1=<local settings> \
--script_debug=_nix_

sdp:/data/Results/1629713923/TBR-1163/dev/shm/vardir/1629713923/217/1/rr



 Comments   
Comment by Marko Mäkelä [ 2022-08-02 ]

Sorry, I had neglected this too long, and the rr replay trace is no longer available. We would need a new one, if we still care to debug and fix this before 10.4 reaches its end of life.

Comment by Vitaliy Zyukin [ 2023-06-01 ]

I've provided a steps to reproduce this issue on oracle forum: https://bugs.mysql.com/bug.php?id=85585
Adding it here as well, hopefully it will help.

Comment by Marko Mäkelä [ 2023-06-02 ]

VitalyZyukin, thank you. I suspect that this bug is specific to the MariaDB Server 10.4 release only (and 10.2 and 10.3, which already reached EOL). In those releases, the I/O layer is very similar to MySQL 5.7. In MariaDB Server 10.5, the I/O subsystem was rewritten and there is only one buf_flush_page_cleaner thread and one buffer pool instance. Also, MDEV-19514 removed the function ibuf_merge_in_background() in 10.5.

A reproducible test case would help a lot. Our internal InnoDB testing efforts are concentrated on later releases, mainly MariaDB Server 10.6 and later.

A simple work-around would seem be to set innodb_change_buffering=none. The default was changed in MDEV-27734 only starting with 10.5.

Comment by Marko Mäkelä [ 2023-06-02 ]

VitalyZyukin, based on your comment in the MySQL bug system, you reproduced this bug on MariaDB Server 10.4.10. (They used to remove or make private comments that mention MariaDB. Luckily it had not happened here yet.)

Can you test a newer version? I think that this bug may have been fixed by MDEV-30422 in MariaDB Server 10.3.38 and 10.4.28.

Comment by Vitaliy Zyukin [ 2023-06-05 ]

Just in case I'm going to duplicate the content here:

I was able to reproduce it on MariaDB 10.4.10. Couple of times already. It doesn't always happen on our hosts, it happen once in 10 executions on average.

The way to reproduce on my end was creating 8 processes in python that will delete 100 records per each delete query in one table.

This procedure is taking place when database is prepared for a shutdown, meaning replication is OFF and no traffic coming in or going out.

Additional parameters:
sql_log_bin=0 and innodb_flush_log_at_trx_commit=0

Next code is capable of deleting ~5 million records in 2 minutes:

```
...
func_args = list()
for i in range(from_id, to_id, 100000):
last_id = (i + 100000) if (i + 100000) < to_id else to_id
func_args.append((table_name, i, last_id))

with multiprocessing.Pool(8) as p:
p.starmap(self.delete_range, func_args)

def delete_range(self, table_name, from_id, to_id):
conn = mysql.connector.connect(**self.connect_args)
cursor = conn.cursor()
cursor.execute("SET sql_log_bin=0;")
cursor.execute("SET GLOBAL innodb_flush_log_at_trx_commit=0;")

for i in range(from_id, to_id, 100):
last_id = (i + 100) if (i + 100) < to_id else to_id

  1. delete multiple ids at the time
    cursor.execute("DELETE FROM {} WHERE id IN {};".format(table_name, tuple(range(i, last_id))))

cursor.close()
conn.close()
```

On my end the bug is still for the same function: UT_LIST_GET_LEN but a different location

```
2023-05-31 8:11:21 0 [Note] InnoDB: Starting shutdown...
2023-05-31 08:11:28 0x7eb6d979d700 InnoDB: Assertion failure in file src/storage/innobase/buf/buf0flu.cc line 3353
InnoDB: Failing assertion: UT_LIST_GET_LEN(buf_pool->flush_list) == 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
230531 8:11:28 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.4.10-MariaDB-log
key_buffer_size=8388608
read_buffer_size=16777216
max_used_connections=29
max_threads=65537
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1208067437 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
<Path>/bin/mysqld(my_print_stacktrace+0x29)[0x55d43315d7b9]
<Path>/bin/mysqld(handle_fatal_signal+0x4d0)[0x55d432c2e8f0]
/lib64/libpthread.so.0(+0x118e0)[0x7f01a7fc38e0]
/lib64/libc.so.6(gsignal+0x110)[0x7f01a715cca0]
/lib64/libc.so.6(abort+0x148)[0x7f01a715e148]
<Path>/bin/mysqld(+0x565705)[0x55d432965705]
<Path>/bin/mysqld(+0xb43a06)[0x55d432f43a06]
/lib64/libpthread.so.0(+0x744b)[0x7f01a7fb944b]
/lib64/libc.so.6(clone+0x3f)[0x7f01a721852f]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /mysql/data
Resource Limits:
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 10485760 10485760 bytes
Max core file size 0 0 bytes
Max resident set unlimited unlimited bytes
Max processes unlimited unlimited processes
Max open files 65535 65535 files
Max locked memory unlimited unlimited bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 30446 30446 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Core pattern: core
```

Comment by Vitaliy Zyukin [ 2023-06-05 ]

Unfortunately, I do not have resources to reproduce it on different MariaDB versions. My solution was to keep `innodb_flush_log_at_trx_commit` as 1. And this slowed the deletion speed.

Comment by Marko Mäkelä [ 2023-06-06 ]

If you are running MariaDB Server 10.4.10 on some hosts, it could be a good idea to upgrade to at least some of them to the latest 10.4, to see if the bug is still reproducible.

Generated at Thu Feb 08 09:45:31 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.