[MDEV-26658] InnoDB: Failing assertion: srv_get_active_thread_type() == SRV_NONE Created: 2021-09-21  Updated: 2023-02-22  Resolved: 2023-02-22

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.3.32, 10.7.0, 10.4.26
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Matthias Leich Assignee: Matthias Leich
Resolution: Cannot Reproduce Votes: 0
Labels: need_rr

Issue Links:
Relates
relates to MDEV-26327 Assertion `!"purge was not shut down"... Open

 Description   

InnoDB: Assertion failure in file /data/Server/10.3/storage/innobase/log/log0log.cc line 1819
InnoDB: Failing assertion: srv_get_active_thread_type() == SRV_NONE
hit during bootstrap on
- origin/10.3 25d6bbcd5172eb9af0a68be11fb662457a7d572b 2021-09-21T14:44:39+07:00
- origin/preview-10.7-MDEV-24621-innodb-bulk-insert ff373ddd28764dd6c58c1a29f52d728308eb1011 2021-09-20T14:05:09+03:00
   I am aware that this is some development tree.
   But given the fact that we have the problem already in 10.3
   I assume that its in all actual trees 10.3 - 10.7
 
(rr) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00001a1e08ee7859 in __GI_abort () at abort.c:79
#2  0x000055df28ec543c in ut_dbg_assertion_failed (expr=0x55df29d6f580 "srv_get_active_thread_type() == SRV_NONE", file=0x55df29d6cf00 "/data/Server/10.3/storage/innobase/log/log0log.cc", line=1819)
    at /data/Server/10.3/storage/innobase/ut/ut0dbg.cc:60
#3  0x000055df28b47913 in logs_empty_and_mark_files_at_shutdown () at /data/Server/10.3/storage/innobase/log/log0log.cc:1819
#4  0x000055df28dfd149 in innodb_shutdown () at /data/Server/10.3/storage/innobase/srv/srv0start.cc:2565
#5  0x000055df289b7b24 in innobase_end () at /data/Server/10.3/storage/innobase/handler/ha_innodb.cc:4416
#6  0x000055df284aea47 in ha_finalize_handlerton (plugin=0x62100002c548) at /data/Server/10.3/sql/handler.cc:488
#7  0x000055df27e05350 in plugin_deinitialize (plugin=0x62100002c548, ref_check=true) at /data/Server/10.3/sql/sql_plugin.cc:1236
#8  0x000055df27e05e8f in reap_plugins () at /data/Server/10.3/sql/sql_plugin.cc:1312
#9  0x000055df27e0a56d in plugin_shutdown () at /data/Server/10.3/sql/sql_plugin.cc:1981
#10 0x000055df27b4d6e9 in clean_up (print_message=false) at /data/Server/10.3/sql/mysqld.cc:2243
#11 0x000055df27b4d47a in unireg_abort (exit_code=0) at /data/Server/10.3/sql/mysqld.cc:2159
#12 0x000055df27b5968d in mysqld_main (argc=<error reading variable: Cannot access memory at address 0xfffffffffffffee0>, argv=<error reading variable: Cannot access memory at address 0xfffffffffffffef0>) at /data/Server/10.3/sql/mysqld.cc:6206
#13 0x000055df27b47c8d in main (argc=40, argv=0x7ffeae4bce98) at /data/Server/10.3/sql/main.cc:25
(rr)
 
Archive: pluto:/RQG/storage/1632223326/003024.tar.xz
pluto:/RQG/storage/1632223326/TBR-1196/dev/shm/vardir/1632223326/142/1/rr
 
RQG
===
I do not think that RQG is required for replaying the problem.
But maybe the CPU load generated by rqg_batch.pl helps  to
replay the problem faster.
 
# git clone https://github.com/mleich1/rqg --branch experimental RQG
#
# GIT_SHOW: HEAD -> experimental 3474f2510e1df557c97bb6f17d38d731950501f0 2021-09-14T13:18:24+02:00
# rqg.pl  : Version 3.4 (2021-09)
#
# $RQG_HOME/rqg.pl \
# --grammar=MDEV-18334.yy \
# --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
# --mysqld=--loose-idle_write_transaction_timeout=0 \
# --mysqld=--loose-idle_transaction_timeout=0 \
# --mysqld=--loose-idle_readonly_transaction_timeout=0 \
# --mysqld=--connect_timeout=60 \
# --mysqld=--interactive_timeout=28800 \
# --mysqld=--slave_net_timeout=60 \
# --mysqld=--net_read_timeout=30 \
# --mysqld=--net_write_timeout=60 \
# --mysqld=--loose-table_lock_wait_timeout=50 \
# --mysqld=--wait_timeout=28800 \
# --mysqld=--lock-wait-timeout=86400 \
# --mysqld=--innodb-lock-wait-timeout=50 \
# --no-mask \
# --queries=10000000 \
# --seed=random \
# --reporters=Backtrace \
# --reporters=ErrorLog \
# --reporters=Deadlock1 \
# --validators=None \
# --mysqld=--log_output=none \
# --mysqld=--log_bin_trust_function_creators=1 \
# --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
# --engine=InnoDB \
# --restart_timeout=240 \
# --mysqld=--plugin-load-add=file_key_management.so \
# --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
# --duration=300 \
# --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
# --mysqld=--loose-innodb-sync-debug \
# --mysqld=--innodb_stats_persistent=off \
# --mysqld=--innodb_adaptive_hash_index=on \
# --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \
# --mysqld=--loose-max-statement-time=30 \
# --threads=9 \
# --mysqld=--innodb-use-native-aio=0 \
# --rr=Extended \
# --rr_options=--chaos --wait \
# --mysqld=--innodb_page_size=8K \
# --mysqld=--innodb-buffer-pool-size=8M \
# --no_mask \
# --workdir=<local settings> \
# --vardir=<local settings> \
# --mtr-build-thread=<local settings> \
# --basedir1=<local settings> \
# --script_debug=_nix



 Comments   
Comment by Marko Mäkelä [ 2022-08-02 ]

mleich, sorry, I had neglected this too long, and the traces are no longer available.

This might occur due to an anomaly of rr record, which would cause very unfair thread scheduling and thus the starvation of some I/O request. But, this could also demonstrate some genuine fault.

Comment by Matthias Leich [ 2022-08-15 ]

origin/10.4 9a897335eb4387980aed7698b832a893dbaa3d81 2022-07-26T16:45:10+03:00
sdp:/data/results/1658847117/TBR-1196-MDEV-26658/dev/shm/rqg/1658847117/14/1/rr$ _RR_TRACE_DIR="." rr replay --mark-stdio

Comment by Marko Mäkelä [ 2022-08-16 ]

Quick analysis of the hang: Several purge threads (including the coordinator) are waiting to exit. We have srv_buf_pool_instances=1 and the buf_pool_ptr->flush_list is empty. The I/O completion threads seem to be idle. It is hard to say without some more debugging whether this could be just a simple anomaly because of the sync_array, which was removed in MDEV-21452 (10.6.0).

What did the crash or hang on 10.7.0 (or anything newer than 10.5) look like?

Comment by Marko Mäkelä [ 2023-02-22 ]

Unfortunately, the rr replay trace for 10.4 is no longer valid.

Does this occur on 10.6 or later? In 10.5, I believe that the sync_array related synchronization primitives can sometimes cause hangs due to lost signals.

Comment by Matthias Leich [ 2023-02-22 ]

The failure was observed last time 2022-09-12 on a
origin/bb-10.4-MDEV-23801 bc5145982fda084901e4f924e6ad563067649842 2022-09-12T14:08:54+03:00
 
I have some roughly reliable bookkeeping of RQG testing results since 2021-04.
In that data I found only two hits of the problem. Both times on some 10.4.
AFAIR failing during bootstrap was more frequent before 2021-04.
 
So I assume that the problem is fixed.

Comment by Matthias Leich [ 2023-02-22 ]

No more reproducible since 2022-10

Comment by Matthias Leich [ 2023-02-22 ]

No more reproducible since ~ 2022-10.

Generated at Thu Feb 08 09:46:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.