[MDEV-26327] Assertion `!"purge was not shut down"' in logs_empty_and_mark_files_at_shutdown during bootstrap Created: 2021-08-09  Updated: 2023-03-23

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2.40, 10.3.31
Fix Version/s: 10.3

Type: Bug Priority: Major
Reporter: Matthias Leich Assignee: Thirunarayanan Balathandayuthapani
Resolution: Unresolved Votes: 0
Labels: need_rr

Issue Links:
Relates
relates to MDEV-14486 InnoDB hang on shutdown Confirmed
relates to MDEV-26658 InnoDB: Failing assertion: srv_get_ac... Closed

 Description   

origin/bb-10.2-MDEV-26131 eac954ccea83b42653acdffce8aab36125f7c513 2021-07-30T11:31:47+05:30
origin/bb-10.3-MDEV-26206-impl-gap-1 43da05ebff34b447618ea884d0506609ffc30903 2021-07-30T14:14:15+03:00
 
Bootstrap fails with
# 2021-08-09T12:43:22 [920403] | mysqld: /data/Server/bb-10.2-MDEV-26131/storage/innobase/log/log0log.cc:1912: void logs_empty_and_mark_files_at_shutdown(): Assertion `!"purge was not shut down"' failed.
# 2021-08-09T12:43:22 [920403] | 210809 12:42:45 [ERROR] mysqld got signal 6 ;
 
pluto:/data/Results/1628503850/TBR-1162/dev/shm/vardir/1628503850/63/1/rr
_RR_TRACE_DIR="." rr replay --mark-stdio
 
(rr) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x000038354b779859 in __GI_abort () at abort.c:79
#2  0x000038354b779729 in __assert_fail_base (fmt=0x38354b90f588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5642e6ab2840 "!\"purge was not shut down\"", 
    file=0x5642e6ab04a0 "/data/Server/bb-10.2-MDEV-26131/storage/innobase/log/log0log.cc", line=1912, function=<optimized out>) at assert.c:92
#3  0x000038354b78af36 in __GI___assert_fail (assertion=0x5642e6ab2840 "!\"purge was not shut down\"", file=0x5642e6ab04a0 "/data/Server/bb-10.2-MDEV-26131/storage/innobase/log/log0log.cc", line=1912, 
    function=0x5642e6ab2380 "void logs_empty_and_mark_files_at_shutdown()") at assert.c:101
#4  0x00005642e5b932f8 in logs_empty_and_mark_files_at_shutdown () at /data/Server/bb-10.2-MDEV-26131/storage/innobase/log/log0log.cc:1912
#5  0x00005642e5de7010 in innodb_shutdown () at /data/Server/bb-10.2-MDEV-26131/storage/innobase/srv/srv0start.cc:2795
#6  0x00005642e5a433dc in innobase_end () at /data/Server/bb-10.2-MDEV-26131/storage/innobase/handler/ha_innodb.cc:4478
#7  0x00005642e558eca8 in ha_finalize_handlerton (plugin=0x62100002c3e8) at /data/Server/bb-10.2-MDEV-26131/sql/handler.cc:494
#8  0x00005642e501069f in plugin_deinitialize (plugin=0x62100002c3e8, ref_check=true) at /data/Server/bb-10.2-MDEV-26131/sql/sql_plugin.cc:1220
#9  0x00005642e50110f9 in reap_plugins () at /data/Server/bb-10.2-MDEV-26131/sql/sql_plugin.cc:1296
#10 0x00005642e5015280 in plugin_shutdown () at /data/Server/bb-10.2-MDEV-26131/sql/sql_plugin.cc:1957
#11 0x00005642e4dcf58c in clean_up (print_message=false) at /data/Server/bb-10.2-MDEV-26131/sql/mysqld.cc:2223
#12 0x00005642e4dcf322 in unireg_abort (exit_code=0) at /data/Server/bb-10.2-MDEV-26131/sql/mysqld.cc:2142
#13 0x00005642e4ddad87 in mysqld_main (argc=<error reading variable: Cannot access memory at address 0xfffffffffffffee0>, argv=<error reading variable: Cannot access memory at address 0xfffffffffffffef0>)
    at /data/Server/bb-10.2-MDEV-26131/sql/mysqld.cc:6115
#14 0x00005642e4dc9aad in main (argc=42, argv=0x7fffccc914d8) at /data/Server/bb-10.2-MDEV-26131/sql/main.cc:25
(rr)
 
Per RQG output the bootstrap command was
ulimit -c 0; cat "/dev/shm/vardir/1628503850/63/1/boot.sql" |  rr record --chaos --wait --mark-stdio "/data/Server_bin/bb-10.2-MDEV-26131_asan/bin/mysqld" "--no-defaults" "--basedir=/data/Server_bin/bb-10.2-MDEV-26131_asan" "--lc-messages-dir=/data/Server_bin/bb-10.2-MDEV-26131_asan/share" "--character-sets-dir=/data/Server_bin/bb-10.2-MDEV-26131_asan/share/charsets" "--tmpdir=/dev/shm/vardir/1628503850/63/1/tmp" "--datadir=/dev/shm/vardir/1628503850/63/1/data" "--loose-innodb_lock_schedule_algorithm=fcfs" "--loose-idle_write_transaction_timeout=0" "--loose-idle_transaction_timeout=0" "--loose-idle_readonly_transaction_timeout=0" "--connect_timeout=60" "--interactive_timeout=28800" "--slave_net_timeout=60" "--net_read_timeout=30" "--net_write_timeout=60" "--loose-table_lock_wait_timeout=50" "--wait_timeout=28800" "--lock-wait-timeout=86400" "--innodb-lock-wait-timeout=50" "--log_output=none" "--log_bin_trust_function_creators=1" "--loose-debug_assert_on_not_freed_memory=0" "--plugin-load-add=file_key_management.so" "--loose-file-key-management-filename=/home/mleich/RQG_O/conf/mariadb/encryption_keys.txt" "--loose-innodb_fatal_semaphore_wait_threshold=300" "--loose-innodb-sync-debug" "--innodb_stats_persistent=on" "--innodb_adaptive_hash_index=off" "--log-bin" "--sync-binlog=1" "--loose-innodb_evict_tables_on_commit_debug=off" "--loose-max-statement-time=30" "--innodb-use-native-aio=0" "--innodb_page_size=4K" "--innodb-buffer-pool-size=256M" "--sql-mode=no_engine_substitution" "--skip-log-bin" "--loose-innodb-encrypt-tables=OFF" "--loose-innodb-encrypt-log=OFF" "--log_error=/dev/shm/vardir/1628503850/63/1/boot.err" "--bootstrap" > "/dev/shm/vardir/1628503850/63/1/boot.log" 2>&1
 
The failure is quite rare because a few hundred tests used the same settings
during bootstrap and only two failed with the effect above.



 Comments   
Comment by Marko Mäkelä [ 2021-08-12 ]

This reminds me of MDEV-14486, where the purge coordinator thread has terminated but some worker threads exist. Does the rr replay trace match those symptoms?

Comment by Marko Mäkelä [ 2021-08-16 ]

This is not exactly the same as MDEV-14486, because thread apply all backtrace shows that the srv_purge_coordinator_thread still exists at the time when the signal was sent. One purge worker and the coordinator exist, and both were fighting over srv_sys.mutex at the time the SIGABRT was delivered:

10.2

Thread 15 (Thread 921343.924087):
#0  0x00002ae00a2ac8c7 in sched_yield () at /home/roc/rr/rr/src/preload/overrides.c:166
#1  0x00005642e5bf9729 in os_thread_yield () at /data/Server/bb-10.2-MDEV-26131/storage/innobase/os/os0thread.cc:212
#2  0x00005642e5aa511f in TTASEventMutex<GenericPolicy>::enter (this=0x5642e7601610 <srv_sys+144>, max_spins=90, max_delay=4, filename=0x5642e6bd86c0 "/data/Server/bb-10.2-MDEV-26131/storage/innobase/srv/srv0srv.cc", line=1019) at /data/Server/bb-10.2-MDEV-26131/storage/innobase/include/ib0mutex.h:496
#3  0x00005642e5a9e58c in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x5642e7601610 <srv_sys+144>, n_spins=30, n_delay=4, name=0x5642e6bd86c0 "/data/Server/bb-10.2-MDEV-26131/storage/innobase/srv/srv0srv.cc", line=1019) at /data/Server/bb-10.2-MDEV-26131/storage/innobase/include/ib0mutex.h:637
#4  0x00005642e5dc9b7f in srv_free_slot (slot=0x5642e7601890 <srv_sys+784>) at /data/Server/bb-10.2-MDEV-26131/storage/innobase/srv/srv0srv.cc:1019
#5  0x00005642e5dcf935 in srv_worker_thread (arg=0x0) at /data/Server/bb-10.2-MDEV-26131/storage/innobase/srv/srv0srv.cc:2637
#6  0x00007f431544f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x000038354b876293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 
Thread 14 (Thread 921343.924085):
#0  0x00005642e5aacb69 in MutexDebug<TTASEventMutex<GenericPolicy> >::Context::locked (this=0x5642e7601630 <srv_sys+176>, mutex=0x5642e7601610 <srv_sys+144>, filename=0x5642e6bd86c0 "/data/Server/bb-10.2-MDEV-26131/storage/innobase/srv/srv0srv.cc", line=962) at /data/Server/bb-10.2-MDEV-26131/storage/innobase/include/sync0policy.h:79
#1  0x00005642e5aa8c8a in MutexDebug<TTASEventMutex<GenericPolicy> >::locked (this=0x5642e7601620 <srv_sys+160>, mutex=0x5642e7601610 <srv_sys+144>, name=0x5642e6bd86c0 "/data/Server/bb-10.2-MDEV-26131/storage/innobase/srv/srv0srv.cc", line=962) at /data/Server/bb-10.2-MDEV-26131/storage/innobase/include/sync0policy.ic:85
#2  0x00005642e5aa5366 in GenericPolicy<TTASEventMutex<GenericPolicy> >::locked (this=0x5642e7601620 <srv_sys+160>, mutex=..., filename=0x5642e6bd86c0 "/data/Server/bb-10.2-MDEV-26131/storage/innobase/srv/srv0srv.cc", line=962) at /data/Server/bb-10.2-MDEV-26131/storage/innobase/include/sync0policy.h:359
#3  0x00005642e5a9e5ae in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=0x5642e7601610 <srv_sys+144>, n_spins=30, n_delay=4, name=0x5642e6bd86c0 "/data/Server/bb-10.2-MDEV-26131/storage/innobase/srv/srv0srv.cc", line=962) at /data/Server/bb-10.2-MDEV-26131/storage/innobase/include/ib0mutex.h:639
#4  0x00005642e5dc96bc in srv_release_threads (type=SRV_WORKER, n=3) at /data/Server/bb-10.2-MDEV-26131/storage/innobase/srv/srv0srv.cc:962
#5  0x00005642e5dd10f2 in srv_purge_coordinator_thread (arg=0x0) at /data/Server/bb-10.2-MDEV-26131/storage/innobase/srv/srv0srv.cc:2931
#6  0x00007f431544f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x000038354b876293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Comment by Matthias Leich [ 2022-01-18 ]

The rr trace mentioned above was devaluated and had to be deleted.
sdp:
Some 10.2 with modifications which should not have an impact on MDEV-26327
STATUS_ALARM--rr--TBR-1162-MDEV-26327        /data/results/1642509111/002693.log    /data/results/1642509111/002693.tar.xz 142692KB

Comment by Marko Mäkelä [ 2023-03-23 ]

This might be related to MDEV-14486, which I do not think we have seen after 10.5 replaced the purge threads with thread pool tasks.

Generated at Thu Feb 08 09:44:28 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.