[MDEV-26293] InnoDB: Failing assertion: space->is_ready_to_close() || space->purpose == FIL_TYPE_TEMPORARY || srv_fast_shutdown == 2 || !srv_was_started Created: 2021-08-02  Updated: 2023-11-22  Resolved: 2022-06-30

Status: Closed
Project: MariaDB Server
Component/s: Server, Storage Engine - InnoDB
Affects Version/s: 10.5.9
Fix Version/s: 10.5.17, 10.6.9, 10.7.5, 10.8.4, 10.9.2

Type: Bug Priority: Major
Reporter: srinivas chandolu Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: crash, race, regression-10.5
Environment:

Microsoft Windows Server 2016 Standard, VMware x-64 based PC, 32 GB Memory


Attachments: File MDEV-26293.cfg     File MDEV-26293.yy     File VK-REL-NSM.err     File WIN-G57FI8VPMSD.err     File WIN104.err     Text File crashreport.txt     PNG File image-2021-08-11-16-01-07-223.png     File mariadbd.dmp     File mariadbd.exe     File mariadbd.pdb     File mariadbd104.dmp     File mariadbd182.dmp     Text File out.txt     PNG File screenshot-1.png     Zip Archive serverdll.zip     File serverpdb.zip.001     File serverpdb.zip.002     File serverpdb.zip.003     File serverpdb.zip.004    
Issue Links:
Problem/Incident
is caused by MDEV-23855 InnoDB log checkpointing causes regre... Closed
Relates
relates to MDEV-26381 InnoDB: Failing assertion: fsize != o... Open
relates to MDEV-32851 Assertion failure crash Closed

 Description   

MariaDB has crashed due to the following error. Is it possible to know the exact reason for this crash? Attached mariadbd dump and database error file for reference.

2021-08-01 04:09:31 0x11d8  InnoDB: Assertion failure in file D:\winx64-packages\build\src\storage\innobase\fil\fil0fil.cc line 488
InnoDB: Failing assertion: space->is_ready_to_close() || space->purpose == FIL_TYPE_TEMPORARY || srv_fast_shutdown == 2 || !srv_was_started
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
210801  4:09:31 [ERROR] mysqld got exception 0x80000003 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.5.9-MariaDB-log
key_buffer_size=33554432
read_buffer_size=8388608
max_used_connections=64
max_threads=65537
thread_count=56
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 52371 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x274c52f9038
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
server.dll!my_parameter_handler()
ucrtbase.dll!raise()
ucrtbase.dll!abort()
server.dll!json_valid()
server.dll!json_valid()
server.dll!json_valid()
server.dll!json_valid()
server.dll!json_valid()
server.dll!json_valid()
server.dll!json_valid()
server.dll!json_valid()
server.dll!json_valid()
server.dll!json_valid()
server.dll!json_valid()
server.dll!json_valid()
server.dll!json_valid()
server.dll!json_valid()
server.dll!json_valid()
server.dll!?ha_open@handler@@QEAAHPEAUTABLE@@PEBDHIPEAUst_mem_root@@PEAV?$List@VString@@@@@Z()
server.dll!?open_table_from_share@@YA?AW4open_frm_error@@PEAVTHD@@PEAUTABLE_SHARE@@PEBUst_mysql_const_lex_string@@IIIPEAUTABLE@@_NPEAV?$List@VString@@@@@Z()
server.dll!?open_table@@YA_NPEAVTHD@@PEAUTABLE_LIST@@PEAVOpen_table_context@@@Z()
server.dll!?open_and_lock_tables@@YA_NPEAVTHD@@AEBUDDL_options_st@@PEAUTABLE_LIST@@_NIPEAVPrelocking_strategy@@@Z()
server.dll!?open_tables@@YA_NPEAVTHD@@AEBUDDL_options_st@@PEAPEAUTABLE_LIST@@PEAIIPEAVPrelocking_strategy@@@Z()
server.dll!?mysqld_show_create_get_fields@@YA_NPEAVTHD@@PEAUTABLE_LIST@@PEAV?$List@VItem@@@@PEAVString@@@Z()
server.dll!?mysqld_show_create@@YA_NPEAVTHD@@PEAUTABLE_LIST@@@Z()
server.dll!?mysql_execute_command@@YAHPEAVTHD@@@Z()
server.dll!?mysql_parse@@YAXPEAVTHD@@PEADIPEAVParser_state@@_N3@Z()
server.dll!?dispatch_command@@YA_NW4enum_server_command@@PEAVTHD@@PEADI_N3@Z()
server.dll!?do_command@@YA_NPEAVTHD@@@Z()
server.dll!?pool_of_threads_scheduler@@YAXPEAUscheduler_functions@@PEAKPEAV?$Atomic_counter@I@@@Z()
server.dll!?tp_callback@@YAXPEAUTP_connection@@@Z()
KERNEL32.DLL!TermsrvOpenRegEntry()
ntdll.dll!RtlReleasePebLock()
ntdll.dll!RtlReleaseSRWLockExclusive()
KERNEL32.DLL!BaseThreadInitThunk()
ntdll.dll!RtlUserThreadStart()
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x274c2c215b0): show create table `iv_crl_ca_details`
Connection ID (thread ID): 39444
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file at C:\Program Files\McAfee\Network Security Manager\MariaDB\data\
Minidump written to C:\Program Files\McAfee\Network Security Manager\MariaDB\data\mariadbd.dmp



 Comments   
Comment by Vladislav Vaintroub [ 2021-08-02 ]

There is a mysqld.pdb or mariadbd.pdb depending on the version, that is distributed in all MariaDB packages by us.
schandol, it would be nice if you put that into your McAfee distribution, next to the executable, so that we do not always have to decipher the stacktrace

Comment by Vladislav Vaintroub [ 2021-08-02 ]

schandol, this does not seem to be our 10.5.9 either. At least both Visual Studio and cdb debugger detect mismatches. there is some noisy symbol loading information in out.txt that says that.

DBGHELP: C:\bug\server.dll - file not found
DBGHELP: c:\bug\mariadb-10.5.9-winx64\bin\server.dll - mismatched timestamp
DBGHELP: server.dll not found in c:\bug\mariadb-10.5.9-winx64\bin
DBGENG:  C:\Program Files\McAfee\Network Security Manager\MariaDB\bin\server.dll - Couldn't map image from disk.
DBGENG:  server.dll - Partial symbol image load missing image info
DBGHELP: Module is not fully loaded into memory.
DBGHELP: Searching for symbols using debugger-provided data.
DBGHELP: c:\bug\mariadb-10.5.9-winx64\bin\server.pdb - mismatched pdb
DBGHELP: c:\bug\mariadb-10.5.9-winx64\bin\dll\server.pdb - file not found
DBGHELP: c:\bug\mariadb-10.5.9-winx64\bin\symbols\dll\server.pdb - file not found
SYMSRV:  BYINDEX: 0xA

What's interesting is that the paths n assertion are correct (D:\winx64-packages\build\src\storage\innobase\fil\fil0fil.cc - this is where it is built from in our build)
So I do not really know, looks like executable was tampered by someone else, or maybe .dmp is somehow damaged.

Comment by srinivas chandolu [ 2021-08-02 ]

Vladislav Vaintroub, Thanks for the update. However, we are not bundling all the debug files. As suggested for deciphering will include mariadbd.pdb file also. Just curious to know whether any dependency files also needs to be bundled? If so can you please update.

We are using mariadb named binaries downloaded from https://downloads.mariadb.org/mariadb/10.5.9/. For the actual problem mentioned i.e., "that the paths are correct .... but looks like executable was tampered by someone else". Also as a note, we are able to restart the service and login to database. Whether this tampering has happened by some process while database is running?

Comment by Vladislav Vaintroub [ 2021-08-02 ]

I do not know, I can't load symbols for this dmp with the released version. I have no other clues than "mismatched binary" by cdb and Visual Studio. Maybe you have some better clues than I have, and I do something wrong, and your output of cdb command is different?

cdb -lines -z  C:\bug\mariadb-10.5.9-winx64\bin\mariadbd.dmp -i  C:\bug\mariadb-10.5.9-winx64\bin -y  C:\bug\mariadb-10.5.9-winx64\bin -logo out.txt -c "!sym noisy;!sym prompts;.reload;lml;!uniqstack -p;q"

Of course with fixed path, the "bin" directory (in my case is the C:\bug\mariadb-10.5.9-winx64\bin) is the one that contains the crashdump mariadbd.dmp, server.dll, server.pdb (also mariadbd.exe, but this one is no more important, as server.dll now has all the server functionality)

Comment by srinivas chandolu [ 2021-08-11 ]

Vladislav Vaintroub, we have hit this issue on two systems with similar error

InnoDB: Assertion failure in file D:\winx64-packages\build\src\storage\innobase\fil\fil0fil.cc line 488
InnoDB: Failing assertion: space->is_ready_to_close() || space->purpose == FIL_TYPE_TEMPORARY || srv_fast_shutdown == 2 || !srv_was_started
InnoDB: We intentionally generate a memory trap.

Also, the two systems uses windows 2016 as OS.

Attached the two crash dumps with err files for reference.

Can we have more priority for this one to know exact reason?

Comment by Vladislav Vaintroub [ 2021-08-11 ]

schandol, Well, analyzing a minidump is not a big deal. It is well documented, and it should be possible to do that with the exact command that I wrote in previous entry (cdb.exe should be installed). It does not work for me for your dumps, neither the old, nor the new.

Thus I alleged, your executable must be different somehow from what we ship. If I'm wrong, and you are right, you should easily execute a couple of cdb commands, as above https://jira.mariadb.org/browse/MDEV-26293?focusedCommentId=195859&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-195859 , or Visual Studio commands, and prove me wrong. However, the stackdumping code in the program itself, and cdb , and Visual Studio , none of those think they can do anything with that dump, and I'm relying on one of them to work, so there is nothing I can do.

Comment by Vladislav Vaintroub [ 2021-08-11 ]

See, "no matching binary found" for mariadbd.exe, for server.dll , and (you can't see it) for server_audit.dll . The symbol path includes 10.5.9 bin directory, and there is pdb for server.dll and for mariadbd.exe in there. Proof :

Comment by Vladislav Vaintroub [ 2021-08-11 ]

Ok I know what exactly happened. Some releases were rebuilt due to https://jira.mariadb.org/browse/MDEV-24944 , and downloads were fixed, so that they contained newer builds .

If you downloaded just after release, the timestamp of the executable is Feb.19
This is what cdb says for your .dmp

00007ff7`f0580000 00007ff7`f058a000   mariadbd   (deferred)             
    Image path: C:\Program Files\McAfee\Network Security Manager\MariaDB\bin\mariadbd.exe
    Image name: mariadbd.exe
    Timestamp:        Fri Feb 19 11:36:22 2021 (602F94A6)
 ...

the one I got downloaded now has the timestamp of March,22. Output of dumpbin /headers contains

FILE HEADER VALUES
            8664 machine (x64)
               7 number of sections
        6058F60E time date stamp Mon Mar 22 20:54:54 2021
... 

Both files are built from the same source revision 3a8ca9096ea82ca61811450775511533d6cb1bb4, but I can't use debug symbols of the new build on the old exe.
Thus schandol, if you did not save the server.dll,server.pdb and mariadbd.exe,mariadbd.pdb somewhere (which are in the ZIP distro that you downloaded) , if you have not got them, we are out of luck, because I don't that those files either. I would suggest to use the executables from the current download, or better yet, 10.5.12 or whatever is current. I would suggest not throw away the pdbs in the future, they can be useful.

If you saved them, can you please attach them to the ticket.

Comment by Vladislav Vaintroub [ 2021-08-12 ]

schandol, thanks, and server.dll/pdb also, if possible , please

Comment by srinivas chandolu [ 2021-08-12 ]

Vladislav Vaintroub, Thanks for the update. Added three files server.dll, mariadbd.exe,mariadbd.pdb. The server.dll file has been added as a zip file since uploading file limit is 10 MB. However, the server.pdb file size is 55.9 MB, Exceeding 10 MB for all the windows/linux zip formats. Is there any alternate way to upload this file?

Also, mentioned that the binary set of files have been updated for 10.5.9 in downloads section of mariadb. Curious to know the fixes that went into with the latest 10.5.9 when compared to that one used at our end.

Comment by Vladislav Vaintroub [ 2021-08-12 ]

schandol, thanks. you could just share the remaining file on a google/azure drive, and add the link here. Or, you can use 7zip to split large archives like here https://www.techulator.com/resources/6601-How-split-join-files-using-7-zip-free-tool.aspx

new 10.5.9 is built from the very same sources, the binaries and MSI was just signed differently, with timestamping. only MDEV-24944 was fixed, nothing else.

Comment by srinivas chandolu [ 2021-08-12 ]

Vladislav Vaintroub, Uploaded the file by splitting into 4 files using 7 zip.

Comment by Vladislav Vaintroub [ 2021-08-12 ]

So, marko.
One of the crashes is this

 	server.dll!ut_dbg_assertion_failed(const char * expr, const char * file, unsigned int line) Line 60	C++
 	server.dll!fil_node_t::prepare_to_close_or_detach() Line 487	C++
 	server.dll!fil_node_t::close() Line 466	C++
 	server.dll!fil_space_t::try_to_close(bool print_info) Line 110	C++
 	server.dll!fil_node_open_file(fil_node_t * node) Line 438	C++
 	server.dll!fil_space_t::prepare(bool have_mutex) Line 666	C++
 	server.dll!fil_space_t::get(unsigned __int64 id) Line 1446	C++
 	server.dll!buf_flush_space(const unsigned int id) Line 1226	C++
 	server.dll!buf_do_flush_list_batch(unsigned __int64 max_n, unsigned __int64 lsn) Line 1450	C++
 	server.dll!buf_flush_lists(unsigned __int64 max_n, unsigned __int64 lsn) Line 1580	C++
 	server.dll!buf_flush_page_cleaner(void * __formal) Line 2195	C++

another one is this

>	server.dll!my_parameter_handler(const wchar_t * expression, const wchar_t * function, const wchar_t * file, unsigned int line, unsigned __int64 pReserved) Line 278	C
 	ucrtbase.dll!raise()	Unknown
 	ucrtbase.dll!abort()	Unknown
 	server.dll!ut_dbg_assertion_failed(const char * expr, const char * file, unsigned int line) Line 60	C++
 	server.dll!fil_node_t::prepare_to_close_or_detach() Line 487	C++
 	server.dll!fil_node_t::close() Line 466	C++
 	server.dll!fil_space_t::try_to_close(bool print_info) Line 110	C++
 	server.dll!fil_node_open_file(fil_node_t * node) Line 438	C++
 	server.dll!fil_space_t::prepare(bool have_mutex) Line 666	C++
 	server.dll!fil_space_t::acquire() Line 535	C++
 	server.dll!btr_cur_prefetch_siblings(const buf_block_t * block, const dict_index_t * index) Line 3341	C++
 	server.dll!btr_cur_optimistic_delete_func(btr_cur_t * cursor, mtr_t * mtr) Line 5620	C++
 	server.dll!row_purge_remove_clust_if_poss_low(purge_node_t * node, unsigned __int64 mode) Line 138	C++
 	server.dll!row_purge_remove_clust_if_poss(purge_node_t * node) Line 185	C++
 	server.dll!row_purge_record_func(purge_node_t * node, unsigned char * undo_rec, bool updated_extern) Line 1047	C++
 	server.dll!row_purge(purge_node_t * node, unsigned char * undo_rec, que_thr_t * thr) Line 1112	C++
 	server.dll!row_purge_step(que_thr_t * thr) Line 1159	C++
 	server.dll!que_thr_step(que_thr_t * thr) Line 946	C++
 	server.dll!que_run_threads_low(que_thr_t * thr) Line 1011	C++
 	server.dll!que_run_threads(que_thr_t * thr) Line 1050	C++
 	server.dll!srv_task_execute() Line 1872	C++
 	server.dll!purge_worker_callback(void * __formal) Line 2034	C++
 	server.dll!tpool::task_group::execute(tpool::task * t) Line 57	C++
 	ntdll.dll!TppSimplepExecuteCallback()	Unknown
 	ntdll.dll!TppWorkerThread()	Unknown

There is a third one, with all stacks , it is in crashreport.txt, look around STACK_TEXT. There are also all threads stacktraces in this file

Comment by Marko Mäkelä [ 2021-08-12 ]

wlad, can you find all threads that are accessing the same tablespace object? Is the value of fil_space_t::n_pending available?

Comment by Vladislav Vaintroub [ 2021-08-12 ]

fil_space_t::n_pending is not available. The crashreport.txt can give an idea of what threads are doing. I'm not sure how to automate finding threads using the same object there.

Comment by Marko Mäkelä [ 2021-08-12 ]

Can we reproduce this with a small value of innodb_open_files? What is the value of that parameter, and roughly how many distinct .ibd files or InnoDB tablespaces are being accessed concurrently?

Comment by srinivas chandolu [ 2021-08-16 ]

Vladislav Vaintroub / Marko Mäkelä , Just curious to know if we found out the reason for the mentioned abrupt shutdowns. Can you please provide an update.

Comment by Vladislav Vaintroub [ 2021-08-16 ]

schandol, I guess the last marko's question was for your.

Also after producing the stacktrace, which marko asked for, I resort to uninvolved observer's role

Comment by srinivas chandolu [ 2021-08-16 ]

Vladislav Vaintroub, I was in the impression that the chat was between You and Marko Mäkelä. We are using innodb_open_files parameter with value as 300. There are 752 tables in the application. Also the configuration has been done using innodb_file_per_table format. So with this there will be 752 ibd files. Also, there will be one tablespace with a default value of 100M which will be autoextended.
Marko Mäkelä, Can you suggest a value for this parameter based on crash dumps or analysis that has been done from your end. Also, I think we are accessing roughly around 10 distinct .ibd files as per the database log or slowquery log file during this crash. Is there any way to find exact access count?

Comment by Marko Mäkelä [ 2021-08-16 ]

schandol, based on the information that is available so far, I cannot suggest a safe value for this parameter. I would not think that 300 is unreasonably small. Because of several layers of caching, there is no sharp correlation between SQL statements being executed and the number of files being open. It might be turn out that innodb_open_files is useless on modern operating systems, and having a large number of open file descriptors is not an actual problem anymore. If that is the case, setting the parameter to the maximum possible value should work around the problem.

wlad is our Windows expert, and he can perform tests on Windows. Because the file system APIs work differently between Windows and (say) Linux, and because we have not observed this problem on Linux, where most of our internal testing takes place, I think that our best chances to repeat this problem would be on Microsoft Windows.

What kind of storage is being used? HDD or SSD? How big are the tables?

Comment by Vladislav Vaintroub [ 2021-08-16 ]

The errors are usually portable, and reproducible elsewhere if reasonable efforts are made. The exception from there is real Windows APIs , like asynchronous IO.

Having said that, you can have innodb_open_files easily as million, the number of open handles in a Windows program is limited by 2^24.

Comment by Roel Van de Paar [ 2021-08-16 ]

I note (https://jira.mariadb.org/secure/attachment/58503/screenshot-1.png) that a Virus scanner (McAfee Network Security Manager) is operating on this system and that it somehow seems to be interacting or trying to interact with mariadb.exe/server.dll. Virus scanners may cause issues with mysqld. Please try excluding mysqld.exe/dll's (in fact the whole directory/subdirectories in which the server is installed as well as the data directory and it's subdirectories) from any virus/network scanner activity (in disk and memory), at the very least as a test. Other possible mysqld interaction causes are anti-virus/anti-malware, firewall, and UAC. Also ensure that the OS is fully patched/updated.

Comment by Roel Van de Paar [ 2021-08-16 ]

2021-08-11  5:20:15 39251 [Note] InnoDB: Cannot close file .\lf\iv_dev_prof_data.ibd because of 22 pending operations
....
2021-08-11  5:20:56 39251 [Note] InnoDB: Cannot close file .\lf\iv_dev_prof_data.ibd because of 0 pending operations and pending fsync
2021-08-11  5:20:56 39251 [Note] InnoDB: Cannot close file .\lf\iv_dev_prof_data.ibd because of 0 pending operations and pending fsync
2021-08-11  5:20:56 39251 [Note] InnoDB: Cannot close file .\lf\iv_dev_prof_data.ibd because of 0 pending operations and pending fsync
2021-08-11  5:20:56 39251 [Note] InnoDB: Cannot close file .\lf\iv_dev_prof_data.ibd because of 0 pending operations and pending fsync
2021-08-11  5:20:56 39251 [Note] InnoDB: Cannot close file .\lf\iv_dev_prof_data.ibd because of 0 pending operations and pending fsync
2021-08-11 05:21:49 0x187c  InnoDB: Assertion failure in file D:\winx64-packages\build\src\storage\innobase\fil\fil0fil.cc line 488
InnoDB: Failing assertion: space->is_ready_to_close() || space->purpose == FIL_TYPE_TEMPORARY || srv_fast_shutdown == 2 || !srv_was_started
InnoDB: We intentionally generate a memory trap.

There are regular pending operations and pending fsync's. In this case the crash happens very close to them. I would not be surpised if the issues were connected, and anti-virus or some other 3rd party interaction remains a major possibility imho. schandol What sort of SAN or similar are you using please?

Comment by Roel Van de Paar [ 2021-08-16 ]

Also interesting is that the FAILURE_BUCKET_ID is BREAKPOINT_80000003_server.dll!my_parameter_handler which may indicate underlying issues on the system or OS itself, or a memory leak in the server. Some limited info on these example URL's:
https://techcult.com/fix-error-0x80000003/
https://www.auslogics.com/en/articles/fix-breakpoint-error-0x80000003/ (I don't recommend using the driver updater!)
https://www.thewindowsclub.com/breakpoint-reached-error-0x80000003

Comment by srinivas chandolu [ 2021-08-16 ]

Marko Mäkelä, Vladislav Vaintroub and Roel Van de Paar, Thanks for the updates.

The kind of storage used in these systems is HDD. Generally there will be around six tables which are of in units to tens of GBs. In all the crashed systems the highest table data is around 10 GB and remaining 5 tables will vary between 2 GB to 7 GB. The remaining tables are of few MBs.

Regarding innodb_open_files parameter noticed that we can give up to 4096 in one of the system. However more than has been given (for example 8092) it is taking as 256 (equal to table_open_cache value). But in documentation (https://mariadb.com/kb/en/innodb-system-variables/#innodb_open_files) it has been mentioned that we can have upto 4294967295. Is it dependent on other parameters? Also how we can set with higher value?

As a note, still now we have not faced issues by having having 300 as its value in previous mariadb versions of 10.3.Now we are facing with 10.5 for the first time.

On other hand on other system if this parameter has been set with still lower value say 10, it is getting crashed with in 2 hours after raising semaphore and wait issues. We have changed to 4096 to check its behavior.

Also McAfee Network Security Manager is not an antivirus application. On all the affected systems the virus scan is not running.

However for other points mentioned will check with concerned and will update accordingly.

Comment by Marko Mäkelä [ 2021-08-16 ]

schandol, in 10.5 the code flow is different from older versions mostly due to MDEV-23855, which reduced contention on fil_system.mutex especially for write workloads.

Comment by Valerii Kravchuk [ 2022-06-21 ]

Customer got the same assertion failure in different context:

Thread 1 (Thread 0x7e6d68762700 (LWP 263532)):
#0  0x00007fefb327aaa1 in pthread_kill () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x0000562c5f35827e in handle_fatal_signal (sig=6) at /usr/src/debug/MariaDB-/src_0/sql/signal_handler.cc:330
        curr_time = 1655794807
        tm = {tm_sec = 7, tm_min = 0, tm_hour = 14, tm_mday = 21, tm_mon = 5, tm_year = 122, tm_wday = 2, tm_yday = 171, tm_isdst = 0, tm_gmtoff = 25200, tm_zone = 0x562c62d416d0 "+07"}
        thd = 0x7e6bb40009b8
        print_invalid_query_pointer = true
#2  <signal handler called>
No symbol table info available.
#3  0x00007fefb13453d7 in raise () from /lib64/libc.so.6
No symbol table info available.
#4  0x00007fefb1346ac8 in abort () from /lib64/libc.so.6
No symbol table info available.
#5  0x0000562c5f02d3af in ut_dbg_assertion_failed (expr=expr@entry=0x562c5fc0c130 "space->is_ready_to_close() || space->purpose == FIL_TYPE_TEMPORARY || srv_fast_shutdown == 2 || !srv_was_started", file=file@entry=0x562c5fc0c060 "/home/jenkins/workspace/MariaDBE-Custom-RPM/label/rhel-7/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/storage/innobase/fil/fil0fil.cc", line=line@entry=488) at /usr/src/debug/MariaDB-/src_0/storage/innobase/ut/ut0dbg.cc:60
No locals.
#6  0x0000562c5f87aac8 in fil_node_t::prepare_to_close_or_detach (this=this@entry=0x7e6bd502fba8) at /usr/src/debug/MariaDB-/src_0/storage/innobase/fil/fil0fil.cc:487
No locals.
#7  0x0000562c5f87ab27 in fil_node_t::close (this=0x7e6bd502fba8) at /usr/src/debug/MariaDB-/src_0/storage/innobase/fil/fil0fil.cc:463
No locals.
#8  0x0000562c5f87aca3 in fil_space_t::try_to_close (print_info=false) at /usr/src/debug/MariaDB-/src_0/storage/innobase/fil/fil0fil.cc:109
        node = 0x7e6bd502fba8
        space = 0x7e6bd517f948
#9  0x0000562c5f87fc03 in fil_node_open_file (node=node@entry=0x7e6bc4307098) at /usr/src/debug/MariaDB-/src_0/storage/innobase/fil/fil0fil.cc:438
        count = 0
#10 0x0000562c5f03245c in fil_space_t::prepare (this=0x7e6bc4304238, have_mutex=have_mutex@entry=false) at /usr/src/debug/MariaDB-/src_0/storage/innobase/fil/fil0fil.cc:666
        node = 0x7e6bc4307098
        is_open = <optimized out>
#11 0x0000562c5f7f4535 in acquire (this=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/include/fil0fil.h:535
No locals.
#12 btr_cur_prefetch_siblings (block=block@entry=0x7eea2b818c00, index=index@entry=0x7e6b71ba93a0) at /usr/src/debug/MariaDB-/src_0/storage/innobase/btr/btr0cur.cc:3343
        page = <optimized out>
#13 0x0000562c5f7fcc11 in btr_cur_optimistic_insert (flags=flags@entry=0, cursor=cursor@entry=0x7e6d68760190, offsets=offsets@entry=0x7e6d687600f0, heap=heap@entry=0x7e6d687600e0, entry=entry@entry=0x7e6bade5d690, rec=rec@entry=0x7e6d687608b0, big_rec=big_rec@entry=0x7e6d687600d0, n_ext=n_ext@entry=0, thr=thr@entry=0x7e6b9efc8bc8, mtr=mtr@entry=0x7e6d68760b10) at /usr/src/debug/MariaDB-/src_0/storage/innobase/btr/btr0cur.cc:3470
        big_rec_vec = 0x0
        err = DB_FAIL
        max_size = <optimized out>
        n_recs = <optimized out>
        page_cursor = <optimized out>
        block = 0x7eea2b818c00
        page = 0x7eea2c068000 <Address 0x7eea2c068000 out of bounds>
        dummy = 0x7e6bb4a296f6 "\201\064\212Í€"
        rec_size = 196
        index = 0x7e6b71ba93a0
        reorg = <optimized out>
        inherit = true
#14 0x0000562c5f7341c4 in row_ins_clust_index_entry_low (flags=flags@entry=0, mode=<optimized out>, mode@entry=2, index=index@entry=0x7e6b71ba93a0, n_uniq=n_uniq@entry=1, entry=entry@entry=0x7e6bade5d690, n_ext=n_ext@entry=0, thr=thr@entry=0x7e6b9efc8bc8) at /usr/src/debug/MariaDB-/src_0/storage/innobase/row/row0ins.cc:2739
        insert_rec = 0x7f6a87477100 <Address 0x7f6a87477100 out of bounds>
        pcur = {btr_cur = {index = 0x7e6b71ba93a0, page_cur = {index = 0x0, rec = 0x7eea2c06b4eb <Address 0x7eea2c06b4eb out of bounds>, offsets = 0x0, block = 0x7eea2b818c00}, purge_node = 0x0, left_block = 0x0, thr = 0x7e6b9efc8bc8, flag = BTR_CUR_BINARY, tree_height = 3, up_match = 0, up_bytes = 0, low_match = 0, low_bytes = 0, n_fields = 0, n_bytes = 0, fold = 0, path_arr = 0x0, rtr_info = 0x0}, latch_mode = 2, old_stored = false, old_rec = 0x0, old_n_fields = 0, rel_pos = 0, block_when_stored = {m_block = 0x0, m_page_id = {m_id = 0}}, modify_clock = 0, pos_state = BTR_PCUR_IS_POSITIONED, search_mode = PAGE_CUR_LE, trx_if_known = 0x0, old_rec_buf = 0x0, buf_size = 0}
        cursor = 0x7e6d68760190
        err = <optimized out>
        big_rec = 0x0
        mtr = {m_last = 0x0, m_last_offset = 0, m_log_mode = 0, m_modifications = 0, m_made_dirty = 1, m_inside_ibuf = 0, m_trim_pages = 0, m_memo = {m_heap = 0x0, m_list = {<ilist<mtr_buf_t::block_t, void>> = {sentinel_ = {next = 0x7e6d68760b48, prev = 0x7e6d68760b48}}, size_ = 1}, m_size = 64, m_first_block = {<ilist_node<void>> = {next = 0x7e6d68760b28, prev = 0x7e6d68760b28}, m_data = "\000\000\000\000\000\000\000\000 \000\000\000m~\000\000\000\000\000\000\000\000\000\000\001\000\000\000\323\177\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\214\201+\352~\000\000\002", '\000' <repeats 15 times>, "\022\000\000\000k~\000\000\000j\337}\vmF\251", '\000' <repeats 16 times>, "p\021vhm~\000\000\230\376¶k~", '\000' <repeats 35 times>, "h\221k\246~\000\000\030\000\f_,V\000\000\000\000\000\000\000\000\000\000(\fvhm~\000\000(\fvhm~\000\000\001", '\000' <repeats 15 times>..., m_used = 64}}, m_log = {m_heap = 0x0, m_list = {<ilist<mtr_buf_t::block_t, void>> = {sentinel_ = {next = 0x7e6d68760d78, prev = 0x7e6d68760d78}}, size_ = 1}, m_size = 0, m_first_block = {<ilist_node<void>> = {next = 0x7e6d68760d58, prev = 0x7e6d68760d58}, m_data = "`\220E`,V\000\000pf\276\341\062V\000\000r\000\000\000\000\000\000\000\330J\324b,V\000\000\021\000\000\000\000\000\000\000\060\016vhm~\000\000\241\251\230_,V\000\000pf\276\341\062V\000\000\000\000\000\000\000\000\000\000n\000\000\000\022\000\000\000(o\250\267k~\000\000\310\065!\350j~\000\000\250\245\256\261k~\000\000\030\066[\324k~\000\000\212\001\213\001\214\001\215\001\060\016vhm~\000\000 o\250\267k~\000\000pf\276\341\062V\000\000\277O\230_,V\000\000 o\250\267k~\000\000pf\276\341\062V\000\000\300\016vhm~\000\000\021\355$_,V\000\000H\000\000\000\000\000\000\000\004\000\000\000\000\000\000\000"..., m_used = 0}}, m_user_space = 0x7e6bc4304238, m_commit_lsn = 0, m_freed_space = 0x0, m_freed_pages = 0x0}
        auto_inc = <optimized out>
        offsets_heap = 0x0
        offsets_ = {300, 66, 32814, 4, 10, 16, 23, 27, 34, 36, 46, 48, 51, 53, 56, 60, 64, 67, 99, 107, 123, 127, 131, 134, 138, 140, 142, 174, 175, 176, 177, 178, 179, 187, 199, 207, 219, 222, 225, 228, 230, 234, 237, 241, 245, 251, 257, 265, 266, 274, 275, 276, 277, 278, 280, 281, 283, 285, 287, 290, 295, 296, 297, 298, 299, 305, 306, 834, 841, 424, 427, 437, 447, 450, 453, 463, 473, 476, 479, 489, 300, 3, 32774, 4, 10, 42, 531, 541, 551, 554, 557, 567, 577, 586, 595, 604, 613, 622, 631, 640, 649, 658, 667, 676, 685, 694, 703, 712, 721, 730, 739, 748, 300, 8, 32777, 4, 10, 16, 23, 24, 678, 679, 686, 856, 865, 874, 883, 892, 901, 910, 919, 928, 937, 946, 955, 964, 973, 982, 991, 1000, 1009, 1018, 1027, 1036, 1045, 1054, 1063, 1072, 1081, 1090, 1099, 1108, 1117, 1126, 1135, 1144, 1153, 1162, 1171, 1180, 1189, 1198, 1207, 1216, 1225, 1234, 1243, 1252, 1261, 1273, 1289, 1301, 1317, 1321, 1393, 1399, 1519, 1522, 1523, 1645, 1648, 1673, 1676, 1684, 1693, 1697, 1705, 1708, 1738, 1768, 1828, 1832, 1839, 0, 0, 0, 19672, 40316, 32363, 0...}
        offsets = 0x7e6d68760650
#15 0x0000562c5f736149 in row_ins_clust_index_entry (index=index@entry=0x7e6b71ba93a0, entry=entry@entry=0x7e6bade5d690, thr=thr@entry=0x7e6b9efc8bc8, n_ext=n_ext@entry=0) at /usr/src/debug/MariaDB-/src_0/storage/innobase/row/row0ins.cc:3210
        err = <optimized out>
        n_uniq = 1
        flags = 0
        skip_locking = <optimized out>
        orig_n_fields = 25
#16 0x0000562c5f73685c in row_ins_index_entry (thr=0x7e6b9efc8bc8, entry=0x7e6bade5d690, index=0x7e6b71ba93a0) at /usr/src/debug/MariaDB-/src_0/storage/innobase/row/row0ins.cc:3333
No locals.
#17 row_ins_index_entry_step (thr=0x7e6b9efc8bc8, node=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/row/row0ins.cc:3502
No locals.
#18 row_ins (thr=0x7e6b9efc8bc8, node=<optimized out>) at /usr/src/debug/MariaDB-/src_0/storage/innobase/row/row0ins.cc:3661
        index = <optimized out>
        type = <optimized out>
#19 row_ins_step (thr=thr@entry=0x7e6b9efc8bc8) at /usr/src/debug/MariaDB-/src_0/storage/innobase/row/row0ins.cc:3807
        node = <optimized out>
        err = <optimized out>
        sel_node = <optimized out>
#20 0x0000562c5f74936b in row_insert_for_mysql (mysql_rec=<optimized out>, prebuilt=0x7e6b9efc7b70, ins_mode=ROW_INS_NORMAL) at /usr/src/debug/MariaDB-/src_0/storage/innobase/row/row0mysql.cc:1421
        err = 1604526928
        was_lock_wait = <optimized out>
        node = 0x7e6b9efc8478
        savept = {least_undo_no = 19}
        trx = 0x7fefac7c6ff0
        table = 0x7e6b701515e0
        blob_heap = <optimized out>
#21 0x0000562c5f67b604 in ha_innobase::write_row (this=0x7e6b9d9b6080, record=0x7e6b9d23c388 "BONEW1PI339J8GLW", ' ' <repeats 80 times>, "\201\064\212\315TRM         0351040", ' ' <repeats 14 times>, "\r") at /usr/src/debug/MariaDB-/src_0/storage/innobase/handler/ha_innodb.cc:7637
No locals.
#22 0x0000562c5f36547f in handler::ha_write_row (this=0x7e6b9d9b6080, buf=buf@entry=0x7e6b9d23c388 "BONEW1PI339J8GLW", ' ' <repeats 80 times>, "\201\064\212\315TRM         0351040", ' ' <repeats 14 times>, "\r") at /usr/src/debug/MariaDB-/src_0/sql/handler.cc:7210
No locals.
#23 0x0000562c5f5ca318 in ha_partition::write_row (this=0x7e6b9eaaf900, buf=0x7e6b9d23c388 "BONEW1PI339J8GLW", ' ' <repeats 80 times>, "\201\064\212\315TRM         0351040", ' ' <repeats 14 times>, "\r") at /usr/src/debug/MariaDB-/src_0/sql/ha_partition.cc:4492
        part_id = 431
        func_value = 3541341631
        sms = {thd = 0x7e6bb40009b8, old_mode = 524288}
        saved_auto_inc_field_not_null = true
        error = 0
        have_auto_increment = <optimized out>
        thd = 0x7e6bb40009b8
#24 0x0000562c5f36547f in handler::ha_write_row (this=0x7e6b9eaaf900, buf=0x7e6b9d23c388 "BONEW1PI339J8GLW", ' ' <repeats 80 times>, "\201\064\212\315TRM         0351040", ' ' <repeats 14 times>, "\r") at /usr/src/debug/MariaDB-/src_0/sql/handler.cc:7210
No locals.
#25 0x0000562c5f471705 in Rows_log_event::write_row (this=this@entry=0x7e6bef5c3848, rgi=rgi@entry=0x7e6a2f7fddc0, overwrite=<optimized out>) at /usr/src/debug/MariaDB-/src_0/sql/log_event_server.cc:7287
        key = {m_ptr = 0x0}
        table = 0x7e6b9e17d428
        error = <optimized out>
        keynum = <optimized out>
        invoke_triggers = false
#26 0x0000562c5f471b6d in Write_rows_log_event::do_exec_row (this=0x7e6bef5c3848, rgi=0x7e6a2f7fddc0) at /usr/src/debug/MariaDB-/src_0/sql/log_event_server.cc:7507
        tmp = 0x562c5fa27835 "Executing"
        message = <optimized out>
        error = <optimized out>
#27 0x0000562c5f466fb4 in Rows_log_event::do_apply_event (this=0x7e6bef5c3848, rgi=0x7e6a2f7fddc0) at /usr/src/debug/MariaDB-/src_0/sql/log_event_server.cc:5679
        old_thd = 0x7e6bb40009b8
        transactional_table = true
        after_image = <optimized out>
        saved_sql_mode = 1411383296
        rli = 0x5632e1d8cb68
        lex = <optimized out>
        table = 0x7e6b9e17d428
        error = <optimized out>
        new_trg_event_map = <optimized out>
#28 0x0000562c5f09b36f in apply_event_and_update_pos_apply (ev=0x7e6bef5c3848, thd=0x7e6bb40009b8, rgi=0x7e6a2f7fddc0, reason=<optimized out>) at /usr/src/debug/MariaDB-/src_0/sql/log_event.h:1498
No locals.
#29 0x0000562c5f294f6e in rpt_handle_event (qev=qev@entry=0x7e6bedb60138, rpt=0x7e6bec009c60) at /usr/src/debug/MariaDB-/src_0/sql/rpl_parallel.cc:62
        err = <optimized out>
        rli = 0x5632e1d8cb68
        rgi = 0x7e6a2f7fddc0
        thd = 0x7e6bb40009b8
        ev = 0x7e6bef5c3848
#30 0x0000562c5f29750b in handle_rpl_parallel_thread (arg=arg@entry=0x7e6bec009c60) at /usr/src/debug/MariaDB-/src_0/sql/rpl_parallel.cc:1363
        event_type = WRITE_ROWS_EVENT_V1
        rgi = 0x7e6a2f7fddc0
        entry = 0x7e6bec01d318
        end_of_group = <optimized out>
        group_ending = 0
        wait_count = <optimized out>
        qev = <optimized out>
        next_qev = 0x7e6bef53cdd8
        in_event_group = true
        group_rgi = 0x7e6a2f7fddc0
        event_gtid_sub_id = 1752521350
        sql_info = {cached_charset = "!\000!\000!", rpl_filter = 0x562c62d548b0}
        err = <optimized out>
        rpt = <optimized out>
        group_standalone = false
        thd = 0x7e6bb40009b8
        old_stage = {m_key = 0, m_name = 0x562c5fa27a27 "After apply log event", m_flags = 1615069400}
        events = <optimized out>
        skip_event_group = false
        gco = <optimized out>
#31 0x0000562c5f5cdc1d in pfs_spawn_thread (arg=0x7e6bec00b998) at /usr/src/debug/MariaDB-/src_0/storage/perfschema/pfs.cc:2201
        typed_arg = 0x7e6bec00b998
        user_arg = 0x7e6bec009c60
        pfs = <optimized out>
        user_start_routine = 0x562c5f296dd0 <handle_rpl_parallel_thread(void*)>
        klass = <optimized out>
#32 0x00007fefb3275ea5 in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#33 0x00007fefb140d9fd in clone () from /lib64/libc.so.6
No symbol table info available.

Comment by Marko Mäkelä [ 2022-06-21 ]

valerii, can you also get the following output from that crash (which is related to enforcing the innodb_open_files limit)?

frame 6
print *this
print *space

Furthermore, in the output of thread apply all backtrace, do you see any pointers to either object in any other thread?

Comment by Marko Mäkelä [ 2022-06-21 ]

Neither the being-closed tablespace object nor the name of the partitioned table occur in any other thread.

The value of space->n_pending in the core dump is CLOSING+1, which would seem to satisfy the assertion expression space->is_ready_to_close().

I think that we should try to reproduce this with astress test run with a small value of innodb_open_files and large numbers of partitions, on a non-debug executable. Once we have something reproducible, it should be easier to figure out what is going on.

Comment by Matthias Leich [ 2022-06-27 ]

I am able to replay the assert on
    origin/10.5 674842bee0a564a2d94e99c4e1319a716aa10aa9 2022-06-16T00:12:11+03:00
compiled without debug.
All attempts to replay with invoking 'rr' or using a build with debug had no luck.
 
RQG
===
# git clone https://github.com/mleich1/rqg --branch experimental RQG
#
# GIT_SHOW: HEAD -> experimental, origin/experimental a32979c6d789ddf0982e6a56af687565be00590c 2022-06-14T17:21:56+02:00
# rqg.pl  : Version 4.0.6 (2022-05)
#
# $RQG_HOME/rqg.pl \
# --duration=300 \
# --queries=10000000 \
# --no_mask \
# --seed=random \
# --rpl_mode=none \
# --max_gd_duration=1200 \
# --engine=InnoDB \
# --mysqld=--file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
# --mysqld=--innodb-lock-wait-timeout=50 \
# --mysqld=--sync-binlog=1 \
# --mysqld=--net_read_timeout=30 \
# --mysqld=--loose_innodb_lock_schedule_algorithm=fcfs \
# --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
# --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
# --mysqld=--slave_net_timeout=60 \
# --mysqld=--loose-idle_readonly_transaction_timeout=0 \
# --mysqld=--innodb_open_files=15 \
# --mysqld=--plugin-load-add=file_key_management.so \
# --mysqld=--wait_timeout=28800 \
# --mysqld=--interactive_timeout=28800 \
# --mysqld=--lock-wait-timeout=86400 \
# --mysqld=--log_bin_trust_function_creators=1 \
# --mysqld=--loose-idle_write_transaction_timeout=0 \
# --mysqld=--connect_timeout=60 \
# --mysqld=--loose-max-statement-time=30 \
# --mysqld=--loose-idle_transaction_timeout=0 \
# --mysqld=--innodb-buffer-pool-size=8M \
# --mysqld=--loose-innodb_read_only_compressed=OFF \
# --mysqld=--log-bin \
# --mysqld=--log-output=none \
# --mysqld=--loose-table_lock_wait_timeout=50 \
# --mysqld=--net_write_timeout=60 \
# --mysqld=--innodb_page_size=8K \
# --mysqld=--loose-plugin-load-add=provider_lz4.so \
# --reporters=Backtrace,Deadlock1,ErrorLog \
# --validators=None \
# --grammar=MDEV-26293.yy \
# --threads=7 \
# <local settings>
 
In average several hundred RQG runs are required for one replay.
 
The simplified RQG grammar MDEV-26293 shows
alter:
    ALTER TABLE table_name_part alter_operation ;
 
alter_operation:
    ADD PARTITION (PARTITION p3 VALUES LESS THAN MAXVALUE) |
    ORDER BY `col_int_nokey` ;
 
create_part:
    CREATE TABLE IF NOT EXISTS table_name_part (
`col_int_nokey` INTEGER,
`col_int_key` INTEGER NOT NULL,
KEY (`col_int_key`)
) ENGINE = INNODB ;
 
exec_sql:
    alter |
    alter |
    alter |
    alter |
    alter |
    alter |
    create_part |
    create_part |
    create_part ;
 
query:
    exec_sql ;
 
query_init:
    ;
 
table_name_part:
    d |
    f |
    g |
    h |
    j |
    k |
    n |
    q |
    t |
    y ;
 
thread_connect:
    ;

Comment by Marko Mäkelä [ 2022-06-30 ]

I was unable to reproduce this locally. A core dump that mleich provided illustrated the problem. I think that the scenario was as follows:

  1. fil_space_t::try_to_close() had called fil_space_t::set_closing(), which returned 0, meaning that no references exist and no NEEDS_FSYNC flag was set.
  2. Another thread executed fil_space_t::acquire_low(), which incremented the n_pending from CLOSING to CLOSING+1.
  3. The assertion failed due to the unexpected reference.

My suggested fix is to avoid incrementing the reference count if the CLOSING flag is set and the caller is not holding fil_system.mutex. Incrementing the reference is OK if the NEEDS_FSYNC flag is set. If the reference was not incremented due to the CLOSING flag, fil_space_t::acquire() must acquire fil_system.mutex and try incrementing the reference again. Because fil_space_t::try_to_close() is holding fil_system.mutex, that fix should prevent this race condition.

Generated at Thu Feb 08 09:44:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.