[MDEV-16785] MariaDB server is running in 100% on one cpu Created: 2018-07-20  Updated: 2023-04-27

Status: Open
Project: MariaDB Server
Component/s: None
Affects Version/s: 5.5.57
Fix Version/s: 10.4, 10.5

Type: Bug Priority: Major
Reporter: Michael Graf Assignee: Unassigned
Resolution: Unresolved Votes: 1
Labels: None
Environment:

Windows Server 2012 R2


Attachments: File h2318784.err     File my.ini     Zip Archive mysqld_minidump.zip     Microsoft Word show_full_processlist.csv     Text File show_full_processlist.txt     HTML File threads    
Issue Links:
Relates
relates to MDEV-11972 Assertion failure in sync_array_get_a... Closed
relates to MDEV-16264 Implement a common work queue for Inn... Closed

 Description   

MariaDB server is running regularly in an endless loop in which one cpu is consuming 100%.
An output of "show full processlist" showed no progress anymore.

An analysis of the created minidump shows that the thread with id 3574 consumed most of the cpu.
0:000> !runaway
User Mode Time
Thread Time
15:3574 0 days 5:59:41.546
18:43a4 0 days 0:02:40.625
23:290c 0 days 0:01:22.109
...



 Comments   
Comment by Elena Stepanova [ 2018-08-04 ]

Threads from the minidump: threads

However, the contents doesn't seem to correspond the described situation. According to the coredump, there is a lot of query-processing activity (numerous updates, deletes, inserts, selects), so the processlist surely shouldn't be empty.

Stack trace for the alleged culprit 3574 is not very impressive, either:

Callstack for Thread 16 (Thread Id: 13684 (0x3574)):
 Index  Function
--------------------------------------------------------------------------------
 1      mysqld.exe!sync_array_find_thread(sync_array_struct * arr=0x000000d78d9dccb0, unsigned long thread=4294967295)
 2      mysqld.exe!sync_array_print_long_waits(unsigned long * waiter=0x000000d82ab2fa48, const void * * sema=0x000000d82ab2fa58)
 3      mysqld.exe!srv_error_monitor_thread(void * arg=0x00007ffe4e7116a0)
 4      [External Code]

wlad, would you like to take a look?
Please feel free to re-assign or unassign if needed.

Comment by Vladislav Vaintroub [ 2018-08-04 ]

reassigning to Marko, looks like Innodb deadlock.
All threads are foreground. A potentially interesting detail is existence of a thread that hangs in ha_close_connection, from the experience. connection shutdown might be have some exciting locking issues

Comment by Marko Mäkelä [ 2019-04-16 ]

I see a possible bug in the InnoDB sync_array in threads:

Callstack for Thread 19 (Thread Id: 17316 (0x43a4)):
 Index  Function
--------------------------------------------------------------------------------
 1      [External Code]
 2      mysqld.exe!os_mutex_enter(os_mutex_struct * mutex=0x000000d78d9269c0)
 3      mysqld.exe!sync_array_reserve_cell(sync_array_struct * arr=0x000000d78d9dccb0, void * object=0x000000d7a43e5400, unsigned __int64 type=352, const char * file=0x00007ff77a2675d0, unsigned __int64 line=1357, unsigned __int64 * index=0x000000d82ae2eed0)
 4      mysqld.exe!rw_lock_s_lock_spin(rw_lock_struct * lock=0x000000d7a43e5400, unsigned __int64 pass=2, const char * file_name=0x00007ff77a2675d0, unsigned __int64 line=1357)
 5      mysqld.exe!rw_lock_s_lock_func(rw_lock_struct * lock=0x000000d7a43e5400, unsigned __int64 pass=2, const char * file_name=0x00007ff77a2675d0, unsigned __int64 line=1357)
 6      mysqld.exe!pfs_rw_lock_s_lock_func(rw_lock_struct * lock=0x000000d7a43e5400, unsigned __int64 pass=2, const char * file_name=0x00007ff77a2675d0, unsigned __int64 line=1357)
 7      mysqld.exe!buf_flush_page(buf_pool_struct * buf_pool=0x000000d79e5457d8, buf_page_struct * bpage=0x000000d7a43e5310, buf_flush flush_type)
 8      mysqld.exe!buf_flush_try_neighbors(unsigned __int64 space=0, unsigned __int64 offset=14058, buf_flush flush_type=BUF_FLUSH_LIST, unsigned __int64 n_flushed=124, unsigned __int64 n_to_flush=200)
 9      mysqld.exe!buf_flush_page_and_try_neighbors(buf_page_struct * bpage, buf_flush flush_type=BUF_FLUSH_LIST, unsigned __int64 n_to_flush=200, unsigned __int64 * count=0x000000d82ae2f0d0)
 10     mysqld.exe!buf_flush_flush_list_batch(buf_pool_struct * buf_pool=0x000000d79e5457d8, unsigned __int64 min_n=200, unsigned __int64 lsn_limit=18446744073709551615)
 11     mysqld.exe!buf_flush_batch(buf_pool_struct * buf_pool=0x000000d79e5457d8, buf_flush flush_type, unsigned __int64 min_n=200, unsigned __int64 lsn_limit)
 12     mysqld.exe!buf_flush_list(unsigned __int64 min_n=200, unsigned __int64 lsn_limit=18446744073709551615)
 13     mysqld.exe!srv_master_thread(void * arg=0x00007ffe4e7116a0)
 14     [External Code]
 
Callstack for Thread 20 (Thread Id: 17352 (0x43c8)):
 Index  Function
--------------------------------------------------------------------------------
 1      [External Code]
 2      mysqld.exe!os_cond_wait(_RTL_CONDITION_VARIABLE * cond, _RTL_CRITICAL_SECTION * mutex)
 3      mysqld.exe!os_event_wait_low(os_event_struct * event=0x000000d825aadb30, __int64 reset_sig_count=9)
 4      mysqld.exe!sync_array_wait_event(sync_array_struct * arr=0x000000d78d9dccb0, unsigned __int64 index=4)
 5      mysqld.exe!rw_lock_s_lock_spin(rw_lock_struct * lock=0x000000d7a0e3b9b8, unsigned __int64 pass=0, const char * file_name=0x00007ff77a261d28, unsigned __int64 line=671)
 6      mysqld.exe!rw_lock_s_lock_func(rw_lock_struct * lock=0x000000d7a0e3b9b8, unsigned __int64 pass=0, const char * file_name=0x00007ff77a261d28, unsigned __int64 line=671)
 7      mysqld.exe!pfs_rw_lock_s_lock_func(rw_lock_struct * lock=0x000000d7a0e3b9b8, unsigned __int64 pass=0, const char * file_name=0x00007ff77a261d28, unsigned __int64 line=671)
 8      mysqld.exe!mtr_s_lock_func(rw_lock_struct * lock=0x000000d7a0e3b9b8, const char * file, unsigned __int64 line, mtr_struct * mtr=0x000000d82af2ed70)
 9      mysqld.exe!btr_cur_search_to_nth_level(dict_index_struct * index=0x000000d7a0e3b928, unsigned __int64 level=0, const dtuple_struct * tuple=0x000000d79fc75918, unsigned __int64 mode=4, unsigned __int64 latch_mode=1, btr_cur_struct * cursor=0x000000d79fc73668, unsigned __int64 has_search_latch=0, const char * file=0x00007ff77a269b38, unsigned __int64 line=716, mtr_struct * mtr=0x000000d82af2ed70)

These threads are operating on unrelated buf_block_t::lock, but there appears to be a possible race condition in the sync_array element 0x000000d78d9dccb0 between sync_array_reserve_cell() and sync_array_wait_event(). It appears that we could be attempting to allocate a slot that is still in use.

The sync array is something that I have never worked on. I think that we should try to eliminate it, but I do not think that we can do it in a GA version.

Comment by Marko Mäkelä [ 2019-04-16 ]

There is quite a bit of nonzero-overhead abstraction (or obfuscation) going on. Apparently, there is only one sync_array_t object in existence: sync_primary_wait_array, and sync_primary_wait_array->protection is always SYNC_ARRAY_OS_MUTEX. It looks like sync_primary_wait_array->os_mutex is the contention point that adds insult to injury when some contention occurs inside InnoDB.

wlad points out that srv_error_monitor_thread() is continuously holding this mutex while gathering information about InnoDB mutex or rw-lock contention.

Unless InnoDB is actually hanging, I am afraid that there is nothing that we can do about this in a GA version.

Comment by Marko Mäkelä [ 2019-09-30 ]

The code could be refactored as part of MDEV-16264, which is replacing many InnoDB threads.

Generated at Thu Feb 08 08:31:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.