[MDEV-30775] Performance regression in fil_space_t::try_to_close() introduced in MDEV-23855 Created: 2023-03-02  Updated: 2023-05-24  Resolved: 2023-03-10

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.7, 10.5.8, 10.5.17, 10.11.2
Fix Version/s: 10.11.3, 10.5.20, 10.6.13, 10.8.8, 10.9.6, 10.10.4

Type: Bug Priority: Critical
Reporter: Claudio Nanni Assignee: Vladislav Lesin
Resolution: Fixed Votes: 1
Labels: regression-10.5

Attachments: Text File annotated-perf.txt     Text File show-global-variables.txt     File space_list.gdb     File test-MDEV-30775-10517.log     File test-MDEV-30775-10520.log    
Issue Links:
Duplicate
Relates
relates to MDEV-30829 InnoDB: Cannot close file <tablename>... Closed

 Description   

Several operations got much slower starting from 10.5.8 and apparently related to file opening/closing in handler.cc when reaching innodb_open_files limit.

The more tables you have in the instance the more visible, with some tens of thousands of tables it gets problematic.

SHOW TABLE STATUS FROM db gets hundreds of times slower

mysqldump on 600 schemas x 223 empty tables goes from 2:45 minutes to 27 minutes

Increasing innodb_open_files (with all the capping limits open_files_limit, os user limits, system limits) helps but in some case means keeping hundreds of thousands of files open.

The change was introduced in CS 10.5.8

To reproduce you need many thousands of tables(at least higher than innodb_open_files) and execute mysqldump or SHOW TABLE STATUS FROM until the limit is reached, testing with and without increasing innodb_open_files will show the difference.

I have found this if condition in handler.cc getting slower, the body is not even executed:

  if (unlikely((error=open(name,mode,test_if_locked))))
  {
	  ms = duration_cast<milliseconds>(system_clock::now().time_since_epoch()).count();
	  fprintf(stderr,"handler.cc:ha_open 3317 %" PRIu64 "\n", ms);
    if ((error == EACCES || error == EROFS) && mode == O_RDWR &&
	(table->db_stat & HA_TRY_READ_ONLY))
    {
      table->db_stat|=HA_READ_ONLY;
      error=open(name,O_RDONLY,test_if_locked);
    }
  }

That open() call goes to ha_heap.cc calling:

int ha_heap::open(const char *name, int mode, uint test_if_locked) ...



 Comments   
Comment by Vladislav Lesin [ 2023-03-06 ]

perf shows that the most of the time is spent on iterating over fil_system.space_list in fil_space_t::try_to_close(). The reason of this is that fil_node_open_file_low() tries to close files from the top of the list if the number of opened files is close to limit. And it closes the files from the top of the list, what makes fil_space_t::try_to_close() to iterate more and more closed files from the top of the list to find opened file to close:

bool fil_space_t::try_to_close(bool print_info)                                 
{                                                                               
  ut_ad(mutex_own(&fil_system.mutex));                                          
  for (fil_space_t *space= UT_LIST_GET_FIRST(fil_system.space_list); space;     
       space= UT_LIST_GET_NEXT(space_list, space))                              
  {                                                                             
    ...                                                                         
    if (!node->is_open())                                                       
      continue;                                                                 
    ...                                                                         
  }                                                                             
...                                                                             
}

I think the easiest fix would be in moving closed node to the end of fil_system.space_list in fil_space_t::try_to_close().

At the other hand, fil_system.space_list is FIFO list, and closed files will be at the top of the list sooner or later(but they will be mixed with opened nodes, what let fil_space_t::try_to_close() to do less iterations during searching opened node). So the better solution would be just in splitting this list into the lists of closed and opened nodes.

Comment by Marko Mäkelä [ 2023-03-07 ]

The change looks good to me. This should be OK to push after some extra stress testing with innodb_open_files=10.

Comment by Matthias Leich [ 2023-03-10 ]

origin/st-10.5-MDEV-30775-try_to_close d08ff5f1c31818cba7af29a3f8139eaac1fc3e8c 2023-03-07T17:05:16+03:00
performed well in RQG testing.

Comment by Claudio Nanni [ 2023-03-10 ]

Tested the change with both SHOW TABLE STATUS and mysqldump both are very fast even with innodb_open_files=10
~133000 tables

Comment by Barry [ 2023-03-31 ]

On WordPress.com we have been bitten by this regression as well after upgrading from MariaDB 10.4.28 to 10.6.12. We have been testing 10.6.13 with the patch for this issue but it doesn't seem to resolve the problem. We still have very poor performance with all of the CPU time spent in `fil_space_t::try_to_close`. We have tried various values for `innodb_open_files` from 10 to 100000 and regardless of the value, once we exceed it, performance degrades significantly. Is there some additional information that would be helpful from our environment to help fully resolve this issue? I'm not really sure how the fix was tested as it doesn't seem to perform better in our tests.

Comment by Marko Mäkelä [ 2023-03-31 ]

Barry, I am sorry to read that. As you probably are aware, MariaDB duplicates some convoluted MySQL logic for adjusting innodb_open_files<10 at startup to something larger based on some other parameters. So, the practical minimum value indeed is innodb_open_files=10. A reasonable minimum value could be much larger.

There used to be a fil_system.LRU list that keeps track of most recently accessed InnoDB data files. The innodb_open_files was being enforced by closing the least recently used file. Maintaining this list required fil_system.mutex to be acquired and released once for each page I/O, which made fil_system.mutex a performance bottleneck. in MDEV-23855 or MDEV-23399 the list was removed and replaced with something simpler. With those changes, we acquire fil_system.mutex once per several page writes, and it no longer is a bottleneck. If I remember correctly, it was being acquired and released 3 times for every page write operation.

The LRU "replacement" (file handle closing) policy was replaced something like a FIFO policy. claudio.nanni provided a test case with which this policy led to frequent closing and reopening of a busy file. vlad.lesin tweaked the FIFO policy a little, and the fix was extensively tested in the 10.5 branch. There were some conflicts when merging to the 10.6 branch. I think that we’d better run our test case on the 10.6 branch once more next week. If it does not reproduce the problem, we would like to get some form of executable test case from you.

Write performance is important to us. With MDEV-26055 and MDEV-26827 in the upcoming 10.6.13 release, contention on buf_pool.mutex should be reduced as well. Those fixes could possibly play some role here, if the files are being closed while a log checkpoint is being executed.

Comment by Vladislav Lesin [ 2023-03-31 ]

Barry, we tested 10.5 fix, and it showed good results with innodb_open_files=10 and ~133000 tables. Maybe we lost something during merging the fix to 10.6, I will check it. There is one more thing with could affect. Did you do queries to INFORMATION_SCHEMA.INNODB_SYS_TABLESPACES or INFORMATION_SCHEMA.INNODB_TABLESPACES_ENCRYPTION?

Comment by Barry [ 2023-04-03 ]

Thanks for looking. We also now have a crash is 10.6.13 which seems related to this patch

Server version: 10.6.13-MariaDB-log source revision: 0760ad3336521b85454b522390eb972a606def23
key_buffer_size=134217728
read_buffer_size=4194304
max_used_connections=258
max_threads=502
thread_count=258
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4256532 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f95386064d8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f95b4ffc598 thread_stack 0x40000
mysys/stacktrace.c:216(my_print_stacktrace)[0x558ac94f641e]
sql/signal_handler.cc:238(handle_fatal_signal)[0x558ac8fb8fe5]
??:0(__restore_rt)[0x7f9e69496140]
??:0(gsignal)[0x7f9e68fcdce1]
??:0(abort)[0x7f9e68fb7537]
os/os0file.cc:3043(os_file_handle_error_cond_exit(char const*, char const*, bool, bool) [clone .cold.120])[0x558ac8c4f3e6]
os/os0file.cc:1458(os_file_close_func(int))[0x558ac930f7d7]
include/os0file.inl:197(fil_node_t::close())[0x558ac942b6f0]
include/fil0fil.h:1549(fil_system_t::move_closed_last_to_space_list(fil_space_t*))[0x558ac942b898]
fil/fil0fil.cc:422(fil_node_open_file(fil_node_t*))[0x558ac942c197]
fil/fil0fil.cc:1102(fil_space_t::read_page0())[0x558ac942c918]
psi/mysql_thread.h:785(dict_load_table_one(st_::span<char const> const&, dict_err_ignore_t, std::deque<char const*, ut_allocator<char const*, true> >&))[0x558ac9418f76]
dict/dict0load.cc:2558(dict_sys_t::load_table(st_::span<char const> const&, dict_err_ignore_t))[0x558ac94193d3]
handler/ha_innodb.cc:13443(ha_innobase::delete_table(char const*))[0x558ac92bee89]
sql/handler.cc:577(hton_drop_table(handlerton*, char const*))[0x558ac8fbf08f]
sql/handler.cc:3138(ha_delete_table(THD*, handlerton*, char const*, st_mysql_const_lex_string const*, st_mysql_const_lex_string const*, bool))[0x558ac8fc4d7c]
sql/sql_table.cc:1469(mysql_rm_table_no_locks(THD*, TABLE_LIST*, st_mysql_const_lex_string const*, st_ddl_log_state*, bool, bool, bool, bool, bool, bool))[0x558ac8e2d297]
sql/sql_db.cc:1123(mysql_rm_db_internal(THD*, st_mysql_const_lex_string const*, bool, bool))[0x558ac8d5cd91]
sql/sql_parse.cc:5186(mysql_execute_command(THD*, bool))[0x558ac8da5bf4]
sql/sql_parse.cc:8038(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x558ac8d96681]
sql/sql_parse.cc:1867(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool))[0x558ac8da1551]
sql/sql_parse.cc:1409(do_command(THD*, bool))[0x558ac8da2c80]
sql/sql_connect.cc:1416(do_handle_one_connection(CONNECT*, bool))[0x558ac8ea1f77]
sql/sql_connect.cc:1324(handle_one_connection)[0x558ac8ea22b5]
perfschema/pfs.cc:2204(pfs_spawn_thread)[0x558ac92026ee]
??:0(start_thread)[0x7f9e6948aea7]
??:0(clone)[0x7f9e69091a2f]

Comment by Michael Widenius [ 2023-04-04 ]

I am a bit confused about one thing in the original report.
"That open() call goes to ha_heap.cc calling" .. "ha_heap::open()"

Heap will not open any files, so the number of possible open file descriptors should have no effect on this call.
As ha_heap::open() will only fail if there is not enough RAM, the body above should 'never' (assuming RAM) be executed.

What am I missing?

As Claudio reported, we have not been able to repeat this with 133000 tables.
Caludio, can you tell us how the files where distributed in your test environment? (How many databases)

Comment by Claudio Nanni [ 2023-04-04 ]

monty

> What am I missing?

In my step-by-step debug setting various breakpoints I saw that this test condition:

if (unlikely((error=open(name,mode,test_if_locked))))

Was calling:

int ha_heap::open(const char *name, int mode, uint test_if_locked) ...

The body of the if was not executed in my tests, only the condition was checked which does a call to open()

Of course is possible I did some mistake but I tested it many times and I can set it up again if needed.

The table distribution was around 600 identical schemas with 223 (empty) tables each.

Comment by Claudio Nanni [ 2023-04-04 ]

I quickly redid a test with 10.5.17 and 10.5.20, I can confirm that 10.5.20 is exempt from the problem (see attached).

test-MDEV-30775-10520.log test-MDEV-30775-10517.log

Comment by Vladislav Lesin [ 2023-04-04 ]

Barry, I reproduced performance regression on 10.6, and is working now on the fix.

Comment by Barry [ 2023-04-06 ]

Thank you! Please let me know if you want me to test anything. I have a somewhat easy way to reproduce.

Comment by Vladislav Lesin [ 2023-04-07 ]

Barry, I have found the reason and fixed it. The fix is under testing now. We can provide you with custom build with the fix after the testing.

Comment by Barry [ 2023-04-07 ]

Thanks! We can build from a Github hash/branch so just let us know when/what to test. No need to provide us a custom build.

Comment by Vladislav Lesin [ 2023-04-11 ]

Barry, you can try https://github.com/MariaDB/server/tree/bb-10.6-MDEV-30775-try_to_close. It passed regression testing, but hasn't yet been tested with RQG.

Comment by Barry [ 2023-04-11 ]

Thanks. I tested this but it doesn't seem to fix the issue. It's easier and less disruptive for us to test on a replica vs a primary DB so I built MariaDB from your branch and upgraded a replica that was running 10.6.12. I set `innodb_open_files` to 10 so we can replicate the problem faster and started the replication IO and SQL threads. The replica can't open/close tables fast enough for replication to keep up (there aren't very many writes). I expect the same issue to happen on a primary as well. perf shows all CPU time spent in `fil_space_t::try_to_close`.

  96.09%  mariadbd  mariadbd            [.] fil_space_t::try_to_close
   0.46%  mariadbd  [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
   0.17%  mariadbd  [kernel.kallsyms]   [k] ipt_do_table
   0.07%  mariadbd  [kernel.kallsyms]   [k] copy_user_generic_string
   0.06%  mariadbd  [kernel.kallsyms]   [k] _aesni_enc4
   0.06%  mariadbd  mariadbd            [.] MYSQLparse
   0.05%  mariadbd  [kernel.kallsyms]   [k] native_read_msr
   0.04%  mariadbd  mariadbd            [.] crc32_pclmul
   0.04%  mariadbd  mariadbd            [.] Lex_input_stream::get_text
   0.04%  mariadbd  [kernel.kallsyms]   [k] clear_page_rep
   0.04%  mariadbd  mariadbd            [.] L_AES_GCM_decrypt_avx2_ghash_128
   0.04%  mariadbd  mariadbd            [.] rec_get_offsets_func
   0.03%  mariadbd  [kernel.kallsyms]   [k] memset
   0.03%  mariadbd  [kernel.kallsyms]   [k] kernel_fpu_begin_mask
   0.03%  mariadbd  mariadbd            [.] l_find
   0.03%  mariadbd  [kernel.kallsyms]   [k] native_write_msr
   0.03%  mariadbd  [kernel.kallsyms]   [k] update_sd_lb_stats.constprop.0
   0.03%  mariadbd  mariadbd            [.] buf_page_get_low
   0.02%  mariadbd  [kernel.kallsyms]   [k] crypt_convert
   0.02%  mariadbd  [kernel.kallsyms]   [k] crc32c_pcl_intel_update
   0.02%  mariadbd  [kernel.kallsyms]   [k] delay_halt_mwaitx
   0.02%  mariadbd  mariadbd            [.] my_b_vprintf
   0.02%  mariadbd  [kernel.kallsyms]   [k] __const_udelay
   0.02%  mariadbd  mariadbd            [.] btr_cur_t::search_leaf
   0.02%  mariadbd  libjemalloc.so.2    [.] free
   0.02%  mariadbd  [kernel.kallsyms]   [k] __list_del_entry_valid
   0.02%  mariadbd  mariadbd            [.] Lex_input_stream::unescape
   0.02%  mariadbd  [kernel.kallsyms]   [k] kmem_cache_free_bulk
   0.02%  mariadbd  mariadbd            [.] crc32c_3way
   0.02%  mariadbd  [kernel.kallsyms]   [k] read_tsc
   0.02%  mariadbd  [kernel.kallsyms]   [k] mlx5_eq_comp_int
   0.02%  mariadbd  libjemalloc.so.2    [.] malloc
   0.02%  mariadbd  mariadbd            [.] my_strnncollsp_simple
   0.01%  mariadbd  [kernel.kallsyms]   [k] megasas_build_and_issue_cmd_fusion
   0.01%  mariadbd  [kernel.kallsyms]   [k] kmem_cache_free
   0.01%  mariadbd  [kernel.kallsyms]   [k] kmem_cache_alloc
   0.01%  mariadbd  [kernel.kallsyms]   [k] tasklet_action_common.constprop.0
   0.01%  mariadbd  [kernel.kallsyms]   [k] syscall_return_via_sysret
   0.01%  mariadbd  mariadbd            [.] mysql_update

Comment by Claudio Nanni [ 2023-04-12 ]

Barry,

What type of operation is resulting in such profile?
Is it a specific test or normal operations?

In my tests the fix works great, just tested SHOW TABLE STATUS from 3200 schemas with 223 tables each (around 700k tables) and each execution is extremely fast (~0.1s).

real 4m22,851s
user 0m9,045s
sys 0m7,520s

Also mysqldump that was extremely slow became very fast.

Could you attach SHOW GLOBAL VARIABLES; if that doesn't contain any sensitive information?

Comment by Barry [ 2023-04-13 ]

This is from a server running a single replication sql thread. Normal traffic, no specific test. 7090 schemas, 389072 tables in total.

SHOW GLOBAL VARIABLES attached with some host-specific information removed.

show-global-variables.txt

Comment by Marko Mäkelä [ 2023-04-13 ]

Barry, could you also please show more detail where fil_space_t::try_to_close() is spending the time? In perf report or perf top, hit 'a' for annotating the code (showing the instructions where most time is being spent). There is also some other command for toggling source code view, which could be useful.

I am not sure if it is related to this, but I have the feeling that it could be useful to have a larger fil_system.spaces.n_cells than the hard-coded 50767 when you have hundreds of thousands of InnoDB tablespaces. If you use partitioning, each partition or subpartition counts as a separate InnoDB table.

Comment by Barry [ 2023-04-13 ]

Annotated perf attached.

annotated-perf.txt

Comment by Vladislav Lesin [ 2023-04-14 ]

We found MDEV-31049 bug, which affects 10.6 with small --innodb-open-files. The fix is under testing.

Comment by Vladislav Lesin [ 2023-04-14 ]

Barry, From the "SHOW GLOBAL VARIABLES": version_source_revision 4c79e15cc3716f69c044d4287ad2160da8101cdc. This corresponds to mariadb-10.6.12. The latest commit of bb-10.6-MDEV-30775-try_to_close is 7c0af5c56994f37c45d40e45fa4e15743acaeb06.

Comment by Marko Mäkelä [ 2023-04-14 ]

In my experience, version_source_revision may not always (or at all) be updated by incremental builds. That might explain the finding. I did not try to match the disassembly of the annotated fil_space_t::try_to_close() with any source code.

Comment by Vladislav Lesin [ 2023-04-14 ]

According to the unfolded fil_space_t::try_to_close() in annotated-perf.txt , about 50% of time was spent here:

       │     fil_space_t::try_to_close(bool):                                                                                                                                                                                         ▒
       │ 40:   mov    0x3c(%rbx),%eax                                                                                                                                                                                                 ▒
 50.91 │       test   %eax,%eax                                                                                                                                                                                                       ▒
  0.59 │       je     122

This corresponds to FIL_TYPE_TEMPORARY case in the following switch:

bool fil_space_t::try_to_close(bool print_info)                                 
{                                                                               
  mysql_mutex_assert_owner(&fil_system.mutex);                                  
  for (fil_space_t &space : fil_system.space_list)                              
  {                                                                             
    switch (space.purpose) {                                                    
    case FIL_TYPE_TEMPORARY:                                                    
      continue;                                                                 
    case FIL_TYPE_IMPORT:                                                       
      break;                                                                    
    case FIL_TYPE_TABLESPACE:                                                   
      if (!fil_is_user_tablespace_id(space.id))                                 
        continue;                                                               
    }                                                                           
...                                                                             
}

The following snippet:

  0.04 │ 80:   mov    0x48(%rbx),%r13                                                                                                                                                                                                 ▒
 20.70 │       test   %r13,%r13                                                                                                                                                                                                       ▒
  0.58 │       je     122                                                                                                                                                                                                             ▒
  0.00 │       cmpl   $0xffffffff,0x10(%r13)                                                                                                                                                                                          ▒
 20.72 │       je     122         

Corresponds to:

    if (!node)                                                                                                              
      /* fil_ibd_create() did not invoke fil_space_t::add() yet */              
      continue;                                                                 
    ...                                      
                                                                                
    if (!node->is_open())                                                       
      continue;

Comment by Vladislav Lesin [ 2023-04-14 ]

The fix was based on the case when fil_system.space_list contained a lot of closed spaces on its top, and fil_space_t::try_to_close() had to iterate them to find first opened space. The above comment means that not only this can be the cause of slow fil_space_t::try_to_close() work, but also a big amount of temporary tables at the top of fil_system.space_list.

Comment by Marko Mäkelä [ 2023-04-14 ]

For the record, there should be only one fil_system.temp_space, but it may contain multiple files if innodb_temp_data_file_path specifies so. However, the fil_system.space_list should contain fil_system.temp_space only once. It might make sense to revise the data structures to have only FIL_TYPE_TABLESPACE in this list.

Comment by Vladislav Lesin [ 2023-04-17 ]

Barry, could you please confirm you are testing 7c0af5c56994f37c45d40e45fa4e15743acaeb06 commit? BTW, the changes were pushed to 10.6, so you can use the latest 10.6 from repository to test.

Comment by Barry [ 2023-04-19 ]

Sorry for the delay and confusion - yes the tests were done on the correct version but since it didn't resolve the issue I switched back to 10.6.12 before I provided the output of SHOW GLOBAL VARIABLES. I can test again using the 10.6 branch.

Comment by Vladislav Lesin [ 2023-04-21 ]

Barry, could you please build latest 10.6(it must contain 0cca8166f3111901019dcd33747a1a1dfd9e66d1 commit) with -DCMAKE_BUILD_TYPE=RelWithDebInfo option, then test it, and, if it still has low performance:

1. attach to the process with gdb, i.e. "gdb -p mariadbd_pid",

2. set gdb logging to file and turn off gdb pagination, i.e.

set logging on
set logging file 10.6-MDEV-30775-gdb.log
set pagination off

3. load gdb space_list.gdb :

source space_list.gdb

4. launch function from the above gdb file:

show_space_list

5. quit from gdb(Ctrl+D), and share 10.6-MDEV-30775-gdb.log with us.

Comment by Barry [ 2023-05-05 ]

Sorry for the delayed reply. I can confirm latest 10.6 (fe89df42686fd41e986dc775e12ad6f3594d5bca) works fine for me. Thanks for all of the help resolving this issue!

Comment by Marko Mäkelä [ 2023-05-08 ]

Barry, thank you for bringing up the issue well in time for the fix to be included in the quarterly releases.

Generated at Thu Feb 08 10:18:48 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.