[MDEV-7943] pthread_getspecific() takes 0.76% in OLTP RO Created: 2015-04-09  Updated: 2015-06-19  Resolved: 2015-06-19

Status: Closed
Project: MariaDB Server
Component/s: OTHER
Affects Version/s: 10.1
Fix Version/s: 10.1.6

Type: Bug Priority: Major
Reporter: Sergey Vojtovich Assignee: Sergey Vojtovich
Resolution: Fixed Votes: 0
Labels: None

Epic Link: Performance: micro optimizations
Sprint: 10.1.6-1

 Description   

Data comes from Sandy Bridge system running sysbench OLTP RO in 1 thread against 1 table.

Call graphs:

-   0.76%  mysqld  libpthread-2.15.so   [.] pthread_getspecific
   - pthread_getspecific
      + 19.28% trx_is_interrupted(trx_t const*)
      + 8.56% net_real_write
      + 7.94% vio_io_wait
      + 5.19% execute_sqlcom_select(THD*, TABLE_LIST*)
      + 4.35% my_free
      + 3.82% String_list::append_str(st_mem_root*, char const*)
      + 3.70% my_real_read(st_net*, unsigned long*, char)
      + 3.26% Item_equal::add_const(Item*, Item*)
      + 3.11% MYSQLparse(THD*)
      + 3.04% make_select(TABLE*, unsigned long long, unsigned long long, Item*, bool, int*)
      + 2.62% Item_equal::Item_equal(Item*, Item*, bool)
      + 2.61% Item_func::fix_fields(THD*, Item**)
      + 2.41% get_best_combination(JOIN*)
      + 2.39% st_select_lex::init_query()
      + 2.16% check_simple_equality(Item*, Item*, Item*, COND_EQUAL*)
      + 1.80% Item_ident::Item_ident(Name_resolution_context*, char const*, char const*, char const*)
      + 1.79% build_equal_items(JOIN*, Item*, COND_EQUAL*, List<TABLE_LIST>*, bool, COND_EQUAL**, bool) [clone .constprop.262]
      + 1.77% mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_select_lex_unit*, st_
      + 1.63% st_select_lex::add_joined_table(TABLE_LIST*)
      + 1.59% make_leaves_list(List<TABLE_LIST>&, TABLE_LIST*, bool, TABLE_LIST*)
      + 1.55% my_malloc
      + 1.44% DsMrr_impl::dsmrr_info_const(unsigned int, st_range_seq_if*, void*, unsigned int, unsigned int*, unsigned int*, Cost_estimate*)
      + 1.34% Item_bool_func2::Item_bool_func2(Item*, Item*)
      + 1.31% Item_int::Item_int(char const*, long long, unsigned int)
      + 1.17% st_select_lex::add_item_to_list(THD*, Item*)
      + 1.06% Eq_creator::create(Item*, Item*) const
      + 0.85% cmp_item::get_comparator(Item_result, Item*, charset_info_st const*)
      + 0.85% st_select_lex::save_leaf_tables(THD*)
      + 0.72% ha_innobase::multi_range_read_init(st_range_seq_if*, void*, unsigned int, unsigned int, st_handler_buffer*)
      + 0.71% Item_func::setup_args_and_comparator(THD*, Arg_comparator*)
      + 0.61% key_and(RANGE_OPT_PARAM*, SEL_ARG*, SEL_ARG*, unsigned int) [clone .part.152]
      + 0.60% get_quick_keys(PARAM*, QUICK_RANGE_SELECT*, st_key_part*, SEL_ARG*, unsigned char*, unsigned int, unsigned char*, unsigned int)
      + 0.56% Item_func_between::Item_func_between(Item*, Item*, Item*)
      + 0.52% sql_memdup(void const*, unsigned long)
      + 0.51% Item_cache::get_cache(Item const*, Item_result)

The most frequent caller is trx_is_interrupted()/thd_kill_level(): it calls current_thd unconditionally.
Note: it may be fixed in Monty's fastconnect tree.



 Comments   
Comment by Sergei Golubchik [ 2015-04-09 ]

one option would be to use thread local variables in gcc. they might be faster (needs to be tested) and with macros one can easily hide the underlying implementation (getspecific or tls) from the caller.

Comment by Sergey Vojtovich [ 2015-04-28 ]

serg, please review 3 patches for this task.

Comment by Sergey Vojtovich [ 2015-05-13 ]

serg, please also review 3-d patch for this task.

Comment by Alexey Kopytov [ 2015-05-20 ]

Out of curiosity, what happened to the thread-local variables idea? Has it proved to be not fast enough to replace pthread_getspecific() calls?

Comment by Sergey Vojtovich [ 2015-05-20 ]

alexeykopytov, according to my study (with no good benchmarks though) TLS should be faster than pthread_getspecific(), but still slower than passing function args.

Currently we reduced number of pthread_getspecific() calls from ~1100 to ~300 per OLTP RO transaction. Alas there're different workloads which won't benefit from this.

The plan is: pass THD through whenever it is possible, otherwise fallback to TLS if there're worthy cases.

Comment by Alexey Kopytov [ 2015-05-20 ]

I see, thanks. I was asking, because I was considering the same idea for Percona Server a few years ago. Leveraging thread-local storage looked like a low-hanging fruit to optimize all those pthread_getspecific() call sites without introducing invasive code changes, but I never got around to evaluating it.

Comment by Sergey Vojtovich [ 2015-06-18 ]

serg, please review another patch for this bug:

[Commits] a5799f5: MDEV-7943 - pthread_getspecific() takes 0.76% in OLTP RO

Comment by Sergey Vojtovich [ 2015-06-19 ]

Number of pthread_getspecific() calls was reduced from ~1100 to 290. Further improvements (if any) will be done separately.

Generated at Thu Feb 08 07:23:28 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.