[MDEV-14659] Innodb scalibility issue found in Mariadb code for complex 'select' queries in Arm platform Created: 2017-12-15  Updated: 2022-08-26  Resolved: 2022-08-26

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.3.2
Fix Version/s: 10.6.0

Type: Bug Priority: Major
Reporter: Sandeep sethia Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: innodb, performance
Environment:

ubuntu 16.0.2


Attachments: PNG File thread.png    
Issue Links:
Blocks
is blocked by MDEV-24142 rw_lock_t has unnecessarily complex w... Closed
PartOf
is part of MDEV-14442 Optimization for ARM64 platform. Open
Relates
relates to MDEV-22850 Reduce buf_pool.page_hash latch conte... Closed
relates to MDEV-22871 Contention on the buf_pool.page_hash Closed
Epic Link: arm64 optimization
Sprint: 10.0.34

 Description   

Hi Sergey,

While testing the performance of complex queries I see huge degradation in performance as we increase the number of client threads. I used the benchamark mysqlslap to evalaute the complex queries.

Background of tables: Two tables populated with 4096 records populated and a common column is populated which helps in joining the two tables.

Sample mysqslap command uses:

mysqlslap -uroot --concurrency=24  --create-schema=test --no-drop --number-of-queries=500 --iterations=10 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category' –p

Sample output : The above command runs for 24 threads for the total number of queries of 500 .It runs 10 times the same operation . Number of clients running queries: 24 Average number of queries per client: 20
Attached is the time take for each combination of threads tested in secs. Example 1 ,8 16 threads etc

On profiling I see atomic operations such as cmpxchg is the hottest function

Arm platform :

   │                                                           MY_MEMORY_ORDER_RELAXED);                                          
       │             while (lock_copy > threshold) {                                                                                 
  0.00 │     ↓ b.le   a0cf14 <pfs_rw_lock_s_lock_func(rw_lock_t*, unsigned long, char const*, unsigned int) [clone .constprop.112]+0x1
       │                     if (my_atomic_cas32_strong_explicit(&lock->lock_word,                                                    
  0.04 │       ldr    w2, [x29,#80]                                                                                                   
  8.68 │ f0:   ldaxr  w0, [x19]                                                                                                       
  0.00 │       cmp    w0, w2                                                                                                          
  0.01 │     ↓ b.ne   a0cefc <pfs_rw_lock_s_lock_func(rw_lock_t*, unsigned long, char const*, unsigned int) [clone .constprop.112]+0x1
88.19 │       stxr   w3, w1, [x19]                                                                                                   
  0.00 │     ↑ cbnz   f0                                                                                                              
  2.29 │104: ↑ b.ne   a0ced4 <pfs_rw_lock_s_lock_func(rw_lock_t*, unsigned long, char const*, unsigned int) [clone .constprop.112]+0xd
       │     rw_lock_s_lock_low():                    
 
Samples: 15M of event 'cycles:ppp', Event count (approx.): 9845312975908
Overhead  Command  Shared Object        Symbol                                                                                        ◆
  32.58%  mysqld   mysqld               [.] pfs_rw_lock_s_lock_func                                                                   ▒
  26.85%  mysqld   mysqld               [.] row_search_mvcc                                                                           ▒
  18.42%  mysqld   mysqld               [.] pfs_rw_lock_s_unlock_func                                                                 ▒
  11.13%  mysqld   mysqld               [.] pfs_rw_lock_s_unlock_func   

I tired to relax the memory orderfrom seq_cst to acq_/relaxed caused a ldaxr/stlxr => ldaxr/stxr but not much benefit availed

Basically these are Low-level function which tries to lock an rw-lock in s-mode. Performs no
spinning.

In Intel I don’t see the function very hot

0.02%  mysqld   mysqld               [.] pfs_rw_lock_s_lock_func                                                                                                                         ▒
   0.02%  mysqld   mysqld               [.] buf_page_get_gen                                                                                                                                ▒
   0.02%  mysqld   mysqld               [.] page_cur_search_with_match_bytes                                                                                                                ▒
   0.02%  mysqld   mysqld               [.] row_search_mvcc                                                                                                                                 ▒
   0.02%  mysqld   mysqld               [.] pfs_rw_lock_s_lock_fun

Can we do away with atomic operation since this being select queries?If not why cant we include spinning/pause in every lock function we try to acquire?

Sample code path

rw_lock_lock_word_decr(
	/*===================*/
	        rw_lock_t*     lock,          /*!< in/out: rw-lock */
	        ulint          amount,        /*!< in: amount to decrement */
	        lint           threshold)     /*!< in: threshold of judgement */
	{
	#ifdef INNODB_RW_LOCKS_USE_ATOMICS
	        lint local_lock_word;
	
	        os_rmb;
	        local_lock_word = lock->lock_word;
	        while (local_lock_word > threshold) {
	               if (os_compare_and_swap_lint(&lock->lock_word,
	                                           local_lock_word,
	                                           local_lock_word - amount)) {
	                       return(true);
	               }
	               local_lock_word = lock->lock_word;
	        }
	        return(false);



 Comments   
Comment by Sandeep sethia [ 2017-12-15 ]

rw_lock_lock_word_decr was copied from mysql wrongly as i was verifying the codepath there. Its almost similar so the issues is confirmed.

Comment by Sandeep sethia [ 2017-12-15 ]

I tried 4096k records in both the tables and copied few of them to one table so that join can happen

Comment by Sergey Vojtovich [ 2017-12-15 ]

Generally I agree that this code performance wise is far from perfect. We could try optimising it, but it requires rw-locks refactoring.

Much simpler option is to try adding UT_RELAX_CPU() into this loop (or MY_RELAX_CPU() if that's recent 10.3).
Another simple option is to verify there's no false sharing.

But none of my guesses can explain why ARM is so much slower here.
Could you confirm database and CPU are warm enough when you start this test? Could you try increasing iterations e.g. up to 100 and see if it makes any difference?
Are you comparing MariaDB on Intel vs MariaDB on ARM? Or MySQL on Intel vs MariaDB on ARM?
Could you try making these tables 10x bigger (or 100x bigger). Does it make any difference?

Comment by Sandeep sethia [ 2017-12-15 ]

I tried including the RELAX_CPU code in the loop but no improvement found. I increase the relax_cpu with 5,10,50 times but the performance got degraded. I also tried COMPILER BARRIER for including delay in a while loop for 50 times but no benefit seen.

To answer your question

I did ran for 30 and 50 times but results are same.

I am comparing Mariadb on intel vs Mariadb on ARM but the issues seems to exist on Mysql as well.

I feel cmpxchng is not working well on ARM platform so need to some other workarounds

Comment by Sergey Vojtovich [ 2017-12-15 ]

Thanks for confirming. Now I can think only of 2 options: false sharing and refactoring. Am I correct that cache line size on ARM is 128?

Comment by Sandeep sethia [ 2017-12-15 ]

Yes its 128 i believe.

Comment by Sergey Vojtovich [ 2017-12-16 ]

ssethia, could you check if this patch makes any difference so that we can exclude false sharing guess?

diff --git a/storage/innobase/include/sync0rw.h b/storage/innobase/include/sync0rw.h
index ae5f410..46c13a3 100644
--- a/storage/innobase/include/sync0rw.h
+++ b/storage/innobase/include/sync0rw.h
@@ -571,7 +571,9 @@ struct rw_lock_t
 #endif /* UNIV_DEBUG */
 {
        /** Holds the state of the lock. */
+       char pada[128];
        int32_t lock_word;
+       char padb[128];
 
        /** 1: there are waiters */
        int32_t waiters;

Comment by Sandeep sethia [ 2017-12-17 ]

I see small improvement with the above patch .Around 5-8% but issue is still pertinent.

Comment by Sandeep sethia [ 2017-12-20 ]

I see queued spin lock in userspace could be a solution but need to try,I tried to do some backoff in while loop but no benefits seen.

Comment by Marko Mäkelä [ 2020-07-15 ]

I wonder if MDEV-22871 could have addressed this. That was a large change, involving refactoring the buf_pool.page_hash and introducing a simpler variant of rw_lock_t for which the read-lock acquisition became a simple std::atomic::fetch_add().

Comment by Marko Mäkelä [ 2022-08-26 ]

I think that this was fixed by MDEV-24142 and some related work in MariaDB Server 10.6.0.

Generated at Thu Feb 08 08:15:17 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.