Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14482

Cache line contention on ut_rnd_ulint_counter()

Details

    Description

      Hi Sergey,

      I see cache coherency and interconnect traffic issue with respect to the random function generator ut_rnd_ulint_counter() while doing a backoff in spin lock.

      I was looking at MySQL repository and see that there is a patch

      https://github.com/mysql/mysql-server/commit/32b184cad3ebd68281fb0076a24a634d2f330aa1
      given to resolve this issue. I see a good amount of benefit for Arm platform around 5-10% and for Intel 3-5% for sysbench oltp test write . In general this looks to be a good idea .

      I also see MYSQL backed off it in 5.20 because of performance reason but the changes brought some scalability issue with the patch dedc8b3d567fbb92ce912f1559fe6a08b2857045

      Again on Version 8.0 this has been committed again with the patch https://github.com/mysql/mysql-server/commit/32b184cad3ebd68281fb0076a24a634d2f330aa1with some new changes.

      However I am trying to port this to Mariadb but there a lot of code difference so no way I can port directly and provide a patch to test. I see files my_thread_local.h being not available and if I try to create there are some merge clashes with my_pthread.h .

      Please let me know your opinion and if required I can share a patch soon with the above issues resolved.

      Attachments

        Issue Links

          Activity

            ssethia Sandeep sethia created issue -
            ssethia Sandeep sethia made changes -
            Field Original Value New Value
            Description Hi Sergey,

            I see cache coherency and interconnect traffic issue with respect to the random function generator ut_rnd_ulint_counter() while doing a backoff in spin lock.

            I was looking at MySQL repository and see that there is a patch

            https://github.com/mysql/mysql-server/commit/32b184cad3ebd68281fb0076a24a634d2f330aa1
            given to resolve this issue. I see a good amount of benefit for Arm platform around 5-10% and for Intel 3-5% for sysbench oltp test write . In general this looks to be a good idea .

            I also see MYSQL backed off it in 5.20 because of performance reason but the changes brought some scalability issue with the patch dedc8b3d567fbb92ce912f1559fe6a08b2857045

            Again on Version 8.0 this has been committed again with the patch dedc8b3d567fbb92ce912f1559fe6a08b2857045 with some new changes.

            However I am trying to port this to Mariadb but there a lot of code difference so no way I can port directly and provide a patch to test. I see files my_thread_local.h being not available and if I try to create there are some merge clashes with my_pthread.h .

            Please let me know your opinion and if required I can share a patch soon with the above issues resolved.


            Hi Sergey,

            I see cache coherency and interconnect traffic issue with respect to the random function generator ut_rnd_ulint_counter() while doing a backoff in spin lock.

            I was looking at MySQL repository and see that there is a patch

            https://github.com/mysql/mysql-server/commit/32b184cad3ebd68281fb0076a24a634d2f330aa1
            given to resolve this issue. I see a good amount of benefit for Arm platform around 5-10% and for Intel 3-5% for sysbench oltp test write . In general this looks to be a good idea .

            I also see MYSQL backed off it in 5.20 because of performance reason but the changes brought some scalability issue with the patch dedc8b3d567fbb92ce912f1559fe6a08b2857045

            Again on Version 8.0 this has been committed again with the patch https://github.com/mysql/mysql-server/commit/32b184cad3ebd68281fb0076a24a634d2f330aa1with some new changes.

            However I am trying to port this to Mariadb but there a lot of code difference so no way I can port directly and provide a patch to test. I see files my_thread_local.h being not available and if I try to create there are some merge clashes with my_pthread.h .

            Please let me know your opinion and if required I can share a patch soon with the above issues resolved.


            elenst Elena Stepanova made changes -
            Fix Version/s 10.3 [ 22126 ]
            Assignee Sergey Vojtovich [ svoj ]

            MySQL switched from global RNG state to per-thread state using thread-local-storage (TLS). But TLS performance as such is far from perfect. If my idea explained in MDEV-14374 is proved to be faster, we won't need RNG in spin-loop at all.

            svoj Sergey Vojtovich added a comment - MySQL switched from global RNG state to per-thread state using thread-local-storage (TLS). But TLS performance as such is far from perfect. If my idea explained in MDEV-14374 is proved to be faster, we won't need RNG in spin-loop at all.
            ssethia Sandeep sethia added a comment - - edited

            If I am correct we need to test in enter function ()

            while (!try_lock()) {
            			ut_delay(ut_rnd_interval(0, max_delay));
            			if (++n_spins == max_spins) {
            				os_thread_yield();
            				max_spins+= step;
            			}
            

            instead

            do {
                while (lock_word != UNLOCKED) skip // spin until lock seems free
              } while CAS(locked, UNLOCKED, LOCKED) // actual atomic locking
            

            ssethia Sandeep sethia added a comment - - edited If I am correct we need to test in enter function () while (!try_lock()) { ut_delay(ut_rnd_interval(0, max_delay)); if (++n_spins == max_spins) { os_thread_yield(); max_spins+= step; } instead do { while (lock_word != UNLOCKED) skip // spin until lock seems free } while CAS(locked, UNLOCKED, LOCKED) // actual atomic locking
            svoj Sergey Vojtovich made changes -
            Epic Link MDEV-14442 [ 64369 ]

            I plan to finish coding tomorrow, then we can try workable version that doesn't confuse anybody.

            svoj Sergey Vojtovich added a comment - I plan to finish coding tomorrow, then we can try workable version that doesn't confuse anybody.

            I did a few experiments with original 10.3 as of 7cb3520c0632ad912b309489ad86a90f9fc9bd0b. oltp_index_updates, 64 threads, ARM.

            vanilla: 143981 TPS
            mutex: 78897 TPS
            spin: 70085 TPS
            nospin: 64573 TPS
            nornd_max_delay: 132392 TPS
            nornd_1: 160053 TPS
            nornd_3: 148662 TPS
            nornd_10: 118859 TPS
            nornd_0: 74063 TPS

            Also there's potential false sharing between srv_spin_wait_delay/srv_n_spin_wait_rounds and statistics variables. I doubt they can contribute much to scalability, still worth checking.

            mutex (use pthread_mutex instead of ib mutex)

            diff --git a/storage/innobase/include/ib0mutex.h b/storage/innobase/include/ib0mutex.h
            index 76f02cc..edbb434 100644
            --- a/storage/innobase/include/ib0mutex.h
            +++ b/storage/innobase/include/ib0mutex.h
            @@ -433,6 +433,7 @@ struct TTASEventMutex {
             		ut_a(m_lock_word == MUTEX_STATE_UNLOCKED);
             
             		m_event = os_event_create(sync_latch_get_name(id));
            +		pthread_mutex_init(&m_mutex, NULL);
             	}
             
             	/** This is the real desctructor. This mutex can be created in BSS and
            @@ -446,6 +447,7 @@ struct TTASEventMutex {
             		/* We have to free the event before InnoDB shuts down. */
             		os_event_destroy(m_event);
             		m_event = 0;
            +		pthread_mutex_destroy(&m_mutex);
             	}
             
             	/** Try and lock the mutex. Note: POSIX returns 0 on success.
            @@ -453,6 +455,7 @@ struct TTASEventMutex {
             	bool try_lock()
             		UNIV_NOTHROW
             	{
            +		return(pthread_mutex_trylock(&m_mutex) == 0);
             		int32 oldval = MUTEX_STATE_UNLOCKED;
             		return(my_atomic_cas32_strong_explicit(&m_lock_word, &oldval,
             						       MUTEX_STATE_LOCKED,
            @@ -464,6 +467,8 @@ struct TTASEventMutex {
             	void exit()
             		UNIV_NOTHROW
             	{
            +		pthread_mutex_unlock(&m_mutex);
            +		return;
             		if (my_atomic_fas32_explicit(&m_lock_word,
             					     MUTEX_STATE_UNLOCKED,
             					     MY_MEMORY_ORDER_RELEASE)
            @@ -485,6 +490,9 @@ struct TTASEventMutex {
             		uint32_t	line)
             		UNIV_NOTHROW
             	{
            +		pthread_mutex_lock(&m_mutex);
            +		return;
            +
             		uint32_t	n_spins = 0;
             		uint32_t	n_waits = 0;
             		const uint32_t	step = max_spins;
            @@ -553,6 +561,7 @@ struct TTASEventMutex {
             	}
             
             private:
            +	sys_mutex_t m_mutex;
             	/** Disable copying */
             	TTASEventMutex(const TTASEventMutex&);
             	TTASEventMutex& operator=(const TTASEventMutex&);
            

            spin (just spin, never delay, never go asleep)

            diff --git a/storage/innobase/include/ib0mutex.h b/storage/innobase/include/ib0mutex.h
            index 76f02cc..5ab6127 100644
            --- a/storage/innobase/include/ib0mutex.h
            +++ b/storage/innobase/include/ib0mutex.h
            @@ -490,6 +490,7 @@ struct TTASEventMutex {
             		const uint32_t	step = max_spins;
             
             		while (!try_lock()) {
            +			continue;
             			if (n_spins++ == max_spins) {
             				max_spins += step;
             				n_waits++;
            

            nospin (immediately go asleep, no delay)

            diff --git a/storage/innobase/include/ib0mutex.h b/storage/innobase/include/ib0mutex.h
            index 76f02cc..f7c9827 100644
            --- a/storage/innobase/include/ib0mutex.h
            +++ b/storage/innobase/include/ib0mutex.h
            @@ -490,7 +490,7 @@ struct TTASEventMutex {
             		const uint32_t	step = max_spins;
             
             		while (!try_lock()) {
            -			if (n_spins++ == max_spins) {
            +			if (1 || n_spins++ == max_spins) {
             				max_spins += step;
             				n_waits++;
             				os_thread_yield();
            

            nornd_X (different values to ut_delay, either hardcoded or max_delay)

            diff --git a/storage/innobase/include/ib0mutex.h b/storage/innobase/include/ib0mutex.h
            index 76f02cc..d48b213 100644
            --- a/storage/innobase/include/ib0mutex.h
            +++ b/storage/innobase/include/ib0mutex.h
            @@ -516,7 +516,7 @@ struct TTASEventMutex {
             					sync_array_wait_event(sync_arr, cell);
             				}
             			} else {
            -				ut_delay(ut_rnd_interval(0, max_delay));
            +				ut_delay(X);
             			}
             		}
            

            svoj Sergey Vojtovich added a comment - I did a few experiments with original 10.3 as of 7cb3520c0632ad912b309489ad86a90f9fc9bd0b. oltp_index_updates, 64 threads, ARM. vanilla: 143981 TPS mutex: 78897 TPS spin: 70085 TPS nospin: 64573 TPS nornd_max_delay: 132392 TPS nornd_1: 160053 TPS nornd_3: 148662 TPS nornd_10: 118859 TPS nornd_0: 74063 TPS Also there's potential false sharing between srv_spin_wait_delay/srv_n_spin_wait_rounds and statistics variables. I doubt they can contribute much to scalability, still worth checking. mutex (use pthread_mutex instead of ib mutex) diff --git a/storage/innobase/include/ib0mutex.h b/storage/innobase/include/ib0mutex.h index 76f02cc..edbb434 100644 --- a/storage/innobase/include/ib0mutex.h +++ b/storage/innobase/include/ib0mutex.h @@ -433,6 +433,7 @@ struct TTASEventMutex { ut_a(m_lock_word == MUTEX_STATE_UNLOCKED);   m_event = os_event_create(sync_latch_get_name(id)); + pthread_mutex_init(&m_mutex, NULL); }   /** This is the real desctructor. This mutex can be created in BSS and @@ -446,6 +447,7 @@ struct TTASEventMutex { /* We have to free the event before InnoDB shuts down. */ os_event_destroy(m_event); m_event = 0; + pthread_mutex_destroy(&m_mutex); }   /** Try and lock the mutex. Note: POSIX returns 0 on success. @@ -453,6 +455,7 @@ struct TTASEventMutex { bool try_lock() UNIV_NOTHROW { + return(pthread_mutex_trylock(&m_mutex) == 0); int32 oldval = MUTEX_STATE_UNLOCKED; return(my_atomic_cas32_strong_explicit(&m_lock_word, &oldval, MUTEX_STATE_LOCKED, @@ -464,6 +467,8 @@ struct TTASEventMutex { void exit() UNIV_NOTHROW { + pthread_mutex_unlock(&m_mutex); + return; if (my_atomic_fas32_explicit(&m_lock_word, MUTEX_STATE_UNLOCKED, MY_MEMORY_ORDER_RELEASE) @@ -485,6 +490,9 @@ struct TTASEventMutex { uint32_t line) UNIV_NOTHROW { + pthread_mutex_lock(&m_mutex); + return; + uint32_t n_spins = 0; uint32_t n_waits = 0; const uint32_t step = max_spins; @@ -553,6 +561,7 @@ struct TTASEventMutex { }   private: + sys_mutex_t m_mutex; /** Disable copying */ TTASEventMutex(const TTASEventMutex&); TTASEventMutex& operator=(const TTASEventMutex&); spin (just spin, never delay, never go asleep) diff --git a/storage/innobase/include/ib0mutex.h b/storage/innobase/include/ib0mutex.h index 76f02cc..5ab6127 100644 --- a/storage/innobase/include/ib0mutex.h +++ b/storage/innobase/include/ib0mutex.h @@ -490,6 +490,7 @@ struct TTASEventMutex { const uint32_t step = max_spins;   while (!try_lock()) { + continue; if (n_spins++ == max_spins) { max_spins += step; n_waits++; nospin (immediately go asleep, no delay) diff --git a/storage/innobase/include/ib0mutex.h b/storage/innobase/include/ib0mutex.h index 76f02cc..f7c9827 100644 --- a/storage/innobase/include/ib0mutex.h +++ b/storage/innobase/include/ib0mutex.h @@ -490,7 +490,7 @@ struct TTASEventMutex { const uint32_t step = max_spins;   while (!try_lock()) { - if (n_spins++ == max_spins) { + if (1 || n_spins++ == max_spins) { max_spins += step; n_waits++; os_thread_yield(); nornd_X (different values to ut_delay, either hardcoded or max_delay) diff --git a/storage/innobase/include/ib0mutex.h b/storage/innobase/include/ib0mutex.h index 76f02cc..d48b213 100644 --- a/storage/innobase/include/ib0mutex.h +++ b/storage/innobase/include/ib0mutex.h @@ -516,7 +516,7 @@ struct TTASEventMutex { sync_array_wait_event(sync_arr, cell); } } else { - ut_delay(ut_rnd_interval(0, max_delay)); + ut_delay(X); } }

            Hi ,

            Just to understand from your comment u get best numbers on

            nornd_1: 160053 TPS

            I am getting the same number for Mariadb and MySQL for same threads level for 20 tables with 1000000 records.

            But if put a patch for https://github.com/mysql/mysql-server/commit/32b184cad3ebd68281fb0076a24a634d2f330aa1

            in MySQL I see 10-13% increases in TPS .This is consistent across 64,128,256 threads.

            What is vanilla in your comment?

            Also right now the random value is given same across all the threads . it should be based on contention of threads and not some function value based?

            ssethia Sandeep sethia added a comment - Hi , Just to understand from your comment u get best numbers on nornd_1: 160053 TPS I am getting the same number for Mariadb and MySQL for same threads level for 20 tables with 1000000 records. But if put a patch for https://github.com/mysql/mysql-server/commit/32b184cad3ebd68281fb0076a24a634d2f330aa1 in MySQL I see 10-13% increases in TPS .This is consistent across 64,128,256 threads. What is vanilla in your comment? Also right now the random value is given same across all the threads . it should be based on contention of threads and not some function value based?

            Vanilla is 10.3 as of 7cb3520c0632ad912b309489ad86a90f9fc9bd0b (10.3 branch as of today).

            Yes, I also wonder why I'm getting lower TPS compared to what you see (~144 000 vs ~160 000). Probably some difference in benchmark/settings/hardware? At least I have 100 000 records versus 1 000 000 that you have. Same 20 tables, 64 threads.

            It would be nice if you could try nornd_1 change on your side. It's not just about fixing shared variable, but also about using lower value for ut_delay specifically in mutex.

            Also right now the random value is given same across all the threads . it should be based on contention of threads and not some function value based?

            I'm afraid I don't quite get this question. Could you elaborate? You mean all the threads get the same random value? Or you mean delay should be based on mutex contention?

            svoj Sergey Vojtovich added a comment - Vanilla is 10.3 as of 7cb3520c0632ad912b309489ad86a90f9fc9bd0b (10.3 branch as of today). Yes, I also wonder why I'm getting lower TPS compared to what you see (~144 000 vs ~160 000). Probably some difference in benchmark/settings/hardware? At least I have 100 000 records versus 1 000 000 that you have. Same 20 tables, 64 threads. It would be nice if you could try nornd_1 change on your side. It's not just about fixing shared variable, but also about using lower value for ut_delay specifically in mutex. Also right now the random value is given same across all the threads . it should be based on contention of threads and not some function value based? I'm afraid I don't quite get this question. Could you elaborate? You mean all the threads get the same random value? Or you mean delay should be based on mutex contention?

            Just for the record, this article suggests that YIELD can be used in spinloop: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489f/CIHEGBBF.html

            I wonder if we can use it YIELD on ARM as a replacement for Intel PAUSE.

            svoj Sergey Vojtovich added a comment - Just for the record, this article suggests that YIELD can be used in spinloop: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489f/CIHEGBBF.html I wonder if we can use it YIELD on ARM as a replacement for Intel PAUSE.

            I tried nornd_1 but I see scalability issue in my numbers.The tps around 74k instead of 160k which I am getting by defaults

            My sysbench command for run and prepare

            ./bin/sysbench --test=share/sysbench/oltp_update_index.lua --max-requests=0 --max-time=50 --tables=20 --table_size=100000 --report-interval=10 --mysql-user=root --mysql-db=test --mysql_storage_engine=innodb --mysql-socket=/tmp/mysql.sock --num-threads=64 run

            and my cnf file is
            [mysqld]
            innodb_flush_neighbors = 0
            innodb_write_io_threads = 8
            loose-innodb_adaptive_hash_index_parts = 16
            loose-innodb_adaptive_hash_index_partitions = 16
            #innodb_spin_wait_delay = 200
            max_prepared_stmt_count=1048576
            innodb_buffer_pool_size = 50g
            max_connections = 5000
            max_user_connections = 5000
            innodb_use_native_aio = 1
            innodb_log_file_size = 2G
            innodb_flush_sync=0
            innodb_io_capacity =5000
            innodb_flush_log_at_trx_commit =0
            #innodb_concurrency_tickets =5000
            #innodb_thread_concurrency=128
            max_heap_table_size = 256M
            sort_buffer_size = 2M
            read_buffer_size =2M
            binlog_cache_size = 1M
            thread_cache_size = 500
            innodb_buffer_pool_instances=30
            #enforce-gtid-consistency = 1

            All the runs are happening in tmpfs.

            ssethia Sandeep sethia added a comment - I tried nornd_1 but I see scalability issue in my numbers.The tps around 74k instead of 160k which I am getting by defaults My sysbench command for run and prepare ./bin/sysbench --test=share/sysbench/oltp_update_index.lua --max-requests=0 --max-time=50 --tables=20 --table_size=100000 --report-interval=10 --mysql-user=root --mysql-db=test --mysql_storage_engine=innodb --mysql-socket=/tmp/mysql.sock --num-threads=64 run and my cnf file is [mysqld] innodb_flush_neighbors = 0 innodb_write_io_threads = 8 loose-innodb_adaptive_hash_index_parts = 16 loose-innodb_adaptive_hash_index_partitions = 16 #innodb_spin_wait_delay = 200 max_prepared_stmt_count=1048576 innodb_buffer_pool_size = 50g max_connections = 5000 max_user_connections = 5000 innodb_use_native_aio = 1 innodb_log_file_size = 2G innodb_flush_sync=0 innodb_io_capacity =5000 innodb_flush_log_at_trx_commit =0 #innodb_concurrency_tickets =5000 #innodb_thread_concurrency=128 max_heap_table_size = 256M sort_buffer_size = 2M read_buffer_size =2M binlog_cache_size = 1M thread_cache_size = 500 innodb_buffer_pool_instances=30 #enforce-gtid-consistency = 1 All the runs are happening in tmpfs.

            Ok, so we need to be careful with this. I'll do more research on this.

            svoj Sergey Vojtovich added a comment - Ok, so we need to be careful with this. I'll do more research on this.
            svoj Sergey Vojtovich made changes -
            Sprint 10.1.31 [ 225 ]
            svoj Sergey Vojtovich made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]
            svoj Sergey Vojtovich added a comment - marko , please review fix for this bug: https://github.com/MariaDB/server/commit/62eeb1c760196207ac57338237a0d58df925847f
            svoj Sergey Vojtovich made changes -
            Assignee Sergey Vojtovich [ svoj ] Marko Mäkelä [ marko ]
            Status Confirmed [ 10101 ] In Review [ 10002 ]

            Looks OK to me.
            Side note: After this fix, ut_rnd_interval() is really only used in one place (if we ignore the uncovered fault injection fil_tablespace_iterate_failure). It could make sense to move the function to the only file where it is being used.

            marko Marko Mäkelä added a comment - Looks OK to me. Side note: After this fix, ut_rnd_interval() is really only used in one place (if we ignore the uncovered fault injection fil_tablespace_iterate_failure ). It could make sense to move the function to the only file where it is being used.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Sergey Vojtovich [ svoj ]
            Status In Review [ 10002 ] Stalled [ 10000 ]

            Sergey,

            In case if the atomics in arm/intel gets optimized or for the matter gets more degraded I feel this can still need to be adjusted. I think what's really needed is a measure the distribution of the number of CPU cycles the mutexes actually need to be delayed. Because "The argument gives the desired delay in microseconds on 100 MHz Pentium + Visual C++." is clearly not relevant any more

            ssethia Sandeep sethia added a comment - Sergey, In case if the atomics in arm/intel gets optimized or for the matter gets more degraded I feel this can still need to be adjusted. I think what's really needed is a measure the distribution of the number of CPU cycles the mutexes actually need to be delayed. Because "The argument gives the desired delay in microseconds on 100 MHz Pentium + Visual C++." is clearly not relevant any more
            svoj Sergey Vojtovich made changes -
            Fix Version/s 10.3.5 [ 22905 ]
            Fix Version/s 10.3 [ 22126 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]

            ssethia, the argument is not relevant indeed. But I'm not completely sure what you're suggesting.

            Scalability is at least 3-dimensional thing: hardware, number of threads and workload. Usually when we improve some benchmark results that works on some crossing of these three, we inevitably make another crossing performance worse. The only way that gives non-negative result in all crossings I'm aware of is removing cache line contention (mutex, rwlock, false sharing, variables) from hot path.

            In the scope of this task we removed RNG variable contention, which goes perfectly inline with what I say above.

            If you have any specific suggestion how exactly we should tune mutex delay, I'm all for creating another JIRA task and discussing it separately.

            svoj Sergey Vojtovich added a comment - ssethia , the argument is not relevant indeed. But I'm not completely sure what you're suggesting. Scalability is at least 3-dimensional thing: hardware, number of threads and workload. Usually when we improve some benchmark results that works on some crossing of these three, we inevitably make another crossing performance worse. The only way that gives non-negative result in all crossings I'm aware of is removing cache line contention (mutex, rwlock, false sharing, variables) from hot path. In the scope of this task we removed RNG variable contention, which goes perfectly inline with what I say above. If you have any specific suggestion how exactly we should tune mutex delay, I'm all for creating another JIRA task and discussing it separately.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Fix Version/s 10.2.31 [ 24017 ]
            Fix Version/s 10.4.0 [ 23115 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 83950 ] MariaDB v4 [ 153235 ]

            People

              svoj Sergey Vojtovich
              ssethia Sandeep sethia
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.