Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-16168

Performance regression on sysbench write benchmarks from 10.2 to 10.3

Details

    Description

      I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
      Here is the setup

      • Windows Azure VMwith 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)

      Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
      Sockets: 1
      Virtual processors: 16

      • MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
        the relevant part of my.ini file is as follows

        [mysqld]
        back_log=500
        max_connections=4096
        max_prepared_stmt_count=500000
        table_open_cache=10000
        transaction_isolation=REPEATABLE-READ
        innodb_status_file=0
        innodb_buffer_pool_size=20G
        innodb_log_file_size=15G
        innodb_log_buffer_size=1000M 
        innodb-io-capacity=4000
        innodb-io-capacity-max=5000
        innodb_doublewrite=0
        innodb-page-cleaners=1
        innodb-buffer-pool-instances=1
        innodb_adaptive_flushing=0
        innodb_adaptive_flushing_lwm=10
        

      • sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

      The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

      Loading table, sysbench prepare

      sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1  --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL  --num-threads=1 prepare
      

      sysbench run

      sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1  --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL  --num-threads=%N% run
      

      where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

      The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

      Below are the results from the runs, for 10.2.4 and recent 10.3.7
      UPDATE: marko mentioned the git 1a62c8a39647db00283b4e35fb6db3c7bc31e3ca as the version right before 10.3 switched to lockless, so
      I added a column for it too. It appears to be better than 10.3.7, but worse than 10.2.14

      Users TPS 10.2.14 TPS 10.3.7 10.3-pre-lockless
      1 1549.92 1547.75 1551.07
      2 2373.89 2209.05 2334.80
      4 4466.03 4276.46 4446.64
      8 9069.82 8587.59 9098.95
      16 16631.45 15719.70 16231.39
      32 28989.79 27482.44 27786.32
      64 36996.12 34843.23 35089.44
      128 37287.05 35172.49 36223.68
      256 38038.01 35896.52 36818.22
      512 38360.71 36567.07 37195.47
      1024 38265.25 36328.62 37253.98
      2048 39990.85 36328.62 38572.81
      4096 41988.71 39032.83 39606.38

      (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
      It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
      It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

      I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

      Below., in all places, baseline refers to 10.2.14, and *comparison" is 10.3.7

      Exclusive samples comparison
      Individual functions (or, exclusive sample percentage with threshold 1% difference)

      Comparison Column Delta Baseline Value Comparison Value
      ut_delay 4.02 16.05 20.07
      l_find 2.44 0.16 2.60
      PolicyMutex<TTASEventMutex<GenericPolicy> >::enter 2.42 0.18 2.60
      TTASEventMutex<GenericPolicy>::enter -4.42 4.42 0.00
      SleepConditionVariableCS -5.41 28.70 23.29

      In short, more busy wait ut_delay, less lazy wait SleepConditionVariableCS . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
      Inclusive samples comparison
      Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

      Comparison Column Delta Baseline Value Comparison Value
      PolicyMutex<TTASEventMutex<GenericPolicy> >::enter 28.69 2.53 31.22
      ut_delay 4.02 16.05 20.07
      trx_purge 3.81 5.67 9.48
      srv_do_purge 3.80 5.70 9.49
      row_purge 3.66 0.00 3.67
      row_purge_step 3.47 0.71 4.18
      que_thr_step 3.31 0.94 4.25
      srv_purge_coordinator_thread 3.25 6.37 9.63
      trx_sys_t::clone_oldest_view 3.18 0.00 3.18
      btr_cur_search_to_nth_level_func 3.12 0.00 3.12
      row_purge_record_func 2.94 0.00 2.94
      row_purge_upd_exist_or_extern_func 2.77 0.00 2.77
      row_purge_reset_trx_id 2.72 0.00 2.72
      que_run_threads_low 2.70 1.73 4.43
      que_run_threads 2.69 1.75 4.45
      l_find 2.46 0.18 2.64
      ReadView::snapshot 2.30 0.00 2.30
      rw_trx_hash_t::iterate 2.27 0.00 2.27
      lf_hash_iterate 2.25 0.00 2.25
      srv_task_execute 2.12 1.95 4.08
      row_purge_reposition_pcur 1.53 0.00 1.53
      row_search_on_row_ref 1.33 0.00 1.33
      btr_pcur_open_low 1.31 0.00 1.31
      btr_search_guess_on_hash 1.31 1.54 2.85
      trx_purge_wait_for_workers_to_complete 1.10 1.52 2.62
      trx_undo_assign_low 1.07 0.00 1.07
      mtr_t::commit 1.05 2.57 3.62
      mtr_t::Command::execute 1.01 2.42 3.43
      srv_worker_thread -1.09 9.84 8.75
      sync_array_wait_event -1.29 9.69 8.40
      trx_write_serialisation_history -1.34 2.62 1.28
      trx_commit_low -1.39 3.15 1.76
      innobase_commit -1.75 47.36 45.61
      trx_commit -1.75 4.79 3.04
      ha_commit_one_phase -1.76 47.55 45.78
      commit_one_phase_2 -1.76 47.54 45.77
      trans_commit_stmt -1.80 48.68 46.88
      ha_commit_trans -1.80 48.65 46.84
      btr_cur_search_to_nth_level -1.85 1.85 0.00
      innobase_commit_ordered_2 -2.04 5.17 3.13
      trx_commit_for_mysql -2.05 5.10 3.05
      innobase_commit_low -2.05 5.11 3.06
      Prepared_statement::execute_loop -2.05 74.58 72.52
      mysql_stmt_execute_common -2.14 74.85 72.71
      mysqld_stmt_execute -2.15 74.89 72.74
      mysql_execute_command -2.21 73.03 70.81
      trx_undo_assign_undo -2.24 2.24 0.00
      Prepared_statement::execute -2.24 74.03 71.79
      threadpool_process_request -2.32 81.71 79.38
      dispatch_command -2.33 79.21 76.88
      do_command -2.34 81.33 79.00
      tp_callback -2.38 82.66 80.28
      srv_resume_thread -3.33 7.94 4.61
      os_event::wait_low -4.98 28.41 23.43
      SleepConditionVariableCS -5.41 28.70 23.29
      TTASEventMutex<GenericPolicy>::enter -30.49 30.49 0.00

      Attachments

        1. 10.3.pre-lockless.txt
          29 kB
        2. 1bp-1pc-10.2.14.txt
          29 kB
        3. 1bp-1pc-10.3.7.txt
          29 kB
        4. winpmp_10_2_14.txt
          317 kB
        5. winpmp_10_3_7.txt
          265 kB

        Issue Links

          Activity

            wlad Vladislav Vaintroub created issue -
            wlad Vladislav Vaintroub made changes -
            Field Original Value New Value
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build)
            the my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_connect_errors=5000
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            enable-named-pipe
            character_set_server=latin1
            collation_server=latin1_swedish_ci
            skip-character-set-client-handshake
            innodb_undo_log_truncate=off
            innodb_doublewrite=0
            #innodb-max-purge-lag=1000000
            #innodb_log_files_in_group=10
            #innodb_max_purge_lag_delay=300000
            #innodb_max_purge_lag=0
            #innodb-write-io-threads=32
            #innodb-page-size=4K
            [mariadb]
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            #innodb-undo-tablespaces=2
            #innodb_buffer_pool_dump_pct=100
            #innodb_max_dirty_pages_pct=99
            #innodb_max_dirty_pages_pct_lwm=80
            innodb_adaptive_flushing_lwm=10

            [client]
            protocol=pipe
            * sysbench 0.4
             
            wlad Vladislav Vaintroub made changes -
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build)
            the my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_connect_errors=5000
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            enable-named-pipe
            character_set_server=latin1
            collation_server=latin1_swedish_ci
            skip-character-set-client-handshake
            innodb_undo_log_truncate=off
            innodb_doublewrite=0
            #innodb-max-purge-lag=1000000
            #innodb_log_files_in_group=10
            #innodb_max_purge_lag_delay=300000
            #innodb_max_purge_lag=0
            #innodb-write-io-threads=32
            #innodb-page-size=4K
            [mariadb]
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            #innodb-undo-tablespaces=2
            #innodb_buffer_pool_dump_pct=100
            #innodb_max_dirty_pages_pct=99
            #innodb_max_dirty_pages_pct_lwm=80
            innodb_adaptive_flushing_lwm=10

            [client]
            protocol=pipe
            * sysbench 0.4
             
            I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users||10.2.14||10.3.7
            |1|
            |2|
            |
            wlad Vladislav Vaintroub made changes -
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users||10.2.14||10.3.7
            |1|
            |2|
            |
            |I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|
            wlad Vladislav Vaintroub made changes -
            Description |I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|
            I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling and

            wlad Vladislav Vaintroub made changes -
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling and

            I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

            Individual functions
            Comparison Column Delta Baseline Value Comparison Value
            ut_delay 4.02 16.05 20.07
            l_find 2.44 0.16 2.60
            PolicyMutex<TTASEventMutex<GenericPolicy> >::enter 2.42 0.18 2.60
            TTASEventMutex<GenericPolicy>::enter -4.42 4.42 0.00
            SleepConditionVariableCS -5.41 28.70 23.29

            wlad Vladislav Vaintroub made changes -
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

            Individual functions
            Comparison Column Delta Baseline Value Comparison Value
            ut_delay 4.02 16.05 20.07
            l_find 2.44 0.16 2.60
            PolicyMutex<TTASEventMutex<GenericPolicy> >::enter 2.42 0.18 2.60
            TTASEventMutex<GenericPolicy>::enter -4.42 4.42 0.00
            SleepConditionVariableCS -5.41 28.70 23.29

            I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

            Individual functions (threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| 4.02| 16.05| 20.07|
            |l_find| 2.44| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| 2.42|0.18| 2.60|
            TTASEventMutex<GenericPolicy>::enter -4.42 4.42 0.00
            SleepConditionVariableCS -5.41 28.70 23.29

            wlad Vladislav Vaintroub made changes -
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

            Individual functions (threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| 4.02| 16.05| 20.07|
            |l_find| 2.44| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| 2.42|0.18| 2.60|
            TTASEventMutex<GenericPolicy>::enter -4.42 4.42 0.00
            SleepConditionVariableCS -5.41 28.70 23.29

            I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.


            Comparing inclusive samples with 1% difference threshold, gives this

            ||Comparison Column ||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            <sync_array_wait_event|-1.29|9.69|8.40
            trx_write_serialisation_history|-1.34|2.62|1.28
            trx_commit_low|-1.39|3.15|1.76
            innobase_commit|-1.75|47.36|45.61
            trx_commit|-1.75|4.79|3.04
            ha_commit_one_phase|-1.76|47.55|45.78
            commit_one_phase_2|-1.76|47.54|45.77
            trans_commit_stmt|-1.80|48.68|46.88
            ha_commit_trans|-1.80|48.65|46.84
            btr_cur_search_to_nth_level|-1.85|1.85|0.00
            innobase_commit_ordered_2|-2.04|5.17|3.13
            trx_commit_for_mysql|-2.05|5.10|3.05
            innobase_commit_low|-2.05|5.11|3.06
            Prepared_statement::execute_loop|-2.05|74.58|72.52
            mysql_stmt_execute_common|-2.14|74.85|72.71
            mysqld_stmt_execute|-2.15|74.89|72.74
            mysql_execute_command|-2.21|73.03|70.81
            trx_undo_assign_undo|-2.24|2.24|0.00
            Prepared_statement::execute|-2.24|74.03|71.79
            threadpool_process_request|-2.32|81.71|79.38
            dispatch_command|-2.33|79.21|76.88
            do_command|-2.34|81.33|79.00
            tp_callback|-2.38|82.66|80.28
            srv_resume_thread|-3.33|7.94|4.61
            os_event::wait_low|-4.98|28.41|23.43
            SleepConditionVariableCS|-5.41|28.70|23.29
            TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00


            wlad Vladislav Vaintroub made changes -
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.


            Comparing inclusive samples with 1% difference threshold, gives this

            ||Comparison Column ||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            <sync_array_wait_event|-1.29|9.69|8.40
            trx_write_serialisation_history|-1.34|2.62|1.28
            trx_commit_low|-1.39|3.15|1.76
            innobase_commit|-1.75|47.36|45.61
            trx_commit|-1.75|4.79|3.04
            ha_commit_one_phase|-1.76|47.55|45.78
            commit_one_phase_2|-1.76|47.54|45.77
            trans_commit_stmt|-1.80|48.68|46.88
            ha_commit_trans|-1.80|48.65|46.84
            btr_cur_search_to_nth_level|-1.85|1.85|0.00
            innobase_commit_ordered_2|-2.04|5.17|3.13
            trx_commit_for_mysql|-2.05|5.10|3.05
            innobase_commit_low|-2.05|5.11|3.06
            Prepared_statement::execute_loop|-2.05|74.58|72.52
            mysql_stmt_execute_common|-2.14|74.85|72.71
            mysqld_stmt_execute|-2.15|74.89|72.74
            mysql_execute_command|-2.21|73.03|70.81
            trx_undo_assign_undo|-2.24|2.24|0.00
            Prepared_statement::execute|-2.24|74.03|71.79
            threadpool_process_request|-2.32|81.71|79.38
            dispatch_command|-2.33|79.21|76.88
            do_command|-2.34|81.33|79.00
            tp_callback|-2.38|82.66|80.28
            srv_resume_thread|-3.33|7.94|4.61
            os_event::wait_low|-4.98|28.41|23.43
            SleepConditionVariableCS|-5.41|28.70|23.29
            TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00


            I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.


            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there is now more CPU spent in purge threads, and not in foreground ones

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            wlad Vladislav Vaintroub made changes -
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.


            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there is now more CPU spent in purge threads, and not in foreground ones

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.


            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , because foreground use less of it (decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            wlad Vladislav Vaintroub made changes -
            Summary [Draft]Performance regression on sysbench write benchmarks from 10.2 to 10.3 Performance regression on sysbench write benchmarks from 10.2 to 10.3
            wlad Vladislav Vaintroub made changes -
            Assignee Marko Mäkelä [ marko ]

            Assigning to marko by default Innodb guru. svoj, you can jump in, if you want to repeat, or have any idea how to fix. axel, same for you.

            wlad Vladislav Vaintroub added a comment - Assigning to marko by default Innodb guru. svoj , you can jump in, if you want to repeat, or have any idea how to fix. axel , same for you.
            wlad Vladislav Vaintroub made changes -
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.


            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , because foreground use less of it (decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.


            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            wlad Vladislav Vaintroub made changes -
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling, and this is what showed up)

            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.


            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

            Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.


            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            wlad Vladislav Vaintroub made changes -
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

            Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.


            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

            Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

            *Exclusive samples comparison*
            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
            *
            Inclusive samples comparison*
            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            wlad Vladislav Vaintroub made changes -
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

            Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

            *Exclusive samples comparison*
            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
            *
            Inclusive samples comparison*
            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

            Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

            *Exclusive samples comparison*
            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
            *Inclusive samples comparison*
            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            wlad Vladislav Vaintroub made changes -
            Attachment 1bp-1pc-10.2.14.txt [ 45632 ]
            Attachment 1bp-1pc-10.3.7.txt [ 45633 ]

            Given that log_sys.mutex appears to be one of the contention points, I think that we should keep track of the average volume of redo log written per write transaction. More redo log writes should lead to more contention.

            There are two features in MariaDB 10.3 that could increase the writes: MDEV-12288 (resetting transaction identifiers after writes) and MDEV-11369 (instant ADD COLUMN). As far as I understand, neither of these changes should matter.

            Also, some mini-transactions for writing undo log were changed in the hope of simplifying crash recovery; one of the changes was essentially reverted in MDEV-15914 due to a performance regression.

            I would like to compare how these fields in SHOW ENGINE INNODB STATUS proceed:

            excerpt of SHOW ENGINE INNODB STATUS output

            …
            ------------
            TRANSACTIONS
            ------------
            Trx id counter 34
            …
            ---
            LOG
            ---
            Log sequence number 1642672
            …
            

            We should record both counts at the start and end of the test, calculate the difference and the ratio. For example, if these counts were 50 and 2000000 at the end of the test, the ratio would be (2000000-1642672)/(50-34) = 22333 bytes per transaction.

            If MariaDB 10.3 is generating significantly more redo log than 10.2, that could explain the increased log_sys.mutex contention.

            If purge threads are using more CPU than earlier, that could be due to svoj’s work of removing trx_sys.mutex and employing some lock-free data structures. It could be useful to test a 10.3 version right before switching to lock-free data structures.

            marko Marko Mäkelä added a comment - Given that log_sys.mutex appears to be one of the contention points, I think that we should keep track of the average volume of redo log written per write transaction. More redo log writes should lead to more contention. There are two features in MariaDB 10.3 that could increase the writes: MDEV-12288 (resetting transaction identifiers after writes) and MDEV-11369 (instant ADD COLUMN ). As far as I understand, neither of these changes should matter. Also, some mini-transactions for writing undo log were changed in the hope of simplifying crash recovery; one of the changes was essentially reverted in MDEV-15914 due to a performance regression. I would like to compare how these fields in SHOW ENGINE INNODB STATUS proceed: excerpt of SHOW ENGINE INNODB STATUS output … ------------ TRANSACTIONS ------------ Trx id counter 34 … --- LOG --- Log sequence number 1642672 … We should record both counts at the start and end of the test, calculate the difference and the ratio. For example, if these counts were 50 and 2000000 at the end of the test, the ratio would be (2000000-1642672)/(50-34) = 22333 bytes per transaction. If MariaDB 10.3 is generating significantly more redo log than 10.2, that could explain the increased log_sys.mutex contention. If purge threads are using more CPU than earlier, that could be due to svoj ’s work of removing trx_sys.mutex and employing some lock-free data structures. It could be useful to test a 10.3 version right before switching to lock-free data structures .

            There was also this change to ut_delay() and some related discussion. On the Intel Skylake microarchitecture, the PAUSE instruction (REPZ NOP) takes more wall-clock time than on earlier ones. Maybe we are now burning more CPU on older microarchitectures?

            marko Marko Mäkelä added a comment - There was also this change to ut_delay() and some related discussion. On the Intel Skylake microarchitecture, the PAUSE  instruction ( REPZ NOP ) takes more wall-clock time than on earlier ones. Maybe we are now burning more CPU on older microarchitectures?
            danblack Daniel Black added a comment - - edited

            ref ut_delay: I think I saw svoj test PTHREAD_MUTEX_ADAPTIVE_NP and it wasn't good.

            Maybe some rethinking the adaptive delay in mariadb to a mean/std approach per mutex rather than a straight mean. We're wasting heaps of cycles in ut_delay so a little bit of maths extra isn't going to hurt too bad.
            https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=1519c142bd6ec5cc528c5921f28575d25c7fb664;hb=HEAD#l118

            danblack Daniel Black added a comment - - edited ref ut_delay: I think I saw svoj test PTHREAD_MUTEX_ADAPTIVE_NP and it wasn't good. Maybe some rethinking the adaptive delay in mariadb to a mean/std approach per mutex rather than a straight mean. We're wasting heaps of cycles in ut_delay so a little bit of maths extra isn't going to hurt too bad. https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;h=1519c142bd6ec5cc528c5921f28575d25c7fb664;hb=HEAD#l118
            wlad Vladislav Vaintroub made changes -
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure box with 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)
            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

            Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

            *Exclusive samples comparison*
            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
            *Inclusive samples comparison*
            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure VMwith 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)

            Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
            Sockets: 1
            Virtual processors: 16

            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

            Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

            *Exclusive samples comparison*
            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
            *Inclusive samples comparison*
            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            wlad Vladislav Vaintroub made changes -
            Attachment 10.3.pre-lockless.txt [ 45641 ]
            wlad Vladislav Vaintroub made changes -
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure VMwith 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)

            Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
            Sockets: 1
            Virtual processors: 16

            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

            Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

            *Exclusive samples comparison*
            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
            *Inclusive samples comparison*
            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure VMwith 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)

            Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
            Sockets: 1
            Virtual processors: 16

            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7||
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

            Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

            *Exclusive samples comparison*
            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
            *Inclusive samples comparison*
            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            wlad Vladislav Vaintroub made changes -
            Description I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure VMwith 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)

            Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
            Sockets: 1
            Virtual processors: 16

            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7

            ||Users|| TPS 10.2.14|| TPS 10.3.7||
            |1|1549.92|1547.75|
            |2|2373.89|2209.05|
            |4|4466.03|4276.46|
            |8|9069.82|8587.59|
            |16|16631.45|15719.70|
            |32|28989.79|27482.44|
            |64|36996.12|34843.23|
            |128|37287.05|35172.49|
            |256|38038.01|35896.52|
            |512|38360.71|36567.07|
            |1024|38265.25|36328.62|
            |2048|39990.85|36328.62|
            |4096|41988.71|39032.83|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

            Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

            *Exclusive samples comparison*
            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
            *Inclusive samples comparison*
            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            I noticed reduced performance numbers when running somewhat large update_no_index sysbench benchmark, when comparing 10.2 to 10.3
            Here is the setup

            * Windows Azure VMwith 16 virtual CPUs, 32GB RAM and SSD storage (I used Local temporary disk)

            Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
            Sockets: 1
            Virtual processors: 16

            * MariaDB 10.2.14 vs 10.3.7 (recent build of 4a5e23e257e229b548599133dbed5162af9df6d9)
            the relevant part of my.ini file is as follows
            {noformat}
            [mysqld]
            back_log=500
            max_connections=4096
            max_prepared_stmt_count=500000
            table_open_cache=10000
            transaction_isolation=REPEATABLE-READ
            innodb_status_file=0
            innodb_buffer_pool_size=20G
            innodb_log_file_size=15G
            innodb_log_buffer_size=1000M
            innodb-io-capacity=4000
            innodb-io-capacity-max=5000
            innodb_doublewrite=0
            innodb-page-cleaners=1
            innodb-buffer-pool-instances=1
            innodb_adaptive_flushing=0
            innodb_adaptive_flushing_lwm=10
            {noformat}
            * sysbench 0.4 (I mention it because current versions of sysbench do not run on Windows anymore, but we do not need anything from the current versions)

            The update-no-index run on a single large table (50 mio rows), with number of users ranging from 1 to 4096, in powers of 2

            *Loading table, sysbench prepare*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=1 prepare
            {noformat}
             
            *sysbench run*
            {noformat}
            sysbench --test=oltp --oltp-table-size=50000000 --mysql-host=localhost --mysql-db=sbtest --mysql-user=root --mysql-password= --db-driver=mysql --mysql-table-engine=innodb --max-time=300 --oltp-test-mode=complex --oltp-read-only=off --max-requests=0 --oltp-point-selects=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-index-updates=0 --oltp-delete-inserts=0 --oltp-non-index-updates=1 --oltp-skip-trx=on --oltp-dist-type=uniform --mysql-socket=MySQL --num-threads=%N% run
            {noformat}

            where %N% is 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

            The test takes a nap of 80 seconds between runs, and also between load and first test, and as far as I can see it is enough for all Innodb backgroup IO activity to finish (MB/sec goes down to 0 on a disk where test is ran). Due to innodb-flush-log-at-trx-commit being default (1), the test does not appear to be very CPU bound (sysbench and mysql use 70% of all available 16 CPUs, in windows measurement, i.e over 10 CPUs are busy)

            Below are the results from the runs, for 10.2.4 and recent 10.3.7
            *UPDATE:* [~marko] mentioned the git 1a62c8a39647db00283b4e35fb6db3c7bc31e3ca as the version right before 10.3 switched to lockless, so
            I added a column for it too. It appears to be better than 10.3.7, but worse than 10.2.14

            ||Users|| TPS 10.2.14|| TPS 10.3.7|| 10.3-pre-lockless||
            |1|1549.92|1547.75|1551.07|
            |2|2373.89|2209.05|2334.80|
            |4|4466.03|4276.46|4446.64|
            |8|9069.82|8587.59|9098.95|
            |16|16631.45|15719.70|16231.39|
            |32|28989.79|27482.44|27786.32|
            |64|36996.12|34843.23|35089.44|
            |128|37287.05|35172.49|36223.68|
            |256|38038.01|35896.52|36818.22|
            |512|38360.71|36567.07|37195.47|
            |1024|38265.25|36328.62|37253.98|
            |2048|39990.85|36328.62|38572.81|
            |4096|41988.71|39032.83|39606.38|

            (also see the graph https://docs.google.com/spreadsheets/d/1VqRYCwF4QATCKEKwxKuCcvRXKm2GWnICCdv4fSo4_IU/edit#gid=0)
            It is not as big as I thought initially (somehow I counted 15%, and it is more like 5%), but it is consistent starting from 64 users, and keeping until the 4K . So it needs investigating.
            It is odd because the test itself is not CPU bound, nor IO bound, the bottleneck in my understanding is waiting for flush in log_write_up_to() . I did not try to make it CPU bound by relaxing durability, this might be another exercise, however with increased TPS purging can become more of a bottleneck, and disturb the picture.

            I tried to do some initial profiling (based on CPU sampling), and this is what showed up.

            Below., in all places, *baseline* refers to 10.2.14, and *comparison" is 10.3.7

            *Exclusive samples comparison*
            Individual functions (or, exclusive sample percentage with threshold 1% difference)

            ||Comparison Column|| Delta|| Baseline Value|| Comparison Value||
            |ut_delay| {color:red}4.02{color}| 16.05| 20.07|
            |l_find| {color:red}2.44{color}| 0.16 |2.60|
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter| {color:red}2.42{color}|0.18| 2.60|
            |TTASEventMutex<GenericPolicy>::enter| {color:green}-4.42{color}| 4.42| 0.00|
            |SleepConditionVariableCS|{color:green}-5.41{color}| 28.70| 23.29|

            In short, more busy wait _ut_delay_, less lazy wait _SleepConditionVariableCS_ . There is a new quite visible function (4th most expensive individual functions now), l_find , which seems to come from replacing stl with custom hashtable in innodb.
            *Inclusive samples comparison*
            Comparing inclusive samples with 1% difference threshold, gives the below table, which gives me a vague idea that there something is more expensive in purge, and again that lazy wait was replaced with busy wait somewhere, background threads probably use more CPU , for example "coordinator" purge thread. foreground use less of it (as shown by decreased time in do_command for example)

            ||Comparison Column ||Delta||Baseline Value|| Comparison Value||
            |PolicyMutex<TTASEventMutex<GenericPolicy> >::enter|28.69|2.53|31.22
            |ut_delay|4.02|16.05|20.07
            |trx_purge|3.81|5.67|9.48
            |srv_do_purge|3.80|5.70|9.49
            |row_purge|3.66|0.00|3.67
            |row_purge_step|3.47|0.71|4.18
            |que_thr_step|3.31|0.94|4.25
            |srv_purge_coordinator_thread|3.25|6.37|9.63
            |trx_sys_t::clone_oldest_view|3.18|0.00|3.18
            |btr_cur_search_to_nth_level_func|3.12|0.00|3.12
            |row_purge_record_func|2.94|0.00|2.94
            |row_purge_upd_exist_or_extern_func|2.77|0.00|2.77
            |row_purge_reset_trx_id|2.72|0.00|2.72
            |que_run_threads_low|2.70|1.73|4.43
            |que_run_threads|2.69|1.75|4.45
            |l_find|2.46|0.18|2.64
            |ReadView::snapshot|2.30|0.00|2.30
            |rw_trx_hash_t::iterate|2.27|0.00|2.27
            |lf_hash_iterate|2.25|0.00|2.25
            |srv_task_execute|2.12|1.95|4.08
            |row_purge_reposition_pcur|1.53|0.00|1.53
            |row_search_on_row_ref|1.33|0.00|1.33
            |btr_pcur_open_low|1.31|0.00|1.31
            |btr_search_guess_on_hash|1.31|1.54|2.85
            |trx_purge_wait_for_workers_to_complete|1.10|1.52|2.62
            |trx_undo_assign_low|1.07|0.00|1.07
            |mtr_t::commit|1.05|2.57|3.62
            |mtr_t::Command::execute|1.01|2.42|3.43
            |srv_worker_thread|-1.09|9.84|8.75
            |sync_array_wait_event|-1.29|9.69|8.40
            |trx_write_serialisation_history|-1.34|2.62|1.28
            |trx_commit_low|-1.39|3.15|1.76
            |innobase_commit|-1.75|47.36|45.61
            |trx_commit|-1.75|4.79|3.04
            |ha_commit_one_phase|-1.76|47.55|45.78
            |commit_one_phase_2|-1.76|47.54|45.77
            |trans_commit_stmt|-1.80|48.68|46.88
            |ha_commit_trans|-1.80|48.65|46.84
            |btr_cur_search_to_nth_level|-1.85|1.85|0.00
            |innobase_commit_ordered_2|-2.04|5.17|3.13
            |trx_commit_for_mysql|-2.05|5.10|3.05
            |innobase_commit_low|-2.05|5.11|3.06
            |Prepared_statement::execute_loop|-2.05|74.58|72.52
            |mysql_stmt_execute_common|-2.14|74.85|72.71
            |mysqld_stmt_execute|-2.15|74.89|72.74
            |mysql_execute_command|-2.21|73.03|70.81
            |trx_undo_assign_undo|-2.24|2.24|0.00
            |Prepared_statement::execute|-2.24|74.03|71.79
            |threadpool_process_request|-2.32|81.71|79.38
            |dispatch_command|-2.33|79.21|76.88
            |do_command|-2.34|81.33|79.00
            |tp_callback|-2.38|82.66|80.28
            |srv_resume_thread|-3.33|7.94|4.61
            |os_event::wait_low|-4.98|28.41|23.43
            |SleepConditionVariableCS|-5.41|28.70|23.29
            |TTASEventMutex<GenericPolicy>::enter|-30.49|30.49|0.00

            wlad Vladislav Vaintroub added a comment - - edited

            marko, I did some calculations of bytes per trx, guided by your hint, and this is what it is (surprisingly, to me maybe, the number of bytes is not integer

            10.3.7
            trxid1 = 193362180
            LSN1 = 69634009621

            trxid2 = 215344539
            LSN2= 75578735333

            bytes per trx = 270.43

            10.2.14

            trxid1 = 11388
            LSN1= 13829598991
            trxid2 = 22135718
            LSN2= 19655598660

            bytes per trx = 263.32

            multiuser benchmarks was run between the point 1 and point 2.
            Hope that helps

            wlad Vladislav Vaintroub added a comment - - edited marko , I did some calculations of bytes per trx, guided by your hint, and this is what it is (surprisingly, to me maybe, the number of bytes is not integer 10.3.7 trxid1 = 193362180 LSN1 = 69634009621 trxid2 = 215344539 LSN2= 75578735333 bytes per trx = 270.43 10.2.14 trxid1 = 11388 LSN1= 13829598991 trxid2 = 22135718 LSN2= 19655598660 bytes per trx = 263.32 multiuser benchmarks was run between the point 1 and point 2. Hope that helps

            To reduce the redo log volume, it looks like we have to resume generating the redo log record types MLOG_UNDO_INIT and MLOG_UNDO_HDR_REUSE. The record MLOG_UNDO_INSERT was already restored in MDEV-15914.

            marko Marko Mäkelä added a comment - To reduce the redo log volume, it looks like we have to resume generating the redo log record types MLOG_UNDO_INIT and MLOG_UNDO_HDR_REUSE . The record MLOG_UNDO_INSERT was already restored in MDEV-15914 .

            MLOG_UNDO_HDR_REUSE was removed by MDEV-12288. That one we cannot restore. It should be equivalent to MLOG_UNDO_HDR_CREATE in the redo log volume. Similarly, reintroducing MLOG_UNDO_ERASE_END ought to increase the redo log volume written, so we’d better not restore that.

            marko Marko Mäkelä added a comment - MLOG_UNDO_HDR_REUSE  was removed by MDEV-12288 . That one we cannot restore. It should be equivalent to MLOG_UNDO_HDR_CREATE in the redo log volume. Similarly, reintroducing MLOG_UNDO_ERASE_END ought to increase the redo log volume written, so we’d better not restore that.
            wlad Vladislav Vaintroub added a comment - - edited

            regarding "relax CPU" number of loops for "rep nop" being 200, https://github.com/MariaDB/server/commit/b3346c2f41d0f7bd0d919826592b37ae09fa41c8#diff-e877f572e9fd77c43d5e7af075f92e74R91 , this is something from me, I measured this as optimal back in 2008, on (probably) Core2 Duo. No idea if this still holds.

            wlad Vladislav Vaintroub added a comment - - edited regarding "relax CPU" number of loops for "rep nop" being 200, https://github.com/MariaDB/server/commit/b3346c2f41d0f7bd0d919826592b37ae09fa41c8#diff-e877f572e9fd77c43d5e7af075f92e74R91 , this is something from me, I measured this as optimal back in 2008, on (probably) Core2 Duo. No idea if this still holds.

            The MLOG_UNDO_INIT record should occupy 3..12 bytes while the lower-level records (MLOG_2BYTES, MLOG_4BYTES) could easily consume 7 more bytes.

            marko Marko Mäkelä added a comment - The MLOG_UNDO_INIT record should occupy 3..12 bytes while the lower-level records ( MLOG_2BYTES , MLOG_4BYTES ) could easily consume 7 more bytes.
            marko Marko Mäkelä made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            marko Marko Mäkelä made changes -

            I have now restored the MLOG_UNDO_INIT record, which I had removed in the hope that recovery would be simpler and faster. (But I failed to notice that a small increase on the number of redo log bytes written can hurt performance.)

            I developed a simple test for measuring the volume of redo log written by an update. Note that transaction IDs are incremented both at start and at commit.

            Due to MDEV-12288, it is clear that there will be more writes by the purge in 10.3. Before MDEV-12288, an UPDATE would not result in any purge activity, but after it, in 10.3, all DML will eventually result in the DB_TRX_ID column being zeroed out in the clustered index record. This will improve read performance at some cost of write performance.

            --source include/have_innodb.inc
            --source include/have_sequence.inc
             
            let $n_updates=1;
            let $n_rows=1;
             
            create table t1 (id serial, c2 char(120) default 'foobar') engine=innodb;
            #connect stop_purge, localhost, root;
            #START TRANSACTION WITH CONSISTENT SNAPSHOT;
            #connection default;
            eval insert into t1(id) select * from seq_1_to_$n_rows;
             
            --replace_regex /.*Log sequence number ([0-9]+).*/\1/
            let $lsn= `SHOW ENGINE INNODB STATUS`;
            let $lsn= `SELECT REGEXP_REPLACE('$lsn', 'InnoDB\\t*', '')`;
            --replace_regex /.*Trx id counter ([0-9]+).*/\1/
            let $trx= `SHOW ENGINE INNODB STATUS`;
            let $trx= `SELECT REGEXP_REPLACE('$trx', 'InnoDB\\t*', '')`;
             
            while ($n_updates) {
            update t1 set c2='boofar';
            update t1 set c2='foobar';
            dec $n_updates;
            }
             
            --replace_regex /.*Log sequence number ([0-9]+).*/\1/
            let $new_lsn= `SHOW ENGINE INNODB STATUS`;
            let $new_lsn= `SELECT REGEXP_REPLACE('$new_lsn', 'InnoDB\\t*', '')`;
            --replace_regex /.*Trx id counter ([0-9]+).*/\1/
            let $new_trx= `SHOW ENGINE INNODB STATUS`;
            let $new_trx= `SELECT REGEXP_REPLACE('$new_trx', 'InnoDB\\t*', '')`;
             
            eval SELECT ($new_lsn-$lsn)/($new_trx-$trx) `bytes/(transaction*2)`;
            drop table t1;
            

            In the above test, you can disable purge to observe the effect to the number of redo log bytes written. The test should probably run a longer time for the purge to kick in.

            The effect of MDEV-12288 can also be disabled by removing the body of the function row_purge_reset_trx_id(). After that change, read-only workloads should be somewhat slower due to performing lookups for DB_TRX_ID that have been purged.

            marko Marko Mäkelä added a comment - I have now restored the MLOG_UNDO_INIT record, which I had removed in the hope that recovery would be simpler and faster. (But I failed to notice that a small increase on the number of redo log bytes written can hurt performance.) I developed a simple test for measuring the volume of redo log written by an update. Note that transaction IDs are incremented both at start and at commit. Due to MDEV-12288 , it is clear that there will be more writes by the purge in 10.3. Before MDEV-12288 , an UPDATE  would not result in any purge activity, but after it, in 10.3, all DML will eventually result in the DB_TRX_ID column being zeroed out in the clustered index record. This will improve read performance at some cost of write performance. --source include/have_innodb.inc --source include/have_sequence.inc   let $n_updates=1; let $n_rows=1;   create table t1 (id serial, c2 char (120) default 'foobar' ) engine=innodb; # connect stop_purge, localhost, root; #START TRANSACTION WITH CONSISTENT SNAPSHOT; # connection default ; eval insert into t1(id) select * from seq_1_to_$n_rows;   --replace_regex /.*Log sequence number ([0-9]+).*/\1/ let $lsn= `SHOW ENGINE INNODB STATUS`; let $lsn= ` SELECT REGEXP_REPLACE( '$lsn' , 'InnoDB\\t*' , '' )`; --replace_regex /.*Trx id counter ([0-9]+).*/\1/ let $trx= `SHOW ENGINE INNODB STATUS`; let $trx= ` SELECT REGEXP_REPLACE( '$trx' , 'InnoDB\\t*' , '' )`;   while ($n_updates) { update t1 set c2= 'boofar' ; update t1 set c2= 'foobar' ; dec $n_updates; }   --replace_regex /.*Log sequence number ([0-9]+).*/\1/ let $new_lsn= `SHOW ENGINE INNODB STATUS`; let $new_lsn= ` SELECT REGEXP_REPLACE( '$new_lsn' , 'InnoDB\\t*' , '' )`; --replace_regex /.*Trx id counter ([0-9]+).*/\1/ let $new_trx= `SHOW ENGINE INNODB STATUS`; let $new_trx= ` SELECT REGEXP_REPLACE( '$new_trx' , 'InnoDB\\t*' , '' )`;   eval SELECT ($new_lsn-$lsn)/($new_trx-$trx) `bytes/( transaction *2)`; drop table t1; In the above test, you can disable purge to observe the effect to the number of redo log bytes written. The test should probably run a longer time for the purge to kick in. The effect of MDEV-12288 can also be disabled by removing the body of the function row_purge_reset_trx_id() . After that change, read-only workloads should be somewhat slower due to performing lookups for DB_TRX_ID that have been purged.
            marko Marko Mäkelä made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            marko Marko Mäkelä added a comment - - edited

            I can confirm this in a memory-based benchmark:

            sysbench oltp_update_index.lua --tables=20 --table_size=100000 --rand-type=uniform --time=60 --report-interval=10 --threads=40
            

            and

            numactl --interleave=all \
            $MDIR/sql/mysqld --no-defaults --log-output=file --general-log=OFF --core-file --loose-debug-sync-timeout=300 --innodb \
              --datadir=$TDIR --lc-messages-dir=$MDIR/sql/share/ --socket=$MYSQL_SOCK \
              --loose-innodb_file_per_table=true \
              --loose-innodb-open-files=4000 \
              --loose-innodb_data_file_path=ibdata1:10M:autoextend \
              --loose-innodb_flush_log_at_trx_commit=0 \
              --loose-innodb_flush_method=O_DIRECT_NO_FSYNC \
              --loose-innodb-adaptive-hash-index-partitions=32 \
              --loose-innodb_log_buffer_size=256M \
              --loose-innodb_log_file_size=1900M \
              --loose-innodb_buffer_pool_size=50G \
              --loose-innodb-buffer-pool-instances=32 \
              --loose-innodb-thread-concurrency=0 \
              --loose-innodb_log_files_in_group=3 \
              --loose-innodb_read_io_threads=8 \
              --loose-innodb_write_io_threads=8 \
              --loose-innodb_io_capacity=8000 \
              --loose-innodb_flush_sync=0 \
              --table-open-cache=1400 \
              --table-open-cache-instances=64 \
              --max-connections=200 \
              --aria-checkpoint-interval=0 \
              --loose-performance-schema=off
            

            The transactions/second number is:

            version threads=1 threads=40
            10.2 a4e7800701d0764fe4cbb85b81d7c7cb54677334 10031.27 189489.50
            mariadb-10.3.4 9035.31 151952.76
            54c715acca8077d9a1694521bd34d6ef1505d7e1 9398.94 173670.23
            8d1d38f9539e2b0d30fdb9531d06b66c7662e9dd 9424.46 178949.21
            c7d0448797a2013079c3e0b7ca97f7bdc1138696 9489.61 178983.94
            5db9c6e490ea2c6b7997dff4cbee18274496dd2c 9593.49 98489.03
            dcc09afa634b0f525d2e8e7d15ef36e5fdd5d270 9373.73 100722.44
            af566d8a63f908c76d43957e790e861d944486b0 9188.53 104609.54
            mariadb-10.3.5 9625.65 95901.31
            10.3 b2f56085d4249e2cf3ed9f80d284b3ce837e9e29 10160.03 95248.29
            patched b2f56085d4249e2cf3ed9f80d284b3ce837e9e29 9555.82 82289.68

            Something has drastically changed between 10.3.4 and 10.3.5. Commit 5db9c6e490ea2c6b7997dff4cbee18274496dd2c is the biggest culprit. I still think that the idea of MDEV-15132 is good, but there must be something bad about this particular implementation.

            I chose commit af566d8a63f908c76d43957e790e861d944486b0 because it is in the middle of some code changes by svoj. There is some visible regression between that and mariadb-10.3.5, and not much improvement since then.

            The patch against the most recent 10.3 essentially disabled the resetting of the DB_TRX_ID (MDEV-12288). This resetting does create some purge activity and cause more writes, but it could also greatly help performance, as demonstrated above. (If DB_TRX_ID=0, we immediately know that there is no older version and the record is not implicitly locked.)

            diff --git a/storage/innobase/row/row0purge.cc b/storage/innobase/row/row0purge.cc
            index af8cd682288..8b22c90e0d1 100644
            --- a/storage/innobase/row/row0purge.cc
            +++ b/storage/innobase/row/row0purge.cc
            @@ -671,6 +671,7 @@ static
             void
             row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr)
             {
            +#if 0
             	ut_ad(rw_lock_own(dict_operation_lock, RW_LOCK_S));
             	/* Reset DB_TRX_ID, DB_ROLL_PTR for old records. */
             	mtr->start();
            @@ -729,6 +730,7 @@ row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr)
             	}
             
             	mtr->commit();
            +#endif
             }
             
             /***********************************************************//**
            diff --git a/storage/innobase/row/row0umod.cc b/storage/innobase/row/row0umod.cc
            index 47c515f70cf..3b5f37adf50 100644
            --- a/storage/innobase/row/row0umod.cc
            +++ b/storage/innobase/row/row0umod.cc
            @@ -385,6 +385,7 @@ row_undo_mod_clust(
             					   false, &mtr);
             		ut_ad(err == DB_SUCCESS
             		      || err == DB_OUT_OF_FILE_SPACE);
            +#if 0
             	} else if (!index->table->is_temporary() && node->new_trx_id) {
             		/* We rolled back a record so that it still exists.
             		We must reset the DB_TRX_ID if the history is no
            @@ -430,6 +431,7 @@ row_undo_mod_clust(
             						  sizeof reset_trx_id, &mtr);
             			}
             		}
            +#endif
             	} else {
             		goto func_exit;
             	}
            

            marko Marko Mäkelä added a comment - - edited I can confirm this in a memory-based benchmark: sysbench oltp_update_index.lua --tables=20 --table_size=100000 --rand-type=uniform --time=60 --report-interval=10 --threads=40 and numactl --interleave=all \ $MDIR/sql/mysqld --no-defaults --log-output=file --general-log=OFF --core-file --loose-debug-sync-timeout=300 --innodb \ --datadir=$TDIR --lc-messages-dir=$MDIR/sql/share/ --socket=$MYSQL_SOCK \ --loose-innodb_file_per_table=true \ --loose-innodb-open-files=4000 \ --loose-innodb_data_file_path=ibdata1:10M:autoextend \ --loose-innodb_flush_log_at_trx_commit=0 \ --loose-innodb_flush_method=O_DIRECT_NO_FSYNC \ --loose-innodb-adaptive-hash-index-partitions=32 \ --loose-innodb_log_buffer_size=256M \ --loose-innodb_log_file_size=1900M \ --loose-innodb_buffer_pool_size=50G \ --loose-innodb-buffer-pool-instances=32 \ --loose-innodb-thread-concurrency=0 \ --loose-innodb_log_files_in_group=3 \ --loose-innodb_read_io_threads=8 \ --loose-innodb_write_io_threads=8 \ --loose-innodb_io_capacity=8000 \ --loose-innodb_flush_sync=0 \ --table-open-cache=1400 \ --table-open-cache-instances=64 \ --max-connections=200 \ --aria-checkpoint-interval=0 \ --loose-performance-schema=off The transactions/second number is: version threads=1 threads=40 10.2 a4e7800701d0764fe4cbb85b81d7c7cb54677334 10031.27 189489.50 mariadb-10.3.4 9035.31 151952.76 54c715acca8077d9a1694521bd34d6ef1505d7e1 9398.94 173670.23 8d1d38f9539e2b0d30fdb9531d06b66c7662e9dd 9424.46 178949.21 c7d0448797a2013079c3e0b7ca97f7bdc1138696 9489.61 178983.94 5db9c6e490ea2c6b7997dff4cbee18274496dd2c 9593.49 98489.03 dcc09afa634b0f525d2e8e7d15ef36e5fdd5d270 9373.73 100722.44 af566d8a63f908c76d43957e790e861d944486b0 9188.53 104609.54 mariadb-10.3.5 9625.65 95901.31 10.3 b2f56085d4249e2cf3ed9f80d284b3ce837e9e29 10160.03 95248.29 patched b2f56085d4249e2cf3ed9f80d284b3ce837e9e29 9555.82 82289.68 Something has drastically changed between 10.3.4 and 10.3.5. Commit 5db9c6e490ea2c6b7997dff4cbee18274496dd2c is the biggest culprit. I still think that the idea of MDEV-15132 is good, but there must be something bad about this particular implementation. I chose commit af566d8a63f908c76d43957e790e861d944486b0 because it is in the middle of some code changes by svoj . There is some visible regression between that and mariadb-10.3.5, and not much improvement since then. The patch against the most recent 10.3 essentially disabled the resetting of the DB_TRX_ID ( MDEV-12288 ). This resetting does create some purge activity and cause more writes, but it could also greatly help performance, as demonstrated above. (If DB_TRX_ID=0, we immediately know that there is no older version and the record is not implicitly locked.) diff --git a/storage/innobase/row/row0purge.cc b/storage/innobase/row/row0purge.cc index af8cd682288..8b22c90e0d1 100644 --- a/storage/innobase/row/row0purge.cc +++ b/storage/innobase/row/row0purge.cc @@ -671,6 +671,7 @@ static void row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr) { +#if 0 ut_ad(rw_lock_own(dict_operation_lock, RW_LOCK_S)); /* Reset DB_TRX_ID, DB_ROLL_PTR for old records. */ mtr->start(); @@ -729,6 +730,7 @@ row_purge_reset_trx_id(purge_node_t* node, mtr_t* mtr) } mtr->commit(); +#endif } /***********************************************************//** diff --git a/storage/innobase/row/row0umod.cc b/storage/innobase/row/row0umod.cc index 47c515f70cf..3b5f37adf50 100644 --- a/storage/innobase/row/row0umod.cc +++ b/storage/innobase/row/row0umod.cc @@ -385,6 +385,7 @@ row_undo_mod_clust( false, &mtr); ut_ad(err == DB_SUCCESS || err == DB_OUT_OF_FILE_SPACE); +#if 0 } else if (!index->table->is_temporary() && node->new_trx_id) { /* We rolled back a record so that it still exists. We must reset the DB_TRX_ID if the history is no @@ -430,6 +431,7 @@ row_undo_mod_clust( sizeof reset_trx_id, &mtr); } } +#endif } else { goto func_exit; }
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Sergey Vojtovich [ svoj ]

            Surprisingly, removing the following supposed optimization

            diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
            index 26b857a75f3..218e913d264 100644
            --- a/storage/innobase/trx/trx0trx.cc
            +++ b/storage/innobase/trx/trx0trx.cc
            @@ -1341,11 +1341,6 @@ trx_write_serialisation_history(
             #endif
             	const bool update_binlog_pos = trx->mysql_log_file_name
             		&& *trx->mysql_log_file_name;
            -	if (!update_binlog_pos
            -#ifdef WITH_WSREP
            -	    && !update_wsrep
            -#endif
            -	    ) return true;
             
             	buf_block_t* block = trx_sysf_get(mtr);
             #ifdef WITH_WSREP
            

            fixes the performance regression of 5db9c6e490ea2c6b7997dff4cbee18274496dd2c. With this optimization removed, we are always acquiring and releasing the TRX_SYS page latch. Replication and Galera are disabled, after all.

            This patch is not applicable to later versions, because MDEV-15158 later removed the write to the TRX_SYS page altogether. Maybe this page lock somehow served as useful throttling?

            marko Marko Mäkelä added a comment - Surprisingly, removing the following supposed optimization diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 26b857a75f3..218e913d264 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -1341,11 +1341,6 @@ trx_write_serialisation_history( #endif const bool update_binlog_pos = trx->mysql_log_file_name && *trx->mysql_log_file_name; - if (!update_binlog_pos -#ifdef WITH_WSREP - && !update_wsrep -#endif - ) return true; buf_block_t* block = trx_sysf_get(mtr); #ifdef WITH_WSREP fixes the performance regression of 5db9c6e490ea2c6b7997dff4cbee18274496dd2c. With this optimization removed, we are always acquiring and releasing the TRX_SYS page latch. Replication and Galera are disabled, after all. This patch is not applicable to later versions, because MDEV-15158 later removed the write to the TRX_SYS page altogether. Maybe this page lock somehow served as useful throttling?

            I applied a similar unnecessary acquisition of the TRX_SYS page lock to the latest 10.3, but it did reduce performance as expected:

            diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
            index 11d83c96a92..8ded9a03e3d 100644
            --- a/storage/innobase/trx/trx0trx.cc
            +++ b/storage/innobase/trx/trx0trx.cc
            @@ -1120,6 +1120,7 @@ trx_write_serialisation_history(
             
             	mutex_exit(&rseg->mutex);
             
            +	trx_sysf_get(mtr);
             	MONITOR_INC(MONITOR_TRX_COMMIT_UNDO);
             
             	trx->mysql_log_file_name = NULL;
            

            I tried to remove a write that was added by MDEV-15158, but it did not help performance:

            diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc
            index 32e050e9b91..3817ce8bcca 100644
            --- a/storage/innobase/trx/trx0purge.cc
            +++ b/storage/innobase/trx/trx0purge.cc
            @@ -254,8 +254,10 @@ trx_purge_add_undo_to_history(const trx_t* trx, trx_undo_t*& undo, mtr_t* mtr)
             			rseg_header + TRX_RSEG_HISTORY_SIZE,
             			hist_size + undo->size, MLOG_4BYTES, mtr);
             
            +#if 0
             		mlog_write_ull(rseg_header + TRX_RSEG_MAX_TRX_ID,
             			       trx_sys.get_max_trx_id(), mtr);
            +#endif
             	}
             
             	/* Before any transaction-generating background threads or the
            
            

            This write is performed while holding rseg->mutex, but there are 128 rollback segments, so I do not think it should matter with 40 concurrent users, if each transaction is assigned a different rollback segment.

            marko Marko Mäkelä added a comment - I applied a similar unnecessary acquisition of the TRX_SYS page lock to the latest 10.3, but it did reduce performance as expected: diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc index 11d83c96a92..8ded9a03e3d 100644 --- a/storage/innobase/trx/trx0trx.cc +++ b/storage/innobase/trx/trx0trx.cc @@ -1120,6 +1120,7 @@ trx_write_serialisation_history( mutex_exit(&rseg->mutex); + trx_sysf_get(mtr); MONITOR_INC(MONITOR_TRX_COMMIT_UNDO); trx->mysql_log_file_name = NULL; I tried to remove a write that was added by MDEV-15158 , but it did not help performance: diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc index 32e050e9b91..3817ce8bcca 100644 --- a/storage/innobase/trx/trx0purge.cc +++ b/storage/innobase/trx/trx0purge.cc @@ -254,8 +254,10 @@ trx_purge_add_undo_to_history(const trx_t* trx, trx_undo_t*& undo, mtr_t* mtr) rseg_header + TRX_RSEG_HISTORY_SIZE, hist_size + undo->size, MLOG_4BYTES, mtr); +#if 0 mlog_write_ull(rseg_header + TRX_RSEG_MAX_TRX_ID, trx_sys.get_max_trx_id(), mtr); +#endif } /* Before any transaction-generating background threads or the This write is performed while holding rseg->mutex , but there are 128 rollback segments, so I do not think it should matter with 40 concurrent users, if each transaction is assigned a different rollback segment.

            My 10.3 (b2f56085d4249e2cf3ed9f80d284b3ce837e9e29 ) results don't go inline with what marko reported:

            [ 10s ] thds: 40 tps: 246050.74 qps: 246050.94 (r/w/o: 0.00/246050.94/0.00) lat (ms,95%): 0.28 err/s: 0.00 reconn/s: 0.00
            [ 20s ] thds: 40 tps: 266876.11 qps: 266876.11 (r/w/o: 0.00/266876.11/0.00) lat (ms,95%): 0.24 err/s: 0.00 reconn/s: 0.00
            [ 30s ] thds: 40 tps: 265013.72 qps: 265013.52 (r/w/o: 0.00/265013.52/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00
            [ 40s ] thds: 40 tps: 267120.70 qps: 267120.70 (r/w/o: 0.00/267120.70/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00
            [ 50s ] thds: 40 tps: 267445.25 qps: 267445.25 (r/w/o: 0.00/267445.25/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00
            [ 60s ] thds: 40 tps: 265585.03 qps: 265585.03 (r/w/o: 0.00/265585.03/0.00) lat (ms,95%): 0.24 err/s: 0.00 reconn/s: 0.00
            [ 70s ] thds: 40 tps: 262178.67 qps: 262178.67 (r/w/o: 0.00/262178.67/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00
            [ 80s ] thds: 40 tps: 265030.96 qps: 265030.96 (r/w/o: 0.00/265030.96/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00
            [ 90s ] thds: 40 tps: 263325.84 qps: 263325.94 (r/w/o: 0.00/263325.94/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00
            [ 100s ] thds: 40 tps: 249319.73 qps: 249319.63 (r/w/o: 0.00/249319.63/0.00) lat (ms,95%): 0.28 err/s: 0.00 reconn/s: 0.00
            [ 110s ] thds: 40 tps: 263994.66 qps: 263994.66 (r/w/o: 0.00/263994.66/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00
            [ 120s ] thds: 40 tps: 263167.54 qps: 263167.54 (r/w/o: 0.00/263167.54/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00
             
                transactions:                        31452722 (262087.56 per sec.)
                queries:                             31452722 (262087.56 per sec.)
            

            However 10.2 results are perfectly inline:

            [ 10s ] thds: 40 tps: 189485.74 qps: 189485.84 (r/w/o: 0.00/189485.84/0.00) lat (ms,95%): 0.34 err/s: 0.00 reconn/s: 0.00
            [ 20s ] thds: 40 tps: 198558.47 qps: 198558.37 (r/w/o: 0.00/198558.37/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
            [ 30s ] thds: 40 tps: 195900.53 qps: 195900.53 (r/w/o: 0.00/195900.53/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
            [ 40s ] thds: 40 tps: 196844.47 qps: 196844.47 (r/w/o: 0.00/196844.47/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
            [ 50s ] thds: 40 tps: 198759.66 qps: 198759.66 (r/w/o: 0.00/198759.66/0.00) lat (ms,95%): 0.32 err/s: 0.00 reconn/s: 0.00
            [ 60s ] thds: 40 tps: 197423.50 qps: 197423.50 (r/w/o: 0.00/197423.50/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
            [ 70s ] thds: 40 tps: 196168.50 qps: 196168.50 (r/w/o: 0.00/196168.50/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
            [ 80s ] thds: 40 tps: 196777.38 qps: 196777.38 (r/w/o: 0.00/196777.38/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
            [ 90s ] thds: 40 tps: 198326.91 qps: 198326.91 (r/w/o: 0.00/198326.91/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
            [ 100s ] thds: 40 tps: 196600.91 qps: 196600.91 (r/w/o: 0.00/196600.91/0.00) lat (ms,95%): 0.32 err/s: 0.00 reconn/s: 0.00
            [ 110s ] thds: 40 tps: 198558.49 qps: 198558.49 (r/w/o: 0.00/198558.49/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
            [ 120s ] thds: 40 tps: 197079.29 qps: 197079.29 (r/w/o: 0.00/197079.29/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00
             
                transactions:                        23605935 (196703.44 per sec.)
                queries:                             23605935 (196703.44 per sec.)
            

            svoj Sergey Vojtovich added a comment - My 10.3 (b2f56085d4249e2cf3ed9f80d284b3ce837e9e29 ) results don't go inline with what marko reported: [ 10s ] thds: 40 tps: 246050.74 qps: 246050.94 (r/w/o: 0.00/246050.94/0.00) lat (ms,95%): 0.28 err/s: 0.00 reconn/s: 0.00 [ 20s ] thds: 40 tps: 266876.11 qps: 266876.11 (r/w/o: 0.00/266876.11/0.00) lat (ms,95%): 0.24 err/s: 0.00 reconn/s: 0.00 [ 30s ] thds: 40 tps: 265013.72 qps: 265013.52 (r/w/o: 0.00/265013.52/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00 [ 40s ] thds: 40 tps: 267120.70 qps: 267120.70 (r/w/o: 0.00/267120.70/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00 [ 50s ] thds: 40 tps: 267445.25 qps: 267445.25 (r/w/o: 0.00/267445.25/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00 [ 60s ] thds: 40 tps: 265585.03 qps: 265585.03 (r/w/o: 0.00/265585.03/0.00) lat (ms,95%): 0.24 err/s: 0.00 reconn/s: 0.00 [ 70s ] thds: 40 tps: 262178.67 qps: 262178.67 (r/w/o: 0.00/262178.67/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00 [ 80s ] thds: 40 tps: 265030.96 qps: 265030.96 (r/w/o: 0.00/265030.96/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00 [ 90s ] thds: 40 tps: 263325.84 qps: 263325.94 (r/w/o: 0.00/263325.94/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00 [ 100s ] thds: 40 tps: 249319.73 qps: 249319.63 (r/w/o: 0.00/249319.63/0.00) lat (ms,95%): 0.28 err/s: 0.00 reconn/s: 0.00 [ 110s ] thds: 40 tps: 263994.66 qps: 263994.66 (r/w/o: 0.00/263994.66/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00 [ 120s ] thds: 40 tps: 263167.54 qps: 263167.54 (r/w/o: 0.00/263167.54/0.00) lat (ms,95%): 0.23 err/s: 0.00 reconn/s: 0.00   transactions: 31452722 (262087.56 per sec.) queries: 31452722 (262087.56 per sec.) However 10.2 results are perfectly inline: [ 10s ] thds: 40 tps: 189485.74 qps: 189485.84 (r/w/o: 0.00/189485.84/0.00) lat (ms,95%): 0.34 err/s: 0.00 reconn/s: 0.00 [ 20s ] thds: 40 tps: 198558.47 qps: 198558.37 (r/w/o: 0.00/198558.37/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 30s ] thds: 40 tps: 195900.53 qps: 195900.53 (r/w/o: 0.00/195900.53/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 40s ] thds: 40 tps: 196844.47 qps: 196844.47 (r/w/o: 0.00/196844.47/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 50s ] thds: 40 tps: 198759.66 qps: 198759.66 (r/w/o: 0.00/198759.66/0.00) lat (ms,95%): 0.32 err/s: 0.00 reconn/s: 0.00 [ 60s ] thds: 40 tps: 197423.50 qps: 197423.50 (r/w/o: 0.00/197423.50/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 70s ] thds: 40 tps: 196168.50 qps: 196168.50 (r/w/o: 0.00/196168.50/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 80s ] thds: 40 tps: 196777.38 qps: 196777.38 (r/w/o: 0.00/196777.38/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 90s ] thds: 40 tps: 198326.91 qps: 198326.91 (r/w/o: 0.00/198326.91/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 100s ] thds: 40 tps: 196600.91 qps: 196600.91 (r/w/o: 0.00/196600.91/0.00) lat (ms,95%): 0.32 err/s: 0.00 reconn/s: 0.00 [ 110s ] thds: 40 tps: 198558.49 qps: 198558.49 (r/w/o: 0.00/198558.49/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00 [ 120s ] thds: 40 tps: 197079.29 qps: 197079.29 (r/w/o: 0.00/197079.29/0.00) lat (ms,95%): 0.31 err/s: 0.00 reconn/s: 0.00   transactions: 23605935 (196703.44 per sec.) queries: 23605935 (196703.44 per sec.)

            maybe you need to run just a little bit longer, like marko and I did

            wlad Vladislav Vaintroub added a comment - maybe you need to run just a little bit longer, like marko and I did

            OTOH, marko ran for 2 minutes as well. I reported on 300 seconds

            wlad Vladislav Vaintroub added a comment - OTOH, marko ran for 2 minutes as well. I reported on 300 seconds

            I used the following build configuration for both 10.2 and 10.3:

            cmake -DCONC_WITH_{UNITTEST,SSL}=OFF -DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo -DPLUGIN_{TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT}=NO -DPLUGIN_PERFSCHEMA=ON -DWITH_SAFEMALLOC=OFF -DWITH_SSL=bundled ..
            

            svoj used the following:

            cmake -DBUILD_CONFIG=mysql_release -DPLUGIN_PERFSCHEMA=NO -DPLUGIN_CONNECT=NO -DPLUGIN_MROONGA=NO -DPLUGIN_TOKUDB=NO -DPLUGIN_SPIDER=NO -DPLUGIN_OQGRAPH=NO -DWITH_JEMALLOC=system ..
            

            I retested commit 5db9c6e490ea2c6b7997dff4cbee18274496dd2c with the latter config:

            build threads=1 threads=40
            Marko 9593.49 98489.03
            Svoj 9272.23 217758.55

            I applied the patch to remove the optimization from trx_write_serialisation_history() (which counter-intuitively improved performance from about 100ktps to 178ktps using my config):

            build threads=1 threads=40
            Marko 9489.61 178983.94
            Svoj 9177.91 237928.43

            So, removing the optimization does help in both cases, but much more with my build configuration. This change is not relevant for solving the regression, because the code has been seriously rewritten since then, but it shows that performance is hard. Perhaps the explanation could be that the TRX_SYS page lock serves as some kind of useful throttle that reduces contention on the rseg->mutex? It would seem best to acquire and release rseg->mutex only once during commit, and to try to ensure that the transactions running in different threads do not get assigned the same rseg (of course, this is only possible for up to 128 concurrent threads).

            marko Marko Mäkelä added a comment - I used the following build configuration for both 10.2 and 10.3: cmake -DCONC_WITH_{UNITTEST,SSL}=OFF -DWITH_EMBEDDED_SERVER=OFF -DWITH_UNIT_TESTS=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo -DPLUGIN_{TOKUDB,MROONGA,OQGRAPH,ROCKSDB,CONNECT}=NO -DPLUGIN_PERFSCHEMA=ON -DWITH_SAFEMALLOC=OFF -DWITH_SSL=bundled .. svoj used the following: cmake -DBUILD_CONFIG=mysql_release -DPLUGIN_PERFSCHEMA=NO -DPLUGIN_CONNECT=NO -DPLUGIN_MROONGA=NO -DPLUGIN_TOKUDB=NO -DPLUGIN_SPIDER=NO -DPLUGIN_OQGRAPH=NO -DWITH_JEMALLOC=system .. I retested commit 5db9c6e490ea2c6b7997dff4cbee18274496dd2c with the latter config: build threads=1 threads=40 Marko 9593.49 98489.03 Svoj 9272.23 217758.55 I applied the patch to remove the optimization from trx_write_serialisation_history() (which counter-intuitively improved performance from about 100ktps to 178ktps using my config): build threads=1 threads=40 Marko 9489.61 178983.94 Svoj 9177.91 237928.43 So, removing the optimization does help in both cases, but much more with my build configuration. This change is not relevant for solving the regression, because the code has been seriously rewritten since then, but it shows that performance is hard. Perhaps the explanation could be that the TRX_SYS page lock serves as some kind of useful throttle that reduces contention on the rseg->mutex ? It would seem best to acquire and release rseg->mutex only once during commit, and to try to ensure that the transactions running in different threads do not get assigned the same rseg (of course, this is only possible for up to 128 concurrent threads).

            I tried modifying svoj’s config by removing -DWITH_JEMALLOC=system. Performance remained roughly the same:

            Svoj’s build without jemalloc threads=1 threads=40
            5db9c6e490ea2c6b7997dff4cbee18274496dd2c 9704.28 204079.81
            5db9c6e490ea2c6b7997dff4cbee18274496dd2c with patch 9355.49 237138.82

            So, jemalloc does help multi-client performance a little compared to GNU libc 2.27, but not that much, and the relative performance difference remains.

            A major difference between our build configurations could be that cmake/build_configurations/mysql_release.cmake uses -O3 while -DCMAKE_BUILD_TYPE=RelWithDebInfo only uses -O2. After replacing -O2 with -O3 in my configuration, I got the following:

            Marko’s build with -O3 threads=1 threads=40
            5db9c6e490ea2c6b7997dff4cbee18274496dd2c 10477.51 199604.51

            This suggests that there could be a clear regression from 10.2 to 10.3 when compiling with -O2. I hope that most distributed packages use -O3.

            marko Marko Mäkelä added a comment - I tried modifying svoj ’s config by removing -DWITH_JEMALLOC=system . Performance remained roughly the same: Svoj’s build without jemalloc threads=1 threads=40 5db9c6e490ea2c6b7997dff4cbee18274496dd2c 9704.28 204079.81 5db9c6e490ea2c6b7997dff4cbee18274496dd2c with patch 9355.49 237138.82 So, jemalloc does help multi-client performance a little compared to GNU libc 2.27, but not that much, and the relative performance difference remains. A major difference between our build configurations could be that cmake/build_configurations/mysql_release.cmake uses -O3 while -DCMAKE_BUILD_TYPE=RelWithDebInfo only uses -O2 . After replacing -O2 with -O3 in my configuration, I got the following: Marko’s build with -O3 threads=1 threads=40 5db9c6e490ea2c6b7997dff4cbee18274496dd2c 10477.51 199604.51 This suggests that there could be a clear regression from 10.2 to 10.3 when compiling with -O2 . I hope that most distributed packages use -O3 .
            wlad Vladislav Vaintroub added a comment - - edited

            distributed packages are compiled with -DBUILD_CONFIGURATION=mysql_release

            Still, I suggest to look more than just threads=1, and thread=40. 40 threads might be a sweet spot on svoj's machine. I tried 1,..,4096, that's more comprehensive. I suggest to vary durability settings as well (innodb_flush_log_at_trx_commit = 1). innodb_flush_method=O_DIRECT_NO_FSYNC is also a trick that pretends your box has battery backed disk. And it does not probably. dataset size, AHI vs no AHI, threadpool vs no threadpool., etc.

            Seeing a full picture after running 2 tests. 2 minutes each , is not easy. I'd really start full comparison for different levels on concurrency, and then vary other parameters, if that does not show regression

            wlad Vladislav Vaintroub added a comment - - edited distributed packages are compiled with -DBUILD_CONFIGURATION=mysql_release Still, I suggest to look more than just threads=1, and thread=40. 40 threads might be a sweet spot on svoj's machine. I tried 1,..,4096, that's more comprehensive. I suggest to vary durability settings as well (innodb_flush_log_at_trx_commit = 1). innodb_flush_method=O_DIRECT_NO_FSYNC is also a trick that pretends your box has battery backed disk. And it does not probably. dataset size, AHI vs no AHI, threadpool vs no threadpool., etc. Seeing a full picture after running 2 tests. 2 minutes each , is not easy. I'd really start full comparison for different levels on concurrency, and then vary other parameters, if that does not show regression

            I had a systematic error, caused by not rebuilding from the scratch. It looks like -DBUILD_CONFIG=mysql_release does use -O2 after all.

            My observations about 5db9c6e490ea2c6b7997dff4cbee18274496dd2c and its predecessor are repeatable. And I still see a significant performance regression with 10.3.5 or the latest 10.3 Here are some rough results from 3-minute runs:

            revision threads=40
            5db9c6e490ea2c6b7997dff4cbee18274496dd2c 230ktps
            af566d8a63f908c76d43957e790e861d944486b0 228346.51tps
            53cc9aa5be63b4dd03aba804aebaa3bd470dab9c 186491.66tps
            c0d5d7c0efb6d2b9961da7fb813ff000a5d3e4c5 210410.54tps
            bc7a1dc1fbd27e6064d3b40443fe242397668af7 225957.50tps
            ba125eca551e3f5faee9f6d7ebed290f04cfc8fd 143308.74tps
            b782971c58b5656820429b8ef3fae5fd82f5a0f7 137140.79tps

            There is some fluctuation in these numbers, so 10% variance could be ignored.

            Surprisingly, 53cc9aa5be63b4dd03aba804aebaa3bd470dab9c (MDEV-15104 - Remove trx_sys_t::rw_trx_ids) reduces performance compared to its parent af566d8a63f908c76d43957e790e861d944486b0. The next InnoDB commit c0d5d7c0efb6d2b9961da7fb813ff000a5d3e4c5 (MDEV-15104 - Remove trx_sys_t::serialisation_list) is better again. This is what I see with GCC 7.3.0; @svoj sees a steady improvement for these commits with GCC 5.4.0.

            Another interesting pair is bc7a1dc1fbd27e6064d3b40443fe242397668af7 and b782971c58b5656820429b8ef3fae5fd82f5a0f7. The latter commit fixed a correctness bug that was introduced by the former. There are quite a few changes in between, and there appears to be some regression already in the parent of the fix (ba125eca551e3f5faee9f6d7ebed290f04cfc8fd).

            Next I will try to find the regression between bc7a1dc1fbd27e6064d3b40443fe242397668af7 and ba125eca551e3f5faee9f6d7ebed290f04cfc8fd.

            marko Marko Mäkelä added a comment - I had a systematic error, caused by not rebuilding from the scratch. It looks like -DBUILD_CONFIG=mysql_release does use -O2  after all. My observations about 5db9c6e490ea2c6b7997dff4cbee18274496dd2c  and its predecessor are repeatable. And I still see a significant performance regression with 10.3.5 or the latest 10.3 Here are some rough results from 3-minute runs: revision threads=40 5db9c6e490ea2c6b7997dff4cbee18274496dd2c 230ktps af566d8a63f908c76d43957e790e861d944486b0 228346.51tps 53cc9aa5be63b4dd03aba804aebaa3bd470dab9c 186491.66tps c0d5d7c0efb6d2b9961da7fb813ff000a5d3e4c5 210410.54tps bc7a1dc1fbd27e6064d3b40443fe242397668af7 225957.50tps ba125eca551e3f5faee9f6d7ebed290f04cfc8fd 143308.74tps b782971c58b5656820429b8ef3fae5fd82f5a0f7 137140.79tps There is some fluctuation in these numbers, so 10% variance could be ignored. Surprisingly, 53cc9aa5be63b4dd03aba804aebaa3bd470dab9c ( MDEV-15104 - Remove trx_sys_t::rw_trx_ids) reduces performance compared to its parent af566d8a63f908c76d43957e790e861d944486b0. The next InnoDB commit c0d5d7c0efb6d2b9961da7fb813ff000a5d3e4c5 ( MDEV-15104 - Remove trx_sys_t::serialisation_list) is better again. This is what I see with GCC 7.3.0; @svoj sees a steady improvement for these commits with GCC 5.4.0. Another interesting pair is bc7a1dc1fbd27e6064d3b40443fe242397668af7 and b782971c58b5656820429b8ef3fae5fd82f5a0f7. The latter commit fixed a correctness bug that was introduced by the former. There are quite a few changes in between, and there appears to be some regression already in the parent of the fix (ba125eca551e3f5faee9f6d7ebed290f04cfc8fd). Next I will try to find the regression between bc7a1dc1fbd27e6064d3b40443fe242397668af7 and ba125eca551e3f5faee9f6d7ebed290f04cfc8fd.
            marko Marko Mäkelä added a comment - - edited
            revision threads=40
            bc7a1dc1fbd27e6064d3b40443fe242397668af7 225957.50tps
            d6ed077fc82c2f4d20895db9a6b16ca295f23c33 212136.14tps
            217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 197112.85 tps
            217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 without st_mem_root::name 214732.32tps
            eda142590f385b3d79e7b6a6e0cc13ba8fd25961 133523.27tps
            53476abce829f14b9d591a9bf160bc07cffaa768 127573.76tps

            Commit 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 is a merge from bb-10.2-ext, most notably adding st_mem_root::name. That is clearly costing some performance compared to its parent d6ed077fc82c2f4d20895db9a6b16ca295f23c33. I tried reverting the patch:

            git show d69642deddf4f8d51f8c6807361576451334f1f9|patch -Rp1
            # note: in 10.3, sql/rpl_handler.h was renamed to sql/rpl_rli.h
            # note: adjust a few init_alloc_root() calls that did not exist in bb-10.2-ext
            

            It seems that we must define st_mem_root::name in debug builds only.

            Still, the additional regression between 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 and eda142590f385b3d79e7b6a6e0cc13ba8fd25961 has to be narrowed down. It is outside InnoDB; the InnoDB differences in that range are trivial integer type mismatch warning fixes and fixes of source code comments. The only other commits in this range are CTE fixes (MDEV-15162 and MDEV-15119).

            marko Marko Mäkelä added a comment - - edited revision threads=40 bc7a1dc1fbd27e6064d3b40443fe242397668af7 225957.50tps d6ed077fc82c2f4d20895db9a6b16ca295f23c33 212136.14tps 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 197112.85 tps 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 without st_mem_root::name 214732.32tps eda142590f385b3d79e7b6a6e0cc13ba8fd25961 133523.27tps 53476abce829f14b9d591a9bf160bc07cffaa768 127573.76tps Commit 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 is a merge from bb-10.2-ext , most notably adding st_mem_root::name . That is clearly costing some performance compared to its parent d6ed077fc82c2f4d20895db9a6b16ca295f23c33. I tried reverting the patch: git show d69642deddf4f8d51f8c6807361576451334f1f9|patch -Rp1 # note: in 10.3, sql/rpl_handler.h was renamed to sql/rpl_rli.h # note: adjust a few init_alloc_root() calls that did not exist in bb-10.2-ext It seems that we must define st_mem_root::name in debug builds only. Still, the additional regression between 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 and eda142590f385b3d79e7b6a6e0cc13ba8fd25961 has to be narrowed down. It is outside InnoDB; the InnoDB differences in that range are trivial integer type mismatch warning fixes and fixes of source code comments. The only other commits in this range are CTE fixes ( MDEV-15162 and MDEV-15119 ).
            wlad Vladislav Vaintroub made changes -
            Attachment winpmp_10_3_7.txt [ 45665 ]
            revision threads=40
            217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 197112.85tps
            465979eabfe3acacc0566fa93402e449e1a9f476 229538.83tps
            f271100836d8a91a775894ec36b869a66a3145e5 205113.14tps
            6c279ad6a71c63cb595fde7c951aadb31c3dbebc 123927.89tps

            It is surprising that the performance improved in the middle, even though the st_mem_root::name is still present. The CTE fixes seem innocent to the performance.

            It is also surprising that the MDEV-15091 size_t fix (commit 6c279ad6a71c63cb595fde7c951aadb31c3dbebc) is hurting so much. That seems to be the main source of regression (for my RAM-based toy benchmark).

            I would suggest introducing a compile-time option for enabling st_mem_root::name (disabled by default), and revisiting MDEV-15091.

            marko Marko Mäkelä added a comment - revision threads=40 217fc122c8e893aa48fa5dcfb3dbfcc99a2aa299 197112.85tps 465979eabfe3acacc0566fa93402e449e1a9f476 229538.83tps f271100836d8a91a775894ec36b869a66a3145e5 205113.14tps 6c279ad6a71c63cb595fde7c951aadb31c3dbebc 123927.89tps It is surprising that the performance improved in the middle, even though the st_mem_root::name  is still present. The CTE fixes seem innocent to the performance. It is also surprising that the MDEV-15091 size_t fix (commit 6c279ad6a71c63cb595fde7c951aadb31c3dbebc) is hurting so much. That seems to be the main source of regression (for my RAM-based toy benchmark). I would suggest introducing a compile-time option for enabling st_mem_root::name (disabled by default), and revisiting MDEV-15091 .
            wlad Vladislav Vaintroub made changes -
            Attachment winpmp_10_2_14.txt [ 45666 ]
            svoj Sergey Vojtovich added a comment - - edited

            Here's what I get on 6c279ad6a71c63cb595fde7c951aadb31c3dbebc:

            [ 10s ] thds: 40 tps: 245242.59 qps: 245242.69 (r/w/o: 0.00/245242.69/0.00) lat (ms,95%): 0.27 err/s: 0.00 reconn/s: 0.00
            [ 20s ] thds: 40 tps: 254569.26 qps: 254569.16 (r/w/o: 0.00/254569.16/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00
            [ 30s ] thds: 40 tps: 254972.73 qps: 254972.73 (r/w/o: 0.00/254972.73/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
            [ 40s ] thds: 40 tps: 252851.43 qps: 252851.53 (r/w/o: 0.00/252851.53/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00
            [ 50s ] thds: 40 tps: 251944.58 qps: 251944.48 (r/w/o: 0.00/251944.48/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00
            [ 60s ] thds: 40 tps: 253112.17 qps: 253112.17 (r/w/o: 0.00/253112.17/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00
            [ 70s ] thds: 40 tps: 250768.06 qps: 250768.06 (r/w/o: 0.00/250768.06/0.00) lat (ms,95%): 0.27 err/s: 0.00 reconn/s: 0.00
            [ 80s ] thds: 40 tps: 246130.98 qps: 246130.98 (r/w/o: 0.00/246130.98/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
            [ 90s ] thds: 40 tps: 243231.00 qps: 243231.00 (r/w/o: 0.00/243231.00/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
            [ 100s ] thds: 40 tps: 245256.26 qps: 245256.26 (r/w/o: 0.00/245256.26/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
            [ 110s ] thds: 40 tps: 246891.01 qps: 246891.01 (r/w/o: 0.00/246891.01/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00
            [ 120s ] thds: 40 tps: 246429.67 qps: 246429.87 (r/w/o: 0.00/246429.87/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00
            SQL statistics:
                queries performed:
                    read:                            0
                    write:                           29916061
                    other:                           0
                    total:                           29916061
                transactions:                        29916061 (249278.75 per sec.)
                queries:                             29916061 (249278.75 per sec.)
            

            Sometimes I experience similar turbulence in benchmarks, which I couldn't explain so far. It is usually gone by itself and I can't reliably analyse it.
            I guess if marko would re-run benchmark against 6c279ad6a71c63cb595fde7c951aadb31c3dbebc e.g. tomorrow he would see totally different picture.

            svoj Sergey Vojtovich added a comment - - edited Here's what I get on 6c279ad6a71c63cb595fde7c951aadb31c3dbebc: [ 10s ] thds: 40 tps: 245242.59 qps: 245242.69 (r/w/o: 0.00/245242.69/0.00) lat (ms,95%): 0.27 err/s: 0.00 reconn/s: 0.00 [ 20s ] thds: 40 tps: 254569.26 qps: 254569.16 (r/w/o: 0.00/254569.16/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00 [ 30s ] thds: 40 tps: 254972.73 qps: 254972.73 (r/w/o: 0.00/254972.73/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00 [ 40s ] thds: 40 tps: 252851.43 qps: 252851.53 (r/w/o: 0.00/252851.53/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00 [ 50s ] thds: 40 tps: 251944.58 qps: 251944.48 (r/w/o: 0.00/251944.48/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00 [ 60s ] thds: 40 tps: 253112.17 qps: 253112.17 (r/w/o: 0.00/253112.17/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00 [ 70s ] thds: 40 tps: 250768.06 qps: 250768.06 (r/w/o: 0.00/250768.06/0.00) lat (ms,95%): 0.27 err/s: 0.00 reconn/s: 0.00 [ 80s ] thds: 40 tps: 246130.98 qps: 246130.98 (r/w/o: 0.00/246130.98/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00 [ 90s ] thds: 40 tps: 243231.00 qps: 243231.00 (r/w/o: 0.00/243231.00/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00 [ 100s ] thds: 40 tps: 245256.26 qps: 245256.26 (r/w/o: 0.00/245256.26/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00 [ 110s ] thds: 40 tps: 246891.01 qps: 246891.01 (r/w/o: 0.00/246891.01/0.00) lat (ms,95%): 0.26 err/s: 0.00 reconn/s: 0.00 [ 120s ] thds: 40 tps: 246429.67 qps: 246429.87 (r/w/o: 0.00/246429.87/0.00) lat (ms,95%): 0.25 err/s: 0.00 reconn/s: 0.00 SQL statistics: queries performed: read: 0 write: 29916061 other: 0 total: 29916061 transactions: 29916061 (249278.75 per sec.) queries: 29916061 (249278.75 per sec.) Sometimes I experience similar turbulence in benchmarks, which I couldn't explain so far. It is usually gone by itself and I can't reliably analyse it. I guess if marko would re-run benchmark against 6c279ad6a71c63cb595fde7c951aadb31c3dbebc e.g. tomorrow he would see totally different picture.
            wlad Vladislav Vaintroub added a comment - - edited

            I get a feeling like original report is completely ignored, and everyone is running ramdisk benchmarks now, testing CPU caches, and compiler optimizations, on 40 threads, on 40 core machines, for 2 minutes

            wlad Vladislav Vaintroub added a comment - - edited I get a feeling like original report is completely ignored, and everyone is running ramdisk benchmarks now, testing CPU caches, and compiler optimizations, on 40 threads, on 40 core machines, for 2 minutes

            You're right, we need to concentrate on original report. Which I couldn't reproduce locally, most probably due to some known issues with Samsung SSD being slow on heavy fsync.

            I will try to look into ramdisk performance more closely, with longer run and more concurrent connections (like 1000).

            svoj Sergey Vojtovich added a comment - You're right, we need to concentrate on original report. Which I couldn't reproduce locally, most probably due to some known issues with Samsung SSD being slow on heavy fsync. I will try to look into ramdisk performance more closely, with longer run and more concurrent connections (like 1000).

            I agree that we must address the I/O issue, but at the same time it could be worth looking for CPU regressions, that is, approach repeatable regressions from multiple angles. For what it is worth, I repeated some 180-second ramdisk benchmarks today:

            revision threads=40
            6c279ad6a71c63cb595fde7c951aadb31c3dbebc 130213.08tps
            6c279ad6a71c63cb595fde7c951aadb31c3dbebc (patch 1) 134518.90tps
            7ad9b3588f53672b3d2d9f345ec9ab7335b50206 104287.79tps
            7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 2) 112871.50tps
            7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 1m+2) 109021.72tps
            7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 1+2) 114488.47tps

            The baseline was 123927.89tps before the weekend, and now it is 130213.08tps, so there indeed is some turbulence in the figures even after I switched to always do a full rebuild.

            Patch 1 reverts some struct or class member variables from size_t to a narrower type, such as uint or uint32. A minimal version of the patch (1m) is below:

            diff --git a/sql/sql_cache.h b/sql/sql_cache.h
            index 0ed45a9ed81..ca2c3a59ee7 100644
            --- a/sql/sql_cache.h
            +++ b/sql/sql_cache.h
            @@ -350,7 +350,7 @@ class Query_cache
               size_t min_allocation_unit, min_result_data_size;
               uint def_query_hash_size, def_table_hash_size;
               
            -  size_t mem_bin_num, mem_bin_steps;		// See at init_cache & find_bin
            +  uint mem_bin_num, mem_bin_steps;		// See at init_cache & find_bin
             
               bool initialized;
             
            

            In my test resuls above, this minimal patch appeared to cause a slight performance regression, but I think that we can explain it by saying that it falls within the 5% or 10% of fluctuation.

            Patch 2 removes st_mem_root::name and the name parameter of init_alloc_root():

            diff --git a/include/my_alloc.h b/include/my_alloc.h
            index 3ba8a02fc48..c9879125d96 100644
            --- a/include/my_alloc.h
            +++ b/include/my_alloc.h
            @@ -52,7 +52,9 @@ typedef struct st_mem_root
               unsigned int first_block_usage;
             
               void (*error_handler)(void);
            +#ifndef DBUG_OFF
               const char *name;
            +#endif
             } MEM_ROOT;
             
             #ifdef  __cplusplus
            diff --git a/include/my_sys.h b/include/my_sys.h
            index 13ab7b12320..fb18cd92e8a 100644
            --- a/include/my_sys.h
            +++ b/include/my_sys.h
            @@ -888,9 +888,16 @@ extern void my_free_lock(void *ptr);
             #define alloc_root_inited(A) ((A)->min_malloc != 0)
             #define ALLOC_ROOT_MIN_BLOCK_SIZE (MALLOC_OVERHEAD + sizeof(USED_MEM) + 8)
             #define clear_alloc_root(A) do { (A)->free= (A)->used= (A)->pre_alloc= 0; (A)->min_malloc=0;} while(0)
            +#ifndef DBUG_OFF
             extern void init_alloc_root(MEM_ROOT *mem_root, const char *name,
                                         size_t block_size, size_t pre_alloc_size,
                                         myf my_flags);
            +#else
            +extern void init_alloc_Root(MEM_ROOT *mem_root,
            +                            size_t block_size, size_t pre_alloc_size,
            +                            myf my_flags);
            +# define init_alloc_root(m,n,b,p,f) init_alloc_Root(m,b,p,f)
            +#endif
             extern void *alloc_root(MEM_ROOT *mem_root, size_t Size);
             extern void *multi_alloc_root(MEM_ROOT *mem_root, ...);
             extern void free_root(MEM_ROOT *root, myf MyFLAGS);
            diff --git a/mysys/my_alloc.c b/mysys/my_alloc.c
            index 7139466be17..fddc9118312 100644
            --- a/mysys/my_alloc.c
            +++ b/mysys/my_alloc.c
            @@ -54,9 +54,15 @@
                 Because of this, we store in MY_THREAD_SPECIFIC as bit 1 in block_size
             */
             
            +#ifndef DBUG_OFF
             void init_alloc_root(MEM_ROOT *mem_root, const char *name, size_t block_size,
             		     size_t pre_alloc_size __attribute__((unused)),
                                  myf my_flags)
            +#else
            +void init_alloc_Root(MEM_ROOT *mem_root, size_t block_size,
            +		     size_t pre_alloc_size __attribute__((unused)),
            +                     myf my_flags)
            +#endif
             {
               DBUG_ENTER("init_alloc_root");
               DBUG_PRINT("enter",("root: %p  name: %s  prealloc: %zu", mem_root,
            @@ -72,7 +78,9 @@ void init_alloc_root(MEM_ROOT *mem_root, const char *name, size_t block_size,
               mem_root->block_num= 4;			/* We shift this with >>2 */
               mem_root->first_block_usage= 0;
               mem_root->total_alloc= 0;
            +#ifndef DBUG_OFF
               mem_root->name= name;
            +#endif
             
             #if !(defined(HAVE_valgrind) && defined(EXTRA_DEBUG))
               if (pre_alloc_size)
            

            I would suggest to include this patch in 10.3 in some form.

            With GCC 7.3.0 and the 3-minute CPU-bound ramdisk benchmark, we still appear to have a significant regression somewhere between 10.3.4 and 10.3.5. I will keep trying to trace it down while Svoj can concentrate on the originally reported regression.

            marko Marko Mäkelä added a comment - I agree that we must address the I/O issue, but at the same time it could be worth looking for CPU regressions, that is, approach repeatable regressions from multiple angles. For what it is worth, I repeated some 180-second ramdisk benchmarks today: revision threads=40 6c279ad6a71c63cb595fde7c951aadb31c3dbebc 130213.08tps 6c279ad6a71c63cb595fde7c951aadb31c3dbebc (patch 1) 134518.90tps 7ad9b3588f53672b3d2d9f345ec9ab7335b50206 104287.79tps 7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 2) 112871.50tps 7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 1m+2) 109021.72tps 7ad9b3588f53672b3d2d9f345ec9ab7335b50206 (patch 1+2) 114488.47tps The baseline was 123927.89tps before the weekend, and now it is 130213.08tps, so there indeed is some turbulence in the figures even after I switched to always do a full rebuild. Patch 1 reverts some struct  or class  member variables from size_t to a narrower type, such as uint  or uint32 . A minimal version of the patch (1m) is below: diff --git a/sql/sql_cache.h b/sql/sql_cache.h index 0ed45a9ed81..ca2c3a59ee7 100644 --- a/sql/sql_cache.h +++ b/sql/sql_cache.h @@ -350,7 +350,7 @@ class Query_cache size_t min_allocation_unit, min_result_data_size; uint def_query_hash_size, def_table_hash_size; - size_t mem_bin_num, mem_bin_steps; // See at init_cache & find_bin + uint mem_bin_num, mem_bin_steps; // See at init_cache & find_bin bool initialized; In my test resuls above, this minimal patch appeared to cause a slight performance regression, but I think that we can explain it by saying that it falls within the 5% or 10% of fluctuation. Patch 2 removes st_mem_root::name and the name  parameter of init_alloc_root() : diff --git a/include/my_alloc.h b/include/my_alloc.h index 3ba8a02fc48..c9879125d96 100644 --- a/include/my_alloc.h +++ b/include/my_alloc.h @@ -52,7 +52,9 @@ typedef struct st_mem_root unsigned int first_block_usage; void (*error_handler)(void); +#ifndef DBUG_OFF const char *name; +#endif } MEM_ROOT; #ifdef __cplusplus diff --git a/include/my_sys.h b/include/my_sys.h index 13ab7b12320..fb18cd92e8a 100644 --- a/include/my_sys.h +++ b/include/my_sys.h @@ -888,9 +888,16 @@ extern void my_free_lock(void *ptr); #define alloc_root_inited(A) ((A)->min_malloc != 0) #define ALLOC_ROOT_MIN_BLOCK_SIZE (MALLOC_OVERHEAD + sizeof(USED_MEM) + 8) #define clear_alloc_root(A) do { (A)->free= (A)->used= (A)->pre_alloc= 0; (A)->min_malloc=0;} while(0) +#ifndef DBUG_OFF extern void init_alloc_root(MEM_ROOT *mem_root, const char *name, size_t block_size, size_t pre_alloc_size, myf my_flags); +#else +extern void init_alloc_Root(MEM_ROOT *mem_root, + size_t block_size, size_t pre_alloc_size, + myf my_flags); +# define init_alloc_root(m,n,b,p,f) init_alloc_Root(m,b,p,f) +#endif extern void *alloc_root(MEM_ROOT *mem_root, size_t Size); extern void *multi_alloc_root(MEM_ROOT *mem_root, ...); extern void free_root(MEM_ROOT *root, myf MyFLAGS); diff --git a/mysys/my_alloc.c b/mysys/my_alloc.c index 7139466be17..fddc9118312 100644 --- a/mysys/my_alloc.c +++ b/mysys/my_alloc.c @@ -54,9 +54,15 @@ Because of this, we store in MY_THREAD_SPECIFIC as bit 1 in block_size */ +#ifndef DBUG_OFF void init_alloc_root(MEM_ROOT *mem_root, const char *name, size_t block_size, size_t pre_alloc_size __attribute__((unused)), myf my_flags) +#else +void init_alloc_Root(MEM_ROOT *mem_root, size_t block_size, + size_t pre_alloc_size __attribute__((unused)), + myf my_flags) +#endif { DBUG_ENTER("init_alloc_root"); DBUG_PRINT("enter",("root: %p name: %s prealloc: %zu", mem_root, @@ -72,7 +78,9 @@ void init_alloc_root(MEM_ROOT *mem_root, const char *name, size_t block_size, mem_root->block_num= 4; /* We shift this with >>2 */ mem_root->first_block_usage= 0; mem_root->total_alloc= 0; +#ifndef DBUG_OFF mem_root->name= name; +#endif #if !(defined(HAVE_valgrind) && defined(EXTRA_DEBUG)) if (pre_alloc_size) I would suggest to include this patch in 10.3 in some form. With GCC 7.3.0 and the 3-minute CPU-bound ramdisk benchmark, we still appear to have a significant regression somewhere between 10.3.4 and 10.3.5. I will keep trying to trace it down while Svoj can concentrate on the originally reported regression.
            marko Marko Mäkelä made changes -

            I can still repeat the huge drop caused by commit 6c279ad6a71c63cb595fde7c951aadb31c3dbebc: dropping from 207119.69tps to 130029.74tps. I narrowed it down to the sql_class.h and sql_class.cc changes in that commit, but when I reverted that part of the patch in the latest 10.3, I did not see any improvement.

            In my benchmark, I would see contention in lock_sys.mutex and log_sys.mutex. The situation should be improved by MDEV-16232 and MDEV-14425.

            Also, at low concurrency, it seems that purge is wasting resources and creating unnecessary contention. Perhaps the amount of purge threads should be automatically scaled based on the amount of purgeable records?

            marko Marko Mäkelä added a comment - I can still repeat the huge drop caused by commit 6c279ad6a71c63cb595fde7c951aadb31c3dbebc: dropping from 207119.69tps to 130029.74tps. I narrowed it down to the sql_class.h and sql_class.cc changes in that commit, but when I reverted that part of the patch in the latest 10.3, I did not see any improvement. In my benchmark, I would see contention in lock_sys.mutex and log_sys.mutex . The situation should be improved by MDEV-16232 and MDEV-14425 . Also, at low concurrency, it seems that purge is wasting resources and creating unnecessary contention. Perhaps the amount of purge threads should be automatically scaled based on the amount of purgeable records?
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Affects Version/s 10.3 [ 22126 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.3 [ 22126 ]

            A bad merge of MDEV-14705 to MariaDB 10.3.6 causes frequent calls to service_manager_extend_timeout() from all purge threads. These calls were supposed to only occur during a slow shutdown. This will be fixed in MDEV-17003.

            Given that the performance regression reported in this ticket was reported on Windows, where the systemd interface is not present, this probably does not explain much of the regression. Also, I probably did not have libsystemd-dev installed when compiling code for running benchmarks.

            marko Marko Mäkelä added a comment - A bad merge of MDEV-14705 to MariaDB 10.3.6 causes frequent calls to service_manager_extend_timeout() from all purge threads. These calls were supposed to only occur during a slow shutdown. This will be fixed in MDEV-17003 . Given that the performance regression reported in this ticket was reported on Windows, where the systemd interface is not present, this probably does not explain much of the regression. Also, I probably did not have libsystemd-dev installed when compiling code for running benchmarks.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 10.4 [ 22408 ]
            marko Marko Mäkelä made changes -
            julien.fritsch Julien Fritsch made changes -
            Assignee Sergey Vojtovich [ svoj ] Marko Mäkelä [ marko ]

            By now, we are all educated that MDEV-12288 caused that issue. It just came not really announced back then

            wlad Vladislav Vaintroub added a comment - By now, we are all educated that MDEV-12288 caused that issue. It just came not really announced back then
            wlad Vladislav Vaintroub made changes -
            issue.field.resolutiondate 2020-11-17 13:34:08.0 2020-11-17 13:34:08.664
            wlad Vladislav Vaintroub made changes -
            Fix Version/s N/A [ 14700 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Resolution Won't Fix [ 2 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 87225 ] MariaDB v4 [ 154359 ]

            People

              marko Marko Mäkelä
              wlad Vladislav Vaintroub
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.