Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34458

wait_for_read in buf_page_get_low hurts performance

Details

    Description

      I'm benchmarking 10.4 vs 10.11, using sysbench, and oltp_update_index.
      It is a rather large benchmark, with about 23GB data , which is run with

      sysbench --db-driver=mysql --range-size=100 --table-size=5000000 --tables=8 --threads=16 --events=0 --warmup-time=5 --time=600 oltp_update_index --rand-type=uniform

      I run it "cold", i,e with slow shutdown after the load, and without bufferpool loading, i.e

      innodb_fast_shutdown=0
      innodb_buffer_pool_load_at_startup=OFF
      innodb_buffer_pool_dump_at_shutdown=OFF
      

      One thing I noticed, is a very slow warmup
      compared 10.4, where it starts at about 40K tps, 10.11 starts at 2K tps, and reaches normal speed ~40-50K tps only after about 140 seconds. mysqld reads from SSD at 80MB/s in 10.4 at that time, and 8-10 times less in 10.11

      Note, this is still mostly in-memory benchmark, the buffer pool is large enough.

      I noticed a particular stack, that comes often during warmup, but not after, which I would attribute to that slowness.

      NtDelayExecution
      RtlDelayExecution
      SleepEx
      std::this_thread::sleep_until<std::chrono::steady_clock,std::chrono::duration<__int64,std::ratio<1,1000000000> > >
      std::this_thread::sleep_for
      buf_page_get_low
      btr_latch_prev
      btr_cur_t::search_leaf row_ins_sec_index_entry_low
      row_ins_sec_index_entry
      row_upd_sec_index_entry
      row_upd_sec_step row_upd row_upd_step
      row_update_for_mysql
      ha_innobase::update_row
      handler::ha_update_row
      mysql_update
      mysql_execute_command
      Prepared_statement::execute
      Prepared_statement::execute_loop
      mysql_stmt_execute_common
      mysqld_stmt_execute
      dispatch_command
      do_command
      ....
      

      Note, that Windows' Sleep can't be less that 1ms, and by default, they would still amount to 14ms or something like that.

      According to Marko, this sleeping is due to MDEV-33543, something change buffer, and it is not needed in 11.x anymore.
      UPD: If I revert 90b95c6149c72f43aa2324353a76f370d018a5ac (MDEV-33543), there is no warmup, performance is much better

      Attachments

        Issue Links

          Activity

            The change buffer was removed in MDEV-29694 (11.0). I think that in the earliest maintained 11.x branch we should revert the MDEV-33543 change and also remove all references to BTR_MODIFY_PREV, which was only used by the change buffer.

            For 10.6 and 10.11, we will need a more complex solution. My intuition would say that we could introduce a new mode value in buf_page_get_gen() and buf_page_get_low() to signal that it is a call from btr_latch_prev not for BTR_SEARCH_PREV but for the change buffer specific BTR_MODIFY_PREV.

            marko Marko Mäkelä added a comment - The change buffer was removed in MDEV-29694 (11.0). I think that in the earliest maintained 11.x branch we should revert the MDEV-33543 change and also remove all references to BTR_MODIFY_PREV , which was only used by the change buffer. For 10.6 and 10.11, we will need a more complex solution. My intuition would say that we could introduce a new mode value in buf_page_get_gen() and buf_page_get_low() to signal that it is a call from btr_latch_prev not for BTR_SEARCH_PREV but for the change buffer specific BTR_MODIFY_PREV .

            I merged my suggested fix https://github.com/MariaDB/server/pull/3362 to 11.1. A solution for 10.6 and 10.11 will have to be found.

            marko Marko Mäkelä added a comment - I merged my suggested fix https://github.com/MariaDB/server/pull/3362 to 11.1. A solution for 10.6 and 10.11 will have to be found.
            wlad Vladislav Vaintroub added a comment - - edited

            I reran oltp_update_index with smaller data (10GB), and still a small bufferpool(2GB)
            The effect of MDEV-33543 seems to be the same as with large bufferpool - a warmup phase, and normal load, below are time series (X axis = time, Y axis - transaction per seconds, red line - current 10.6, blue line - MDEV-33543 reverted)

            Even it the workload it IO-bound, the delay caused by sleep in MDEV-33543 is only/mostly visible when empty bufferpool is being loaded, and there is no visible regression once bufferpool is full. (I suspect that this has something to do with how reads are done, i.e maybe it only happens for async reads, and they are not used with full bufferpool).

            I also ran readonly workload oltp_point_select with exact settings 10GB data and 2 GB bufferpool, and there is no negative effect caused by MDEV-33543 for this workload, i.e the numbers with/without 90b95c6149c72f43aa2324353a76f370d018a5ac are the same.

            wlad Vladislav Vaintroub added a comment - - edited I reran oltp_update_index with smaller data (10GB), and still a small bufferpool(2GB) The effect of MDEV-33543 seems to be the same as with large bufferpool - a warmup phase, and normal load, below are time series (X axis = time, Y axis - transaction per seconds, red line - current 10.6, blue line - MDEV-33543 reverted) Even it the workload it IO-bound, the delay caused by sleep in MDEV-33543 is only/mostly visible when empty bufferpool is being loaded, and there is no visible regression once bufferpool is full. (I suspect that this has something to do with how reads are done, i.e maybe it only happens for async reads, and they are not used with full bufferpool). I also ran readonly workload oltp_point_select with exact settings 10GB data and 2 GB bufferpool, and there is no negative effect caused by MDEV-33543 for this workload, i.e the numbers with/without 90b95c6149c72f43aa2324353a76f370d018a5ac are the same.
            debarun Debarun Banerjee added a comment - - edited

            I could complete my initial analysis.

            1. The performance regression in load time is indeed caused by the deadlock fix given in MDEV-33543. I see wlad's observation also confirms it. The area of impact is actually wider but is more visible when BP is being loaded initially via DMLs. Specifically the response time could be impacted when ...

            • DML : Pessimistic operation on index - expanding/shrinking the Index structure (split/merge)
            • The leaf pages are not found in buffer pool. Likely case if BP size is small.

            2. MDEV-33543 fills a very serious gap in the current design. It is not necessary that we need "change buffering" for such hang. It is actually general design issue where we are trying to acquire a lock without the consideration of proper latch ordering. Assessing the general impact of reverting the patch, I found that almost in all cases where we gain performance, we actually run into the risk of deadlock.

            Let's try to see what the real issue is. The origin of the issue dates back to MDEV-30400 that introduced btr_cur_t::search_leaf() replacing btr_cur_search_to_nth_level() for leaf page searches. Here we try to optimize the pessimistic latch acquisition by trying to latch the previous page out of normal latch ordering. The attempt is a trial and if the trial fails we revert back to ordered latching. There is no issue here. However, before trying the latch we need to get the page from buffer pool and we use RW_NO_LATCH to get the page fixed in BP without latching(buf_page_get_gen). Again it is ok, as long as the page is in BP. When the page is not in BP, we try to acquire S latch on the page to wait for the page being loaded to BP. This S latch acquisition with RW_NO_LATCH, violates the lathing order and causes deadlock.

            commit de4030e4d49805a7ded5c0bfee01cc3fd7623522
            Author: Marko Mäkelä <marko.makela@mariadb.com>
            Date:   Tue Jan 24 14:09:21 2023 +0200
                MDEV-30400 Assertion height == btr_page_get_level(...) on INSERT
             
                btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level()
                for searches to level=0 (the leaf level). We will never release
                parent page latches before acquiring leaf page latches. If we need to
                temporarily release the level=1 page latch in the BTR_SEARCH_PREV or
                BTR_MODIFY_PREV latch_mode, we will reposition the cursor on the
                child node pointer so that we will land on the correct leaf page.
            

            There are other patches around this one including MDEV-27058 that exposed the deadlock for change buffering.

            This deadlock was analyzed in MDEV-33543 and fixed by using the already present wait logic in buf_page_get_gen() instead of S latch for RW_NO_LATCH. The wait logic is inferior to usual S latch wait and is simply a repeated sleep 100 micro-sec (This sleep could be actually more depending on platforms). The bug was seen with "change-buffering" code path and the idea was that this path should be less exercised. The judgement was not correct and the path is actually quite frequent and does impact performance when pages are not in BP and being loaded by DML expanding/shrinking large data.

            It, however, also implies that the general case could be vulnerable to the hang that got fixed by MDEV-33543 without "change-buffering" being involved. I have some idea how to fix the problem in a better manner. While trying to get a patch with RW_NO_LATCH and we are attempting "out of order" latch, we could return from buf_page_get_gen immediately (instead of waiting) if the page is being loaded and follow the ordered latching path.

            I am working on the patch and upload it soon. I hope it fixes the issue.

            debarun Debarun Banerjee added a comment - - edited I could complete my initial analysis. 1. The performance regression in load time is indeed caused by the deadlock fix given in MDEV-33543 . I see wlad 's observation also confirms it. The area of impact is actually wider but is more visible when BP is being loaded initially via DMLs. Specifically the response time could be impacted when ... DML : Pessimistic operation on index - expanding/shrinking the Index structure (split/merge) The leaf pages are not found in buffer pool. Likely case if BP size is small. 2. MDEV-33543 fills a very serious gap in the current design. It is not necessary that we need "change buffering" for such hang. It is actually general design issue where we are trying to acquire a lock without the consideration of proper latch ordering. Assessing the general impact of reverting the patch, I found that almost in all cases where we gain performance, we actually run into the risk of deadlock. Let's try to see what the real issue is. The origin of the issue dates back to MDEV-30400 that introduced btr_cur_t::search_leaf() replacing btr_cur_search_to_nth_level() for leaf page searches. Here we try to optimize the pessimistic latch acquisition by trying to latch the previous page out of normal latch ordering. The attempt is a trial and if the trial fails we revert back to ordered latching. There is no issue here. However, before trying the latch we need to get the page from buffer pool and we use RW_NO_LATCH to get the page fixed in BP without latching(buf_page_get_gen). Again it is ok, as long as the page is in BP. When the page is not in BP, we try to acquire S latch on the page to wait for the page being loaded to BP. This S latch acquisition with RW_NO_LATCH, violates the lathing order and causes deadlock. commit de4030e4d49805a7ded5c0bfee01cc3fd7623522 Author: Marko Mäkelä <marko.makela@mariadb.com> Date: Tue Jan 24 14:09:21 2023 +0200 MDEV-30400 Assertion height == btr_page_get_level(...) on INSERT   btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level() for searches to level=0 (the leaf level). We will never release parent page latches before acquiring leaf page latches. If we need to temporarily release the level=1 page latch in the BTR_SEARCH_PREV or BTR_MODIFY_PREV latch_mode, we will reposition the cursor on the child node pointer so that we will land on the correct leaf page. There are other patches around this one including MDEV-27058 that exposed the deadlock for change buffering. This deadlock was analyzed in MDEV-33543 and fixed by using the already present wait logic in buf_page_get_gen() instead of S latch for RW_NO_LATCH. The wait logic is inferior to usual S latch wait and is simply a repeated sleep 100 micro-sec (This sleep could be actually more depending on platforms). The bug was seen with "change-buffering" code path and the idea was that this path should be less exercised. The judgement was not correct and the path is actually quite frequent and does impact performance when pages are not in BP and being loaded by DML expanding/shrinking large data. It, however, also implies that the general case could be vulnerable to the hang that got fixed by MDEV-33543 without "change-buffering" being involved. I have some idea how to fix the problem in a better manner. While trying to get a patch with RW_NO_LATCH and we are attempting "out of order" latch, we could return from buf_page_get_gen immediately (instead of waiting) if the page is being loaded and follow the ordered latching path. I am working on the patch and upload it soon. I hope it fixes the issue.

            monty
            The page flusher maintains the % of dirty pages. The 90% settings is the the amount of dirty (modified but not flushed) pages in BP. The flush is not going to evict the pages out of BP. So, the number of pages in BP could be of higher %. We try to maintain a fixed set of "Free Pages" + Some non-data pages used for other purposes like Locks. Other than that the rest of the pool contain data pages.

            However, there is always a possibility of hitting that code. The data pages also include the undo pages and practically with read/write workload stress and with a lot of undo pages we might still be forced to evict/reload pages and hitting this part of the code. I think a lot depends on the scenario.

            debarun Debarun Banerjee added a comment - monty The page flusher maintains the % of dirty pages. The 90% settings is the the amount of dirty (modified but not flushed) pages in BP. The flush is not going to evict the pages out of BP. So, the number of pages in BP could be of higher %. We try to maintain a fixed set of "Free Pages" + Some non-data pages used for other purposes like Locks. Other than that the rest of the pool contain data pages. However, there is always a possibility of hitting that code. The data pages also include the undo pages and practically with read/write workload stress and with a lot of undo pages we might still be forced to evict/reload pages and hitting this part of the code. I think a lot depends on the scenario.

            I have a tentative patch. I am still testing the patch.
            Branch: 10.6-MDEV-34458.
            Review: https://github.com/MariaDB/server/pull/3382

            wlad, Can you please help me validating if the patch improves the performance problem ?
            thiru Can you please review the patch ?

            debarun Debarun Banerjee added a comment - I have a tentative patch. I am still testing the patch. Branch: 10.6- MDEV-34458 . Review: https://github.com/MariaDB/server/pull/3382 wlad , Can you please help me validating if the patch improves the performance problem ? thiru Can you please review the patch ?
            wlad Vladislav Vaintroub added a comment - - edited

            Thank debarun, I checked, the performance is back with the patch. Thank you!

            wlad Vladislav Vaintroub added a comment - - edited Thank debarun , I checked, the performance is back with the patch. Thank you!

            Thanks to thiru for the review.
            Thanks to wlad for identifying the issue and validation.

            debarun Debarun Banerjee added a comment - Thanks to thiru for the review. Thanks to wlad for identifying the issue and validation.

            People

              debarun Debarun Banerjee
              wlad Vladislav Vaintroub
              Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.