Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33699

write statements gets stuck during IO-bound insert benchmark with InnoDB

Details

    • Bug
    • Status: Open (View Workflow)
    • Critical
    • Resolution: Unresolved
    • 10.11.7
    • 10.11
    • None
    • Ubuntu 22.04, server has 32 cores (cores, not HW threads), 128G of RAM and XFS with SW RAID 0 over 2 NVMe devices

    Description

      I was able to complete CPU-bound insert benchmark tests for all LTS releases from 10.2 through 11.4. But when I try an IO-bound run with MariaDB 10.11.7 then sessions doing writes to InnoDB get stuck. By stuck I mean:

      • per SHOW PROCESSLIST write statements (insert and delete) are running for 42000+ seconds. These should finish in less than one second.
      • SHOW INNODB STATUS hangs right now (42000+ seconds into the problem)

      By CPU-bound I mean the database fits in the DBMS block cache. By IO-bound I mean the database is larger than memory.

      The benchmark is run with 24 clients and there are 3 connections/client – one for insert, one for delete, one for select. At the point where it gets stuck so the SHOW PROCESSLIST output shows 72 connections and 48 (insert, delete) are stuck while the selects are proceeding.

      Also, while stuck, from "top" I see that the mariadbd process is busy with %CPU >= 400.

      I will upload the full my.cnf. For this test I set innodb_change_buffering=all. The previous test run uses =off and it was too slow (would have take ~10 days) while others take less than 1 day. So I killed the run and restarted with the change buffer enabled.

      For this benchmark I have been using a big server (32 cores) with high-concurrency (24 clients) and a small server (8 cores) with low-concurrency (1 client). For IO-bound runs I have only seen this problem on the big server.

      The error log has these two messages that I don't recall occurring with upstream MySQL+InnoDB. But these occur during the create index step of the benchmark which is a few hours prior to the hang:

      2024-03-16  8:45:50 0 [Warning] InnoDB: Could not free any blocks in the buffer pool! 6490112 blocks are in use and 0 free. Consider increasing innodb_buffer_pool_size.
      2024-03-16 13:03:05 0 [Warning] InnoDB: Could not free any blocks in the buffer pool! 6490112 blocks are in use and 0 free. Consider increasing innodb_buffer_pool_size.
      

      For an overview of the insert benchmark see here

      I attached 2 sets of PMP output. For each set there are two files – one grouped, one not grouped. From a quick browse of the stacks I assume the hang is related to the change buffer.

      Attachments

        1. 8core.ma1011.11d.flat
          49 kB
        2. 8core.ma1011.11d.hier
          8 kB
        3. dell32.err
          3 kB
        4. my.cnf
          2 kB
        5. pmp.grouped.240317_152037.txt
          14 kB
        6. pmp.grouped.240317_152426.txt
          12 kB
        7. pmp.not-grouped.240317_152037.txt
          194 kB
        8. pmp.not-grouped.240317_152426.txt
          197 kB
        9. show_process_list.txt
          10 kB
        10. show_process_list-1.txt
          10 kB

        Issue Links

          Activity

            I added output from SHOW ENGINE INNODB STATUS run a few minutes prior to the start of the benchmark step that hangs, and the hang occurs only a few minutes after that.

            mdcallag Mark Callaghan added a comment - I added output from SHOW ENGINE INNODB STATUS run a few minutes prior to the start of the benchmark step that hangs, and the hang occurs only a few minutes after that.

            Thank you. The InnoDB change buffer would be removed in MariaDB Server 11.0 (MDEV-29694), and it was disabled by default in MariaDB Server 10.5 (MDEV-27734).

            Can you provide more complete stack traces, with the names of function parameters? The output of the GDB command thread apply all backtrace full would be useful as well, because then we should see which page latches are being held by each thread in their mtr_t::m_memo.

            Based on the currently available information it is difficult to say it for sure, but this report could be duplicating MDEV-32489.

            The warning messages about the inability to free blocks from the buffer pool were recently added in MDEV-33053. The reason why I added it was that when I configured an unreasonably small buffer pool for a Sysbench workload, InnoDB would hang during the workload without issuing any warning message. The scenario was that all of the buffer pool was pinned by buffer-fixes or page latches by SQL threads, and several of these threads were trying to allocate more pages from the buffer pool. A deadlock in that case is pretty much unavoidable.

            marko Marko Mäkelä added a comment - Thank you. The InnoDB change buffer would be removed in MariaDB Server 11.0 ( MDEV-29694 ), and it was disabled by default in MariaDB Server 10.5 ( MDEV-27734 ). Can you provide more complete stack traces, with the names of function parameters? The output of the GDB command thread apply all backtrace full would be useful as well, because then we should see which page latches are being held by each thread in their mtr_t::m_memo . Based on the currently available information it is difficult to say it for sure, but this report could be duplicating MDEV-32489 . The warning messages about the inability to free blocks from the buffer pool were recently added in MDEV-33053 . The reason why I added it was that when I configured an unreasonably small buffer pool for a Sysbench workload, InnoDB would hang during the workload without issuing any warning message. The scenario was that all of the buffer pool was pinned by buffer-fixes or page latches by SQL threads, and several of these threads were trying to allocate more pages from the buffer pool. A deadlock in that case is pretty much unavoidable.

            It happened again, this time on a smaller server (8 cores). I also see this in the database error log:
            InnoDB: Could not free any blocks in the buffer pool! 648960 blocks are in use and 0 free. Consider increasing innodb_buffer_pool_size.

            The buffer pool is not too small. On the 8-core server it is 10G and on the larger server it was much larger. I never have hangs with upstream MySQL+InnoDB using a similar config.

            I added PMP output – both the full stack traces (8core.ma1011.11d.flat) and aggregated (8core.ma1011.11d.hier). I used "thread apply all bt full" but the build didn't use O2 and -fno-omit-frame-pointer, I am not sure that is needed.

            8core.ma1011.11d.flat 8core.ma1011.11d.hier

            mdcallag Mark Callaghan added a comment - It happened again, this time on a smaller server (8 cores). I also see this in the database error log: InnoDB: Could not free any blocks in the buffer pool! 648960 blocks are in use and 0 free. Consider increasing innodb_buffer_pool_size. The buffer pool is not too small. On the 8-core server it is 10G and on the larger server it was much larger. I never have hangs with upstream MySQL+InnoDB using a similar config. I added PMP output – both the full stack traces (8core.ma1011.11d.flat) and aggregated (8core.ma1011.11d.hier). I used "thread apply all bt full" but the build didn't use O2 and -fno-omit-frame-pointer, I am not sure that is needed. 8core.ma1011.11d.flat 8core.ma1011.11d.hier

            mdcallag, in MDEV-33053 we had refactored the LRU eviction flushing so that only the buf_flush_page_cleaner thread will initiate LRU eviction and the user threads will wait for it to do its job. The high level idea is similar to MDEV-26055 and MDEV-33699: Acquire buf_pool.mutex less often and try to achieve more in a single critical section of buf_pool.mutex.

            Unfortunately, the MDEV-33053 fix caused some regression. Thanks to debarun, there were some fixes in the August 2024 quarterly releases (such as 10.6.19, 10.11.9). One of them was MDEV-34265, which would try to ensure that the buf_flush_page_cleaner() keeps going if we are about to run out of buffer pool. In some recent testing, we found that 10.6.18 would perform much worse than 10.6.19 in an I/O bound workload.

            Can you reproduce this anomaly with MariaDB 10.6.19 or 10.11.9?

            marko Marko Mäkelä added a comment - mdcallag , in MDEV-33053 we had refactored the LRU eviction flushing so that only the buf_flush_page_cleaner thread will initiate LRU eviction and the user threads will wait for it to do its job. The high level idea is similar to MDEV-26055 and MDEV-33699 : Acquire buf_pool.mutex less often and try to achieve more in a single critical section of buf_pool.mutex . Unfortunately, the MDEV-33053 fix caused some regression. Thanks to debarun , there were some fixes in the August 2024 quarterly releases (such as 10.6.19, 10.11.9). One of them was MDEV-34265 , which would try to ensure that the buf_flush_page_cleaner() keeps going if we are about to run out of buffer pool. In some recent testing, we found that 10.6.18 would perform much worse than 10.6.19 in an I/O bound workload. Can you reproduce this anomaly with MariaDB 10.6.19 or 10.11.9?

            Started tests now for, will have results early next week

            mdcallag Mark Callaghan added a comment - Started tests now for, will have results early next week

            I used a 48-core server from Hetzner (ax162-s) with 20 clients, 200M rows/table, 4000 rows total and a client/table. The benchmark is heavy on IO as the working set does not fit in cache for point queries (it does for range queries) and there is much write back. There are up to 3 concurrent connections per client.

            I did tests with 10.6.18, 10.6.19, 10.11.8 and 10.11.9.

            I used two my.cnf variations, and all are here

            • my.cnf.cz11a_c32r128
            • my.cnf.cz11a_lwas4k_c32r128 - same as my.cnf.cz11a_c32r128 except adds innodb_log_write_ahead_size=4096

            The default for innodb_log_write_ahead_size on this server was:

            • 8192 for 10.6.18, 10.6.19
            • not listed in SHOW GLOBAL VARIABLES for 10.11.8
            • 512 for 10.11.9

            I used my.cnf.cz11a_lwas4k_c32r128 with 10.6.18, 10.6.19 and 10.11.9. I did not use it with 10.11.8 because it is parsed but ignored there and I am not fan of parsed but ignored (see MDEV-33958).

            The results are here and summary is here

            tl;dr - I don't think I reproduced the anomaly

            The rest is a summary per benchmark step.

            l.i0 - load in PK order before secondary indexes are created

            • insert rate drops by ~15% from 10.6 to 10.11
            • there are unexpected reads from storage with 10.11.8 and 10.11.9 with my.cnf.cz11a_c32r128, see rps here
            • context switches per insert are almost 3X larger in 10.6 than 10.11, see cspq here. Usually that is a good sign (less mutex contention) but I don't think that is the case here and we have discussed this in the past.
            • CPU per insert is ~17% larger in 10.11 than 10.6, see cpupq here

            l.x - create secondary indexes, I usually ignore results here

            l.i1, l.i2 - random writes (delete & inserts) where l.i1 does 50 writes/commit while l.i2 does 5 writes/commit

            • there might be small regressions here in 10.6.19, 10.11.8 and 10.11.9 but these steps have more variance in general
            • context switches per operation increase from 10.6 to 10.11, see cspq here
            • CPU per query is mixed, but might be larger in 10.11, see cpupq here

            qr100 - range queries with background writes (100 insert/s and 100 delete/s per client)

            • throughput is ~2% larger in 10.11.9 vs 10.6.19
            • the working set for reads is mostly cached as reads/query is ~0.05, see rpq here
            • CPU/query and context switches/query don't change much, see cpupq and cspq here

            qp100 - point queries with background writes (100 insert/s and 100 delete/s per client)

            • throughput is ~3% smaller in 10.11.9 vs 10.6.19
            • the working set for reads is not cached as reads/query is ~7, see rpq here
            • CPU/query and context switches/query are ~10% larger in 10.11.9 vs 10.6.19, see cpupq and cspq here

            qr500, qp500, qr1000, qp1000

            • MariaDB did not sustain the target background write rates (target was 10k/s x 2) or for qr1000, qp1000 (target was 20k/s x 2). So it is harder to understand the results for qr500, qp500, qr1000 or qp1000.
            mdcallag Mark Callaghan added a comment - I used a 48-core server from Hetzner (ax162-s) with 20 clients, 200M rows/table, 4000 rows total and a client/table. The benchmark is heavy on IO as the working set does not fit in cache for point queries (it does for range queries) and there is much write back. There are up to 3 concurrent connections per client. I did tests with 10.6.18, 10.6.19, 10.11.8 and 10.11.9. I used two my.cnf variations, and all are here my.cnf.cz11a_c32r128 my.cnf.cz11a_lwas4k_c32r128 - same as my.cnf.cz11a_c32r128 except adds innodb_log_write_ahead_size=4096 The default for innodb_log_write_ahead_size on this server was: 8192 for 10.6.18, 10.6.19 not listed in SHOW GLOBAL VARIABLES for 10.11.8 512 for 10.11.9 I used my.cnf.cz11a_lwas4k_c32r128 with 10.6.18, 10.6.19 and 10.11.9. I did not use it with 10.11.8 because it is parsed but ignored there and I am not fan of parsed but ignored ( see MDEV-33958 ). The results are here and summary is here tl;dr - I don't think I reproduced the anomaly The rest is a summary per benchmark step. l.i0 - load in PK order before secondary indexes are created insert rate drops by ~15% from 10.6 to 10.11 there are unexpected reads from storage with 10.11.8 and 10.11.9 with my.cnf.cz11a_c32r128, see rps here context switches per insert are almost 3X larger in 10.6 than 10.11, see cspq here . Usually that is a good sign (less mutex contention) but I don't think that is the case here and we have discussed this in the past. CPU per insert is ~17% larger in 10.11 than 10.6, see cpupq here l.x - create secondary indexes, I usually ignore results here l.i1, l.i2 - random writes (delete & inserts) where l.i1 does 50 writes/commit while l.i2 does 5 writes/commit there might be small regressions here in 10.6.19, 10.11.8 and 10.11.9 but these steps have more variance in general context switches per operation increase from 10.6 to 10.11, see cspq here CPU per query is mixed, but might be larger in 10.11, see cpupq here qr100 - range queries with background writes (100 insert/s and 100 delete/s per client) throughput is ~2% larger in 10.11.9 vs 10.6.19 the working set for reads is mostly cached as reads/query is ~0.05, see rpq here CPU/query and context switches/query don't change much, see cpupq and cspq here qp100 - point queries with background writes (100 insert/s and 100 delete/s per client) throughput is ~3% smaller in 10.11.9 vs 10.6.19 the working set for reads is not cached as reads/query is ~7, see rpq here CPU/query and context switches/query are ~10% larger in 10.11.9 vs 10.6.19, see cpupq and cspq here qr500, qp500, qr1000, qp1000 MariaDB did not sustain the target background write rates (target was 10k/s x 2) or for qr1000, qp1000 (target was 20k/s x 2). So it is harder to understand the results for qr500, qp500, qr1000 or qp1000.

            I haven't reproduced this in a long time.
            I also haven't used 10.11.7 in a long time because I use 10.11.x where x is 8, 9 or 10.

            mdcallag Mark Callaghan added a comment - I haven't reproduced this in a long time. I also haven't used 10.11.7 in a long time because I use 10.11.x where x is 8, 9 or 10.

            People

              marko Marko Mäkelä
              mdcallag Mark Callaghan
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.