Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26464

InnoDB: Failing assertion: UT_LIST_GET_LEN(buf_pool->flush_list) == 0 during shutdown

Details

    Description

      origin/10.4 2b66cd249384221e2a892f655fe84bb58ddc31bd 2021-08-23T10:44:06+03:00
      

      Workflow:

      1. DB server start
      2. Generate some initial data
      3. 33 sessions run a DDL/DML mix concurrent. They disconnect after 300s.
      4. Shutdown attempt which ends with

        # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1955505]2021-08-23  3:51:27 0 [Note] /data/Server_bin/10.4_asan/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
        # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1956164]2021-08-23  3:51:27 83 [Warning] Aborted connection 83 to db: 'mysql' user: 'root' host: 'localhost' (This connection closed normally)
        # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1957093]2021-08-23  3:51:27 0 [Note] InnoDB: FTS optimize thread exiting.
        # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1957281]2021-08-23  3:51:27 0 [Note] Event Scheduler: Purging the queue. 0 events
        # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1957790]2021-08-23  3:51:29 0 [Note] InnoDB: Starting shutdown...
        # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1957905]2021-08-23  3:51:29 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/vardir/1629713923/217/1/data/ib_buffer_pool
        # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1957917]2021-08-23  3:51:29 0 [Note] InnoDB: Instance 0, restricted to 95 pages due to innodb_buf_pool_dump_pct=25
        # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1957929]2021-08-23  3:51:29 0 [Note] InnoDB: Buffer pool(s) dump completed at 210823  3:51:29
        # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1958065]2021-08-23 03:51:29 0x66b427d20700[rr 3469573 1958068]  InnoDB: Assertion failure in file /data/Server/10.4/storage/innobase/buf/buf0flu.cc line 3348
        # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1958070]InnoDB: Failing assertion: UT_LIST_GET_LEN(buf_pool->flush_list) == 0
        # 2021-08-23T03:52:21 [3465505] | [rr 3469573 1958072]InnoDB: We intentionally generate a memory trap.
        

      Per marko:
      ----------------
      1. A page was read before the shutdown

      #3  0x0000560a62536e21 in buf_read_ibuf_merge_pages (sync=false, 
          space_ids=0x7e8e3789f710, page_nos=0x7e8e3789f6b0, n_stored=1)
          at /data/Server/10.4/storage/innobase/buf/buf0rea.cc:805
      #4  0x0000560a620290ba in ibuf_merge_pages (n_pages=0x7e8e3789fd90, sync=false)
          at /data/Server/10.4/storage/innobase/ibuf/ibuf0ibuf.cc:2462
      #5  0x0000560a62029e41 in ibuf_merge (n_pages=0x7e8e3789fd90, sync=false)
          at /data/Server/10.4/storage/innobase/ibuf/ibuf0ibuf.cc:2567
      #6  0x0000560a6202a407 in ibuf_merge_in_background (full=false)
          at /data/Server/10.4/storage/innobase/ibuf/ibuf0ibuf.cc:2638
      #7  0x0000560a62316a0b in srv_master_do_active_tasks ()
          at /data/Server/10.4/storage/innobase/srv/srv0srv.cc:2147
      

      2. There happens a change buffer merge after the begin of the shutdown process

      #7  0x0000560a620d2562 in mtr_t::commit (this=0x2f2d612b0830)
          at /data/Server/10.4/storage/innobase/mtr/mtr0mtr.cc:452
      #8  0x0000560a6201eecd in btr_pcur_commit_specify_mtr (pcur=0x2f2d612b04c0, 
          mtr=0x2f2d612b0830)
          at /data/Server/10.4/storage/innobase/include/btr0pcur.ic:345
      #9  0x0000560a6201feb1 in ibuf_btr_pcur_commit_specify_mtr (
          pcur=0x2f2d612b04c0, mtr=0x2f2d612b0830)
          at /data/Server/10.4/storage/innobase/ibuf/ibuf0ibuf.cc:322
      #10 0x0000560a62035092 in ibuf_merge_or_delete_for_page (block=0x5ac0422418b8, 
          page_id=..., zip_size=0)
          at /data/Server/10.4/storage/innobase/ibuf/ibuf0ibuf.cc:4572
      #11 0x0000560a624df00d in buf_page_io_complete (bpage=0x5ac0422418b8, 
          dblwr=true, evict=false)
          at /data/Server/10.4/storage/innobase/buf/buf0buf.cc:6201
      #12 0x0000560a62610488 in fil_aio_wait (segment=2)
          at /data/Server/10.4/storage/innobase/fil/fil0fil.cc:4375
      

      3. The bug does not seem to be a duplicate of MDEV-12630.
      4. MDEV-19514 Resolved: 2019-10-11 16:57 , Fix Version/s: 10.5.0 should have fixed that trouble.

      RQG
      ------

      git clone https://github.com/mleich1/rqg --branch experimental RQG
       
      perl rqg.pl \
      --grammar=conf/mariadb/innodb_compression_encryption.yy \
      --gendata=conf/mariadb/innodb_compression_encryption.zz \
      --max_gd_duration=1800 \
      --mysqld=--loose-innodb-encryption-threads=7 \
      --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
      --mysqld=--loose-idle_write_transaction_timeout=0 \
      --mysqld=--loose-idle_transaction_timeout=0 \
      --mysqld=--loose-idle_readonly_transaction_timeout=0 \
      --mysqld=--connect_timeout=60 \
      --mysqld=--interactive_timeout=28800 \
      --mysqld=--slave_net_timeout=60 \
      --mysqld=--net_read_timeout=30 \
      --mysqld=--net_write_timeout=60 \
      --mysqld=--loose-table_lock_wait_timeout=50 \
      --mysqld=--wait_timeout=28800 \
      --mysqld=--lock-wait-timeout=86400 \
      --mysqld=--innodb-lock-wait-timeout=50 \
      --no-mask \
      --queries=10000000 \
      --seed=random \
      --reporters=Backtrace \
      --reporters=ErrorLog \
      --reporters=Deadlock1 \
      --validators=None \
      --mysqld=--log_output=none \
      --mysqld=--log_bin_trust_function_creators=1 \
      --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
      --engine=InnoDB \
      --restart_timeout=240 \
      --mysqld=--plugin-load-add=file_key_management.so \
      --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
      --duration=300 \
      --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
      --mysqld=--loose-innodb-sync-debug \
      --mysqld=--innodb_stats_persistent=on \
      --mysqld=--innodb_adaptive_hash_index=on \
      --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \
      --mysqld=--loose-max-statement-time=30 \
      --threads=33 \
      --mysqld=--innodb-use-native-aio=0 \
      --rr=Extended \
      --rr_options=--chaos --wait \
      --mysqld=--innodb_page_size=64K \
      --mysqld=--innodb-buffer-pool-size=24M \
      --no_mask \
      --workdir=<local settings> \
      --vardir=<local settings> \
      --mtr-build-thread=<local settings> \
      --basedir1=<local settings> \
      --script_debug=_nix_
      

      sdp:/data/Results/1629713923/TBR-1163/dev/shm/vardir/1629713923/217/1/rr

      Attachments

        Issue Links

          Activity

            Sorry, I had neglected this too long, and the rr replay trace is no longer available. We would need a new one, if we still care to debug and fix this before 10.4 reaches its end of life.

            marko Marko Mäkelä added a comment - Sorry, I had neglected this too long, and the rr replay trace is no longer available. We would need a new one, if we still care to debug and fix this before 10.4 reaches its end of life.

            I've provided a steps to reproduce this issue on oracle forum: https://bugs.mysql.com/bug.php?id=85585
            Adding it here as well, hopefully it will help.

            VitalyZyukin Vitaliy Zyukin added a comment - I've provided a steps to reproduce this issue on oracle forum: https://bugs.mysql.com/bug.php?id=85585 Adding it here as well, hopefully it will help.

            VitalyZyukin, thank you. I suspect that this bug is specific to the MariaDB Server 10.4 release only (and 10.2 and 10.3, which already reached EOL). In those releases, the I/O layer is very similar to MySQL 5.7. In MariaDB Server 10.5, the I/O subsystem was rewritten and there is only one buf_flush_page_cleaner thread and one buffer pool instance. Also, MDEV-19514 removed the function ibuf_merge_in_background() in 10.5.

            A reproducible test case would help a lot. Our internal InnoDB testing efforts are concentrated on later releases, mainly MariaDB Server 10.6 and later.

            A simple work-around would seem be to set innodb_change_buffering=none. The default was changed in MDEV-27734 only starting with 10.5.

            marko Marko Mäkelä added a comment - VitalyZyukin , thank you. I suspect that this bug is specific to the MariaDB Server 10.4 release only (and 10.2 and 10.3, which already reached EOL). In those releases, the I/O layer is very similar to MySQL 5.7. In MariaDB Server 10.5, the I/O subsystem was rewritten and there is only one buf_flush_page_cleaner thread and one buffer pool instance. Also, MDEV-19514 removed the function ibuf_merge_in_background() in 10.5. A reproducible test case would help a lot. Our internal InnoDB testing efforts are concentrated on later releases, mainly MariaDB Server 10.6 and later. A simple work-around would seem be to set innodb_change_buffering=none . The default was changed in MDEV-27734 only starting with 10.5.

            VitalyZyukin, based on your comment in the MySQL bug system, you reproduced this bug on MariaDB Server 10.4.10. (They used to remove or make private comments that mention MariaDB. Luckily it had not happened here yet.)

            Can you test a newer version? I think that this bug may have been fixed by MDEV-30422 in MariaDB Server 10.3.38 and 10.4.28.

            marko Marko Mäkelä added a comment - VitalyZyukin , based on your comment in the MySQL bug system, you reproduced this bug on MariaDB Server 10.4.10. (They used to remove or make private comments that mention MariaDB. Luckily it had not happened here yet.) Can you test a newer version? I think that this bug may have been fixed by MDEV-30422 in MariaDB Server 10.3.38 and 10.4.28.

            Just in case I'm going to duplicate the content here:

            I was able to reproduce it on MariaDB 10.4.10. Couple of times already. It doesn't always happen on our hosts, it happen once in 10 executions on average.

            The way to reproduce on my end was creating 8 processes in python that will delete 100 records per each delete query in one table.

            This procedure is taking place when database is prepared for a shutdown, meaning replication is OFF and no traffic coming in or going out.

            Additional parameters:
            sql_log_bin=0 and innodb_flush_log_at_trx_commit=0

            Next code is capable of deleting ~5 million records in 2 minutes:

            ```
            ...
            func_args = list()
            for i in range(from_id, to_id, 100000):
            last_id = (i + 100000) if (i + 100000) < to_id else to_id
            func_args.append((table_name, i, last_id))

            with multiprocessing.Pool(8) as p:
            p.starmap(self.delete_range, func_args)

            def delete_range(self, table_name, from_id, to_id):
            conn = mysql.connector.connect(**self.connect_args)
            cursor = conn.cursor()
            cursor.execute("SET sql_log_bin=0;")
            cursor.execute("SET GLOBAL innodb_flush_log_at_trx_commit=0;")

            for i in range(from_id, to_id, 100):
            last_id = (i + 100) if (i + 100) < to_id else to_id

            1. delete multiple ids at the time
              cursor.execute("DELETE FROM {} WHERE id IN {};".format(table_name, tuple(range(i, last_id))))

            cursor.close()
            conn.close()
            ```

            On my end the bug is still for the same function: UT_LIST_GET_LEN but a different location

            ```
            2023-05-31 8:11:21 0 [Note] InnoDB: Starting shutdown...
            2023-05-31 08:11:28 0x7eb6d979d700 InnoDB: Assertion failure in file src/storage/innobase/buf/buf0flu.cc line 3353
            InnoDB: Failing assertion: UT_LIST_GET_LEN(buf_pool->flush_list) == 0
            InnoDB: We intentionally generate a memory trap.
            InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
            InnoDB: If you get repeated assertion failures or crashes, even
            InnoDB: immediately after the mysqld startup, there may be
            InnoDB: corruption in the InnoDB tablespace. Please refer to
            InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
            InnoDB: about forcing recovery.
            230531 8:11:28 [ERROR] mysqld got signal 6 ;
            This could be because you hit a bug. It is also possible that this binary
            or one of the libraries it was linked against is corrupt, improperly built,
            or misconfigured. This error can also be caused by malfunctioning hardware.

            To report this bug, see https://mariadb.com/kb/en/reporting-bugs

            We will try our best to scrape up some info that will hopefully help
            diagnose the problem, but since we have already crashed,
            something is definitely wrong and this may fail.

            Server version: 10.4.10-MariaDB-log
            key_buffer_size=8388608
            read_buffer_size=16777216
            max_used_connections=29
            max_threads=65537
            thread_count=0
            It is possible that mysqld could use up to
            key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1208067437 K bytes of memory
            Hope that's ok; if not, decrease some variables in the equation.

            Thread pointer: 0x0
            Attempting backtrace. You can use the following information to find out
            where mysqld died. If you see no messages after this, something went
            terribly wrong...
            stack_bottom = 0x0 thread_stack 0x49000
            <Path>/bin/mysqld(my_print_stacktrace+0x29)[0x55d43315d7b9]
            <Path>/bin/mysqld(handle_fatal_signal+0x4d0)[0x55d432c2e8f0]
            /lib64/libpthread.so.0(+0x118e0)[0x7f01a7fc38e0]
            /lib64/libc.so.6(gsignal+0x110)[0x7f01a715cca0]
            /lib64/libc.so.6(abort+0x148)[0x7f01a715e148]
            <Path>/bin/mysqld(+0x565705)[0x55d432965705]
            <Path>/bin/mysqld(+0xb43a06)[0x55d432f43a06]
            /lib64/libpthread.so.0(+0x744b)[0x7f01a7fb944b]
            /lib64/libc.so.6(clone+0x3f)[0x7f01a721852f]
            The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
            information that should help you find out what is causing the crash.
            Writing a core file...
            Working directory at /mysql/data
            Resource Limits:
            Limit Soft Limit Hard Limit Units
            Max cpu time unlimited unlimited seconds
            Max file size unlimited unlimited bytes
            Max data size unlimited unlimited bytes
            Max stack size 10485760 10485760 bytes
            Max core file size 0 0 bytes
            Max resident set unlimited unlimited bytes
            Max processes unlimited unlimited processes
            Max open files 65535 65535 files
            Max locked memory unlimited unlimited bytes
            Max address space unlimited unlimited bytes
            Max file locks unlimited unlimited locks
            Max pending signals 30446 30446 signals
            Max msgqueue size 819200 819200 bytes
            Max nice priority 0 0
            Max realtime priority 0 0
            Max realtime timeout unlimited unlimited us
            Core pattern: core
            ```

            VitalyZyukin Vitaliy Zyukin added a comment - Just in case I'm going to duplicate the content here: I was able to reproduce it on MariaDB 10.4.10. Couple of times already. It doesn't always happen on our hosts, it happen once in 10 executions on average. The way to reproduce on my end was creating 8 processes in python that will delete 100 records per each delete query in one table. This procedure is taking place when database is prepared for a shutdown, meaning replication is OFF and no traffic coming in or going out. Additional parameters: sql_log_bin=0 and innodb_flush_log_at_trx_commit=0 Next code is capable of deleting ~5 million records in 2 minutes: ``` ... func_args = list() for i in range(from_id, to_id, 100000): last_id = (i + 100000) if (i + 100000) < to_id else to_id func_args.append((table_name, i, last_id)) with multiprocessing.Pool(8) as p: p.starmap(self.delete_range, func_args) def delete_range(self, table_name, from_id, to_id): conn = mysql.connector.connect(**self.connect_args) cursor = conn.cursor() cursor.execute("SET sql_log_bin=0;") cursor.execute("SET GLOBAL innodb_flush_log_at_trx_commit=0;") for i in range(from_id, to_id, 100): last_id = (i + 100) if (i + 100) < to_id else to_id delete multiple ids at the time cursor.execute("DELETE FROM {} WHERE id IN {};".format(table_name, tuple(range(i, last_id)))) cursor.close() conn.close() ``` On my end the bug is still for the same function: UT_LIST_GET_LEN but a different location ``` 2023-05-31 8:11:21 0 [Note] InnoDB: Starting shutdown... 2023-05-31 08:11:28 0x7eb6d979d700 InnoDB: Assertion failure in file src/storage/innobase/buf/buf0flu.cc line 3353 InnoDB: Failing assertion: UT_LIST_GET_LEN(buf_pool->flush_list) == 0 InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to https://jira.mariadb.org/ InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mysqld startup, there may be InnoDB: corruption in the InnoDB tablespace. Please refer to InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/ InnoDB: about forcing recovery. 230531 8:11:28 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. To report this bug, see https://mariadb.com/kb/en/reporting-bugs We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Server version: 10.4.10-MariaDB-log key_buffer_size=8388608 read_buffer_size=16777216 max_used_connections=29 max_threads=65537 thread_count=0 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1208067437 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x0 thread_stack 0x49000 <Path>/bin/mysqld(my_print_stacktrace+0x29) [0x55d43315d7b9] <Path>/bin/mysqld(handle_fatal_signal+0x4d0) [0x55d432c2e8f0] /lib64/libpthread.so.0(+0x118e0) [0x7f01a7fc38e0] /lib64/libc.so.6(gsignal+0x110) [0x7f01a715cca0] /lib64/libc.so.6(abort+0x148) [0x7f01a715e148] <Path>/bin/mysqld(+0x565705) [0x55d432965705] <Path>/bin/mysqld(+0xb43a06) [0x55d432f43a06] /lib64/libpthread.so.0(+0x744b) [0x7f01a7fb944b] /lib64/libc.so.6(clone+0x3f) [0x7f01a721852f] The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains information that should help you find out what is causing the crash. Writing a core file... Working directory at /mysql/data Resource Limits: Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 10485760 10485760 bytes Max core file size 0 0 bytes Max resident set unlimited unlimited bytes Max processes unlimited unlimited processes Max open files 65535 65535 files Max locked memory unlimited unlimited bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 30446 30446 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us Core pattern: core ```

            Unfortunately, I do not have resources to reproduce it on different MariaDB versions. My solution was to keep `innodb_flush_log_at_trx_commit` as 1. And this slowed the deletion speed.

            VitalyZyukin Vitaliy Zyukin added a comment - Unfortunately, I do not have resources to reproduce it on different MariaDB versions. My solution was to keep `innodb_flush_log_at_trx_commit` as 1. And this slowed the deletion speed.

            If you are running MariaDB Server 10.4.10 on some hosts, it could be a good idea to upgrade to at least some of them to the latest 10.4, to see if the bug is still reproducible.

            marko Marko Mäkelä added a comment - If you are running MariaDB Server 10.4.10 on some hosts, it could be a good idea to upgrade to at least some of them to the latest 10.4, to see if the bug is still reproducible.

            People

              marko Marko Mäkelä
              mleich Matthias Leich
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.