Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29141

InnoDB hangs some time after performing live VM migration

Details

    Description

      Hello,

      The first 10.6 release we used was 10.6.5. After upgrading to 10.6.5, whenever we perform live VM migrations we sometimes see that InnoDB will at some point no longer respond. We also see this behaviour in 10.6.8.

      We perform a live VM migration on XCP-ng (xenserver). We have done this for about ~300 VMs, and about ~40 had issues. We would see the issues ranging from about a few minutes after the migration up to ~2 weeks after the migration (usually on less busy servers it would manifest later).

      Queries remain stuck in the states Updating, Sending Data, Statistics, Filling Schema table, Commit and just never complete.
      Running the command "Show engine innodb status" will hang indefinitely and never give any output.

      A restart of mysql also will not work. we have to kill -9 in order to restart mysql. After that it works again.

      We have not noticed this on MariaDB 10.5 and 10.4.

      We have noticed this on single instances as well as instances running galera.

      I did make a gcore of one of the instances that has issues (it is almost 5GB). Perhaps I can do anything with that, but I'm not sure what.

      Any ideas on what is wrong here?

      Attachments

        Issue Links

          Activity

            nielsh Niels Hendriks added a comment - - edited

            Could this be related to https://github.com/MariaDB/server/commit/db0fde3f24b37cfac9a4125ce888f1650a20db7b ?

            I've been trying to see if I can pin this to a specific commit where it starts, but I'm having a difficult time understanding git bisect on the MariaDB repo. I keep ending up on wrong branches (suddenly building for 10.5 for example despite being on the 10.6 branch). I don't often use bisect so perhaps I am just using it wrong.

            Either way, I think that so-far I cannot reproduce this on the latest commit in the 10.6 branch. Doing a specific checkout on commit db0fde3f24b37cfac9a4125ce888f1650a20db7b I can also not reproduce it. Looking at https://github.com/MariaDB/server/commits/10.6?after=654236c06d231461c66e2f3c5c4fd3b35cba3869+139&branch=10.6&qualified_name=refs%2Fheads%2F10.6 - it seems that commit a0e4853eff028fa9db9ba0421309e2bd1124ab26 comes just prior to db0fde3f24b37cfac9a4125ce888f1650a20db7b but this compiles to MariaDB 10.5 so I'm probably doing something wrong there.

            I can reproduce it on commit 57e66dc7e60 (which seems close to commit db0fde3f24b ?) where I cannot reproduce it so-far.

            I'm curious on your opinion on whether the commit I linked seems related to this one to you.

            nielsh Niels Hendriks added a comment - - edited Could this be related to https://github.com/MariaDB/server/commit/db0fde3f24b37cfac9a4125ce888f1650a20db7b ? I've been trying to see if I can pin this to a specific commit where it starts, but I'm having a difficult time understanding git bisect on the MariaDB repo. I keep ending up on wrong branches (suddenly building for 10.5 for example despite being on the 10.6 branch). I don't often use bisect so perhaps I am just using it wrong. Either way, I think that so-far I cannot reproduce this on the latest commit in the 10.6 branch. Doing a specific checkout on commit db0fde3f24b37cfac9a4125ce888f1650a20db7b I can also not reproduce it. Looking at https://github.com/MariaDB/server/commits/10.6?after=654236c06d231461c66e2f3c5c4fd3b35cba3869+139&branch=10.6&qualified_name=refs%2Fheads%2F10.6 - it seems that commit a0e4853eff028fa9db9ba0421309e2bd1124ab26 comes just prior to db0fde3f24b37cfac9a4125ce888f1650a20db7b but this compiles to MariaDB 10.5 so I'm probably doing something wrong there. I can reproduce it on commit 57e66dc7e60 (which seems close to commit db0fde3f24b ?) where I cannot reproduce it so-far. I'm curious on your opinion on whether the commit I linked seems related to this one to you.

            nielsh, in gdb.txt I cannot find any occurrence of the string thread_routine. If that run reported InnoDB: Using liburing in the server error log, then the cause of that hang should be fixed in MDEV-28665.
            Also worth noting:

            git diff db0fde3f24b..57e66dc7e60 storage/innobase tpool
            git log 57e66dc7e60..db0fde3f24b
            

            These commands only report the fix of MDEV-28665. So, this report would indeed seem to be a duplicate of MDEV-28665. Thank you both for narrowing it down.

            marko Marko Mäkelä added a comment - nielsh , in gdb.txt I cannot find any occurrence of the string thread_routine . If that run reported InnoDB: Using liburing in the server error log, then the cause of that hang should be fixed in MDEV-28665 . Also worth noting: git diff db0fde3f24b..57e66dc7e60 storage/innobase tpool git log 57e66dc7e60..db0fde3f24b These commands only report the fix of MDEV-28665 . So, this report would indeed seem to be a duplicate of MDEV-28665 . Thank you both for narrowing it down.

            danblack, thank you for reviewing the trx_sys code changes that I made in MDEV-25062. It was a plausible candidate for this hang. To answer your observations:

            • Yes, lock_print_info_summary() could use trx_sys.history_size_approx(). I would not object to a patch to optimize it.
            • No, trx_sys_t::history_size() cannot release any rseg.latch before acquiring them all, because we want an exact snapshot across all rollback segments.
            • The trx_sys_t::history_exceeds() is only used in SET GLOBAL innodb_max_purge_lag_wait, introduced in MDEV-16952 primarily for making tests more stable, but also to help users ‘prepare’ for a slow shutdown so that the actual shutdown will complete faster. By design, it requires an exact count, similar to trx_sys_t::history_size().
            marko Marko Mäkelä added a comment - danblack , thank you for reviewing the trx_sys code changes that I made in MDEV-25062 . It was a plausible candidate for this hang. To answer your observations: Yes, lock_print_info_summary() could use trx_sys.history_size_approx() . I would not object to a patch to optimize it. No, trx_sys_t::history_size() cannot release any rseg.latch before acquiring them all, because we want an exact snapshot across all rollback segments. The trx_sys_t::history_exceeds() is only used in SET GLOBAL innodb_max_purge_lag_wait , introduced in MDEV-16952 primarily for making tests more stable, but also to help users ‘prepare’ for a slow shutdown so that the actual shutdown will complete faster. By design, it requires an exact count, similar to trx_sys_t::history_size() .
            danblack Daniel Black added a comment -

            Ok, I did the inexact counts in https://github.com/MariaDB/server/pull/2202

            Don't quite follow that the exactness of trx_sys_t::history_size() / trx_sys_t::history_exceeds() the accuracy is lost when the locks are released before the function is returned.

            danblack Daniel Black added a comment - Ok, I did the inexact counts in https://github.com/MariaDB/server/pull/2202 Don't quite follow that the exactness of trx_sys_t::history_size() / trx_sys_t::history_exceeds() the accuracy is lost when the locks are released before the function is returned.
            marko Marko Mäkelä added a comment - - edited

            Thank you, MDEV-29166 is now in. There are a few cases where we actually care about an exact trx_sys_t::history_size() count. I would not dare to change those without thinking about it very carefully. In the past, we had problems with the purge being unreliable, such as MDEV-11802.

            marko Marko Mäkelä added a comment - - edited Thank you, MDEV-29166 is now in. There are a few cases where we actually care about an exact trx_sys_t::history_size() count. I would not dare to change those without thinking about it very carefully. In the past, we had problems with the purge being unreliable, such as MDEV-11802 .

            People

              marko Marko Mäkelä
              nielsh Niels Hendriks
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.