Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34641

MariaDB server getting stuck when ntp server switch time in midst of start up

Details

    • Bug
    • Status: Open (View Workflow)
    • Critical
    • Resolution: Unresolved
    • 10.6.16, 10.11.6
    • 10.6, 10.11
    • Server
    • MLOS is based out of CentOS 7.

      Linux MLOS-NSM 4.19.245-3.mlos3.x86_64 #1 SMP Wed Jul 19 06:01:56 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

    Description

      The Mariadb process become non-responsive in 1-3 days time after its start in environment where NTP is enabled.

      This could be specifically seen when the ntp server switches system time in the middle on mariadb server start up.

      Checking the process list on DB shows all the queries in stuck state

      below is strace on the mariadb server process :

      The issue is not seen on 10.6.14-MariaDB-log MariaDB Server . The problems start appearing after we upgraded to 10.6.16-MariaDB-log MariaDB Server & 10.11.6-MariaDB-log

      Attachments

        Issue Links

          Activity

            Thread 62 is blocking because of too many (all allocated) AIOs are in flight. Every slot is allocated before asynchronous write, and freed in AIO write completion callback. I do not know if libaio is to blame, but have no better idea, then innodb_use_native_aio=0 would be worth to try.

            wlad Vladislav Vaintroub added a comment - Thread 62 is blocking because of too many (all allocated) AIOs are in flight. Every slot is allocated before asynchronous write, and freed in AIO write completion callback. I do not know if libaio is to blame, but have no better idea, then innodb_use_native_aio=0 would be worth to try.

            The Linux kernel version might be relevant here. I got the impression that Red Hat Enterprise Linux 7 reached its end of life some weeks or months ago. It could be that io_submit or io_getevents on the underlying Linux kernel version gets confused when the system time is being moved. I see that tpool/aio_linux.cc invokes io_getevents without any time parameter. So, if there was a kernel bug, it should be deep inside the kernel, not in the system call interface.

            That being said, I remember that MDEV-32861 was fixed not too long time ago, in MariaDB Server 10.6.17 and 10.11.7. That could be a simple explanation of this hang in 10.6.16 and 10.11.6.

            ullasram, can you please test if MariaDB Server 10.6.17 or 10.11.7 would avoid this hang?

            marko Marko Mäkelä added a comment - The Linux kernel version might be relevant here. I got the impression that Red Hat Enterprise Linux 7 reached its end of life some weeks or months ago. It could be that io_submit or io_getevents on the underlying Linux kernel version gets confused when the system time is being moved. I see that tpool/aio_linux.cc invokes io_getevents without any time parameter. So, if there was a kernel bug, it should be deep inside the kernel, not in the system call interface. That being said, I remember that MDEV-32861 was fixed not too long time ago, in MariaDB Server 10.6.17 and 10.11.7. That could be a simple explanation of this hang in 10.6.16 and 10.11.6. ullasram , can you please test if MariaDB Server 10.6.17 or 10.11.7 would avoid this hang?

            Thanks @marko for the update. Let me upgrade to the mentioned version and get back to you. Please note , we haven't figured out a way to consistently replicate the issue . It may take few days for us to verify its working.

            ullasram Ullas Ramakrishnan added a comment - Thanks @marko for the update. Let me upgrade to the mentioned version and get back to you. Please note , we haven't figured out a way to consistently replicate the issue . It may take few days for us to verify its working.
            ullasram Ullas Ramakrishnan added a comment - - edited

            @marco The issue is not resolved with the 10.11.7 upgrade. We still see the DB hang on the upgraded setup.

            ullasram Ullas Ramakrishnan added a comment - - edited @marco The issue is not resolved with the 10.11.7 upgrade. We still see the DB hang on the upgraded setup.

            It looks like it is actually MDEV-33669 (despite its description containing "mariabackup", it is not about mariabackup=. there are not enough threads to process io completion requests, and this condition is "sticky". It is not fixed in 10.11.7 . But in 10.11.8, it is.

            wlad Vladislav Vaintroub added a comment - It looks like it is actually MDEV-33669 (despite its description containing "mariabackup", it is not about mariabackup=. there are not enough threads to process io completion requests, and this condition is "sticky". It is not fixed in 10.11.7 . But in 10.11.8, it is.

            People

              danblack Daniel Black
              ullasram Ullas Ramakrishnan
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.