Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24829

10.5.8 fails to startup on approx 10% of ubuntu focal deployments

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Duplicate
    • 10.5.8
    • 10.5.9
    • None
    • Ubuntu Focal, openstack CI environments running in KVM VM, Ubuntu server cloud image also exhibits this.

    Description

      We run many CI jobs a day for openstack-ansible and deploy MariaDB 10.5.8 across several OS per patch tested. Anecdotally this has started happening on Ubuntu Focal since we upgraded the deployed version to 10.5.8.

      Roughly 10% of the jobs are currently failing because MariaDB fails to startup correctly. The service log is filled with

      {{Jan 28 09:38:23 aio1 mariadbd[35395]: --Thread 140426648426240 has waited at ha_innodb.cc line 4704 for 391.00 seconds the semaphore:
      Jan 28 09:38:23 aio1 mariadbd[35395]: Mutex at 0x55cc9e14de40, Mutex LOCK_SYS created /home/buildbot/buildbot/build/mariadb-10.5.8/storage/innobase/lock/lock0lock.cc:461, lock var 2
      Jan 28 09:38:23 aio1 mariadbd[35395]: InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
      Jan 28 09:38:23 aio1 mariadbd[35395]: InnoDB: Pending reads 0, writes 0
      Jan 28 09:38:23 aio1 mariadbd[35395]: InnoDB: ###### Diagnostic info printed to the standard error stream}}

      A complete dump of the service log is here http://paste.openstack.org/show/802413/

      This has been reproduced outside a CI environment in a focal virtual machine by running the same deployment code, and on the occasions when it is stuck in this mutex lock state, restarting the service gets things working again.

      Any pointers on how to debug this further would be really helpful.

      Attachments

        Issue Links

          Activity

            danblack Daniel Black added a comment -

            Thanks for the bug report. There was 4 minutes between startup and the InnoDB error messages. Was the system under load during this time?

            danblack Daniel Black added a comment - Thanks for the bug report. There was 4 minutes between startup and the InnoDB error messages. Was the system under load during this time?
            jrosser Jonathan Rosser added a comment - - edited

            These CI tests are run on 8 core VM and the deployment is done with a very extensive set of ansible playbooks which will continue with other components immediately after the mariadb service has been started. However, ansible is not known for high performance and i'd be surprised if much more than a single CPU core was maxed out at this time.

            When I managed to reproduce this on a local VM it was on a 'quiet' hypervisor, so not really suffering from noisy-neighbour VM or overloaded storage.

            jrosser Jonathan Rosser added a comment - - edited These CI tests are run on 8 core VM and the deployment is done with a very extensive set of ansible playbooks which will continue with other components immediately after the mariadb service has been started. However, ansible is not known for high performance and i'd be surprised if much more than a single CPU core was maxed out at this time. When I managed to reproduce this on a local VM it was on a 'quiet' hypervisor, so not really suffering from noisy-neighbour VM or overloaded storage.

            This may or may not be any insight but the thing which fails is attempting to add the db users here https://opendev.org/openstack/openstack-ansible-galera_server/src/branch/master/tasks/galera_server_setup.yml#L16-L49

            This ansible module hangs forever and eventually the CI job is killed with a timeout. An example would be this http://paste.openstack.org/show/802491/

            I saw exactly the same locally, where I was able to restart the mariadb service manually, kick off the ansible again and it continued without error.

            jrosser Jonathan Rosser added a comment - This may or may not be any insight but the thing which fails is attempting to add the db users here https://opendev.org/openstack/openstack-ansible-galera_server/src/branch/master/tasks/galera_server_setup.yml#L16-L49 This ansible module hangs forever and eventually the CI job is killed with a timeout. An example would be this http://paste.openstack.org/show/802491/ I saw exactly the same locally, where I was able to restart the mariadb service manually, kick off the ansible again and it continued without error.

            Could this possibly be a duplicate of MDEV-24188? That could only be confirmed if you could produce stack traces of all running threads during the hang.

            marko Marko Mäkelä added a comment - Could this possibly be a duplicate of MDEV-24188 ? That could only be confirmed if you could produce stack traces of all running threads during the hang.

            I've reproduced this in a local VM, heres a stack trace http://paste.openstack.org/show/802502/

            I still have the VM so if the trace needs doing again/differently then I can do that.

            jrosser Jonathan Rosser added a comment - I've reproduced this in a local VM, heres a stack trace http://paste.openstack.org/show/802502/ I still have the VM so if the trace needs doing again/differently then I can do that.

            jrosser, thank you. Because the stack traces that you posted do not include any thread executing buf_page_create(), this cannot be MDEV-24188. Neither do they include anything pointing to InnoDB statistics, so this cannot be MDEV-24275 either.

            I leave it to jplindst to comment whether this could be related to MDEV-23536 or MDEV-23328 or some other Galera bugs.

            marko Marko Mäkelä added a comment - jrosser , thank you. Because the stack traces that you posted do not include any thread executing buf_page_create() , this cannot be MDEV-24188 . Neither do they include anything pointing to InnoDB statistics, so this cannot be MDEV-24275 either. I leave it to jplindst to comment whether this could be related to MDEV-23536 or MDEV-23328 or some other Galera bugs.

            Based on stack this is related to MDEV-23328.

            jplindst Jan Lindström (Inactive) added a comment - Based on stack this is related to MDEV-23328 .

            People

              jplindst Jan Lindström (Inactive)
              jrosser Jonathan Rosser
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.