[MDEV-24829] 10.5.8 fails to startup on approx 10% of ubuntu focal deployments Created: 2021-02-09  Updated: 2021-11-18  Resolved: 2021-02-16

Status: Closed
Project: MariaDB Server
Component/s: Galera, Storage Engine - InnoDB
Affects Version/s: 10.5.8
Fix Version/s: 10.5.9

Type: Bug Priority: Major
Reporter: Jonathan Rosser Assignee: Jan Lindström (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Ubuntu Focal, openstack CI environments running in KVM VM, Ubuntu server cloud image also exhibits this.


Issue Links:
Problem/Incident
is caused by MDEV-23328 Server hang due to Galera lock confli... Closed
Relates
relates to MDEV-24188 Hang in buf_page_create() after reusi... Closed

 Description   

We run many CI jobs a day for openstack-ansible and deploy MariaDB 10.5.8 across several OS per patch tested. Anecdotally this has started happening on Ubuntu Focal since we upgraded the deployed version to 10.5.8.

Roughly 10% of the jobs are currently failing because MariaDB fails to startup correctly. The service log is filled with

{{Jan 28 09:38:23 aio1 mariadbd[35395]: --Thread 140426648426240 has waited at ha_innodb.cc line 4704 for 391.00 seconds the semaphore:
Jan 28 09:38:23 aio1 mariadbd[35395]: Mutex at 0x55cc9e14de40, Mutex LOCK_SYS created /home/buildbot/buildbot/build/mariadb-10.5.8/storage/innobase/lock/lock0lock.cc:461, lock var 2
Jan 28 09:38:23 aio1 mariadbd[35395]: InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
Jan 28 09:38:23 aio1 mariadbd[35395]: InnoDB: Pending reads 0, writes 0
Jan 28 09:38:23 aio1 mariadbd[35395]: InnoDB: ###### Diagnostic info printed to the standard error stream}}

A complete dump of the service log is here http://paste.openstack.org/show/802413/

This has been reproduced outside a CI environment in a focal virtual machine by running the same deployment code, and on the occasions when it is stuck in this mutex lock state, restarting the service gets things working again.

Any pointers on how to debug this further would be really helpful.



 Comments   
Comment by Daniel Black [ 2021-02-09 ]

Thanks for the bug report. There was 4 minutes between startup and the InnoDB error messages. Was the system under load during this time?

Comment by Jonathan Rosser [ 2021-02-09 ]

These CI tests are run on 8 core VM and the deployment is done with a very extensive set of ansible playbooks which will continue with other components immediately after the mariadb service has been started. However, ansible is not known for high performance and i'd be surprised if much more than a single CPU core was maxed out at this time.

When I managed to reproduce this on a local VM it was on a 'quiet' hypervisor, so not really suffering from noisy-neighbour VM or overloaded storage.

Comment by Jonathan Rosser [ 2021-02-10 ]

This may or may not be any insight but the thing which fails is attempting to add the db users here https://opendev.org/openstack/openstack-ansible-galera_server/src/branch/master/tasks/galera_server_setup.yml#L16-L49

This ansible module hangs forever and eventually the CI job is killed with a timeout. An example would be this http://paste.openstack.org/show/802491/

I saw exactly the same locally, where I was able to restart the mariadb service manually, kick off the ansible again and it continued without error.

Comment by Marko Mäkelä [ 2021-02-10 ]

Could this possibly be a duplicate of MDEV-24188? That could only be confirmed if you could produce stack traces of all running threads during the hang.

Comment by Jonathan Rosser [ 2021-02-10 ]

I've reproduced this in a local VM, heres a stack trace http://paste.openstack.org/show/802502/

I still have the VM so if the trace needs doing again/differently then I can do that.

Comment by Marko Mäkelä [ 2021-02-15 ]

jrosser, thank you. Because the stack traces that you posted do not include any thread executing buf_page_create(), this cannot be MDEV-24188. Neither do they include anything pointing to InnoDB statistics, so this cannot be MDEV-24275 either.

I leave it to jplindst to comment whether this could be related to MDEV-23536 or MDEV-23328 or some other Galera bugs.

Comment by Jan Lindström (Inactive) [ 2021-02-16 ]

Based on stack this is related to MDEV-23328.

Generated at Thu Feb 08 09:32:59 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.