[MDEV-24829] 10.5.8 fails to startup on approx 10% of ubuntu focal deployments Created: 2021-02-09 Updated: 2021-11-18 Resolved: 2021-02-16 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, Storage Engine - InnoDB |
| Affects Version/s: | 10.5.8 |
| Fix Version/s: | 10.5.9 |
| Type: | Bug | Priority: | Major |
| Reporter: | Jonathan Rosser | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Ubuntu Focal, openstack CI environments running in KVM VM, Ubuntu server cloud image also exhibits this. |
||
| Issue Links: |
|
||||||||||||||||
| Description |
|
We run many CI jobs a day for openstack-ansible and deploy MariaDB 10.5.8 across several OS per patch tested. Anecdotally this has started happening on Ubuntu Focal since we upgraded the deployed version to 10.5.8. Roughly 10% of the jobs are currently failing because MariaDB fails to startup correctly. The service log is filled with {{Jan 28 09:38:23 aio1 mariadbd[35395]: --Thread 140426648426240 has waited at ha_innodb.cc line 4704 for 391.00 seconds the semaphore: A complete dump of the service log is here http://paste.openstack.org/show/802413/ This has been reproduced outside a CI environment in a focal virtual machine by running the same deployment code, and on the occasions when it is stuck in this mutex lock state, restarting the service gets things working again. Any pointers on how to debug this further would be really helpful. |
| Comments |
| Comment by Daniel Black [ 2021-02-09 ] |
|
Thanks for the bug report. There was 4 minutes between startup and the InnoDB error messages. Was the system under load during this time? |
| Comment by Jonathan Rosser [ 2021-02-09 ] |
|
These CI tests are run on 8 core VM and the deployment is done with a very extensive set of ansible playbooks which will continue with other components immediately after the mariadb service has been started. However, ansible is not known for high performance and i'd be surprised if much more than a single CPU core was maxed out at this time. When I managed to reproduce this on a local VM it was on a 'quiet' hypervisor, so not really suffering from noisy-neighbour VM or overloaded storage. |
| Comment by Jonathan Rosser [ 2021-02-10 ] |
|
This may or may not be any insight but the thing which fails is attempting to add the db users here https://opendev.org/openstack/openstack-ansible-galera_server/src/branch/master/tasks/galera_server_setup.yml#L16-L49 This ansible module hangs forever and eventually the CI job is killed with a timeout. An example would be this http://paste.openstack.org/show/802491/ I saw exactly the same locally, where I was able to restart the mariadb service manually, kick off the ansible again and it continued without error. |
| Comment by Marko Mäkelä [ 2021-02-10 ] |
|
Could this possibly be a duplicate of |
| Comment by Jonathan Rosser [ 2021-02-10 ] |
|
I've reproduced this in a local VM, heres a stack trace http://paste.openstack.org/show/802502/ I still have the VM so if the trace needs doing again/differently then I can do that. |
| Comment by Marko Mäkelä [ 2021-02-15 ] |
|
jrosser, thank you. Because the stack traces that you posted do not include any thread executing buf_page_create(), this cannot be I leave it to jplindst to comment whether this could be related to |
| Comment by Jan Lindström (Inactive) [ 2021-02-16 ] |
|
Based on stack this is related to |