We have multiple galera clusters working in a multi-master setup. And noticed that a "sleeping" system thread could hung the whole cluster.
When this system thread hung as shown in the screenshot, the whole galera cluster goes into a stand still. Nothing an be written into the database
We have a log that print the "wsrep_last_committed", it shows that one of the node 's wsrep_last_commited is not moving. Did the wsrep plugin in Galera hung?
The h5 server is the one that stuck. There is nothing in the mysql.err showing any stacktrace
2022-08-1806:10:04,862 INFO galera_alert line:93 galerastats on node xxx-h4:
No, they are information only and used by gdb. Small bit of storage but no impacts to the running server or any replacement of code.
Daniel Black
added a comment - No, they are information only and used by gdb. Small bit of storage but no impacts to the running server or any replacement of code.
Does this means i do not need to install the debug info packages? As my binary is not stripped.
/opt/sbin/mariadbd: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=893a0b4698fc39d184df3f3c32df693dfa008884, not stripped
When i tried to gdb attach <pid>, i get these lines. Does that mean i need to install the debuginfo?
Reading symbols from /usr/lib64/libgssapi_krb5.so.2...Reading symbols from /usr/lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done.
Khai Ping
added a comment - - edited @daniel,
Does this means i do not need to install the debug info packages? As my binary is not stripped.
/opt/sbin/mariadbd: ELF 64 -bit LSB shared object, x86- 64 , version 1 (GNU/Linux), dynamically linked (uses shared libs), for GNU/Linux 2.6 . 32 , BuildID[sha1]=893a0b4698fc39d184df3f3c32df693dfa008884, not stripped
When i tried to gdb attach <pid>, i get these lines. Does that mean i need to install the debuginfo?
Reading symbols from /usr/lib64/libgssapi_krb5.so. 2 ...Reading symbols from /usr/lib64/libgssapi_krb5.so. 2 ...(no debugging symbols found)...done.
The binary is not technically stripped however a split-debug technique commonly used means that the debug info isn't in the binary, but in separate files, hence the debuginfo packages are still needed.
Missing debug information from the libraries mariadb uses isn't a large impediment as the fault is unlikely to be in these libraries. If in doubt, just include the generated gdb information.
If for some reason you feel uncomfortable with the detail in the gdb output you can upload it privately to the ftp server.
Daniel Black
added a comment - The binary is not technically stripped however a split-debug technique commonly used means that the debug info isn't in the binary, but in separate files, hence the debuginfo packages are still needed.
Missing debug information from the libraries mariadb uses isn't a large impediment as the fault is unlikely to be in these libraries. If in doubt, just include the generated gdb information.
If for some reason you feel uncomfortable with the detail in the gdb output you can upload it privately to the ftp server .
@daniel, we are building our own mariadb using the spec file , however the debuginfo rpm is not getting generated for 10.6.5 , however it is getting generated for 10.6.9
any idea what could be causing it?
Khai Ping
added a comment - - edited @daniel, we are building our own mariadb using the spec file , however the debuginfo rpm is not getting generated for 10.6.5 , however it is getting generated for 10.6.9
any idea what could be causing it?
> we are building our own mariadb using the spec file ,
Why? What is it?
> however the debuginfo rpm is not getting generated for 10.6.5 , however it is getting generated for 10.6.9
> any idea what could be causing it?
No. I could guess the cmake version is different. But I can't think of a code change that made this difference.
Daniel Black
added a comment - > we are building our own mariadb using the spec file ,
Why? What is it?
> however the debuginfo rpm is not getting generated for 10.6.5 , however it is getting generated for 10.6.9
> any idea what could be causing it?
No. I could guess the cmake version is different. But I can't think of a code change that made this difference.
khaiping.loh Yes, that output would be more than useful. Please provide also full error log. Can you try with more recent version of MariaDB and Galera library.
Jan Lindström (Inactive)
added a comment - khaiping.loh Yes, that output would be more than useful. Please provide also full error log. Can you try with more recent version of MariaDB and Galera library.
Khai Ping
added a comment - - edited @daniel , i have attached mariadbd_full_bt_all_threads.txt .
Is this issue resolve in mariadb 10.6.12? I am referencing this ticket https://jira.mariadb.org/browse/MDEV-29684 , it seems like it is fixed?
Thank you. What analysis have you done that makes you think it is MDEV-29684?
This does have killed threads holding locks so it potentially the same, but a more complete look than what I have time for now is required to be more definate.
Daniel Black
added a comment - Thank you. What analysis have you done that makes you think it is MDEV-29684 ?
This does have killed threads holding locks so it potentially the same, but a more complete look than what I have time for now is required to be more definate.
In my environment, we only see this problem in multi master galera nodes. Our application can send the writes to any of the galera nodes. In high concurrency, there is bound to have galera write conflicts. In MDEV-29684, it mention this line "This requires multi-master testing"
And also in MDEV-29684, the first sentence mention this "There are a number of bug reports of cluster wide conflict resolving related crashes or hangs." When this issue happens in our environment, nothing can be written anymore, it is as tho the cluster hung
Appreciate your prompt response, i hope the bt thread logs is helpful. That log is retrieve from the node that hung.
Khai Ping
added a comment -
Due to the release notes , https://mariadb.com/kb/en/mariadb-10-6-12-release-notes/ . It mention "Fixes for cluster wide write conflict resolving"
In my environment, we only see this problem in multi master galera nodes. Our application can send the writes to any of the galera nodes. In high concurrency, there is bound to have galera write conflicts. In MDEV-29684 , it mention this line "This requires multi-master testing"
And also in MDEV-29684 , the first sentence mention this "There are a number of bug reports of cluster wide conflict resolving related crashes or hangs." When this issue happens in our environment, nothing can be written anymore, it is as tho the cluster hung
Appreciate your prompt response, i hope the bt thread logs is helpful. That log is retrieve from the node that hung.
Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.
Julien Fritsch
added a comment - Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.
Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.
JiraAutomate
added a comment - Automated message:
----------------------------
Since this issue has not been updated since 6 weeks, it's time to move it back to Stalled.
khaiping.loh Can you provide full unedited error log from node that hangs, show processlist, show engine innodb status? Is issue reproducible? If it is can you provide steps to reproduce. Used MariaDB server and Galera library versions are quite old, please consider upgrading to more recent ones. More recent version has fixes on cluster conflict hang cases and this could be one of them.
Jan Lindström
added a comment - khaiping.loh Can you provide full unedited error log from node that hangs, show processlist, show engine innodb status? Is issue reproducible? If it is can you provide steps to reproduce. Used MariaDB server and Galera library versions are quite old, please consider upgrading to more recent ones. More recent version has fixes on cluster conflict hang cases and this could be one of them.
Jan Lindström , when the issue happen is there is no errors on the mysql.err. However, when we go into one of the node, the processlist will show a system_user thread stuck indefinitely.
Do you know which version have those fixes related to cluster hanging? Is it 10.6.15?
Khai Ping
added a comment - - edited Jan Lindström , when the issue happen is there is no errors on the mysql.err. However, when we go into one of the node, the processlist will show a system_user thread stuck indefinitely.
Do you know which version have those fixes related to cluster hanging? Is it 10.6.15?
Jan Lindström
added a comment - khaiping.loh I looked the stack trace and I can find selects from there but not that update-clause. I do not see any real evidence that server would be hang. https://mariadb.com/kb/en/mariadb-10-6-12-release-notes/ contains the fix but https://mariadb.com/kb/en/mariadb-10-6-17-release-notes/ is the latest and recommended release.
Jan Lindström , i noticed this changelog in 10.6.15 as well. Could this help also?
MariaDB stuck on starting commit state (waiting on commit order critical section) (MDEV-29293)
Looking at the performance regression in MDEV-33508, i do not think upgrading it to 10.6.17 should be recommended?
Khai Ping
added a comment - - edited Jan Lindström , i noticed this changelog in 10.6.15 as well. Could this help also?
MariaDB stuck on starting commit state (waiting on commit order critical section) ( MDEV-29293 )
Looking at the performance regression in MDEV-33508 , i do not think upgrading it to 10.6.17 should be recommended?
khaiping.loh Yes it would help, but then I did not see evidence you are hitting it. I do not know how severe the performance regression is.
Jan Lindström
added a comment - khaiping.loh Yes it would help, but then I did not see evidence you are hitting it. I do not know how severe the performance regression is.
@jan, i uploaded another set of stacktrace of another cluster whereby 1 node hung. Inside the logs contain 3 servers.
[^mariadb stacktrace.zip]
Khai Ping
added a comment - @jan, i uploaded another set of stacktrace of another cluster whereby 1 node hung. Inside the logs contain 3 servers.
[^mariadb stacktrace.zip]
Jan Lindström
added a comment - khaiping.loh I can't read those MAXOS files but I strongly suspect https://jira.mariadb.org/browse/MDEV-29293 for that you would need to upgrade.
Khai Ping
added a comment - @jan , i uploaded the non-zip files.
mariadbd_full_bt_all_threads-h12_1712676357.log
mariadbd_full_bt_all_threads-h15_1712676357.log
mariadbd_full_bt_all_threads-h14_1712676357.log
Thanks, we will proceed with the upgrade .
How about seeing this in the processlist ? It seems to have cause the hung too. In this example, unfortunately i do not have the stacktrace.
ID,QUERY_ID,USER,DB,TIME,STATE,MEMORY_USED,MAX_MEMORY_USED,EXAMINED_ROWS,TID,INFO
276168,3416992,flask_user,None,4517,acquiring total order isolation,75568,75568,0,2831340,KILL CONNECTION ?
276167,3416991,flask_user,None,4517,acquiring total order isolation,74712,74712,0,2831239,KILL CONNECTION ?
276152,3416920,flask_user,None,4545,acquiring total order isolation,74712,74712,0,2831209,KILL CONNECTION ?
276141,3416909,flask_user,None,4566,acquiring total order isolation,74712,74712,0,2831188,KILL CONNECTION ?
When that happen, we noticed alot of commit transaction were stuck
Khai Ping
added a comment - - edited @jan, thanks!
How about seeing this in the processlist ? It seems to have cause the hung too. In this example, unfortunately i do not have the stacktrace.
ID,QUERY_ID,USER,DB,TIME,STATE,MEMORY_USED,MAX_MEMORY_USED,EXAMINED_ROWS,TID,INFO
276168,3416992,flask_user,None,4517,acquiring total order isolation,75568,75568,0,2831340,KILL CONNECTION ?
276167,3416991,flask_user,None,4517,acquiring total order isolation,74712,74712,0,2831239,KILL CONNECTION ?
276152,3416920,flask_user,None,4545,acquiring total order isolation,74712,74712,0,2831209,KILL CONNECTION ?
276141,3416909,flask_user,None,4566,acquiring total order isolation,74712,74712,0,2831188,KILL CONNECTION ?
When that happen, we noticed alot of commit transaction were stuck
277541,3422826,app_user,database_1,601,starting,83152,1033792,0,313531,COMMIT
277149,3421496,app_user,database_1,1501,starting,82080,1032720,0,2835445,COMMIT
276707,3420193,app_user,database_1,2401,starting,82080,1032720,0,2834972,COMMIT
276639,3420300,app_user,replication,2323,starting,82080,1032720,0,2834556,COMMIT
Seems to be related to MDEV-29293 as well.
Can you: