Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29346

update_rows_log_event hung causing galera cluster failure

Details

    Description

      We have multiple galera clusters working in a multi-master setup. And noticed that a "sleeping" system thread could hung the whole cluster.

      When this system thread hung as shown in the screenshot, the whole galera cluster goes into a stand still. Nothing an be written into the database

      We have a log that print the "wsrep_last_committed", it shows that one of the node 's wsrep_last_commited is not moving. Did the wsrep plugin in Galera hung?

      The h5 server is the one that stuck. There is nothing in the mysql.err showing any stacktrace

      2022-08-18 06:10:04,862 INFO galera_alert line:93 galerastats on node xxx-h4: 
      2022-08-18 06:10:04,861 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "150", "wsrep_last_committed": "21383020", 
      2022-08-18 06:10:04,862 INFO galera_alert line:93 galerastats on node xxx-h5: 
      2022-08-18 06:10:04,862 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "590", "wsrep_last_committed": "21382990", 
      2022-08-18 06:10:04,863 INFO galera_alert line:93 galerastats on node xxx-h6: 
      2022-08-18 06:10:04,863 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "204", "wsrep_last_committed": "21383020", 
      ....
      ....
      2022-08-18 06:30:04,996 INFO galera_alert line:93 galerastats on node xxx-h4: 
      2022-08-18 06:30:04,996 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "170", "wsrep_last_committed": "21383020",
      2022-08-18 06:30:04,997 INFO galera_alert line:93 galerastats on node xxx-h5: 
      2022-08-18 06:30:04,997 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "643", "wsrep_last_committed": "21382990", 
      2022-08-18 06:30:04,997 INFO galera_alert line:93 galerastats on node xxx-h6: 
      2022-08-18 06:30:04,997 INFO galera_alert line:94 {'error': 0, 'payload': {'output': '{"Threads_connected": "228", "wsrep_last_committed": "21383020", 
      

      The only solution to "unbreak" it is to stop the hung node, kill mariadb and start the mariadb service

      Attachments

        Issue Links

          Activity

            khaiping.loh Khai Ping added a comment -

            @jan , can you tell me based on a stack trace how we can identify the issue?

            khaiping.loh Khai Ping added a comment - @jan , can you tell me based on a stack trace how we can identify the issue?

            khaiping.loh If you find a thread doing sql_kill function it was MDEV-29293.

            janlindstrom Jan Lindström added a comment - khaiping.loh If you find a thread doing sql_kill function it was MDEV-29293 .
            khaiping.loh Khai Ping added a comment - - edited

            @jan, thanks!

            How about seeing this in the processlist ? It seems to have cause the hung too. In this example, unfortunately i do not have the stacktrace.

            ID,QUERY_ID,USER,DB,TIME,STATE,MEMORY_USED,MAX_MEMORY_USED,EXAMINED_ROWS,TID,INFO
            276168,3416992,flask_user,None,4517,acquiring total order isolation,75568,75568,0,2831340,KILL CONNECTION ?
            276167,3416991,flask_user,None,4517,acquiring total order isolation,74712,74712,0,2831239,KILL CONNECTION ?
            276152,3416920,flask_user,None,4545,acquiring total order isolation,74712,74712,0,2831209,KILL CONNECTION ?
            276141,3416909,flask_user,None,4566,acquiring total order isolation,74712,74712,0,2831188,KILL CONNECTION ?

            When that happen, we noticed alot of commit transaction were stuck

            277541,3422826,app_user,database_1,601,starting,83152,1033792,0,313531,COMMIT
            277149,3421496,app_user,database_1,1501,starting,82080,1032720,0,2835445,COMMIT
            276707,3420193,app_user,database_1,2401,starting,82080,1032720,0,2834972,COMMIT
            276639,3420300,app_user,replication,2323,starting,82080,1032720,0,2834556,COMMIT

            Seems to be related to MDEV-29293 as well.

            khaiping.loh Khai Ping added a comment - - edited @jan, thanks! How about seeing this in the processlist ? It seems to have cause the hung too. In this example, unfortunately i do not have the stacktrace. ID,QUERY_ID,USER,DB,TIME,STATE,MEMORY_USED,MAX_MEMORY_USED,EXAMINED_ROWS,TID,INFO 276168,3416992,flask_user,None,4517,acquiring total order isolation,75568,75568,0,2831340,KILL CONNECTION ? 276167,3416991,flask_user,None,4517,acquiring total order isolation,74712,74712,0,2831239,KILL CONNECTION ? 276152,3416920,flask_user,None,4545,acquiring total order isolation,74712,74712,0,2831209,KILL CONNECTION ? 276141,3416909,flask_user,None,4566,acquiring total order isolation,74712,74712,0,2831188,KILL CONNECTION ? When that happen, we noticed alot of commit transaction were stuck 277541,3422826,app_user,database_1,601,starting,83152,1033792,0,313531,COMMIT 277149,3421496,app_user,database_1,1501,starting,82080,1032720,0,2835445,COMMIT 276707,3420193,app_user,database_1,2401,starting,82080,1032720,0,2834972,COMMIT 276639,3420300,app_user,replication,2323,starting,82080,1032720,0,2834556,COMMIT Seems to be related to MDEV-29293 as well.

            khaiping.loh Yes, it is indication of MDEV-29293 fixed on more recent version of MariaDB server.

            janlindstrom Jan Lindström added a comment - khaiping.loh Yes, it is indication of MDEV-29293 fixed on more recent version of MariaDB server.
            khaiping.loh Khai Ping added a comment -

            @jan, thanks again!

            khaiping.loh Khai Ping added a comment - @jan, thanks again!

            People

              janlindstrom Jan Lindström
              khaiping.loh Khai Ping
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.