Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29388

Galera cluster hanging

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.6.9
    • Fix Version/s: 10.6
    • Component/s: Galera, Server
    • Labels:
      None

      Description

      On a 3-nodes cluster, 2 nodes got completely stuck. Anything trying to connect would also get stuck. The third node had 200+ (or 500?) queries running, but stuck as well.

      Stopping those 2 servers got stuck to, had to kill the processes manually. Logs shows the server being disconnected from the cluster. Cluster size went from 3 to 1 as it should, but mariadb processes where stuck on those 2 servers until manually killed.

      Starting first node back after a minute got it to sync using an IST, the other node restarted after 5 minutes synced using an SST. As if nothing happened!?

      Digging into logs of the 3 servers shows nothing, no output at all during the whole period they where stuck.

      Anything I could do to track this down if it happens again?

      Server 1 log (stuck):

      2022-08-25  0:29:16 0 [Note] WSREP: Deleted page /var/lib/mysql/galera.page.000191
      2022-08-25  0:29:32 0 [Note] WSREP: Created page /var/lib/mysql/galera.page.000192 of size 134217696 bytes
      2022-08-25  0:29:39 0 [Note] WSREP: Deleted page /var/lib/mysql/galera.page.000192
      2022-08-25  0:30:38 0 [Note] WSREP: Created page /var/lib/mysql/galera.page.000193 of size 134217696 bytes
      2022-08-25  0:30:47 0 [Note] WSREP: Created page /var/lib/mysql/galera.page.000194 of size 134217696 bytes
      2022-08-25  0:30:56 0 [Note] WSREP: Deleted page /var/lib/mysql/galera.page.000193
      2022-08-25  0:30:57 0 [Note] WSREP: Deleted page /var/lib/mysql/galera.page.000194
      2022-08-25  0:32:23 0 [Note] WSREP: Created page /var/lib/mysql/galera.page.000195 of size 134217696 bytes
      2022-08-25  1:03:21 0 [Note] WSREP: declaring 19edde19-9435 at tcp://xxxx:4567 stable
      2022-08-25  1:03:21 0 [Note] WSREP: forgetting e26a09b5-b02e (tcp://xxxx:4567)
      2022-08-25  1:03:21 0 [Note] WSREP: Node 19edde19-9435 state prim
      2022-08-25  1:03:21 0 [Note] WSREP: view(view_id(PRIM,19edde19-9435,63) memb {
      2022-08-25  1:03:21 0 [Note] WSREP: save pc into disk
      2022-08-25  1:03:21 0 [Note] WSREP: forgetting e26a09b5-b02e (tcp://xxxx:4567)
      2022-08-25  1:03:21 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
      2022-08-25  1:03:21 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
      2022-08-25  1:03:21 0 [Note] WSREP: STATE EXCHANGE: sent state msg: f2ffea9d-2400-11ed-9136-ef240a6d8bee
      2022-08-25  1:03:21 0 [Note] WSREP: STATE EXCHANGE: got state msg: f2ffea9d-2400-11ed-9136-ef240a6d8bee from 0 (ovh7.1check.com)
      2022-08-25  1:03:21 0 [Note] WSREP: STATE EXCHANGE: got state msg: f2ffea9d-2400-11ed-9136-ef240a6d8bee from 1 (ovh1.1check.com)
      2022-08-25  1:03:21 0 [Note] WSREP: Quorum results:
      2022-08-25  1:03:21 0 [Note] WSREP: Flow-control interval: [23, 23]
      2022-08-25  1:03:26 0 [Note] WSREP:  cleaning up e26a09b5-b02e (tcp://xxxx:4567)
      2022-08-25  1:06:00 0 [Note] /usr/sbin/mariadbd (initiated by: unknown): Normal shutdown
      2022-08-25  1:06:00 0 [Note] WSREP: Shutdown replication
      2022-08-25  1:06:00 0 [Note] WSREP: Server status change synced -> disconnecting
      2022-08-25  1:06:00 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2022-08-25  1:06:00 0 [Note] WSREP: Closing send monitor...
      2022-08-25  1:06:00 0 [Note] WSREP: Closed send monitor.
      2022-08-25  1:06:00 0 [Note] WSREP: gcomm: terminating thread
      2022-08-25  1:06:00 0 [Note] WSREP: gcomm: joining thread
      2022-08-25  1:06:00 0 [Note] WSREP: gcomm: closing backend
      2022-08-25  1:06:01 0 [Note] WSREP: view(view_id(NON_PRIM,19edde19-9435,63) memb {
      2022-08-25  1:06:01 0 [Note] WSREP: PC protocol downgrade 1 -> 0
      2022-08-25  1:06:01 0 [Note] WSREP: view((empty))
      2022-08-25  1:06:01 0 [Note] WSREP: gcomm: closed
      2022-08-25  1:06:01 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
      2022-08-25  1:06:01 0 [Note] WSREP: Flow-control interval: [16, 16]
      2022-08-25  1:06:01 0 [Note] WSREP: Received NON-PRIMARY.
      2022-08-25  1:06:01 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 70213440)
      2022-08-25  1:06:01 0 [Note] WSREP: New SELF-LEAVE.
      2022-08-25  1:06:01 0 [Note] WSREP: Flow-control interval: [0, 0]
      2022-08-25  1:06:01 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
      2022-08-25  1:06:01 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 70213440)
      2022-08-25  1:06:01 0 [Note] WSREP: RECV thread exiting 0: Success
      2022-08-25  1:06:01 0 [Note] WSREP: recv_thread() joined.
      2022-08-25  1:06:01 0 [Note] WSREP: Closing replication queue.
      2022-08-25  1:06:01 0 [Note] WSREP: Closing slave action queue.
      2022-08-25  1:07:18 0 [Note] WSREP: Loading provider /usr/lib/galera/libgalera_smm.so initial position: 2c96698a-0fdf-11ed-90d6-7ecc8fa70984:70213386
      2022-08-25  1:07:18 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
      

      Server 2 log (stuck):

      2022-08-25  0:29:32 0 [Note] WSREP: Created page /var/lib/mysql/galera.page.000192 of size 134217696 bytes
      2022-08-25  0:29:39 0 [Note] WSREP: Deleted page /var/lib/mysql/galera.page.000192
      2022-08-25  0:30:38 0 [Note] WSREP: Created page /var/lib/mysql/galera.page.000193 of size 134217696 bytes
      2022-08-25  0:30:47 0 [Note] WSREP: Created page /var/lib/mysql/galera.page.000194 of size 134217696 bytes
      2022-08-25  0:30:56 0 [Note] WSREP: Deleted page /var/lib/mysql/galera.page.000193
      2022-08-25  0:30:57 0 [Note] WSREP: Deleted page /var/lib/mysql/galera.page.000194
      2022-08-25  0:32:23 0 [Note] WSREP: Created page /var/lib/mysql/galera.page.000195 of size 134217696 bytes
      2022-08-25  1:03:21 0 [Note] /usr/sbin/mariadbd (initiated by: unknown): Normal shutdown
      2022-08-25  1:03:21 0 [Note] WSREP: Shutdown replication
      2022-08-25  1:03:21 0 [Note] WSREP: Server status change synced -> disconnecting
      2022-08-25  1:03:21 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2022-08-25  1:03:21 0 [Note] WSREP: Closing send monitor...
      2022-08-25  1:03:21 0 [Note] WSREP: Closed send monitor.
      2022-08-25  1:03:21 0 [Note] WSREP: gcomm: terminating thread
      2022-08-25  1:03:21 0 [Note] WSREP: gcomm: joining thread
      2022-08-25  1:03:21 0 [Note] WSREP: gcomm: closing backend
      2022-08-25  1:03:21 0 [Note] WSREP: view(view_id(NON_PRIM,19edde19-9435,62) memb {
      2022-08-25  1:03:21 0 [Note] WSREP: PC protocol downgrade 1 -> 0
      2022-08-25  1:03:21 0 [Note] WSREP: view((empty))
      2022-08-25  1:03:21 0 [Note] WSREP: gcomm: closed
      2022-08-25  1:03:21 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
      2022-08-25  1:03:21 0 [Note] WSREP: Flow-control interval: [16, 16]
      2022-08-25  1:03:21 0 [Note] WSREP: Received NON-PRIMARY.
      2022-08-25  1:03:21 0 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 70213433)
      2022-08-25  1:03:21 0 [Note] WSREP: New SELF-LEAVE.
      2022-08-25  1:03:21 0 [Note] WSREP: Flow-control interval: [0, 0]
      2022-08-25  1:03:21 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
      2022-08-25  1:03:21 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 70213433)
      2022-08-25  1:03:21 0 [Note] WSREP: RECV thread exiting 0: Success
      2022-08-25  1:03:21 0 [Note] WSREP: recv_thread() joined.
      2022-08-25  1:03:21 0 [Note] WSREP: Closing replication queue.
      2022-08-25  1:03:21 0 [Note] WSREP: Closing slave action queue.
      2022-08-25  1:08:05 0 [Note] WSREP: Loading provider /usr/lib/galera/libgalera_smm.so initial position: 2c96698a-0fdf-11ed-90d6-7ecc8fa70984:70213385
      2022-08-25  1:08:05 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
      2022-08-25  1:08:05 0 [Note] WSREP: wsrep_load(): Galera 26.4.12(r1eac5b64) by Codership Oy <info@codership.com> loaded successfully.
      2022-08-25  1:08:05 0 [Note] WSREP: CRC-32C: using 64-bit x86 acceleration.
      

      Server 3 log (still alive):

      2022-08-25  0:30:56 0 [Note] WSREP: Deleted page /var/lib/mysql/galera.page.000389
      2022-08-25  0:30:57 0 [Note] WSREP: Deleted page /var/lib/mysql/galera.page.000390
      2022-08-25  0:32:23 0 [Note] WSREP: Created page /var/lib/mysql/galera.page.000391 of size 134217696 bytes
      2022-08-25  1:03:21 0 [Note] WSREP: declaring 3e14de76-94d4 at tcp://xxxx:4567 stable
      2022-08-25  1:03:21 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://xxxx:44208
      2022-08-25  1:03:21 0 [Note] WSREP: forgetting e26a09b5-b02e (tcp://xxxx:4567)
      2022-08-25  1:03:21 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7fd768845210
      2022-08-25  1:03:21 0 [Note] WSREP: Deferred close timer destruct
      2022-08-25  1:03:21 0 [Note] WSREP: Node 19edde19-9435 state prim
      2022-08-25  1:03:21 0 [Note] WSREP: view(view_id(PRIM,19edde19-9435,63) memb {
      2022-08-25  1:03:21 0 [Note] WSREP: save pc into disk
      2022-08-25  1:03:21 0 [Note] WSREP: forgetting e26a09b5-b02e (tcp://xxxx:4567)
      2022-08-25  1:03:21 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
      2022-08-25  1:03:21 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: f2ffea9d-2400-11ed-9136-ef240a6d8bee
      2022-08-25  1:03:21 0 [Note] WSREP: STATE EXCHANGE: sent state msg: f2ffea9d-2400-11ed-9136-ef240a6d8bee
      2022-08-25  1:03:21 0 [Note] WSREP: STATE EXCHANGE: got state msg: f2ffea9d-2400-11ed-9136-ef240a6d8bee from 0 (ovh7.1check.com)
      2022-08-25  1:03:21 0 [Note] WSREP: STATE EXCHANGE: got state msg: f2ffea9d-2400-11ed-9136-ef240a6d8bee from 1 (ovh1.1check.com)
      2022-08-25  1:03:21 0 [Note] WSREP: Quorum results:
      2022-08-25  1:03:21 0 [Note] WSREP: Flow-control interval: [23, 23]
      2022-08-25  1:03:21 17 [Note] WSREP: ####### processing CC 70213434, local, ordered
      2022-08-25  1:03:21 17 [Note] WSREP: ####### My UUID: 19edde19-2139-11ed-9435-e69b7d8e37e9
      2022-08-25  1:03:21 17 [Note] WSREP: Skipping cert index reset
      2022-08-25  1:03:21 17 [Note] WSREP: REPL Protocols: 10 (5)
      2022-08-25  1:03:21 17 [Note] WSREP: ####### Adjusting cert position: 70213433 -> 70213434
      2022-08-25  1:03:21 0 [Note] WSREP: Service thread queue flushed.
      2022-08-25  1:03:21 17 [Note] WSREP: ================================================
      2022-08-25  1:03:21 17 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2022-08-25  1:03:21 17 [Note] WSREP: Lowest cert index boundary for CC from group: 70213347
      2022-08-25  1:03:21 17 [Note] WSREP: Min available from gcache for CC from group: 70213347
      2022-08-25  1:03:26 0 [Note] WSREP:  cleaning up e26a09b5-b02e (tcp://xxxx:4567)
      2022-08-25  1:06:01 0 [Note] WSREP: Deferred close timer started for socket with remote endpoint: tcp://xxxx:36660
      2022-08-25  1:06:01 0 [Note] WSREP: forgetting 3e14de76-94d4 (tcp://xxxx:4567)
      2022-08-25  1:06:01 0 [Note] WSREP: Node 19edde19-9435 state prim
      2022-08-25  1:06:01 0 [Note] WSREP: view(view_id(PRIM,19edde19-9435,64) memb {
      2022-08-25  1:06:01 0 [Note] WSREP: save pc into disk
      2022-08-25  1:06:01 0 [Note] WSREP: forgetting 3e14de76-94d4 (tcp://xxxx:4567)
      2022-08-25  1:06:01 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
      2022-08-25  1:06:01 0 [Note] WSREP: Deferred close timer handle_wait Operation aborted. for 0x7fd7344f9160
      2022-08-25  1:06:01 0 [Note] WSREP: Deferred close timer destruct
      2022-08-25  1:06:01 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 523fb653-2401-11ed-8253-82473e2187ea
      2022-08-25  1:06:01 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 523fb653-2401-11ed-8253-82473e2187ea
      2022-08-25  1:06:01 0 [Note] WSREP: STATE EXCHANGE: got state msg: 523fb653-2401-11ed-8253-82473e2187ea from 0 (ovh7.1check.com)
      2022-08-25  1:06:01 0 [Note] WSREP: Quorum results:
      2022-08-25  1:06:01 0 [Note] WSREP: Flow-control interval: [16, 16]
      2022-08-25  1:06:01 17 [Note] WSREP: ####### processing CC 70213441, local, ordered
      2022-08-25  1:06:01 17 [Note] WSREP: ####### My UUID: 19edde19-2139-11ed-9435-e69b7d8e37e9
      2022-08-25  1:06:01 17 [Note] WSREP: Skipping cert index reset
      2022-08-25  1:06:01 17 [Note] WSREP: REPL Protocols: 10 (5)
      2022-08-25  1:06:01 17 [Note] WSREP: ####### Adjusting cert position: 70213440 -> 70213441
      2022-08-25  1:06:01 0 [Note] WSREP: Service thread queue flushed.
      2022-08-25  1:06:01 17 [Note] WSREP: ================================================
      2022-08-25  1:06:01 17 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2022-08-25  1:06:01 17 [Note] WSREP: Lowest cert index boundary for CC from group: 70213347
      2022-08-25  1:06:01 17 [Note] WSREP: Min available from gcache for CC from group: 70213347
      2022-08-25  1:06:06 0 [Note] WSREP:  cleaning up 3e14de76-94d4 (tcp://xxxx:4567)
      2022-08-25  1:06:17 16935047 [Warning] Aborted connection 16935047 to db: '_1check_mail_export' user: 'bo_1check' host: '127.0.0.1' (Got an error reading communication packets)
      2022-08-25  1:06:17 16935044 [Warning] Aborted connection 16935044 to db: '_1check_mail_export' user: 'bo_1check' host: '127.0.0.1' (Got an error reading communication packets)
      2022-08-25  1:07:12 0 [Note] WSREP: Created page /var/lib/mysql/galera.page.000392 of size 134217696 bytes
      2022-08-25  1:07:18 0 [Note] WSREP: (19edde19-9435, 'tcp://0.0.0.0:4567') connection established to 807a59e3-896d tcp://xxxx:4567
      2022-08-25  1:07:18 0 [Note] WSREP: (19edde19-9435, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
      2022-08-25  1:07:19 0 [Note] WSREP: declaring 807a59e3-896d at tcp://xxxx:4567 stable
      2022-08-25  1:07:19 0 [Note] WSREP: Node 19edde19-9435 state prim
      2022-08-25  1:07:19 0 [Note] WSREP: view(view_id(PRIM,19edde19-9435,65) memb {
      2022-08-25  1:07:19 0 [Note] WSREP: save pc into disk
      2022-08-25  1:07:19 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
      2022-08-25  1:07:19 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 80c6eee7-2401-11ed-a085-a2c1fb1fa522
      2022-08-25  1:07:19 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 80c6eee7-2401-11ed-a085-a2c1fb1fa522
      2022-08-25  1:07:19 0 [Note] WSREP: STATE EXCHANGE: got state msg: 80c6eee7-2401-11ed-a085-a2c1fb1fa522 from 0 (ovh7.1check.com)
      2022-08-25  1:07:19 0 [Note] WSREP: STATE EXCHANGE: got state msg: 80c6eee7-2401-11ed-a085-a2c1fb1fa522 from 1 (ovh1.1check.com)
      2022-08-25  1:07:19 0 [Note] WSREP: Quorum results:
      2022-08-25  1:07:19 0 [Note] WSREP: Flow-control interval: [23, 23]
      2022-08-25  1:07:19 17 [Note] WSREP: ####### processing CC 70224411, local, ordered
      2022-08-25  1:07:19 17 [Note] WSREP: ####### My UUID: 19edde19-2139-11ed-9435-e69b7d8e37e9
      2022-08-25  1:07:19 17 [Note] WSREP: Skipping cert index reset
      2022-08-25  1:07:19 17 [Note] WSREP: REPL Protocols: 10 (5)
      2022-08-25  1:07:19 17 [Note] WSREP: ####### Adjusting cert position: 70224410 -> 70224411
      2022-08-25  1:07:19 0 [Note] WSREP: Service thread queue flushed.
      2022-08-25  1:07:19 17 [Note] WSREP: ================================================
      2022-08-25  1:07:19 17 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2022-08-25  1:07:19 17 [Note] WSREP: Lowest cert index boundary for CC from group: 70213347
      2022-08-25  1:07:19 17 [Note] WSREP: Min available from gcache for CC from group: 70213347
      

        Attachments

          Activity

            People

            Assignee:
            jplindst Jan Lindström
            Reporter:
            ccounotte COUNOTTE CEDRIC
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated:

                Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.