Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9879

Locking issues similar to MDEV-8869

    XMLWordPrintable

Details

    Description

      A user is seeing locking issues similar to MDEV-8869 in MariaDB Galera Cluster 10.0.24, which contains the fix for MDEV-8869. Some of the symptoms the user is seeing are:

      • Long semaphore waits printed in MariaDB error log.
      • Nodes hang for long periods of time.
      • Nodes can't seem to keep up with network traffic (probably because mysqld is hung), so there is a lot of flapping, with nodes leaving and rejoining the cluster frequently.
      • While this is happening, sar doesn't really show a high amount of system activity.

      Here are some examples from error logs:

      160404 16:47:08 [Note] WSREP: Member 1.0 (node01) requested state transfer from '*any*'. Selected 4.0 (node05)(SYNCED) as donor.
      160404 16:47:08 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 22299377)
      160404 16:47:08 [Note] WSREP: Requesting state transfer: success after 2 tries, donor: 4
      160404 16:47:08 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 476845979 bytes
      160404 16:47:08 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 15/53 locked buffers
      160404 16:47:08 [Note] WSREP: Receiving IST: 121 writesets, seqnos 22299203-22299324
      160404 16:47:08 [Note] WSREP: 4.0 (node05): State transfer to 1.0 (node01) complete.
      160404 16:47:08 [Note] WSREP: Member 4.0 (node05) synced with group.
      InnoDB: Warning: a long semaphore wait:
      --Thread 139857691129600 has waited at log0log.cc line 2317 for 1663.0 seconds the semaphore:
      S-lock on RW-latch at 0x7f34ae046e58 '&log_sys->checkpoint_lock'
      number of readers 0, waiters flag 0, lock_word: 100000
      Last time read locked in file log0log.cc line 2317
      Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.0.24/storage/xtradb/log/log0log.cc line 2086
      InnoDB: Warning: semaphore wait:
      --Thread 139857691129600 has waited at log0log.cc line 2317 for 1663.0 seconds the semaphore:
      S-lock on RW-latch at 0x7f34ae046e58 '&log_sys->checkpoint_lock'
      number of readers 0, waiters flag 0, lock_word: 100000
      Last time read locked in file log0log.cc line 2317
      Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.0.24/storage/xtradb/log/log0log.cc line 2086
      InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
      InnoDB: Pending preads 0, pwrites 0
      160404 17:22:26 [Warning] WSREP: last inactive check more than PT1.5S ago (PT24M4.71841S), skipping check
      160404 17:22:26 [Note] WSREP: (9650bd48, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://ip2:4567 tcp://ip3:4567 tcp://ip4:4567 tcp://ip5:4567 
      160404 17:22:27 [Note] WSREP: (9650bd48, 'tcp://0.0.0.0:4567') reconnecting to 91737ffe (tcp://ip2:4567), attempt 0
      160404 17:22:27 [Note] WSREP: (9650bd48, 'tcp://0.0.0.0:4567') reconnecting to 9ff82a4e (tcp://ip3:4567), attempt 0
      160404 17:22:27 [Note] WSREP: (9650bd48, 'tcp://0.0.0.0:4567') reconnecting to a99e148d (tcp://ip4:4567), attempt 0
      160404 17:22:27 [Note] WSREP: (9650bd48, 'tcp://0.0.0.0:4567') reconnecting to ee8a5f69 (tcp://ip5:4567), attempt 0
      160404 17:22:28 [ERROR] WSREP: Writeset deserialization failed: : 93 (Protocol not supported)
      	 at galera/src/trx_handle.cpp:unserialize():261
      	 at galera/src/trx_handle.cpp:unserialize():268
      WS flags:      0
      Trx proto:     -1
      Trx source:    00000000-0000-0000-0000-000000000000
      Trx conn_id:   18446744073709551615
      Trx trx_id:    18446744073709551615
      Trx last_seen: -1
      160404 17:22:28 [ERROR] WSREP: got exception while reading ist stream: : 93 (Protocol not supported)
      	 at galera/src/trx_handle.cpp:unserialize():261
      	 at galera/src/trx_handle.cpp:unserialize():268
      160404 17:22:28 [ERROR] WSREP: RecordSet checksum does not match:
      computed: 9161146c 0ec51b37 1ba72de7 aae2001e
      found:    039dd37e 0cc814fe 72eb77d0 f66769ce: 22 (Invalid argument)
      	 at galerautils/src/gu_rset.cpp:checksum():392
      160404 17:22:28 [ERROR] WSREP: receiving IST failed, node restart required: Writeset checksum failed: 22 (Invalid argument)
      	 at galera/src/write_set_ng.hpp:checksum_fin():813
      160404 17:22:28 [ERROR] WSREP: failed trx: source: ee8a5f69-f786-11e5-9555-f7e9604eaa9f version: 3 local: 0 state: EXECUTING flags: 257 conn_id: 18152 trx_id: 22314288 seqnos (l: -1, g: 22299350, s: 22299349, d: 22299335, ts: 328879474354843)
      

      160404  1:23:00 [Note] WSREP: Prepared IST receiver, listening at: tcp://ip1:4568
      160404  1:23:00 [Note] WSREP: Member 0.0 (node01) requested state transfer from '*any*'. Selected 4.0 (node05)(SYNCED) as donor.
      160404  1:23:00 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 21921776)
      160404  1:23:00 [Note] WSREP: Requesting state transfer: success, donor: 4
      160404  1:23:00 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset
      160404  1:23:00 [Note] WSREP: 4.0 (node05): State transfer to 0.0 (node01) complete.
      160404  1:23:00 [Note] WSREP: Member 4.0 (node05) synced with group.
      160404  1:23:00 [Note] WSREP: Receiving IST: 267 writesets, seqnos 21921509-21921776
      160404  1:23:02 [Note] WSREP: IST received: cca366dc-9938-11e5-9d4b-e7c47e7205aa:21921776
      160404  1:23:02 [Note] WSREP: 0.0 (node01): State transfer from 4.0 (node05) complete.
      160404  1:23:02 [Note] WSREP: Shifting JOINER -> JOINED (TO: 21921776)
      160404  1:23:02 [Note] WSREP: Member 0.0 (node01) synced with group.
      160404  1:23:02 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 21921776)
      160404  1:23:02 [Note] WSREP: Synchronized with group, ready for connections
      160404  1:23:02 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      160404  1:42:40 [Note] WSREP: Provider paused at cca366dc-9938-11e5-9d4b-e7c47e7205aa:21921909 (765946)
      160404  1:42:42 [Note] WSREP: resuming provider at 765946
      160404  1:42:42 [Note] WSREP: Provider resumed.
      InnoDB: Warning: a long semaphore wait:
      --Thread 140511877961472 has waited at row0ins.cc line 2757 for 690.00 seconds the semaphore:
      X-lock on RW-latch at 0x7fcafc547840 '&block->lock'
      a writer (thread id 140505694320384) has reserved it in mode  exclusive
      number of readers 0, waiters flag 1, lock_word: 0
      Last time read locked in file not yet reserved line 0
      Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.0.24/storage/xtradb/ibuf/ibuf0ibuf.cc line 4787
      InnoDB: Warning: a long semaphore wait:
      --Thread 140505492993792 has waited at log0log.cc line 2317 for 687.00 seconds the semaphore:
      S-lock on RW-latch at 0x7fcb75046e58 '&log_sys->checkpoint_lock'
      number of readers 0, waiters flag 0, lock_word: 100000
      Last time read locked in file log0log.cc line 2317
      Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.0.24/storage/xtradb/log/log0log.cc line 2086
      InnoDB: Warning: semaphore wait:
      --Thread 140511877961472 has waited at row0ins.cc line 2757 for 691.00 seconds the semaphore:
      X-lock on RW-latch at 0x7fcafc547840 '&block->lock'
      a writer (thread id 140505694320384) has reserved it in mode  exclusive
      number of readers 0, waiters flag 1, lock_word: 0
      Last time read locked in file not yet reserved line 0
      Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.0.24/storage/xtradb/ibuf/ibuf0ibuf.cc line 4787
      InnoDB: Warning: semaphore wait:
      --Thread 140505492993792 has waited at log0log.cc line 2317 for 687.00 seconds the semaphore:
      S-lock on RW-latch at 0x7fcb75046e58 '&log_sys->checkpoint_lock'
      number of readers 0, waiters flag 0, lock_word: 100000
      Last time read locked in file log0log.cc line 2317
      Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.0.24/storage/xtradb/log/log0log.cc line 2086
      InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
      InnoDB: Pending preads 0, pwrites 1
      160404 12:25:28 [Warning] WSREP: last inactive check more than PT1.5S ago (PT8M16.4901S), skipping check
      160404 12:25:28 [Note] WSREP: (81dcbf62, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://ip2:4567 tcp://ip3:4567 tcp://ip4:4567 tcp://ip5:4567 
      160404 12:25:29 [Note] WSREP: (81dcbf62, 'tcp://0.0.0.0:4567') reconnecting to ee8a5f69 (tcp://ip5:4567), attempt 0
      160404 12:25:29 [Note] WSREP: (81dcbf62, 'tcp://0.0.0.0:4567') reconnecting to 91737ffe (tcp://ip2:4567), attempt 0
      160404 12:25:29 [Note] WSREP: (81dcbf62, 'tcp://0.0.0.0:4567') reconnecting to 9ff82a4e (tcp://ip3:4567), attempt 0
      160404 12:25:29 [Note] WSREP: (81dcbf62, 'tcp://0.0.0.0:4567') reconnecting to a99e148d (tcp://ip4:4567), attempt 0
      

      160406 13:08:25 [Note] WSREP: Prepared IST receiver, listening at: tcp://ip2:4568
      160406 13:08:25 [Note] WSREP: Member 1.0 (node02) requested state transfer from '*any*'. Selected 3.0 (node04)(SYNCED) as donor.
      160406 13:08:25 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 23062904)
      160406 13:08:25 [Note] WSREP: Requesting state transfer: success, donor: 3
      160406 13:08:25 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 976567893 bytes
      160406 13:08:25 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 7/20 locked buffers
      160406 13:08:25 [Note] WSREP: 3.0 (node04): State transfer to 1.0 (node02) complete.
      160406 13:08:25 [Note] WSREP: Member 3.0 (node04) synced with group.
      160406 13:08:25 [Note] WSREP: Receiving IST: 110 writesets, seqnos 23062781-23062891
      160406 13:10:06 [Note] WSREP: IST received: cca366dc-9938-11e5-9d4b-e7c47e7205aa:23062891
      160406 13:10:06 [Note] WSREP: 1.0 (node02): State transfer from 3.0 (node04) complete.
      160406 13:10:06 [Note] WSREP: Shifting JOINER -> JOINED (TO: 23062928)
      160406 13:10:06 [Note] WSREP: Member 1.0 (node02) synced with group.
      160406 13:10:06 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 23062928)
      160406 13:10:31 [Note] WSREP: Synchronized with group, ready for connections
      160406 13:10:31 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      InnoDB: Warning: a long semaphore wait:
      --Thread 140531492501248 has waited at row0ins.cc line 2757 for 1052.0 seconds the semaphore:
      X-lock on RW-latch at 0x7fcec9cfbec0 '&block->lock'
      number of readers 0, waiters flag 0, lock_word: 100000
      Last time read locked in file buf0flu.cc line 1073
      Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.0.24/storage/xtradb/ibuf/ibuf0ibuf.cc line 4787
      160406 13:49:04 [Warning] WSREP: last inactive check more than PT1.5S ago (PT15M14.8588S), skipping check
      InnoDB: Warning: semaphore wait:
      --Thread 140531492501248 has waited at row0ins.cc line 2757 for 1052.0 seconds the semaphore:
      X-lock on RW-latch at 0x7fcec9cfbec0 '&block->lock'
      number of readers 0, waiters flag 0, lock_word: 100000
      Last time read locked in file buf0flu.cc line 1073
      Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.0.24/storage/xtradb/ibuf/ibuf0ibuf.cc line 4787
      InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
      InnoDB: Pending preads 0, pwrites 1
      160406 13:49:05 [Note] WSREP: (91737ffe, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://ip1:4567 tcp://ip3:4567 tcp://ip4:4567 tcp://ip5:4567 
      160406 13:49:06 [Note] WSREP: (91737ffe, 'tcp://0.0.0.0:4567') reconnecting to dfc18106 (tcp://ip1:4567), attempt 0
      160406 13:49:06 [Note] WSREP: (91737ffe, 'tcp://0.0.0.0:4567') reconnecting to 9ff82a4e (tcp://ip3:4567), attempt 0
      160406 13:49:06 [Note] WSREP: (91737ffe, 'tcp://0.0.0.0:4567') reconnecting to a99e148d (tcp://ip4:4567), attempt 0
      160406 13:49:06 [Note] WSREP: (91737ffe, 'tcp://0.0.0.0:4567') reconnecting to 171dbfe9 (tcp://ip5:4567), attempt 0
      

      160406 15:51:07 [Note] WSREP: view(view_id(PRIM,171dbfe9,56) memb {
      	171dbfe9,0
      	91737ffe,0
      	9ff82a4e,0
      	a99e148d,0
      	dfc18106,0
      } joined {
      } left {
      } partitioned {
      })
      160406 15:51:07 [Note] WSREP: save pc into disk
      InnoDB: Warning: a long semaphore wait:
      --Thread 140531492501248 has waited at row0ins.cc line 2757 for 1654.0 seconds the semaphore:
      X-lock on RW-latch at 0x7fceebb48dc0 '&block->lock'
      number of readers 0, waiters flag 0, lock_word: 100000
      Last time read locked in file buf0flu.cc line 1073
      Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.0.24/storage/xtradb/ibuf/ibuf0ibuf.cc line 4787
      InnoDB: Warning: semaphore wait:
      --Thread 140531492501248 has waited at row0ins.cc line 2757 for 1654.0 seconds the semaphore:
      X-lock on RW-latch at 0x7fceebb48dc0 '&block->lock'
      number of readers 0, waiters flag 0, lock_word: 100000
      Last time read locked in file buf0flu.cc line 1073
      Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.0.24/storage/xtradb/ibuf/ibuf0ibuf.cc line 4787
      160406 16:18:38 [Note] WSREP: Flow-control interval: [16, 16]
      InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
      InnoDB: Pending preads 0, pwrites 1
      160406 16:18:39 [Note] WSREP: Received NON-PRIMARY.
      160406 16:18:39 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 23066454)
      160406 16:18:39 [Warning] WSREP: FLOW message from member 17179869184 in non-primary configuration. Ignored.
      160406 16:18:39 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
      160406 16:18:39 [Note] WSREP: Flow-control interval: [16, 16]
      160406 16:18:39 [Note] WSREP: Received NON-PRIMARY.
      160406 16:18:39 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 5
      160406 16:18:39 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
      160406 16:18:39 [Warning] WSREP: last inactive check more than PT1.5S ago (PT27M32.5022S), skipping check
      160406 16:18:39 [Note] WSREP: evs::proto(91737ffe, OPERATIONAL, view_id(REG,171dbfe9,56)) detected inactive node: 171dbfe9
      160406 16:18:39 [Note] WSREP: evs::proto(91737ffe, OPERATIONAL, view_id(REG,171dbfe9,56)) detected inactive node: 9ff82a4e
      160406 16:18:39 [Note] WSREP: evs::proto(91737ffe, OPERATIONAL, view_id(REG,171dbfe9,56)) detected inactive node: a99e148d
      160406 16:18:39 [Note] WSREP: evs::proto(91737ffe, OPERATIONAL, view_id(REG,171dbfe9,56)) detected inactive node: dfc18106
      160406 16:18:40 [Note] WSREP: (91737ffe, 'tcp://0.0.0.0:4567') reconnecting to dfc18106 (tcp://ip1:4567), attempt 0
      160406 16:18:40 [Note] WSREP: (91737ffe, 'tcp://0.0.0.0:4567') reconnecting to 9ff82a4e (tcp://ip3:4567), attempt 0
      160406 16:18:40 [Note] WSREP: (91737ffe, 'tcp://0.0.0.0:4567') reconnecting to a99e148d (tcp://ip4:4567), attempt 0
      160406 16:18:40 [Note] WSREP: (91737ffe, 'tcp://0.0.0.0:4567') reconnecting to 171dbfe9 (tcp://ip5:4567), attempt 0
      

      160406 15:51:15 [Note] WSREP:  cleaning up 91737ffe (tcp://ip2:4567)
      InnoDB: Warning: a long semaphore wait:
      --Thread 140384540604160 has waited at row0ins.cc line 2757 for 1533.0 seconds the semaphore:
      X-lock on RW-latch at 0x7fac70f43140 '&block->lock'
      number of readers 0, waiters flag 0, lock_word: 100000
      Last time read locked in file buf0flu.cc line 1073
      Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.0.24/storage/xtradb/ibuf/ibuf0ibuf.cc line 4787
      InnoDB: Warning: semaphore wait:
      --Thread 140384540604160 has waited at row0ins.cc line 2757 for 1533.0 seconds the semaphore:
      X-lock on RW-latch at 0x7fac70f43140 '&block->lock'
      number of readers 0, waiters flag 0, lock_word: 100000
      Last time read locked in file buf0flu.cc line 1073
      Last time write locked in file /home/buildbot/buildbot/build/mariadb-10.0.24/storage/xtradb/ibuf/ibuf0ibuf.cc line 4787
      InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
      InnoDB: Pending preads 0, pwrites 0
      160406 16:37:46 [Warning] WSREP: last inactive check more than PT1.5S ago (PT22M36.0207S), skipping check
      160406 16:37:47 [Note] WSREP: (171dbfe9, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://ip1:4567 tcp://ip3:4567 tcp://ip4:4567 
      160406 16:37:48 [Note] WSREP: (171dbfe9, 'tcp://0.0.0.0:4567') reconnecting to dfc18106 (tcp://ip1:4567), attempt 0
      160406 16:37:48 [Note] WSREP: (171dbfe9, 'tcp://0.0.0.0:4567') reconnecting to 9ff82a4e (tcp://ip3:4567), attempt 0
      160406 16:37:48 [Note] WSREP: (171dbfe9, 'tcp://0.0.0.0:4567') reconnecting to a99e148d (tcp://ip4:4567), attempt 0
      

      Attachments

        Activity

          People

            jplindst Jan Lindström (Inactive)
            GeoffMontee Geoff Montee (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.