Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14282

SIGHUP after rsync SST causes crash in my_hash_element

    XMLWordPrintable

Details

    Description

      A joiner node performed an rsync SST from another node in the cluster, and when the SST finished, the node saw the following crash:

      2017-11-03 16:44:46 140515631684352 [Note] WSREP: State transfer required: 
      	Group state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977
      	Local state: 00000000-0000-0000-0000-000000000000:-1
      2017-11-03 16:44:46 140515631684352 [Note] WSREP: New cluster view: global state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977, view# 102: Primary, number of nodes: 4, my index: 3, protocol version 3
      2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Gap in state sequence. Need state transfer.
      2017-11-03 16:44:46 140504717063936 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '10.119.17.237' --datadir '/app/mysql/galera/'  --defaults-file '/app/mysql/config/galera.cnf'  --parent '8214' --binlog '/app/mysql/galera/db-node1-binlog' '
      2017-11-03 16:44:46 140515631684352 [Note] WSREP: Prepared SST request: rsync|10.119.17.237:4444/rsync_sst
      2017-11-03 16:44:46 140515631684352 [Note] WSREP: REPL Protocols: 7 (3, 2)
      2017-11-03 16:44:46 140515631684352 [Note] WSREP: Assign initial position for certification: 230930977, protocol version: 3
      2017-11-03 16:44:46 140504792139520 [Note] WSREP: Service thread queue flushed.
      2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (b1c4fe17-b433-11e7-bd44-52c2748b6170): 1 (Operation not permitted)
      	 at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
      2017-11-03 16:44:46 140504733837056 [Note] WSREP: Member 3.0 (10-119-17-237.domain.com) requested state transfer from '*any*'. Selected 0.0 (10-119-17-69.domain.com)(SYNCED) as donor.
      2017-11-03 16:44:46 140504733837056 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 230930977)
      2017-11-03 16:44:46 140515631684352 [Note] WSREP: Requesting state transfer: success, donor: 0
      2017-11-03 16:44:46 140515631684352 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977)
      2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') connection to peer d3790118 with addr tcp://10.119.17.237:4567 timed out, no messages seen in PT3S
      2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') turning message relay requesting off
      2017-11-03 17:06:34 140504733837056 [Note] WSREP: 0.0 (10-119-17-69.domain.com): State transfer to 3.0 (10-119-17-237.domain.com) complete.
      WSREP_SST: [INFO] Extracting binlog files: (20171103 17:06:34.145)
      db-g3-bin.000042
      ls: cannot access db-node1-binlog.*: No such file or directory
      WSREP_SST: [INFO] Joiner cleanup. rsync PID: 8293 (20171103 17:06:34.156)
      WSREP_SST: [INFO] Joiner cleanup done. (20171103 17:06:34.662)
      2017-11-03 17:06:34 140515632003328 [Note] WSREP: SST complete, seqno: 230930977
      2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using mutexes to ref count buffer pool pages
      2017-11-03 17:06:34 140515632003328 [Note] InnoDB: The InnoDB memory heap is disabled
      2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
      2017-11-03 17:06:34 140515632003328 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
      2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Compressed tables use zlib 1.2.7
      2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using Linux native AIO
      2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using SSE crc32 instructions
      2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Initializing buffer pool, size = 512.0M
      2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Completed initialization of buffer pool
      2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Highest supported file format is Barracuda.
      2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1244892544453
      2017-11-03 17:06:36 140504733837056 [Note] WSREP: Member 0.0 (10-119-17-69.domain.com) synced with group.
      2017-11-03 17:06:42 140515632003328 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
      2017-11-03 17:06:49 140515632003328 [Note] InnoDB: To recover: 2702 pages from log
      2017-11-03 17:06:53 140515632003328 [Note] InnoDB: Starting final batch to recover 2585 pages from redo log
      InnoDB: Last MySQL binlog file position 0 786976686, file name /app/mysql/galera/db-g3-bin.000041
      2017-11-03 17:07:15 140515632003328 [Note] InnoDB: 128 rollback segment(s) are active.
      2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Waiting for purge to start
      2017-11-03 17:07:15 140515632003328 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.36-82.2 started; log sequence number 1244894457235
      2017-11-03 17:07:23 140515632003328 [Note] Plugin 'FEEDBACK' is disabled.
      2017-11-03 17:07:23 140503555213056 [Note] InnoDB: Dumping buffer pool(s) not yet started
      2017-11-03 17:07:23 140515632003328 [Note] Server socket created on IP: '10.119.17.237'.
       
      Status information:
       
      Current dir: /app/mysql/galera/
      Running threads: 2  Stack size: 295936
      Current locks:
      lock: 0x7fc98da48788:
       
      lock: 0x7fc9cdc6a508:
       
      lock: 0x7fcc598b5520:
       
      lock: 0x7fcc598b5440:
       
      lock: 0x7fcc598b4480:
       
      lock: 0x7fcc598b43a0:
       
      lock: 0x7fcc598b42c0:
       
      lock: 0x7fcc598b4640:
       
      lock: 0x7fcc598b2880:
       
      lock: 0x7fcc598b5360:
       
      lock: 0x7fcc598b2f80:
       
      lock: 0x7fcc598b3220:
       
      lock: 0x7fcc598b2ea0:
       
      lock: 0x7fcc598b3140:
       
      lock: 0x7fcc598b2dc0:
       
      lock: 0x7fcc598b3060:
       
      lock: 0x7fcc598b35a0:
       
      lock: 0x7fcc598b3680:
       
      lock: 0x7fcc598b3760:
       
      lock: 0x7fcc598b2ce0:
       
      lock: 0x7fcc598b2a40:
       
      lock: 0x7fcc598b2c00:
       
      lock: 0x7fcc598b2960:
       
      lock: 0x7fcc598b2b20:
       
      lock: 0x7fcc598b3300:
       
      lock: 0x7fcc598b33e0:
       
      lock: 0x7fcc598b34c0:
       
      lock: 0x7fcc598b4fe0:
       
      lock: 0x7fcc598b5280:
       
      lock: 0x7fcc598b51a0:
       
      lock: 0x7fcc598b50c0:
       
      lock: 0x7fcc598b4c60:
       
      lock: 0x7fcc598b4b80:
       
      lock: 0x7fcc598b4aa0:
       
      lock: 0x7fcc598b49c0:
       
      lock: 0x7fcc598b48e0:
       
      lock: 0x7fcc598b4e20:
       
      lock: 0x7fcc598b4800:
       
      lock: 0x7fcc598b4720:
       
      lock: 0x7fcc598b4f00:
       
      lock: 0x7fcc598b4560:
       
      lock: 0x7fcc598b4100:
       
      lock: 0x7fcc598b41e0:
       
      lock: 0x7fcc598b4020:
       
      lock: 0x7fcc598b3f40:
       
      lock: 0x7fcc598b3bc0:
       
      lock: 0x7fcc598b3e60:
       
      lock: 0x7fcc598b3d80:
       
      lock: 0x7fcc598b3ae0:
       
      lock: 0x7fcc598b3ca0:
       
      lock: 0x7fcc598b3840:
       
      lock: 0x7fcc598b3920:
       
      lock: 0x7fcc598b3a00:
       
      lock: 0x7fcc598b4d40:
       
       
      Key caches:
      default
      Buffer_size:     134217728
      Block_size:           1024
      Division_limit:        100
      Age_threshold:         300
      Partitions:              0
      blocks used:             0
      not flushed:             0
      w_requests:              0
      writes:                  0
      r_requests:              0
      reads:                   0
       
       
      handler status:
      read_key:            0
      read_next:           0
      read_rnd             0
      read_first:          0
      write:               0
      delete               0
      update:              0
       
      Table status:
      Opened tables:          2
      Open tables:            2
      Open files:             8
      Open streams:           0
       
      Alarm status:
      Active alarms:   0
      Max used alarms: 0
      Next alarm time: 0
       
      Memory status:
      Non-mmapped space allocated from system: 0
      Number of free chunks:			 1
      Number of fastbin blocks:		 0
      Number of mmapped regions:		 0
      Space in mmapped regions:		 0
      Maximum total allocated space:		 0
      Space available in freed fastbin blocks: 0
      Total allocated space:			 0
      Total free space:			 0
      Top-most, releasable space:		 0
      Estimated memory (with thread stack):    591872
       
       
       
      Events status:
      LLA = Last Locked At  LUA = Last Unlocked At
      WOC = Waiting On Condition  DL = Data Locked
      The Event Scheduler is disabled
       
      2017-11-03 17:07:23 140515632003328 [Note] WSREP: Signalling provider to continue.
      2017-11-03 17:07:23 140515632003328 [Note] WSREP: SST received: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977
      2017-11-03 17:07:23 140515630676736 [Note] Event Scheduler: scheduler thread started with id 4
      171103 17:07:23 [ERROR] mysqld got signal 11 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
       
      Server version: 10.1.28-MariaDB
      key_buffer_size=134217728
      read_buffer_size=131072
      max_used_connections=0
      max_threads=1002
      thread_count=10
      It is possible that mysqld could use up to 
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2332069 K  bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
       
      Thread pointer: 0x0
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x0 thread_stack 0x48400
      mysqld(my_print_stacktrace+0x2e)[0x7fcc5902d92e]
      2017-11-03 17:07:23 140515632003328 [Note] Reading of all Master_info entries succeded
      2017-11-03 17:07:23 140515632003328 [Note] Added new Master_info '' to hash table
      2017-11-03 17:07:23 140515632003328 [Note] mysqld: ready for connections.
      Version: '10.1.28-MariaDB'  socket: '/app/mysql/temporaries/galera/mysqld.sock'  port: 3306  MariaDB Server
      2017-11-03 17:07:23 140504733837056 [Note] WSREP: 3.0 (10-119-17-237.domain.com): State transfer from 0.0 (10-119-17-69.domain.com) complete.
      2017-11-03 17:07:23 140504733837056 [Note] WSREP: Shifting JOINER -> JOINED (TO: 230971731)
      mysqld(handle_fatal_signal+0x305)[0x7fcc58b50ce5]
      /lib64/libpthread.so.0(+0xf370)[0x7fcc58161370]
      addr2line: 'mysqld': No such file
      mysqld(my_hash_element+0x13)[0x7fcc5900c023]
      mysqld(_ZN17Master_info_index20flush_all_relay_logsEv+0x93)[0x7fcc58a98c83]
      mysqld(_Z20reload_acl_and_cacheP3THDyP10TABLE_LISTPi+0x5d2)[0x7fcc58ab8672]
      mysqld(signal_hand+0x382)[0x7fcc5892c812]
      /lib64/libpthread.so.0(+0x7dc5)[0x7fcc58159dc5]
      /lib64/libc.so.6(clone+0x6d)[0x7fcc5657976d]
      The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
      information that should help you find out what is causing the crash.
      

      I'm not sure if it could provide any hint as to the cause of the crash, but the donor's error log contained a lot of messages about long semaphore waits involving log_sys->checkpoint_lock during the time period in which it acted as donor:

      InnoDB: ###### Diagnostic info printed to the standard error stream
      WSREP: avoiding InnoDB self crash due to long semaphore wait of  > 600 seconds
      Server is processing SST donor operation, fatal_cnt now: 0InnoDB: Warning: a long semaphore wait:
      --Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
      S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
      a writer (thread id 140439973771008) has reserved it in mode  exclusive
      number of readers 0, waiters flag 1, lock_word: 0
      Last time read locked in file log0log.cc line 2352
      Last time write locked in file log0log.cc line 2109
      Holder thread 0 file not yet reserved line 0
      InnoDB: Warning: semaphore wait:
      --Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
      S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
      a writer (thread id 140439973771008) has reserved it in mode  exclusive
      number of readers 0, waiters flag 1, lock_word: 0
      Last time read locked in file log0log.cc line 2352
      Last time write locked in file log0log.cc line 2109
      Holder thread 0 file not yet reserved line 0
      InnoDB: Warning: Writer thread is waiting this semaphore:
      --Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
      S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
      a writer (thread id 140439973771008) has reserved it in mode  exclusive
      number of readers 0, waiters flag 1, lock_word: 0
      Last time read locked in file log0log.cc line 2352
      Last time write locked in file log0log.cc line 2109
      Holder thread 0 file not yet reserved line 0
      

      In case it is relevant, the donor node was a replication master, but not a slave.

      After the first SST led to a crash, another SST was attempted, and that one was successful without a crash.

      Attachments

        Issue Links

          Activity

            People

              Elkin Andrei Elkin
              GeoffMontee Geoff Montee (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.