[MDEV-10041] Server crashes sporadically during bootstrap while running wsrep tests Created: 2016-05-07  Updated: 2016-11-07  Resolved: 2016-09-24

Status: Closed
Project: MariaDB Server
Component/s: Tests, wsrep
Affects Version/s: 10.1
Fix Version/s: 10.1.18

Type: Bug Priority: Critical
Reporter: Elena Stepanova Assignee: Sergei Golubchik
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
blocks MDEV-7069 Fix buildbot failures in main server ... Stalled
Relates
relates to MDEV-8208 Sporadic SEGFAULT on startup Closed
Sprint: 10.2.1-5, 10.1.18

 Description   

http://buildbot.askmonty.org/buildbot/builders/kvm-deb-precise-amd64/builds/6218/steps/test_5/logs/stdio

wsrep.binlog_format                      w2 [ fail ]
        Test ended at 2016-05-06 22:29:12
 
CURRENT_TEST: wsrep.binlog_format
 
 
Failed to start mysqld.1
mysqltest failed but provided no output
 
 
 - saving '/run/shm/var/2/log/wsrep.binlog_format/' to '/run/shm/var/log/wsrep.binlog_format/'
 
Retrying test wsrep.binlog_format, attempt(2/3)...
 
worker[2] > Restart  - not started
***Warnings generated in error logs during shutdown after running tests: wsrep.binlog_format
 
160506 22:29:10 [ERROR] mysqld got signal 11 ;
Attempting backtrace. You can use the following information to find out

I can reproduce it by running

MTR_FEEDBACK_PLUGIN=1 perl mysql-test-run.pl  --verbose-restart --vardir="$(readlink -f /dev/shm/var)" --parallel=4 --force --retry=3 --max-save-core=0 --max-save-datadir=1 --suite=wsrep --max-test-fail=1

on a VM image, using the buildbot packages.
Sometimes it needs to be run more than once.

2016-05-08  0:14:59 140146525366144 [Note] /usr/sbin/mysqld (mysqld 10.1.14-MariaDB-1~precise) starting as process 10689 ...
2016-05-08  0:14:59 140146525366144 [Warning] Could not increase number of max_open_files to more than 1024 (request: 4162)
2016-05-08  0:14:59 140146525366144 [Note] WSREP: Read nil XID from storage engines, skipping position init
2016-05-08  0:14:59 140146525366144 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
2016-05-08  0:14:59 140146525366144 [Note] WSREP: wsrep_load(): Galera 25.3.15(r3578) by Codership Oy <info@codership.com> loaded successfully.
2016-05-08  0:14:59 140146525366144 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm.
2016-05-08  0:14:59 140146525366144 [Warning] WSREP: Could not open state file for reading: '/run/shm/var/1/mysqld.1/data//grastate.dat'
2016-05-08  0:14:59 140146525366144 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
2016-05-08  0:14:59 140146525366144 [Note] WSREP: Passing config to GCS: base_dir = /run/shm/var/1/mysqld.1/data/; base_host = 127.0.0.1; base_port = 16001; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /run/shm/var/1/mysqld.1/data/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /run/shm/var/1/mysqld.1/data//galera.cache; gcache.page_size = 128M; gcache.size = 10M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignor
2016-05-08  0:14:59 140146366211840 [Note] WSREP: Service thread queue flushed.
2016-05-08  0:14:59 140146525366144 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2016-05-08  0:14:59 140146525366144 [Note] WSREP: wsrep_sst_grab()
2016-05-08  0:14:59 140146525366144 [Note] WSREP: Start replication
2016-05-08  0:14:59 140146525366144 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2016-05-08  0:14:59 140146525366144 [Note] WSREP: protonet asio version 0
2016-05-08  0:14:59 140146525366144 [Note] WSREP: Using CRC-32C for message checksums.
2016-05-08  0:14:59 140146525366144 [Note] WSREP: backend: asio
2016-05-08  0:14:59 140146525366144 [Warning] WSREP: access file(/run/shm/var/1/mysqld.1/data//gvwstate.dat) failed(No such file or directory)
2016-05-08  0:14:59 140146525366144 [Note] WSREP: restore pc from disk failed
2016-05-08  0:14:59 140146525366144 [Note] WSREP: GMCast version 0
2016-05-08  0:14:59 140146525366144 [Note] WSREP: (c1bd0214, 'tcp://0.0.0.0:16001') listening at tcp://0.0.0.0:16001
2016-05-08  0:14:59 140146525366144 [Note] WSREP: (c1bd0214, 'tcp://0.0.0.0:16001') multicast: , ttl: 1
2016-05-08  0:14:59 140146525366144 [Note] WSREP: EVS version 0
2016-05-08  0:14:59 140146525366144 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer ''
2016-05-08  0:14:59 140146525366144 [Note] WSREP: start_prim is enabled, turn off pc_recovery
2016-05-08  0:14:59 140146525366144 [Note] WSREP: Node c1bd0214 state prim
2016-05-08  0:14:59 140146525366144 [Note] WSREP: view(view_id(PRIM,c1bd0214,1) memb {
        c1bd0214,0
} joined {
} left {
} partitioned {
})
2016-05-08  0:14:59 140146525366144 [Note] WSREP: save pc into disk
2016-05-08  0:14:59 140146525366144 [Note] WSREP: gcomm: connected
2016-05-08  0:14:59 140146525366144 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2016-05-08  0:14:59 140146525366144 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2016-05-08  0:14:59 140146525366144 [Note] WSREP: Opened channel 'my_wsrep_cluster'
2016-05-08  0:14:59 140146525366144 [Note] WSREP: Waiting for SST to complete.
2016-05-08  0:14:59 140146305394432 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
2016-05-08  0:14:59 140146305394432 [Note] WSREP: Starting new group from scratch: c1be5307-1498-11e6-ba3c-a618cc15550c
2016-05-08  0:14:59 140146305394432 [Note] WSREP: STATE_EXCHANGE: sent state UUID: c1be63c1-1498-11e6-bc0b-3a3f18654c93
2016-05-08  0:14:59 140146305394432 [Note] WSREP: STATE EXCHANGE: sent state msg: c1be63c1-1498-11e6-bc0b-3a3f18654c93
2016-05-08  0:14:59 140146305394432 [Note] WSREP: STATE EXCHANGE: got state msg: c1be63c1-1498-11e6-bc0b-3a3f18654c93 from 0 ()
2016-05-08  0:14:59 140146305394432 [Note] WSREP: Quorum results:
        version    = 3,
        component  = PRIMARY,
        conf_id    = 0,
        members    = 1/1 (joined/total),
        act_id     = 0,
        last_appl. = -1,
        protocols  = 0/7/3 (gcs/repl/appl),
        group UUID = c1be5307-1498-11e6-ba3c-a618cc15550c
2016-05-08  0:14:59 140146305394432 [Note] WSREP: Flow-control interval: [16, 16]
2016-05-08  0:14:59 140146305394432 [Note] WSREP: Restored state OPEN -> JOINED (0)
2016-05-08  0:14:59 140146296695552 [Note] WSREP: New cluster view: global state: c1be5307-1498-11e6-ba3c-a618cc15550c:0, view# 1: Primary, number of nodes: 1, my index: 0, protocol version 3
2016-05-08  0:14:59 140146525366144 [Note] WSREP: SST complete, seqno: 0
2016-05-08  0:14:59 140146305394432 [Note] WSREP: Member 0.0 () synced with group.
2016-05-08  0:14:59 140146305394432 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 0)
160508  0:14:59 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.1.14-MariaDB-1~precise
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=0
max_threads=1001
thread_count=2
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 388521 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0x7f7658412008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f765a3fbe10 thread_stack 0x48400
/usr/sbin/mysqld(my_print_stacktrace+0x2b)[0x7f76688438fb]
/usr/sbin/mysqld(handle_fatal_signal+0x475)[0x7f76683a2975]
2016-05-08  0:14:59 140146525366144 [Note] Plugin 'partition' is disabled.
2016-05-08  0:14:59 140146525366144 [Note] Plugin 'InnoDB' is disabled.
2016-05-08  0:14:59 140146525366144 [Note] Plugin 'SEQUENCE' is disabled.
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)[0x7f76679a5cb0]
2016-05-08  0:14:59 140146525366144 [Note] Plugin 'INNODB_SYS_DATAFILES' is disabled.
2016-05-08  0:14:59 140146525366144 [Note] Plugin 'INNODB_SYS_TABLESTATS' is disabled.
2016-05-08  0:14:59 140146525366144 [Note] Plugin 'INNODB_SYS_INDEXES' is disabled.
2016-05-08  0:14:59 140146525366144 [Note] Plugin 'XTRADB_RSEG' is disabled.
2016-05-08  0:14:59 140146525366144 [Note] Plugin 'INNODB_CMP_PER_INDEX' is disabled.
2016-05-08  0:14:59 140146525366144 [Note] Plugin 'INNODB_TRX' is disabled.
2016-05-08  0:14:59 140146525366144 [Note] Plugin 'CHANGED_PAGE_BITMAPS' is disabled.
2016-05-08  0:14:59 140146525366144 [Note] Plugin 'INNODB_FT_BEING_DELETED' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_LOCK_WAITS' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_LOCKS' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_TABLESPACES_ENCRYPTION' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'XTRADB_INTERNAL_HASH_TABLES' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_SYS_FIELDS' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_CMPMEM_RESET' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'FEEDBACK' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_CMP' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_FT_INDEX_TABLE' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_SYS_TABLESPACES' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_MUTEXES' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_BUFFER_PAGE_LRU' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_SYS_FOREIGN_COLS' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_CMP_RESET' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_BUFFER_POOL_STATS' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_FT_INDEX_CACHE' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_SYS_FOREIGN' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_METRICS' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_FT_DEFAULT_STOPWORD' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_CMPMEM' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_SYS_TABLES' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_SYS_COLUMNS' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_FT_CONFIG' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_BUFFER_PAGE' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_CMP_PER_INDEX_RESET' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'XTRADB_READ_VIEW' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_SYS_SEMAPHORE_WAITS' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_CHANGED_PAGES' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_FT_DELETED' is disabled.
2016-05-08  0:15:00 140146525366144 [Note] Plugin 'INNODB_TABLESPACES_SCRUBBING' is disabled.
2016-05-08  0:15:00 140146525366144 [Warning] /usr/sbin/mysqld: unknown variable 'loose-feedback-debug-startup-interval=20'
2016-05-08  0:15:00 140146525366144 [Warning] /usr/sbin/mysqld: unknown variable 'loose-feedback-debug-first-interval=60'
2016-05-08  0:15:00 140146525366144 [Warning] /usr/sbin/mysqld: unknown variable 'loose-feedback-debug-interval=60'
2016-05-08  0:15:00 140146525366144 [Warning] /usr/sbin/mysqld: unknown variable 'loose-debug-sync-timeout=300'
/usr/sbin/mysqld(+0x97a47b)[0x7f766875247b]
/usr/sbin/mysqld(_Z15ha_commit_transP3THDb+0xbd)[0x7f76683a580d]
2016-05-08  0:15:00 140146525366144 [Note] Server socket created on IP: '127.0.0.1'.
/usr/sbin/mysqld(_Z21ha_enable_transactionP3THDb+0x2f)[0x7f76683aa28f]
2016-05-08  0:15:00 140146296695552 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-05-08  0:15:00 140146296695552 [Note] WSREP: REPL Protocols: 7 (3, 2)
2016-05-08  0:15:00 140146366211840 [Note] WSREP: Service thread queue flushed.
2016-05-08  0:15:00 140146296695552 [Note] WSREP: Assign initial position for certification: 0, protocol version: 3
2016-05-08  0:15:00 140146366211840 [Note] WSREP: Service thread queue flushed.
2016-05-08  0:15:00 140146296695552 [Note] WSREP: Synchronized with group, ready for connections
2016-05-08  0:15:00 140146296695552 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-05-08  0:15:00 140146525366144 [Note] /usr/sbin/mysqld: ready for connections.
Version: '10.1.14-MariaDB-1~precise'  socket: '/run/shm/var/tmp/1/mysqld.1.sock'  port: 16000  mariadb.org binary distribution
/usr/sbin/mysqld(_ZN3THD16init_for_queriesEv+0x94)[0x7f76682027a4]
/usr/sbin/mysqld(start_wsrep_THD+0x3d0)[0x7f7668343820]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a)[0x7f766799de9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f7665e0438d]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): 
Connection ID (thread ID): 2
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on
 
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file

#1  0x00007f76683a29af in handle_fatal_signal ()
#2  <signal handler called>
#3  0x00007f766875247b in ?? ()
#4  0x00007f76683a580d in ha_commit_trans(THD*, bool) ()
#5  0x00007f76683aa28f in ha_enable_transaction(THD*, bool) ()
#6  0x00007f76682027a4 in THD::init_for_queries() ()
#7  0x00007f7668343820 in start_wsrep_THD ()
#8  0x00007f766799de9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#9  0x00007f7665e0438d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x0000000000000000 in ?? ()



 Comments   
Comment by Nirbhay Choubey (Inactive) [ 2016-05-08 ]

This issue is not a recent regression. Looks like 10.1 is still affected by MDEV-8208.

Comment by Elena Stepanova [ 2016-06-05 ]

Recent example on 10.1:
wsrep.pool_of_threads
http://buildbot.askmonty.org/buildbot/builders/kvm-deb-trusty-amd64/builds/2888/steps/test_5/logs/stdio

Comment by Nirbhay Choubey (Inactive) [ 2016-06-06 ]

https://github.com/MariaDB/server/commit/f5b21aa858a97fb388e309fffbff0f2162c5368b

Comment by Sergei Golubchik [ 2016-06-27 ]

1. why does it crash?
2. The commit a2330c8 was, precisely, because old approach didn't work — it modified shared data concurrently, so you were fixing it over and over. Remember the last commit (70113ee)? Did you simply revert a2330c8 or you've fixed the old code somehow?

Comment by Nirbhay Choubey (Inactive) [ 2016-06-27 ]

1. why does it crash?

It crashed because one of the wsrep startup threads (rollbacker to be specific) can proceed
even when plugins were not fully initialized. This was fixed, but after last commit (a2330c8)
this issue got resurfaced.

The patch below easily repeats this issue (and also helps explain the problem).
https://gist.github.com/nirbhayc/5c453ac70f3cba5d5af1e2418dee3385

2. The commit a2330c8 was, precisely, because old approach didn't work — it modified shared data concurrently, so you were fixing it over and over. Remember the last commit (70113ee)? Did you simply revert a2330c8 or you've fixed the old code somehow?

The last patch before a2330c8 did actually fix this bug. I thought a2330c8 was a "better" way
to fix this issue, but unfortunately it did not fix the actual problem. I simply reverted a2330c8
and the issue does not repeat anymore.

Comment by Elena Stepanova [ 2016-08-27 ]

I have to add all wsrep tests to 10.1 unstable list because of this.

Comment by Sergei Golubchik [ 2016-09-22 ]

Thanks for the test case.

I don't think you need to revert a2330c8. It is correct and THD initialization is properly delayed until the server is ready. Your test case shows a different problem, related to the Aria engine hack ha_maria::implicit_commit. Normally engine's commit code can only be invoked when this engine has participated in the transaction and has registered itself to be called at commit/rollback time.

But ha_maria::implicit_commit is called unconditionally, so it has to be prepared to called anytime, and it wasn't. This is the fix:

--- a/storage/maria/ha_maria.cc
+++ b/storage/maria/ha_maria.cc
@@ -2839,9 +2839,10 @@ int ha_maria::implicit_commit(THD *thd, bool new_trn)
   int error;
   uint locked_tables;
   DYNAMIC_ARRAY used_tables;
+  extern my_bool plugins_are_initialized;
   
   DBUG_ENTER("ha_maria::implicit_commit");
-  if (!maria_hton || !(trn= THD_TRN))
+  if (!maria_hton || !plugins_are_initialized || !(trn= THD_TRN))
     DBUG_RETURN(0);
   if (!new_trn && (thd->locked_tables_mode == LTM_LOCK_TABLES ||
                    thd->locked_tables_mode == LTM_PRELOCKED_UNDER_LOCK_TABLES))

Comment by Nirbhay Choubey (Inactive) [ 2016-09-24 ]

http://lists.askmonty.org/pipermail/commits/2016-September/009912.html

Generated at Thu Feb 08 07:39:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.