[MDEV-12369] Crash on idle Galera node (libgalera_smm, libssl, libcrypto) Created: 2017-03-27  Updated: 2019-05-20  Resolved: 2019-05-20

Status: Closed
Project: MariaDB Server
Component/s: Galera, SSL
Affects Version/s: 10.1.22
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Ján Regeš Assignee: Jan Lindström (Inactive)
Resolution: Won't Fix Votes: 0
Labels: Crash, Galera
Environment:

Gentoo, kernel 4.4.6, 8x Intel Xeon E5-2643 0 @ 3.30GHz, 8GB RAM, 3 Galera nodes on the same stable network (no geo-replication)


Attachments: File 2017-03-27_ab-arbitrator-crash_my.cnf     Text File 2017-03-27_ab-binlog.txt    

 Description   

Hi,

today morning at 08:10:24 crashed one node from 3-node Galera cluster.

All 3 nodes were very idle, in average about 10 selects/s and 1 insert or update/s. CPU, RAM, IO, all were idle.

Below i attach crash log. When you need it, I can send my.cnf from all 3 nodes.

Thank you for your support.

170327  8:10:24 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.1.22-MariaDB
key_buffer_size=33554432
read_buffer_size=1048576
max_used_connections=5
max_threads=152
thread_count=8
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 347165 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x80000
/usr/sbin/mysqld(my_print_stacktrace+0x29)[0x5598d88bd3bf]
/usr/sbin/mysqld(handle_fatal_signal+0x327)[0x5598d84420b5]
/lib64/libpthread.so.0(+0x104f0)[0x7f8309da54f0]
/lib64/libc.so.6(+0x89a90)[0x7f830922aa90]
/usr/lib64/libcrypto.so.1.0.0(+0x10b81a)[0x7f830a2be81a]
/usr/lib64/libcrypto.so.1.0.0(BIO_write+0x6c)[0x7f830a2b7b19]
/usr/lib64/libssl.so.1.0.0(ssl3_write_pending+0x68)[0x7f830a610aef]
/usr/lib64/libssl.so.1.0.0(ssl3_dispatch_alert+0x3a)[0x7f830a612cf4]
/usr/lib64/libssl.so.1.0.0(ssl3_shutdown+0xa2)[0x7f830a60e9aa]
/usr/lib/galera/libgalera_smm.so(_ZN4asio3ssl6detail2ioINS_19basic_stream_socketINS_2ip3tcpENS_21stream_socket_serviceIS5_EEEENS1_11shutdown_opEEEmRT_RNS1_11stream_coreERKT0_RNS_10error_codeE+0x57)[0x7f82fe0f8737]
/usr/lib/galera/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket12close_socketEv+0xc1)[0x7f82fe0e3719]
/usr/lib/galera/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket5closeEv+0x15b)[0x7f82fe0e3e6b]
/usr/lib/galera/libgalera_smm.so(_ZN5gcomm6GMCast11erase_protoESt17_Rb_tree_iteratorISt4pairIKPKvPNS_6gmcast5ProtoEEE+0xbe)[0x7f82fe098e72]
/usr/lib/galera/libgalera_smm.so(_ZN5gcomm6GMCast13handle_failedEPNS_6gmcast5ProtoE+0x1ed)[0x7f82fe0a53d3]
/usr/lib/galera/libgalera_smm.so(_ZN5gcomm6GMCast9handle_upEPKvRKNS_8DatagramERKNS_11ProtoUpMetaE+0x67b)[0x7f82fe0a7d63]
/usr/lib/galera/libgalera_smm.so(_ZN5gcomm10Protostack8dispatchEPKvRKNS_8DatagramERKNS_11ProtoUpMetaE+0x47)[0x7f82fe0d6921]
/usr/lib/galera/libgalera_smm.so(_ZN5gcomm12AsioProtonet8dispatchERKPKvRKNS_8DatagramERKNS_11ProtoUpMetaE+0x41)[0x7f82fe106549]
/usr/lib/galera/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket14failed_handlerERKN4asio10error_codeERKSsi+0x2c4)[0x7f82fe0e1ba2]
/usr/lib/galera/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket25read_completion_conditionERKN4asio10error_codeEm+0x260)[0x7f82fe0e22be]
/usr/lib/galera/libgalera_smm.so(_ZN4asio6detail7read_opINS_3ssl6streamINS_19basic_stream_socketINS_2ip3tcpENS_21stream_socket_serviceIS6_EEEEEEN5boost5arrayINS_14mutable_bufferELm1EEENSB_3_bi6bind_tImNSB_4_mfi3mf
2ImN5gcomm13AsioTcpSocketERKNS_10error_codeEmEENSF_5list3INSF_5valueINSB_10shared_ptrISK_EEEEPFNSB_3argILi1EEEvEPFNSU_ILi2EEEvEEEEENSG_IvNSI_IvSK_SN_mEES11_EEEclESN_mi+0x4aa)[0x7f82fe0fbdea]
/usr/lib/galera/libgalera_smm.so(_ZN4asio3ssl6detail5io_opINS_19basic_stream_socketINS_2ip3tcpENS_21stream_socket_serviceIS5_EEEENS1_7read_opINS_6detail17consuming_buffersINS_14mutable_bufferEN5boost5arrayISC_Lm1E
EEEEEENSA_7read_opINS0_6streamIS8_EESF_NSD_3_bi6bind_tImNSD_4_mfi3mf2ImN5gcomm13AsioTcpSocketERKNS_10error_codeEmEENSL_5list3INSL_5valueINSD_10shared_ptrISQ_EEEEPFNSD_3argILi1EEEvEPFNS10_ILi2EEEvEEEEENSM_IvNSO_IvS
Q_ST_mEES17_EEEEEclESR_mi+0x1d8)[0x7f82fe0fafd8]
/usr/lib/galera/libgalera_smm.so(_ZN4asio6detail23reactive_socket_recv_opINS_17mutable_buffers_1ENS_3ssl6detail5io_opINS_19basic_stream_socketINS_2ip3tcpENS_21stream_socket_serviceIS8_EEEENS4_7read_opINS0_17consum
ing_buffersINS_14mutable_bufferEN5boost5arrayISE_Lm1EEEEEEENS0_7read_opINS3_6streamISB_EESH_NSF_3_bi6bind_tImNSF_4_mfi3mf2ImN5gcomm13AsioTcpSocketERKNS_10error_codeEmEENSN_5list3INSN_5valueINSF_10shared_ptrISS_EEE
EPFNSF_3argILi1EEEvEPFNS12_ILi2EEEvEEEEENSO_IvNSQ_IvSS_SV_mEES19_EEEEEEE11do_completeEPNS0_15task_io_serviceEPNS0_25task_io_service_operationESV_m+0xd5)[0x7f82fe0fc055]
/usr/lib/galera/libgalera_smm.so(_ZN4asio6detail13epoll_reactor16descriptor_state11do_completeEPNS0_15task_io_serviceEPNS0_25task_io_service_operationERKNS_10error_codeEm+0x115)[0x7f82fe0ee85d]
/usr/lib/galera/libgalera_smm.so(_ZN4asio6detail15task_io_service3runERNS_10error_codeE+0x3d5)[0x7f82fe0ee125]
/usr/lib/galera/libgalera_smm.so(_ZN5gcomm12AsioProtonet10event_loopERKN2gu8datetime6PeriodE+0x253)[0x7f82fe107681]
/usr/lib/galera/libgalera_smm.so(_ZN9GCommConn3runEv+0xc0)[0x7f82fe11eaf0]
/usr/lib/galera/libgalera_smm.so(_ZN9GCommConn6run_fnEPv+0x9)[0x7f82fe123540]
/lib64/libpthread.so.0(+0x731c)[0x7f8309d9c31c]
/lib64/libc.so.6(clone+0x6d)[0x7f8309280ced]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.



 Comments   
Comment by Daniel Black [ 2017-03-27 ]

A config file of the crashed node would be great - assuming all nodes are similar with respect to galera configuration. This looks like multicast network right? Did other nodes crash in the same way? Was there anything odd in the logs of the other servers? Also what galera version do you have? What OpenSSL build? Do you have debug symbols for galera and/or openssl to be able to map this stacktrace to line numbers? Can you tell what update/insert was happening (from binary logs or otherwise)?

Comment by Ján Regeš [ 2017-03-27 ]

Hi Daniel,

i attach my.cnf and other requested information.

Config file: 2017-03-27_ab-arbitrator-crash_my.cnf
Galera in MariaDB 10.1.22: 3.17 (r447d194)
OpenSSL: 1.0.2k
Binlog from one other node from time about 08:10: 2017-03-27_ab-binlog.txt (after my crashed mysql restart, it did SST, so binlogs were deleted)

I have no debug symbols for now. For the further debugging, I will activate "debug" USE flag to MariaDB in Gentoo and reinstall all 3 nodes.

Other 2 nodes (our names: master+slave) worked fine after crash of third node (our name: arbitrator). Just for clarification, all 3 nodes are fully-featured MariaDB data instances with multi-master (there is no dummy "arbitrator node").

Log from our "master" server from the time of third node crash below. It looks like a SSL error?

2017-03-27  8:10:24 140068744328960 [Warning] WSREP: read_completion_condition(): decryption failed or bad record mac (336130329: 'error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac')
2017-03-27  8:10:24 140068744328960 [Warning] WSREP: read_handler(): decryption failed or bad record mac (336130329: 'error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac')
2017-03-27  8:10:24 140068744328960 [Note] WSREP: (b423f2f1, 'ssl://0.0.0.0:4567') turning message relay requesting on, nonlive peers: ssl://10.234.4.35:4567
2017-03-27  8:10:25 140068744328960 [Note] WSREP: (b423f2f1, 'ssl://0.0.0.0:4567') reconnecting to 0399a74f (ssl://10.234.4.35:4567), attempt 0
2017-03-27  8:10:39 140068744328960 [Note] WSREP: evs::proto(b423f2f1, GATHER, view_id(REG,0399a74f,151)) suspecting node: 0399a74f
2017-03-27  8:10:39 140068744328960 [Note] WSREP: evs::proto(b423f2f1, GATHER, view_id(REG,0399a74f,151)) suspected node without join message, declaring inactive
2017-03-27  8:10:39 140068744328960 [Note] WSREP: declaring 9cd4d86f at ssl://10.234.1.38:4567 stable
2017-03-27  8:10:39 140068744328960 [Note] WSREP: Node 9cd4d86f state prim
2017-03-27  8:10:39 140068744328960 [Note] WSREP: view(view_id(PRIM,9cd4d86f,152) memb {
    9cd4d86f,0
    b423f2f1,0
} joined {
} left {
} partitioned {
    0399a74f,0
})
2017-03-27  8:10:39 140068744328960 [Note] WSREP: save pc into disk
2017-03-27  8:10:39 140068744328960 [Note] WSREP: forgetting 0399a74f (ssl://10.234.4.35:4567)
2017-03-27  8:10:39 140068744328960 [Note] WSREP: (b423f2f1, 'ssl://0.0.0.0:4567') turning message relay requesting off
2017-03-27  8:10:39 140068731741952 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2017-03-27  8:10:39 140068731741952 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2017-03-27  8:10:39 140068731741952 [Note] WSREP: STATE EXCHANGE: sent state msg: 1d1e4f7b-12b4-11e7-977d-6bdee4b70ea4
2017-03-27  8:10:39 140068731741952 [Note] WSREP: STATE EXCHANGE: got state msg: 1d1e4f7b-12b4-11e7-977d-6bdee4b70ea4 from 0 (ab_slave)
2017-03-27  8:10:39 140068731741952 [Note] WSREP: STATE EXCHANGE: got state msg: 1d1e4f7b-12b4-11e7-977d-6bdee4b70ea4 from 1 (ab_master)
2017-03-27  8:10:39 140068731741952 [Note] WSREP: Quorum results:
    version    = 4,
    component  = PRIMARY,
    conf_id    = 55,
    members    = 2/2 (joined/total),
    act_id     = 15978707,
    last_appl. = 15978649,
    protocols  = 0/7/3 (gcs/repl/appl),
    group UUID = cb18d9d7-dc8f-11e6-9b71-37b745222498
2017-03-27  8:10:39 140068731741952 [Note] WSREP: Flow-control interval: [23, 23]
2017-03-27  8:10:39 140070029170432 [Note] WSREP: New cluster view: global state: cb18d9d7-dc8f-11e6-9b71-37b745222498:15978707, view# 56: Primary, number of nodes: 2, my index: 1, protocol version 3
2017-03-27  8:10:39 140070029170432 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-03-27  8:10:39 140068798850816 [Note] WSREP: Service thread queue flushed.
2017-03-27  8:10:39 140070029170432 [Note] WSREP: Assign initial position for certification: 15978707, protocol version: 3
2017-03-27  8:10:39 140068798850816 [Note] WSREP: Service thread queue flushed.
2017-03-27  8:10:41 140068744328960 [Note] WSREP:  cleaning up 0399a74f (ssl://10.234.4.35:4567)

Comment by Daniel Black [ 2017-03-27 ]

FEATURES=splitdebug

SSL seems a like factor - the crash was in ssl3_dispatch_alert

Can you just validate your ssl certs with gnutl-cli or the openssl s_client/s_server against each other?

Having said that, it shouldn't crash regardless of the state of the certs.

Comment by Ján Regeš [ 2017-03-28 ]

All 3 nodes have same, identical (shared) certificate generated by commands below. It looks fine.

In the past, we used wan-replication, that was a motive to SSL for replication traffic. Now are all nodes on the same LAN network, so I will remove SSL.

# CA
openssl genrsa 2048 > /etc/mysql/ssl/ca.key
openssl req -sha256 -new -x509 -nodes -days 36500 -key /etc/mysql/ssl/ca.key -out /etc/mysql/ssl/ca.crt
 
# SERVER
openssl req -sha256 -newkey rsa:2048 -days 36500 -nodes -keyout /etc/mysql/ssl/server.key -out /etc/mysql/ssl/server.csr
openssl rsa -in /etc/mysql/ssl/server.key -out /etc/mysql/ssl/server.key
openssl x509 -sha256 -req -in /etc/mysql/ssl/server.csr -days 36500 -CA /etc/mysql/ssl/ca.crt -CAkey /etc/mysql/ssl/ca.key -set_serial 01 -out /etc/mysql/ssl/server.crt
 
# permissions to read *.key only for root and mysql
chown root:mysql /etc/mysql/ssl/*.key
chmod 0640 /etc/mysql/ssl/*.key

Comment by Jan Lindström (Inactive) [ 2019-05-20 ]

This does not look like a Galera problem rather a external library problem.

Generated at Thu Feb 08 07:57:11 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.