[MDEV-14036] Primary node of MariaDB 10.1 got signal 11 suddenly Created: 2017-10-10  Updated: 2019-05-20  Resolved: 2019-05-20

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.1.22
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: y-taka Assignee: Jan Lindström (Inactive)
Resolution: Not a Bug Votes: 0
Labels: crash, galera
Environment:

RedHat Enterprise Linux 6.8


Attachments: Text File 20171013_processlist+innodb_status.txt     Zip Archive node1_info.zip     Zip Archive node2_info.zip     Zip Archive node3_info.zip    

 Description   

There is a MariaDB 10.1 Galera environment, which have three node.
We use it as Single Primary Node, say update queries are issued to only node1.

In this cluster, Node1 goes down suddenly with signal 11.
It have already occurred several times, the time and queries are various (= not static).

And it seems that server resource (MEMORY, DISK, CPU) are afford,
and application workload is also not high.



 Comments   
Comment by Elena Stepanova [ 2017-10-10 ]

From node1_info.zip:

2017-10-10 10:25:37 140543307295488 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2017-10-10 10:25:37 140543307295488 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-10-10 10:25:37 140543307295488 [Note] WSREP: Assign initial position for certification: 12938747, protocol version: 3
2017-10-10 10:25:37 140572056614656 [Note] WSREP: Service thread queue flushed.
2017-10-10 10:25:37 140571995793152 [Note] WSREP: Member 1.0 (Node3) requested state transfer from '*any*'. Selected 2.0 (Node2)(SYNCED) as donor.
2017-10-10 10:25:37 140571995793152 [Note] WSREP: 2.0 (Node2): State transfer to 1.0 (Node3) complete.
2017-10-10 10:25:37 140571995793152 [Note] WSREP: Member 2.0 (Node2) synced with group.
2017-10-10 10:25:37 140571995793152 [Note] WSREP: 1.0 (Node3): State transfer from 2.0 (Node2) complete.
2017-10-10 10:25:37 140571995793152 [Note] WSREP: Member 1.0 (Node3) synced with group.
2017-10-10 10:25:39 140572006283008 [Note] WSREP: (309f4f13, 'tcp://0.0.0.0:4567') turning message relay requesting off
171010 10:42:58 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.1.22-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=1001
max_threads=507
thread_count=39
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1254863 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7fd2d8c12008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fd2c9ff40e0 thread_stack 0x48400
2017-10-10 10:43:32 140541433391872 [Warning] Aborted connection 20472620 to db: 'unconnected' user: 'openexchange' host: '172.29.188.162' (Got an error reading communication packets)
2017-10-10 10:43:34 140541875866368 [Warning] Aborted connection 20472625 to db: 'unconnected' user: 'openexchange' host: '172.29.188.162' (Got an error reading communication packets)

Not much to go with.

Comment by Elena Stepanova [ 2017-10-10 ]

y-taka, Could you please enable the coredump creation on the nodes? Hopefully, when the problem occurs next time, at least the server will be able to get more diagnostics, stack trace etc. Now we really have nothing to work with.
Here is the instruction for enabling it:
https://mariadb.com/kb/en/library/how-to-produce-a-full-stack-trace-for-mysqld/

We don't expect you to use a debug binary in the production environment, but getting the stack trace even from a non-debug version might help.

Comment by y-taka [ 2017-10-11 ]

To Elena,

Thank you for comments. We have already set "core-file" parameter, but coredump wasn't created when the error happened.

And we don't use debug binary.

Comment by Daniel Black [ 2017-10-11 ]

check core-file-size is unlimited or LimitCORE ( for systemd: https://mariadb.com/kb/en/library/systemd/ )

Comment by y-taka [ 2017-10-11 ]

To Daniel,

I checked core-file-size is "unlimited", <9284> is PID of mysqld process.

  1. cat /proc/9284/limits | grep core
    Max core file size unlimited unlimited bytes
Comment by y-taka [ 2017-10-13 ]

The problem (signal 11 error) reproduced today.
So I executed "kill -11" to mysqld process, core-dump is created.

And I found some new things.

+ After signal 11 occur, I could login to MariaDB (but it take long time than usual)
+ I could execute "show processlist" or "show engine innodb status"
+ but couldn't execute "use information_schema" and "use xxx_db" -> it is hung up

20171013_processlist+innodb_status.txt

Comment by Andrii Nikitin (Inactive) [ 2017-10-13 ]

dmesg.log of nodes indicate quite few hardware / filesystem problems with messages like:

mptscsih: ioc0: attempting task abort! (sc=ffff8806db5480c0)
sd 2:0:1:0: [sdb] CDB: Write(10): 2a 00 09 44 d4 78 00 00 08 00
mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff8806db5480c0)

Node 2 had such problem right before crash on Node 1

Oct 10 10:22:44 bvea650824 kernel: mptscsih: ioc0: attempting task abort! (sc=ffff88092c146980)
Oct 10 10:22:44 bvea650824 kernel: sd 2:0:1:0: [sdb] CDB: Write(10): 2a 00 00 05 fc 48 00 00 18 00
Oct 10 10:22:46 bvea650824 kernel: mptscsih: ioc0: task abort: SUCCESS (rv=2002) (sc=ffff88092c146980)

So far it looks like hardware problems caused instability, do you agree?
Is sdb - disk where mysqld stores data? Could you provide output of `mount` to clarify that ?

Comment by y-taka [ 2017-10-16 ]

To Andrii,

Thanks for feedback. I'll investigate hardware layer.
Below is mount command result. "/data" is used by mysqld.

  1. mount
    /dev/mapper/vg1-R on / type ext4 (rw,errors=panic)
    proc on /proc type proc (rw)
    sysfs on /sys type sysfs (rw)
    devpts on /dev/pts type devpts (rw,gid=5,mode=620)
    tmpfs on /dev/shm type tmpfs (rw)
    /dev/sda2 on /boot type ext4 (rw,errors=panic)
    /dev/mapper/vg1-data on /data type ext4 (rw,errors=panic)
    /dev/mapper/vg1-home on /home type ext4 (rw,errors=panic)
    /dev/mapper/vg2-ssd on /ssd type ext4 (rw,errors=panic)
    /dev/mapper/vg1-tmp on /tmp type ext4 (rw,errors=panic)
    /dev/mapper/vg1-var on /var type ext4 (rw,errors=panic)
    /dev/mapper/vg1-var_core on /var/core type ext4 (rw,errors=panic)
    /dev/mapper/vg1-var_crash on /var/crash type ext4 (rw,errors=panic)
    none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
    sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
Comment by y-taka [ 2017-10-16 ]

To Andrii,

The problem node is "Node1", but your "mptscsih:" message is written only Node2's /var/log/message.

When signal 11 happened in Node1, Node2 worked without any problem.
So I think "mptscsih:" is not related to signal 11... how about you ?

Comment by Andrii Nikitin (Inactive) [ 2017-10-18 ]

As was mentioned earlier, I do see quite many those messages in dmesg of Node 1 as well. And it is bad sign, which is very likely causing the problems.

Please feel free to send the logs every time the issue occurs and upload compressed core dump to the ticket or to ftp.askmonty.org/private , so we will try to find more hints.

Generated at Thu Feb 08 08:10:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.