[MDEV-13950] mysqld_safe could not start Galera node after upgrade to 10.1.28 or 10.2.9 Created: 2017-09-29  Updated: 2020-08-25  Resolved: 2017-10-04

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.1.28, 10.2.9
Fix Version/s: 10.1.29, 10.2.10

Type: Bug Priority: Blocker
Reporter: sysdljr Assignee: Sachin Setiya (Inactive)
Resolution: Fixed Votes: 4
Labels: None
Environment:

CentOS 6.7


Attachments: File jessie.log    
Issue Links:
Duplicate
is duplicated by MDEV-14061 SElinuix preventing MariaDB + Galera ... Closed
is duplicated by MDEV-14063 mariadb102-server 10.2.9 cannot start... Closed
is duplicated by MDEV-14136 Cannot start a new galera cluster on ... Closed
Problem/Incident
is caused by MDEV-10767 /tmp/wsrep_recovery.${RANDOM} file cr... Closed
Sprint: 10.2.10

 Description   

Hi,
Today , We upgrade a node of three nodes galera from MariaDB 10.1.26 to 10.1.28 in test machines. after upgrade, start MariaDB server failed, but not error info in error log.

170929 11:10:25 mysqld_safe Starting mysqld daemon with databases from /data/gc/data
170929 11:10:25 mysqld_safe WSREP: Running position recovery with --disable-log-error  --pid-file='/data/gc/data/OG5300-FM
41-252-recover.pid'
2017-09-29 11:10:25 140515836905504 [Note] /usr/sbin/mysqld (mysqld 10.1.28-MariaDB) starting as process 13773 ...
2017-09-29 11:10:26 140515836905504 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2017-09-29 11:10:26 140515836905504 [Note] InnoDB: The InnoDB memory heap is disabled
2017-09-29 11:10:26 140515836905504 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-09-29 11:10:26 140515836905504 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2017-09-29 11:10:26 140515836905504 [Note] InnoDB: Compressed tables use zlib 1.2.3
2017-09-29 11:10:26 140515836905504 [Note] InnoDB: Using Linux native AIO
2017-09-29 11:10:26 140515836905504 [Note] InnoDB: Using SSE crc32 instructions
2017-09-29 11:10:26 140515836905504 [Note] InnoDB: Initializing buffer pool, size = 8.0G
2017-09-29 11:10:26 140515836905504 [Note] InnoDB: Completed initialization of buffer pool
2017-09-29 11:10:26 140515836905504 [Note] InnoDB: Opened 4 undo tablespaces
2017-09-29 11:10:26 140515836905504 [Note] InnoDB: Highest supported file format is Barracuda.
2017-09-29 11:10:27 140515836905504 [Note] InnoDB: 128 rollback segment(s) are active.
2017-09-29 11:10:27 140515836905504 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.36-82.2 started; log seque
nce number 26100895442
2017-09-29 11:10:27 140515836905504 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.
2017-09-29 11:10:27 140515836905504 [Note] Plugin 'FEEDBACK' is disabled.
2017-09-29 11:10:27 140515836905504 [Warning] Failed to create a socket for IPv6 '::': errno: 97.
2017-09-29 11:10:27 140515836905504 [Note] Server socket created on IP: '0.0.0.0'.
2017-09-29 11:10:27 140515836905504 [Note] WSREP: Recovered position: d1b940ba-608d-11e7-a924-ba9a2478cac0:10282408

if set wsrep_on=off, MariaDB server could start normal.
if downgrade to 10.1.26, set wsrep=on , MariaDB server could start normal too.

On other 10.1.26 node, start MariaDB service info at below, it look different as 10.1.28 node.
Why it has only very seldom wsrep info in 10.1.28 error log? please help confirm , thanks.

170929 11:12:19 mysqld_safe Starting mysqld daemon with databases from /data/gc/data
170929 11:12:19 mysqld_safe WSREP: Running position recovery with --log_error='/data/gc/data/wsrep_recovery.bxCldj' --pid-
file='/data/gc/data/OG5309-FM41-253-recover.pid'
2017-09-29 11:12:19 140448920930336 [Note] /usr/sbin/mysqld (mysqld 10.1.26-MariaDB) starting as process 28057 ...
170929 11:12:31 mysqld_safe WSREP: Recovered position d1b940ba-608d-11e7-a924-ba9a2478cac0:10282408
2017-09-29 11:12:31 140472818767904 [Note] /usr/sbin/mysqld (mysqld 10.1.26-MariaDB) starting as process 28107 ...
2017-09-29 11:12:31 140472818767904 [Note] WSREP: Read nil XID from storage engines, skipping position init
2017-09-29 11:12:31 140472818767904 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.
so'
2017-09-29 11:12:31 140472818767904 [Note] WSREP: wsrep_load(): Galera 25.3.20(r3703) by Codership Oy <info@codership.com>
 loaded successfully.
2017-09-29 11:12:31 140472818767904 [Note] WSREP: CRC-32C: using hardware acceleration.
2017-09-29 11:12:31 140472818767904 [Note] WSREP: Found saved state: d1b940ba-608d-11e7-a924-ba9a2478cac0:10282408, safe_t
o_bootsrap: 0
2017-09-29 11:12:31 140472818767904 [Note] WSREP: Recovering GCache ring buffer: version: 1, UUID: d1b940ba-608d-11e7-a924
-ba9a2478cac0, offset: 1428487245
2017-09-29 11:12:31 140472818767904 [Note] WSREP: GCache::RingBuffer initial scan (2147483688 bytes)... 0.0% (0 bytes) com
plete.
2017-09-29 11:12:32 140472818767904 [Note] WSREP: GCache::RingBuffer initial scan (2147483688 bytes)... 100.0% (2147483688
 bytes) complete.
2017-09-29 11:12:32 140472818767904 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 9309448-10282408
2017-09-29 11:12:32 140472818767904 [Note] WSREP: GCache::RingBuffer unused buffers scan (2147481783 bytes)... 0.0% (0 byt
es) complete.
2017-09-29 11:12:32 140472818767904 [Note] WSREP: GCache::RingBuffer unused buffers scan (2147481783 bytes)... 100.0% (214
7481783 bytes) complete.
2017-09-29 11:12:32 140472818767904 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): found 0/972961 locked buffers
2017-09-29 11:12:32 140472818767904 [Note] WSREP: GCache DEBUG: RingBuffer::recover(): used space: 2147481783/2147483648
[root@OG5309-FM41-253 11:13:06 log]# vi log.err
2017-09-27 11:57:05 140561910589184 [Note] WSREP: declaring 7478d999 at tcp://172.28.41.254:4567 stable
2017-09-27 11:57:05 140561910589184 [Note] WSREP: forgetting cf2f5d62 (tcp://172.28.41.252:4567)
2017-09-27 11:57:05 140561910589184 [Note] WSREP: Node 7478d999 state prim
2017-09-27 11:57:05 140561910589184 [Note] WSREP: view(view_id(PRIM,7478d999,26) memb {
        7478d999,0
        ca02a904,0
} joined {
} left {
} partitioned {
        cf2f5d62,0
})
2017-09-27 11:57:05 140561910589184 [Note] WSREP: save pc into disk
2017-09-27 11:57:05 140561910589184 [Note] WSREP: forgetting cf2f5d62 (tcp://172.28.41.252:4567)
2017-09-27 11:57:05 140561900099328 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2017-09-27 11:57:05 140561900099328 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2017-09-27 11:57:05 140561900099328 [Note] WSREP: STATE EXCHANGE: sent state msg: eceffe3e-a337-11e7-a36a-2f00347d9bdc
2017-09-27 11:57:05 140561900099328 [Note] WSREP: STATE EXCHANGE: got state msg: eceffe3e-a337-11e7-a36a-2f00347d9bdc from
 0 (node254)
2017-09-27 11:57:05 140561900099328 [Note] WSREP: STATE EXCHANGE: got state msg: eceffe3e-a337-11e7-a36a-2f00347d9bdc from
 1 (node253)
2017-09-27 11:57:05 140561900099328 [Note] WSREP: Quorum results:
        version    = 4,
        component  = PRIMARY,
        conf_id    = 25,
        members    = 2/2 (joined/total),
        act_id     = 10282408,
        last_appl. = 0,
        protocols  = 0/7/3 (gcs/repl/appl),
        group UUID = d1b940ba-608d-11e7-a924-ba9a2478cac0
2017-09-27 11:57:05 140561900099328 [Note] WSREP: Flow-control interval: [256, 256]
2017-09-27 11:57:05 140564255800064 [Note] WSREP: New cluster view: global state: d1b940ba-608d-11e7-a924-ba9a2478cac0:102
82408, view# 26: Primary, number of nodes: 2, my index: 1, protocol version 3
2017-09-27 11:57:05 140564255800064 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2017-09-27 11:57:05 140564255800064 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-09-27 11:57:05 140564255800064 [Note] WSREP: Assign initial position for certification: 10282408, protocol version: 3
2017-09-27 11:57:05 140561960924928 [Note] WSREP: Service thread queue flushed.
"log.err" 293L, 24946C
2017-09-29 11:12:47 140472818767904 [Note] WSREP: SST complete, seqno: 10282408
2017-09-29 11:12:47 140470464276224 [Note] WSREP: Member 0.0 (node253) synced with group.
2017-09-29 11:12:47 140470464276224 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 10282408)
2017-09-29 11:12:47 140472818767904 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2017-09-29 11:12:47 140472818767904 [Note] InnoDB: The InnoDB memory heap is disabled
2017-09-29 11:12:47 140472818767904 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-09-29 11:12:47 140472818767904 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2017-09-29 11:12:47 140472818767904 [Note] InnoDB: Compressed tables use zlib 1.2.3
2017-09-29 11:12:47 140472818767904 [Note] InnoDB: Using Linux native AIO
2017-09-29 11:12:47 140472818767904 [Note] InnoDB: Using SSE crc32 instructions
2017-09-29 11:12:47 140472818767904 [Note] InnoDB: Initializing buffer pool, size = 32.0G
2017-09-29 11:12:48 140472818767904 [Note] InnoDB: Completed initialization of buffer pool
2017-09-29 11:12:49 140472818767904 [Note] InnoDB: Opened 4 undo tablespaces
2017-09-29 11:12:49 140472818767904 [Note] InnoDB: Highest supported file format is Barracuda.
2017-09-29 11:12:50 140472818767904 [Note] InnoDB: 128 rollback segment(s) are active.
2017-09-29 11:12:50 140472818767904 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.36-82.1 started; log seque
nce number 26160484326
2017-09-29 11:12:50 140470474766080 [Note] WSREP: (1146e144, 'tcp://0.0.0.0:4567') connection to peer 1146e144 with addr t
cp://172.28.41.253:4567 timed out, no messages seen in PT3S
2017-09-29 11:12:50 140470474766080 [Note] WSREP: (1146e144, 'tcp://0.0.0.0:4567') turning message relay requesting off
2017-09-29 11:12:50 140472818767904 [Note] Plugin 'FEEDBACK' is disabled.
2017-09-29 11:12:50 140432369514240 [Note] InnoDB: Dumping buffer pool(s) not yet started
2017-09-29 11:12:50 7fb8f57ff700 InnoDB: Loading buffer pool(s) from /data/gc/data/ib_buffer_pool
2017-09-29 11:12:50 140472818767904 [Warning] Failed to create a socket for IPv6 '::': errno: 97.
2017-09-29 11:12:50 140472818767904 [Note] Server socket created on IP: '0.0.0.0'.
2017-09-29 11:12:50 140432232532736 [Note] Event Scheduler: scheduler thread started with id 4
2017-09-29 11:12:50 140472818767904 [Note] /usr/sbin/mysqld: ready for connections.
Version: '10.1.26-MariaDB'  socket: '/data/gc/data/mysql.sock'  port: 3309  MariaDB Server
2017-09-29 11:12:50 140470441208576 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2017-09-29 11:12:50 140470441208576 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-09-29 11:12:50 140470441208576 [Note] WSREP: Assign initial position for certification: 10282408, protocol version: 3
2017-09-29 11:12:50 140470525097728 [Note] WSREP: Service thread queue flushed.
2017-09-29 11:12:50 140470441208576 [Note] WSREP: Synchronized with group, ready for connections
2017-09-29 11:12:50 140470441208576 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2017-09-29 11:13:14 7fb8f57ff700 InnoDB: Buffer pool(s) load completed at 170929 11:13:14 



 Comments   
Comment by Andrii Nikitin (Inactive) [ 2017-09-29 ]

I am sorry you hit this problem.
I've confirmed that I can reproduce it in some environments and also confirmed that 10.1.28 starts successfully using script mysqld_safe from 10.1.26
So one of solutions may be to roll back recent bug fix MDEV-10767 in mysqld_safe :

$ diff 10.1.26/bin/mysqld_safe 10.1.28/bin/mysqld_safe
249c249
<   local wr_logfile=$(mktemp $DATADIR/wsrep_recovery.XXXXXX)
---
>   local wr_logfile=$(mktemp wsrep_recovery.XXXXXX)
267c267
<   local wr_options="--log_error='$wr_logfile' --pid-file='$wr_pidfile'"
---
>   local wr_options="--disable-log-error  --pid-file='$wr_pidfile'"
271c271
<   eval_log_error "$mysqld_cmd --wsrep_recover $wr_options"
---
>   eval_log_error "$mysqld_cmd --wsrep_recover $wr_options 2> $wr_logfile"

We are working on best workaround at the moment.

Comment by sysdljr [ 2017-09-29 ]

That's alright. Thank you very much for rapid reply and support.

Comment by Andrii Nikitin (Inactive) [ 2017-09-29 ]

This script reproduces the problem: https://github.com/AndriiNikitin/bugs/blob/master/MDEV-13950.sh
(needs git, m4, bash and galera packages installed; the script will download and unpack corresponding tar package).

This succeeds:

bash MDEV-13950.sh 10.1.26

This fails:

bash MDEV-13950.sh 10.1.28

Comment by Andrii Nikitin (Inactive) [ 2017-09-29 ]

see also debug mysqld_safe output in attached jessie.log

Comment by Andrii Nikitin (Inactive) [ 2017-09-29 ]

The problem is that in new change (which merged from galera_recovery script) log is expecting output from stderr, but it actually goes from stdout (because eval_log_error will append "2>&1" ). So correct fix shall be:

diff mysqld_safe.bug mysqld_safe.fixed
249c249
<   local wr_logfile=$(mktemp wsrep_recovery.XXXXXX)
---
>   local wr_logfile=$(mktemp -t wsrep_recovery.XXXXXX)
271c271
<   eval_log_error "$mysqld_cmd --wsrep_recover $wr_options 2> $wr_logfile"
---
>   eval_log_error "$mysqld_cmd --wsrep_recover $wr_options > $wr_logfile"

Comment by Otto Kekäläinen [ 2017-10-01 ]

I've seen this in production and I can confirm ever MariaDB Galera system that upgrades (no matter what distro) to 10.1.28 will permanently fail to start after the upgrade.

Comment by Sachin Setiya (Inactive) [ 2017-10-02 ]

A patch have been pushed into 10.1 which will solve this problem. But for 10.3 I am thinking of different solution , we have duplicate code in wsrep_recover_position same code in mysqld_safe and galera_recovery , correct way will be to call galera_recovery from mysqld_safe

Comment by Claudio Nanni [ 2017-10-02 ]

On 10.2.9 I can only start the galera node with wsrep_on=OFF.
When connected if I run SET GLOBAL wsrep_on=ON; nothing happens until I exit the mysql client, when mysqld disappears without messages.
Also the wsrep_recovery.xxxxxx files are written in the basedir instead of the datadir.

Comment by Sachin Setiya (Inactive) [ 2017-10-02 ]

If wsrep_on is off then mysqld_safe wont call wsrep_recover_position.

>Also the wsrep_recovery.xxxxxx files are written in the basedir instead of the datadir.
This is side effect of 10767.

Generated at Thu Feb 08 08:09:40 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.