[MDEV-11135] Adding first node to a busy cluster Created: 2016-10-25  Updated: 2018-01-09

Status: Open
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.1.17
Fix Version/s: 10.1

Type: Bug Priority: Major
Reporter: Michael Nguyen Assignee: Unassigned
Resolution: Unresolved Votes: 1
Labels: None
Environment:

root@mariadb-03:/etc/mysql/conf.d# dpkg -l | grep maria
ii libmariadbclient18 10.1.18+maria-1~wheezy amd64 MariaDB database client library
ii libmysqlclient18 10.1.18+maria-1~wheezy amd64 Virtual package to satisfy external depends
ii mariadb-client-10.1 10.1.18+maria-1~wheezy amd64 MariaDB database client binaries
ii mariadb-client-core-10.1 10.1.18+maria-1~wheezy amd64 MariaDB database core client binaries
ii mariadb-common 10.1.18+maria-1~wheezy all MariaDB database common files (e.g. /etc/mysql/conf.d/mariadb.cnf)
ii mariadb-server-10.1 10.1.18+maria-1~wheezy amd64 MariaDB database server binaries
ii mariadb-server-core-10.1 10.1.18+maria-1~wheezy amd64 MariaDB database core server files
ii mysql-common 10.1.18+maria-1~wheezy all MariaDB database common files (e.g. /etc/mysql/my.cnf)
root@mariadb-03:/etc/mysql/conf.d# uname -a
Linux mariadb-03 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u1 x86_64 GNU/Linux
root@mariadb-03:/etc/mysql/conf.d# lsb_release -sc
wheezy



 Description   

When starting the first JOINER node on a busy DONOR, after the SST is successfully complete, the IST process is started in order to complete the synchronization. It appears as if the IST fails if the galera.cache file is not large enough to hold all the changes that occurred between the start of the SST until the beginning of the IST. This causes the MariaDB instance to fail on the Joining Node and also causes the DONOR node to become Non-Primary:

On DONOR node innobackup.backup.log file:

xtrabackup: The latest check point (for incremental): '4321113633769'
xtrabackup: Stopping log copying thread.
.161025 16:14:59 >> log scanned up to (4321113633769)

161025 16:14:59 Executing UNLOCK TABLES
161025 16:14:59 All tables unlocked
161025 16:14:59 Backup created in directory '/tmp'
MySQL binlog position: filename 'binlog.000107', position '408', GTID of the last change '0-3-529,1-3-11806550'
161025 16:14:59 [00] Streaming backup-my.cnf
161025 16:14:59 [00] ...done
161025 16:14:59 [00] Streaming xtrabackup_info
161025 16:14:59 [00] ...done
xtrabackup: Transaction log of lsn (4320841669921) to (4321113633769) was copied.
161025 16:15:02 completed OK!

This shows the backup completed.

On the JOINER node:

Oct 25 16:15:02 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:02 140643296864000 [Note] WSREP: 0.0 (mariadb-03): State transfer to 1.0 (mariadb-plm-01) complete.
Oct 25 16:15:02 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:02 140643296864000 [Note] WSREP: Member 0.0 (mariadb-03) synced with group.
Oct 25 16:15:02 mariadb-plm-01 mysqld[70555]: WSREP_SST: [INFO] xtrabackup_ist received from donor: Running IST (20161025 16:15:02.889)
Oct 25 16:15:02 mariadb-plm-01 mysqld[70555]: WSREP_SST: [INFO] Total time on joiner: 0 seconds (20161025 16:15:02.893)
Oct 25 16:15:02 mariadb-plm-01 mysqld[70555]: WSREP_SST: [INFO] Removing the sst_in_progress file (20161025 16:15:02.896)
Oct 25 16:15:02 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:02 140677831268288 [Note] WSREP: SST complete, seqno: 1899034
Oct 25 16:15:03 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:03 140677831268288 [Note] InnoDB: Using mutexes to ref count buffer pool pages
Oct 25 16:15:03 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:03 140677831268288 [Note] InnoDB: The InnoDB memory heap is disabled
Oct 25 16:15:03 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:03 140677831268288 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
Oct 25 16:15:03 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:03 140677831268288 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
Oct 25 16:15:03 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:03 140677831268288 [Note] InnoDB: Compressed tables use zlib 1.2.8
Oct 25 16:15:03 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:03 140677831268288 [Note] InnoDB: Using Linux native AIO
Oct 25 16:15:03 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:03 140677831268288 [Note] InnoDB: Using SSE crc32 instructions
Oct 25 16:15:03 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:03 140677831268288 [Note] InnoDB: Initializing buffer pool, size = 30.0G
Oct 25 16:15:04 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:04 140677831268288 [Note] InnoDB: Completed initialization of buffer pool
Oct 25 16:15:04 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:04 140677831268288 [Note] InnoDB: Highest supported file format is Barracuda.
Oct 25 16:15:04 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:04 140677831268288 [Warning] InnoDB: The log sequence number in the ibdata files is higher than the log sequence number in the ib_logfiles! Are you sure you are using the right ib_logfiles to start up the database. Log sequence number in the ib_logfiles is 4320323667423, logsequence numbers stamped to ibdata file headers are between 4320836805942 and 4320836805942.
Oct 25 16:15:04 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:04 140677831268288 [Note] InnoDB: The log sequence numbers 4320836805942 and 4320836805942 in ibdata files do not match the log sequence number 4320323667423 in the ib_logfiles!
Oct 25 16:15:04 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:04 140677831268288 [Note] InnoDB: Database was not shutdown normally!
Oct 25 16:15:04 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:04 140677831268288 [Note] InnoDB: Starting crash recovery.
Oct 25 16:15:04 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:04 140677831268288 [Note] InnoDB: Reading tablespace information from the .ibd files...
Oct 25 16:15:06 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:06 140677831268288 [Note] InnoDB: Restoring possible half-written data pages
Oct 25 16:15:06 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:06 140677831268288 [Note] InnoDB: from the doublewrite buffer...
Oct 25 16:15:06 mariadb-plm-01 mysqld[70555]: 2016-10-25 16:15:06 7ff21c2917c0 InnoDB: Error: page 7 log sequence number 4320605260426
Oct 25 16:15:06 mariadb-plm-01 mysqld[70555]: InnoDB: is in the future! Current system log sequence number 4320323667423.
Oct 25 16:15:06 mariadb-plm-01 mysqld[70555]: InnoDB: Your database may be corrupt or you may have copied the InnoDB
Oct 25 16:15:06 mariadb-plm-01 mysqld[70555]: InnoDB: tablespace but not the InnoDB log files. See
Oct 25 16:15:06 mariadb-plm-01 mysqld[70555]: InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
Oct 25 16:15:06 mariadb-plm-01 mysqld[70555]: InnoDB: for more information.

Back on the DONOR server:

Oct 25 16:15:02 mariadb-03 mysqld: 2016-10-25 16:15:02 139885163833088 [Note] WSREP: 0.0 (mariadb-03): State transfer to 1.0 (mariadb-plm-01) complete.
Oct 25 16:15:02 mariadb-03 mysqld: 2016-10-25 16:15:02 139885163833088 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 1900097)
Oct 25 16:15:02 mariadb-03 mysqld: 2016-10-25 16:15:02 139885163833088 [Note] WSREP: Member 0.0 (mariadb-03) synced with group.
Oct 25 16:15:02 mariadb-03 mysqld: 2016-10-25 16:15:02 139885163833088 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1900097)
Oct 25 16:15:02 mariadb-03 mysqld: 2016-10-25 16:15:02 139919665761024 [Note] WSREP: Synchronized with group, ready for connections
Oct 25 16:15:02 mariadb-03 mysqld: 2016-10-25 16:15:02 139919665761024 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Oct 25 16:15:02 mariadb-03 mysqld: WSREP_SST: [INFO] Total time on donor: 0 seconds (20161025 16:15:02.833)
Oct 25 16:15:11 mariadb-03 mysqld: 2016-10-25 16:15:11 139885172225792 [Note] WSREP: (2604c0a4, 'tcp://0.0.0.0:4567') connection to peer 40e10823 with addr tcp://192.168.90.43:4567 timed out, no messages seen in PT3S
Oct 25 16:15:11 mariadb-03 mysqld: 2016-10-25 16:15:11 139885172225792 [Note] WSREP: (2604c0a4, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.90.43:4567
Oct 25 16:15:12 mariadb-03 mysqld: 2016-10-25 16:15:12 139885172225792 [Note] WSREP: (2604c0a4, 'tcp://0.0.0.0:4567') reconnecting to 40e10823 (tcp://192.168.90.43:4567), attempt 0
Oct 25 16:15:13 mariadb-03 mysqld: 2016-10-25 16:15:13 139885172225792 [Note] WSREP: evs::proto(2604c0a4, OPERATIONAL, view_id(REG,2604c0a4,2)) suspecting node: 40e10823
Oct 25 16:15:13 mariadb-03 mysqld: 2016-10-25 16:15:13 139885172225792 [Note] WSREP: evs::proto(2604c0a4, OPERATIONAL, view_id(REG,2604c0a4,2)) suspected node without join message, declaring inactive
Oct 25 16:15:14 mariadb-03 mysqld: 2016-10-25 16:15:14 139885172225792 [Note] WSREP: view(view_id(NON_PRIM,2604c0a4,2) memb

{ Oct 25 16:15:14 mariadb-03 mysqld: #0112604c0a4,0 Oct 25 16:15:14 mariadb-03 mysqld: }

joined

{ Oct 25 16:15:14 mariadb-03 mysqld: }

left

{ Oct 25 16:15:14 mariadb-03 mysqld: }

partitioned

{ Oct 25 16:15:14 mariadb-03 mysqld: #01140e10823,0 Oct 25 16:15:14 mariadb-03 mysqld: }

)
Oct 25 16:15:14 mariadb-03 mysqld: 2016-10-25 16:15:14 139885172225792 [Note] WSREP: view(view_id(NON_PRIM,2604c0a4,3) memb

{ Oct 25 16:15:14 mariadb-03 mysqld: #0112604c0a4,0 Oct 25 16:15:14 mariadb-03 mysqld: }

joined

{ Oct 25 16:15:14 mariadb-03 mysqld: }

left

{ Oct 25 16:15:14 mariadb-03 mysqld: }

partitioned

{ Oct 25 16:15:14 mariadb-03 mysqld: #01140e10823,0 Oct 25 16:15:14 mariadb-03 mysqld: }

)
Oct 25 16:15:14 mariadb-03 mysqld: 2016-10-25 16:15:14 139885163833088 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
Oct 25 16:15:14 mariadb-03 mysqld: 2016-10-25 16:15:14 139885163833088 [Note] WSREP: Flow-control interval: [16, 16]
Oct 25 16:15:14 mariadb-03 mysqld: 2016-10-25 16:15:14 139885163833088 [Note] WSREP: Received NON-PRIMARY.
Oct 25 16:15:14 mariadb-03 mysqld: 2016-10-25 16:15:14 139885163833088 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 1906499)
Oct 25 16:15:14 mariadb-03 mysqld: 2016-10-25 16:15:14 139919665761024 [Note] WSREP: New cluster view: global state: b0728b95-9aae-11e6-982b-da4f6771abcd:1906499, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
Oct 25 16:15:14 mariadb-03 mysqld: 2016-10-25 16:15:14 139919665761024 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Oct 25 16:15:14 mariadb-03 mysqld: 2016-10-25 16:15:14 139885163833088 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
Oct 25 16:15:14 mariadb-03 mysqld: 2016-10-25 16:15:14 139885163833088 [Note] WSREP: Flow-control interval: [16, 16]
Oct 25 16:15:14 mariadb-03 mysqld: 2016-10-25 16:15:14 139885163833088 [Note] WSREP: Received NON-PRIMARY.
Oct 25 16:15:14 mariadb-03 mysqld: 2016-10-25 16:15:14 139919665761024 [Note] WSREP: New cluster view: global state: b0728b95-9aae-11e6-982b-da4f6771abcd:1906499, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
Oct 25 16:15:14 mariadb-03 mysqld: 2016-10-25 16:15:14 139919665761024 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
Oct 25 16:15:14 mariadb-03 mysqld: 2016-10-25 16:15:14 139846180711168 [ERROR] Slave SQL: Node has dropped from cluster, Gtid 1-3-11814016, Internal MariaDB error code: 1047
Oct 25 16:15:15 mariadb-03 mysqld: 2016-10-25 16:15:15 139885172225792 [Note] WSREP: (2604c0a4, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://192.168.90.43:4567 timed out, no messages seen in PT3S



 Comments   
Comment by Michael Nguyen [ 2016-10-25 ]

On DONOR after failure:

MariaDB [(none)]> SHOW STATUS LIKE 'wsrep%';
---------------------------------------------------------------------------------------------+

Variable_name Value

---------------------------------------------------------------------------------------------+

wsrep_apply_oooe 0.000017
wsrep_apply_oool 0.000000
wsrep_apply_window 1.000017
wsrep_causal_reads 0
wsrep_cert_deps_distance 65.142420
wsrep_cert_index_size 119
wsrep_cert_interval 0.000200
wsrep_cluster_conf_id 18446744073709551615
wsrep_cluster_size 1
wsrep_cluster_state_uuid b0728b95-9aae-11e6-982b-da4f6771abcd
wsrep_cluster_status non-Primary
wsrep_commit_oooe 0.000000
wsrep_commit_oool 0.000000
wsrep_commit_window 1.000000
wsrep_connected ON
wsrep_desync_count 0
wsrep_evs_delayed 40e10823-9aea-11e6-abf4-02a331d75d56:tcp://192.168.90.43:4567:1
wsrep_evs_evict_list  
wsrep_evs_repl_latency 0/0/0/0/0
wsrep_evs_state OPERATIONAL
wsrep_flow_control_paused 0.000000
wsrep_flow_control_paused_ns 0
wsrep_flow_control_recv 0
wsrep_flow_control_sent 0
wsrep_gcomm_uuid 2604c0a4-9aea-11e6-a8ff-0f5358c2920b
wsrep_incoming_addresses 192.168.90.42:3306
wsrep_last_committed 1906499
wsrep_local_bf_aborts 0
wsrep_local_cached_downto 1731279
wsrep_local_cert_failures 0
wsrep_local_commits 175220
wsrep_local_index 0
wsrep_local_recv_queue 0
wsrep_local_recv_queue_avg 0.000000
wsrep_local_recv_queue_max 1
wsrep_local_recv_queue_min 0
wsrep_local_replays 0
wsrep_local_send_queue 0
wsrep_local_send_queue_avg 0.000147
wsrep_local_send_queue_max 2
wsrep_local_send_queue_min 0
wsrep_local_state 0
wsrep_local_state_comment Initialized
wsrep_local_state_uuid b0728b95-9aae-11e6-982b-da4f6771abcd
wsrep_protocol_version 7
wsrep_provider_name Galera
wsrep_provider_vendor Codership Oy <info@codership.com>
wsrep_provider_version 25.3.18(r3632)
wsrep_ready OFF
wsrep_received 1388
wsrep_received_bytes 11836
wsrep_repl_data_bytes 110808240
wsrep_repl_keys 1487649
wsrep_repl_keys_bytes 15931606
wsrep_repl_other_bytes 0
wsrep_replicated 175221
wsrep_replicated_bytes 144561473
wsrep_thread_count 2

---------------------------------------------------------------------------------------------+
58 rows in set (0.00 sec)

Comment by Andrii Nikitin (Inactive) [ 2017-07-15 ]

It may be kind of expected behaviour that single node becomes 'nonoperational' when the other node crashed according to http://galeracluster.com/documentation-webpages/twonode.html

There may be many reasons why Server on mariadb-plm-01 did exit 'ungracefully'. In any case provided messages from JOINER log indicate inconsistency in InnoDB files, which may be caused by various problems during backup restore. It is hard to tell for sure without logs covering whole problem period.

Please confirm if such explanation looks reasonable or if you are sure that troubleshooting is needed here.

Generated at Thu Feb 08 07:47:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.