[MDEV-15770] We have three node galera cluster with mariadb, bootstrap primary node is running but other two nodes not able to recover the data from first node and crash after some data recover like after recovery 15GB data Created: 2018-04-04  Updated: 2018-08-25

Status: Open
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.1.31
Fix Version/s: 10.1

Type: Bug Priority: Major
Reporter: Shrikant Anjankar Assignee: Seppo Jaakola
Resolution: Unresolved Votes: 1
Labels: None
Environment:

CentOS 7.4


Attachments: Text File azabnl-id05_errors.txt     Text File errors_azabnl-id01_chk1.txt     Text File mysql.error.log     File server.cnf    
Issue Links:
Relates
relates to MDEV-11035 Restore removed disallow-writes for G... Closed
relates to MDEV-10949 innodb_disallow_writes does not work ... Closed

 Description   

We have three node galera cluster with mariadb, bootstrap primary node is running but other two nodes not able to recover the data from first node and crash after some data recover like after recovery 15GB data.
------------------------------------------

Service Status:
-------------------
[root@AZABIR-ID01 data]# systemctl status mariadb.service
● mariadb.service - MariaDB 10.1.31 database server
Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/mariadb.service.d
└─migrated-from-my.cnf-settings.conf
Active: failed (Result: exit-code) since Wed 2018-04-04 06:09:06 CEST; 56s ago
Docs: man:mysqld(8)
https://mariadb.com/kb/en/library/systemd/
Process: 25879 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= || VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ] && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=1/FAILURE)
Process: 25875 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)

Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local sh[25879]: 2018-04-04 6:09:05 139810169485568 [Note] Loaded 'file_key_management.so' with offset 0x7f28081fb000
Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local sh[25879]: 2018-04-04 6:09:05 139810169485568 [Note] InnoDB: Started in read only mode
Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local sh[25879]: 2018-04-04 6:09:05 139810169485568 [Note] InnoDB: Using mutexes to ref count buffer pool pages
Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local sh[25879]: 2018-04-04 6:09:05 139810169485568 [Note] InnoDB: The InnoDB memory heap is disabled
Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local sh[25879]: 2018-04-04 6:09:05 139810169485568 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local sh[25879]: 2018-04-04 6:09:05 139810169485568 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for m...y barrier
Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local systemd[1]: Failed to start MariaDB 10.1.31 database server.
Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local systemd[1]: Unit mariadb.service entered failed state.
Apr 04 06:09:06 AZABIR-ID01.azure.cloud.corp.local systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@AZABIR-ID01 data]# /etc/init.d/mysql start
Starting mysql (via systemctl): Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
[FAILED]
=======================================

Log errors:

------------------------------------------------------------------------

2018-04-04 6:01:38 140400691869952 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
2018-04-04 6:01:38 140400691869952 [Note] InnoDB: Starting final batch to recover 272 pages from redo log
2018-04-04 6:01:38 140400691869952 [ERROR] InnoDB: Trying to access page number 254 in space 30854 space name DB_GRPUK_P/cache_form, which is outside the tablespace bounds. Byte offset 0, len 16384 i/o type 10.
2018-04-04 06:01:38 7fb1955d7900 InnoDB: Assertion failure in thread 140400691869952 in file ha_innodb.cc line 22015
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
180404 6:01:38 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.1.31-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=2002
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4529069 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...

stack_bottom = 0x0 thread_stack 0x48400
mysys/stacktrace.c:268(my_print_stacktrace)[0x55df5319a1ce]
sql/signal_handler.cc:168(handle_fatal_signal)[0x55df52cbdfb5]
sigaction.c:0(__restore_rt)[0x7fb1951ed5e0]
:0(__GI_raise)[0x7fb1934c61f7]
:0(__GI_abort)[0x7fb1934c78e8]
handler/ha_innodb.cc:22015(ib_logf(ib_log_level_t, char const*, ...))[0x55df52f0b13f]
fil/fil0fil.cc:5982(fil_io(unsigned long, bool, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, void*, void*, unsigned long*, trx_t*, bool))[0x55df530bb323]
buf/buf0rea.cc:263(buf_read_page_low(dberr_t*, bool, unsigned long, unsigned long, unsigned long, unsigned long, long, unsigned long, trx_t*, bool))[0x55df5307b0a0]
buf/buf0rea.cc:1119(buf_read_recv_pages(unsigned long, unsigned long, unsigned long, unsigned long const*, unsigned long))[0x55df5307eb66]
log/log0recv.cc:1857(recv_apply_hashed_log_recs(bool))[0x55df52f5d44f]
srv/srv0start.cc:2663(innobase_start_or_create_for_mysql())[0x55df52ff076e]
handler/ha_innodb.cc:4479(innobase_init(void*))[0x55df52f0f67d]
sql/handler.cc:521(ha_initialize_handlerton(st_plugin_int*))[0x55df52cc0264]
sql/sql_plugin.cc:1409(plugin_initialize(st_mem_root*, st_plugin_int*, int*, char**, bool))[0x55df52b47e70]
sql/sql_plugin.cc:1686(plugin_init(int*, char**, int))[0x55df52b4975a]
sql/mysqld.cc:5133(init_server_components())[0x55df52a9eb88]
sql/mysqld.cc:5722(mysqld_main(int, char**))[0x55df52aa2630]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb1934b2c05]
//sbin/mysqld(+0x39910d)[0x55df52a9610d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.



 Comments   
Comment by Sachin Setiya (Inactive) [ 2018-04-04 ]

Can you upload core dump ? And cnf file ?

Comment by Shrikant Anjankar [ 2018-04-04 ]

I have uploaded server.conf file. One node 10.134.18.4 is p running with bootstrap and accessible, and another two nodes not able to recover and sync data after 6 GB data collection and MySQL service aborted.

New error . Please let us knnow if any option need to enable during recovery.

History: all three nodes was ran fine before the saterday night.
========================

[‎4/‎4/‎2018 4:52 PM] Kalmady, Sachin:
No Title
'' '
2018-04-03 18:52:51 140278145825536 [Note] WSREP: Prepared SST request: rsync|10.134.18.5:4444/rsync_sst
2018-04-03 18:52:51 140278145825536 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-04-03 18:52:51 140278145825536 [Note] WSREP: REPL Protocols: 7 (3, 2)
2018-04-03 18:52:51 140278145825536 [Note] WSREP: Assign initial position for certification: 204383674, protocol version: 3
2018-04-03 18:52:51 140277901264640 [Note] WSREP: Service thread queue flushed.
2018-04-03 18:52:51 140278145825536 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (53540047-107d-11e6-8b2a-9a31eea4d5df): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2018-04-03 18:52:51 140277842962176 [Note] WSREP: Member 1.0 (azabnl-id04) requested state transfer from 'any'. Selected 0.0 (azabnl-id03)(SYNCED) as donor.
2018-04-03 18:52:51 140277842962176 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 204383691)
2018-04-03 18:52:51 140278145825536 [Note] WSREP: Requesting state transfer: success, donor: 0
2018-04-03 18:52:51 140278145825536 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 53540047-107d-11e6-8b2a-9a31eea4d5df:204383674
2018-04-03 18:52:53 140277851354880 [Note] WSREP: (71e14758, 'tcp://0.0.0.0:4567') connection to peer 71e14758 with addr tcp://10.134.18.5:4567 timed out, no messages seen in PT3S
2018-04-03 18:52:54 140277851354880 [Note] WSREP: (71e14758, 'tcp://0.0.0.0:4567') turning message relay requesting off
===============

Comment by Marko Mäkelä [ 2018-04-04 ]

This looks awfully similar to fraggeln’s report in MDEV-11035 that rsync snapshot transfer does not work. Mariabackup works.
This would suggest that MDEV-11035 was not fixed correctly.

Comment by Shrikant Anjankar [ 2018-04-05 ]

Any specific workaround. Production server only ruuning single galera mariadb 1 node. another two nodes not able to resync with the primary one it get failed.

Comment by Shrikant Anjankar [ 2018-04-05 ]

we have test 10.2.14 mariadb version on other galera cluster with version 25.3.22 but still stuck rsync after 9GB. Please give the solution

Comment by Seppo Jaakola [ 2018-04-05 ]

The error log shows that rsync has failed earlier, with that it is easy to understand that innodb startup will fail also. Therefore I doubt this would relate to MDEV-11035.

For troubleshooting rsync transfer issue, please attach the error log from donor node as well.

Comment by Shrikant Anjankar [ 2018-04-10 ]

azabnl-id05 is joiner/reciever node
azabnl-id01 is a doner node

Please check the logs and let us know if anything need to change or apply new patch.

Just I want to share one success story:

When we stuck up with resync method we used below custom command to sync data between two node and it got succeeded, but this was not worked with other case.

#mysqld --basedir=/usr --datadir=/mnt/data --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep_provider=/usr/lib64/galera/libgalera_smm.so --log-error=/mnt/data/AZABNL-ID04.XXXXXX.err --pid-file=/mnt/data/AZABNL-ID04.XXXXX.pid --wsrep_start_position=53540047-107d-11e6-8b2a-9a31eea4d5df:0

Comment by Shrikant Anjankar [ 2018-06-06 ]

anyone have any update regarding three node galara syncing issue?

Comment by Jacques Amar [ 2018-08-25 ]

@shrikant, if you can afford the downtime, here's how i resorted to fix it:

Basically stop mariadb on a running server.
Then do an manual rsync to a server that is out of sync and crashes because some files are too large. Make sure no older logs survive, as they will create problems. Insure proper permissions accross.
Restart the already functioning server should just start)
Now restart the newly synced machine. It should just start syncing to the current state.
If it works, do it again, using the newly synced machine this time!

I had to do this. On a 15GB, downtime was in the minutes, If you only have one server running, you might have to use 'galera_new_cluster' to start the first server.

My Case:

rsync --progress --delete-after -Cave ssh <healthyserver>:/var/lib/mysql /var/lib

Hope this helps.

Generated at Thu Feb 08 08:23:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.