[MDEV-16037] Crash on Galera Node occurred during recovery Created: 2018-04-26  Updated: 2018-10-03  Resolved: 2018-10-03

Status: Closed
Project: MariaDB Server
Component/s: Galera, Storage Engine - InnoDB
Affects Version/s: 10.1, 10.1.32, 10.3.6, 10.3
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Zdravelina Sokolovska (Inactive) Assignee: Jan Lindström (Inactive)
Resolution: Incomplete Votes: 1
Labels: need_feedback
Environment:

CentOS 7.4 ; 2 Master-Master Galera Nodes


Attachments: Text File sequence_of_accuring_of_MDEV-16037.txt     Text File wsrep_recovery_log.txt    

 Description   

Crash on Galera Node occurred during recovery

The crash occurred on the second Node after recovery attempt

# service mysql start --wsrep_cluster_address=gcomm://192.168.104.196
Starting mysql (via systemctl):  Job for mariadb.service failed because the control process exited with error code. See "systemctl status mariadb.service" and "journalctl -xe" for details.
                                                           [FAILED]
[root@t4w6 oltpbench]#

# systemctl status mariadb.service
● mariadb.service - MariaDB 10.3.6 database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/mariadb.service.d
           └─migrated-from-my.cnf-settings.conf
   Active: failed (Result: exit-code) since Thu 2018-04-26 11:52:43 EEST; 1h 15min ago
     Docs: man:mysqld(8)
           https://mariadb.com/kb/en/library/systemd/
  Process: 1418 ExecStartPost=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
  Process: 3211 ExecStart=/usr/sbin/mysqld $MYSQLD_OPTS $_WSREP_NEW_CLUSTER $_WSREP_START_POSITION (code=killed, signal=ABRT)
  Process: 21926 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= ||   VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ]   && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=1/FAILURE)
  Process: 21917 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
 Main PID: 3211 (code=killed, signal=ABRT)
 
Apr 26 11:52:43 t4w6.xentio.lan sh[21926]: 2018-04-26 11:52:42 0 [ERROR] InnoDB: Cannot open datafile for read-only: './tpcc/STOCK.ibd' OS error: 71
Apr 26 11:52:43 t4w6.xentio.lan sh[21926]: 2018-04-26 11:52:42 0 [ERROR] InnoDB: Operating system error number 2 in a file operation.
Apr 26 11:52:43 t4w6.xentio.lan sh[21926]: 2018-04-26 11:52:42 0 [ERROR] InnoDB: The error means the system cannot find the path specified.
Apr 26 11:52:43 t4w6.xentio.lan sh[21926]: 2018-04-26 11:52:42 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create dire...ate them.
Apr 26 11:52:43 t4w6.xentio.lan sh[21926]: 2018-04-26 11:52:42 0 [ERROR] InnoDB: Could not find a valid tablespace file for ``tpcc`.`STOCK``. Ple...he issue.
Apr 26 11:52:43 t4w6.xentio.lan sh[21926]: 2018-04-26 11:52:42 0 [Warning] InnoDB: Ignoring tablespace for `tpcc`.`STOCK` because it could not be opened.
Apr 26 11:52:43 t4w6.xentio.lan systemd[1]: mariadb.service: control process exited, code=exited status=1
Apr 26 11:52:43 t4w6.xentio.lan systemd[1]: Failed to start MariaDB 10.3.6 database server.
Apr 26 11:52:43 t4w6.xentio.lan systemd[1]: Unit mariadb.service entered failed state.
Apr 26 11:52:43 t4w6.xentio.lan systemd[1]: mariadb.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

root     26265  0.0  0.0 113260  1728 pts/1    S+   13:20   0:00 /bin/sh /usr/sbin/service mysql start --wsrep_cluster_address=gcomm://19                    2.168.104.196
root     26278  0.0  0.0 115520  1832 pts/1    S+   13:20   0:00 /bin/sh /etc/init.d/mysql start --wsrep_cluster_address=gcomm://192.168.                    104.196
root     26286  0.0  0.0 128428  1448 pts/1    S+   13:20   0:00 /bin/systemctl start mysql.service
mysql    26364 26.0  4.3 1406736 87392 ?       Sl   13:20   0:00 /usr/sbin/mysqld --user=mysql --wsrep_recover --disable-log-error
mysql    26457 33.0  0.9 146828 19020 ?        R    13:20   0:00 addr2line -C -f -e /usr/sbin/mysqld

root     26265  0.0  0.0 113260  1728 pts/1    S+   13:20   0:00 /bin/sh /usr/sbin/service mysql start --wsrep_cluster_address=gcomm://19                    2.168.104.196
root     26278  0.0  0.0 115520  1832 pts/1    S+   13:20   0:00 /bin/sh /etc/init.d/mysql start --wsrep_cluster_address=gcomm://192.168.                    104.196
root     26286  0.0  0.0 128428  1448 pts/1    S+   13:20   0:00 /bin/systemctl start mysql.service
mysql    26364 27.0  0.0      0     0 ?        Zl   13:20   0:00 [mysqld] <defunct>
mysql    26491  0.0  0.1 118380  2240 ?        S    13:20   0:00 addr2line -C -f -e /lib64/libc.so.6

stack_bottom = 0x7fbe1effcc70 thread_stack 0x49000
2018-04-26 11:52:42 0 [Note] InnoDB: 5.7.21 started; log sequence number 68953003122; transaction id 822399
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x561939cf3f1e]
/usr/sbin/mysqld(handle_fatal_signal+0x357)[0x561939796807]
sigaction.c:0(__restore_rt)[0x7fbe5acc35e0]
:0(__GI_raise)[0x7fbe591d01f7]
:0(__GI_abort)[0x7fbe591d18e8]
/usr/sbin/mysqld(+0xa9b463)[0x561939abd463]
/usr/sbin/mysqld(+0x4cb163)[0x5619394ed163]
ut/ut0ut.cc:793(ib::fatal::~fatal())[0x561939b6b121]
fil/fil0fil.cc:4357(fil_report_invalid_page_access(unsigned long, unsigned long, char const*, unsigned long, unsigned long, bool))[0x561939b1fe85]
fil/fil0fil.cc:4598(fil_io(IORequest const&, bool, page_id_t const&, page_size_t const&, unsigned long, unsigned long, void*, void*, bool))[0x561939afd9e6]
buf/buf0rea.cc:197(buf_read_page(page_id_t const&, page_size_t const&))[0x561939adc99e]
buf/buf0buf.cc:4385(buf_page_get_gen(page_id_t const&, page_size_t const&, unsigned long, buf_block_t*, unsigned long, char const*, unsigned int, mtr_t*, dberr_t*))[0x561939a4249c]
btr/btr0cur.cc:1309(btr_cur_search_to_nth_level_func(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, rw_lock_t*, char const*, unsigned int, mtr_t*, unsigned long))[0x561939a43a87]
include/btr0pcur.ic:456(btr_pcur_open_low(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_pcur_t*, char const*, unsigned int, unsigned long, mtr_t*) [clone .constprop.38])[0x561939a3e51f]
include/page0page.h:224(row_search_on_row_ref(btr_pcur_t*, unsigned long, dict_table_t const*, dtuple_t const*, mtr_t*))[0x561939a40a3b]
row/row0purge.cc:106(row_purge_reset_trx_id(purge_node_t*, mtr_t*))[0x561939a41563]
include/dyn0buf.h:198(dyn_buf_t<512ul>::erase())[0x5619399fef43]
row/row0purge.cc:1111(row_purge_step(que_thr_t*))[0x561939a70c7d]
pthread_create.c:0(start_thread)[0x7fbe5acbbe25]
/lib64/libc.so.6(clone+0x6d)[0x7fbe5929334d]



 Comments   
Comment by kilian [ 2018-05-02 ]

I can report the exact same issue with MariaDB-server-10.1.32-1.el7.centos.x86_64 on CentOS 7.4.

Any advice on how to proceed to recover from here?

Thanks!

Comment by Jan Lindström (Inactive) [ 2018-05-03 ]

To me this looks like a permissions problem:

2018-04-26 11:52:42 0 [ERROR] InnoDB: Operating system error number 2 in a file operation.
2018-04-26 11:52:42 0 [ERROR] InnoDB: The error means the system cannot find the path specified.
2018-04-26 11:52:42 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them.
2018-04-26 11:52:42 0 [ERROR] InnoDB: Cannot open datafile for read-only: './tpcc/STOCK.ibd' OS error: 71

How did you create your database ?

Comment by Zdravelina Sokolovska (Inactive) [ 2018-05-03 ]

jplindst via oltpbench on 1 Node, and tpcc db was replicated by galera on other Nodes

Comment by Zdravelina Sokolovska (Inactive) [ 2018-05-03 ]

mysqld actually is started with wsrep_recover , then are seen InnoDB logs

2018-04-26 11:52:42 0 [Note] /usr/sbin/mysqld (mysqld 10.3.6-MariaDB) starting as process 22007 ...
2018-04-26 11:52:42 0 [Note] InnoDB: For Galera, using innodb_lock_schedule_algorithm=fcfs
2018-04-26 11:52:42 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2018-04-26 11:52:42 0 [Note] InnoDB: Uses event mutexes
2018-04-26 11:52:42 0 [Note] InnoDB: Compressed tables use zlib 1.2.7
2018-04-26 11:52:42 0 [Note] InnoDB: Using Linux native AIO
2018-04-26 11:52:42 0 [Note] InnoDB: Number of pools: 1
2018-04-26 11:52:42 0 [Note] InnoDB: Using generic crc32 instructions
2018-04-26 11:52:42 0 [Note] InnoDB: Initializing buffer pool, total size = 128M, instances = 1, chunk size = 128M
2018-04-26 11:52:42 0 [Note] InnoDB: Completed initialization of buffer pool
2018-04-26 11:52:42 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2018-04-26 11:52:42 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=68953003113
2018-04-26 11:52:42 0 [ERROR] InnoDB: Operating system error number 2 in a file operation.
2018-04-26 11:52:42 0 [ERROR] InnoDB: The error means the system cannot find the path specified.
2018-04-26 11:52:42 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them.

Comment by Jan Lindström (Inactive) [ 2018-05-03 ]

winstone You need to be more specific, did you have all three nodes up and running when you run oltpbench or do you start other nodes after that database is created and Galera does sst ? Please give full steps to repeat from empty datadir.

jplindst , those are 2 Nodes -that's given in the environment description . Nodes were up and running, then data loading was done. In any ways I have already added sequence_of_occurring file. Finally mysqld is not started on the Node and it's found out that it's has been crashed
Note that the problem was not observed on 10.2 with the same databases.

Comment by kilian [ 2018-05-03 ]

For me, the crash happened when simply upgrading the MariaDB-server RPM from 1.0.29 to 1.0.32 on one of our 3-node Galera cluster.
That 1.0.32 node could never start `mysqld`, but all the files, directories and permissions were already in place and didn't change during the upgrade.

Reverting the RPMs back to 1.0.29 on that node allowed `mysqld` to start without any issue on the same node.

So it doesn't look like it's a permission of filesystem-related issue.

Could that be caused by the 1.0.29/1.0.32 version mix in the Galera cluster?

Comment by Jan Lindström (Inactive) [ 2018-08-07 ]

I would need configuration files and error logs from both joiner and donor. Do you use innodb_home_data_dir ?
Original reports looks like either permission problem or SST is looking files from incorrect location and they are not really copied as they should.

Generated at Thu Feb 08 08:25:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.