Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.2.12
-
Debian 9 fresh install
mariadb from debian repos
galera from debian repos
-
10.2.13
Description
Try to setup galera cluster from standalone mariadb server that has a few databases.
From existing mariadb server use /var/lib/mysql as is
on stock /etc/mysql/my.cnf
add
[galera]
- Mandatory settings
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_cluster_name="tam_cluster"
wsrep_cluster_address="gcomm://ip1, ip2, ip5"
wsrep_node_address="ip5"
wsrep_node_name="node_5"
wsrep_sst_method=rsync
binlog_format=row
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
run: galera_new_cluster
result: mysqld crashes
Attachments
- mdb_dbg.tgz
- 302 kB
- my.cnf
- 6 kB
- mysqld_debug.txt
- 17 kB
- mysqld.trace
- 9.42 MB
- mysqld.trace
- 1.76 MB
- mysqld.trace
- 1.82 MB
- mysqld.trace
- 88 kB
Activity
Better question is why wsrep==NULL when WSREP_PROVIDER_EXISTS is true
Jan 8 08:03:40 tam05 systemd[1]: Starting MariaDB 10.2.12 database server...
Jan 8 08:03:44 tam05 mysqld[1287]: WSREP: Failed to start mysqld for wsrep recovery: '2018-01-08 8:03:41 0 [Note] /usr/sbin/mysqld (mysqld 10.3.4-MariaDB-debug-log) starting as process 1488 ...
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:41 0 [Note] InnoDB: For Galera, using innodb_lock_schedule_algorithm=fcfs
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:41 0 [Note] InnoDB: !!!!!!!! UNIV_DEBUG switched on !!!!!!!!!
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:41 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:41 0 [Note] InnoDB: Uses event mutexes
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:41 0 [Note] InnoDB: Compressed tables use zlib 1.2.8
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:41 0 [Note] InnoDB: Using Linux native AIO
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:41 0 [Note] InnoDB: Number of pools: 1
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:41 0 [Note] InnoDB: Using SSE2 crc32 instructions
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:41 0 [Note] InnoDB: Initializing buffer pool, total size = 512M, instances = 1, chunk size = 128M
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:41 0 [Note] InnoDB: Completed initialization of buffer pool
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:42 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:42 0 [Note] InnoDB: The log sequence number 1018153280213 in the system tablespace does not match the log sequence number 1018153280524 in the ib_logfiles!
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:42 0 [Note] InnoDB: Last binlog file 'mysql-bin.000002', position 402911975
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:43 0 [Note] InnoDB: 128 out of 128 rollback segments are active.
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:43 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:43 0 [Note] InnoDB: Creating shared tablespace for temporary tables
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:43 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:43 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:43 0 [Note] InnoDB: Waiting for purge to start
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:43 0 [Note] InnoDB: 5.7.20 started; log sequence number 1018153280524
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:43 0 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.
Jan 8 08:03:44 tam05 mysqld[1287]: 2018-01-08 8:03:43 0 [Note] Plugin 'FEEDBACK' is disabled.
Jan 8 08:03:44 tam05 mysqld[1287]: 180108 8:03:43 [ERROR] mysqld got signal 11 ;
Jan 8 08:03:44 tam05 mysqld[1287]: This could be because you hit a bug. It is also possible that this binary
Jan 8 08:03:44 tam05 mysqld[1287]: or one of the libraries it was linked against is corrupt, improperly built,
Jan 8 08:03:44 tam05 mysqld[1287]: or misconfigured. This error can also be caused by malfunctioning hardware.
Jan 8 08:03:44 tam05 mysqld[1287]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Jan 8 08:03:44 tam05 mysqld[1287]: We will try our best to scrape up some info that will hopefully help
Jan 8 08:03:44 tam05 mysqld[1287]: diagnose the problem, but since we have already crashed,
Jan 8 08:03:44 tam05 mysqld[1287]: something is definitely wrong and this may fail.
Jan 8 08:03:44 tam05 mysqld[1287]: Server version: 10.3.4-MariaDB-debug-log
Jan 8 08:03:44 tam05 mysqld[1287]: key_buffer_size=134217728
Jan 8 08:03:44 tam05 mysqld[1287]: read_buffer_size=2097152
Jan 8 08:03:44 tam05 mysqld[1287]: max_used_connections=0
Jan 8 08:03:44 tam05 mysqld[1287]: max_threads=702
Jan 8 08:03:44 tam05 mysqld[1287]: thread_count=6
Jan 8 08:03:44 tam05 mysqld[1287]: It is possible that mysqld could use up to
Jan 8 08:03:44 tam05 mysqld[1287]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4459921 K bytes of memory
Jan 8 08:03:44 tam05 mysqld[1287]: Hope that's ok; if not, decrease some variables in the equation.
Jan 8 08:03:44 tam05 mysqld[1287]: Thread pointer: 0x55a486fb3470
Jan 8 08:03:44 tam05 mysqld[1287]: Attempting backtrace. You can use the following information to find out
Jan 8 08:03:44 tam05 mysqld[1287]: where mysqld died. If you see no messages after this, something went
Jan 8 08:03:44 tam05 mysqld[1287]: terribly wrong...
Jan 8 08:03:44 tam05 mysqld[1287]: stack_bottom = 0x7ffe02b9ab90 thread_stack 0x49000
Jan 8 08:03:44 tam05 mysqld[1287]: /usr/sbin/mysqld(my_print_stacktrace+0x3d)[0x55a481de86ca]
Jan 8 08:03:44 tam05 mysqld[1287]: /usr/sbin/mysqld(handle_fatal_signal+0x3bb)[0x55a481636f2a]
Jan 8 08:03:44 tam05 mysqld[1287]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0)[0x7f93adb3c0c0]
Jan 8 08:03:44 tam05 mysqld[1287]: /usr/sbin/mysqld(_Z12wsrep_commitP10handlertonP3THDb+0x202)[0x55a481567621]
Jan 8 08:03:44 tam05 mysqld[1287]: /usr/sbin/mysqld(+0x90ecf0)[0x55a48163acf0]
Jan 8 08:03:44 tam05 mysqld[1287]: /usr/sbin/mysqld(_Z19ha_commit_one_phaseP3THDb+0xfe)[0x55a48163abec]
Jan 8 08:03:44 tam05 mysqld[1287]: /usr/sbin/mysqld(_Z15ha_commit_transP3THDb+0x53a)[0x55a48163a438]
Jan 8 08:03:44 tam05 mysqld[1287]: /usr/sbin/mysqld(_Z17trans_commit_stmtP3THD+0xc3)[0x55a4814d8c66]
Jan 8 08:03:44 tam05 mysqld[1287]: /usr/sbin/mysqld(_Z18close_mysql_tablesP3THD+0x26)[0x55a48130d437]
Jan 8 08:03:44 tam05 mysqld[1287]: /usr/sbin/mysqld(+0x66b946)[0x55a481397946]
Jan 8 08:03:44 tam05 mysqld[1287]: /usr/sbin/mysqld(_Z11plugin_initPiPPci+0x966)[0x55a4813971fa]
Jan 8 08:03:44 tam05 mysqld[1287]: /usr/sbin/mysqld(+0x568cc3)[0x55a481294cc3]
Jan 8 08:03:44 tam05 mysqld[1287]: /usr/sbin/mysqld(_Z11mysqld_mainiPPc+0x5c4)[0x55a481295d08]
Jan 8 08:03:44 tam05 mysqld[1287]: /usr/sbin/mysqld(main+0x20)[0x55a48128b1c0]
Jan 8 08:03:44 tam05 systemd[1]: mariadb.service: Control process exited, code=exited status=1
Jan 8 08:03:44 tam05 mysqld[1287]: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f93ac69b2b1]
Jan 8 08:03:44 tam05 mysqld[1287]: /usr/sbin/mysqld(_start+0x2a)[0x55a48128b09a]
Jan 8 08:03:44 tam05 mysqld[1287]: Trying to get some variables.
Jan 8 08:03:44 tam05 mysqld[1287]: Some pointers may be invalid and cause the dump to abort.
Jan 8 08:03:44 tam05 mysqld[1287]: Query (0x0):
Jan 8 08:03:44 tam05 mysqld[1287]: Connection ID (thread ID): 0
Jan 8 08:03:44 tam05 mysqld[1287]: Status: NOT_KILLED
Jan 8 08:03:44 tam05 mysqld[1287]: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on
Jan 8 08:03:44 tam05 mysqld[1287]: The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
Jan 8 08:03:44 tam05 mysqld[1287]: information that should help you find out what is causing the crash.
Jan 8 08:03:44 tam05 mysqld[1287]: Writing a core file at /var/lib/mysql/
Jan 8 08:03:44 tam05 mysqld[1287]: Segmentation fault (core dumped)'
Jan 8 08:03:44 tam05 systemd[1]: Failed to start MariaDB 10.2.12 database server.
Jan 8 08:03:44 tam05 systemd[1]: mariadb.service: Unit entered failed state.
Jan 8 08:03:44 tam05 systemd[1]: mariadb.service: Failed with result 'exit-code'.
If it helps, that happens if I use
>galera_new_cluster
to bootstrap the first node.
If I bootstrap the first node with
>mysqld -umysql --wsrep-new-cluster
The first node comes up, other nodes can connect and sync.
But I cannot afterwards shut them down and restart.
The macro WSREP_PROVIDER_EXISTS compares a string, not the wsrep pointer:
(wsrep_provider && strncasecmp(wsrep_provider, WSREP_NONE, FN_REFLEN))
|
This string is not being updated atomically with wsrep. One example where this might fail is in the error handling of the function wsrep_provider_init().
Wouldn’t it be more reliable to just test wsrep instead of wsrep_provider?
Is either variable protected by any mutex? It does not seem so to me. What would serialize multiple concurrent executions of wsrep_provider_update(), for example?
In the longer mysqld.trace most references to wsrep are from thread T@1, which is the initial thread. The trace is similar to the earlier error log that leads to a crash.
Unfortunately most of the WSREP code is lacking DBUG_ENTER and DBUG_RETURN instrumentation. Because of this, it is hard to tell which subsystems init_server_components() actually initialized after wsrep_thr_init(). Some of the WSREP initialization is conditional.
jgnoss, one thing that you should be able to do in your setup is that you add more instrumentation to the server startup. Something like this, possibly also in other places:
diff --git a/sql/wsrep_mysqld.cc b/sql/wsrep_mysqld.cc
|
index 173e955e44b..cd6a3497999 100644
|
--- a/sql/wsrep_mysqld.cc
|
+++ b/sql/wsrep_mysqld.cc
|
@@ -796,6 +796,11 @@ void wsrep_thr_init()
|
|
void wsrep_init_startup (bool first)
|
{
|
+ sql_print_information("wsrep_init_startup (%d, %s, %s, %s)",
|
+ first, wsrep_provider, wsrep_sst_method,
|
+ wsrep_cluster_address
|
+ ? wsrep_cluster_address : "(none)");
|
+
|
if (wsrep_init()) unireg_abort(1);
|
|
wsrep_thr_lock_init(
|
@@ -829,6 +834,7 @@ void wsrep_init_startup (bool first)
|
wsrep_create_appliers(1);
|
|
if (first && !wsrep_sst_wait()) unireg_abort(1);// wait until SST is completed
|
+ sql_print_information("wsrep_init_startup completed");
|
}
|
|
|
These messages should be displayed in the server error log with a Note prefix. If these messages do not appear in the log, then this function is for some reason not being called at all, and you should instead add some output to the two callers in the file sql/mysqld.cc.
One way to get more information is to set wsrep_debug=ON; Could you do that?
The trace I added right now is with
wsrep_debug=ON
in my.cnf
What else I did:
on IRC we talked about a database that isn't accessible inside that server.
I used the original mysqld and started that server in standalone
then I dropped that database.
changed config to include galera stuff
run
galera_new_cluster
because I lost trace of all changes I made in my.cnf
I'll attach the last version I used
jgnoss, you mentioned that you would upload the first 128 KiB of the system tablespace, but I cannot find that file either attached to this issue or on ftp.mariadb.com/uploads. (I suspect that Galera is looking for something in the first pages, and will not find it there. I’d want to transplant the content from those pages to an empty database, hoping to reproduce the problem.)
It seems to me that the output that is generated by wsrep_debug=ON would land in the main server error log, not in the DBUG trace log. I did not notice any additional WSREP messages compared to the previous mysqld.trace file.
Can you find out which arguments the wrapper script is passing to mysqld? To do that, I would suggest the following:
sudo su -
|
mv /usr/sbin/mysqld /usr/sbin/mysqld-real
|
cat > /usr/sbin/mysqld << EOF
|
#!/bin/sh
|
echo mysqld "$@" >> /tmp/mysqld-real-cmd.txt
|
exec /usr/sbin/mysqld-real "$@"
|
EOF
|
chmod +x /usr/sbin/mysqld
|
This wrapper script should append a line with the arguments to the file /tmp/mysqld-real-cmd.txt each time the server is invoked, and then invoke the server. Aimed with this information, it should hopefully be possible to start the server with similar arguments in gdb. (And maybe the problem is somehow related to the arguments that are being passed.)
I could not repeat the assertion. I started standalone systemd mariadb server, created few databases and InnoDB tables, stopped mariadb, run galera_new_cluster after configuration change. All started fine for me.
Hi Jan,
same here, if I start the galera cluster from a fresh db with a few db's in it, it works fine.
Problem occurs if I start the galera on an existing db, I have.
The db I try is a 260G+ standalone db, upgraded from mysql to mariadb.
Now I try to initiate the first node on that db and that is, where it crashes.
What version of mysql db you have ? Did it contain also galera or tc.log ?
mysql was 5.7 when I upgraded to mariadb 10.2.12.
The server in question actually is running productive on mariadb 10.2.12
sorry, the production server runs
Server version: 10.2.11-MariaDB-10.2.11+maria~jessie-log mariadb.org binary distribution
commit 446b3d356218bff06efe4b3f5df89595cdfe8284
Author: Jan Lindström <jan.lindstrom@mariadb.com>
Date: Tue Jan 30 17:40:40 2018 +0200
MDEV-14875: galera_new_cluster crashes mysqld when existing server contains databases
Fortify wsrep_hton so that wsrep calls are not done to NULL-pointers.
first output of
journalctl -xe
Tell me what I should do to get information you need
Sorry, I'm on the wrong release.
I just git pulled, forgot to checkout the right version and ended up on 10.3.5
Now it's building on HEAD 446b3d35621
Sorry for that
Can I have full error log? Something must go wrong at wsrep initialization.
here the files from an actual try to bootstrap the cluster
on my debian 9 install if I use galera_new_cluster the log goes to syslog
so the attached file has the relevant output from syslog
@jan: would it be of help to chat on freenode #maria ?
Yes, it would help, what files you have on datadir ? You should not have galera state files when you create a new cluster, why it can't create normal error log?
It seems that MariaDB 10.2.13 has fixed this issue:
jan@jan-laptop-asus:/$ sudo /usr/local/bin/galera_new_cluster
|
jan@jan-laptop-asus:/$ sudo systemctl status mariadb
|
● mariadb.service - MariaDB 10.2.13 database server
|
Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled
|
Drop-In: /etc/systemd/system/mariadb.service.d
|
└─migrated-from-my.cnf-settings.conf
|
Active: active (running) since Tue 2018-02-06 16:41:22 EET; 1 day 18h ago
|
Docs: man:mysqld(8)
|
https://mariadb.com/kb/en/library/systemd/
|
Main PID: 2316 (mysqld)
|
Status: "Taking your SQL requests now..."
|
Tasks: 35 (limit: 4915)
|
CGroup: /system.slice/mariadb.service
|
└─2316 /usr/sbin/mysqld --wsrep_start_position=00000000-0000-0000-0000-00000
|
|
Warning: Journal has been rotated since unit was started. Log output is incomplete or u
|
The SIGSEGV occurs because the null pointer wsrep is being dereferenced in the member function call:
{
thd->query());
}
}
Why is the function wsrep_commit() being called at all if wsrep==NULL?