|
kernel, I see wsrep_notify_status in the call stack, which suggests to me that Galera replication is being used. Can you confirm that you are using Galera?
|
|
No, I do not use Galera. The [mysqld] config does not contain any wsrep settings. wsrep_ready and wsrep_connected is OFF
|
|
kernel, thank you. The lines before mysql_alter_table in the stack trace output look corrupted to me. Can you please try to follow the advice in https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ and provide a stack trace of the crash from gdb? That should be reliable.
We could get even more useful information if you attached the debugger to the server process before executing the crashing statement. Something like this:
sudo gdb -p $(pgrep -x mariadbd)
|
And then in gdb, execute the following:
set print frame-arguments all
|
handle SIGUSR1 nostop noprint
|
handle SIGUSR2 nostop noprint
|
handle SIGWAITING nostop noprint
|
handle SIGLWP nostop noprint
|
handle SIGPIPE nostop
|
handle SIGALRM nostop
|
handle SIGHUP nostop
|
handle SIGTERM nostop noprint
|
handle SIG32 nostop noprint
|
handle SIG33 nostop noprint
|
Then, once you have executed a crashing statement, you should get the (gdb) prompt again. In that prompt, execute the following:
set log file log.txt
|
set height 0
|
set log on
|
thread apply all backtrace
|
quit
|
And then attach the log.txt.
|
|
Used core file for log.txt . Hopefully that's enough. If not then I can do it on a live server, but this is not desirable.
|
|
kernel, unfortunately, I do not see any occurrence of "inno" in log.txt . Most of the function names that are displayed as ?? probably refer to InnoDB functions, which are exactly the interesting part for finding this bug.
|
|
Compiled MariaDB for Debugging (https://mariadb.com/kb/en/compiling-mariadb-for-debugging/) .
Right after launch
2021-01-15 14:25:52 0x7f4a5cf97700 InnoDB: Assertion failure in file /media/server/storage/innobase/handler/ha_innodb.cc line 1863
|
...
|
linux/raise.c:51(__GI_raise)[0x7f5717e1efb7]
|
stdlib/abort.c:81(__GI_abort)[0x7f5717e20921]
|
ut/ut0new.cc:84(ut_new_boot())[0x558e1a5aa272]
|
handler/ha_innodb.cc:1866(innobase_get_cset_width(unsigned long, unsigned int*, unsigned int*))[0x558e1a2fee08]
|
include/data0type.ic:83(dtype_get_mblen(unsigned long, unsigned long, unsigned int*, unsigned int*))[0x558e1a68e69c]
|
dict/dict0mem.cc:755(dict_mem_fill_column_struct(dict_col_t*, unsigned long, unsigned long, unsigned long, unsigned long))[0x558e1a6928a2]
|
dict/dict0mem.cc:342(dict_mem_table_add_col(dict_table_t*, mem_block_info_t*, char const*, unsigned long, unsigned long, unsigned long))[0x558e1a690f9b]
|
dict/dict0load.cc:1653(dict_load_column_low(dict_table_t*, mem_block_info_t*, dict_col_t*, unsigned long*, char const**, unsigned char const*, unsigned long*))[0x558e1a68672f]
|
dict/dict0load.cc:1818(dict_load_columns(dict_table_t*, mem_block_info_t*))[0x558e1a686f79]
|
dict/dict0load.cc:2934(dict_load_table_one(table_name_t const&, dict_err_ignore_t, std::deque<char const*, ut_allocator<char const*, true> >&))[0x558e1a68b019]
|
dict/dict0load.cc:2739(dict_load_table(char const*, dict_err_ignore_t))[0x558e1a68a17a]
|
dict/dict0load.cc:3176(dict_load_table_on_id(unsigned long, dict_err_ignore_t))[0x558e1a68be7a]
|
dict/dict0dict.cc:222(dict_table_open_on_id_low(unsigned long, dict_err_ignore_t, bool))[0x558e1a66b1f3]
|
dict/dict0dict.cc:953(dict_table_open_on_id(unsigned long, bool, dict_table_op_t, THD*, MDL_ticket**))[0x558e1a66cb14]
|
row/row0purge.cc:932(row_purge_parse_undo_rec(purge_node_t*, unsigned char*, que_thr_t*, bool*))[0x558e1a4f33f7]
|
row/row0purge.cc:1105(row_purge(purge_node_t*, unsigned char*, que_thr_t*))[0x558e1a4f3d32]
|
row/row0purge.cc:1159(row_purge_step(que_thr_t*))[0x558e1a4f3ee4]
|
que/que0que.cc:947(que_thr_step(que_thr_t*))[0x558e1a45d0b2]
|
que/que0que.cc:1009(que_run_threads_low(que_thr_t*))[0x558e1a45d358]
|
que/que0que.cc:1051(que_run_threads(que_thr_t*))[0x558e1a45d5ba]
|
trx/trx0purge.cc:1302(trx_purge(unsigned long, bool))[0x558e1a56cb3f]
|
srv/srv0srv.cc:2010(srv_do_purge(unsigned long*))[0x558e1a53c246]
|
srv/srv0srv.cc:2100(purge_coordinator_callback_low())[0x558e1a53c7d8]
|
srv/srv0srv.cc:2130(purge_coordinator_callback(void*))[0x558e1a53c8dc]
|
tpool/task_group.cc:56(tpool::task_group::execute(tpool::task*))[0x558e1a75bd11]
|
tpool/task.cc:56(tpool::task::execute())[0x558e1a75c122]
|
tpool/tpool_generic.cc:516(tpool::thread_pool_generic::worker_main(tpool::worker_data*))[0x558e1a754f79]
|
bits/invoke.h:73(void std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(std::__invoke_memfun_deref, void
|
(tpool::thread_pool_generic::*&&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::worker_data*&&))[0x558e1a7587c1]
|
bits/invoke.h:95(std::__invoke_result<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>::type std::__invoke<void (tpool::thread_po
|
ol_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(void (tpool::thread_pool_generic::*&&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::wor
|
ker_data*&&))[0x558e1a757981]
|
7/thread:234(decltype (__invoke((_S_declval<0ul>)(), (_S_declval<1ul>)(), (_S_declval<2ul>)())) std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), t
|
pool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul>(std::_Index_tuple<0ul, 1ul, 2ul>))[0x558e1a75b9b3]
|
7/thread:243(std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator()())[0x558e1a75b938]
|
7/thread:186(std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_r
|
un())[0x558e1a75b908]
|
??:0(std::error_code::default_error_condition() const)[0x7f57188446df]
|
nptl/pthread_create.c:463(start_thread)[0x7f5718d1b6db]
|
I don't think it has anything to do with the original bug report, however I don’t know what to do next.
|
|
kernel, I see a common pattern that between __GI_abort and valid-looking stack trace, we have some garbage. That assertion failure is indeed unrelated to the original report. Is that really from MariaDB 10.5.8? I do not see ut_error in the line ha_innobase::1863 or in the function innodb_get_cset_width. I realize that it must probably be the following line (you did not include the assertion expression, which should have been displayed in the preceding line):
This means that some table is referring to a non-existent character set collation.
This crash should go away if you set innodb_force_recovery=2 to disable the purge of transaction history. But, that will cause the undo logs to grow indefinitely. If you used gdb, you should be able to determine the name of the problematic table, and you could work around this crash by deleting the .ibd file before restarting the server.
This observation suggests that your data directory could be corrupted in some way and that the original report might be the result of such corruption.
There are many possible causes of corruption, starting from hardware (I would recommend sudo memtester), an error in the backup workflow, scripts or tools, or a bug in the database server.
In your position, I might consider backing up the database contents with mysqldump and restoring the logical dump to a fresh database.
In any case, I hope that you can try to reproduce the originally reported issue on a freshly initialized server. If the crash is the result of some earlier corruption, it could be hard to analyze it without having a copy of the database.
|
|
Thanks. Repaired table. Download mariadb from https://downloads.mariadb.org/mariadb/10.5.8/
After restart mariadb started up fine. Here is the backtrace on mysql_update alter table; log.txt
|
|
kernel, thank you. Your trace seems to include output from two thread apply all backtrace commands, because I see two Thread 1 there. The first output is for the characterset-collation issue that I suppose you already resolved:
Thread 1 (Thread 0x7ef72c99b700 (LWP 30196)):
|
…
|
#6 0x000055abd186b272 in ut_dbg_assertion_failed (expr=0x55abd1fdcc23 "cset == 0", file=0x55abd1fdb3d8 "/media/server/storage/innobase/handler/ha_innodb.cc", line=1863)
|
at /media/server/storage/innobase/ut/ut0dbg.cc:60
|
#7 0x000055abd15bfe08 in innobase_get_cset_width (cset=51, mbminlen=0x7ef72c999130, mbmaxlen=0x7ef72c999134) at /media/server/storage/innobase/handler/ha_innodb.cc:1863
|
…
|
#13 0x000055abd194c019 in dict_load_table_one (name=@0x7ef72c99a260: {m_name = 0x7ef71c00a618 "p0_001a_fb/17_users", …
|
So, at least one column in the table p0_001a_fb.17_users is referring to an unknown charset-collation code 51. That does not look like a built-in collation code, but it could be a loadable one, and your configuration is probably missing that dynamic collation now?
The second trace is for the originally reported crash. It seems to occur while executing some upgrade script, possibly related to MDEV-14637?
Thread 1 (Thread 0x7fc4b60a9700 (LWP 31632)):
|
…
|
#5 0x000055c271baa5bf in ut_dbg_assertion_failed (expr=expr@entry=0x55c27280ab1f "table->get_ref_count() == 0",
|
file=file@entry=0x55c27280a478 "/home/buildbot/buildbot/build/storage/innobase/row/row0merge.cc", line=line@entry=4324)
|
at /home/buildbot/buildbot/build/storage/innobase/ut/ut0dbg.cc:60
|
#6 0x000055c2722b6e3f in row_merge_drop_table (trx=trx@entry=0x7fd1880fa8c8, table=<optimized out>) at /home/buildbot/buildbot/build/storage/innobase/row/row0merge.cc:4324
|
#7 0x000055c27220e6d5 in ha_innobase::commit_inplace_alter_table (this=<optimized out>, altered_table=<optimized out>, ha_alter_info=<optimized out>, commit=<optimized out>)
|
at /home/buildbot/buildbot/build/storage/innobase/handler/handler0alter.cc:11323
|
…
|
#20 0x000055c271cd5d2a in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x7fc390000c18,
|
packet=packet@entry=0x7fc390495639 "if @have_innodb then\n \n delete from innodb_index_stats where length(table_name) > 64;\n delete from innodb_table_stats where length(table_name) > 64;\n \n alter table innodb_index_stats modify last_"..., packet_length=packet_length@entry=564, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false)
|
at /home/buildbot/buildbot/build/sql/sql_parse.cc:1873
|
elenst, we do not test much DDL on the InnoDB persistent statistics tables, do we? (This is somewhat related to MDEV-15020.)
|
|
If you mean innodb_stats_persistent, it is ON by default, and the vast majority of our non-MTR tests are run with default InnoDB settings – both mine and, I expect, those which mleich owns.
|
The first output is for the characterset-collation issue that I suppose you already resolved
Yes, the first one can be ignored.
It seems to occur while executing some upgrade script
Yes, it occur while executing mariadb_upgrade (debian run it on every mariadb start). However call "alter table innodb_index_stats modify last_update timestamp not null default current_timestamp on update current_timestamp, modify table_name varchar(199)" manually leads to the same effect.
|
|
elenst, do our stress tests include executing ALTER TABLE mysql.innodb_index_stats concurrently with normal DDL or DML workload? I would think that it is somewhat of a special case. But, it should not be that hard to repeat in a test. The reference-count violation could be caused this relatively innocent-looking concurrent access, to retrieve the persistent statistics for a table that is being opened for executing a SELECT statement:
Thread 142 (Thread 0x7fc4b634c700 (LWP 31623)):
|
…
|
#8 PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (name=0x55c272807840 "/home/buildbot/buildbot/build/storage/innobase/que/que0que.cc", line=1129, n_delay=<optimized out>,
|
n_spins=<optimized out>, this=<optimized out>) at /home/buildbot/buildbot/build/storage/innobase/include/ib0mutex.h:590
|
#9 que_eval_sql (info=info@entry=0x7fc3a4086150,
|
sql=sql@entry=0x55c2728224a0 "PROCEDURE FETCH_STATS () IS\nfound INT;\nDECLARE FUNCTION fetch_table_stats_step;\nDECLARE FUNCTION fetch_index_stats_step;\nDECLARE CURSOR table_stats_cur IS\n SELECT\n n_rows,\n clustered_index_size,\n "..., reserve_dict_mutex=reserve_dict_mutex@entry=true, trx=trx@entry=0x7fd1880fdb20)
|
at /home/buildbot/buildbot/build/storage/innobase/que/que0que.cc:1129
|
#10 0x000055c2723eb0bb in dict_stats_fetch_from_ps (table=0x7fc3a4685940) at /home/buildbot/buildbot/build/storage/innobase/dict/dict0stats.cc:3091
|
#11 dict_stats_update (table=table@entry=0x7fc3a45d5e90, stats_upd_option=<optimized out>) at /home/buildbot/buildbot/build/storage/innobase/dict/dict0stats.cc:3316
|
#12 0x000055c2721f6804 in dict_stats_init (table=<optimized out>) at /home/buildbot/buildbot/build/storage/innobase/include/dict0stats.ic:165
|
#13 ha_innobase::open (this=0x7fc3a467dc30, name=0x7fc3a467dbd8 "./p0_001a_f3/46_forums") at /home/buildbot/buildbot/build/storage/innobase/handler/ha_innodb.cc:5707
|
That thread is waiting for dict_sys.mutex while probably already holding a reference to the statistics table. The mutex is probably held by the crashing ALTER TABLE thread.
|
|
> do our stress tests include executing ALTER TABLE mysql.innodb_index_stats concurrently with normal DDL or DML workload
Not mine. Altering system tables is not a general practice, I don't expect such bugs would have had any value if found and filed internally.
|
|
elenst, I tried to reproduce the crash with the following, but with no success in 19×100 attempts:
--source include/have_innodb.inc
|
CREATE TABLE t (a INT) ENGINE=InnoDB STATS_PERSISTENT=1;
|
--source include/restart_mysqld.inc
|
connect (con,localhost,root);
|
send ALTER TABLE mysql.innodb_index_stats FORCE, ALGORITHM=INPLACE;
|
connection default;
|
SELECT * FROM t;
|
connection con;
|
reap;
|
disconnect con;
|
connection default;
|
DROP TABLE t;
|
Hopefully you will be able to reproduce it by applying the idea further. I double-checked that the crash occurs in a table rebuild (which the FORCE is requesting) while trying to commit the operation.
And yes, if MDEV-15020 had been implemented, then the ha_innobase::open() would not be able to get that far, because there would be an MDL conflict for the statistics table name.
kernel, maybe it would be a good idea to let the upgrade script complete before allowing clients to connect to the server.
|
maybe it would be a good idea to let the upgrade script complete before allowing clients to connect to the server.
Unfortunately I cannot afford this. mariadb_upgrade takes a very long time, at least tens of minutes. I have hundreds of thousands of databases.
Sooner or later I will do it using replication either by executing only this request without competing ones.
|
|
I can share core dump if it's will be helpfull. One can be openned with a binary from https://downloads.mariadb.org/mariadb/10.5.8/ (x86_64).
|
|
I believe that this bug has been fixed by MDEV-25919 in MariaDB 10.6.5. This is a duplicate of MDEV-23670 and MDEV-19539.
It is not feasible to port the fix to earlier release series.
|