[MDEV-24307] Crashes in ALTER TABLE IMPORT TABLESPACE, Assertion dict0dict.cc, table->get_ref_count() == 0 Created: 2020-11-30  Updated: 2024-01-16

Status: Needs Feedback
Project: MariaDB Server
Component/s: Backup, Data Definition - Alter Table, Storage Engine - InnoDB
Affects Version/s: 10.5.5
Fix Version/s: 10.5

Type: Bug Priority: Major
Reporter: Daniel Nilsson Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Red Hat Enterprise Linux release 8.2 (Ootpa). x86 64-bit.


Attachments: Text File case1_gdb.txt     Text File case1_log.log     Text File case2_gdb.txt     Text File case2_log.log     File mdev24307_py_mariadb_reproduce.py     Text File mdev24307_reproduce.py    
Issue Links:
Relates
relates to MDEV-31515 mariadb crash while importing tablespace Closed

 Description   

I am daily replicating some tables between two machines using .ibd file transfer. I randomly get crashes when doing this and i replicate one schema at a time.

On source machine:

  • First using FLUSH TABLES ... FOR EXPORT;
  • Copy the .ibd files
  • UNLOCK TABLES;

On target:

  • LOCK TABLES ... LOW_PRIORITY WRITE;

On target for each table:

  • ALTER TABLE <schema>.<tbl> DISCARD TABLESPACE;
  • Copy ibd file in place then set proper permissions
  • ALTER TABLE xxx IMPORT TABLESPACE. ( <== Here is where it always break, and on varying tables)

On target:

  • UNLOCK TABLES;

Example log below, i tried to follow the procedure by docs but might be overlooking something? I do things like dropping indexes on large table at source when reloading to get speed but this arises also on tables where i do not do this so do not see it would be related to a specific table modification at the source. Both machines are running same version and same OS.

My solution now is simply make another try after db service restarts itself and then it seem to work. It keeps me afloat but it's a really ugly "solution" as i have plenty of services connected to this provider machine. I want to have "hotswap" of data all data at once and only at a specific trigger once a day when its ready cooked in source and that's why i'am using this approach instead of using a galera cluster setup.

Thank you so much in advance for having look into this. Let me know what more data i could provide to ease the review.

Best Regards
/Daniel

From log

2020-11-30 09:08:51 0x7fc8644e0700  InnoDB: Assertion failure in file /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.5.5/storage/innobase/dict/dict0dict.cc line 1918
InnoDB: Failing assertion: table->get_ref_count() == 0
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
201130  9:08:51 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.5.5-MariaDB
key_buffer_size=20971520
read_buffer_size=131072
max_used_connections=7
max_threads=153
thread_count=7
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 357211 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7fc064000c58
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fc8644dfbd8 thread_stack 0x49000
??:0(my_print_stacktrace)[0x55a3ca399dee]
??:0(handle_fatal_signal)[0x55a3c9e24ec5]
sigaction.c:0(__restore_rt)[0x7fc8762dedd0]
:0(__GI_raise)[0x7fc87407d70f]
:0(__GI_abort)[0x7fc874067b25]
??:0(Wsrep_server_service::log_dummy_write_set(wsrep::client_state&, wsrep::ws_meta const&))[0x55a3c9b0ac23]
??:0(Wsrep_server_service::log_dummy_write_set(wsrep::client_state&, wsrep::ws_meta const&))[0x55a3c9b18b1c]
??:0(wsrep_notify_status(wsrep::server_state::state, wsrep::view const*))[0x55a3ca0f29fa]
??:0(mysql_discard_or_import_tablespace(THD*, TABLE_LIST*, bool))[0x55a3c9cc4f77]
??:0(Sql_cmd_discard_import_tablespace::execute(THD*))[0x55a3c9d2b02d]
??:0(mysql_execute_command(THD*))[0x55a3c9c38fed]
??:0(mysql_parse(THD*, char*, unsigned int, Parser_state*, bool, bool))[0x55a3c9c2bc62]
??:0(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool))[0x55a3c9c360fe]
??:0(do_command(THD*))[0x55a3c9c3724f]
??:0(do_handle_one_connection(CONNECT*, bool))[0x55a3c9d26bc4]
??:0(handle_one_connection)[0x55a3c9d26f7d]
??:0(MyCTX_nopad::finish(unsigned char*, unsigned int*))[0x55a3ca043d5a]
pthread_create.c:0(start_thread)[0x7fc8762d42de]
:0(__GI___clone)[0x7fc874141e83]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7fc0640109d0): ALTER TABLE `dwf`.`billing` IMPORT TABLESPACE
Connection ID (thread ID): 4
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /b001/dwfdb-data/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        unlimited            unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             192044               192044               processes
Max open files            16384                16384                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       192044               192044               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Core pattern: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h %e



 Comments   
Comment by Marko Mäkelä [ 2021-01-13 ]

dene, are you using Galera replication? The stack trace seems to hint that, but wanted to double-check.
Would you be able to provide an executable script (with some dummy data) for reproducing this crash?

Comment by Daniel Nilsson [ 2021-01-13 ]

Hi Marko,

No galera is not used, i also checked it's not enabled by mistake on both machines.

I will simplify my code, make sure I can provoke the crash happning with that sample and provide a small self contained python script. I'll be back.

Br /Daniel

Comment by Marko Mäkelä [ 2021-01-14 ]

dene, tack, det låter bra.
Please see also MDEV-24579 for my suggestions on how to get more useful stack traces. The built-in stack trace dump (which is for the current thread only) is known to produce some garbage stack frames, and it is omitting the function parameter values.

Comment by Daniel Nilsson [ 2021-01-15 ]

Thanks, i will review that ticket before reporting next trace, i have built a script that reproduces exactly what i do in my service core but using some representative sample data (approx row size and same indexes) but using synchronous calls to keep it clear and verified it transaction by transaction. It does not break even running on the production installation where i have the issue. I analyzed the logs and i have days where it crashes the db after two small tables. So I will break out the smallest scenario and then build a service (running twisted matrix as the prod and same but stripped code) to replicate the my real scenario to see if i can find out where things go bad.

mdev24307_reproduce.py

Above is my current test script (you would init, backup, restore on it then provoke with restore or even --loop xxx when restoring i cannot get failure) so you see the flow but it does not cause issues even when replicating from one machine into another so don't spend time trying replicating based on this.

Comment by Daniel Black [ 2021-02-01 ]

dene, thanks for the script and information.

I tried on 10.5-73c43ee9ed067dd22a85f7f8241e33b14be7dd37 (pre 10.5.9 release) and 10.4-542d769ea1a22a7a6a87c9fe76ff911a162ade44 (latest as of now) and no assert or segfault was detected.

(my test environment Fedora 33 builds with clang-11.0.0-2.fc33)

I did small modifications to your script to use https://mariadb.com/docs/appdev/connector-python/ (attached) as I didn't have Python2 and MySQLdb didn't seem to be updated to support it.

If your data is ok to communicate to a limited number of developers there is a ftp server https://mariadb.com/kb/en/meta/mariadb-ftp-server/ where tar.xz of the datadir/coredump with binary could be uploaded.

The corepattern in your original bug report indicates you have coredumpctl installed. You might have the original recorded coredump still available.

Look with coredumpctl list. If still available coredumpctl [dump|debug] /usr/sbin/mysqld

Use

gdb for full backtrace

set logging on
set pagination off
set print frame-arguments all
thread apply all bt full
set logging off

For the `debug` option and take the output from gdb.txt (from https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/#getting-backtraces-with-gdb-on-linux). Note this will probably contain your table names, but not contents.

Comment by Daniel Nilsson [ 2021-02-14 ]

Hi, I now have a table that always causes the receiving machine to crash but it's a different exception.

I made fresh installs of MariaDB 10.5.8 on both sides. And started with fresh datadirs on both machines. I have the exact same table creations in both places.

If i export the ibd file on machine 1 and then discard the table space on the same machine copies the same file back and import it works fine (so all on the same machine).

The same on machine 2 works fine while exporting and importing. But if i export the table on machine 1 and import on machine 2 it breaks.. Attached the case1_gdb.txt for this and the mariadb logfile from the crash.

Also attached gdb log for one case of the initial reported issue (case2_gdb.txt) and the logfile (which does not provide debugging info that time.. the only one i have visible in
coredumpctl right now).

case2_log.log case2_gdb.txt case1_log.log case1_gdb.txt

Best Regards
/Daniel

Comment by Marko Mäkelä [ 2021-11-10 ]

dene, sorry, I just now noticed your update. Unfortunately, no debugging symbols are installed, so neither case1_gdb.txt nor case2_gdb.txt are showing any names and values of function parameters.

The most recent release (10.5.13) included several fixes to ALTER TABLE…IMPORT TABLESPACE. Does 10.5.13 work for you?

Comment by Marko Mäkelä [ 2023-12-14 ]

dene, did you try a more recent version of MariaDB?

Comment by Daniel Nilsson [ 2023-12-16 ]

marko

Hi there,

Short answer: no i did not so far.

I ended up with using mydumper (/myloader) compiled into the backend container and load into shadow schemas of the "provider" once load is done we are locking, dropping old tables and moving new tables into place - takes a split second. This has worked well and take an hour for our 250M rows in total which is not ideal but stable.

I have today dev, qual and prod setups and it's all containerized so its good timing to retry this. Development is now on MDB 11.2.2.

Let me schedule some time and have a go at it and come back. I actually kept the replication mode in my code using idb transfer so should be easy to test. I need to revisit but should just be a configuration change and reinstate the nfs share to test it.

Br /Daniel

Comment by Marko Mäkelä [ 2024-01-16 ]

Hi dene, in MariaDB 11.2 you could use a cleaner workflow that was introduced in MDEV-26137. That is, you would additionally copy the .frm file from the source server, and avoid executing CREATE TABLE or ALTER TABLE…DISCARD TABLESPACE on the destination.

Generated at Thu Feb 08 09:29:00 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.