[MDEV-12396] MariaDB Server SEGV crash when importing tablespace for a large table (4GB) Created: 2017-03-29  Updated: 2020-02-24  Resolved: 2020-02-24

Status: Closed
Project: MariaDB Server
Component/s: Server, Storage Engine - InnoDB
Affects Version/s: 10.1.22
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Rafael Gallastegui Assignee: Marko Mäkelä
Resolution: Incomplete Votes: 1
Labels: need_feedback
Environment:

CentOS 6 - 8GB memory


Issue Links:
Duplicate
duplicates MDEV-12419 IMPORT should not look up tablespace ... Closed
Relates
relates to MDEV-11759 Encryption code in MariaDB 10.1/10.2 ... Closed
relates to MDEV-12253 Buffer pool blocks are accessed after... Closed
relates to MDEV-12602 InnoDB: Failing assertion: space->n_p... Closed
relates to MDEV-13103 InnoDB: Deal with page_compressed pag... Closed
relates to MDEV-15032 Encrypted database created on big end... Closed

 Description   

I'm trying to import a database by bringing over the .ibd files from another server (which were saved after tables were flushed and locked), running the "alter table X discard tablespace" for each table and then copying the file over and running "alter table X import tablespace". I disable foreign key checks prior to importing each table.

the sequence of import statements works correctly for most tables. When it reaches a table that is about 4GB large, but with large records, I get a SEGV crash. This happens consistently for the same table. Weird thing is that a 6GB table (but "thin") works OK.

Error in log is:

2017-03-29 9:25:56 139871748651776 [Note] InnoDB: Sync to disk
2017-03-29 9:26:00 139871748651776 [Note] InnoDB: Sync to disk - done!
2017-03-29 9:26:00 139871748651776 [Note] InnoDB: Phase I - Update all pages
2017-03-29 9:27:18 139871748651776 [Note] InnoDB: Sync to disk
2017-03-29 9:27:21 139871748651776 [Note] InnoDB: Sync to disk - done!
2017-03-29 9:27:22 139871748651776 [Note] InnoDB: Phase III - Flush changes to disk
2017-03-29 9:27:22 139871748651776 [Note] InnoDB: Phase IV - Flush complete
170329 9:27:22 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.1.22-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=1
max_threads=502
thread_count=1
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1233714 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f3622a42008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f366de47110 thread_stack 0x48400
/usr/sbin/mysqld(my_print_stacktrace+0x2b)[0x7f366e984fbb]
/usr/sbin/mysqld(handle_fatal_signal+0x4d5)[0x7f366e4dfbe5]
/lib64/libpthread.so.0(+0xf7e0)[0x7f366dae17e0]
/usr/sbin/mysqld(+0x7a4fd0)[0x7f366e6b5fd0]
/usr/sbin/mysqld(+0x7fa283)[0x7f366e70b283]
/usr/sbin/mysqld(+0x72c2cd)[0x7f366e63d2cd]
/usr/sbin/mysqld(+0x73327d)[0x7f366e64427d]
/usr/sbin/mysqld(_Z34mysql_discard_or_import_tablespaceP3THDP10TABLE_LISTb+0xe8)[0x7f366e3ef2f8]
/usr/sbin/mysqld(_ZN33Sql_cmd_discard_import_tablespace7executeEP3THD+0xc8)[0x7f366e436d88]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x152d)[0x7f366e36b03d]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x324)[0x7f366e372934]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x22a3)[0x7f366e3753b3]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x151)[0x7f366e375931]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x17f)[0x7f366e4348bf]
/usr/sbin/mysqld(handle_one_connection+0x47)[0x7f366e4349f7]
/usr/sbin/mysqld(+0x960c8d)[0x7f366e871c8d]
/lib64/libpthread.so.0(+0x7aa1)[0x7f366dad9aa1]
/lib64/libc.so.6(clone+0x6d)[0x7f366bfbdaad]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f364a42c020): is an invalid pointer
Connection ID (thread ID): 230
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=off

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

We think the query pointer is invalid, but we will try to print it anyway.
Query: ALTER TABLE RTODATA_T IMPORT TABLESPACE

170329 09:27:22 mysqld_safe Number of processes running now: 0
170329 09:27:22 mysqld_safe mysqld restarted

From the output it would seem like the calculation of memory usage says that MariaDB could use up to a little bit more than 1GB of memory, which should be OK since this server has 8GB of memory.

Am I doing something wrong? I tried searching for this kind of crash but couldn't find much.



 Comments   
Comment by Rafael Gallastegui [ 2017-03-29 ]

My config file:

#
# This group is read both both by the client and the server
# use it for options that affect everything
#
[client-server]
#
# This group is read both both by the client and the server
# use it for options that affect everything
#
[client-server]
#
# These groups are read by MariaDB server.
# Use it for options that only the server (but not clients) should see
#
# See the examples of server my.cnf files in /usr/share/mysql/
#
# this is read by the standalone daemon and embedded servers
[server]
tmpdir = /var/lib/mysql/tmpdir
 
# Disable binary logs, since we don't have replication running here
 
#log-bin = binlog
#binlog_format = mixed
server_id = 1
#log-basename = server1
expire-logs-days = 7
# this is only for the mysqld standalone daemon
[mysqld]
transaction-isolation = READ-COMMITTED
ignore_db_dirs=lost+found
performance-schema=on
# Set host name for reports
report_host = '172.16.14.241'
default_time_zone='America/Chicago'
# Set default character set and collation
character_set_server = 'latin1'
collation-server = 'latin1_bin'
max_allowed_packet = 512M
# Corresponding value for replication
slave_max_allowed_packet = 512M
# log file location
log_error=/var/log/mysqld.log
# directory for the data
datadir=/var/lib/mysql/
# Max number of opened connections
max_connections=500
# Perf tuning
thread_cache_size=512
# Increased from the default value to handle large number of tables
table_open_cache=30000
# Max number of open files. It has to correlate to an OS change
open_files_limit = 102400
# INNODB OPTIONS
# Buffer Pool Size - (IMPORTANT - DEPENDS on MEMORY!!! - 70 to 80% of memory is recommended)
#innodb_buffer_pool_size = 4G
# Use one file per table in the schema
innodb_file_per_table=1
# lock timeout
innodb_lock_wait_timeout = 300
# Perf improvement to decrease flushing
innodb_log_file_size=512M
# Decreases I/O to disk for larger transaction
innodb_log_buffer_size=64M
innodb_buffer_pool_instances=16
innodb_log_files_in_group=2
# Recommended twice CPUs plus number of disks
innodb_thread_concurrency=10
# Next 2 Recommended to ensure data consistency
innodb_flush_method = O_DIRECT
innodb_doublewrite=1
## Avoid statistics update when doing some ops, e.g. show tables
innodb_stats_on_metadata=0
default_storage_engine=innodb

Comment by Rafael Gallastegui [ 2017-03-29 ]

Don't know if it's relevant, but the IBD files came from a 10.1.19 server

Comment by John W Smith [ 2017-04-03 ]

I can confirm that the same happens on a Galera cluster in our environment after upgrading to 10.1.22 (ius.io repos on CentOS 7). This unfortunately also causes a cascade in the cluster and all nodes go down. Downgrading to 10.1.21 seems to have resolved the issue for us.

Comment by Marko Mäkelä [ 2017-05-09 ]

This could be a regression caused by MDEV-11759, which was released in MariaDB 10.1.22. In MariaDB 10.1.23 there are multiple fixes to encryption; look at the related issues listed in MDEV-12602.

Comment by Marko Mäkelä [ 2017-05-09 ]

This report could be a duplicate of MDEV-12419 which was fixed as part of MDEV-12253.
Does the large import work with MariaDB 10.1.23?

Comment by Marko Mäkelä [ 2017-05-17 ]

On a somewhat related note, Valgrind reported an error for a memcpy() operation that is executed when importing encrypted files:
http://buildbot.askmonty.org/buildbot/builders/work-amd64-valgrind/builds/10056/steps/test/logs/stdio

encryption.innodb-discard-import-change 'cbc,innodb' w2 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2017-05-17 07:42:11
line
==31112== Thread 27:
==31112== Source and destination overlap in memcpy(0x11bb8000, 0x11bb8000, 16384)
==31112==    at 0x4C2C51C: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1018)
==31112==    by 0x10048A5: fil_iterate(fil_iterator_t const&, buf_block_t*, PageCallback&) (fil0fil.cc:6047)
==31112==    by 0x1005392: fil_tablespace_iterate(dict_table_t*, unsigned long, PageCallback&) (fil0fil.cc:6283)
==31112==    by 0xE03456: row_import_for_mysql(dict_table_t*, row_prebuilt_t*) (row0import.cc:3496)

I do not think that the buffers should legitimately overlap in any case here:

			if (encrypted) {
				memcpy(writeptr + (i * size),
					row_compressed ? block->page.zip.data :
					block->frame, size);
			}

I was unable to repeat this failure locally.
The arithmetics done to n_bytes in fil_iterate() look a bit suspicious to me. Even more alarming is this code:

		byte*		io_buffer = iter.io_buffer;
		block->frame = io_buffer;
 
		if (row_compressed) {
			block->page.zip.data = block->frame + UNIV_PAGE_SIZE;
			io_buffer = block->page.zip.data;
		}

Could block->page.zip.data point past the end of the iter.io_buffer array, overlapping with writeptr (which points to iter.crypt_io_buffer)?

Comment by Marko Mäkelä [ 2018-03-19 ]

I pushed some clean-up to 10.1 so that page checksums will be validated in one function.
I did not find anything that could explain this crash.

Comment by Marko Mäkelä [ 2018-03-21 ]

I merged the clean-up to 10.2, so it will be in both 10.1.32 and 10.2.14 (and 10.3.6). The main functional difference should be that IMPORT no longer fills the error log with an unbounded number of messages about retrying partial page reads when you try to import an .ibd file that is shorter than innodb_page_size. For 10.2, there is also the difference that IMPORT will now continue to read and write multiple pages at a time, like it used to work. page_compressed tables will grow during import; it looks like hole-punching has not been implemented yet.

The originally reported error ought to occur somewhere in the final page flushing after doing some final adjustments through the buffer pool.

rgallastpsi, I wonder if you have enabled innodb_use_mtflush, which is known to cause occasional hangs (MDEV-10843); maybe it could cause crashes as well?
In MariaDB 10.2, an alternative to innodb_use_mtflush was provided by innodb_page_cleaners, and accordingly, innodb_use_mtflush was deprecated in 10.2 (MDEV-13674) and removed in 10.3 (MDEV-13690).

Comment by Marko Mäkelä [ 2020-01-24 ]

rgallastpsi, I just noticed that this ticket is still open.
We made significant improvements to InnoDB table encryption in MariaDB, especially in 10.1.23,10.2.6 but also in later. Do you still experience this kind of problems with more recent versions?

Generated at Thu Feb 08 07:57:23 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.