[MDEV-29013] ER_KEY_NOT_FOUND/lock timeout upon online alter with long unique indexes Created: 2022-07-02  Updated: 2023-08-16  Resolved: 2023-08-16

Status: Closed
Project: MariaDB Server
Component/s: Data Definition - Alter Table, Virtual Columns
Affects Version/s: N/A
Fix Version/s: 11.2.1

Type: Bug Priority: Critical
Reporter: Elena Stepanova Assignee: Nikita Malyavin
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Problem/Incident
is caused by MDEV-16329 Engine-independent online ALTER TABLE Closed
Relates
relates to MDEV-28808 Test MDEV-16329 (ALTER ONLINE TABLE) ... Stalled
relates to MDEV-29021 ALTER TABLE fails when a stored virtu... Closed

 Description   

ALTERing a table with Long Unique blobs ends up with ER_KEY_NOT_FOUND (and assertion failure)

--source include/have_innodb.inc
--source include/have_debug_sync.inc
 
CREATE TABLE t1 (b TEXT, UNIQUE(b)) ENGINE=InnoDB;
INSERT INTO t1 VALUES ('foo'),('bar');
--send
  set debug_sync= 'now wait_for downgraded';
 
--connect (con1,localhost,root,,test)
set debug_sync= 'alter_table_online_downgraded signal downgraded wait_for goforit';
--send
  ALTER TABLE t1 ADD c INT, ALGORITHM=COPY, LOCK=NONE;
 
--connection default
--reap
DELETE FROM t1;
set debug_sync= 'now signal goforit';
 
--connection con1
--reap
 
DROP TABLE t1;
set debug_sync= reset;

bb-10.10-MDEV-16329 daf879d80d

mysqltest: At line 20: query 'reap' failed: ER_KEY_NOT_FOUND (1032): Can't find record in 't1'

2022-07-03  0:59:18 5 [ERROR] mariadbd: Can't find record in 't1'
2022-07-03  0:59:18 5 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 313, Internal MariaDB error code: 1032
2022-07-03  0:59:19 5 [Warning] Could not remove temporary table: './test/#sql-alter-155a5e-5', error: 0
mariadbd: /data/src/preview-10.10-online-alter-gcov/storage/innobase/trx/trx0trx.cc:377: void trx_t::free(): Assertion `!n_mysql_tables_in_use' failed.
220703  0:59:19 [ERROR] mysqld got signal 6 ;

A further variation with NULLs instead of values (otherwise identical test case) ends with a sudden timeout (quick timeout, not correlating with timeout settings) and same assertion failure:

mysqltest: At line 22: query 'DROP TABLE t1' failed: ER_LOCK_WAIT_TIMEOUT (1205): Lock wait timeout exceeded; try restarting transaction

In this case, no SQL thread errors in the log, but the assertion failure is still there:

Version: '10.10.0-MariaDB-debug-log'  socket: '/mnt8t/src/preview-10.10-online-alter-gcov/mysql-test/var/tmp/mysqld.1.sock'  port: 16000  Source distribution
mariadbd: /data/src/preview-10.10-online-alter-gcov/storage/innobase/trx/trx0trx.cc:377: void trx_t::free(): Assertion `!n_mysql_tables_in_use' failed.
220703  1:02:02 [ERROR] mysqld got signal 6 ;

A bit longer test case causes the same unexpected timeout error and a different assertion failure:

--source include/have_innodb.inc
--source include/have_debug_sync.inc
 
CREATE TABLE t1 (a TEXT, UNIQUE(a)) ENGINE=InnoDB;
CREATE TABLE t2 (b TEXT, UNIQUE(b)) ENGINE=InnoDB;
INSERT INTO t2 VALUES (NULL),(NULL);
--send
  set debug_sync= 'now wait_for downgraded';
 
--connect (con1,localhost,root,,test)
set debug_sync= 'alter_table_online_downgraded signal downgraded wait_for goforit';
--send
ALTER TABLE t2 ADD COLUMN c INT, ALGORITHM=COPY, LOCK=NONE;
 
--connection default
--reap
DELETE FROM t2;
set debug_sync= 'now signal goforit';
 
--connection con1
--reap
 
--connection default
ALTER TABLE t2 FORCE;
ALTER TABLE t1 FORCE;
 
DROP TABLE t1, t2;
set debug_sync= reset;

connection default;
ALTER TABLE t2 FORCE;
Warnings:
Warning	1205	Lock wait timeout exceeded; try restarting transaction
ALTER TABLE t1 FORCE;
bug.dupkey3a 'innodb'                    [ fail ]
        Test ended at 2022-07-03 01:04:03
 
CURRENT_TEST: bug.dupkey3a
mysqltest: At line 25: query 'ALTER TABLE t1 FORCE' failed: <Unknown> (2013): Lost connection to server during query

mariadbd: /data/src/preview-10.10-online-alter-gcov/storage/innobase/row/row0mysql.cc:2675: dberr_t row_rename_table_for_mysql(const char*, const char*, trx_t*, bool): Assertion `err != DB_DUPLICATE_KEY' failed.
220703  1:04:02 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.10.0-MariaDB-debug-log
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=2
max_threads=153
thread_count=2
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63909 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f7898000db8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f78c0458e18 thread_stack 0x49000
mysys/stacktrace.c:212(my_print_stacktrace)[0x55cc7c8e24e1]
sql/signal_handler.cc:223(handle_fatal_signal)[0x55cc7b93f566]
sigaction.c:0(__restore_rt)[0x7f78ce3b1140]
linux/raise.c:51(__GI_raise)[0x7f78cdee2ce1]
stdlib/abort.c:81(__GI_abort)[0x7f78cdecc537]
intl/loadmsgcat.c:509(get_sysdep_segment_value)[0x7f78cdecc40f]
:0(__GI___assert_fail)[0x7f78cdedb662]
row/row0mysql.cc:2679(row_rename_table_for_mysql(char const*, char const*, trx_t*, bool))[0x55cc7c3a6ce3]
handler/ha_innodb.cc:13701(innobase_rename_table(trx_t*, char const*, char const*, bool))[0x55cc7c06255c]
handler/ha_innodb.cc:14100(ha_innobase::rename_table(char const*, char const*))[0x55cc7c065c91]
sql/handler.cc:5327(handler::ha_rename_table(char const*, char const*))[0x55cc7b9636c9]
sql/sql_table.cc:5126(mysql_rename_table(handlerton*, st_mysql_const_lex_string const*, st_mysql_const_lex_string const*, st_mysql_const_lex_string const*, st_mysql_const_lex_string const*, st_mysql_const_unsigned_lex_string*, unsigned int))[0x55cc7b447532]
sql/sql_table.cc:11083(mysql_alter_table(THD*, st_mysql_const_lex_string const*, st_mysql_const_lex_string const*, Table_specification_st*, TABLE_LIST*, Alter_info*, unsigned int, st_order*, bool, bool))[0x55cc7b468520]
sql/sql_alter.cc:552(Sql_cmd_alter_table::execute(THD*))[0x55cc7b5bb2f1]
sql/sql_parse.cc:5996(mysql_execute_command(THD*, bool))[0x55cc7b25a2a7]
sql/sql_parse.cc:8036(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x55cc7b265c10]
sql/sql_parse.cc:1894(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool))[0x55cc7b23cf11]
sql/sql_parse.cc:1407(do_command(THD*, bool))[0x55cc7b23a0ea]
sql/sql_connect.cc:1418(do_handle_one_connection(CONNECT*, bool))[0x55cc7b5a621b]
sql/sql_connect.cc:1312(handle_one_connection)[0x55cc7b5a5a22]
perfschema/pfs.cc:2203(pfs_spawn_thread)[0x55cc7beb4113]
nptl/pthread_create.c:478(start_thread)[0x7f78ce3a5ea7]
x86_64/clone.S:97(__GI___clone)[0x7f78cdfa4def]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f7898015460): ALTER TABLE t1 FORCE
 
Connection ID (thread ID): 4
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /dev/shm/var_auto_oLAo/mysqld.1/data
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             385885               385885               processes 
Max open files            1024                 1024                 files     
Max locked memory         12659513344          12659513344          bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       385885               385885               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: core
 
----------SERVER LOG END-------------



 Comments   
Comment by Nikita Malyavin [ 2022-07-04 ]

The first test case from original description is fixed is fixed in MDEV-29021:

--source include/have_innodb.inc
--source include/have_debug_sync.inc
 
CREATE TABLE t1 (a CHAR(3), b CHAR(3) AS (a)) ENGINE=InnoDB;
INSERT INTO t1 (a) VALUES ('foo'),('bar');
--send
  set debug_sync= 'now wait_for downgraded';
 
--connect (con1,localhost,root,,test)
set debug_sync= 'alter_table_online_downgraded signal downgraded wait_for goforit';
--send
  ALTER TABLE t1 ADD c INT, ALGORITHM=COPY, LOCK=NONE;
 
--connection default
--reap
DELETE FROM t1;
set debug_sync= 'now signal goforit';
 
--connection con1
--reap
 
DROP TABLE t1;
set debug_sync= reset;

Comment by Nikita Malyavin [ 2022-07-05 ]

Please review bb-10.10-MDEV-29021 up to 3e2d2978

Comment by Elena Stepanova [ 2022-07-08 ]

The above patch also fixes this:

--source include/have_debug_sync.inc
 
create table t1 (b blob);
insert into t1 values ('foo'),('bar');
--send
  set debug_sync= 'now wait_for downgraded';
 
--connect (con_alter,localhost,root,,test)
set debug_sync= 'alter_table_online_downgraded signal downgraded wait_for goforit';
--send
  alter table t1 add unique(b), algorithm=copy, lock=none;
 
--connection default
--reap
insert into t1 values ('qux'),('foo');
set debug_sync= 'now signal goforit';
 
--connection con_alter
--reap
select * from t1;
show create table t1;
 
# Cleanup
drop table t1;
set debug_sync= reset;

On current bb-10.10-MDEV-16329 the above succeeds, and the table ends up with a unique key and duplicate values inside. On bb-10.10-MDEV-29021 branch ALTER reasonably fails with ER_DUP_ENTRY.

Comment by Nikita Malyavin [ 2022-07-13 ]

elenst I have included your case into MDEV-29021 patch

Generated at Thu Feb 08 10:05:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.