[MDEV-15872] Crash in online ALTER TABLE…ADD PRIMARY KEY after instant ADD COLUMN…NULL Created: 2018-04-15  Updated: 2018-08-15  Resolved: 2018-08-14

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.3.9
Fix Version/s: 10.3.10

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: crash, instant, online-ddl, regression
Environment:

Linux Ubuntu 17.10 but most probably not important.


Attachments: File MDEV-15872.tgz    
Issue Links:
Problem/Incident
is caused by MDEV-16365 Setting a column NOT NULL fails to re... Closed

 Description   

https://api.travis-ci.org/v3/job/364729376/log.txt

10.3 8334aced009e9748e17ce697191e00c9f14e78b9

#3  <signal handler called>
#4  0x00007fe726dd1d06 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x0000000000e081ef in mem_heap_dup (heap=0x7fe6b429c340, data=0x7fe7150040ae, len=4294967294) at /home/travis/src/storage/innobase/include/mem0mem.h:306
#6  0x0000000000f51294 in row_log_table_get_pk_col (col=0x7fe6e40ec050, ifield=0x7fe6d8079b30, dfield=0x7fe6b429c410, heap=0x7fe6b429c340, rec=0x7fe71500409a "", offsets=0x7fe7244a5e90, i=6, page_size=..., max_len=3072) at /home/travis/src/storage/innobase/row/row0log.cc:1187
#7  0x0000000000f51a91 in row_log_table_get_pk (rec=0x7fe71500409a "", index=0x7fe6e40b05f8, offsets=0x7fe7244a5e90, sys=0x0, heap=0x7fe7244a56f8) at /home/travis/src/storage/innobase/row/row0log.cc:1326
#8  0x0000000000f4f460 in row_log_table_delete (rec=0x7fe71500409a "", index=0x7fe6e40b05f8, offsets=0x7fe7244a5e90, sys=0x0) at /home/travis/src/storage/innobase/row/row0log.cc:694
#9  0x000000000104f05a in btr_cur_del_mark_set_clust_rec (block=0x7fe711400e00, rec=0x7fe71500409a "", index=0x7fe6e40b05f8, offsets=0x7fe7244a5e90, thr=0x7fe6b423ac40, entry=0x7fe6b42a0f28, mtr=0x7fe7244a6210) at /home/travis/src/storage/innobase/btr/btr0cur.cc:5064
#10 0x0000000000fa15d9 in row_upd_del_mark_clust_rec (node=0x7fe6b423a8d8, index=0x7fe6e40b05f8, offsets=0x7fe7244a5e90, thr=0x7fe6b423ac40, referenced=0, foreign=false, mtr=0x7fe7244a6210) at /home/travis/src/storage/innobase/row/row0upd.cc:2979
#11 0x0000000000fa1f4b in row_upd_clust_step (node=0x7fe6b423a8d8, thr=0x7fe6b423ac40) at /home/travis/src/storage/innobase/row/row0upd.cc:3161
#12 0x0000000000fa2447 in row_upd (node=0x7fe6b423a8d8, thr=0x7fe6b423ac40) at /home/travis/src/storage/innobase/row/row0upd.cc:3282
#13 0x0000000000fa2936 in row_upd_step (thr=0x7fe6b423ac40) at /home/travis/src/storage/innobase/row/row0upd.cc:3426
#14 0x0000000000f4061b in row_update_for_mysql (prebuilt=0x7fe6b423a0c8) at /home/travis/src/storage/innobase/row/row0mysql.cc:1882
#15 0x0000000000df0097 in ha_innobase::delete_row (this=0x7fe6b429c5e8, record=0x7fe6b436e8b8 "\377") at /home/travis/src/storage/innobase/handler/ha_innodb.cc:8957
#16 0x0000000000acf3e1 in handler::ha_delete_row (this=0x7fe6b429c5e8, buf=0x7fe6b436e8b8 "\377") at /home/travis/src/sql/handler.cc:6285
#17 0x0000000000c623cd in TABLE::delete_row (this=0x7fe6b4334e00) at /home/travis/src/sql/sql_delete.cc:253
#18 0x0000000000c5f78b in mysql_delete (thd=0x7fe6b40155c0, table_list=0x7fe6b402c060, conds=0x0, order_list=0x7fe6b4019df8, limit=9, options=549755813888, result=0x0) at /home/travis/src/sql/sql_delete.cc:748
#19 0x00000000007dc725 in mysql_execute_command (thd=0x7fe6b40155c0) at /home/travis/src/sql/sql_parse.cc:4918
#20 0x00000000007e6364 in mysql_parse (thd=0x7fe6b40155c0, rawbuf=0x7fe6b402bf38 "DELETE FROM `oltp1` LIMIT 9 /* QNO 44526 CON_ID 15 */", length=53, parser_state=0x7fe7244a8450, is_com_multi=false, is_next_command=false) at /home/travis/src/sql/sql_parse.cc:7991
#21 0x00000000007d38d8 in dispatch_command (command=COM_QUERY, thd=0x7fe6b40155c0, packet=0x7fe6b4032841 "DELETE FROM `oltp1` LIMIT 9 /* QNO 44526 CON_ID 15 */ ", packet_length=54, is_com_multi=false, is_next_command=false) at /home/travis/src/sql/sql_parse.cc:1841
#22 0x00000000007d230b in do_command (thd=0x7fe6b40155c0) at /home/travis/src/sql/sql_parse.cc:1386
#23 0x00000000009338df in do_handle_one_connection (connect=0x497d3a0) at /home/travis/src/sql/sql_connect.cc:1402
#24 0x0000000000933661 in handle_one_connection (arg=0x497d3a0) at /home/travis/src/sql/sql_connect.cc:1308
#25 0x00007fe727860e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#26 0x00007fe726d7c2ed in clone () from /lib/x86_64-linux-gnu/libc.so.6
#27 0x0000000000000000 in ?? ()

Some pointers may be invalid and cause the dump to abort.
Query (0x7fe6b402bf38): DELETE FROM `oltp1` LIMIT 9 /* QNO 44526 CON_ID 15 */
Connection ID (thread ID): 15
Status: NOT_KILLED
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on

experimental 4feb78e17fbecf7fc7f2847b49b8c66b54879629

perl /home/travis/rqg/runall-new.pl --duration=350 --threads=6 --redefine=conf/mariadb/instant_add.yy --reporters=Backtrace,ErrorLog,Deadlock --engine=InnoDB --seed=1523411054 --basedir=/home/travis/server --mysqld=--max-statement-time=30 --mysqld=--loose-debug_assert_on_not_freed_memory=0 --grammar=conf/mariadb/oltp.yy --gendata=conf/mariadb/oltp.zz --duration=300 --mysqld=--innodb-flush-log-at-trx-commit=0 --vardir=/home/travis/logs/vardir1_4

Could not reproduce so far.



 Comments   
Comment by Matthias Leich [ 2018-07-23 ]

I have an already simplified RQG test which replays at least the begin of the stacktrace above.

| Version: '10.3.9-MariaDB-log' ...
| 2018-07-23 16:13:07 18 [Note] Detected table cache mutex contention at instance 1: 23% waits. Additional table cache instance activated. Number of instances after activation: 2.
| 180723 16:18:56 [ERROR] mysqld got signal 11 ;
... 
# 2018-07-23T16:19:02 [60510] | Query (0x7f2d40011640): INSERT INTO t1 (col1,col2,col3,col4) VALUES (NULL,NULL,NULL,REPEAT(CAST( NULL AS CHAR(1)), @fill_amount) ), (NULL,NULL,NULL,REPEAT(CAST( NULL AS CHAR(1)), @fill_amount) )  /* E_R Thread4 QNO 381662 CON_ID 18 */
| Connection ID (thread ID): 18
| Status: NOT_KILLED
 
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=11) at ../sysdeps/unix/sysv/linux/pthread_kill.c:62
[Current thread is 1 (Thread 0x7f2dd580e700 (LWP 60606))]
#3  <signal handler called>
#4  0x000055ddd7b21373 in memcpy (__len=4294967295, __src=0x0, __dest=0x7f2d4003af58) at /usr/include/x86_64-linux-gnu/bits/string3.h:53
#5  mem_heap_dup (len=4294967295, data=0x0, heap=<optimized out>) at storage/innobase/include/mem0mem.h:306
#6  row_log_table_get_pk_col (log=0x7f2d4c05cb00, log=0x7f2d4c05cb00, max_len=3072, page_size=..., i=<optimized out>, offsets=0x7f2d4003f110, rec=0x7f2dcbe7c07e <error: Cannot access memory at address 0x7f2dcbe7c07e>, heap=<optimized out>, dfield=0x7f2d4003af10, ifield=0x7f2d4c05c118) at storage/innobase/row/row0log.cc:1170
#7  row_log_table_get_pk (rec=0x7f2dcbe7c07e <error: Cannot access memory at address 0x7f2dcbe7c07e>, index=0x7f2d44052f90, offsets=offsets@entry=0x7f2d4003f110, sys=sys@entry=0x0, heap=heap@entry=0x7f2dd580b468) at storage/innobase/row/row0log.cc:1307
#8  0x000055ddd7b21592 in row_log_table_delete (rec=rec@entry=0x7f2dcbe7c07e <error: Cannot access memory at address 0x7f2dcbe7c07e>, index=index@entry=0x7f2d44052f90, offsets=0x7f2d4003f110, sys=sys@entry=0x0) at storage/innobase/row/row0log.cc:718
#9  0x000055ddd81326c9 in row_undo_ins_remove_clust_rec (node=node@entry=0x7f2d4005b180) at storage/innobase/row/row0uins.cc:120
#10 0x000055ddd813355c in row_undo_ins (node=<optimized out>, thr=<optimized out>) at storage/innobase/row/row0uins.cc:613
#11 0x000055ddd7fec679 in row_undo (thr=0x7f2d4005afb8, node=0x7f2d4005b180) at storage/innobase/row/row0undo.cc:303
#12 row_undo_step (thr=thr@entry=0x7f2d4005afb8) at storage/innobase/row/row0undo.cc:362
#13 0x000055ddd7f97376 in que_thr_step (thr=0x7f2d4005afb8) at storage/innobase/que/que0que.cc:1044
#14 que_run_threads_low (thr=0x7f2d4005afb8) at storage/innobase/que/que0que.cc:1108
#15 que_run_threads (thr=<optimized out>) at storage/innobase/que/que0que.cc:1148
#16 0x000055ddd8028e9a in trx_rollback_to_savepoint_low (trx=trx@entry=0x7f2dd58a4008, savept=savept@entry=0x7f2dd58a4230) at storage/innobase/trx/trx0roll.cc:145
#17 0x000055ddd80291c0 in trx_rollback_to_savepoint (savept=0x7f2dd58a4230, trx=0x7f2dd58a4008) at storage/innobase/trx/trx0roll.cc:193
#18 trx_rollback_last_sql_stat_for_mysql (trx=trx@entry=0x7f2dd58a4008) at storage/innobase/trx/trx0roll.cc:326
#19 0x000055ddd7f08292 in innobase_rollback (hton=<optimized out>, thd=0x7f2d400009a8, rollback_trx=<optimized out>) at storage/innobase/handler/ha_innodb.cc:4745
#20 0x000055ddd7dac0c1 in ha_rollback_trans (thd=thd@entry=0x7f2d400009a8, all=all@entry=false) at sql/handler.cc:1721
#21 0x000055ddd7cc4a57 in trans_rollback_stmt (thd=thd@entry=0x7f2d400009a8) at sql/transaction.cc:583
#22 0x000055ddd7bebc72 in mysql_execute_command (thd=thd@entry=0x7f2d400009a8) at sql/sql_parse.cc:6328
#23 0x000055ddd7bf40f9 in mysql_parse (thd=thd@entry=0x7f2d400009a8, rawbuf=<optimized out>, length=210, parser_state=parser_state@entry=0x7f2dd580d650, is_com_multi=<optimized out>, is_next_command=<optimized out>) at sql/sql_parse.cc:8073
#24 0x000055ddd7bf68d4 in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x7f2d400009a8, packet=packet@entry=0x7f2d40009349 "INSERT INTO t1 (col1,col2,col3,col4) VALUES (NULL,NULL,NULL,REPEAT(CAST( NULL AS CHAR(1)), @fill_amount) ), (NULL,NULL,NULL,REPEAT(CAST( NULL AS CHAR(1)), @fill_amount) )  /* E_R Thread4 QNO 381662 CON_ID 18 */ ", packet_length=packet_length@entry=211, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at sql/sql_parse.cc:1847
#25 0x000055ddd7bf7546 in do_command (thd=0x7f2d400009a8) at sql/sql_parse.cc:1391
#26 0x000055ddd7cb8114 in do_handle_one_connection (connect=connect@entry=0x55dddb3681f8) at sql/sql_connect.cc:1402
#27 0x000055ddd7cb8294 in handle_one_connection (arg=0x55dddb3681f8) at sql/sql_connect.cc:1308

The likelihood to replay has become better because of the simplification but
nevertheless 16 - 32 trials are required.
Replay on
10.3 commit 141a5b24843c626f545a0c2f49013b54da194c0d compiled without debug
The server claims to be a 10.3.9-MariaDB-log.
Never a replay on
10.2 commit d0d073b1aaa30997307bc7aa686d3715f8c22da0 compiled without debug
The server claims to be a 10.2.17-MariaDB

I will upload the files required for replay.

Comment by Matthias Leich [ 2018-08-13 ]

Replay again with
10.3 commit 5e7496e2eaf01d3be79fca3f199a8d3e597af208 2018-08-12T14:52:37+02:00
The server claims to be a 10.3.9-MariaDB-debug-log.

Instructions for replay attempts
---------------------------------------
git clone https://github.com/mleich1/rqg --branch experimental RQG_mleich1
cd RQG_mleich1
tar xvzf <Path_to_MDEV-15872.tgz>
Edit MDEV-15872.sh so that BASEDIR points to your binaries
./MDEV-15872.sh

Comment by Marko Mäkelä [ 2018-08-14 ]

PRIMARY KEY columns are supposed to be NOT NULL, but here we seem to be trying to copy a NULL value:

#5  mem_heap_dup (len=4294967295, data=0x0, heap=<optimized out>) at storage/innobase/include/mem0mem.h:306
#6  row_log_table_get_pk_col (log=0x7f2d4c05cb00, log=0x7f2d4c05cb00, max_len=3072, page_size=..., i=<optimized out>, offsets=0x7f2d4003f110, rec=0x7f2dcbe7c07e <error: Cannot access memory at address 0x7f2dcbe7c07e>, heap=<optimized out>, dfield=0x7f2d4003af10, ifield=0x7f2d4c05c118) at storage/innobase/row/row0log.cc:1170

The problem should be that this code for handling instantly added columns is not checking if the initial default value of the instantly added column was NULL:

		field = static_cast<const byte*>(
			log->defaults->fields[n_default_cols].data);
		len = log->defaults->fields[i - DATA_N_SYS_COLS].len;

I believe that the scenario must be something like this:

  1. Instantly add a column that allows NULL value.
  2. Insert some data, with implicit NULL in the instantly added column
  3. Execute DELETE (or a ROLLBACK of an INSERT) while ALTER TABLE…ADD PRIMARY KEY is in progress.

Unfortunately, I was unable to trigger this. I am getting "Invalid use of NULL value" flagged for the ALTER TABLE before the DELETE gets a chance to be logged.

Comment by Marko Mäkelä [ 2018-08-14 ]

A necessary condition that was present in MDEV-11369 and was modified in MDEV-14168 was inadvertently deleted by MDEV-16365, which was a follow-up fix of MDEV-14168.

The test requires ALTER IGNORE. Here is a deterministic test (which crashes the server) and the fix:

diff --git a/mysql-test/suite/innodb/t/instant_alter_debug.test b/mysql-test/suite/innodb/t/instant_alter_debug.test
index 5a73ed65d72..ff1650ea191 100644
--- a/mysql-test/suite/innodb/t/instant_alter_debug.test
+++ b/mysql-test/suite/innodb/t/instant_alter_debug.test
@@ -232,6 +232,30 @@ INSERT INTO t1 SET a=1;
 INSERT INTO t1 SET a=2,b=3,c=4;
 SET DEBUG_SYNC = 'now SIGNAL logged';
 
+connection ddl;
+reap;
+
+connection default;
+SET DEBUG_SYNC = RESET;
+SELECT * FROM t1;
+
+--echo #
+--echo # MDEV-15872 Server crashed in row_log_table_delete
+--echo #
+ALTER TABLE t1 ADD COLUMN d INT, ALGORITHM=INSTANT;
+UPDATE t1 SET d=1;
+
+connection ddl;
+SET DEBUG_SYNC='row_log_table_apply1_before SIGNAL copied WAIT_FOR logged';
+send ALTER IGNORE TABLE t1 DROP PRIMARY KEY, ADD PRIMARY KEY (a,d);
+
+connection default;
+SET DEBUG_SYNC = 'now WAIT_FOR copied';
+BEGIN;
+INSERT INTO t1 SET a=3;
+ROLLBACK;
+SET DEBUG_SYNC = 'now SIGNAL logged';
+
 connection ddl;
 reap;
 disconnect ddl;
diff --git a/storage/innobase/row/row0log.cc b/storage/innobase/row/row0log.cc
index 386bd2786a0..a6e35744446 100644
--- a/storage/innobase/row/row0log.cc
+++ b/storage/innobase/row/row0log.cc
@@ -1164,7 +1164,6 @@ row_log_table_get_pk_col(
 	field = rec_get_nth_field(rec, offsets, i, &len);
 
 	if (len == UNIV_SQL_NULL) {
-
 		if (!log->allow_not_null) {
 			return(DB_INVALID_NULL);
 		}
@@ -1173,6 +1172,9 @@ row_log_table_get_pk_col(
 
 		field = static_cast<const byte*>(
 			log->defaults->fields[n_default_cols].data);
+		if (!field) {
+			return(DB_INVALID_NULL);
+		}
 		len = log->defaults->fields[i - DATA_N_SYS_COLS].len;
 	}
 

Comment by Matthias Leich [ 2018-08-15 ]

10.3 + the MDEV-15782 patch "survived"

  • 1.5 hours high load with ~ 110 RQG tests in parallel
  • in sum ~ 900 RQG tests (replay setup and grammar, 300s duration)
    without any bad effect showing up
    == The bug had his "chance" and seems to be fixed.
Generated at Thu Feb 08 08:24:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.