[MDEV-16962] Assertion `!error || !ot_ctx.can_recover_from_failed_open()' failed in open_purge_table upon concurrent ALTER and FLUSH - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.2(EOL), 10.3(EOL), 10.4(EOL)
Fix Version/s: 10.2.38, 10.3.29, 10.4.19, 10.5.10
Component/s: Data Definition - Alter Table, Full-text Search, Storage Engine - InnoDB
Labels:
- affects-tests

Description

Note: Run with --repeat=N --mem. N can vary a lot, there are periods when it fails for me every time within 5 attempts, and sometimes it takes over 100. Luckily the test is very fast when in runs in shm, so set a really big number for starters.

--source include/have_innodb.inc

CREATE TABLE t1 (

  pk SERIAL,

  c VARCHAR(128),

  d DATE,

  vd DATE AS (d) VIRTUAL,

  PRIMARY KEY(pk),

  KEY(vd,c)

) ENGINE=InnoDB;

INSERT IGNORE INTO t1 (pk,c) VALUES (1,'foo');

--connect (con1,localhost,root,,test)

REPLACE INTO t1 (pk,c) VALUES (1,'bar');

--connection default

--send

  ALTER TABLE t1 ADD FULLTEXT KEY(c), ALGORITHM=COPY;

--connection con1

FLUSH TABLES;

# Cleanup

--disconnect con1

--connection default

--reap

DROP TABLE t1;

10.3 8716bb3b72b
mysqld: /data/src/10.3/sql/sql_class.cc:4672: TABLE* open_purge_table(THD, const char, size_t, const char*, size_t): Assertion `!error \|\| !ot_ctx.can_recover_from_failed_open()' failed.
180818 1:17:07 [ERROR] mysqld got signal 6 ;

#7 0x00007f5a635fdee2 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#8 0x00005602b08e90ab in open_purge_table (thd=0x7f5a28000b00, db=0x7f5a48ff70a0 "test", dblen=4, tb=0x7f5a48ff7240 "t1", tblen=2) at /data/src/10.3/sql/sql_class.cc:4672
#9 0x00005602b0f6f72b in innodb_acquire_mdl (thd=0x7f5a28000b00, table=0x7f5a0c1502f8) at /data/src/10.3/storage/innobase/handler/ha_innodb.cc:20738
#10 0x00005602b0f6f926 in innodb_find_table_for_vc (thd=0x7f5a28000b00, table=0x7f5a0c1502f8) at /data/src/10.3/storage/innobase/handler/ha_innodb.cc:20793
#11 0x00005602b0f6fe99 in innobase_allocate_row_for_vcol (thd=0x7f5a28000b00, index=0x7f5a0c196e88, heap=0x7f5a48ff7670, table=0x7f5a48ff7680, record=0x7f5a48ff7688, storage=0x7f5a48ff7678) at /data/src/10.3/storage/innobase/handler/ha_innodb.cc:20947
#12 0x00005602b111b01e in row_vers_build_clust_v_col (row=0x7f5a28014108, clust_index=0x7f5a0c133158, index=0x7f5a0c196e88, heap=0x7f5a28014030, vcol_info=0x5602b4adfd08) at /data/src/10.3/storage/innobase/row/row0vers.cc:465
#13 0x00005602b111c1b0 in row_vers_old_has_index_entry (also_curr=true, rec=0x7f5a5d3fc07f "", mtr=0x7f5a48ff7ab0, index=0x7f5a0c196e88, ientry=0x7f5a28014698, roll_ptr=16888498624468088, trx_id=0, vcol_info=0x5602b4adfd08) at /data/src/10.3/storage/innobase/row/row0vers.cc:959
#14 0x00005602b10d09f8 in row_purge_poss_sec (node=0x5602b4adfb68, index=0x7f5a0c196e88, entry=0x7f5a28014698, sec_pcur=0x7f5a48ff8050, sec_mtr=0x7f5a48ff82b0, is_tree=false) at /data/src/10.3/storage/innobase/row/row0purge.cc:346
#15 0x00005602b10d16cb in row_purge_remove_sec_if_poss_leaf (node=0x5602b4adfb68, index=0x7f5a0c196e88, entry=0x7f5a28014698) at /data/src/10.3/storage/innobase/row/row0purge.cc:597
#16 0x00005602b10d1bb3 in row_purge_remove_sec_if_poss (node=0x5602b4adfb68, index=0x7f5a0c196e88, entry=0x7f5a28014698) at /data/src/10.3/storage/innobase/row/row0purge.cc:710
#17 0x00005602b10d1dc0 in row_purge_del_mark (node=0x5602b4adfb68) at /data/src/10.3/storage/innobase/row/row0purge.cc:784
#18 0x00005602b10d32d9 in row_purge_record_func (node=0x5602b4adfb68, undo_rec=0x5602b4ae0118 "", thr=0x5602b4adf998, updated_extern=false) at /data/src/10.3/storage/innobase/row/row0purge.cc:1178
#19 0x00005602b10d3616 in row_purge (node=0x5602b4adfb68, undo_rec=0x5602b4ae0118 "", thr=0x5602b4adf998) at /data/src/10.3/storage/innobase/row/row0purge.cc:1245
#20 0x00005602b10d39c4 in row_purge_step (thr=0x5602b4adf998) at /data/src/10.3/storage/innobase/row/row0purge.cc:1331
#21 0x00005602b1057484 in que_thr_step (thr=0x5602b4adf998) at /data/src/10.3/storage/innobase/que/que0que.cc:1046
#22 0x00005602b10576b7 in que_run_threads_low (thr=0x5602b4adf998) at /data/src/10.3/storage/innobase/que/que0que.cc:1108
#23 0x00005602b10578a8 in que_run_threads (thr=0x5602b4adf998) at /data/src/10.3/storage/innobase/que/que0que.cc:1148
#24 0x00005602b11248fb in srv_task_execute () at /data/src/10.3/storage/innobase/srv/srv0srv.cc:2437
#25 0x00005602b1124aa0 in srv_worker_thread (arg=0x0) at /data/src/10.3/storage/innobase/srv/srv0srv.cc:2485
#26 0x00007f5a652d4494 in start_thread (arg=0x7f5a48ff9700) at pthread_create.c:333
#27 0x00007f5a636ba93f in clone () from /lib/x86_64-linux-gnu/libc.so.6

Note: The provided test case does not fail on 10.2 for me (still fails on 10.3), however the assertion failure has been observed on 10.2 as well, so I'm adding it to the list of affected/to-be-fixed versions.

Example of 10.2 failure, ab1ce2204e959bea596817494e932754ab5cbe88
Version: '10.2.19-MariaDB-debug-log' socket: '/home/travis/logs/vardir/mysql.sock' port: 19300 Source distribution
mysqld: /home/travis/src/sql/sql_class.cc:4404: TABLE* open_purge_table(THD, const char, size_t, const char*, size_t): Assertion `!error \|\| !ot_ctx.can_recover_from_failed_open()' failed.
181020 9:19:58 [ERROR] mysqld got signal 6 ;

#3 0x00007f98a6eaeca2 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x000056466a9f376c in open_purge_table (thd=0x7f98a0002260, db=0x7f98860e34e0 "test", dblen=4, tb=0x7f98860e3680 "t5", tblen=2) at /home/travis/src/sql/sql_class.cc:4404
#5 0x000056466aedd338 in innodb_acquire_mdl (thd=0x7f98a0002260, table=0x7f98a05809b8) at /home/travis/src/storage/innobase/handler/ha_innodb.cc:21417
#6 0x000056466aedd53b in innodb_find_table_for_vc (thd=0x7f98a0002260, table=0x7f98a05809b8) at /home/travis/src/storage/innobase/handler/ha_innodb.cc:21472
#7 0x000056466aeddac3 in innobase_allocate_row_for_vcol (thd=0x7f98a0002260, index=0x7f98a0208ac8, heap=0x7f98860e3af0, table=0x7f98860e3b00, record=0x7f98860e3b08, storage=0x7f98860e3af8) at /home/travis/src/storage/innobase/handler/ha_innodb.cc:21627
#8 0x000056466b06235f in row_vers_build_clust_v_col (row=0x7f98a0080e00, clust_index=0x7f98a009e898, index=0x7f98a0208ac8, heap=0x7f98a0080c80, vcol_info=0x56466d18b820) at /home/travis/src/storage/innobase/row/row0vers.cc:470
#9 0x000056466b063431 in row_vers_old_has_index_entry (also_curr=true, rec=0x7f989436f2c1 ' ' <repeats 49 times>, mtr=0x7f98860e3f60, index=0x7f98a0208ac8, ientry=0x7f98a0724af8, roll_ptr=9851624223149860, trx_id=10079, vcol_info=0x56466d18b820) at /home/travis/src/storage/innobase/row/row0vers.cc:966
#10 0x000056466b01d7ee in row_purge_poss_sec (node=0x56466d18b680, index=0x7f98a0208ac8, entry=0x7f98a0724af8, sec_pcur=0x7f98860e4510, sec_mtr=0x7f98860e4770, is_tree=false) at /home/travis/src/storage/innobase/row/row0purge.cc:346
#11 0x000056466b01e42a in row_purge_remove_sec_if_poss_leaf (node=0x56466d18b680, index=0x7f98a0208ac8, entry=0x7f98a0724af8) at /home/travis/src/storage/innobase/row/row0purge.cc:604
#12 0x000056466b01e90a in row_purge_remove_sec_if_poss (node=0x56466d18b680, index=0x7f98a0208ac8, entry=0x7f98a0724af8) at /home/travis/src/storage/innobase/row/row0purge.cc:717
#13 0x000056466b01eb11 in row_purge_del_mark (node=0x56466d18b680) at /home/travis/src/storage/innobase/row/row0purge.cc:791
#14 0x000056466b01f763 in row_purge_record_func (node=0x56466d18b680, undo_rec=0x56466d91ce50 "", thr=0x56466d18b5c8, updated_extern=false) at /home/travis/src/storage/innobase/row/row0purge.cc:1102
#15 0x000056466b01f9df in row_purge (node=0x56466d18b680, undo_rec=0x56466d91ce50 "", thr=0x56466d18b5c8) at /home/travis/src/storage/innobase/row/row0purge.cc:1163
#16 0x000056466b01fd1b in row_purge_step (thr=0x56466d18b5c8) at /home/travis/src/storage/innobase/row/row0purge.cc:1249
#17 0x000056466afb0468 in que_thr_step (thr=0x56466d18b5c8) at /home/travis/src/storage/innobase/que/que0que.cc:1049
#18 0x000056466afb0663 in que_run_threads_low (thr=0x56466d18b5c8) at /home/travis/src/storage/innobase/que/que0que.cc:1111
#19 0x000056466afb080d in que_run_threads (thr=0x56466d18b5c8) at /home/travis/src/storage/innobase/que/que0que.cc:1151
#20 0x000056466b06b608 in srv_task_execute () at /home/travis/src/storage/innobase/srv/srv0srv.cc:2538
#21 0x000056466b06b756 in srv_worker_thread (arg=0x0) at /home/travis/src/storage/innobase/srv/srv0srv.cc:2585
#22 0x00007f98a7a70184 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#23 0x00007f98a6f7cffd in clone () from /lib/x86_64-linux-gnu/libc.so.6

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

you_no_you.jpg
2021-04-02 12:44
22 kB
Nikita Malyavin

Issue Links

relates to

MDEV-15855 Assertion `mysql_table' failed in innobase_get_computed_value upon DDL/DML on a table with virtual columns

Closed

MDEV-17598 InnoDB index option for per-record transaction ID

Open

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2018-11-02 09:57

serg, the assertion failure here seems to be local to open_purge_table(). InnoDB should be OK if opening the table fails; it should assume that the table is no longer available, and that the old undo log records can be discarded.

It looks like we should seriously come up with some InnoDB data format change that removes the need to compute virtual columns during the purge of history.
Ideally virtual column values should only be computed once, at the SQL layer. (Unfortunately, InnoDB’s SET NULL or CASCADE handling related to FOREIGN KEY still requires InnoDB to evaluate indexed virtual column values in connected tables.)

Currently, purge evaluates indexed virtual columns inside row_purge_poss_sec(), which is the only caller of row_vers_old_has_index_entry(also_curr=true, …). This happens only when there is only one version of the clustered index record, and it is not delete-marked.

In the test case, there must exist exactly 1 record in clustered index before the ALTER TABLE statement, with pk=1. The column c would be updated. The ‘only one version’ condition should only hold before the REPLACE statement is executed, or (in 10.3 ~~MDEV-12288~~) after we have reset the DB_TRX_ID,DB_ROLL_PTR in the clustered index record as part of purge.

It looks like the stack trace must be for the delete-marked record (vd=NULL,c='foo',pk=1) in the secondary index.

The virtual column evaluation could be avoided in this case, because no base column of an indexed virtual column is being updated. There is INDEX(vd,c), but only the persistent column c is being updated, and vd would remain unchanged.

The task in purge is to find out if the delete-marked record in a secondary index corresponds to any still visible version in the clustered index, that is, if the secondary index record can be removed. In this case, we could also notice a mismatch in the persistent column c already, so we should not bother to evaluate any virtual columns.

In the general case, I believe that the only way to avoid evaluating virtual columns in purge is to change the format of indexes on virtual columns, so that these index entries will include DB_TRX_ID and possibly DB_ROLL_PTR. This would also simplify MVCC and implicit locking. This should probably be an option for all secondary indexes in InnoDB.

Marko Mäkelä added a comment - 2018-11-02 09:57 serg , the assertion failure here seems to be local to open_purge_table() . InnoDB should be OK if opening the table fails; it should assume that the table is no longer available, and that the old undo log records can be discarded. It looks like we should seriously come up with some InnoDB data format change that removes the need to compute virtual columns during the purge of history. Ideally virtual column values should only be computed once, at the SQL layer. (Unfortunately, InnoDB’s SET NULL or CASCADE handling related to FOREIGN KEY still requires InnoDB to evaluate indexed virtual column values in connected tables.) Currently, purge evaluates indexed virtual columns inside row_purge_poss_sec() , which is the only caller of row_vers_old_has_index_entry(also_curr=true, …) . This happens only when there is only one version of the clustered index record, and it is not delete-marked. In the test case, there must exist exactly 1 record in clustered index before the ALTER TABLE statement, with pk=1 . The column c would be updated. The ‘only one version’ condition should only hold before the REPLACE statement is executed, or (in 10.3 MDEV-12288 ) after we have reset the DB_TRX_ID,DB_ROLL_PTR in the clustered index record as part of purge. It looks like the stack trace must be for the delete-marked record (vd=NULL,c='foo',pk=1) in the secondary index. The virtual column evaluation could be avoided in this case, because no base column of an indexed virtual column is being updated. There is INDEX(vd,c) , but only the persistent column c is being updated, and vd would remain unchanged. The task in purge is to find out if the delete-marked record in a secondary index corresponds to any still visible version in the clustered index, that is, if the secondary index record can be removed. In this case, we could also notice a mismatch in the persistent column c already, so we should not bother to evaluate any virtual columns. In the general case, I believe that the only way to avoid evaluating virtual columns in purge is to change the format of indexes on virtual columns, so that these index entries will include DB_TRX_ID and possibly DB_ROLL_PTR . This would also simplify MVCC and implicit locking. This should probably be an option for all secondary indexes in InnoDB.

Nikita Malyavin added a comment - 2020-07-31 15:16

julien.fritsch the problem is tha it is the bug in 10.2+. What'd we do in 10.2-10.5 with it?

Nikita Malyavin added a comment - 2020-07-31 15:16 julien.fritsch the problem is tha it is the bug in 10.2+. What'd we do in 10.2-10.5 with it?

Nikita Malyavin added a comment - 2021-04-02 12:46

So we are having a three thread race condition with a deadlock backoff
For visuals, the situation can be described by the picture attached.

More precisely, the following happens:
T1: NOCOPY ALTER TABLE begins, and eventually it holds MDL_SHARED_NO_WRITE lock;
T2: FLUSH TABLES begins. it sets share->tdc->flushed = true
T3: purge on a record with virtual column begins. it is going to open a table. MDL_SHARED_READ lock is acquired therefore.
Since share->tdc->flushed is set, it waits for a TDC purge end.
T1: is going to elevate MDL LOCK to exclusive and therefore has to set other waiters to back off.
T3: receives VICTIM status, reports a DEADLOCK, sets OT_BACKOFF_AND_RETRY to Open_table_context::m_action

Nikita Malyavin added a comment - 2021-04-02 12:46 So we are having a three thread race condition with a deadlock backoff For visuals, the situation can be described by the picture attached. More precisely, the following happens: T1: NOCOPY ALTER TABLE begins, and eventually it holds MDL_SHARED_NO_WRITE lock; T2: FLUSH TABLES begins. it sets share->tdc->flushed = true T3: purge on a record with virtual column begins. it is going to open a table. MDL_SHARED_READ lock is acquired therefore. Since share->tdc->flushed is set, it waits for a TDC purge end. T1: is going to elevate MDL LOCK to exclusive and therefore has to set other waiters to back off. T3: receives VICTIM status, reports a DEADLOCK, sets OT_BACKOFF_AND_RETRY to Open_table_context::m_action

Nikita Malyavin added a comment - 2021-04-02 15:08

here the deterministic test goes

--source include/have_innodb.inc

CREATE TABLE t1 (

  pk SERIAL,

  c VARCHAR(128),

  d DATE,

  vd DATE AS (d) VIRTUAL,

  PRIMARY KEY(pk),

  KEY(vd,c)

) ENGINE=InnoDB;

INSERT IGNORE INTO t1 (pk,c) VALUES (1,'foo');

--connect (con1,localhost,root,,test)

set global innodb_debug_sync="after_open_table_mdl_shared wait_for purge";

REPLACE INTO t1 (pk,c) VALUES (1,'bar');

--connection default

set debug_sync="alter_table_before_rename_result_table wait_for purge";

--send

  ALTER TABLE t1 ADD FULLTEXT KEY(c), ALGORITHM=COPY;

--connection con1

set debug_sync="after_flush_unlock signal purge";

FLUSH TABLES;

# Cleanup

--disconnect con1

--connection default

--reap

DROP TABLE t1;

Nikita Malyavin added a comment - 2021-04-02 15:08 here the deterministic test goes --source include/have_innodb.inc CREATE TABLE t1 ( pk SERIAL, c VARCHAR (128), d DATE , vd DATE AS (d) VIRTUAL, PRIMARY KEY (pk), KEY (vd,c) ) ENGINE=InnoDB; INSERT IGNORE INTO t1 (pk,c) VALUES (1, 'foo' ); --connect (con1,localhost,root,,test) set global innodb_debug_sync= "after_open_table_mdl_shared wait_for purge" ; REPLACE INTO t1 (pk,c) VALUES (1, 'bar' ); --connection default set debug_sync= "alter_table_before_rename_result_table wait_for purge" ; --send ALTER TABLE t1 ADD FULLTEXT KEY (c), ALGORITHM=COPY; --connection con1 set debug_sync= "after_flush_unlock signal purge" ; FLUSH TABLES; # Cleanup --disconnect con1 --connection default --reap DROP TABLE t1;

Nikita Malyavin added a comment - 2021-04-08 14:24

sanja Please review the fix: https://github.com/MariaDB/server/commit/cd63fa047994b58d070c5f9db76397b23cd3261c

Nikita Malyavin added a comment - 2021-04-08 14:24 sanja Please review the fix: https://github.com/MariaDB/server/commit/cd63fa047994b58d070c5f9db76397b23cd3261c

Oleksandr Byelkin added a comment - 2021-04-13 07:18

OK to push

Oleksandr Byelkin added a comment - 2021-04-13 07:18 OK to push

Julien Fritsch added a comment - 2021-05-03 07:36

Removing the remaining effort since this issue is closed.

Julien Fritsch added a comment - 2021-05-03 07:36 Removing the remaining effort since this issue is closed.

People

Assignee:: Nikita Malyavin

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2018-08-13 18:10

Updated:: 2023-05-16 14:01

Resolved:: 2021-04-27 10:07

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration