[MDEV-32578] row_merge_fts_doc_tokenize() handles FTS plugin parser inconsistently Created: 2023-10-25  Updated: 2023-10-29  Resolved: 2023-10-27

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0, 11.1, 11.2
Fix Version/s: 10.4.32, 10.5.23, 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3, 11.2.2

Type: Bug Priority: Blocker
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: upstream-fix

Issue Links:
Blocks
blocks MDEV-32579 Merge new release of InnoDB 5.7.44 to... Closed
Relates
relates to MDEV-10267 Add "ngram" support to MariaDB Open

 Description   

The recent release of MySQL 8.0.35 includes the following change:
Bug#35432973 InnoDB: processing single character tokens with FTS parser plugin

When a tokenizer plugin interface was added to MySQL 5.7, fts_tokenize_ctx::processed_len got a second meaning, which is only partly implemented in row_merge_fts_doc_tokenize().



 Comments   
Comment by Marko Mäkelä [ 2023-10-26 ]

I found a rather simple test case that will crash MySQL 8.0.34 but not MySQL 5.7.43. The code has been refactored between MySQL 5.7 and 8.0. I will need to investigate what exactly has changed between the two versions.

Comment by Marko Mäkelä [ 2023-10-27 ]

In MySQL 5.7.44 with or without the fix, the only way I can reproduce the crash is the following patch:

diff --git a/storage/innobase/row/row0ftsort.cc b/storage/innobase/row/row0ftsort.cc
index 36ce6eb6cca..03149b68a21 100644
--- a/storage/innobase/row/row0ftsort.cc
+++ b/storage/innobase/row/row0ftsort.cc
@@ -468,7 +468,7 @@ row_merge_fts_doc_tokenize(
 	row_merge_buf_t* buf;
 	dfield_t*	field;
 	fts_string_t	t_str;
-	ibool		buf_full = FALSE;
+	ibool		buf_full = TRUE;
 	byte		str_buf[FTS_MAX_WORD_LEN + 1];
 	ulint		data_size[FTS_NUM_AUX_INDEX];
 	ulint		n_tuple[FTS_NUM_AUX_INDEX];

But, this will crash also when the fix is present:

mysql-5.7.44 with the above patch

2023-10-27 11:28:35 0x7f4c437fe6c0  InnoDB: Assertion failure in thread 139965526697664 in file row0ftsort.cc line 837
InnoDB: Failing assertion: t_ctx.rows_added[t_ctx.buf_used]

In MySQL 8.0, the buffer size calculation is quite different. ddl::Context::scan_buffer_size will allocate a buffer by dividing innodb_sort_buffer_size (default: 1MiB) by the number of threads (2 in this case) and index partitions (hard-coded as 6 in the file format). These 54613 bytes will then be passed on to key_buffer.m_buffer_size. It seems that the intention was to round this up to some multiple of 4096 bytes, but that did not happen.

Whatever I tried, I am unable to reproduce a crash in MySQL 5.7 with an SQL test case that crashes MySQL 8.0. I think that I must more or less apply the MySQL 5.7 fix without adding a test case. An additional challenge would be that we do not actually have an n-gram tokenizer in MariaDB (MDEV-10267). We only have tests for a simple_parser.

Generated at Thu Feb 08 10:32:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.