[MDEV-20621] FULLTEXT INDEX activity causes InnoDB hang Created: 2019-09-18  Updated: 2020-09-30  Resolved: 2019-10-25

Status: Closed
Project: MariaDB Server
Component/s: Server, Storage Engine - InnoDB
Affects Version/s: 10.1.41, 10.2, 10.3, 10.4
Fix Version/s: 10.2.28, 10.1.42, 10.3.19, 10.4.9

Type: Bug Priority: Critical
Reporter: Stevo Assignee: Thirunarayanan Balathandayuthapani
Resolution: Fixed Votes: 0
Labels: None
Environment:

Cloudlinux 7.7
Google Cloud Compute Engine


Attachments: HTML File gdb    
Issue Links:
Problem/Incident
causes MDEV-20987 InnoDB fails to start when fts table ... Closed
causes MDEV-23856 fts_optimize_wq accessed after shutdo... Closed
Relates
relates to MDEV-19529 InnoDB hang on DROP FULLTEXT INDEX Closed

 Description   

We are experiencing technical difficulties with the latest MariaDB 10.1.41-MariaDB.
This is only happening on one server while we have more with the same system package versions.

The database is freezing and does not accept new connections.
The error_log shows so much error data eg:

InnoDB: Warning: a long semaphore wait:
--Thread 140300680931072 has waited at dict0dict.cc line 984 for 241.00 seconds the semaphore:
Mutex at 0x7f9e26c112e8 '&dict_sys->mutex', lock var 1
Last time reserved by thread 140300697716480 in file not yet reserved line 0, waiters flag 1
InnoDB: Warning: semaphore wait:
--Thread 140300680931072 has waited at dict0dict.cc line 984 for 241.00 seconds the semaphore:
Mutex at 0x7f9e26c112e8 '&dict_sys->mutex', lock var 1
Last time reserved by thread 140300697716480 in file not yet reserved line 0, waiters flag 1

We can provide more error log data but not in a public.



 Comments   
Comment by Stevo [ 2019-09-18 ]

Bonus tip: Looks like is always happening approximately after 1 hour and 30 minutes after the service is started.

Comment by Marko Mäkelä [ 2019-09-19 ]

Novkovski, I would like to see the stack traces of all threads when this happens.

gdb -ex "set pagination 0" -ex "thread apply all backtrace" --batch -p $(pgrep -x mysqld)

I do not have any idea what could cause this to happen 1 hour and 30 minutes after the service has been started. Could it be due to some external monitoring or maintenance activity? In MDEV-13983 we have something similar, possibly related to InnoDB deadlock detection or LOCK_thread_count that it is being acquired in order to dump information about a conflicting transaction.

Comment by Stevo [ 2019-09-19 ]

It happened again after approx. 1 hour and 30-40 minutes.
Here is the output: gdb

Comment by Marko Mäkelä [ 2019-09-19 ]

Has gdb been sanitized in some way, or are the debugging symbols missing? I do not see any parameters to function calls, or source code line numbers.

It would be helpful if you could save a core dump that you could analyze based on commands provided by me. We need to find out which thread is holding dict_sys->mutex. That should be relatively easy: print/x *dict_sys->mutex should reveal the thread identifier in hexadecimal. You can use thread find 0x… to find the thread. Finally, switch to that thread and issue backtrace. But, I would probably likely still need a complete backtrace of all threads.

Note that you can upload any confidential files to ftp.mariadb.com.

Comment by Stevo [ 2019-09-20 ]

I`m not so Linux technical person so please guide me whatever I need to make.
It happened again and once MariaDB waits for some locks, the whole server does not work and mysql_error.log files get filled.
I have temporary set up a cron job to restart MariaDB every hour witch fixes the issue.

Comment by Stevo [ 2019-09-20 ]

Isnt the latest change a fix to this issue?
https://github.com/MariaDB/server/commit/8a79fa0e4d0385818da056f7a4a39fde95d62fe3

Comment by Marko Mäkelä [ 2019-09-30 ]

Novkovski, thank you for noticing. Yes, your gdb output suggests that your report could be a duplicate of MDEV-19529.

There are many problems with the InnoDB fulltext search implementation, and there are not many useful regression tests. We also have some other fixes in the works that have not gone through stress tests (or code review) yet.

Comment by Thirunarayanan Balathandayuthapani [ 2019-10-01 ]

It is different issue from MDEV-19529 because it dealt with alter and fts_optimize_thread wait. But this one deals with srv_master_thread and fts_optimize_thread wait.
Will work on it. Thanks for raising the report Novkovski

Comment by Matthias Leich [ 2019-10-16 ]

Results of RQG testing on bb-10.2-thiru commit 0b91f74906c8dcbcc1dac486fcc66c1e9c0c603a
- > 1500 RQG tests were executed
There was some surprising low fraction of failing tests.
All asserts/crashes are already covered by open bugs in JIRA except one
- mysqld: sql/sql_list.h:684: void ilink::assert_linked(): Assertion `prev != 0 && next != 0' failed.
  happening during shutdown of the server
- per Thiru: Unlikely that its caused by the changes in bb-10.3-thiru
- occuring only once ==  Attempts to replay that on actual 10.2 have a too low chance
https://jira.mariadb.org/browse/MDEV-20843

Comment by Marko Mäkelä [ 2019-10-17 ]

This is a welcome step to the right direction, but I think that this needs some more work.

First of all, the in_queue should not be stored in a bit-field that is shared with other bit-fields that are protected by a different mutex.

I would suggest to use bool, and to document the possible state transitions carefully. We might consider using atomic memory access.

Second, in 10.1, fts_optimize_init() is not adding tables to the queue, while in 10.2 it is doing that. I’d like to see a 10.1 patch that does this. It should also avoid the unnecessary use of std::vector.

Third, fts_optimize_remove_table() should assert !table->fts->in_queue in the end.

Comment by Marko Mäkelä [ 2019-10-18 ]

At the end of fts_optimize_remove_table(), the fts_optimize_wq->mutex acquisition and release around the debug assertion should be inside ut_d(), to avoid unnecessary operations on the release build.

I saw a redundant sync_table = mem_heap_alloc(…) call whose result was immediately overwritten by {{sync_table=table;}

In fts_optimize_new_table() the assignment slot->running = false is redundant because of a preceding memset() call.

If fts_slots can be accessed by multiple threads, then we should extend some mutex hold time. It could be that it is only being accessed by a single thread.

Should we call fts_init_index() already on ha_innobase::open()? Otherwise, it seems that FTS-indexed columns could be updated before any fulltext search is performed (and ha_innobase::ft_init_ext() is called). Could that lead to some updates being missed by the fulltext indexes?

Finally, please check the following for differences in white-space or comments, and try to fix those:

diff -I^@@ <(git show origin/bb-10.1-thiru storage/innobase) <(git show origin/bb-10.1-thiru storage/xtradb/)
git show origin/bb-10.2-thiru|diff -^@@ - <(git show origin/bb-10.1-thiru storage/innobase)

Comment by Marko Mäkelä [ 2019-10-22 ]

Thanks, this looks OK. I made a suggestion to declare fts_optimize_wq) without static scope, to avoid having to add trivial non-inline accessor functions.

Comment by Matthias Leich [ 2019-10-25 ]

I tested the tree bb-10.2-thiru commit ce813ca178e499ab2171978bf0140537cb9ca612 which contains
patches for the current MDEV.
There were no asserts/crashes which do not occur in actual
10.2 commit 28098420317bc2efe082df799c917babde879242
too.
So from my point of view the MDEV-20621 patch is ok.

Generated at Thu Feb 08 09:00:54 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.