[MDEV-20621] FULLTEXT INDEX activity causes InnoDB hang - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.1.41, 10.2(EOL), 10.3(EOL), 10.4(EOL)
Fix Version/s: 10.2.28, 10.1.42, 10.3.19, 10.4.9
Component/s: Server, Storage Engine - InnoDB
Labels:
None
Environment:
Cloudlinux 7.7
Google Cloud Compute Engine

Description

We are experiencing technical difficulties with the latest MariaDB 10.1.41-MariaDB.
This is only happening on one server while we have more with the same system package versions.

The database is freezing and does not accept new connections.
The error_log shows so much error data eg:

InnoDB: Warning: a long semaphore wait:

--Thread 140300680931072 has waited at dict0dict.cc line 984 for 241.00 seconds the semaphore:

Mutex at 0x7f9e26c112e8 '&dict_sys->mutex', lock var 1

Last time reserved by thread 140300697716480 in file not yet reserved line 0, waiters flag 1

InnoDB: Warning: semaphore wait:

--Thread 140300680931072 has waited at dict0dict.cc line 984 for 241.00 seconds the semaphore:

Mutex at 0x7f9e26c112e8 '&dict_sys->mutex', lock var 1

Last time reserved by thread 140300697716480 in file not yet reserved line 0, waiters flag 1

We can provide more error log data but not in a public.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

gdb
829 kB
2019-09-19 12:27

Issue Links

causes

MDEV-20987 InnoDB fails to start when fts table has FK relation

Closed

MDEV-23856 fts_optimize_wq accessed after shutdown of FTS Optimize thread

Closed

relates to

MDEV-19529 InnoDB hang on DROP FULLTEXT INDEX

Closed

Activity

Ascending order - Click to sort in descending order

Stevo created issue - 2019-09-18 15:57

Stevo added a comment - 2019-09-18 15:58 - edited

Bonus tip: Looks like is always happening approximately after 1 hour and 30 minutes after the service is started.

Stevo added a comment - 2019-09-18 15:58 - edited Bonus tip: Looks like is always happening approximately after 1 hour and 30 minutes after the service is started.

Elena Stepanova made changes - 2019-09-18 16:39

Field	Original Value	New Value
Fix Version/s		10.1 [ 16100 ]
Assignee		Marko Mäkelä [ marko ]

Marko Mäkelä added a comment - 2019-09-19 05:00

Novkovski, I would like to see the stack traces of all threads when this happens.

gdb -ex "set pagination 0" -ex "thread apply all backtrace" --batch -p $(pgrep -x mysqld)

I do not have any idea what could cause this to happen 1 hour and 30 minutes after the service has been started. Could it be due to some external monitoring or maintenance activity? In ~~MDEV-13983~~ we have something similar, possibly related to InnoDB deadlock detection or LOCK_thread_count that it is being acquired in order to dump information about a conflicting transaction.

Marko Mäkelä added a comment - 2019-09-19 05:00 Novkovski , I would like to see the stack traces of all threads when this happens. gdb -ex "set pagination 0" -ex "thread apply all backtrace" --batch -p $(pgrep -x mysqld) I do not have any idea what could cause this to happen 1 hour and 30 minutes after the service has been started. Could it be due to some external monitoring or maintenance activity? In MDEV-13983 we have something similar, possibly related to InnoDB deadlock detection or LOCK_thread_count that it is being acquired in order to dump information about a conflicting transaction.

Stevo made changes - 2019-09-19 12:27

Attachment

gdb [ 48981 ]

Stevo added a comment - 2019-09-19 12:27

It happened again after approx. 1 hour and 30-40 minutes.
Here is the output: gdb

Stevo added a comment - 2019-09-19 12:27 It happened again after approx. 1 hour and 30-40 minutes. Here is the output: gdb

Marko Mäkelä added a comment - 2019-09-19 14:19

Has gdb been sanitized in some way, or are the debugging symbols missing? I do not see any parameters to function calls, or source code line numbers.

It would be helpful if you could save a core dump that you could analyze based on commands provided by me. We need to find out which thread is holding dict_sys->mutex. That should be relatively easy: print/x *dict_sys->mutex should reveal the thread identifier in hexadecimal. You can use thread find 0x… to find the thread. Finally, switch to that thread and issue backtrace. But, I would probably likely still need a complete backtrace of all threads.

Note that you can upload any confidential files to ftp.mariadb.com.

Marko Mäkelä added a comment - 2019-09-19 14:19 Has gdb been sanitized in some way, or are the debugging symbols missing? I do not see any parameters to function calls, or source code line numbers. It would be helpful if you could save a core dump that you could analyze based on commands provided by me. We need to find out which thread is holding dict_sys->mutex . That should be relatively easy: print/x *dict_sys->mutex should reveal the thread identifier in hexadecimal. You can use thread find 0x… to find the thread. Finally, switch to that thread and issue backtrace . But, I would probably likely still need a complete backtrace of all threads. Note that you can upload any confidential files to ftp.mariadb.com.

Stevo added a comment - 2019-09-20 12:31

I`m not so Linux technical person so please guide me whatever I need to make.
It happened again and once MariaDB waits for some locks, the whole server does not work and mysql_error.log files get filled.
I have temporary set up a cron job to restart MariaDB every hour witch fixes the issue.

Stevo added a comment - 2019-09-20 12:31 I`m not so Linux technical person so please guide me whatever I need to make. It happened again and once MariaDB waits for some locks, the whole server does not work and mysql_error.log files get filled. I have temporary set up a cron job to restart MariaDB every hour witch fixes the issue.

Stevo added a comment - 2019-09-20 17:40

Isnt the latest change a fix to this issue?
https://github.com/MariaDB/server/commit/8a79fa0e4d0385818da056f7a4a39fde95d62fe3

Stevo added a comment - 2019-09-20 17:40 Isnt the latest change a fix to this issue? https://github.com/MariaDB/server/commit/8a79fa0e4d0385818da056f7a4a39fde95d62fe3

Marko Mäkelä added a comment - 2019-09-30 15:52 - edited

Novkovski, thank you for noticing. Yes, your gdb output suggests that your report could be a duplicate of ~~MDEV-19529~~.

There are many problems with the InnoDB fulltext search implementation, and there are not many useful regression tests. We also have some other fixes in the works that have not gone through stress tests (or code review) yet.

Marko Mäkelä added a comment - 2019-09-30 15:52 - edited Novkovski , thank you for noticing. Yes, your gdb output suggests that your report could be a duplicate of MDEV-19529 . There are many problems with the InnoDB fulltext search implementation, and there are not many useful regression tests. We also have some other fixes in the works that have not gone through stress tests (or code review) yet.

Marko Mäkelä made changes - 2019-09-30 15:52

Link

This issue relates to ~~MDEV-19529~~ [ ~~MDEV-19529~~ ]

Marko Mäkelä made changes - 2019-09-30 15:52

Assignee

Marko Mäkelä [ marko ]

Thirunarayanan Balathandayuthapani [ thiru ]

Thirunarayanan Balathandayuthapani added a comment - 2019-10-01 07:45

It is different issue from ~~MDEV-19529~~ because it dealt with alter and fts_optimize_thread wait. But this one deals with srv_master_thread and fts_optimize_thread wait.
Will work on it. Thanks for raising the report Novkovski

Thirunarayanan Balathandayuthapani added a comment - 2019-10-01 07:45 It is different issue from MDEV-19529 because it dealt with alter and fts_optimize_thread wait. But this one deals with srv_master_thread and fts_optimize_thread wait. Will work on it. Thanks for raising the report Novkovski

Thirunarayanan Balathandayuthapani made changes - 2019-10-01 07:45

Status

Open [ 1 ]

Confirmed [ 10101 ]

Thirunarayanan Balathandayuthapani made changes - 2019-10-11 14:01

Status

Confirmed [ 10101 ]

In Progress [ 3 ]

Thirunarayanan Balathandayuthapani made changes - 2019-10-11 14:01

Assignee	Thirunarayanan Balathandayuthapani [ thiru ]	Marko Mäkelä [ marko ]
Status	In Progress [ 3 ]	In Review [ 10002 ]

Matthias Leich added a comment - 2019-10-16 10:52 - edited

Results of RQG testing on bb-10.2-thiru commit 0b91f74906c8dcbcc1dac486fcc66c1e9c0c603a

- > 1500 RQG tests were executed

There was some surprising low fraction of failing tests.

All asserts/crashes are already covered by open bugs in JIRA except one

- mysqld: sql/sql_list.h:684: void ilink::assert_linked(): Assertion `prev != 0 && next != 0' failed.

  happening during shutdown of the server

- per Thiru: Unlikely that its caused by the changes in bb-10.3-thiru

- occuring only once ==  Attempts to replay that on actual 10.2 have a too low chance

https://jira.mariadb.org/browse/MDEV-20843

Matthias Leich added a comment - 2019-10-16 10:52 - edited Results of RQG testing on bb-10.2-thiru commit 0b91f74906c8dcbcc1dac486fcc66c1e9c0c603a - > 1500 RQG tests were executed There was some surprising low fraction of failing tests. All asserts/crashes are already covered by open bugs in JIRA except one - mysqld: sql/sql_list.h:684: void ilink::assert_linked(): Assertion `prev != 0 && next != 0' failed. happening during shutdown of the server - per Thiru: Unlikely that its caused by the changes in bb-10.3-thiru - occuring only once == Attempts to replay that on actual 10.2 have a too low chance https://jira.mariadb.org/browse/MDEV-20843

Matthias Leich made changes - 2019-10-16 13:42

Comment

[ A comment with security level 'Developers' was removed. ]

Marko Mäkelä added a comment - 2019-10-17 11:29

This is a welcome step to the right direction, but I think that this needs some more work.

First of all, the in_queue should not be stored in a bit-field that is shared with other bit-fields that are protected by a different mutex.

I would suggest to use bool, and to document the possible state transitions carefully. We might consider using atomic memory access.

Second, in 10.1, fts_optimize_init() is not adding tables to the queue, while in 10.2 it is doing that. I’d like to see a 10.1 patch that does this. It should also avoid the unnecessary use of std::vector.

Third, fts_optimize_remove_table() should assert !table->fts->in_queue in the end.

Marko Mäkelä added a comment - 2019-10-17 11:29 This is a welcome step to the right direction, but I think that this needs some more work. First of all, the in_queue should not be stored in a bit-field that is shared with other bit-fields that are protected by a different mutex. I would suggest to use bool , and to document the possible state transitions carefully. We might consider using atomic memory access. Second, in 10.1, fts_optimize_init() is not adding tables to the queue, while in 10.2 it is doing that. I’d like to see a 10.1 patch that does this. It should also avoid the unnecessary use of std::vector . Third, fts_optimize_remove_table() should assert !table->fts->in_queue in the end.

Marko Mäkelä made changes - 2019-10-17 11:29

Assignee	Marko Mäkelä [ marko ]	Thirunarayanan Balathandayuthapani [ thiru ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Thirunarayanan Balathandayuthapani made changes - 2019-10-18 11:16

Assignee	Thirunarayanan Balathandayuthapani [ thiru ]	Marko Mäkelä [ marko ]
Status	Stalled [ 10000 ]	In Review [ 10002 ]

Marko Mäkelä made changes - 2019-10-18 11:59

Fix Version/s		10.2 [ 14601 ]
Fix Version/s		10.3 [ 22126 ]
Fix Version/s		10.4 [ 22408 ]
Affects Version/s		10.2 [ 14601 ]
Affects Version/s		10.3 [ 22126 ]
Affects Version/s		10.4 [ 22408 ]

Marko Mäkelä added a comment - 2019-10-18 12:06

At the end of fts_optimize_remove_table(), the fts_optimize_wq->mutex acquisition and release around the debug assertion should be inside ut_d(), to avoid unnecessary operations on the release build.

I saw a redundant sync_table = mem_heap_alloc(…) call whose result was immediately overwritten by {{sync_table=table;}

In fts_optimize_new_table() the assignment slot->running = false is redundant because of a preceding memset() call.

If fts_slots can be accessed by multiple threads, then we should extend some mutex hold time. It could be that it is only being accessed by a single thread.

Should we call fts_init_index() already on ha_innobase::open()? Otherwise, it seems that FTS-indexed columns could be updated before any fulltext search is performed (and ha_innobase::ft_init_ext() is called). Could that lead to some updates being missed by the fulltext indexes?

Finally, please check the following for differences in white-space or comments, and try to fix those:

diff -I^@@ <(git show origin/bb-10.1-thiru storage/innobase) <(git show origin/bb-10.1-thiru storage/xtradb/)

git show origin/bb-10.2-thiru|diff -^@@ - <(git show origin/bb-10.1-thiru storage/innobase)

Marko Mäkelä added a comment - 2019-10-18 12:06 At the end of fts_optimize_remove_table() , the fts_optimize_wq->mutex acquisition and release around the debug assertion should be inside ut_d() , to avoid unnecessary operations on the release build. I saw a redundant sync_table = mem_heap_alloc(…) call whose result was immediately overwritten by {{sync_table=table;} In fts_optimize_new_table() the assignment slot->running = false is redundant because of a preceding memset() call. If fts_slots can be accessed by multiple threads, then we should extend some mutex hold time. It could be that it is only being accessed by a single thread. Should we call fts_init_index() already on ha_innobase::open() ? Otherwise, it seems that FTS-indexed columns could be updated before any fulltext search is performed (and ha_innobase::ft_init_ext() is called). Could that lead to some updates being missed by the fulltext indexes? Finally, please check the following for differences in white-space or comments, and try to fix those: diff -I^@@ <(git show origin/bb-10.1-thiru storage/innobase) <(git show origin/bb-10.1-thiru storage/xtradb/) git show origin/bb-10.2-thiru|diff -^@@ - <(git show origin/bb-10.1-thiru storage/innobase)

Marko Mäkelä made changes - 2019-10-18 12:06

Assignee	Marko Mäkelä [ marko ]	Thirunarayanan Balathandayuthapani [ thiru ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Thirunarayanan Balathandayuthapani made changes - 2019-10-22 10:35

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Thirunarayanan Balathandayuthapani made changes - 2019-10-22 10:35

Assignee	Thirunarayanan Balathandayuthapani [ thiru ]	Marko Mäkelä [ marko ]
Status	In Progress [ 3 ]	In Review [ 10002 ]

Marko Mäkelä added a comment - 2019-10-22 13:22

Thanks, this looks OK. I made a suggestion to declare fts_optimize_wq) without static scope, to avoid having to add trivial non-inline accessor functions.

Marko Mäkelä added a comment - 2019-10-22 13:22 Thanks, this looks OK. I made a suggestion to declare fts_optimize_wq ) without static scope, to avoid having to add trivial non- inline accessor functions.

Marko Mäkelä made changes - 2019-10-22 13:22

Assignee	Marko Mäkelä [ marko ]	Thirunarayanan Balathandayuthapani [ thiru ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Matthias Leich added a comment - 2019-10-25 10:29

I tested the tree bb-10.2-thiru commit ce813ca178e499ab2171978bf0140537cb9ca612 which contains

patches for the current MDEV.

There were no asserts/crashes which do not occur in actual

10.2 commit 28098420317bc2efe082df799c917babde879242

too.

So from my point of view the MDEV-20621 patch is ok.

Matthias Leich added a comment - 2019-10-25 10:29 I tested the tree bb-10.2-thiru commit ce813ca178e499ab2171978bf0140537cb9ca612 which contains patches for the current MDEV. There were no asserts/crashes which do not occur in actual 10.2 commit 28098420317bc2efe082df799c917babde879242 too. So from my point of view the MDEV-20621 patch is ok.

Marko Mäkelä made changes - 2019-10-25 14:44

issue.field.resolutiondate

2019-10-25 14:44:39.0

2019-10-25 14:44:39.204

Marko Mäkelä made changes - 2019-10-25 14:44

Fix Version/s		10.1.42 [ 23407 ]
Fix Version/s		10.2.28 [ 23910 ]
Fix Version/s		10.3.19 [ 23908 ]
Fix Version/s		10.4.9 [ 23906 ]
Fix Version/s	10.2 [ 14601 ]
Fix Version/s	10.1 [ 16100 ]
Fix Version/s	10.3 [ 22126 ]
Fix Version/s	10.4 [ 22408 ]
Resolution		Fixed [ 1 ]
Status	Stalled [ 10000 ]	Closed [ 6 ]

Marko Mäkelä made changes - 2019-10-25 14:48

Summary

Locking issue freezes MariaDB

FULLTEXT INDEX activity causes InnoDB hang

Marko Mäkelä made changes - 2019-11-06 06:05

Link

This issue causes ~~MDEV-20987~~ [ ~~MDEV-20987~~ ]

Thirunarayanan Balathandayuthapani made changes - 2020-09-30 10:44

Link

This issue causes ~~MDEV-23856~~ [ ~~MDEV-23856~~ ]

Sergei Golubchik made changes - 2021-12-06 21:50

Workflow

MariaDB v3 [ 99761 ]

MariaDB v4 [ 156762 ]

People

Assignee:: Thirunarayanan Balathandayuthapani

Reporter:: Stevo

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2019-09-18 15:57

Updated:: 2020-09-30 10:44

Resolved:: 2019-10-25 14:44

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration