Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20621

FULLTEXT INDEX activity causes InnoDB hang

Details

    Description

      We are experiencing technical difficulties with the latest MariaDB 10.1.41-MariaDB.
      This is only happening on one server while we have more with the same system package versions.

      The database is freezing and does not accept new connections.
      The error_log shows so much error data eg:

      InnoDB: Warning: a long semaphore wait:
      --Thread 140300680931072 has waited at dict0dict.cc line 984 for 241.00 seconds the semaphore:
      Mutex at 0x7f9e26c112e8 '&dict_sys->mutex', lock var 1
      Last time reserved by thread 140300697716480 in file not yet reserved line 0, waiters flag 1
      InnoDB: Warning: semaphore wait:
      --Thread 140300680931072 has waited at dict0dict.cc line 984 for 241.00 seconds the semaphore:
      Mutex at 0x7f9e26c112e8 '&dict_sys->mutex', lock var 1
      Last time reserved by thread 140300697716480 in file not yet reserved line 0, waiters flag 1
      

      We can provide more error log data but not in a public.

      Attachments

        Issue Links

          Activity

            Novkovski Stevo created issue -
            Novkovski Stevo added a comment - - edited

            Bonus tip: Looks like is always happening approximately after 1 hour and 30 minutes after the service is started.

            Novkovski Stevo added a comment - - edited Bonus tip: Looks like is always happening approximately after 1 hour and 30 minutes after the service is started.
            elenst Elena Stepanova made changes -
            Field Original Value New Value
            Fix Version/s 10.1 [ 16100 ]
            Assignee Marko Mäkelä [ marko ]

            Novkovski, I would like to see the stack traces of all threads when this happens.

            gdb -ex "set pagination 0" -ex "thread apply all backtrace" --batch -p $(pgrep -x mysqld)
            

            I do not have any idea what could cause this to happen 1 hour and 30 minutes after the service has been started. Could it be due to some external monitoring or maintenance activity? In MDEV-13983 we have something similar, possibly related to InnoDB deadlock detection or LOCK_thread_count that it is being acquired in order to dump information about a conflicting transaction.

            marko Marko Mäkelä added a comment - Novkovski , I would like to see the stack traces of all threads when this happens. gdb -ex "set pagination 0" -ex "thread apply all backtrace" --batch -p $(pgrep -x mysqld) I do not have any idea what could cause this to happen 1 hour and 30 minutes after the service has been started. Could it be due to some external monitoring or maintenance activity? In MDEV-13983 we have something similar, possibly related to InnoDB deadlock detection or LOCK_thread_count that it is being acquired in order to dump information about a conflicting transaction.
            Novkovski Stevo made changes -
            Attachment gdb [ 48981 ]
            Novkovski Stevo added a comment -

            It happened again after approx. 1 hour and 30-40 minutes.
            Here is the output: gdb

            Novkovski Stevo added a comment - It happened again after approx. 1 hour and 30-40 minutes. Here is the output: gdb

            Has gdb been sanitized in some way, or are the debugging symbols missing? I do not see any parameters to function calls, or source code line numbers.

            It would be helpful if you could save a core dump that you could analyze based on commands provided by me. We need to find out which thread is holding dict_sys->mutex. That should be relatively easy: print/x *dict_sys->mutex should reveal the thread identifier in hexadecimal. You can use thread find 0x… to find the thread. Finally, switch to that thread and issue backtrace. But, I would probably likely still need a complete backtrace of all threads.

            Note that you can upload any confidential files to ftp.mariadb.com.

            marko Marko Mäkelä added a comment - Has gdb been sanitized in some way, or are the debugging symbols missing? I do not see any parameters to function calls, or source code line numbers. It would be helpful if you could save a core dump that you could analyze based on commands provided by me. We need to find out which thread is holding dict_sys->mutex . That should be relatively easy: print/x *dict_sys->mutex should reveal the thread identifier in hexadecimal. You can use thread find 0x… to find the thread. Finally, switch to that thread and issue backtrace . But, I would probably likely still need a complete backtrace of all threads. Note that you can upload any confidential files to ftp.mariadb.com.
            Novkovski Stevo added a comment -

            I`m not so Linux technical person so please guide me whatever I need to make.
            It happened again and once MariaDB waits for some locks, the whole server does not work and mysql_error.log files get filled.
            I have temporary set up a cron job to restart MariaDB every hour witch fixes the issue.

            Novkovski Stevo added a comment - I`m not so Linux technical person so please guide me whatever I need to make. It happened again and once MariaDB waits for some locks, the whole server does not work and mysql_error.log files get filled. I have temporary set up a cron job to restart MariaDB every hour witch fixes the issue.
            Novkovski Stevo added a comment - Isnt the latest change a fix to this issue? https://github.com/MariaDB/server/commit/8a79fa0e4d0385818da056f7a4a39fde95d62fe3
            marko Marko Mäkelä added a comment - - edited

            Novkovski, thank you for noticing. Yes, your gdb output suggests that your report could be a duplicate of MDEV-19529.

            There are many problems with the InnoDB fulltext search implementation, and there are not many useful regression tests. We also have some other fixes in the works that have not gone through stress tests (or code review) yet.

            marko Marko Mäkelä added a comment - - edited Novkovski , thank you for noticing. Yes, your gdb output suggests that your report could be a duplicate of MDEV-19529 . There are many problems with the InnoDB fulltext search implementation, and there are not many useful regression tests. We also have some other fixes in the works that have not gone through stress tests (or code review) yet.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Thirunarayanan Balathandayuthapani [ thiru ]

            It is different issue from MDEV-19529 because it dealt with alter and fts_optimize_thread wait. But this one deals with srv_master_thread and fts_optimize_thread wait.
            Will work on it. Thanks for raising the report Novkovski

            thiru Thirunarayanan Balathandayuthapani added a comment - It is different issue from MDEV-19529 because it dealt with alter and fts_optimize_thread wait. But this one deals with srv_master_thread and fts_optimize_thread wait. Will work on it. Thanks for raising the report Novkovski
            thiru Thirunarayanan Balathandayuthapani made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]
            thiru Thirunarayanan Balathandayuthapani made changes -
            Status Confirmed [ 10101 ] In Progress [ 3 ]
            thiru Thirunarayanan Balathandayuthapani made changes -
            Assignee Thirunarayanan Balathandayuthapani [ thiru ] Marko Mäkelä [ marko ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            mleich Matthias Leich added a comment - - edited

            Results of RQG testing on bb-10.2-thiru commit 0b91f74906c8dcbcc1dac486fcc66c1e9c0c603a
            - > 1500 RQG tests were executed
            There was some surprising low fraction of failing tests.
            All asserts/crashes are already covered by open bugs in JIRA except one
            - mysqld: sql/sql_list.h:684: void ilink::assert_linked(): Assertion `prev != 0 && next != 0' failed.
              happening during shutdown of the server
            - per Thiru: Unlikely that its caused by the changes in bb-10.3-thiru
            - occuring only once ==  Attempts to replay that on actual 10.2 have a too low chance
            https://jira.mariadb.org/browse/MDEV-20843
            

            mleich Matthias Leich added a comment - - edited Results of RQG testing on bb-10.2-thiru commit 0b91f74906c8dcbcc1dac486fcc66c1e9c0c603a - > 1500 RQG tests were executed There was some surprising low fraction of failing tests. All asserts/crashes are already covered by open bugs in JIRA except one - mysqld: sql/sql_list.h:684: void ilink::assert_linked(): Assertion `prev != 0 && next != 0' failed. happening during shutdown of the server - per Thiru: Unlikely that its caused by the changes in bb-10.3-thiru - occuring only once == Attempts to replay that on actual 10.2 have a too low chance https://jira.mariadb.org/browse/MDEV-20843
            mleich Matthias Leich made changes -
            Comment [ A comment with security level 'Developers' was removed. ]

            This is a welcome step to the right direction, but I think that this needs some more work.

            First of all, the in_queue should not be stored in a bit-field that is shared with other bit-fields that are protected by a different mutex.

            I would suggest to use bool, and to document the possible state transitions carefully. We might consider using atomic memory access.

            Second, in 10.1, fts_optimize_init() is not adding tables to the queue, while in 10.2 it is doing that. I’d like to see a 10.1 patch that does this. It should also avoid the unnecessary use of std::vector.

            Third, fts_optimize_remove_table() should assert !table->fts->in_queue in the end.

            marko Marko Mäkelä added a comment - This is a welcome step to the right direction, but I think that this needs some more work. First of all, the in_queue should not be stored in a bit-field that is shared with other bit-fields that are protected by a different mutex. I would suggest to use bool , and to document the possible state transitions carefully. We might consider using atomic memory access. Second, in 10.1, fts_optimize_init() is not adding tables to the queue, while in 10.2 it is doing that. I’d like to see a 10.1 patch that does this. It should also avoid the unnecessary use of std::vector . Third, fts_optimize_remove_table() should assert !table->fts->in_queue in the end.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Thirunarayanan Balathandayuthapani [ thiru ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            thiru Thirunarayanan Balathandayuthapani made changes -
            Assignee Thirunarayanan Balathandayuthapani [ thiru ] Marko Mäkelä [ marko ]
            Status Stalled [ 10000 ] In Review [ 10002 ]
            marko Marko Mäkelä made changes -
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Affects Version/s 10.2 [ 14601 ]
            Affects Version/s 10.3 [ 22126 ]
            Affects Version/s 10.4 [ 22408 ]

            At the end of fts_optimize_remove_table(), the fts_optimize_wq->mutex acquisition and release around the debug assertion should be inside ut_d(), to avoid unnecessary operations on the release build.

            I saw a redundant sync_table = mem_heap_alloc(…) call whose result was immediately overwritten by {{sync_table=table;}

            In fts_optimize_new_table() the assignment slot->running = false is redundant because of a preceding memset() call.

            If fts_slots can be accessed by multiple threads, then we should extend some mutex hold time. It could be that it is only being accessed by a single thread.

            Should we call fts_init_index() already on ha_innobase::open()? Otherwise, it seems that FTS-indexed columns could be updated before any fulltext search is performed (and ha_innobase::ft_init_ext() is called). Could that lead to some updates being missed by the fulltext indexes?

            Finally, please check the following for differences in white-space or comments, and try to fix those:

            diff -I^@@ <(git show origin/bb-10.1-thiru storage/innobase) <(git show origin/bb-10.1-thiru storage/xtradb/)
            git show origin/bb-10.2-thiru|diff -^@@ - <(git show origin/bb-10.1-thiru storage/innobase)
            

            marko Marko Mäkelä added a comment - At the end of fts_optimize_remove_table() , the fts_optimize_wq->mutex acquisition and release around the debug assertion should be inside ut_d() , to avoid unnecessary operations on the release build. I saw a redundant sync_table = mem_heap_alloc(…) call whose result was immediately overwritten by {{sync_table=table;} In fts_optimize_new_table() the assignment slot->running = false is redundant because of a preceding memset() call. If fts_slots can be accessed by multiple threads, then we should extend some mutex hold time. It could be that it is only being accessed by a single thread. Should we call fts_init_index() already on ha_innobase::open() ? Otherwise, it seems that FTS-indexed columns could be updated before any fulltext search is performed (and ha_innobase::ft_init_ext() is called). Could that lead to some updates being missed by the fulltext indexes? Finally, please check the following for differences in white-space or comments, and try to fix those: diff -I^@@ <(git show origin/bb-10.1-thiru storage/innobase) <(git show origin/bb-10.1-thiru storage/xtradb/) git show origin/bb-10.2-thiru|diff -^@@ - <(git show origin/bb-10.1-thiru storage/innobase)
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Thirunarayanan Balathandayuthapani [ thiru ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            thiru Thirunarayanan Balathandayuthapani made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            thiru Thirunarayanan Balathandayuthapani made changes -
            Assignee Thirunarayanan Balathandayuthapani [ thiru ] Marko Mäkelä [ marko ]
            Status In Progress [ 3 ] In Review [ 10002 ]

            Thanks, this looks OK. I made a suggestion to declare fts_optimize_wq) without static scope, to avoid having to add trivial non-inline accessor functions.

            marko Marko Mäkelä added a comment - Thanks, this looks OK. I made a suggestion to declare fts_optimize_wq ) without static scope, to avoid having to add trivial non- inline accessor functions.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Thirunarayanan Balathandayuthapani [ thiru ]
            Status In Review [ 10002 ] Stalled [ 10000 ]

            I tested the tree bb-10.2-thiru commit ce813ca178e499ab2171978bf0140537cb9ca612 which contains
            patches for the current MDEV.
            There were no asserts/crashes which do not occur in actual
            10.2 commit 28098420317bc2efe082df799c917babde879242
            too.
            So from my point of view the MDEV-20621 patch is ok.
            

            mleich Matthias Leich added a comment - I tested the tree bb-10.2-thiru commit ce813ca178e499ab2171978bf0140537cb9ca612 which contains patches for the current MDEV. There were no asserts/crashes which do not occur in actual 10.2 commit 28098420317bc2efe082df799c917babde879242 too. So from my point of view the MDEV-20621 patch is ok.
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2019-10-25 14:44:39.0 2019-10-25 14:44:39.204
            marko Marko Mäkelä made changes -
            Fix Version/s 10.1.42 [ 23407 ]
            Fix Version/s 10.2.28 [ 23910 ]
            Fix Version/s 10.3.19 [ 23908 ]
            Fix Version/s 10.4.9 [ 23906 ]
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.1 [ 16100 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            marko Marko Mäkelä made changes -
            Summary Locking issue freezes MariaDB FULLTEXT INDEX activity causes InnoDB hang
            marko Marko Mäkelä made changes -
            thiru Thirunarayanan Balathandayuthapani made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 99761 ] MariaDB v4 [ 156762 ]

            People

              thiru Thirunarayanan Balathandayuthapani
              Novkovski Stevo
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.