Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-16467

MariaDB crashes because of "long semaphore wait"after migrating from 10.1 to 10.3

Details

    Description

      We have a Windows server that has been running MariaDB 10.1 successfully for over a year. The server remains mostly idle for long times with some read access, but occasionally there are transactions that add data (about 500k rows per commit). There can be up to 10 such transactions (one per database) at the same time and during those times the server is under quite some load (the code processing the data resides on the same server as the database).

      When trying to use MariaDB 10.3, during those load times the database crashes and logs "[FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung."

      I figure it could be related to MariaDB switching from XtraDB to InnoDB and on the mailing list it was suggested to file a bug.

      I'm attaching the error log. I cannot attach a minidump as-is because upon inspection it contained sensitive data that I'm not allowed to disclose. If absolutely neccessary, I can try and reproduce the problem with random test data, but this might take a few days.

      The mailing list entry mentioned above can be found here: https://lists.launchpad.net/maria-discuss/msg05139.html

      Regards,
      Tom.

      Attachments

        1. error_mysqld.log
          635 kB
        2. gdb_core.180561.txt
          5 kB
        3. gdb_core.64996.txt
          39 kB
        4. gdb_short.txt
          9.18 MB
        5. gdb.180561.txt.gz
          519 kB
        6. mariadb_clean.err
          1.17 MB
        7. my.ini
          1 kB
        8. uniqstack.txt
          17 kB

        Issue Links

          Activity

            tdcf Tom Deory created issue -
            tdcf Tom Deory added a comment -

            Additional info: The error log is starting from a clean state (i.e. a MariaDB 10.3 with no data previously in it). It contains the whole log from starting the database up to the crash.

            tdcf Tom Deory added a comment - Additional info: The error log is starting from a clean state (i.e. a MariaDB 10.3 with no data previously in it). It contains the whole log from starting the database up to the crash.
            wlad Vladislav Vaintroub added a comment - - edited

            It is strange that you got sensitive data from those minidumps, they contain hardly anything apart from callstacks. but anyway, we need all threads callstacks, not a single thread callstack that error log provides.

            There are different ways to get them, using mysqld.dmp that you have created.

            One way is to execute

            cdb.exe -y "C:\Program Files\MariaDB 10.3\bin" -i "C:\Program Files\MariaDB 10.3\bin" -lines -z C:\path\to\mysqld.dmp -c "~*kc;q!"

            (you'd might to fix the -y, -i paths to be correct and point to the directory where mysqld.pdb , mysqld.exe . the path given after -z should be full path to mysqld.dmp)

            cdb.exe is part of Debugging tools for Windows, which is part of Windows SDK. After I installed it, cdb.exe is located under
            C:\Program Files (x86)\Windows Kits\10\Debuggers\x64

            more info on debugging tools for Windows :
            https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/

            Another day to dump all threads callstacks is opening minidump in visual studio, click "debug with native only" button and then execute

            Debug.ListCallStack /AllThreads in the "Command Window"

            So, could you please try to get all threads callstack, and attach that, if you are not comfortable sharing the minidump

            wlad Vladislav Vaintroub added a comment - - edited It is strange that you got sensitive data from those minidumps, they contain hardly anything apart from callstacks. but anyway, we need all threads callstacks, not a single thread callstack that error log provides. There are different ways to get them, using mysqld.dmp that you have created. One way is to execute cdb.exe -y "C:\Program Files\MariaDB 10.3\bin" -i "C:\Program Files\MariaDB 10.3\bin" -lines -z C:\path\to\mysqld.dmp -c "~*kc;q!" (you'd might to fix the -y, -i paths to be correct and point to the directory where mysqld.pdb , mysqld.exe . the path given after -z should be full path to mysqld.dmp) cdb.exe is part of Debugging tools for Windows, which is part of Windows SDK. After I installed it, cdb.exe is located under C:\Program Files (x86)\Windows Kits\10\Debuggers\x64 more info on debugging tools for Windows : https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/ Another day to dump all threads callstacks is opening minidump in visual studio, click "debug with native only" button and then execute Debug.ListCallStack /AllThreads in the "Command Window" So, could you please try to get all threads callstack, and attach that, if you are not comfortable sharing the minidump
            wlad Vladislav Vaintroub made changes -
            Field Original Value New Value
            Labels crash innodb semaphore crash innodb need_feedback semaphore
            JensVanDeynse Jens Van Deynse made changes -
            Attachment error_mysqld.log [ 45797 ]
            JensVanDeynse Jens Van Deynse added a comment - - edited

            Hello

            We experience the exact same issue:

            • We updated our database server (1 of 25+ database servers) from MariaDB10.1 to MariaDB10.3
            • First days everything went as planned
            • This weekend MariaDB intentionally crashed itself twice (Night b error_mysqld.log etween friday - saturday & night between sunday - monday)
            • The crashes were at exactly the same moment (00:20)
            • After analyzing our server (with a lot of databases) we saw that 105 (small) events are scheduled at 00:00 each night and in the logs it's obvious that it's around that time that something goes wrong -> First semaphore wait indication is at 00:04:03 with a 241 second wait
            • It's our opinion that the events have something to do with the crash but this wasn't an issue in MariaDB10.1

            Our environment on the other hand is completely different:

            • Virtualised
            • Linux server -> CentOS 7.5.1804 (Core)
            • 32GB RAM
            • 8 CPU's
            • SSD
            • 18GB innodb_buffer_pool_size

            I attached our logs in the hope that you could find some useful information.
            If you need any more information, please just ask and I'll see if I can find it

            Regards
            Jens

            JensVanDeynse Jens Van Deynse added a comment - - edited Hello We experience the exact same issue: We updated our database server (1 of 25+ database servers) from MariaDB10.1 to MariaDB10.3 First days everything went as planned This weekend MariaDB intentionally crashed itself twice (Night b error_mysqld.log etween friday - saturday & night between sunday - monday) The crashes were at exactly the same moment (00:20) After analyzing our server (with a lot of databases) we saw that 105 (small) events are scheduled at 00:00 each night and in the logs it's obvious that it's around that time that something goes wrong -> First semaphore wait indication is at 00:04:03 with a 241 second wait It's our opinion that the events have something to do with the crash but this wasn't an issue in MariaDB10.1 Our environment on the other hand is completely different: Virtualised Linux server -> CentOS 7.5.1804 (Core) 32GB RAM 8 CPU's SSD 18GB innodb_buffer_pool_size I attached our logs in the hope that you could find some useful information. If you need any more information, please just ask and I'll see if I can find it Regards Jens
            sklasing Scott Klasing added a comment -

            We have the identical problem on MDB 10.2.14.
            The deadlocks occur only during slave replication due to competing identical insert on duplicate update statements for the same database. This never occurred in 10.1.18 or 10.1.22.

            LMK if I need to file a separate bug since this report is based on 10.3 and ours is on 10.2.14.

            binlog_wait_commit on master was 2 and have since set it to 0 and the problem still occurs on the slave. On the slave side lowered slave_parallel_workers ... threads from 16 to 8 to now 4.

            Servers are 40cpu, 384 gb ram, bufferpools 260GB, disks 33tb. These servers handle extreme high volume mass inserts on duplicate update with NO problems on the master. We saw occassional deadlocks on the slave even in 10.1.22 but MDB always correctly handled the deadlocks. Since the problems were not occuring on the master, but only on slave during parallism this is considered an MDB issue as opposed to a code isue. Once upgraded to 10.2.14 this problem continues to occur about every 4 days now.

            sklasing Scott Klasing added a comment - We have the identical problem on MDB 10.2.14. The deadlocks occur only during slave replication due to competing identical insert on duplicate update statements for the same database. This never occurred in 10.1.18 or 10.1.22. LMK if I need to file a separate bug since this report is based on 10.3 and ours is on 10.2.14. binlog_wait_commit on master was 2 and have since set it to 0 and the problem still occurs on the slave. On the slave side lowered slave_parallel_workers ... threads from 16 to 8 to now 4. Servers are 40cpu, 384 gb ram, bufferpools 260GB, disks 33tb. These servers handle extreme high volume mass inserts on duplicate update with NO problems on the master. We saw occassional deadlocks on the slave even in 10.1.22 but MDB always correctly handled the deadlocks. Since the problems were not occuring on the master, but only on slave during parallism this is considered an MDB issue as opposed to a code isue. Once upgraded to 10.2.14 this problem continues to occur about every 4 days now.
            sklasing Scott Klasing added a comment -

            should have also reported we are on centos 6.9.x.

            On one of our other servers we are also seeing slave insert on duplcate update same database deadlocks. Also from the parallel processing but it has yet to cause a semaphore > 900 seconds. Instead the server load and threads go > 1000 load, threads upwards 2k and the only show engine innodb status of significance is the following message indicating a very high level mutex is jammed.

            --Thread 140079041885952 has waited at srv0srv.cc line 952 for 0.00 seconds the semaphore:
            Mutex at 0x7f9c38aadd10, Mutex SRV_SYS created srv0srv.cc:1032, lock var 0

            sklasing Scott Klasing added a comment - should have also reported we are on centos 6.9.x. On one of our other servers we are also seeing slave insert on duplcate update same database deadlocks. Also from the parallel processing but it has yet to cause a semaphore > 900 seconds. Instead the server load and threads go > 1000 load, threads upwards 2k and the only show engine innodb status of significance is the following message indicating a very high level mutex is jammed. --Thread 140079041885952 has waited at srv0srv.cc line 952 for 0.00 seconds the semaphore: Mutex at 0x7f9c38aadd10, Mutex SRV_SYS created srv0srv.cc:1032, lock var 0
            wlad Vladislav Vaintroub added a comment - - edited

            I'm afraid #me-too does not help much in those cases .
            If something deadlocks, taking stack traces from all threads, and attaching it to the bug helps. If you're on Windows, you can also add core-file under [mysqld], and attach the mysqld.dmp to the bug report.

            wlad Vladislav Vaintroub added a comment - - edited I'm afraid #me-too does not help much in those cases . If something deadlocks, taking stack traces from all threads, and attaching it to the bug helps. If you're on Windows, you can also add core-file under [mysqld] , and attach the mysqld.dmp to the bug report.
            sklasing Scott Klasing added a comment -

            thanks , will setup for the stack traces from all threads and attach when the problem reoccurs. Only have the mysql-error.log at this point in time.

            sklasing Scott Klasing added a comment - thanks , will setup for the stack traces from all threads and attach when the problem reoccurs. Only have the mysql-error.log at this point in time.
            sklasing Scott Klasing added a comment -

            fyi, everything is set up for the thread stack traces; now waiting for the crash which is usually about 3-4 days out. We bumped up the slave parallel threads to see if we can trigger the problem sooner.

            sklasing Scott Klasing added a comment - fyi, everything is set up for the thread stack traces; now waiting for the crash which is usually about 3-4 days out. We bumped up the slave parallel threads to see if we can trigger the problem sooner.
            sklasing Scott Klasing added a comment -

            Noting since we set the db up for the core dump of all threads the problem has not reoccurred, one thought is that the coredoump configuration slows down the db just enough to properly handle deadlocks. As repeated earlier, we have replication configured for optimistic processing; it appears to still be too agressive since the problem only occurs with paralell slave processing. Setting it to conservative cause replication to immediately fall behind.

            sklasing Scott Klasing added a comment - Noting since we set the db up for the core dump of all threads the problem has not reoccurred, one thought is that the coredoump configuration slows down the db just enough to properly handle deadlocks. As repeated earlier, we have replication configured for optimistic processing; it appears to still be too agressive since the problem only occurs with paralell slave processing. Setting it to conservative cause replication to immediately fall behind.
            sklasing Scott Klasing added a comment -

            We finally have a core dump, all 360GB of it compressed, along with the mysql-error.log, show variables out etc. The plan is to send it to ftp.askmonty.org/private. Does MariaDB have an established NDA process since our company requires it prior to my sending the core dump?

            sklasing Scott Klasing added a comment - We finally have a core dump, all 360GB of it compressed, along with the mysql-error.log, show variables out etc. The plan is to send it to ftp.askmonty.org/private. Does MariaDB have an established NDA process since our company requires it prior to my sending the core dump?

            sklasing, usually, there is no need to send the whole dump. all stack backtrace should normally be enough.

            To produce it , start gdb with core dump, execute "thread apply all bt", and attach the output to the bug report
            See also https://mariadb.com/kb/en/library/how-to-produce-a-full-stack-trace-for-mysqld/#using-the-core-file

            wlad Vladislav Vaintroub added a comment - sklasing , usually, there is no need to send the whole dump. all stack backtrace should normally be enough. To produce it , start gdb with core dump, execute "thread apply all bt", and attach the output to the bug report See also https://mariadb.com/kb/en/library/how-to-produce-a-full-stack-trace-for-mysqld/#using-the-core-file
            sklasing Scott Klasing made changes -
            Attachment gdb_core.64996.txt [ 45878 ]
            Attachment gdb_core.180561.txt [ 45879 ]
            sklasing Scott Klasing added a comment -

            Thanks Vlad,
            I am attached two back trace outputs of two core dumps that occurred on Sunday/Monday htis week.
            gdb_core.64996.txt gdb_core.180561.txt

            sklasing Scott Klasing added a comment - Thanks Vlad, I am attached two back trace outputs of two core dumps that occurred on Sunday/Monday htis week. gdb_core.64996.txt gdb_core.180561.txt
            sklasing Scott Klasing added a comment -

            also noting after the last crash when the system restarted it immediately reported 2018-07-10 8:44:40 139680525514496 [Note] Detected table cache mutex contention at instance 1: 21% waits. Additional table cache instance activated. Number of instances after activation: 2.

            We noted the cache hit rate was only 27% so have since then quadrupled it to 64k and our current hit rate is now 79%. There are atleast 16k databases on this archive server with approx 6 tables each.

            Since then no crashes but a bit too early to correlate the mod.

            sklasing Scott Klasing added a comment - also noting after the last crash when the system restarted it immediately reported 2018-07-10 8:44:40 139680525514496 [Note] Detected table cache mutex contention at instance 1: 21% waits. Additional table cache instance activated. Number of instances after activation: 2. We noted the cache hit rate was only 27% so have since then quadrupled it to 64k and our current hit rate is now 79%. There are atleast 16k databases on this archive server with approx 6 tables each. Since then no crashes but a bit too early to correlate the mod.

            Hmm, are those complete stack traces? How did you collect them?
            The reason I'm asking is that I'm missing a couple of expected threads - thread that does "poll" to handle new connection, innodb background threads, and some other background threads, and most interestingly, I miss the thread that holds the mutex. It is not possible that all threads are waiting for the same mutex, there must be one that holds it.

            wlad Vladislav Vaintroub added a comment - Hmm, are those complete stack traces? How did you collect them? The reason I'm asking is that I'm missing a couple of expected threads - thread that does "poll" to handle new connection, innodb background threads, and some other background threads, and most interestingly, I miss the thread that holds the mutex. It is not possible that all threads are waiting for the same mutex, there must be one that holds it.
            elenst Elena Stepanova added a comment - - edited

            For all threads' full stack trace:

            gdb --batch --eval-command="thread apply all bt full" <path to the binary> <path to the coredump> > <path-to-the-resulting-file>
            

            It will create a text file at <path-to-the-resulting-file> location. Even with full stack and 8600 threads, it shouldn't be very big, but if it turns out to be more than 10 Mb, you'll need to compress it before attaching to the issue.

            elenst Elena Stepanova added a comment - - edited For all threads' full stack trace: gdb --batch --eval-command="thread apply all bt full" <path to the binary> <path to the coredump> > <path-to-the-resulting-file> It will create a text file at <path-to-the-resulting-file> location. Even with full stack and 8600 threads, it shouldn't be very big, but if it turns out to be more than 10 Mb, you'll need to compress it before attaching to the issue.
            sklasing Scott Klasing made changes -
            Attachment gdb.180561.txt.gz [ 45884 ]
            sklasing Scott Klasing added a comment -

            sorry for the delay folks. Even the above supplied --batch commandstream would not write the log file. Ended up repeating orignal with a set pagination off which appeared to do the trick. LMK if you need anything else and will try to be more responsive. gdb.180561.txt.gz

            sklasing Scott Klasing added a comment - sorry for the delay folks. Even the above supplied --batch commandstream would not write the log file. Ended up repeating orignal with a set pagination off which appeared to do the trick. LMK if you need anything else and will try to be more responsive. gdb.180561.txt.gz
            elenst Elena Stepanova made changes -
            Labels crash innodb need_feedback semaphore crash innodb semaphore
            elenst Elena Stepanova made changes -
            Attachment gdb_short.txt [ 45890 ]

            Thanks.
            I've also attached gdb_short.txt – it's extracted from the archive above, for the initial quick look without unpacking.

            elenst Elena Stepanova added a comment - Thanks. I've also attached gdb_short.txt – it's extracted from the archive above, for the initial quick look without unpacking.
            elenst Elena Stepanova made changes -
            Fix Version/s 10.3 [ 22126 ]
            Assignee Vladislav Vaintroub [ wlad ]

            tdcf, we did not hear from you for a while now. it could be a completely different case of deadlock than the one sklasing is experiencing

            wlad Vladislav Vaintroub added a comment - tdcf , we did not hear from you for a while now. it could be a completely different case of deadlock than the one sklasing is experiencing
            wlad Vladislav Vaintroub made changes -
            Attachment uniqstack.txt [ 45892 ]

            Added https://jira.mariadb.org/secure/attachment/45892/uniqstack.txt with stack aggregation, so it is easier to look at

            There is a high contention on LOCK_thread_count mutex (_L_lock_995), 8412 new connections would like to acquire it outside of thread cache, 144 new connection from thread cache, 4 from "show processlist", and 1 replication thread that does DeadlockChecker::print

            wlad Vladislav Vaintroub added a comment - Added https://jira.mariadb.org/secure/attachment/45892/uniqstack.txt with stack aggregation, so it is easier to look at There is a high contention on LOCK_thread_count mutex (_L_lock_995), 8412 new connections would like to acquire it outside of thread cache, 144 new connection from thread cache, 4 from "show processlist", and 1 replication thread that does DeadlockChecker::print
            sklasing Scott Klasing added a comment -

            Folks, we have not had a crash for over a week now.

            After the last crash the error log displayed this message AFTER it restarted
            2018-07-10 8:44:40 139680525514496 [Note] Detected table cache mutex contention at instance 1: 21% waits. Additional table cache instance activated. Number of instances after activation: 2.

            As a result we checked the open table cache hit rate which was 27%. We then raised open_table_cache from 16k to 64k since we have easily 150k tables / 15k databases on this instance.

            My theory is that when the competing inserts on duplicate key updates deadlock during slave replication that MDB is having to clear cache for the new tables and perhaps that caused the 900 sec semaphore delay.

            Hope this helps.

            sklasing Scott Klasing added a comment - Folks, we have not had a crash for over a week now. After the last crash the error log displayed this message AFTER it restarted 2018-07-10 8:44:40 139680525514496 [Note] Detected table cache mutex contention at instance 1: 21% waits. Additional table cache instance activated. Number of instances after activation: 2. As a result we checked the open table cache hit rate which was 27%. We then raised open_table_cache from 16k to 64k since we have easily 150k tables / 15k databases on this instance. My theory is that when the competing inserts on duplicate key updates deadlock during slave replication that MDB is having to clear cache for the new tables and perhaps that caused the 900 sec semaphore delay. Hope this helps.
            wlad Vladislav Vaintroub added a comment - - edited

            There is no contention on table cache mutex in the log.
            The message
            [Note] Detected table cache mutex contention at instance 1: 21% waits. Additional table cache instance activated. Number of instances after activation: 2.

            is not a warning, it is a note. Some heuristic is used on how table cache instances are automatically sized, and this comes from that heuristic.

            wlad Vladislav Vaintroub added a comment - - edited There is no contention on table cache mutex in the log. The message [Note] Detected table cache mutex contention at instance 1: 21% waits. Additional table cache instance activated. Number of instances after activation: 2. is not a warning, it is a note. Some heuristic is used on how table cache instances are automatically sized, and this comes from that heuristic.

            In uniqstack.txt I see that ibuf_delete_rec() is unnecessarily calling lock_update_delete(). But that should not be the reason of the hang.

            marko Marko Mäkelä added a comment - In uniqstack.txt I see that ibuf_delete_rec() is unnecessarily calling lock_update_delete() . But that should not be the reason of the hang.
            marko Marko Mäkelä made changes -

            JensVanDeynse, while I did not see any sign of MDEV-14637 in uniqstack.txt, it is possible that you are hitting that issue. It has been observed as a hang between two purge threads. It might also be possible during a ROLLBACK, but as I have not created a test case, I cannot say for sure.

            marko Marko Mäkelä added a comment - JensVanDeynse , while I did not see any sign of MDEV-14637 in uniqstack.txt , it is possible that you are hitting that issue. It has been observed as a hang between two purge threads. It might also be possible during a ROLLBACK , but as I have not created a test case, I cannot say for sure.

            @Marko Mäkelä, this could be the case
            We tried observing what was wrong by using gdb and although we couldn't see much we saw that it was hung between multiple drop commands
            This was because of a script that drops and recreates the databases, since then we changed the timing of these scripts which fixed the issue for some time but this morning it crashed again after two weeks but we can't keep gdb running since this eventually also causes the server to crash

            JensVanDeynse Jens Van Deynse added a comment - @Marko Mäkelä, this could be the case We tried observing what was wrong by using gdb and although we couldn't see much we saw that it was hung between multiple drop commands This was because of a script that drops and recreates the databases, since then we changed the timing of these scripts which fixed the issue for some time but this morning it crashed again after two weeks but we can't keep gdb running since this eventually also causes the server to crash
            wlad Vladislav Vaintroub made changes -
            Assignee Vladislav Vaintroub [ wlad ] Sergey Vojtovich [ svoj ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.4 [ 22408 ]

            Looking at the gdb_short.txt (which is by far the most useful attachment), culprit is:

            Thread 61 (
            Thread 0x7f5a100d0700 (LWP 184312)):
            #0  0x00007f5a174c1334 in __lll_lock_wait () from /lib64/libpthread.so.0
            #1  0x00007f5a174bc60e in _L_lock_995 () from /lib64/libpthread.so.0
            #2  0x00007f5a174bc576 in pthread_mutex_lock () from /lib64/libpthread.so.0
            #3  0x00007f5a17def3fd in inline_mysql_mutex_lock (that=0x7f5a18dd9680, src_line=Unhandled dwarf expression opcode 0xf3
            #4  inline_mysql_mutex_lock (that=0x7f5a18dd9680, src_line=Unhandled dwarf expression opcode 0xf3
            #5  0x00007f5a17df1773 in thd_get_error_context_description (thd=0x7f13f00009a8, buffer=0x7f5a100caf80 "\240\260\f\020Z\177", length=1024, max_query_len=3000) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/sql_class.cc:540
            #6  0x00007f5a18141cf4 in innobase_mysql_print_thd (f=0x7f5ba7906950, thd=Unhandled dwarf expression opcode 0xf3
            #7  0x00007f5a181846be in DeadlockChecker::print (trx=0x7f59823329c0, max_query_len=3000) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7238
            #8  0x00007f5a18185b7f in DeadlockChecker::notify (this=0x7f5a100cb460, lock=0x7f13e4025b28) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7384
            #9  0x00007f5a18186815 in DeadlockChecker::search (this=0x7f5a100cb460) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7500
            #10 0x00007f5a18188a35 in DeadlockChecker::check_and_resolve (lock=0x7f13e4025c80, trx=0x7f59823356d0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7635
            #11 0x00007f5a1818a535 in lock_rec_enqueue_waiting (c_lock=0x7f5a1a3de938, type_mode=2563, block=Unhandled dwarf expression opcode 0xf3
            #12 0x00007f5a1818aca7 in lock_rec_insert_check_and_lock (flags=Unhandled dwarf expression opcode 0xf3
            #13 0x00007f5a182b5de0 in btr_cur_ins_lock_and_undo (flags=0, cursor=0x7f5a100cbb60, entry=0x7f11501973e0, thr=0x7f04587d12f0, mtr=Unhandled dwarf expression opcode 0xf3
            #14 0x00007f5a182b8844 in btr_cur_optimistic_insert (flags=0, cursor=0x7f5a100cbb60, offsets=0x7f5a100cbb28, heap=0x7f5a100cbac0, entry=0x7f11501973e0, rec=0x7f5a100cbb20, big_rec=0x7f5a100cbab0, n_ext=0, thr=0x7f04587d12f0, mtr=0x7f5a100cc660) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/btr/btr0cur.cc:3031
            #15 0x00007f5a181fb3e3 in row_ins_clust_index_entry_low (flags=0, mode=2, index=0x7f1090a8e740, n_uniq=Unhandled dwarf expression opcode 0xf3
            #16 0x00007f5a181fcfde in row_ins_clust_index_entry (index=0x7f1090a8e740, entry=0x7f11501973e0, thr=0x7f04587d12f0, n_ext=0, dup_chk_only=Unhandled dwarf expression opcode 0xf3
            #17 0x00007f5a181ff38a in row_ins_index_entry (thr=Unhandled dwarf expression opcode 0xf3
            #18 row_ins_index_entry_step (thr=Unhandled dwarf expression opcode 0xf3
            #19 row_ins (thr=Unhandled dwarf expression opcode 0xf3
            #20 row_ins_step (thr=Unhandled dwarf expression opcode 0xf3
            #21 0x00007f5a1820e0de in row_insert_for_mysql (mysql_rec=0x7f1091cdb150 "0\220\375\221\020\177", prebuilt=0x7f04587c3cf0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/row/row0mysql.cc:1404
            #22 0x00007f5a1814fb4f in ha_innobase::write_row (this=0x7f045ac412f0, record=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/handler/ha_innodb.cc:8295
            #23 0x00007f5a17fab8bf in handler::ha_write_row (this=0x7f045ac412f0, buf=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/handler.cc:5971
            #24 0x00007f5a184a650a in ha_partition::write_row (this=0x7f045ac3a7f0, buf=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/ha_partition.cc:4189
            #25 0x00007f5a17fab8bf in handler::ha_write_row (this=0x7f045ac3a7f0, buf=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/handler.cc:5971
            #26 0x00007f5a17e07aa0 in write_record (thd=0x7f14080009a8, table=0x7f045ac3d958, info=0x7f5a100cdcc0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/sql_insert.cc:1654
            #27 0x00007f5a17e0db7c in mysql_insert (thd=0x7f14080009a8, table_list=Unhandled dwarf expression opcode 0xf3
            #28 0x00007f5a17e21e39 in mysql_execute_command (thd=0x7f14080009a8) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/sql_parse.cc:4430
            #29 0x00007f5a17e277ca in mysql_parse (thd=0x7f14080009a8, rawbuf=Unhandled dwarf expression opcode 0xf3
            #30 0x00007f5a18076734 in Query_log_event::do_apply_event (this=0x7f13dc9ee968, rgi=0x7f14202019a0, query_arg=0x7f14221178c0 "INSERT INTO  app_12289_archive.rollup_hourly (month, hrl_hour, hrl_hash, hrl_tracker, hrl_network, hrl_site, hrl_creative, hrl_geo_country, hrl_campaign, hrl_install_count, hrl_install_matched_to_imp_"..., q_len_arg=Unhandled dwarf expression opcode 0xf3
            #31 0x00007f5a17db89a3 in Log_event::apply_event (this=0x7f13dc9ee968, rgi=0x7f14202019a0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/log_event.h:1452
            #32 0x00007f5a17dabc83 in apply_event_and_update_pos_apply (ev=0x7f13dc9ee968, thd=0x7f14080009a8, rgi=0x7f14202019a0, reason=Unhandled dwarf expression opcode 0xf3
            #33 0x00007f5a17f2834e in rpt_handle_event (qev=0x7f1421dd4da8, rpt=Unhandled dwarf expression opcode 0xfa
            #34 0x00007f5a17f2bb5c in handle_rpl_parallel_thread (arg=0x7f1420007eb0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/rpl_parallel.cc:1300
            #35 0x00007f5a174baaa1 in start_thread () from /lib64/libpthread.so.0
            #36 0x00007f5a15bd2bcd in clone () from /lib64/libc.so.6
            

            This should've been fixed since 10.3.3 by

            commit c2118a08b144c95cd4d88a080eaa70abd49f3740
            Author: Monty <monty@mariadb.org>
            Date:   Thu Dec 7 21:28:00 2017 +0200
             
                Move all kill mutex protection to LOCK_thd_kill
             
                LOCK_thd_data was used to protect both THD data and
                ensure that the THD is not deleted while it was in use
             
                This patch moves the THD delete protection to LOCK_thd_kill,
                which already protects the THD for kill.
             
                The benefits are:
                - More well defined what LOCK_thd_data protects
                - LOCK_thd_data usage is now much simpler and easier to verify
                - Less chance of deadlocks in SHOW PROCESS LIST as there is less
                  chance of interactions between mutexes
                - Remove not needed LOCK_thread_count from
                  thd_get_error_context_description()
                - Fewer mutex taken for thd->awake()
             
                Other things:
                - Don't take mysys->var mutex in show processlist to check if thread
                  is kill marked
                - thd->awake() now automatically takes the LOCK_thd_kill mutex
                  (Simplifies code)
                - Apc uses LOCK_thd_kill instead of LOCK_thd_data
            

            Specifically by "- Remove not needed LOCK_thread_count from thd_get_error_context_description()"

            10.2 is missing this fix, but it can be easily backported.

            Unfortunately there's not enough data to analyze 10.3 deadlock. Dump of all threads callstacks is missing.

            svoj Sergey Vojtovich added a comment - Looking at the gdb_short.txt (which is by far the most useful attachment), culprit is: Thread 61 ( Thread 0x7f5a100d0700 (LWP 184312)): #0 0x00007f5a174c1334 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f5a174bc60e in _L_lock_995 () from /lib64/libpthread.so.0 #2 0x00007f5a174bc576 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f5a17def3fd in inline_mysql_mutex_lock (that=0x7f5a18dd9680, src_line=Unhandled dwarf expression opcode 0xf3 #4 inline_mysql_mutex_lock (that=0x7f5a18dd9680, src_line=Unhandled dwarf expression opcode 0xf3 #5 0x00007f5a17df1773 in thd_get_error_context_description (thd=0x7f13f00009a8, buffer=0x7f5a100caf80 "\240\260\f\020Z\177", length=1024, max_query_len=3000) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/sql_class.cc:540 #6 0x00007f5a18141cf4 in innobase_mysql_print_thd (f=0x7f5ba7906950, thd=Unhandled dwarf expression opcode 0xf3 #7 0x00007f5a181846be in DeadlockChecker::print (trx=0x7f59823329c0, max_query_len=3000) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7238 #8 0x00007f5a18185b7f in DeadlockChecker::notify (this=0x7f5a100cb460, lock=0x7f13e4025b28) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7384 #9 0x00007f5a18186815 in DeadlockChecker::search (this=0x7f5a100cb460) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7500 #10 0x00007f5a18188a35 in DeadlockChecker::check_and_resolve (lock=0x7f13e4025c80, trx=0x7f59823356d0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/lock/lock0lock.cc:7635 #11 0x00007f5a1818a535 in lock_rec_enqueue_waiting (c_lock=0x7f5a1a3de938, type_mode=2563, block=Unhandled dwarf expression opcode 0xf3 #12 0x00007f5a1818aca7 in lock_rec_insert_check_and_lock (flags=Unhandled dwarf expression opcode 0xf3 #13 0x00007f5a182b5de0 in btr_cur_ins_lock_and_undo (flags=0, cursor=0x7f5a100cbb60, entry=0x7f11501973e0, thr=0x7f04587d12f0, mtr=Unhandled dwarf expression opcode 0xf3 #14 0x00007f5a182b8844 in btr_cur_optimistic_insert (flags=0, cursor=0x7f5a100cbb60, offsets=0x7f5a100cbb28, heap=0x7f5a100cbac0, entry=0x7f11501973e0, rec=0x7f5a100cbb20, big_rec=0x7f5a100cbab0, n_ext=0, thr=0x7f04587d12f0, mtr=0x7f5a100cc660) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/btr/btr0cur.cc:3031 #15 0x00007f5a181fb3e3 in row_ins_clust_index_entry_low (flags=0, mode=2, index=0x7f1090a8e740, n_uniq=Unhandled dwarf expression opcode 0xf3 #16 0x00007f5a181fcfde in row_ins_clust_index_entry (index=0x7f1090a8e740, entry=0x7f11501973e0, thr=0x7f04587d12f0, n_ext=0, dup_chk_only=Unhandled dwarf expression opcode 0xf3 #17 0x00007f5a181ff38a in row_ins_index_entry (thr=Unhandled dwarf expression opcode 0xf3 #18 row_ins_index_entry_step (thr=Unhandled dwarf expression opcode 0xf3 #19 row_ins (thr=Unhandled dwarf expression opcode 0xf3 #20 row_ins_step (thr=Unhandled dwarf expression opcode 0xf3 #21 0x00007f5a1820e0de in row_insert_for_mysql (mysql_rec=0x7f1091cdb150 "0\220\375\221\020\177", prebuilt=0x7f04587c3cf0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/row/row0mysql.cc:1404 #22 0x00007f5a1814fb4f in ha_innobase::write_row (this=0x7f045ac412f0, record=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/storage/innobase/handler/ha_innodb.cc:8295 #23 0x00007f5a17fab8bf in handler::ha_write_row (this=0x7f045ac412f0, buf=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/handler.cc:5971 #24 0x00007f5a184a650a in ha_partition::write_row (this=0x7f045ac3a7f0, buf=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/ha_partition.cc:4189 #25 0x00007f5a17fab8bf in handler::ha_write_row (this=0x7f045ac3a7f0, buf=0x7f045ac450e0 "\016") at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/handler.cc:5971 #26 0x00007f5a17e07aa0 in write_record (thd=0x7f14080009a8, table=0x7f045ac3d958, info=0x7f5a100cdcc0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/sql_insert.cc:1654 #27 0x00007f5a17e0db7c in mysql_insert (thd=0x7f14080009a8, table_list=Unhandled dwarf expression opcode 0xf3 #28 0x00007f5a17e21e39 in mysql_execute_command (thd=0x7f14080009a8) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/sql_parse.cc:4430 #29 0x00007f5a17e277ca in mysql_parse (thd=0x7f14080009a8, rawbuf=Unhandled dwarf expression opcode 0xf3 #30 0x00007f5a18076734 in Query_log_event::do_apply_event (this=0x7f13dc9ee968, rgi=0x7f14202019a0, query_arg=0x7f14221178c0 "INSERT INTO app_12289_archive.rollup_hourly (month, hrl_hour, hrl_hash, hrl_tracker, hrl_network, hrl_site, hrl_creative, hrl_geo_country, hrl_campaign, hrl_install_count, hrl_install_matched_to_imp_"..., q_len_arg=Unhandled dwarf expression opcode 0xf3 #31 0x00007f5a17db89a3 in Log_event::apply_event (this=0x7f13dc9ee968, rgi=0x7f14202019a0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/log_event.h:1452 #32 0x00007f5a17dabc83 in apply_event_and_update_pos_apply (ev=0x7f13dc9ee968, thd=0x7f14080009a8, rgi=0x7f14202019a0, reason=Unhandled dwarf expression opcode 0xf3 #33 0x00007f5a17f2834e in rpt_handle_event (qev=0x7f1421dd4da8, rpt=Unhandled dwarf expression opcode 0xfa #34 0x00007f5a17f2bb5c in handle_rpl_parallel_thread (arg=0x7f1420007eb0) at /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.14/sql/rpl_parallel.cc:1300 #35 0x00007f5a174baaa1 in start_thread () from /lib64/libpthread.so.0 #36 0x00007f5a15bd2bcd in clone () from /lib64/libc.so.6 This should've been fixed since 10.3.3 by commit c2118a08b144c95cd4d88a080eaa70abd49f3740 Author: Monty <monty@mariadb.org> Date: Thu Dec 7 21:28:00 2017 +0200   Move all kill mutex protection to LOCK_thd_kill   LOCK_thd_data was used to protect both THD data and ensure that the THD is not deleted while it was in use   This patch moves the THD delete protection to LOCK_thd_kill, which already protects the THD for kill.   The benefits are: - More well defined what LOCK_thd_data protects - LOCK_thd_data usage is now much simpler and easier to verify - Less chance of deadlocks in SHOW PROCESS LIST as there is less chance of interactions between mutexes - Remove not needed LOCK_thread_count from thd_get_error_context_description() - Fewer mutex taken for thd->awake()   Other things: - Don't take mysys->var mutex in show processlist to check if thread is kill marked - thd->awake() now automatically takes the LOCK_thd_kill mutex (Simplifies code) - Apc uses LOCK_thd_kill instead of LOCK_thd_data Specifically by "- Remove not needed LOCK_thread_count from thd_get_error_context_description()" 10.2 is missing this fix, but it can be easily backported. Unfortunately there's not enough data to analyze 10.3 deadlock. Dump of all threads callstacks is missing.
            svoj Sergey Vojtovich made changes -
            svoj Sergey Vojtovich added a comment - - edited

            Given that original 10.3.7 mariadb_clean.err contains ADD FOREIGN KEY:

            insert into `Project` (`dataVersion`, `name`, `sortIndex`, `area_id`, `customer_id`, `id`) values (0, 'Basic Course', 0, NULL, NULL, 'c2aa6a6f-a612-4b23-8993-5ffdb09a4d27')
            insert into `Project` (`dataVersion`, `name`, `sortIndex`, `area_id`, `customer_id`, `id`) values (0, 'Merits', 0, NULL, NULL, 'bd31bd37-cd4a-4ae6-b27d-6cc1df3bb039')
            insert into `Project` (`dataVersion`, `name`, `sortIndex`, `area_id`, `customer_id`, `id`) values (0, 'Station 14', 0, NULL, NULL, 'e316509e-123b-49bb-8ff8-55b51a5131be')
            insert into `Customer` (`dataVersion`, `name`, `area_id`, `id`) values (0, '****', NULL, 'f1f611de-2964-432a-a670-22cc0ff7c8fb')
            alter table `Project` add constraint `FKr9hirbv06ir3se75hy9f9urpo` foreign key (`area_id`) references `Area` (`id`), add constraint `FKfdttvh81kur1ulwxu1t4vst6n` foreign key (`customer_id`) references `Customer` (`id`)
            

            It feels like similar problem was fixed by MDEV-16465 and MDEV-16060.

            svoj Sergey Vojtovich added a comment - - edited Given that original 10.3.7 mariadb_clean.err contains ADD FOREIGN KEY: insert into `Project` (`dataVersion`, `name`, `sortIndex`, `area_id`, `customer_id`, `id`) values (0, 'Basic Course', 0, NULL, NULL, 'c2aa6a6f-a612-4b23-8993-5ffdb09a4d27') insert into `Project` (`dataVersion`, `name`, `sortIndex`, `area_id`, `customer_id`, `id`) values (0, 'Merits', 0, NULL, NULL, 'bd31bd37-cd4a-4ae6-b27d-6cc1df3bb039') insert into `Project` (`dataVersion`, `name`, `sortIndex`, `area_id`, `customer_id`, `id`) values (0, 'Station 14', 0, NULL, NULL, 'e316509e-123b-49bb-8ff8-55b51a5131be') insert into `Customer` (`dataVersion`, `name`, `area_id`, `id`) values (0, '****', NULL, 'f1f611de-2964-432a-a670-22cc0ff7c8fb') alter table `Project` add constraint `FKr9hirbv06ir3se75hy9f9urpo` foreign key (`area_id`) references `Area` (`id`), add constraint `FKfdttvh81kur1ulwxu1t4vst6n` foreign key (`customer_id`) references `Customer` (`id`) It feels like similar problem was fixed by MDEV-16465 and MDEV-16060 .

            10.3 issue should've been fixed already
            10.2 issue is fixed by backporting one of 10.3 fixes to 10.1 and 10.2

            svoj Sergey Vojtovich added a comment - 10.3 issue should've been fixed already 10.2 issue is fixed by backporting one of 10.3 fixes to 10.1 and 10.2
            svoj Sergey Vojtovich made changes -
            Fix Version/s 10.1.41 [ 23406 ]
            Fix Version/s 10.2.25 [ 23408 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Closed [ 6 ]
            GeoffMontee Geoff Montee (Inactive) made changes -
            Fix Version/s 10.4.6 [ 23412 ]
            Fix Version/s 10.3.16 [ 23410 ]
            GeoffMontee Geoff Montee (Inactive) made changes -
            Fix Version/s 10.3.16 [ 23410 ]
            Fix Version/s 10.4.6 [ 23412 ]
            marko Marko Mäkelä made changes -

            The original report was something specific to 10.3.7.
            The backport of the cleanup from 10.3.3 to 10.1.41, 10.2.25 effectively reverted the attempted fix of MDEV-11896, which caused the hang in 10.0.32, 10.1.26, 10.2.8, 10.3.1. A deeper analysis of this is in MDEV-13983.

            marko Marko Mäkelä added a comment - The original report was something specific to 10.3.7. The backport of the cleanup from 10.3.3 to 10.1.41, 10.2.25 effectively reverted the attempted fix of MDEV-11896 , which caused the hang in 10.0.32, 10.1.26, 10.2.8, 10.3.1. A deeper analysis of this is in MDEV-13983 .
            serg Sergei Golubchik made changes -
            marko Marko Mäkelä made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            Labels crash innodb semaphore ServiceNow crash innodb semaphore
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            Labels ServiceNow crash innodb semaphore 76qDvLB8Gju6Hs7nk3VY3EX42G795W5z crash innodb semaphore
            serg Sergei Golubchik made changes -
            Labels 76qDvLB8Gju6Hs7nk3VY3EX42G795W5z crash innodb semaphore crash innodb semaphore
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 87791 ] MariaDB v4 [ 154500 ]
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 201658
            Zendesk active tickets 201658

            People

              svoj Sergey Vojtovich
              tdcf Tom Deory
              Votes:
              2 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.