Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-13498

DELETE with CASCADE constraints takes long time and then crashes the instance with Signal 6

Details

    Description

      Running DELETE on a table with a single ROW which causes a cascade constraint to trigger hangs for long time (about 20 minutes in the reported case) and then crashes the instance with:

      InnoDB: ###### Diagnostic info printed to the standard error stream
      2017-08-10 19:08:35 140481422710528 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server becau
      se it appears to be hung.
      170810 19:08:35 [ERROR] mysqld got signal 6 ;

      To reproduce:

      1) Import attached dump

      2) Execute:

      DELETE FROM BRc273a13dfda91b1add89;

      3) Wait ~20 minutes

      On 10.1 the same completes successfully in 15 seconds.

      Attachments

        Issue Links

          Activity

            claudio.nanni Claudio Nanni created issue -
            claudio.nanni Claudio Nanni made changes -
            Field Original Value New Value
            Description Running DELETE on a table with a single ROW which causes a cascade constraint to trigger hangs for long time (about 20 minutes in the reported case) and then crashes the instance with:
            {{InnoDB: ###### Diagnostic info printed to the standard error stream
            2017-08-10 19:08:35 140481422710528 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server becau
            se it appears to be hung.
            170810 19:08:35 [ERROR] mysqld got signal 6 ;}}

            To reproduce:

            1) Import attached dump

            2) Execute:

             {{DELETE FROM BRc273a13dfda91b1add89;}}

            3) Wait ~20 minutes


            On 10.1 the same completes successfully in 15 seconds.
            Running DELETE on a table with a single ROW which causes a cascade constraint to trigger hangs for long time (about 20 minutes in the reported case) and then crashes the instance with:
            {quote}InnoDB: ###### Diagnostic info printed to the standard error stream
            2017-08-10 19:08:35 140481422710528 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server becau
            se it appears to be hung.
            170810 19:08:35 [ERROR] mysqld got signal 6 ;{quote}

            To reproduce:

            1) Import attached dump

            2) Execute:

             {{DELETE FROM BRc273a13dfda91b1add89;}}

            3) Wait ~20 minutes


            On 10.1 the same completes successfully in 15 seconds.
            claudio.nanni Claudio Nanni made changes -
            Attachment test-case.sql.gz [ 43971 ]
            claudio.nanni Claudio Nanni made changes -
            Attachment test-case.sql.gz [ 43972 ]
            alice Alice Sherepa made changes -
            Affects Version/s 10.2.2 [ 22013 ]
            alice Alice Sherepa made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]
            alice Alice Sherepa added a comment - - edited

            Thanks for the report!
            Reproducible on versions >=10.2.2
            error log (10.2.7):

            ============================
            InnoDB: ###### Diagnostic info printed to the standard error stream
            2017-08-11 14:50:48 140259762132736 [Warning] InnoDB: A long semaphore wait:
            --Thread 140259778918144 has waited at srv0srv.cc line 2144 for 923.00 seconds the semaphore:
            X-lock (wait_ex) on RW-latch at 0x48f2a10 created in file dict0dict.cc line 1112
            a writer (thread id 140259778918144) has reserved it in mode  wait exclusive
            number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
            Last time read locked in file row0ins.cc line 1469
            Last time write locked in file srv0srv.cc line 2144
            2017-08-11 14:50:48 140259762132736 [Note] InnoDB: A semaphore wait:
            --Thread 140259778918144 has waited at srv0srv.cc line 2144 for 923.00 seconds the semaphore:
            X-lock (wait_ex) on RW-latch at 0x48f2a10 created in file dict0dict.cc line 1112
            a writer (thread id 140259778918144) has reserved it in mode  wait exclusive
            number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
            Last time read locked in file row0ins.cc line 1469
            Last time write locked in file srv0srv.cc line 2144
            InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
            InnoDB: Pending reads 0, writes 0
             
            =====================================
            2017-08-11 14:51:06 0x7f90c4cc4700 INNODB MONITOR OUTPUT
            =====================================
            Per second averages calculated from the last 20 seconds
            -----------------
            BACKGROUND THREAD
            -----------------
            srv_master_thread loops: 197 srv_active, 0 srv_shutdown, 3187 srv_idle
            srv_master_thread log flush and writes: 3383
            ----------
            SEMAPHORES
            ----------
            OS WAIT ARRAY INFO: reservation count 73807
            --Thread 140259787310848 has waited at buf0flu.cc line 1248 for 0.00 seconds the semaphore:
            SX-lock on RW-latch at 0x7f90cbf04540 created in file buf0buf.cc line 1471
            a writer (thread id 140260423001856) has reserved it in mode  exclusive
            number of readers 0, waiters flag 1, lock_word: 0
            Last time read locked in file row0ins.cc line 1475
            Last time write locked in file row0upd.cc line 3077
            --Thread 140259778918144 has waited at srv0srv.cc line 2144 for 941.00 seconds the semaphore:
            X-lock (wait_ex) on RW-latch at 0x48f2a10 created in file dict0dict.cc line 1112
            a writer (thread id 140259778918144) has reserved it in mode  wait exclusive
            number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
            Last time read locked in file row0ins.cc line 1469
            Last time write locked in file srv0srv.cc line 2144
            OS WAIT ARRAY INFO: signal count 172129
            RW-shared spins 0, rounds 150450, OS waits 12867
            RW-excl spins 0, rounds 777537, OS waits 10007
            RW-sx spins 97496, rounds 1930979, OS waits 45357
            Spin rounds per wait: 150450.00 RW-shared, 777537.00 RW-excl, 19.81 RW-sx
            ------------
            TRANSACTIONS
            ------------
            Trx id counter 8856
            Purge done for trx's n:o < 8854 undo n:o < 0 state: running but idle
            History list length 12
            LIST OF TRANSACTIONS FOR EACH SESSION:
            ---TRANSACTION 421735400322840, not started
            0 lock struct(s), heap size 1136, 0 row lock(s)
            ---TRANSACTION 8855, ACTIVE 995 sec updating or deleting
            mysql tables in use 3, locked 3
            27689 lock struct(s), heap size 3317968, 1563936 row lock(s), undo log entries 264206
            MySQL thread id 9, OS thread handle 0x7f90ecb06700, query id 620 localhost root updating
            DELETE FROM BRc273a13dfda91b1add89
            --------
            FILE I/O
            --------
            I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
            I/O thread 1 state: waiting for completed aio requests (log thread)
            I/O thread 2 state: waiting for completed aio requests (read thread)
            I/O thread 3 state: waiting for completed aio requests (read thread)
            I/O thread 4 state: waiting for completed aio requests (read thread)
            I/O thread 5 state: waiting for completed aio requests (read thread)
            I/O thread 6 state: waiting for completed aio requests (write thread)
            I/O thread 7 state: waiting for completed aio requests (write thread)
            I/O thread 8 state: waiting for completed aio requests (write thread)
            I/O thread 9 state: waiting for completed aio requests (write thread)
            Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] ,
             ibuf aio reads:, log i/o's:, sync i/o's:
            Pending flushes (fsync) log: 0; buffer pool: 0
            20376 OS file reads, 68064 OS file writes, 15463 OS fsyncs
            0.00 reads/s, 0 avg bytes/read, 13.20 writes/s, 7.65 fsyncs/s
            -------------------------------------
            INSERT BUFFER AND ADAPTIVE HASH INDEX
            -------------------------------------
            Ibuf: size 1, free list len 546, seg size 548, 723 merges
            merged operations:
             insert 0, delete mark 176754, delete 0
            discarded operations:
             insert 0, delete mark 0, delete 0
            Hash table size 34679, node heap has 4 buffer(s)
            Hash table size 34679, node heap has 4 buffer(s)
            Hash table size 34679, node heap has 9 buffer(s)
            Hash table size 34679, node heap has 1 buffer(s)
            Hash table size 34679, node heap has 1 buffer(s)
            Hash table size 34679, node heap has 1 buffer(s)
            Hash table size 34679, node heap has 208 buffer(s)
            Hash table size 34679, node heap has 3 buffer(s)
            0.00 hash searches/s, 0.00 non-hash searches/s
            ---
            LOG
            ---
            Log sequence number 4418308559
            Log flushed up to   4418298159
            Pages flushed up to 4418284524
            Last checkpoint at  4384550662
            0 pending log flushes, 0 pending chkp writes
            5927 log i/o's done, 2.90 log i/o's/second
            ----------------------
            BUFFER POOL AND MEMORY
            ----------------------
            Total large memory allocated 137560064
            Dictionary memory allocated 118116
            Buffer pool size   8192
            Free buffers       0
            Database pages     7759
            Old database pages 2844
            Modified db pages  2
            Percent of dirty pages(LRU & free pages): 0.026
            Max dirty pages percent: 75.000
            Pending reads 0
            Pending writes: LRU 0, flush list 2, single page 0
            Pages made young 27108, not young 3400054
            2.80 youngs/s, 0.00 non-youngs/s
            Pages read 20338, created 24551, written 58406
            0.00 reads/s, 1.85 creates/s, 8.45 writes/s
            Buffer pool hit rate 1000 / 1000, young-making rate 4 / 1000 not 0 / 1000
            Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
            LRU len: 7759, unzip_LRU len: 0
            I/O sum[6879]:cur[151], unzip sum[0]:cur[0]
            --------------
            ROW OPERATIONS
            --------------
            0 queries inside InnoDB, 0 queries in queue
            0 read views open inside InnoDB
            Process ID=8894, Main thread ID=140259778918144, state: enforcing dict cache limit
            Number of rows inserted 1678464, updated 0, deleted 988977, read 4
            0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
            Number of system rows inserted 0, updated 0, deleted 0, read 0
            0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
            ----------------------------
            END OF INNODB MONITOR OUTPUT
            ============================
            InnoDB: ###### Diagnostic info printed to the standard error stream
            2017-08-11 14:51:18 140259762132736 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
            170811 14:51:18 [ERROR] mysqld got signal 6 ;
            This could be because you hit a bug. It is also possible that this binary
            or one of the libraries it was linked against is corrupt, improperly built,
            or misconfigured. This error can also be caused by malfunctioning hardware.
             
            To report this bug, see https://mariadb.com/kb/en/reporting-bugs
             
            We will try our best to scrape up some info that will hopefully help
            diagnose the problem, but since we have already crashed, 
            something is definitely wrong and this may fail.
             
            Server version: 10.2.7-MariaDB
            key_buffer_size=134217728
            read_buffer_size=131072
            max_used_connections=1
            max_threads=153
            thread_count=7
            It is possible that mysqld could use up to 
            key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467207 K  bytes of memory
            Hope that's ok; if not, decrease some variables in the equation.
             
            Thread pointer: 0x0
            Attempting backtrace. You can use the following information to find out
            where mysqld died. If you see no messages after this, something went
            terribly wrong...
            stack_bottom = 0x0 thread_stack 0x49000
            /10.2.7/bin/mysqld(my_print_stacktrace+0x2e)[0xddc02e]
            /10.2.7/bin/mysqld(handle_fatal_signal+0x444)[0x7d5984]
            /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f90f1b2f390]
            /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f90f08b8428]
            /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f90f08ba02a]
            /10.2.7/bin/mysqld[0xba63e9]
            /10.2.7/bin/mysqld[0xb4f1b0]
            /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f90f1b256ba]
            /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f90f098a3dd]
            The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
            information that should help you find out what is causing the crash.
            

            alice Alice Sherepa added a comment - - edited Thanks for the report! Reproducible on versions >=10.2.2 error log (10.2.7): ============================ InnoDB: ###### Diagnostic info printed to the standard error stream 2017-08-11 14:50:48 140259762132736 [Warning] InnoDB: A long semaphore wait: --Thread 140259778918144 has waited at srv0srv.cc line 2144 for 923.00 seconds the semaphore: X-lock (wait_ex) on RW-latch at 0x48f2a10 created in file dict0dict.cc line 1112 a writer (thread id 140259778918144) has reserved it in mode wait exclusive number of readers 1, waiters flag 0, lock_word: ffffffffffffffff Last time read locked in file row0ins.cc line 1469 Last time write locked in file srv0srv.cc line 2144 2017-08-11 14:50:48 140259762132736 [Note] InnoDB: A semaphore wait: --Thread 140259778918144 has waited at srv0srv.cc line 2144 for 923.00 seconds the semaphore: X-lock (wait_ex) on RW-latch at 0x48f2a10 created in file dict0dict.cc line 1112 a writer (thread id 140259778918144) has reserved it in mode wait exclusive number of readers 1, waiters flag 0, lock_word: ffffffffffffffff Last time read locked in file row0ins.cc line 1469 Last time write locked in file srv0srv.cc line 2144 InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info: InnoDB: Pending reads 0, writes 0   ===================================== 2017-08-11 14:51:06 0x7f90c4cc4700 INNODB MONITOR OUTPUT ===================================== Per second averages calculated from the last 20 seconds ----------------- BACKGROUND THREAD ----------------- srv_master_thread loops: 197 srv_active, 0 srv_shutdown, 3187 srv_idle srv_master_thread log flush and writes: 3383 ---------- SEMAPHORES ---------- OS WAIT ARRAY INFO: reservation count 73807 --Thread 140259787310848 has waited at buf0flu.cc line 1248 for 0.00 seconds the semaphore: SX-lock on RW-latch at 0x7f90cbf04540 created in file buf0buf.cc line 1471 a writer (thread id 140260423001856) has reserved it in mode exclusive number of readers 0, waiters flag 1, lock_word: 0 Last time read locked in file row0ins.cc line 1475 Last time write locked in file row0upd.cc line 3077 --Thread 140259778918144 has waited at srv0srv.cc line 2144 for 941.00 seconds the semaphore: X-lock (wait_ex) on RW-latch at 0x48f2a10 created in file dict0dict.cc line 1112 a writer (thread id 140259778918144) has reserved it in mode wait exclusive number of readers 1, waiters flag 0, lock_word: ffffffffffffffff Last time read locked in file row0ins.cc line 1469 Last time write locked in file srv0srv.cc line 2144 OS WAIT ARRAY INFO: signal count 172129 RW-shared spins 0, rounds 150450, OS waits 12867 RW-excl spins 0, rounds 777537, OS waits 10007 RW-sx spins 97496, rounds 1930979, OS waits 45357 Spin rounds per wait: 150450.00 RW-shared, 777537.00 RW-excl, 19.81 RW-sx ------------ TRANSACTIONS ------------ Trx id counter 8856 Purge done for trx's n:o < 8854 undo n:o < 0 state: running but idle History list length 12 LIST OF TRANSACTIONS FOR EACH SESSION: ---TRANSACTION 421735400322840, not started 0 lock struct(s), heap size 1136, 0 row lock(s) ---TRANSACTION 8855, ACTIVE 995 sec updating or deleting mysql tables in use 3, locked 3 27689 lock struct(s), heap size 3317968, 1563936 row lock(s), undo log entries 264206 MySQL thread id 9, OS thread handle 0x7f90ecb06700, query id 620 localhost root updating DELETE FROM BRc273a13dfda91b1add89 -------- FILE I/O -------- I/O thread 0 state: waiting for completed aio requests (insert buffer thread) I/O thread 1 state: waiting for completed aio requests (log thread) I/O thread 2 state: waiting for completed aio requests (read thread) I/O thread 3 state: waiting for completed aio requests (read thread) I/O thread 4 state: waiting for completed aio requests (read thread) I/O thread 5 state: waiting for completed aio requests (read thread) I/O thread 6 state: waiting for completed aio requests (write thread) I/O thread 7 state: waiting for completed aio requests (write thread) I/O thread 8 state: waiting for completed aio requests (write thread) I/O thread 9 state: waiting for completed aio requests (write thread) Pending normal aio reads: [0, 0, 0, 0] , aio writes: [0, 0, 0, 0] , ibuf aio reads:, log i/o's:, sync i/o's: Pending flushes (fsync) log: 0; buffer pool: 0 20376 OS file reads, 68064 OS file writes, 15463 OS fsyncs 0.00 reads/s, 0 avg bytes/read, 13.20 writes/s, 7.65 fsyncs/s ------------------------------------- INSERT BUFFER AND ADAPTIVE HASH INDEX ------------------------------------- Ibuf: size 1, free list len 546, seg size 548, 723 merges merged operations: insert 0, delete mark 176754, delete 0 discarded operations: insert 0, delete mark 0, delete 0 Hash table size 34679, node heap has 4 buffer(s) Hash table size 34679, node heap has 4 buffer(s) Hash table size 34679, node heap has 9 buffer(s) Hash table size 34679, node heap has 1 buffer(s) Hash table size 34679, node heap has 1 buffer(s) Hash table size 34679, node heap has 1 buffer(s) Hash table size 34679, node heap has 208 buffer(s) Hash table size 34679, node heap has 3 buffer(s) 0.00 hash searches/s, 0.00 non-hash searches/s --- LOG --- Log sequence number 4418308559 Log flushed up to 4418298159 Pages flushed up to 4418284524 Last checkpoint at 4384550662 0 pending log flushes, 0 pending chkp writes 5927 log i/o's done, 2.90 log i/o's/second ---------------------- BUFFER POOL AND MEMORY ---------------------- Total large memory allocated 137560064 Dictionary memory allocated 118116 Buffer pool size 8192 Free buffers 0 Database pages 7759 Old database pages 2844 Modified db pages 2 Percent of dirty pages(LRU & free pages): 0.026 Max dirty pages percent: 75.000 Pending reads 0 Pending writes: LRU 0, flush list 2, single page 0 Pages made young 27108, not young 3400054 2.80 youngs/s, 0.00 non-youngs/s Pages read 20338, created 24551, written 58406 0.00 reads/s, 1.85 creates/s, 8.45 writes/s Buffer pool hit rate 1000 / 1000, young-making rate 4 / 1000 not 0 / 1000 Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s LRU len: 7759, unzip_LRU len: 0 I/O sum[6879]:cur[151], unzip sum[0]:cur[0] -------------- ROW OPERATIONS -------------- 0 queries inside InnoDB, 0 queries in queue 0 read views open inside InnoDB Process ID=8894, Main thread ID=140259778918144, state: enforcing dict cache limit Number of rows inserted 1678464, updated 0, deleted 988977, read 4 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s Number of system rows inserted 0, updated 0, deleted 0, read 0 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s ---------------------------- END OF INNODB MONITOR OUTPUT ============================ InnoDB: ###### Diagnostic info printed to the standard error stream 2017-08-11 14:51:18 140259762132736 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. 170811 14:51:18 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware.   To report this bug, see https://mariadb.com/kb/en/reporting-bugs   We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail.   Server version: 10.2.7-MariaDB key_buffer_size=134217728 read_buffer_size=131072 max_used_connections=1 max_threads=153 thread_count=7 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467207 K bytes of memory Hope that's ok; if not, decrease some variables in the equation.   Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x0 thread_stack 0x49000 /10.2.7/bin/mysqld(my_print_stacktrace+0x2e)[0xddc02e] /10.2.7/bin/mysqld(handle_fatal_signal+0x444)[0x7d5984] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f90f1b2f390] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f90f08b8428] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f90f08ba02a] /10.2.7/bin/mysqld[0xba63e9] /10.2.7/bin/mysqld[0xb4f1b0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f90f1b256ba] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f90f098a3dd] The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains information that should help you find out what is causing the crash.
            elenst Elena Stepanova added a comment - - edited

            There is no need to wait for 20 minutes, we have innodb-fatal-semaphore-wait-threshold=N option (600 seconds by default), if it's set to a lower value, the failure will happen earlier. Also, if you set it to a bigger value, you won't get the failure at all – the query doesn't hang or anything, it's just extremely slow. If it's given enough time, like ~20 min on my machine, it eventually deletes everything and succeeds (it still throws intermediate "long semaphore" warnings with increasing wait, they are just not fatal if the threshold is high). If another connection with READ UNCOMMITTED checks the contents of the referencing tables, it's obvious that the rows keep being deleted. It does have to cascade-delete ~800K rows in two tables, but it's not an excuse for taking 20 minutes – on unaffected versions it finishes in ~1 minute.

            An MTR-friendly test case is attached. It's pretty much the same test as in the initial attachment, only unrelated tables are removed, the contents re-formatted for MTR, and tables are renamed to t1, t2, t3. Run it with --mysqld=--innodb-fatal-semaphore-wait-threshold=60 to get the failure in reasonable time.

            The test case can be reduced, but since it takes different time on different builds and machines, it's unclear how much data is safe to remove until the test becomes unreliable.

            10.2 is affected, including the latest build. 10.1, 5.5, MySQL 5.6, MySQL 5.7 are not affected.

            5.7.18 release

            MySQL [test]> DELETE FROM BRc273a13dfda91b1add89;
            Query OK, 1 row affected (1 min 10.11 sec)
            

            5.7.19 release

            MySQL [test]> DELETE FROM BRc273a13dfda91b1add89;
            Query OK, 1 row affected (1 min 11.06 sec)
            

            5.6.37 release

            MySQL [test]> DELETE FROM BRc273a13dfda91b1add89;
            Query OK, 1 row affected (1 min 15.41 sec)
            

            10.1.26 release

            MariaDB [test]> DELETE FROM BRc273a13dfda91b1add89;
            Query OK, 1 row affected (32.03 sec)
            

            10.2.8 pre-release

            MariaDB [test]> DELETE FROM BRc273a13dfda91b1add89;
            Query OK, 1 row affected (18 min 23.64 sec)
            

            elenst Elena Stepanova added a comment - - edited There is no need to wait for 20 minutes, we have innodb-fatal-semaphore-wait-threshold=N option (600 seconds by default), if it's set to a lower value, the failure will happen earlier. Also, if you set it to a bigger value, you won't get the failure at all – the query doesn't hang or anything, it's just extremely slow. If it's given enough time, like ~20 min on my machine, it eventually deletes everything and succeeds (it still throws intermediate "long semaphore" warnings with increasing wait, they are just not fatal if the threshold is high). If another connection with READ UNCOMMITTED checks the contents of the referencing tables, it's obvious that the rows keep being deleted. It does have to cascade-delete ~800K rows in two tables, but it's not an excuse for taking 20 minutes – on unaffected versions it finishes in ~1 minute. An MTR-friendly test case is attached. It's pretty much the same test as in the initial attachment, only unrelated tables are removed, the contents re-formatted for MTR, and tables are renamed to t1 , t2 , t3 . Run it with --mysqld=--innodb-fatal-semaphore-wait-threshold=60 to get the failure in reasonable time. The test case can be reduced, but since it takes different time on different builds and machines, it's unclear how much data is safe to remove until the test becomes unreliable. 10.2 is affected, including the latest build. 10.1, 5.5, MySQL 5.6, MySQL 5.7 are not affected. 5.7.18 release MySQL [test]> DELETE FROM BRc273a13dfda91b1add89; Query OK, 1 row affected (1 min 10.11 sec) 5.7.19 release MySQL [test]> DELETE FROM BRc273a13dfda91b1add89; Query OK, 1 row affected (1 min 11.06 sec) 5.6.37 release MySQL [test]> DELETE FROM BRc273a13dfda91b1add89; Query OK, 1 row affected (1 min 15.41 sec) 10.1.26 release MariaDB [test]> DELETE FROM BRc273a13dfda91b1add89; Query OK, 1 row affected (32.03 sec) 10.2.8 pre-release MariaDB [test]> DELETE FROM BRc273a13dfda91b1add89; Query OK, 1 row affected (18 min 23.64 sec)
            elenst Elena Stepanova made changes -
            Component/s Storage Engine - InnoDB [ 10129 ]
            Fix Version/s 10.2 [ 14601 ]
            Affects Version/s 10.2 [ 14601 ]
            Assignee Marko Mäkelä [ marko ]
            elenst Elena Stepanova made changes -
            Attachment mdev13498.test.gz [ 43978 ]
            claudio.nanni Claudio Nanni added a comment - - edited

            I need to wait around 20 minutes even with default 600 seconds for innodb_fatal_semaphore_wait_threshold.

            MariaDB [i17]> select now(); DELETE FROM BRc273a13dfd; select now();
            +---------------------+
            | now()               |
            +---------------------+
            | 2017-08-12 23:22:03 |
            +---------------------+
            1 row in set (0.00 sec)
            

            ERROR 2013 (HY000): Lost connection to MySQL server during query
            ERROR 2006 (HY000): MySQL server has gone away
            No connection. Trying to reconnect...
            ERROR 2002 (HY000): Can't connect to MySQL server on '127.0.0.1' (115)
            ERROR: Can't connect to the server
             
            unknown [i17]> 
            unknown [i17]> Bye
            [claudio@tossanc ~]$ date
            Sat Aug 12 23:44:28 CEST 2017
            

            If you look at the SHOW ENGINE INNODB STATUS just before the crash you see transaction time to be close to real time, semaphore wait being less than real time, but more than 600 seconds:

            SEMAPHORES
            ----------
            OS WAIT ARRAY INFO: reservation count 58240
            --Thread 139979397363456 has waited at srv0srv.cc line 2144 for 951.00 seconds the semaphore:
            X-lock (wait_ex) on RW-latch at 0xb628ee950 created in file dict0dict.cc line 1112
            a writer (thread id 139979397363456) has reserved it in mode  wait exclusive
            number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
            Last time read locked in file row0ins.cc line 1469
            Last time write locked in file srv0srv.cc line 2144
            OS WAIT ARRAY INFO: signal count 58121
            RW-shared spins 0, rounds 1723, OS waits 684
            RW-excl spins 0, rounds 92894, OS waits 3775
            RW-sx spins 53576, rounds 1606747, OS waits 53534
            Spin rounds per wait: 1723.00 RW-shared, 92894.00 RW-excl, 29.99 RW-sx
            ------------
            TRANSACTIONS
            ------------
            Trx id counter 12284
            Purge done for trx's n:o < 12284 undo n:o < 0 state: running but idle
            History list length 0
            LIST OF TRANSACTIONS FOR EACH SESSION:
            ---TRANSACTION 421457560813080, not started
            0 lock struct(s), heap size 1136, 0 row lock(s)
            ---TRANSACTION 421457560812184, not started
            0 lock struct(s), heap size 1136, 0 row lock(s)
            ---TRANSACTION 12275, ACTIVE 1322 sec updating or deleting
            mysql tables in use 3, locked 3
            27677 lock struct(s), heap size 3317968, 1563932 row lock(s), undo log entries 405753
            MySQL thread id 11, OS thread handle 0x7f503c04f700, query id 333 localhost 127.0.0.1 root updating
            DELETE FROM BRc273a13dfd

            claudio.nanni Claudio Nanni added a comment - - edited I need to wait around 20 minutes even with default 600 seconds for innodb_fatal_semaphore_wait_threshold. MariaDB [i17]> select now(); DELETE FROM BRc273a13dfd; select now(); + ---------------------+ | now() | + ---------------------+ | 2017-08-12 23:22:03 | + ---------------------+ 1 row in set (0.00 sec) ERROR 2013 (HY000): Lost connection to MySQL server during query ERROR 2006 (HY000): MySQL server has gone away No connection. Trying to reconnect... ERROR 2002 (HY000): Can't connect to MySQL server on '127.0.0.1' (115) ERROR: Can't connect to the server   unknown [i17]> unknown [i17]> Bye [claudio@tossanc ~]$ date Sat Aug 12 23:44:28 CEST 2017 If you look at the SHOW ENGINE INNODB STATUS just before the crash you see transaction time to be close to real time, semaphore wait being less than real time, but more than 600 seconds: SEMAPHORES ---------- OS WAIT ARRAY INFO: reservation count 58240 --Thread 139979397363456 has waited at srv0srv.cc line 2144 for 951.00 seconds the semaphore: X-lock (wait_ex) on RW-latch at 0xb628ee950 created in file dict0dict.cc line 1112 a writer (thread id 139979397363456) has reserved it in mode wait exclusive number of readers 1, waiters flag 0, lock_word: ffffffffffffffff Last time read locked in file row0ins.cc line 1469 Last time write locked in file srv0srv.cc line 2144 OS WAIT ARRAY INFO: signal count 58121 RW-shared spins 0, rounds 1723, OS waits 684 RW-excl spins 0, rounds 92894, OS waits 3775 RW-sx spins 53576, rounds 1606747, OS waits 53534 Spin rounds per wait: 1723.00 RW-shared, 92894.00 RW-excl, 29.99 RW-sx ------------ TRANSACTIONS ------------ Trx id counter 12284 Purge done for trx's n:o < 12284 undo n:o < 0 state: running but idle History list length 0 LIST OF TRANSACTIONS FOR EACH SESSION: ---TRANSACTION 421457560813080, not started 0 lock struct(s), heap size 1136, 0 row lock(s) ---TRANSACTION 421457560812184, not started 0 lock struct(s), heap size 1136, 0 row lock(s) ---TRANSACTION 12275, ACTIVE 1322 sec updating or deleting mysql tables in use 3, locked 3 27677 lock struct(s), heap size 3317968, 1563932 row lock(s), undo log entries 405753 MySQL thread id 11, OS thread handle 0x7f503c04f700, query id 333 localhost 127.0.0.1 root updating DELETE FROM BRc273a13dfd
            elenst Elena Stepanova added a comment - - edited

            I need to wait around 20 minutes even with default 600 seconds for innodb_fatal_semaphore_wait_threshold.

            Right, because the problem, and hence the counting, apparently doesn't start right away, and also timing is not absolutely precise (that is, InnoDB doesn't abort immediately when the timeout is exceeded).
            Still, if you decrease the threshold, you'll get the problem faster – still not exactly in N seconds if you set the value to N, but in N+something.

            elenst Elena Stepanova added a comment - - edited I need to wait around 20 minutes even with default 600 seconds for innodb_fatal_semaphore_wait_threshold. Right, because the problem, and hence the counting, apparently doesn't start right away, and also timing is not absolutely precise (that is, InnoDB doesn't abort immediately when the timeout is exceeded). Still, if you decrease the threshold, you'll get the problem faster – still not exactly in N seconds if you set the value to N, but in N+something.
            marko Marko Mäkelä made changes -
            Status Confirmed [ 10101 ] In Progress [ 3 ]

            I did not observe a hang, but only slowness, when running the test case on a debug binary.

            sudo perf top -g
            

            pointed me out a very costly std::find() call, in a condition was being evaluated even though the Galera cluster is not enabled!
            There is a similar condition in 10.1, but using a simpler predicate. The std::find() call was introduced in MariaDB 10.2.2 by a commit whose purpose was to merge changes between MySQL 5.7.9 and 5.7.14.
            I would propose the following patch. It avoids the costly std::find() call when the Galera cluster is not enabled, and it also removes some redundant comparisons.

            @@ -2414,29 +2414,24 @@ row_upd_sec_index_entry(
             		row_ins_sec_index_entry() below */
             		if (!rec_get_deleted_flag(
             			    rec, dict_table_is_comp(index->table))) {
            -
            -#ifdef WITH_WSREP
            -			que_node_t *parent = que_node_get_parent(node);
            -#endif
             			err = btr_cur_del_mark_set_sec_rec(
             				flags, btr_cur, TRUE, thr, &mtr);
             			if (err != DB_SUCCESS) {
             				break;
             			}
             #ifdef WITH_WSREP
            -			if (err == DB_SUCCESS && !referenced                  &&
            -			    !(parent && que_node_get_type(parent) ==
            -				QUE_NODE_UPDATE                               &&
            -			      (std::find(((upd_node_t*)parent)->cascade_upd_nodes->begin(),
            -                                         ((upd_node_t*)parent)->cascade_upd_nodes->end(),
            -                                         node) ==
            -                               ((upd_node_t*)parent)->cascade_upd_nodes->end()))  &&
            -                            foreign
            -                        ) {
            -				ulint*	offsets =
            -					rec_get_offsets(
            -						rec, index, NULL, ULINT_UNDEFINED,
            -						&heap);
            +			upd_node_t *parent = static_cast<upd_node_t*>(
            +				que_node_get_parent(node));
            +			if (!referenced && foreign
            +			    && que_node_get_type(parent) == QUE_NODE_UPDATE
            +			    && wsrep_on(trx->mysql_thd)
            +			    && std::find(parent->cascade_upd_nodes->begin(),
            +					 parent->cascade_upd_nodes->end(),
            +					 node)
            +			    == parent->cascade_upd_nodes->end()) {
            +				ulint*	offsets = rec_get_offsets(
            +					rec, index, NULL, ULINT_UNDEFINED,
            +					&heap);
             
             				err = wsrep_row_upd_check_foreign_constraints(
             					node, &pcur, index->table,
            @@ -2450,19 +2445,19 @@ row_upd_sec_index_entry(
             				case DB_DEADLOCK:
             					if (wsrep_debug) {
             						ib::warn() << "WSREP: sec index FK check fail for deadlock"
            -							   << " index " << index->name()
            -							   << " table " << index->table->name.m_name;
            +							   << " index " << index->name
            +							   << " table " << index->table->name;
             					}
             					break;
             				default:
            -					ib::error() << "WSREP: referenced FK check fail: " << err
            -						    << " index " << index->name()
            -						    << " table " << index->table->name.m_name;
            +					ib::error() << "WSREP: referenced FK check fail: " << ut_strerr(err)
            +						    << " index " << index->name
            +						    << " table " << index->table->name;
             
             					break;
             				}
            -			}
             #endif /* WITH_WSREP */
            +			}
             		}
             
             		ut_ad(err == DB_SUCCESS);
            

            Even with this patch applied, the test case takes a very long time (several minutes, and counting) on a non-debug server. So, I wonder if further fixes are needed.

            marko Marko Mäkelä added a comment - I did not observe a hang, but only slowness, when running the test case on a debug binary. sudo perf top -g pointed me out a very costly std::find() call, in a condition was being evaluated even though the Galera cluster is not enabled! There is a similar condition in 10.1, but using a simpler predicate. The std::find() call was introduced in MariaDB 10.2.2 by a commit whose purpose was to merge changes between MySQL 5.7.9 and 5.7.14 . I would propose the following patch. It avoids the costly std::find() call when the Galera cluster is not enabled, and it also removes some redundant comparisons. @@ -2414,29 +2414,24 @@ row_upd_sec_index_entry( row_ins_sec_index_entry() below */ if (!rec_get_deleted_flag( rec, dict_table_is_comp(index->table))) { - -#ifdef WITH_WSREP - que_node_t *parent = que_node_get_parent(node); -#endif err = btr_cur_del_mark_set_sec_rec( flags, btr_cur, TRUE, thr, &mtr); if (err != DB_SUCCESS) { break; } #ifdef WITH_WSREP - if (err == DB_SUCCESS && !referenced && - !(parent && que_node_get_type(parent) == - QUE_NODE_UPDATE && - (std::find(((upd_node_t*)parent)->cascade_upd_nodes->begin(), - ((upd_node_t*)parent)->cascade_upd_nodes->end(), - node) == - ((upd_node_t*)parent)->cascade_upd_nodes->end())) && - foreign - ) { - ulint* offsets = - rec_get_offsets( - rec, index, NULL, ULINT_UNDEFINED, - &heap); + upd_node_t *parent = static_cast<upd_node_t*>( + que_node_get_parent(node)); + if (!referenced && foreign + && que_node_get_type(parent) == QUE_NODE_UPDATE + && wsrep_on(trx->mysql_thd) + && std::find(parent->cascade_upd_nodes->begin(), + parent->cascade_upd_nodes->end(), + node) + == parent->cascade_upd_nodes->end()) { + ulint* offsets = rec_get_offsets( + rec, index, NULL, ULINT_UNDEFINED, + &heap); err = wsrep_row_upd_check_foreign_constraints( node, &pcur, index->table, @@ -2450,19 +2445,19 @@ row_upd_sec_index_entry( case DB_DEADLOCK: if (wsrep_debug) { ib::warn() << "WSREP: sec index FK check fail for deadlock" - << " index " << index->name() - << " table " << index->table->name.m_name; + << " index " << index->name + << " table " << index->table->name; } break; default: - ib::error() << "WSREP: referenced FK check fail: " << err - << " index " << index->name() - << " table " << index->table->name.m_name; + ib::error() << "WSREP: referenced FK check fail: " << ut_strerr(err) + << " index " << index->name + << " table " << index->table->name; break; } - } #endif /* WITH_WSREP */ + } } ut_ad(err == DB_SUCCESS); Even with this patch applied, the test case takes a very long time (several minutes, and counting) on a non-debug server. So, I wonder if further fixes are needed.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Jan Lindström [ jplindst ]
            Status In Progress [ 3 ] In Review [ 10002 ]

            bb-10.2-marko
            My commit comment describes why I believe that these conditions were not only slow but also wrong.
            I do not know if the introduced predicate wsrep_must_process_fk() is correct; I hope seppo can review it and point me to the tests that should exercise this code.

            marko Marko Mäkelä added a comment - bb-10.2-marko My commit comment describes why I believe that these conditions were not only slow but also wrong. I do not know if the introduced predicate wsrep_must_process_fk() is correct; I hope seppo can review it and point me to the tests that should exercise this code.

            My observation of the predicate being not only slow but also possibly wrong could explain
            MDEV-13246 Stale rows despite ON DELETE CASCADE constraint

            There was a compilation error in my patch when WITH_WSREP was not enabled. Revised patch on bb-10.2-marko

            marko Marko Mäkelä added a comment - My observation of the predicate being not only slow but also possibly wrong could explain MDEV-13246 Stale rows despite ON DELETE CASCADE constraint There was a compilation error in my patch when WITH_WSREP was not enabled. Revised patch on bb-10.2-marko
            marko Marko Mäkelä made changes -

            ok to push with change discussed on slack.

            jplindst Jan Lindström (Inactive) added a comment - ok to push with change discussed on slack.
            jplindst Jan Lindström (Inactive) made changes -
            Assignee Jan Lindström [ jplindst ] Marko Mäkelä [ marko ]
            Status In Review [ 10002 ] Stalled [ 10000 ]

            The change that was discussed: pick up the !wsrep_thd_is_BF() condition from Galera MW-369 FK fixes, which was missing from MariaDB 10.2.
            Side note: MariaDB 10.2 may also be missing some other changes from the Galera 5.7 branch.

            marko Marko Mäkelä added a comment - The change that was discussed: pick up the !wsrep_thd_is_BF() condition from Galera MW-369 FK fixes , which was missing from MariaDB 10.2. Side note: MariaDB 10.2 may also be missing some other changes from the Galera 5.7 branch.
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2017-08-15 08:00:30.0 2017-08-15 08:00:30.774
            marko Marko Mäkelä made changes -
            Fix Version/s 10.2.8 [ 22544 ]
            Fix Version/s 10.2 [ 14601 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            claudio.nanni Claudio Nanni added a comment -

            Now it's much faster on a single instance, but if Galera is enabled it's exactly like before, lasting so long it will eventually render the cluster unusable.

            claudio.nanni Claudio Nanni added a comment - Now it's much faster on a single instance, but if Galera is enabled it's exactly like before, lasting so long it will eventually render the cluster unusable.
            marko Marko Mäkelä made changes -

            claudio.nanni, you are right, the Galera logic was not fixed yet. I filed MDEV-13678 to track that fix.

            marko Marko Mäkelä added a comment - claudio.nanni , you are right, the Galera logic was not fixed yet. I filed MDEV-13678 to track that fix.
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 82030 ] MariaDB v4 [ 152596 ]
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 110990

            People

              marko Marko Mäkelä
              claudio.nanni Claudio Nanni
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.