Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30780

parallel slave hangs after hit an error

Details

    Description

      After a parallel worker thread hits an error that must be error-stopping the slave
      show slave status does display the error along with YES of the slave running status, e.g

      show slave status\G
       
      *************************** 1. row ***************************
       
                      Slave_IO_State: Waiting for master to send event
       
                         Master_Host: 172.31.15.61
       
                         Master_User: db02replication
       
                         Master_Port: 3306
       
                       Connect_Retry: 60
       
                     Master_Log_File: mysql-bin.028940
       
                 Read_Master_Log_Pos: 1050157656
       
                      Relay_Log_File: relay-bin.000134
       
                       Relay_Log_Pos: 964684321
       
               Relay_Master_Log_File: mysql-bin.028938
       
                    Slave_IO_Running: Yes
       
                   Slave_SQL_Running: Yes
       
                     Replicate_Do_DB: 
       
                 Replicate_Ignore_DB: 
       
                  Replicate_Do_Table: 
       
              Replicate_Ignore_Table: 
       
             Replicate_Wild_Do_Table: 
       
         Replicate_Wild_Ignore_Table: 
       
                          Last_Errno: 1062
       
                          Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486
       
                        Skip_Counter: 0
       
                 Exec_Master_Log_Pos: 964684022
      

      Slave threads however instead of expected exiting may hang like

      +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
      | Id   | User         | Host               | db   | Command      | Time  | State                                         | Info             | Progress |
      +------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
      |    5 | system user  |                    | NULL | Slave_IO     | 51160 | Waiting for master to send event              | NULL             |    0.000 |
      |   19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep        |     5 |                                               | NULL             |    0.000 |
      |   61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep        |    10 |                                               | NULL             |    0.000 |
      | 2394 | system user  |                    | NULL | Slave_worker | 50852 | closing tables                                | NULL             |    0.000 |
      | 2395 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2396 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2397 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2398 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2399 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2400 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2401 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2402 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2403 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2404 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2405 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
      | 2393 | system user  |                    | NULL | Slave_SQL    | 50860 | Waiting for room in worker thread event queue | NULL             |    0.000 |
      

      Slave_SQL may also hang in a different state.

      Upon analysis it turned out that closing tables worker got entrapped in endless looping
      in mark_start_commit_inner across already garbage-collected items including rgi->gco itself.
      The reason of the belated access is identified as possible out-of-order group committing
      in the error branch.

      The issue applies to both the conservative and optimistic modes.
      A patch, to be committed soon, fixes the case to reinforce group_commit_orderer-based order for errored-out workers.

      Attachments

        Issue Links

          Activity

            danblack Daniel Black added a comment -

            forke, sorry only just saw your message.

            Is the problem still occurring? Are you running on 10.11.3 now?

            Was the 10.11.2+changes compiled with debug symbols (file mariadbd doesn't show "stripped"), or debug info packages of 10.11.3 installed.

            If so, can you obtain a backtrace on the running mariadbd instance? ref.

            danblack Daniel Black added a comment - forke , sorry only just saw your message. Is the problem still occurring? Are you running on 10.11.3 now? Was the 10.11.2+changes compiled with debug symbols ( file mariadbd doesn't show "stripped"), or debug info packages of 10.11.3 installed . If so, can you obtain a backtrace on the running mariadbd instance? ref .
            forke Marcin Wanat added a comment -

            danblack I was trying this commit on 10.11.2 (before 10.11.3 was released) and the issue was NOT fixed. But after upgrading to 10.11.3 the issue is resolved, so probably some more changes was required to resolve this issue. The problem is that in 10.11.3 there is critical bug MDEV-31234 that result in UNDO logs to grow indefinitely.

            This way none of these releases are usable in production. 10.11.2 hangs replication for no reason and 10.11.3 grow UNDO logs indefinitely.

            forke Marcin Wanat added a comment - danblack I was trying this commit on 10.11.2 (before 10.11.3 was released) and the issue was NOT fixed. But after upgrading to 10.11.3 the issue is resolved, so probably some more changes was required to resolve this issue. The problem is that in 10.11.3 there is critical bug MDEV-31234 that result in UNDO logs to grow indefinitely. This way none of these releases are usable in production. 10.11.2 hangs replication for no reason and 10.11.3 grow UNDO logs indefinitely.
            danblack Daniel Black added a comment -

            Thanks for confirming forke, your notes in MDEV-31234 while not responded to have been read.

            danblack Daniel Black added a comment - Thanks for confirming forke , your notes in MDEV-31234 while not responded to have been read.
            RagulR Ragul added a comment -

            Hi Andrei Elkin, Do we have any procedures to reproduce this issue and test it?
            Facing similar issue which can't be reproduced but occurs at random.

            RagulR Ragul added a comment - Hi Andrei Elkin , Do we have any procedures to reproduce this issue and test it? Facing similar issue which can't be reproduced but occurs at random.
            Elkin Andrei Elkin added a comment -

            RagulR howdy, I've just replied in the mailing list that your stacks do not fit to this case. Please find there what would be good to do next. Thank you and good luck! Andrei

            Elkin Andrei Elkin added a comment - RagulR howdy, I've just replied in the mailing list that your stacks do not fit to this case. Please find there what would be good to do next. Thank you and good luck! Andrei

            People

              Elkin Andrei Elkin
              Elkin Andrei Elkin
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.