[MDEV-30780] parallel slave hangs after hit an error Created: 2023-03-03  Updated: 2023-07-12  Resolved: 2023-03-16

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.4, 10.5, 10.6, 10.8, 10.9, 10.10, 10.11
Fix Version/s: 10.11.3, 10.4.29, 10.5.20, 10.6.13, 10.8.8, 10.9.6, 10.10.4

Type: Bug Priority: Critical
Reporter: Andrei Elkin Assignee: Andrei Elkin
Resolution: Fixed Votes: 1
Labels: None

Issue Links:
Duplicate
Relates
relates to MDEV-31052 Parallel slave hangs with binlog_alte... Open
relates to MDEV-31448 Killing a replica thread awaiting its... Closed

 Description   

After a parallel worker thread hits an error that must be error-stopping the slave
show slave status does display the error along with YES of the slave running status, e.g

show slave status\G
 
*************************** 1. row ***************************
 
                Slave_IO_State: Waiting for master to send event
 
                   Master_Host: 172.31.15.61
 
                   Master_User: db02replication
 
                   Master_Port: 3306
 
                 Connect_Retry: 60
 
               Master_Log_File: mysql-bin.028940
 
           Read_Master_Log_Pos: 1050157656
 
                Relay_Log_File: relay-bin.000134
 
                 Relay_Log_Pos: 964684321
 
         Relay_Master_Log_File: mysql-bin.028938
 
              Slave_IO_Running: Yes
 
             Slave_SQL_Running: Yes
 
               Replicate_Do_DB: 
 
           Replicate_Ignore_DB: 
 
            Replicate_Do_Table: 
 
        Replicate_Ignore_Table: 
 
       Replicate_Wild_Do_Table: 
 
   Replicate_Wild_Ignore_Table: 
 
                    Last_Errno: 1062
 
                    Last_Error: Could not execute Write_rows_v1 event on table pingtree.campaignOutboundDupeEmail; Duplicate entry '877-damien_cunningham88@outlook.com' for key 'codePrimaryEmail', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.028938, end_log_pos 964684486
 
                  Skip_Counter: 0
 
           Exec_Master_Log_Pos: 964684022

Slave threads however instead of expected exiting may hang like

+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
| Id   | User         | Host               | db   | Command      | Time  | State                                         | Info             | Progress |
+------+--------------+--------------------+------+--------------+-------+-----------------------------------------------+------------------+----------+
|    5 | system user  |                    | NULL | Slave_IO     | 51160 | Waiting for master to send event              | NULL             |    0.000 |
|   19 | mariadbadmin | 172.31.15.18:58548 | NULL | Sleep        |     5 |                                               | NULL             |    0.000 |
|   61 | mariadbadmin | 172.31.15.18:46002 | NULL | Sleep        |    10 |                                               | NULL             |    0.000 |
| 2394 | system user  |                    | NULL | Slave_worker | 50852 | closing tables                                | NULL             |    0.000 |
| 2395 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
| 2396 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
| 2397 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
| 2398 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
| 2399 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
| 2400 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
| 2401 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
| 2402 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
| 2403 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
| 2404 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
| 2405 | system user  |                    | NULL | Slave_worker | 50852 | Waiting for prior transaction to start commit | NULL             |    0.000 |
| 2393 | system user  |                    | NULL | Slave_SQL    | 50860 | Waiting for room in worker thread event queue | NULL             |    0.000 |

Slave_SQL may also hang in a different state.

Upon analysis it turned out that closing tables worker got entrapped in endless looping
in mark_start_commit_inner across already garbage-collected items including rgi->gco itself.
The reason of the belated access is identified as possible out-of-order group committing
in the error branch.

The issue applies to both the conservative and optimistic modes.
A patch, to be committed soon, fixes the case to reinforce group_commit_orderer-based order for errored-out workers.



 Comments   
Comment by Andrei Elkin [ 2023-03-07 ]

The patch is pushed to bb-10.4-andrei.

Comment by Rick Pizzi [ 2023-03-11 ]

Elkin I just hit this bug with a client which has CONSERVATIVE parallel replication.
Will the patch cover that case as well?

MariaDB [(none)]> show processlist;
+---------+----------------------------+----------------------+--------------+--------------+---------+-----------------------------------------------+------------------+----------+
| Id      | User                       | Host                 | db           | Command      | Time    | State                                         | Info             | Progress |
+---------+----------------------------+----------------------+--------------+--------------+---------+-----------------------------------------------+------------------+----------+
|      15 | system user                |                      | NULL         | Slave_IO     | 7336114 | Waiting for master to send event              | NULL             |    0.000 |
|      17 | system user                |                      | NULL         | Slave_worker |  123681 | closing tables                                | NULL             |    0.000 |
|      18 | system user                |                      | NULL         | Slave_worker |  123681 | Waiting for prior transaction to start commit | NULL             |    0.000 |
|      19 | system user                |                      | NULL         | Slave_worker |  123681 | Waiting for prior transaction to start commit | NULL             |    0.000 |
|      20 | system user                |                      | NULL         | Slave_worker |  123681 | Waiting for prior transaction to start commit | NULL             |    0.000 |
|      16 | system user                |                      | NULL         | Slave_SQL    |  123685 | Reading event from the relay log              | NULL             |    0.000 |
| 1319544 | skysql_maxscale_db00006746 | 10.103.64.226:46248  | NULL         | Sleep        |       1 |                                               | NULL             |    0.000 |
| 1640887 | readonly_user              | 35.188.30.0:62489    | NULL         | Sleep        |     112 |                                               | NULL             |    0.000 |
| 1640888 | readonly_user              | 35.188.30.0:62491    | chariot_case | Sleep        |     112 |                                               | NULL             |    0.000 |
| 1644053 | readonly_user              | 35.188.30.0:65440    | chariot_case | Sleep        |      83 |                                               | NULL             |    0.000 |
| 1650211 | root                       | localhost            | NULL         | Query        |   55897 | Killing slave                                 | stop slave       |    0.000 |

Comment by Andrei Elkin [ 2023-03-11 ]

rpizzi thanks for the heads up. I confirm the issue is more general that was originally stated in the summary. I've updated it now. The patch is at review and will be available fairly soon.

Comment by Brandon Nesterenko [ 2023-03-13 ]

Approved (discussed patch on Slack).

Comment by Marcin Wanat [ 2023-04-11 ]

Why 10.11 is not included in Fix Version? It it not affected or it took a bit more work to patch this version too?

Comment by Brandon Nesterenko [ 2023-04-11 ]

forke It was accidentally missed while closing the ticket, it will be fixed in 10.11.3.

Comment by Marcin Wanat [ 2023-04-26 ]

The issue is probably not fully fixed.

We are experiencing this issue on 10.11.2 under high load in conservative mode. Master has binlog commit wait count = 10 and slaves have 12 replication workers, replication is ROW based. Master have 100k+ updates/inserts per second and few DROP/CREATE tables (we are often using CREATE LIKE+RENAME TABLE+DROP as faster alternative to truncate large tables under high load to reduce latency for client).

I have manually replaced rpl_rli.cc, rpl_parallel.h and rpl_parallel.cc from 10.11.2 with latest files from github 10.11 branch (it has merged this patch) and recompiled from source.
Replication can still hang the same way, but not that often as previously.

EDIT:
There is one change. It does not have "Closing tables" state, but all workerd are NULL or waiting:

MariaDB [(none)]> SHOW PROCESSLIST;
+---------+-------------+---------------------------+---------+--------------+------+-----------------------------------------------+------------------+----------+
| Id      | User        | Host                      | db      | Command      | Time | State                                         | Info             | Progress |
+---------+-------------+---------------------------+---------+--------------+------+-----------------------------------------------+------------------+----------+
|       4 | system user |                           | NULL    | Slave_IO     | 5845 | Waiting for master to send event              | NULL             |    0.000 |
|       6 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
|       7 | system user |                           | NULL    | Slave_worker |  311 | Waiting for prior transaction to commit       | NULL             |    0.000 |
|       9 | system user |                           | NULL    | Slave_worker |  311 | Waiting for prior transaction to commit       | NULL             |    0.000 |
|       8 | system user |                           | NULL    | Slave_worker |  311 | Waiting for prior transaction to commit       | NULL             |    0.000 |
|      10 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
|      11 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
|      12 | system user |                           | NULL    | Slave_worker |  311 | Waiting for prior transaction to commit       | NULL             |    0.000 |
|      13 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
|      14 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
|      15 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
|      16 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
|      17 | system user |                           | NULL    | Slave_worker |  311 | NULL                                          | NULL             |    0.000 |
|       5 | system user |                           | NULL    | Slave_SQL    |  318 | Waiting for room in worker thread event queue | NULL             |    0.000 |
| 2524245 | root        | localhost                 | NULL    | Query        |    0 | starting                                      | SHOW PROCESSLIST |    0.000 |
+---------+-------------+---------------------------+---------+--------------+------+-----------------------------------------------+------------------+----------+

Comment by Daniel Black [ 2023-05-30 ]

forke, sorry only just saw your message.

Is the problem still occurring? Are you running on 10.11.3 now?

Was the 10.11.2+changes compiled with debug symbols (file mariadbd doesn't show "stripped"), or debug info packages of 10.11.3 installed.

If so, can you obtain a backtrace on the running mariadbd instance? ref.

Comment by Marcin Wanat [ 2023-05-30 ]

danblack I was trying this commit on 10.11.2 (before 10.11.3 was released) and the issue was NOT fixed. But after upgrading to 10.11.3 the issue is resolved, so probably some more changes was required to resolve this issue. The problem is that in 10.11.3 there is critical bug MDEV-31234 that result in UNDO logs to grow indefinitely.

This way none of these releases are usable in production. 10.11.2 hangs replication for no reason and 10.11.3 grow UNDO logs indefinitely.

Comment by Daniel Black [ 2023-05-31 ]

Thanks for confirming forke, your notes in MDEV-31234 while not responded to have been read.

Comment by Ragul [ 2023-06-02 ]

Hi Andrei Elkin, Do we have any procedures to reproduce this issue and test it?
Facing similar issue which can't be reproduced but occurs at random.

Comment by Andrei Elkin [ 2023-06-02 ]

RagulR howdy, I've just replied in the mailing list that your stacks do not fit to this case. Please find there what would be good to do next. Thank you and good luck! Andrei

Generated at Thu Feb 08 10:18:50 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.