[MDEV-30780] parallel slave hangs after hit an error Created: 2023-03-03 Updated: 2023-07-12 Resolved: 2023-03-16 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.4, 10.5, 10.6, 10.8, 10.9, 10.10, 10.11 |
| Fix Version/s: | 10.11.3, 10.4.29, 10.5.20, 10.6.13, 10.8.8, 10.9.6, 10.10.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Andrei Elkin | Assignee: | Andrei Elkin |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Description |
|
After a parallel worker thread hits an error that must be error-stopping the slave
Slave threads however instead of expected exiting may hang like
Slave_SQL may also hang in a different state. Upon analysis it turned out that closing tables worker got entrapped in endless looping The issue applies to both the conservative and optimistic modes. |
| Comments |
| Comment by Andrei Elkin [ 2023-03-07 ] | ||||||||||||||||||||
|
The patch is pushed to bb-10.4-andrei. | ||||||||||||||||||||
| Comment by Rick Pizzi [ 2023-03-11 ] | ||||||||||||||||||||
|
Elkin I just hit this bug with a client which has CONSERVATIVE parallel replication.
| ||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-03-11 ] | ||||||||||||||||||||
|
rpizzi thanks for the heads up. I confirm the issue is more general that was originally stated in the summary. I've updated it now. The patch is at review and will be available fairly soon. | ||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2023-03-13 ] | ||||||||||||||||||||
|
Approved (discussed patch on Slack). | ||||||||||||||||||||
| Comment by Marcin Wanat [ 2023-04-11 ] | ||||||||||||||||||||
|
Why 10.11 is not included in Fix Version? It it not affected or it took a bit more work to patch this version too? | ||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2023-04-11 ] | ||||||||||||||||||||
|
forke It was accidentally missed while closing the ticket, it will be fixed in 10.11.3. | ||||||||||||||||||||
| Comment by Marcin Wanat [ 2023-04-26 ] | ||||||||||||||||||||
|
The issue is probably not fully fixed. We are experiencing this issue on 10.11.2 under high load in conservative mode. Master has binlog commit wait count = 10 and slaves have 12 replication workers, replication is ROW based. Master have 100k+ updates/inserts per second and few DROP/CREATE tables (we are often using CREATE LIKE+RENAME TABLE+DROP as faster alternative to truncate large tables under high load to reduce latency for client). I have manually replaced rpl_rli.cc, rpl_parallel.h and rpl_parallel.cc from 10.11.2 with latest files from github 10.11 branch (it has merged this patch) and recompiled from source. EDIT:
| ||||||||||||||||||||
| Comment by Daniel Black [ 2023-05-30 ] | ||||||||||||||||||||
|
forke, sorry only just saw your message. Is the problem still occurring? Are you running on 10.11.3 now? Was the 10.11.2+changes compiled with debug symbols (file mariadbd doesn't show "stripped"), or debug info packages of 10.11.3 installed. If so, can you obtain a backtrace on the running mariadbd instance? ref. | ||||||||||||||||||||
| Comment by Marcin Wanat [ 2023-05-30 ] | ||||||||||||||||||||
|
danblack I was trying this commit on 10.11.2 (before 10.11.3 was released) and the issue was NOT fixed. But after upgrading to 10.11.3 the issue is resolved, so probably some more changes was required to resolve this issue. The problem is that in 10.11.3 there is critical bug MDEV-31234 that result in UNDO logs to grow indefinitely. This way none of these releases are usable in production. 10.11.2 hangs replication for no reason and 10.11.3 grow UNDO logs indefinitely. | ||||||||||||||||||||
| Comment by Daniel Black [ 2023-05-31 ] | ||||||||||||||||||||
|
Thanks for confirming forke, your notes in | ||||||||||||||||||||
| Comment by Ragul [ 2023-06-02 ] | ||||||||||||||||||||
|
Hi Andrei Elkin, Do we have any procedures to reproduce this issue and test it? | ||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-06-02 ] | ||||||||||||||||||||
|
RagulR howdy, I've just replied in the mailing list that your stacks do not fit to this case. Please find there what would be good to do next. Thank you and good luck! Andrei |