[MDEV-15660] Problematic BINLOG events from master Created: 2018-03-26 Updated: 2019-02-05 Resolved: 2019-02-05 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication, Server |
| Affects Version/s: | 10.0.25, 10.1.32 |
| Fix Version/s: | 10.0.38 |
| Type: | Bug | Priority: | Major |
| Reporter: | MG | Assignee: | Andrei Elkin |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | binlog | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
Had a master, within the same binary log, start generating BINLOG events that had no effect on the slaves. By default for a new session, binlog_format=ROW and tx_isolation='READ COMMITTED' Creating new slaves, they would see the same problem - BINLOG events that ran did not affect rows or cause an error. The same would occur in a PITR scenario, piping mysqlbinlog to `mysql` with or without --binary-mode. The simplest statement which did not replicate, was: mysql <<< "replace into replmon.hb (id, hb) values (1, now())" If I created a session and, before issuing the above, changed tx_isolation to repeatable with binlog_format=statement, the command would correctly replicate. I suspect a simple restart would have helped, but the packages were upgraded from 10.0.25 to 10.0.32 in hopes of avoiding the issue again. It may be a second bug, beyond strange events being created, that slaves did not modify rows or error on these BINLOG commands. I tried 10.0 slaves of course, but had the same experience with 10.2.13 (non-enterprise) in replication or PITR via mysqlbinlog. |
| Comments |
| Comment by MG [ 2018-03-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Please see ftp.askmonty.org:/private/ I do still have the binary logs, but will not be able to share them in entirety. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Hugo Prins [ 2018-04-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
I can confirm that this issue also exists with MariaDB 10.1.9. We have done several test but the only thing fixing the issue was rebooting the master server. The following things were tested to see if it would fix the issue:
In the end we decided to take the down time by restarting the MariaDB process on the master node. At first we thought this worked, but then we found out something else. The backup we created to restore the slave was created with xstrabackup, this tools works just fine but it looks like the backup was corrupt. We created this backup before we restarted MariaDB, and the only thing we can think of is that the backup was created from a corrupted memorystate of the master which results in a corrupted backup. After creating a new backup after MariaDB was rebooted we were able to use this new backup to start our slaves again. This new backup was created using a DB wide lock and filesystem snapshot. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Hugo Prins [ 2018-06-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
I had the issue again on one of my clusters, but this time everything was replicating fine except for some databases that were created in the last couple of weeks. Resyncing this cluster now using a snapshot of the master and upgrading to 10.1.32 at the same time. The future will tell me if the problem will manifest itself again. Is there any progress on this bug? | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by MG [ 2018-06-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
@Jan perhaps you are using binlog_format=MIXED? I am trying to guess why you would have some affected databases but not all. It could be that the newer databases use SQL that is deemed "unsafe" for STATEMENT logs, and thus ROW events are generated (seemingly incorrectly) Also, the xtrabackup would have almost certainly worked if it was taken after your service restart but the problematic events happening during/after the backup (before the restart) would have resulted in the slave still having missing records or errors like HA_ERR_KEY_NOT_FOUND | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reinder Cuperus [ 2018-06-13 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
(colleague of Jan here)
| ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Hugo Prins [ 2018-09-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
We have updated all clusters to 10.1.32. One of the clusters has had several occurrences of this problem over the last couple of weeks. Actually, the last couple of weeks we had the problem about once a week. We have also seen that one table within a database can replicate just fine, while a different table in the same database has stopped. The replication of a table stops at an undefined moment and during a longer period of time several tables stop replicating at random times. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reinder Cuperus [ 2018-11-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
We are currently in a situation where this bug (or a bug very similar to this one) is happening in one of our clusters every 7-10 days. Last week we upgraded this cluster to 10.3.9. Unfortunately, the upgrade van 10.1.32 -> 10.3.9 did not fix the problem. Around Nov 15 07:50 the replication started failing. Our setup is as follow:
The output from journalctl -u mariadb shows some curious repetitive entries:
Summarizing this output for a few days gives this pattern:
I can demonstrate the nature of the problem by running this code on master (it's no reproduction):
On the slave the content of this table is:
I have made:
I find it interesting that the first INSERT shows in the relay-log of slave, but isn't applied by the slave. Also interesting that only a restart of the MariaDB-process on the master fixes the problem. A RESET SLAVE on slave, or RESET MASTER on master will not fix the situation. If more information is needed to debug this problem, let me know. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-11-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
In your case we see errors by the slave IO (receiver) thread like
Could you clarify whether the reconnecting succeeded so the slave resumed automatically? When this is the case perhaps the error could be caused by | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by MG [ 2018-11-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
This looks to me to be the same problem originally reported (randomly, rbr events stop affecting the slave while sbr events continue to affect slave / no effect of rbr event via mysqlbinlog | mysql)
Yes, this is the problem / two problems: 1) Master generates problematic RBR which has no effect (eg it traverses master binary log, gets in to slave relay log, sql thread does nothing with this event silently ignoring it or you issue mysqlbinlog | mysql on a backup and it does not modify rows). In my case, after a certain point, all RBR are impotent until the master was restarted and this abrupt breakage happened within a single binary log. 2) The slave / BINLOG() event is impotent but there is no error such as "no rows modified" (such a check may be too expensive to implement) Regarding the "Slave I/O thread: Failed reading log event" - this is probably due to STOP ALL SLAVES though I was not familiar with the syntax nor did I test this. I guess your backup takes two hours. It seems strange to need to STOP ALL SLAVES when taking a backup vs STOP SLAVE or just use a non-blocking backup like mariabackup. This is based on my quick read about STOP ALL SLAVES which seems to write a binlog event that causes a slave to stop via the binary log (keep in mind that you use circular replication so this event probably replicates back and may not consider server-id). Again all of this is speculation on my part. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reinder Cuperus [ 2018-11-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
The slave connects normally. I did some more observations on these log-entries today:
Then I see these log-entries on the passive master (logdb02):
Also the tcp-connection from logdb01 to logdb02:3306 gets replaced by another tcp-connection. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reinder Cuperus [ 2018-11-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
@MG STOP ALL SLAVES works like STOP SLAVE but for all slave-connections when you have a multi source replication. I don't see anything mentioned in documentation about writes into binlog events. Stopping slave activity makes it possible to take a single point in time backup using parallel mysqldump-processes (one for each table). It also prevents locks from the mysqldump-processes to weirdly interact with the locks generated by the slave-threads. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by MG [ 2018-11-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Got it - thanks. I was looking at a different implementation of STOP ALL SLAVES (not mariadb). This is just for MSR / replication channels to stop each of the connections to various masters when run on a given slave. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-11-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
If I understood the case correctly it is an 'impotent' Write_rows_log_event out of I see you could not reproduce it. But it seems to relate to the fact that this Write event immediately follows CREATE. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by MG [ 2018-11-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
In my case, I saw the problem:
In either case, no rows would be affected by Write_rows/Update_rows/Delete_rows events (eg there is an Annotate_rows event) but SBL logged events (eg they have SET TIMESTAMP / SET session.sql_mode) such as DDL would replicate/apply via mysqlbinlog. This workload does use a lot of DDL (CREATE/TRUNCATE/DROP) and most often via SP. Within the SP you would see the CREATE/DROP effects on the slave but the table would remain empty (the INSERT logged after Table_map/Write_rows had no effect) | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2018-11-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
MG, so any parallel slave issue is ruled out. I am still trying to build a picture of what happened.
Can you confirm that not only DDL:s but DML:s as well are potent? | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reinder Cuperus [ 2018-11-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
I have not tried UPDATEs, but I can confirm that INSERTs and DELETEs are working when using statement based replication. I have restarted the logdb01 on Tuesday, en resynchronised everything on logdb02 again. I expect replication to fail again next week. If you have any suggestions of instrumentation/debugging/sniffing/etc I might enable on these servers, then I can try to add those. The current slave-settings on the servers:
logdb02 (passive master):
| ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reinder Cuperus [ 2018-11-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
As predicted, replication from logdb01 to logdb02 has failed yesterday. Because it is so predictable, I have the opportunity to add extra monitoring. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by MG [ 2018-12-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
@Andrei Elkin From my initial comment in this report:
| ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2019-01-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
I put neutrally as 'relates' while being sure it's a duplicate. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2019-01-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
mg], reinder By all the symptoms this case is a duplicate of Keeping the ticket active until you have confirmed. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reinder Cuperus [ 2019-01-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
@Andrei I think it's plausible that the bug described in 1.9b table_ids is absurdly high for only 4 days, so we went investigating what the reason for this could be. We suspect that the open_table_cache is too small for this server:
So increasing the variable table_open_cache should prevent the bug from manifesting itself on this server (and help a little in performance). | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by MG [ 2019-01-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
@Andrei - This does seem to be a duplicate of
| ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reinder Cuperus [ 2019-01-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
@Andrei - Can you confirm that the patch only needs to be applied on the slave? Because if I read the bug-report in | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2019-01-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
reinder, yes, it's the slave side as you can see from patches committed to the parent bug. Although | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Reinder Cuperus [ 2019-02-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
We didn't succeed in building an installable rpm-package for our CentOS7 machines. When you have a release candidate build of the 10.3-series, I'd like to test it. In the meantime we have succeeded in eliminating a few causes for wildly increasing table_ids, and see no more replication-problems. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2019-02-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
reinder, the bugfix is upcoming soon with a new releases, see https://jira.mariadb.org/secure/Dashboard.jspa. Thank you for practical recommendation of yours So increasing the variable table_open_cache should prevent the bug from manifesting itself on this server (and help a little in performance). | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2019-02-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Closed as duplicate. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2019-02-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Fixes cce2b45c8f5 are pushed. |