[MDEV-9670] server_id mysteriously set to 0 in binlog causes GTID & multisource replication to break Created: 2016-03-02 Updated: 2024-01-17 Resolved: 2017-07-02 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.0.22 |
| Fix Version/s: | 10.0.32 |
| Type: | Bug | Priority: | Major |
| Reporter: | Geoff Montee (Inactive) | Assignee: | Michael Widenius |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | gtid, multisource, replication | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Sprint: | 10.2.7-1 | ||||||||||||||||||||||||
| Description |
|
A user experienced some strange GTID out-of-order sequence number errors, which caused very strange symptoms, and in order to fix the problems, the server process to be killed. The user has the following replication topology with 8 servers total in 2 data centers (4 nodes each DC). Node1 in each data center are masters that replicate from each other using circular replication. Nodes 2,3 and 7 in each data center are slaves with read-only enabled. The gtid_domain_id and server_ids values are the following: DC1: DC2: Somehow, DC1-Node1 (server_id 7715) received an event from DC2-Node1 (server_id 7725) that had a server_id value of 0:
This caused replication to break:
Some things to note here:
The other events within the transaction do have the proper server_id set:
When I look at the same event in the binary log of DC2-Node1, it looks like that server has the correct server_id for the transaction:
The user was able to skip this transaction on DC1-Node1 by stopping the slave, setting sql_slave_skip_counter to 1, and then starting the slave. Shortly after DC1-Node1 ran into the out-of-order sequence GTID error, DC2-Node2 ran into an out-of-order sequence GTID error for a different transaction:
This transaction looks normal in DC1-Node1's binary log, and it only appears once, so it seems strange that it would trigger this error:
Not only did this break replication on DC2-Node1, but the error log also hinted at a more serious problem:
After this, we noticed the following issues:
The user is not using parallel replication. The user is using multi-source replication. |
| Comments |
| Comment by Elena Stepanova [ 2016-03-02 ] | ||||||||||||||||||||||||||||||||||||||
|
support issue | ||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2016-03-02 ] | ||||||||||||||||||||||||||||||||||||||
|
Server id 0 is never a valid value. But without some indication of what It is mentioned that gtid_ignore_duplicates and multi-source replication are | ||||||||||||||||||||||||||||||||||||||
| Comment by Geoff Montee (Inactive) [ 2016-03-02 ] | ||||||||||||||||||||||||||||||||||||||
|
Hi knielsen,
The binary logs are available for you to look at, if that would help. Information on how to access the logs was in the email that I sent to the engineering mailing list yesterday. If any other information would help, please let me know.
The user is not using multi-source replication in the sense that they have multiple replication sources configured. Node1 in each data center is only replicating from each other, so each server only has one master. When I said that they are using multi-source replication, I only meant that their replication was configured using the new named connection syntax added for multi-source replication in MariaDB 10.0. For example, you can see that the connections are named dc1master and dc2master in the SHOW SLAVE STATUS output above. I do not know why gtid_ignore_duplicates is enabled. It may have been enabled in hopes that it would somehow fix this strange problem, since the user is running into the problem fairly often. | ||||||||||||||||||||||||||||||||||||||
| Comment by Matt Neth [ 2017-05-25 ] | ||||||||||||||||||||||||||||||||||||||
|
I uploaded some error logs/core dump/binlogs to ftp.askmonty.org/private/ to try to help with this bug: 5-24-totalcrash.tar.gz - MySQL process totally crashed and when it restarted all of the slave connections were stopped with the out-of-order error from 5-24-totalcrash-M3.tar.gz - Same issue as above, but this happened on a different server that is not using multisource replication. It just has one slave connection. There's a README in the tarball with more info as well. This one also has a core dump and binary in it. | ||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2017-06-21 ] | ||||||||||||||||||||||||||||||||||||||
|
I've looked through the latest logs. Alas only 5-24-totalcrash.tar.gz contains some interesting data. | ||||||||||||||||||||||||||||||||||||||
| Comment by Matt Neth [ 2017-06-21 ] | ||||||||||||||||||||||||||||||||||||||
|
I uploaded MDEV9670-m4-errorlog-large to the FTP which is the entire error output from the day that 5-24-totalcrash.tar.gz happened, let me know if that helps. | ||||||||||||||||||||||||||||||||||||||
| Comment by Matt Neth [ 2017-06-21 ] | ||||||||||||||||||||||||||||||||||||||
|
I also uploaded MDEV9952-6-8-outoforder-M4.tar.gz and MDEV9952-6-8-outoforder-M2.tar.gz a few days ago for Monty. The out of order happened on both M2 and M4 at the same time, and the M4 errorlog output showed the server ID 0 bug as well. Both of those have binlogs/relay logs/error logs so hopefully they can provide some good info. | ||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2017-06-21 ] | ||||||||||||||||||||||||||||||||||||||
|
If it is about partial logging of a group (the last part?) and server_id=0, I would suggest looking at That bug can cause replication to start in the middle of an event group. Not sure if this could cause server_id=0, though, hard to say... | ||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2017-06-21 ] | ||||||||||||||||||||||||||||||||||||||
|
Kristian, hello Thanks for the point, but if the referred ticket is entirely about parallel repl: _ MariaDB ServerMDEV-10863 parallel replication tries to continue from wrong position _ then it can't relate. To share more with you, the server_id 0 group is completed with COMMIT Query_log_event, not Xid_log_event. And in one of two such cases the zero-server_id group was followed by the correct server_id one, so the total sum of output (as seen in the binlog) events is equal to the sum of events in the input (seen in relay-log) single group. Andrei | ||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2017-06-21 ] | ||||||||||||||||||||||||||||||||||||||
|
Hm, I see. So it sounds like something in the code is somehow flushing the binlog cache to the binlog in the middle of the event group, in a completely incorrect way that does not create the proper GTID event, nor the correct commit event. Can't think of what it could be, unfortunately... | ||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2017-06-22 ] | ||||||||||||||||||||||||||||||||||||||
|
Yes, this premature flushing cause has been my main suspect. Sources are quite explicit about where server_id is set to zero, and in our case I would relate just two places in next_event() which generate fake (it's crucial as it explains absence of any server_id=0 events in the relay-log) Rotate or Gtid_list events. To a dilemma I left in my previous comments above,
which eventually got rolled back by Innodb/server recovery:
Notice that 1831 rows to undo is somewhat close to the number of missed rows event (to gain the same total with the input). | ||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2017-06-22 ] | ||||||||||||||||||||||||||||||||||||||
|
Matt, thanks for uploading MDEV9670-m4-errorlog-large more informative piece of log. It was already helpful. Yet do you have
? The newer file, alas starts from this line so we can't possibly see what might have caused an apparent premature commit of the 1st "part" of this split event group. Ideally it would be great to see all reported incidents in the error log since time the slave service is started, actually for all slaves of multisourced slave server as we can't exclude yet interference from different sources. | ||||||||||||||||||||||||||||||||||||||
| Comment by Matt Neth [ 2017-06-22 ] | ||||||||||||||||||||||||||||||||||||||
|
That's all that was in the log from around that time. That was the first entry in the log for that day and for that crash. I went ahead and uploaded the full current errorlog to the FTP as MDEV9670-full-errorlog-M4. | ||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2017-06-23 ] | ||||||||||||||||||||||||||||||||||||||
|
There are some promising news. I've discovered one way to split a group into two a way we needed. I can't clarify it yet whether this is actually possible in any real execution. But the slave binlog shows The Input:
and the output in the slave binlog:
Now I am looking into how to create a real execution. | ||||||||||||||||||||||||||||||||||||||
| Comment by Michael Widenius [ 2017-07-02 ] | ||||||||||||||||||||||||||||||||||||||
|
Problem was that in a circular replication setup the master remembers The fix was to not generate any Gtid_list_log_events in the middle of a | ||||||||||||||||||||||||||||||||||||||
| Comment by Michael Widenius [ 2017-07-02 ] | ||||||||||||||||||||||||||||||||||||||
|
Fix pushed. | ||||||||||||||||||||||||||||||||||||||
| Comment by Sachin Setiya (Inactive) [ 2018-07-17 ] | ||||||||||||||||||||||||||||||||||||||
|
Test case fails if used with repeat , It fails with this error
| ||||||||||||||||||||||||||||||||||||||
| Comment by Sachin Setiya (Inactive) [ 2018-07-17 ] | ||||||||||||||||||||||||||||||||||||||
|
--debug-sync-timeout=5 can be used in mtr argument to generate the failure early
|