[MDEV-9952] Strange out-of-order sequence errors on slave, unreleased table locks and possibly SQL thread position corruption Created: 2016-04-19 Updated: 2018-07-14 Resolved: 2018-07-14 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.0.22 |
| Fix Version/s: | 10.0.32 |
| Type: | Bug | Priority: | Major |
| Reporter: | Geoff Montee (Inactive) | Assignee: | Andrei Elkin |
| Resolution: | Fixed | Votes: | 4 |
| Labels: | gtid, replication | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
The same user who experienced The latest error showed a bunch of out-of-order sequence number errors in a short period of time. Each time, the user attempted to skip the problem. After each of those, another out-of-order sequence number error occurred shortly after. For example:
I've noticed some strange things about the latest failure at 10:45:21: 1.) When the slave stops due to out-of-order sequence number with existing GTID 35-7735-937161, the slave is stopped at master_log_name=''mariadb-bin.007022' master_log_pos='97244987', GTID '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'
The strange thing is that GTID '35-7735-937161' is not present in the binary log until much later: master_log_name='mariadb-bin.007022' master_log_pos='97246276'
If the slave supposedly failed executing an event at master_log_name='mariadb-bin.007022' master_log_pos='97246276', why would it stop all the way back at master_log_name='mariadb-bin.007022' master_log_pos='97244987'? 2.) When the slave stops due to out-of-order sequence number with existing GTID 35-7735-937161, the GTID position when it stops is '21-7521-7596336,0-7725-126,35-7735-937160,11-7511-6574751,17-7725-102939943,15-7715-29236025'. Notice that the sequence number for gtid_domain_id 35 is 937160, which is less than the out-of-order sequence number 937161. If the highest sequence number for the domain was only 937160 at the time, why would a sequence number of 937161 cause an out-of-order sequence number? These items also make me wonder if these strange observations also hold for the first failure that happened at 10:33, so I backtracked in the binlogs to look into it. 1.) When the slave stops due to out-of-order sequence number with existing GTID '35-7735-937125', it stops at position master_log_name='mariadb-bin.007022', master_log_pos='95365805'
This position does actually seem to be correct, since the GTID '35-7735-937125' is at that position:
2.) When the slave stops due to out-of-order sequence number with existing GTID '35-7735-937125', the GTID position when it stops is '21-7521-7596336,0-7725-126,35-7735-937124,11-7511-6574751,17-7725-102939928,15-7715-29236017'. Notice here that the sequence number for gtid_domain_id 35 is only 937124, so this out-of-order sequence error also seems to be unusual. Additionally, I can see that the transaction with the sequence number 937124 is directly preceding the failing one in the binlog:
Let's see if observation #1 above holds for the other two failures at 10:44:43 and 10:44:57. The failure at 10:44:43 involved GTID '35-7735-937137', and the slave thread stopped at master_log_name='mariadb-bin.007022', master_log_pos='95860806'. This position also seems to be correct, since the GTID '35-7735-937137' is at the position:
The failure at 10:44:57 involved GTID '35-7735-937151', and the slave thread stopped at master_log_name='mariadb-bin.007022', master_log_pos='96808225'. This GTID doesn't appear until position 96809133, a bit later than position 96808225.
These observations make me wonder if the SQL thread's position was corrupted after the 10:44:57 failure. I know of at least one open bug ( The user does not have parallel replication enabled. Some other observations:
|
| Comments |
| Comment by Geoff Montee (Inactive) [ 2016-04-19 ] | |||||||||||||
|
support issue | |||||||||||||
| Comment by Kristian Nielsen [ 2016-04-19 ] | |||||||||||||
|
> If the highest sequence number for the domain was only 937160 at the time, There seems to be some confusion here. The out-of-order error is not about If gtid_strict_mode is set, a commit on the server is first checked against So the error occurs when trying to replicate a GTID that is already in the It is not a bug in itself to get this out-of-order error, it is a feature to | |||||||||||||
| Comment by Geoff Montee (Inactive) [ 2016-04-19 ] | |||||||||||||
I understand that. However, data from this particular gtid_domain_id (35) is only coming from data replicated from dc2master. The only way that a GTID from this domain should end up in the slave's binlog is if it replicates it from dc2master's binlog.
Thanks, but I understand what the feature is for. It just does not seem to be functioning as-expected in this case.
Well, I'm not entirely sure that the GTIDs are in the slave's binlog, because as I mentioned, they should only be there if they came from dc2master's binlog. However, I will ask for the binlogs from this slave, so I can check. Any thoughts on why the GTIDs in the out-of-order sequence errors starting at 10:44:57 aren't even at the position in the binlog that the SQL thread supposedly encounters the error at? Or why the whole system sometimes hangs when this happens, and the only way to un-hang it is to kill mysqld? | |||||||||||||
| Comment by Kristian Nielsen [ 2016-04-19 ] | |||||||||||||
|
> Any thoughts on why the GTIDs in the out-of-order sequence errors starting It's very hard to say from the information given. Try pasting the relevant > Or why the whole system sometimes hangs when this happens, and the only Server hang should always be a bug (and there seems to be complaints about Hm, one thought. SQL_SLAVE_SKIP_COUNTER is used, I wonder if the slave IO So this might be related. Try to check if the IO thread was running when | |||||||||||||
| Comment by Geoff Montee (Inactive) [ 2016-04-21 ] | |||||||||||||
|
There are 3 masters in this topology: dc2-master2 <===> dc2-master <===> dc1-master
I've looked through the binlogs from all 3 masters, and I've noticed the following pattern each time that this error has occurred: 1.) When dc1-master encounters the out-of-order sequence number error about a particular GTID, the transaction with that GTID in dc1-master's binlog actually corresponds to a subset of the preceding transaction from dc2-master and dc2-master2. For example, when dc1-master got this error:
The transaction with GTID 35-7735-937125 in dc1-master's binlog actually corresponds to a subset of the transaction with GTID 35-7735-937124 from dc2-master's and dc2-master2's binlogs. 2.) When dc1-master replicates a transaction with the wrong GTID, as explained above, the original GTID of the transaction is not present in the log at all. For example, when dc1-master mistakenly replicated transaction with GTID 35-7735-937124 as GTID 35-7735-937125, the transaction's original GTID (35-7735-937124) is not present anywhere in dc1-master's log. | |||||||||||||
| Comment by Kristian Nielsen [ 2016-04-21 ] | |||||||||||||
|
This sounds a lot like the problem with setting sql_slave_skip_counter The server really isn't supposed to allow this, it can cause all sorts of problems, possibly including corrupting internal server state. I need to get hold of Monty and find out what the intension was with the patch that introduced this. As I said, you can try to see if it helps to ensure manually that I wonder if this also causes the strange server_id=0 in another bug report? | |||||||||||||
| Comment by Geoff Montee (Inactive) [ 2016-04-21 ] | |||||||||||||
I do see in the logs that the I/O thread was started about 2 seconds prior to setting sql_slave_skip_counter:
I will let these users know that they should only set sql_slave_skip_counter when both replication threads are stopped.
If it can cause problems like this, then I agree that it sounds quite dangerous. Sometimes this problem causes this user's system to lock up so badly that they have to kill mysqld.
I submitted that bug report on behalf of the same users, so it's probably pretty likely that that problem has the same root cause as this one. Thanks for the assistance! | |||||||||||||
| Comment by Geoff Montee (Inactive) [ 2016-04-22 ] | |||||||||||||
|
These users say that when they use sql_slave_skip_counter, they always execute it in the following way:
So it seems that the I/O thread should have been stopped when they encountered this problem. I believe the 2 second difference in timestamps in the above output might just be due to delays in the threads starting up. Can you think of anything else that might lead to problems like this? | |||||||||||||
| Comment by Kristian Nielsen [ 2016-04-25 ] | |||||||||||||
|
It still seems the most likely explanation that SQL_SLAVE_SKIP_COUNTER was Note that the topology given requires special actions on the part of the | |||||||||||||
| Comment by Kristian Nielsen [ 2016-04-25 ] | |||||||||||||
|
Hm, actually SQL_SLAVE_SKIP_COUNTER only requires the SQL thread be stopped. I was thinking of other variables, like gtid_slave_pos, slave_parallel_threads, slave_domain_parallel_threads, and gtid_ignore_duplicates. Those variables have the bug that the server allows changing them with IO thread running, even though this can corrupt the server state in some cases. | |||||||||||||
| Comment by Geoff Montee (Inactive) [ 2016-04-25 ] | |||||||||||||
dc2-master2 is actually a relatively new addition to the topology. The topology used to just have two masters that replicated like this: dc2-master <===> dc1-master They were still getting similar issues with that topology. For example,
I don't see any mention of this requirement in the documentation for any of these variables, except for slave_parallel_threads. Should a note be added about that until the bug you're referring to has been fixed? https://mariadb.com/kb/en/mariadb/gtid/#gtid_slave_pos https://mariadb.com/kb/en/mariadb/gtid/#gtid_ignore_duplicates In this particular case, the users said that they haven't dynamically changed any of these variables. | |||||||||||||
| Comment by Michaël de groot [ 2016-05-09 ] | |||||||||||||
|
Some more information, the user does not experience Both of the issues were created when doing bulk data loading. I'm investigating now to find similarities and will keep you up to date. Perhaps you have some ideas where to look, what information to gather? | |||||||||||||
| Comment by Kristian Nielsen [ 2016-05-10 ] | |||||||||||||
|
Maybe some corner cases in error handling, or some replication filtering? gtid-ignore-duplicates is a very complicated setup, so a lot of care will be required on the part of the user to make sure everything is right. But stuff like getting server_id=0 or partial transactions really shouldn't happen, that really needs to be fixed... | |||||||||||||
| Comment by Michaël de groot [ 2016-05-10 ] | |||||||||||||
|
There are no replication filters turned on. What do you think about this part in the error log: Does this mean the slave sql thread has tables locked when it is stopped due to gtid-safe-mode? How can that happen? I would assume safe mode checks are enforced before it locks any tables. I think these table locks might be causing another issue, leaving the server unusable until a hard restart. | |||||||||||||
| Comment by Kristian Nielsen [ 2016-05-10 ] | |||||||||||||
|
It sounds like something is wrong internally in the server. Just like server_id=0 in GTIDs and partial transactions in the binlog does... | |||||||||||||
| Comment by Michaël de groot [ 2016-05-10 ] | |||||||||||||
|
I agree. That doesn't bring us closer to fixing the issue though. The masters are a bit out of sync. I know there was at least 1 user error, investigating if this issue causes more inconsistencies. Can this change the situation in any way? | |||||||||||||
| Comment by Michaël de groot [ 2016-05-10 ] | |||||||||||||
|
knielsen: I was wrong, there ARE replication filters turned on:
I have general query log of 1 of the occurrences of the issue and it contained many times, around the moment of the issue:
Do you think the issue might have to do with replication fiters? If so can you elaborate? | |||||||||||||
| Comment by Kristian Nielsen [ 2016-05-11 ] | |||||||||||||
|
To repeat myself: > But without a way to reproduce, only wild guesses are possible. For this particular wild guess, filtering is one way that complexity and corner-cases can appear, and a bug might lead to incorrect filtering of part of transactions? Filtering, multi-source, multi-path gtid_ignore_duplicate, sql_skip_slave_counter, ... this is a horribly complex setup, it really requires having a test system where problems can be reproduced and analysed and reduced to a test case... | |||||||||||||
| Comment by Michaël de groot [ 2016-05-11 ] | |||||||||||||
|
The sql_slave_skip_counter was just set to work around this bug, it is not a part of the normal setup. We have a test system, I can set it up A < | |||||||||||||
| Comment by Michaël de groot [ 2016-05-11 ] | |||||||||||||
|
I have just found that this issue often results into another issue where some tables get locked indefinitely. The only fix for this is to restart the server. Would a coredump of this be of any help? | |||||||||||||
| Comment by Matt Neth [ 2017-04-27 ] | |||||||||||||
|
I'm having this happen as well and i'm not using replication filters anywhere so if I can provide any helpful information, let me know. I've also seen the server ID 0 event thing happen as well! Our three DBs are 10.0.30 using multi-source replication where each one replicates to and from the others and we have gitd_strict_mode=1 and gtid_ignore_duplicates=1. One DB is a physical server here at our office datacenter, one is an Amazon EC2 instance, and one is a Linode. The issue seems to have shown up as a slave thread being stuck at waiting for table lock as well as direct inserts to the master being stuck as well. Doing a 'show engine innodb status' during this time shows no deadlocks and there appears to be nothing relevant in the slow query log for it. | |||||||||||||
| Comment by Matt Neth [ 2017-05-08 ] | |||||||||||||
|
This issue seems to manifest in three different ways for us: I've never seen a DEADLOCK section during any of these errors when I check SHOW ENGINE INNODB STATUS For #1, i'm still waiting to get a core dump from right after the out-of-order shows up. This tends to show in the error log when the out of order happens (but not every time): ; For #2, I uploaded a core dump to ftp.askmonty.org/private/ called MDEV9952.slave-table-lock-core.tar.gz. For #3, i'm still waiting to get a core dump of this as well. I did get an strace when this happened this morning, and the output for the specific thread ID of the stuck query was this: Our mysql config is as follows, with each server having a different server_id, domain_id, and auto_increment_offset: | |||||||||||||
| Comment by Matt Neth [ 2017-05-16 ] | |||||||||||||
|
I uploaded MDEV9952.slave-table-lock-mysqld.tar.gz to ftp.askmonty.org/private/ which is the mysqld binary for the core dump file I uploaded before: MDEV9952.slave-table-lock-core.tar.gz | |||||||||||||
| Comment by Michaël de groot [ 2017-05-16 ] | |||||||||||||
|
The customer on whose behalf this issue was added reported that the issue dissapeared after they stopped writing to multiple nodes. Maybe that helps you mneth1 to work around it and knielsen to reproduce/fix it. | |||||||||||||
| Comment by Kristian Nielsen [ 2017-05-16 ] | |||||||||||||
|
This looks like the relevant part of the attached stacktrace:
So the SQL thread is stuck on a table lock. And in turn STOP SLAVE is So if there is no genuine table lock held, then I wonder if this is an MDL It still doesn't really seem to help to understand what causes the problem, unfortunately. | |||||||||||||
| Comment by Matt Neth [ 2017-05-16 ] | |||||||||||||
|
Hm yeah, when these issues occur i've never seen a deadlock listed in SHOW ENGINE INNODB STATUS or in information_schema.innodb_locks/innodb_lock_waits. | |||||||||||||
| Comment by Michaël de groot [ 2017-05-16 ] | |||||||||||||
|
mneth1, if you can reprodcuce this, could you try to enable the metadata lock info plugin and see if anything shows up there? Link: https://mariadb.com/kb/en/mariadb/metadata_lock_info/ | |||||||||||||
| Comment by Matt Neth [ 2017-05-17 ] | |||||||||||||
|
The mysql process on one of our masters died by itself today so I wasn't able to see the contents of the metadata_lock_info. However, after mysql restarted after the crash the slave threads were stopped with the out-of-order error AND the transaction it stopped on had the server ID 0 bug from The new core was uploaded to the FTP server with the filename: MDEV9952-core-crash.tar.gz I attached the error log from the time of the crash to this bug as 5-17-crash-errorlog, and the full backtrace as 5-17-crash-gdb-bt | |||||||||||||
| Comment by Matt Neth [ 2017-05-18 ] | |||||||||||||
|
@michaeldg Here's the output from the metadata lock table when we had some stuff stuck at waiting for table lock. Normally when this happens it's just the slave thread stuck, this time there were direct queries stuck in addition to the slave thread being stuck: THREAD_ID,LOCK_MODE,LOCK_DURATION,LOCK_TYPE,TABLE_SCHEMA,TABLE_NAME | |||||||||||||
| Comment by Kristian Nielsen [ 2017-05-18 ] | |||||||||||||
|
What does "Global read lock" in there mean? | |||||||||||||
| Comment by Matt Neth [ 2017-05-18 ] | |||||||||||||
|
I'm not sure how to find what would be holding the FTWRL, none of our stuff is doing that intentionally that I can find. | |||||||||||||
| Comment by Michael Widenius [ 2017-05-29 ] | |||||||||||||
|
I have looked at the latest uploaded logs in detail. Here is the questions I have asked on IRC Error from updated logs: 170523 8:00:04 [ERROR] Master 'aws-woprm5': Slave SQL: An attempt was made to binlog GTID 2-2-58631956 which would create an out-of-order sequence number with existing GTID 2-2-58631956, and gtid strict mode is enabled. Gtid 2-2-58631956, Internal MariaDB error code: 1950 The strange thing is that the slave is not running strict mode (as How can the stop position be less ? This is a ring setup, so I assume the slave can get the gtid twice, but there is no indication of this in the knielsen> montywi: The slave has position 2-2-...955, it tries to replicate 2-2-...956, but that fails because slave binlog also has 2-2-...956. So replication fails and rolls back, and stop position is one step before, 2-2-...995. Seems ok How can the slave have 995 if binlog is 956 ? With binlog, do you mean that slave has written 956 to binlog? If I am looking at the master log and this has strangely enough All in same log, so there can't be any master crash in between. I would have assume that the scenario would be like following:
Looking at the logs, I found something very strange: On master, you have GTID's 2-2-58631954 and then ..555 and then ...557 It looks (hard to verify) that the slave has applied 555 as 556, while I tried to check if master event ...555 is same as slave ...556 . The I wonder under which circumstances would a GTID be missing on the master. i assume that on slave things can be different as after the crash, the However, I can't understand how master can have one number missing and I also don't understand how slave could originally complain about duplicate Could some of that be explained with that the user restarted replication | |||||||||||||
| Comment by Matt Neth [ 2017-05-29 ] | |||||||||||||
|
Couple bits of info that may help: gtid_strict_mode is definitely enabled on all of them. The way we fix the out-of-order issue that is stopping replication is by temporarily disabling gtid_strict_mode, NOT by using sql_slave_skip_counter- | |||||||||||||
| Comment by Elena Stepanova [ 2017-06-18 ] | |||||||||||||
|
Since all cases of out-of-order problem seem to involve M2 (server ID 2) in one or another way, it might be important to have the cnf file from it. We've got M3 and M4 configs, is any of the them identical to M2, apart from server/domain IDs? | |||||||||||||
| Comment by Matt Neth [ 2017-06-19 ] | |||||||||||||
|
M2's is identical to M3 except for server/domain ID and report-host. The out-of-order has happened has happened on M3 and M4 as well, but the the last few times have been on M2. | |||||||||||||
| Comment by Ulrich Moser (Inactive) [ 2017-11-30 ] | |||||||||||||
|
Hi, | |||||||||||||
| Comment by Kristian Nielsen [ 2017-11-30 ] | |||||||||||||
|
> Since server ids are different I would not expect an out-of-order eror here. No, different server ids does not mean that out-of-order is allowed. On the contrary, out-of-order sequence numbers between different server ids To allow out-of-order (as in a master-master ring), you need different See the docs for more, eg. | |||||||||||||
| Comment by Ulrich Moser (Inactive) [ 2017-11-30 ] | |||||||||||||
|
Thanks Kristian. | |||||||||||||
| Comment by Mannoj Kumar Saravanan [ 2018-04-19 ] | |||||||||||||
|
Hi Team, A & B are galera cluster with domain_id=4100 A,B is Master of E. Using Multi Source Replication model. wsrep-gtid-mode=on This is working with AB replicating to D&E. But it is not happening when I write to D&E it doesn't replicate to A & B. Which Checking B slave status -> I get below error. Last_SQL_Error: An attempt was made to binlog GTID 0-195981401-8 which would create an out-of-order sequence number with existing GTID 0-195981401-8, and gtid strict mode is enabled. https://jira.mariadb.org/browse/MDEV-9952 . - > I'm having different domain_id for each cluster as mentioned by Kristian Nielsen. Please let me know if I'm missing something. | |||||||||||||
| Comment by Andrei Elkin [ 2018-07-11 ] | |||||||||||||
|
The fixed Considering all that as well as the report is done against the pre-9670 fixes version the case should be closed as duplicate. | |||||||||||||
| Comment by Andrei Elkin [ 2018-07-13 ] | |||||||||||||
|
Reverting it back to active to verify a scenario when the IO thread | |||||||||||||
| Comment by Andrei Elkin [ 2018-07-14 ] | |||||||||||||
|
Verified a possibility to reconnect to the master of a circular setup while having one GTID:s |