[MDEV-17803] Row-based event is not applied when table map id is greater 32 bit int Created: 2018-11-22 Updated: 2019-05-29 Resolved: 2019-05-29 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.1.26 |
| Fix Version/s: | 10.4.2, 10.1.38, 10.0.38, 10.2.22, 10.3.13 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maik Pätzold | Assignee: | Andrei Elkin |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | replication | ||
| Environment: |
2x XEN PV Guest with Debian 9 with a Master/Slave Replication Setup DB01 Guest Hosted on: DB02 Guest Hosted on: Each Guest with: |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
Because of privacy reasons I can’t share as much data i’d like to, but hopefully you can help anyway. In February 2018 we migrated a customer from a MySQL 5.5 Master/Master to a MariaDB 10.1 Master/Master Setup. 1. Stop Replication on one of the MySQL Master, create a Dump via mysqldump and note replication position From that Point we had a MariaDB Master/Master „Cluster“ Setup working…. At 2018-10-23 10:55:59 Replication stopped on DB01 because of a foreign key constraint violation. At and before this point in time the ClusterIP was Pointing to DB02, so DB02 could be treated as MASTER and DB01 as SLAVE. For a better understanding we could assume the following tables: Order All with auto_increment id’s as primary key, and the following foreign keys (with null allowed and as default): Order.address_id -> Address.id The Constraint violation happened for the Order_item.order_id, So the corresponding Order does not exist. We had a look on DB02 -> as expected the Order and all related exists on the „Master“ and as well expected because of the issue on DB01 none of it exists. So we started inspecting the corresponding binlog events on DB02:
The Replication on DB01 stopped with exec master position at 68765489 and complaining the insert at 68767005...
somehow i did not copied until the Order_item insert, but it was right behind 2467 and the failed one.
So we tried to insert the missing Order manually and continue replication... but this only worked until the next order/order_item insert. So we had a persistent problem... 1. Stop DB02, make make a Snapshot and restart DB02 From this Point we had a Master and 2 Slaves. The Slaves catched up and all was working... At 2018-11-14 17:42:33 Replication stopped on both, DB01 and DB03 Slaves with the exact same position and similar Statement as the first Problem (Missing Order for INSERT INTO Order_item) So from our understanding the entire Problem is somehow related to the row-based replication event which is somehow skipped or silent dropped or something like that? Do you have any ideas what could cause a problem like this? |
| Comments |
| Comment by Maik Pätzold [ 2018-12-18 ] | ||
|
Update: after the last replication outage (2018-11-14) we added more slaves like described above. So we had the following machines:
Additionaly we changed the binlog format to ROW instead of MIXED at 2018-12-11 we had another outage and this time it's even worse... :
any ideas? | ||
| Comment by Andrei Elkin [ 2018-12-20 ] | ||
|
Thanks for reporting, Maik! It's difficult to grasp your setup. If I got it right (as you say 'master-master' and 'cluster'), two servers replicated crisscross DB01 <-> DB02 ? If it's really crisscross setup, I naturally wonder how it's guaranteed that local data modification does not conflict with replication. Also it would be helpful to provide
Good luck, Andrei | ||
| Comment by Maik Pätzold [ 2018-12-20 ] | ||
|
Hey Andrei, almost right... But after the first Replication Problem we moved to a Master/Slave (by get a snapshot of the last master (DB02) and configure it as a slave...) So since 23th October we no longer have a "crisscross" Setup... | ||
| Comment by Maik Pätzold [ 2018-12-20 ] | ||
|
i've attached db03.txt which contains the requested outputs of one of the mariadb slave (DB03) which is no longer applying changes. As well as db02.my.conf (MASTER my.cnf) and db03.my.cnf (SLAVE my.cnf) | ||
| Comment by Maik Pätzold [ 2018-12-20 ] | ||
|
do you have any idea whats the difference between MariaDB 10.1 and MySQL 5.6 regarding replication and could cause such a behaviour? | ||
| Comment by Andrei Elkin [ 2018-12-20 ] | ||
|
> 3. Configure each server with a different another auto_increment_offset (1 and 2) and set auto_increment_increment to 2. Sure in such case auto-increment INSERT:s are okay, but there are UPDATE:s according to your snippet. However as you clarified the missed record error has occurred now in the master-slave. A piece of your comments start with > The Replication on DB01 stopped with exec master position at 68765489 to say the error must have happened at applying
but I could not get the meaning of the 2nd part of the comments: >and complaining the insert at 68767005 which refer to a following event. I need your explanation. I could imagine that the error actually happened at the 2nd INSERT of 68767005, and you found at your analysis it's the 1st 68765489 that did not make its job. But I can't read it this way. | ||
| Comment by Andrei Elkin [ 2018-12-20 ] | ||
|
> whats the difference between MariaDB 10.1 and MySQL 5.6 regarding replication Generally we have few differences, but I don't see how any of them can relate to your case. | ||
| Comment by Maik Pätzold [ 2018-12-20 ] | ||
|
i think you read it right... | ||
| Comment by Andrei Elkin [ 2018-12-20 ] | ||
|
>the second comment (insert at 68767005) stopped the replication with an foreign key constraint violation That means BEGIN GTID 0-12-569302375 transaction (gtid group) which this insert belongs to. If you have a slave "frozen" after this error I could try to help you to unroll what might have happened. | ||
| Comment by wy [ 2019-01-10 ] | ||
|
table_id overflow? | ||
| Comment by Maik Pätzold [ 2019-01-10 ] | ||
|
2019-01-08 we hit another replication outage... | ||
| Comment by wy [ 2019-01-10 ] | ||
|
Maik | ||
| Comment by Maik Pätzold [ 2019-01-10 ] | ||
|
ok, for me it sounds really close to our problem... Last Events wich are inserted at slave: #190108 7:51:30 server id 12 end_log_pos 62304235 Table_map: `supremeshops`.`supr_send_cloud` mapped to number 4294743012
And the Problem begins with:
From that Point we have a table_id > 2^32 - 1 ... | ||
| Comment by Maik Pätzold [ 2019-01-10 ] | ||
|
so if i got it right the Problem is slave related and we could prove that this is our Problem bei starting a fixed slave and attach it to the binlog position with the silent dropped insert. The fixed slave now should replicate again, right? | ||
| Comment by Andrei Elkin [ 2019-01-10 ] | ||
|
mapa Your latest data points to | ||
| Comment by Maik Pätzold [ 2019-01-10 ] | ||
|
Elkin i've changed the uint table_id in sql/table.h to ulong table_id and reattached the slave like described above -> Works | ||
| Comment by Maik Pätzold [ 2019-01-10 ] | ||
|
Elkin do you known if the table_id maybe increasing faster because of we have ~800k tables in one database? | ||
| Comment by Andrei Elkin [ 2019-01-10 ] | ||
|
mapa Right, the reason is the short type of TABLE_LIST::table_id. Well done! table id in TABLE_SHARE::table_map_id changes by each ALTER. And it's of 8 bytes. | ||
| Comment by Maik Pätzold [ 2019-01-10 ] | ||
|
just to be sure.... Is there something more to change? https://github.com/mysql/mysql-server/commit/c9c7bda021b8d4d7aa997f24e301df5b76dfe215 maybe it's a good idea to use a typedef for table_id to have the same type across the entire code, like mysql does it? | ||
| Comment by Andrei Elkin [ 2019-01-10 ] | ||
|
sachin.setiya.007, there has been activity on two tickets - | ||
| Comment by Andrei Elkin [ 2019-01-10 ] | ||
|
mapa, sorry, we seem to have been working on the same thing without coordination. I appreciate your help! | ||
| Comment by Maik Pätzold [ 2019-01-11 ] | ||
|
Elkin that's just me quick and dirty change i've done to test/prove that this is the reason of our Problem... so it is i thought i should contribute it because i did not excpected that you find time to start on that topic that fast | ||
| Comment by Andrei Elkin [ 2019-01-11 ] | ||
|
mapa We always love to have contributions. ruochen and yourself did a great job, thank you both! Yes, please close PR, my "internal" patch is being reviewed. | ||
| Comment by Andrei Elkin [ 2019-05-29 ] | ||
|
Patch cce2b45c8f5b32 was actually pushed back in January. |