[MDEV-17803] Row-based event is not applied when table map id is greater 32 bit int - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.1.26
Fix Version/s: 10.4.2, 10.1.38, 10.0.38, 10.2.22, 10.3.13
Component/s: Replication
Labels:
- replication
Environment:

Hide
2x XEN PV Guest with Debian 9 with a Master/Slave Replication Setup

DB01 Guest Hosted on:
XEN 4.8.4
2x AMD Opteron 6136 (total of 16 Cores)
2x10G Network (LACP)
SSD (4x Enterprise SSD in RAID10 dedicated to this Guest) SAN with 8x10G Network

DB02 Guest Hosted on:
XEN 4.8.3
2x AMD Opteron 6370P (total of 32 Cores)
2x10G Network (LACP)
SSD (4x Enterprise SSD in RAID10 dedicated to this Guest) SAN with 4x10G Network

Each Guest with:
8 vCPU's
32G of RAM
300GB Disk (entire mounted as /)
8GB Swap (mounted from file in /)
MariaDB 10.1.26 from Debian Repo
~12 Databases with ~150GB disk space used
1 Database with ~700k tables

Show
2x XEN PV Guest with Debian 9 with a Master/Slave Replication Setup DB01 Guest Hosted on: XEN 4.8.4 2x AMD Opteron 6136 (total of 16 Cores) 2x10G Network (LACP) SSD (4x Enterprise SSD in RAID10 dedicated to this Guest) SAN with 8x10G Network DB02 Guest Hosted on: XEN 4.8.3 2x AMD Opteron 6370P (total of 32 Cores) 2x10G Network (LACP) SSD (4x Enterprise SSD in RAID10 dedicated to this Guest) SAN with 4x10G Network Each Guest with: 8 vCPU's 32G of RAM 300GB Disk (entire mounted as /) 8GB Swap (mounted from file in /) MariaDB 10.1.26 from Debian Repo ~12 Databases with ~150GB disk space used 1 Database with ~700k tables

Description

Because of privacy reasons I can’t share as much data i’d like to, but hopefully you can help anyway.
To have the whole Story… what happened:

In February 2018 we migrated a customer from a MySQL 5.5 Master/Master to a MariaDB 10.1 Master/Master Setup.
Because the Customer have one Big Database with around 700k Tables and a Dump took several days, we have done the migration by the following Steps:

1. Stop Replication on one of the MySQL Master, create a Dump via mysqldump and note replication position
2. Setup MariaDB Server named DB01 and import Dump from 1.
3. Configure DB01 as a Slave of the opposite MySQL Server from 1. With the positions from 1.
4. Let DB01 Catch up and run as slave
5. Stop All Maschines, except DB01.
6. Note master position + file from and stop DB01.
6. Create a snapshot of DB01
7. Configure the snapshot as new server DB02 (change name/network/server id/increment offset/…)
8. Start DB01 and DB02.
9. Configure DB02 as Slave of DB01 With data from 6.
10. Configure DB01 As Slave from DB02 (because no one works on it by note master pos/file from DB02)
11. Configure keepalived so we have a cluster ip
12. Migrate all applications the the ClusterIP from 11.

From that Point we had a MariaDB Master/Master „Cluster“ Setup working….
And this Setup worked without any changes for round about 8 Months, until 23th of October.

At 2018-10-23 10:55:59 Replication stopped on DB01 because of a foreign key constraint violation. At and before this point in time the ClusterIP was Pointing to DB02, so DB02 could be treated as MASTER and DB01 as SLAVE. For a better understanding we could assume the following tables:

Order
Address
Order_item

All with auto_increment id’s as primary key, and the following foreign keys (with null allowed and as default):

Order.address_id -> Address.id
Order.delivery_address_id -> Address.id
Order_item.order_id -> Order.id

The Constraint violation happened for the Order_item.order_id, So the corresponding Order does not exist. We had a look on DB02 -> as expected the Order and all related exists on the „Master“ and as well expected because of the issue on DB01 none of it exists. So we started inspecting the corresponding binlog events on DB02:

mysql-bin.001407	68765085	Table_map	12	68765287	table_id: 4297936365 (Order)
mysql-bin.001407	68765287	Table_map	12	68765489	table_id: 4297936365 (Order)
mysql-bin.001407	68765489	Write_rows_v1	12	68765660	table_id: 4297936365 flags: STMT_END_F
mysql-bin.001407	68765660	Xid	12	68765687	COMMIT /* xid=1845363106 */
mysql-bin.001407	68765687	Gtid	12	68765725	BEGIN GTID 0-12-569302375
mysql-bin.001407	68765725	Intvar	12	68765753	INSERT_ID=501448
mysql-bin.001407	68765753	Query	12	68766164	INSERT INTO Address ...
mysql-bin.001407	68766164	Query	12	68766977	UPDATE `Order` SET ..., `address_id`='501448', `delivery_address_id`='501448' WHERE Order.id='471850'
mysql-bin.001407	68766977	Intvar	12	68767005	INSERT_ID=1329576
mysql-bin.001407	68767005	Query	12	68767678	INSERT INTO Order_item ...

The Replication on DB01 stopped with exec master position at 68765489 and complaining the insert at 68767005...
On DB01 the binlog events showed (this is a copy after we tried to recover the slave by reset slave and reattach):

mysqld-relay-bin.000002	537	Gtid	12	68765085	BEGIN GTID 0-12-569302374
mysqld-relay-bin.000002	249	Rotate	12	0	mysql-bin.001407;pos=68765047
mysqld-relay-bin.000002	292	Format_desc	12	0	Server ver: 10.1.26-MariaDB-0+deb9u1, Binlog ver: 4
mysqld-relay-bin.000002	537	Gtid	12	68765085	BEGIN GTID 0-12-569302374
mysqld-relay-bin.000002	575	Table_map	12	68765287	table_id: 4297936365 (Order)
mysqld-relay-bin.000002	777	Table_map	12	68765489	table_id: 4297936365 (Order)
mysqld-relay-bin.000002	979	Write_rows_v1	12	68765660	table_id: 4297936365 flags: STMT_END_F
mysqld-relay-bin.000002	1150	Xid	12	68765687	COMMIT /* xid=1845363106 */
mysqld-relay-bin.000002	1177	Gtid	12	68765725	BEGIN GTID 0-12-569302375
mysqld-relay-bin.000002	1215	Intvar	12	68765753	INSERT_ID=501448
mysqld-relay-bin.000002	1243	Query	12	68766164	INSERT INTO `Address` ...
mysqld-relay-bin.000002	1654	Query	12	68766977	UPDATE `Order` SET ..., `address_id`='501448', `delivery_address_id`='501448' WHERE Order.id='471850'
mysqld-relay-bin.000002	2467	Intvar	12	68767005	INSERT_ID=1329576

somehow i did not copied until the Order_item insert, but it was right behind 2467 and the failed one.
Things to mention:

the missing Order is replicated row based
while inspecting more of the logs we noticed that some row-based events (and this applies to the missing Order) has no Format_desc right before the event (as others have)

So we tried to insert the missing Order manually and continue replication... but this only worked until the next order/order_item insert. So we had a persistent problem...
Because we couldn't fix the replication we decided to rebuild the "cluster" by the following steps:

1. Stop DB02, make make a Snapshot and restart DB02
2. Overwrite DB01 with the Snapshot from 1. And configure network/name/server_id and so on...
3. Create a new DB03 within the Snapshot like in 2.
4. Start DB01 with network disabled
5. Note master file/position from DB01, enabled network and reattach to DB02 as a slave with this infos
6. Attach DB03 as a second Slave to DB02 with the infos from 5.

From this Point we had a Master and 2 Slaves. The Slaves catched up and all was working...
For around 3 weeks until 14th of November.

At 2018-11-14 17:42:33 Replication stopped on both, DB01 and DB03 Slaves with the exact same position and similar Statement as the first Problem (Missing Order for INSERT INTO Order_item)

So from our understanding the entire Problem is somehow related to the row-based replication event which is somehow skipped or silent dropped or something like that?

Do you have any ideas what could cause a problem like this?

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

db02.my.cnf
4 kB
2018-12-20 11:28
db03.my.cnf
2 kB
2018-12-20 11:28
db03.txt
520 kB
2018-12-20 11:23

Issue Links

is duplicated by

MDEV-15660 Problematic BINLOG events from master

Closed

MDEV-18175 table id overflow causes replication out of sync

Closed

relates to

MDEV-15660 Problematic BINLOG events from master

Closed

Row-based event is not applied when table map id is greater 32 bit int

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration