Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.1.26
-
2x XEN PV Guest with Debian 9 with a Master/Slave Replication Setup
DB01 Guest Hosted on:
XEN 4.8.4
2x AMD Opteron 6136 (total of 16 Cores)
2x10G Network (LACP)
SSD (4x Enterprise SSD in RAID10 dedicated to this Guest) SAN with 8x10G Network
DB02 Guest Hosted on:
XEN 4.8.3
2x AMD Opteron 6370P (total of 32 Cores)
2x10G Network (LACP)
SSD (4x Enterprise SSD in RAID10 dedicated to this Guest) SAN with 4x10G Network
Each Guest with:
8 vCPU's
32G of RAM
300GB Disk (entire mounted as /)
8GB Swap (mounted from file in /)
MariaDB 10.1.26 from Debian Repo
~12 Databases with ~150GB disk space used
1 Database with ~700k tables2x XEN PV Guest with Debian 9 with a Master/Slave Replication Setup DB01 Guest Hosted on: XEN 4.8.4 2x AMD Opteron 6136 (total of 16 Cores) 2x10G Network (LACP) SSD (4x Enterprise SSD in RAID10 dedicated to this Guest) SAN with 8x10G Network DB02 Guest Hosted on: XEN 4.8.3 2x AMD Opteron 6370P (total of 32 Cores) 2x10G Network (LACP) SSD (4x Enterprise SSD in RAID10 dedicated to this Guest) SAN with 4x10G Network Each Guest with: 8 vCPU's 32G of RAM 300GB Disk (entire mounted as /) 8GB Swap (mounted from file in /) MariaDB 10.1.26 from Debian Repo ~12 Databases with ~150GB disk space used 1 Database with ~700k tables
Description
Because of privacy reasons I can’t share as much data i’d like to, but hopefully you can help anyway.
To have the whole Story… what happened:
In February 2018 we migrated a customer from a MySQL 5.5 Master/Master to a MariaDB 10.1 Master/Master Setup.
Because the Customer have one Big Database with around 700k Tables and a Dump took several days, we have done the migration by the following Steps:
1. Stop Replication on one of the MySQL Master, create a Dump via mysqldump and note replication position
2. Setup MariaDB Server named DB01 and import Dump from 1.
3. Configure DB01 as a Slave of the opposite MySQL Server from 1. With the positions from 1.
4. Let DB01 Catch up and run as slave
5. Stop All Maschines, except DB01.
6. Note master position + file from and stop DB01.
6. Create a snapshot of DB01
7. Configure the snapshot as new server DB02 (change name/network/server id/increment offset/…)
8. Start DB01 and DB02.
9. Configure DB02 as Slave of DB01 With data from 6.
10. Configure DB01 As Slave from DB02 (because no one works on it by note master pos/file from DB02)
11. Configure keepalived so we have a cluster ip
12. Migrate all applications the the ClusterIP from 11.
From that Point we had a MariaDB Master/Master „Cluster“ Setup working….
And this Setup worked without any changes for round about 8 Months, until 23th of October.
At 2018-10-23 10:55:59 Replication stopped on DB01 because of a foreign key constraint violation. At and before this point in time the ClusterIP was Pointing to DB02, so DB02 could be treated as MASTER and DB01 as SLAVE. For a better understanding we could assume the following tables:
Order
Address
Order_item
All with auto_increment id’s as primary key, and the following foreign keys (with null allowed and as default):
Order.address_id -> Address.id
Order.delivery_address_id -> Address.id
Order_item.order_id -> Order.id
The Constraint violation happened for the Order_item.order_id, So the corresponding Order does not exist. We had a look on DB02 -> as expected the Order and all related exists on the „Master“ and as well expected because of the issue on DB01 none of it exists. So we started inspecting the corresponding binlog events on DB02:
mysql-bin.001407 | 68765085 | Table_map | 12 | 68765287 | table_id: 4297936365 (Order) |
mysql-bin.001407 | 68765287 | Table_map | 12 | 68765489 | table_id: 4297936365 (Order) |
mysql-bin.001407 | 68765489 | Write_rows_v1 | 12 | 68765660 | table_id: 4297936365 flags: STMT_END_F |
mysql-bin.001407 | 68765660 | Xid | 12 | 68765687 | COMMIT /* xid=1845363106 */ |
mysql-bin.001407 | 68765687 | Gtid | 12 | 68765725 | BEGIN GTID 0-12-569302375 |
mysql-bin.001407 | 68765725 | Intvar | 12 | 68765753 | INSERT_ID=501448 |
mysql-bin.001407 | 68765753 | Query | 12 | 68766164 | INSERT INTO Address ... |
mysql-bin.001407 | 68766164 | Query | 12 | 68766977 | UPDATE `Order` SET ..., `address_id`='501448', `delivery_address_id`='501448' WHERE Order.id='471850' |
mysql-bin.001407 | 68766977 | Intvar | 12 | 68767005 | INSERT_ID=1329576 |
mysql-bin.001407 | 68767005 | Query | 12 | 68767678 | INSERT INTO Order_item ... |
The Replication on DB01 stopped with exec master position at 68765489 and complaining the insert at 68767005...
On DB01 the binlog events showed (this is a copy after we tried to recover the slave by reset slave and reattach):
mysqld-relay-bin.000002 | 537 | Gtid | 12 | 68765085 | BEGIN GTID 0-12-569302374 |
mysqld-relay-bin.000002 | 249 | Rotate | 12 | 0 | mysql-bin.001407;pos=68765047 |
mysqld-relay-bin.000002 | 292 | Format_desc | 12 | 0 | Server ver: 10.1.26-MariaDB-0+deb9u1, Binlog ver: 4 |
mysqld-relay-bin.000002 | 537 | Gtid | 12 | 68765085 | BEGIN GTID 0-12-569302374 |
mysqld-relay-bin.000002 | 575 | Table_map | 12 | 68765287 | table_id: 4297936365 (Order) |
mysqld-relay-bin.000002 | 777 | Table_map | 12 | 68765489 | table_id: 4297936365 (Order) |
mysqld-relay-bin.000002 | 979 | Write_rows_v1 | 12 | 68765660 | table_id: 4297936365 flags: STMT_END_F |
mysqld-relay-bin.000002 | 1150 | Xid | 12 | 68765687 | COMMIT /* xid=1845363106 */ |
mysqld-relay-bin.000002 | 1177 | Gtid | 12 | 68765725 | BEGIN GTID 0-12-569302375 |
mysqld-relay-bin.000002 | 1215 | Intvar | 12 | 68765753 | INSERT_ID=501448 |
mysqld-relay-bin.000002 | 1243 | Query | 12 | 68766164 | INSERT INTO `Address` ... |
mysqld-relay-bin.000002 | 1654 | Query | 12 | 68766977 | UPDATE `Order` SET ..., `address_id`='501448', `delivery_address_id`='501448' WHERE Order.id='471850' |
mysqld-relay-bin.000002 | 2467 | Intvar | 12 | 68767005 | INSERT_ID=1329576 |
somehow i did not copied until the Order_item insert, but it was right behind 2467 and the failed one.
Things to mention:
- the missing Order is replicated row based
- while inspecting more of the logs we noticed that some row-based events (and this applies to the missing Order) has no Format_desc right before the event (as others have)
So we tried to insert the missing Order manually and continue replication... but this only worked until the next order/order_item insert. So we had a persistent problem...
Because we couldn't fix the replication we decided to rebuild the "cluster" by the following steps:
1. Stop DB02, make make a Snapshot and restart DB02
2. Overwrite DB01 with the Snapshot from 1. And configure network/name/server_id and so on...
3. Create a new DB03 within the Snapshot like in 2.
4. Start DB01 with network disabled
5. Note master file/position from DB01, enabled network and reattach to DB02 as a slave with this infos
6. Attach DB03 as a second Slave to DB02 with the infos from 5.
From this Point we had a Master and 2 Slaves. The Slaves catched up and all was working...
For around 3 weeks until 14th of November.
At 2018-11-14 17:42:33 Replication stopped on both, DB01 and DB03 Slaves with the exact same position and similar Statement as the first Problem (Missing Order for INSERT INTO Order_item)
So from our understanding the entire Problem is somehow related to the row-based replication event which is somehow skipped or silent dropped or something like that?
Do you have any ideas what could cause a problem like this?
Attachments
Issue Links
- is duplicated by
-
MDEV-15660 Problematic BINLOG events from master
- Closed
-
MDEV-18175 table id overflow causes replication out of sync
- Closed
- relates to
-
MDEV-15660 Problematic BINLOG events from master
- Closed