Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-17803

Row-based event is not applied when table map id is greater 32 bit int

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 10.1.26
    • Fix Version/s: 10.4.2, 10.1.38, 10.0.38, 10.2.22, 10.3.13
    • Component/s: Replication
    • Labels:
    • Environment:

      Description

      Because of privacy reasons I can’t share as much data i’d like to, but hopefully you can help anyway.
      To have the whole Story… what happened:

      In February 2018 we migrated a customer from a MySQL 5.5 Master/Master to a MariaDB 10.1 Master/Master Setup.
      Because the Customer have one Big Database with around 700k Tables and a Dump took several days, we have done the migration by the following Steps:

      1. Stop Replication on one of the MySQL Master, create a Dump via mysqldump and note replication position
      2. Setup MariaDB Server named DB01 and import Dump from 1.
      3. Configure DB01 as a Slave of the opposite MySQL Server from 1. With the positions from 1.
      4. Let DB01 Catch up and run as slave
      5. Stop All Maschines, except DB01.
      6. Note master position + file from and stop DB01.
      6. Create a snapshot of DB01
      7. Configure the snapshot as new server DB02 (change name/network/server id/increment offset/…)
      8. Start DB01 and DB02.
      9. Configure DB02 as Slave of DB01 With data from 6.
      10. Configure DB01 As Slave from DB02 (because no one works on it by note master pos/file from DB02)
      11. Configure keepalived so we have a cluster ip
      12. Migrate all applications the the ClusterIP from 11.

      From that Point we had a MariaDB Master/Master „Cluster“ Setup working….
      And this Setup worked without any changes for round about 8 Months, until 23th of October.

      At 2018-10-23 10:55:59 Replication stopped on DB01 because of a foreign key constraint violation. At and before this point in time the ClusterIP was Pointing to DB02, so DB02 could be treated as MASTER and DB01 as SLAVE. For a better understanding we could assume the following tables:

      Order
      Address
      Order_item

      All with auto_increment id’s as primary key, and the following foreign keys (with null allowed and as default):

      Order.address_id -> Address.id
      Order.delivery_address_id -> Address.id
      Order_item.order_id -> Order.id

      The Constraint violation happened for the Order_item.order_id, So the corresponding Order does not exist. We had a look on DB02 -> as expected the Order and all related exists on the „Master“ and as well expected because of the issue on DB01 none of it exists. So we started inspecting the corresponding binlog events on DB02:

      mysql-bin.001407 68765085 Table_map             12     68765287 table_id: 4297936365 (Order)
      mysql-bin.001407 68765287 Table_map             12     68765489 table_id: 4297936365 (Order) 
      mysql-bin.001407 68765489 Write_rows_v1         12     68765660 table_id: 4297936365 flags: STMT_END_F
      mysql-bin.001407 68765660 Xid                   12     68765687 COMMIT /* xid=1845363106 */
      mysql-bin.001407 68765687 Gtid                  12     68765725 BEGIN GTID 0-12-569302375
      mysql-bin.001407 68765725 Intvar                12     68765753 INSERT_ID=501448
      mysql-bin.001407 68765753 Query                 12     68766164 INSERT INTO Address ...
      mysql-bin.001407 68766164 Query                 12     68766977 UPDATE `Order` SET ..., `address_id`='501448', `delivery_address_id`='501448' WHERE Order.id='471850'
      mysql-bin.001407 68766977 Intvar                12     68767005 INSERT_ID=1329576        
      mysql-bin.001407 68767005 Query 12 68767678 INSERT INTO Order_item ...

      The Replication on DB01 stopped with exec master position at 68765489 and complaining the insert at 68767005...
      On DB01 the binlog events showed (this is a copy after we tried to recover the slave by reset slave and reattach):

      mysqld-relay-bin.000002 537 Gtid 12 68765085 BEGIN GTID 0-12-569302374
      mysqld-relay-bin.000002 249 Rotate 12 0 mysql-bin.001407;pos=68765047
      mysqld-relay-bin.000002 292 Format_desc 12 0 Server ver: 10.1.26-MariaDB-0+deb9u1, Binlog ver: 4
      mysqld-relay-bin.000002 537 Gtid 12 68765085 BEGIN GTID 0-12-569302374
      mysqld-relay-bin.000002 575 Table_map 12 68765287 table_id: 4297936365 (Order)
      mysqld-relay-bin.000002 777 Table_map 12 68765489 table_id: 4297936365 (Order)
      mysqld-relay-bin.000002 979 Write_rows_v1 12 68765660 table_id: 4297936365 flags: STMT_END_F
      mysqld-relay-bin.000002 1150 Xid 12 68765687 COMMIT /* xid=1845363106 */
      mysqld-relay-bin.000002 1177 Gtid 12 68765725 BEGIN GTID 0-12-569302375
      mysqld-relay-bin.000002 1215 Intvar 12 68765753 INSERT_ID=501448
      mysqld-relay-bin.000002 1243 Query 12 68766164 INSERT INTO `Address` ...
      mysqld-relay-bin.000002 1654 Query 12 68766977 UPDATE `Order` SET ..., `address_id`='501448', `delivery_address_id`='501448' WHERE Order.id='471850'
      mysqld-relay-bin.000002 2467 Intvar 12 68767005 INSERT_ID=1329576

      somehow i did not copied until the Order_item insert, but it was right behind 2467 and the failed one.
      Things to mention:

      • the missing Order is replicated row based
      • while inspecting more of the logs we noticed that some row-based events (and this applies to the missing Order) has no Format_desc right before the event (as others have)

      So we tried to insert the missing Order manually and continue replication... but this only worked until the next order/order_item insert. So we had a persistent problem...
      Because we couldn't fix the replication we decided to rebuild the "cluster" by the following steps:

      1. Stop DB02, make make a Snapshot and restart DB02
      2. Overwrite DB01 with the Snapshot from 1. And configure network/name/server_id and so on...
      3. Create a new DB03 within the Snapshot like in 2.
      4. Start DB01 with network disabled
      5. Note master file/position from DB01, enabled network and reattach to DB02 as a slave with this infos
      6. Attach DB03 as a second Slave to DB02 with the infos from 5.

      From this Point we had a Master and 2 Slaves. The Slaves catched up and all was working...
      For around 3 weeks until 14th of November.

      At 2018-11-14 17:42:33 Replication stopped on both, DB01 and DB03 Slaves with the exact same position and similar Statement as the first Problem (Missing Order for INSERT INTO Order_item)

      So from our understanding the entire Problem is somehow related to the row-based replication event which is somehow skipped or silent dropped or something like that?

      Do you have any ideas what could cause a problem like this?

        Attachments

        1. db02.my.cnf
          4 kB
        2. db03.my.cnf
          2 kB
        3. db03.txt
          520 kB

          Issue Links

            Activity

              People

              Assignee:
              Elkin Andrei Elkin
              Reporter:
              mapa Maik Pätzold
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: