[MDEV-30077] Replication locked after migration of 1 server from 10.4.18 to 10.4.25 Created: 2022-11-23  Updated: 2023-01-13  Resolved: 2023-01-13

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.4.25
Fix Version/s: N/A

Type: Bug Priority: Blocker
Reporter: JP Pozzi Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: replication
Environment:

CENTOS 7.9


Issue Links:
Relates
relates to MDEV-29621 Replica stopped by locks on sequence Closed

 Description   

Hello,
After migration from 10.4.18 to 10.4.25 the replication lag.
When I look at metadata_lock_info I see 2 "MDL_BACKUP_COMMIT" and 2 "MDL_BACKUP_TRANS_DML".
Those locks seems to be from the slave processes ...
If I kill the threads the slaves ptocesses are stopped.
If i restart the slaves, they are working some time (high IO rate) and the lMDL_BACKUP locks comes back and the replication lag again.
It is very urgent as it is on a production system.

The migration to 10.4.25 was OK on integration and pre-production systems and even on some other smaller production systems.
That system is working at 30000/50000 updates per minute.



 Comments   
Comment by JP Pozzi [ 2022-11-23 ]

Lag is now > 10000 seconds.

Comment by JP Pozzi [ 2022-11-23 ]

select * from metadata_lock_info
--------------

-----------------------------------------------------------------------------------------------------------------+

THREAD_ID LOCK_MODE LOCK_DURATION LOCK_TYPE TABLE_SCHEMA TABLE_NAME

-----------------------------------------------------------------------------------------------------------------+

5035 MDL_BACKUP_COMMIT NULL Backup lock    
5034 MDL_SHARED NULL Stored function metadata lock coyote_poi CALC_MOIS_12
5032 MDL_SHARED NULL Stored function metadata lock coyote_poi CALC_MOIS_12
5024 MDL_SHARED NULL Stored function metadata lock coyote_poi CALC_MOIS_12
5021 MDL_SHARED NULL Stored function metadata lock coyote_poi CALC_MOIS_12
5020 MDL_SHARED NULL Stored function metadata lock coyote_poi CALC_MOIS_12
5022 MDL_SHARED NULL Stored function metadata lock coyote_poi CALC_MOIS_12
5018 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_DEVICE_DATA
5028 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_POI_LAST_SESSION
5023 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_POI_LAST_SESSION
5025 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_POI_LAST_SESSION
5031 MDL_SHARED NULL Stored function metadata lock coyote CALC_QUANTIEME_13
5033 MDL_SHARED NULL Stored function metadata lock coyote CALC_QUANTIEME_13
5027 MDL_SHARED NULL Stored function metadata lock coyote CALC_QUANTIEME_13
5035 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5031 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5028 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5032 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5034 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5033 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5024 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5023 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5025 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5029 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5027 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5030 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5018 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5022 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5020 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5021 MDL_SHARED_WRITE NULL Table metadata lock mysql gtid_slave_pos
5035 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_POI_SESSION
5029 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_POI_SESSION
5030 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_POI_SESSION
5031 MDL_SHARED_WRITE NULL Table metadata lock coyote FILE_REQUEST
5027 MDL_SHARED_WRITE NULL Table metadata lock coyote FILE_REQUEST
5033 MDL_SHARED_WRITE NULL Table metadata lock coyote COYOTE_STAT_USER_FCD
5034 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_VERSION_BDD_CLIENT
5032 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_VERSION_BDD_CLIENT
5024 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_VERSION_BDD_CLIENT
5021 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_VERSION_BDD_CLIENT
5020 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_VERSION_BDD_CLIENT
5022 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_VERSION_BDD_CLIENT
5034 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_VERSION_BDD_CLIENT_HISTO
5032 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_VERSION_BDD_CLIENT_HISTO
5024 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_VERSION_BDD_CLIENT_HISTO
5021 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_VERSION_BDD_CLIENT_HISTO
5020 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_VERSION_BDD_CLIENT_HISTO
5022 MDL_SHARED_WRITE NULL Table metadata lock coyote_poi COYOTE_VERSION_BDD_CLIENT_HISTO

-----------------------------------------------------------------------------------------------------------------+
48 rows in set (0.000 sec)

Comment by JP Pozzi [ 2022-11-23 ]

After Start slave + Stop slave it works some time (minutes) and then lock.
After stop/start IO is #800Mo/sec when lock IO < 10Mo/sec.
Lag is still increasing.

Comment by JP Pozzi [ 2022-11-23 ]

We downgrade the system to 10.4.18 and the lag is rapidly reducing 4 minutes less every 1 minute.
I'm curious to understand this phenomenon ...

Comment by Angelique Sklavounos (Inactive) [ 2022-11-23 ]

HI jppo

I will try to reproduce based on this ticket and MDEV-29621 (I am guessing the systems are the same, apart from the MariaDB version of course), but if you still have the error and binary logs from 10.4.25 that would be helpful.

Thank you.

Comment by JP Pozzi [ 2022-11-24 ]

Hello,
That cluster is one of our production systems. The preceeding problem was with 1.4.26 on pre-production cluster, so we choose to upgrade to 10.4.25 wich was OK on pre-production systems and which is OK on two smaller production clusters.
So the problem was a surprise for us as all other systems are OK with 10.4.25 !
I can do some tests on the pre-production systems which are less busy (not 50000+ transactions/minute).
Regards
JP P

Comment by Angelique Sklavounos (Inactive) [ 2022-12-06 ]

Hi jppo

The output from show processlist would provide more information, as well as the binary and error logs and show slave status output to see if there is a bug and if it can be reproduced.

Comment by Angelique Sklavounos (Inactive) [ 2022-12-13 ]

Hi jppo

Since it will not be possible to be reproduced from your end, if you could provide general production binary logs that you think would cause the problem in 10.4.25, we could try to see what would be the difference between 10.4.18 and 10.4.25. Please use the FTP server.

Also, is only one master receiving queries, or are both masters receiving queries?

Comment by JP Pozzi [ 2022-12-23 ]

Hello,
I come back after a long time ... with no news. The binary logs are destroyed after 4/5 days ..... as they are huge.(65 to 100 files a day).
Only one machine receive updates, the scond one is used as read only by some applications, but it is 1% of the activity.
The client accesses machine through Virtual IPs managed by a High Availability system with a higher priority on the "001" system, the other "002" uses a smaller priority to avoid conflicts.
I think that we do stop the 10.4 migration and begin tests to go to a newer version which is "compatible" at replication level as our systems are running 24/24 and 365/365. We uses to use that kind of migration since the "Mysql" time.
which version will be perfectly compatible at the replication level with the 10.4.18 ?

Comment by JP Pozzi [ 2023-01-13 ]

Hello,

Is the replication process compatible between major versions ?
From 10.4.18 toward what version could we migrate with no replication problems ?

Thanks for advance

JP P

Comment by Angelique Sklavounos (Inactive) [ 2023-01-13 ]

Yes, replication should be compatible between major versions. Please consult:
https://mariadb.com/kb/en/upgrading-between-major-mariadb-versions/
https://mariadb.com/kb/en/replication-overview/#cross-version-replication-compatibility

If you run into an issue which you believe is a bug, please report it by following these guidelines: https://mariadb.com/kb/en/reporting-bugs/

Generated at Thu Feb 08 10:13:28 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.