[MDEV-29621] Replica stopped by locks on sequence Created: 2022-09-23  Updated: 2023-07-26  Resolved: 2023-04-28

Status: Closed
Project: MariaDB Server
Component/s: Binary Protocol, Replication
Affects Version/s: 10.3.36, 10.4.26, 10.6.11
Fix Version/s: 10.11.3, 10.3.39, 10.4.29, 10.5.20, 10.6.13, 10.8.8, 10.9.6, 10.10.4

Type: Bug Priority: Blocker
Reporter: JP Pozzi Assignee: Andrei Elkin
Resolution: Fixed Votes: 1
Labels: None
Environment:

Centos 7.9


Attachments: HTML File CONFIG_CLUSTER     HTML File INSERT_1     HTML File INSERT_2     File README.first     HTML File SEQ_AND_TABLES     File server.cnf    
Issue Links:
Blocks
Problem/Incident
causes MDEV-31779 Server crash in Rows_log_event::updat... In Progress
Relates
relates to MDEV-30077 Replication locked after migration of... Closed

 Description   

Hello,
We encounter a problem while upgrading from 10.4.18 to 10.4.26.
All was OK on "single" servers for dev, integration but when we upgrade pre-production servers (cluster) we get a big problem.
Our clusters are built with one "master" machine and a "slave" machine.
The master works well and all work is done, but on the replica the slave process is blocked with locks.
We use some tables with sequences to have an "unique" numbering scheme for 2 or 3 tables and the replica block when receiving the data for the sequence and a corresponding table.
One thing made me wonder, on the master the sequence locks are "MDL SHARED_WRITE" and on the replica it locks with "MDL_EXCLUSIVE", the depending table is locked with "MDL SHARED WRITE".
I can kill the locked process but the replica locks again for the next insert.
We can not upgrade the production servers as many tables are using sequences.
We do not have any problems with that kind of cluster since the version 10.2.
The binlog is using "mixed" format.

Any help will be welcome ...

PS : the production servers are working at 10000 to 45000 transactions/ minute and the replica's lag is between 0 and 2 seconds.

JP P



 Comments   
Comment by JP Pozzi [ 2022-09-28 ]

Hello,

Our upgrade process is blocked, we cannot upgrade our production servers as their security is based on replication.
Is there anything we can do to get past this problem or help resolve it ?

Comment by Angelique Sklavounos (Inactive) [ 2022-09-29 ]

Hi jppo

To try to recreate this, could you please provide more information on the following:

  1. The clusters (topology)
  2. my.cnf files
  3. SHOW CREATE TABLE on the tables with sequences whose later INSERTs cause the slave to lock
  4. The queries with the INSERTs that cause the locks

Thank you.

Comment by JP Pozzi [ 2022-09-30 ]

Hello,

I add two files (INSERT_1, INSERT_2) used to test the applications.
If you needs more I can get some more.

Regards
JP P

Comment by Angelique Sklavounos (Inactive) [ 2022-10-03 ]

Thanks, jppo.

In the file SEQ_AND_TABLES, I am a bit confused by table SUIVI_V2 and its key SUPPRESSION when there is no column SUPPRESSION defined for that table - I cannot create this table on its own.

I'm not able to recreate this simply with creating the sequence and table insertion, so could you also please send...

  1. output of the command used for to see the metadata locks?
  2. output of SHOW PROCESSLIST (if not the above)
  3. error and SQL logs before and up to the lock occurring (if possible?) (If they are too big, you can upload them to the private folder of this FTP server: https://mariadb.com/kb/en/meta/mariadb-ftp-server/)

Also, is the problem seen with only one cluster? If not, how many clusters are there, and how are they connected together (from server.cnf I assume Galera is not used)?

Thank you.

Comment by JP Pozzi [ 2022-10-03 ]

Hello,

Yes the index on "SUPPRESSION" colums is irrelevant.
The SQL used to show METADATA_LOCK_INFO :
select *
from information_schema.METADATA_LOCK_INFO
where length(TABLE_NAME) > 0
and TABLE_SCHEMA not in ('mysql');*

The problem is "old", so The server is notr able to activate the save (binlogs are maximum 3 days old) :
Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.

Processlist is very small as the slave proceeses are stopped :

4 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
1 system user   NULL Daemon NULL InnoDB purge coordinator NULL 0.000
3 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
2 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
5 system user   NULL Daemon NULL InnoDB shutdown handler NULL 0.000
7 event_scheduler localhost NULL Daemon 445538 Waiting on empty queue NULL 0.000
72150 replic_user 10.216.51.62:41696 NULL Binlog Dump 41987 Master has sent all binlog to slave; waiting for binlog to be updated NULL 0.000
79637 admin_coyote localhost mysql Query 0 Init show processlist 0.000

---------------------------------------------------------------------------------------------
If nexessary I can restore the data from the other node and connect the slave, but the database is quite big and it is a very long process. I can restore for today as I will not have enough time until 18:00.
I can do it tomorrow if it helps to solve the problem.

Regards
JP Pozzi

Comment by JP Pozzi [ 2022-10-03 ]

Hello,

I made the second machine in SYNC (without restoring) and I get the difference between the two servers :
Server 1 (the "master") :
---------------------------------------------------------------------------------------+

THREAD_ID LOCK_MODE LOCK_DURATION LOCK_TYPE TABLE_SCHEMA TABLE_NAME

---------------------------------------------------------------------------------------+

185148 MDL_SHARED_WRITE NULL Table metadata lock coyote id_suivi
185148 MDL_SHARED_WRITE NULL Table metadata lock coyote LOCAL_APPAREIL_M

---------------------------------------------------------------------------------------+
Sur le serveur 2 (the slave which is not working) :
---------------------------------------------------------------------------------------+

THREAD_ID LOCK_MODE LOCK_DURATION LOCK_TYPE TABLE_SCHEMA TABLE_NAME

---------------------------------------------------------------------------------------+

80803 MDL_EXCLUSIVE NULL Table metadata lock coyote id_suivi
80803 MDL_SHARED_WRITE NULL Table metadata lock coyote LOCAL_APPAREIL_M

---------------------------------------------------------------------------------------+

I hope it will help !

Regards

JP Pozzi

Comment by Angelique Sklavounos (Inactive) [ 2022-10-03 ]

Hi jppo,

Thanks for that. While it is active, could you please also do SHOW PROCESSLIST on both servers? And also SHOW SLAVE STATUS and SHOW VARIABLES?

The SQL and/or binary logs that show the queries that produce the locks would be useful as well, if available.

Thank you.

Comment by JP Pozzi [ 2022-10-03 ]

Hello,

I restart the slave and get a lock after 1 second, here is the log :
---------------------------------------------------------------------------------------+

THREAD_ID LOCK_MODE LOCK_DURATION LOCK_TYPE TABLE_SCHEMA TABLE_NAME

---------------------------------------------------------------------------------------+

81843 MDL_EXCLUSIVE NULL Table metadata lock coyote id_suivi
81843 MDL_SHARED_WRITE NULL Table metadata lock coyote LOCAL_APPAREIL_M
81845 MDL_SHARED_WRITE NULL Table metadata lock coyote LOCAL_APPAREIL_M

---------------------------------------------------------------------------------------+

The processlist shows the threads 81843 and 81845 "Waiting for prior transaction to commit" :
------------------------------------------------------------------------------------------------------------------------------------------------------------

Id User Host db Command Time State Info Progress

------------------------------------------------------------------------------------------------------------------------------------------------------------

4 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
1 system user   NULL Daemon NULL InnoDB purge coordinator NULL 0.000
3 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
2 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
5 system user   NULL Daemon NULL InnoDB shutdown handler NULL 0.000
7 event_scheduler localhost NULL Daemon 457721 Waiting on empty queue NULL 0.000
72150 replic_user 10.216.51.62:41696 NULL Binlog Dump 54169 Master has sent all binlog to slave; waiting for binlog to be updated NULL 0.000
81833 admin_coyote localhost mysql Query 0 Init show processlist 0.000
81840 system user   NULL Slave_IO 125 Waiting for master to send event NULL 0.000
81842 system user   NULL Slave_worker 125 Waiting for work from SQL thread NULL 0.000
81843 system user   NULL Slave_worker 123 Waiting for prior transaction to commit NULL 0.000
81845 system user   NULL Slave_worker 123 Waiting for prior transaction to commit NULL 0.000
81844 system user   NULL Slave_worker 123 Waiting for table metadata lock NULL 0.000
81841 system user   NULL Slave_SQL 123 Waiting for room in worker thread event queue NULL 0.000

------------------------------------------------------------------------------------------------------------------------------------------------------------

Regards
JP Pozzi

Comment by JP Pozzi [ 2022-10-03 ]

And the "show processlist" on the master :
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Id User Host db Command Time State Info Progress

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

1 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
2 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
3 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
4 system user   NULL Daemon NULL InnoDB purge coordinator NULL 0.000
5 system user   NULL Daemon NULL InnoDB shutdown handler NULL 0.000
7 event_scheduler localhost NULL Daemon 313594 Waiting on empty queue NULL 0.000
174975 system user   NULL Slave_IO 54442 Waiting for master to send event NULL 0.000
174977 system user   NULL Slave_worker 54442 Waiting for work from SQL thread NULL 0.000
174978 system user   NULL Slave_worker 54442 Waiting for work from SQL thread NULL 0.000
174979 system user   NULL Slave_worker 54442 Waiting for work from SQL thread NULL 0.000
174980 system user   NULL Slave_worker 54442 Waiting for work from SQL thread NULL 0.000
174976 system user   NULL Slave_SQL 54442 Slave has read all relay log; waiting for the slave I/O thread to update it NULL 0.000
185091 coyotadmin zvr-lmin001.preprod.abcdef.local:33240 coyote Sleep 60   NULL 0.000
185092 coyotadmin zvr-lmin001.preprod.abcdef.local:33242 coyote Sleep 0   NULL 0.000
185094 coyotadmin zvr-lmin001.preprod.abcdef.local:33250 coyote Sleep 4   NULL 0.000
185118 coyotadmin zvr-lmin001.preprod.abcdef.local:33642 coyote Sleep 16   NULL 0.000
185121 coyotadmin zvr-lmin001.preprod.abcdef.local:33654 coyote Sleep 0   NULL 0.000
185122 coyotadmin zvr-lmin001.preprod.abcdef.local:33662 coyote Sleep 0   NULL 0.000
185123 coyotadmin zvr-lmin001.preprod.abcdef.local:33664 coyote Sleep 0   NULL 0.000
185124 coyotadmin zvr-lmin001.preprod.abcdef.local:33672 coyote Sleep 0   NULL 0.000
185125 coyotadmin zvr-lmin001.preprod.abcdef.local:33678 coyote Sleep 0   NULL 0.000
185126 coyotadmin zvr-lmin001.preprod.abcdef.local:33680 coyote Sleep 0   NULL 0.000
185127 coyotadmin zvr-lmin001.preprod.abcdef.local:33682 coyote Sleep 0   NULL 0.000
185128 coyotadmin zvr-lmin001.preprod.abcdef.local:33694 coyote Sleep 0   NULL 0.000
185129 coyotadmin zvr-lmin001.preprod.abcdef.local:33698 coyote Query 0 Commit INSERT INTO CENTRE_POI_STAT_TEMPS_REEL ( ID_RADAR_M , PAYS , TYPE_POI , TYPE_INSERTION , IC 0.000
185131 coyotadmin zvr-lmin001.preprod.abcdef.local:33738 coyote Sleep 0   NULL 0.000
185132 coyotadmin zvr-lmin001.preprod.abcdef.local:33756 coyote Sleep 0   NULL 0.000
185135 coyotadmin zvr-lmin001.preprod.abcdef.local:33814 coyote Sleep 1   NULL 0.000
185136 coyotadmin zvr-lmin001.preprod.abcdef.local:33816 coyote Sleep 0   NULL 0.000
185137 coyotadmin zvr-lmin001.preprod.abcdef.local:33818 coyote Query 0 Commit INSERT INTO CENTRE_POI_STAT_TEMPS_REEL ( ID_RADAR_M , PAYS , TYPE_POI , TYPE_INSERTION , IC 0.000
185146 coyotadmin zvr-lmin001.preprod.abcdef.local:33968 coyote Sleep 0   NULL 0.000
185148 coyotadmin zvr-lmin001.preprod.abcdef.local:33990 coyote Query 0 Commit INSERT INTO CENTRE_POI_ZONE_RT_ELEMENT(ID_ZONE_RT, VITESSE, CAP, ZONE_LAT_A, ZONE_LON_A, ZONE_LAT_B, 0.000
185149 coyotadmin zvr-lmin001.preprod.abcdef.local:33992 coyote Sleep 0   NULL 0.000
185161 coyotadmin zvr-lrts032.preprod.abcdef.local:56712 coyote Sleep 39   NULL 0.000
185162 coyotadmin zvr-lrts032.preprod.abcdef.local:56718 coyote_poi Sleep 39   NULL 0.000
185163 coyotadmin zvr-lrts032.preprod.abcdef.local:56720 coyote Sleep 39   NULL 0.000
185168 coyotadmin zvr-lrts031.preprod.abcdef.local:59140 coyote Sleep 34   NULL 0.000
185169 coyotadmin zvr-lrts031.preprod.abcdef.local:59146 coyote_poi Sleep 34   NULL 0.000
185170 coyotadmin zvr-lrts031.preprod.abcdef.local:59148 coyote Sleep 34   NULL 0.000
185188 coyotadmin zvr-lrts032.preprod.abcdef.local:56754 coyote_poi Sleep 32   NULL 0.000
185192 coyotadmin zvr-lmin001.preprod.abcdef.local:34538 coyote Sleep 32   NULL 0.000
185216 coyotadmin 10.218.60.4:59070 coyote Sleep 26   NULL 0.000
185248 coyotadmin zvr-lmin001.preprod.abcdef.local:34628 coyote Sleep 18   NULL 0.000
185255 coyotadmin zvr-lmin001.preprod.abcdef.local:34634 coyote Sleep 16   NULL 0.000
185259 coyotadmin zvr-lrts031.preprod.abcdef.local:59210 coyote_poi Sleep 71   NULL 0.000
185299 coyotadmin zvr-lrts031.preprod.abcdef.local:59256 coyote_poi Sleep 71   NULL 0.000
185403 coyotadmin zvr-lrts032.preprod.abcdef.local:56948 coyote_poi Sleep 273   NULL 0.000
188133 coyotadmin zvr-lrts032.preprod.abcdef.local:59496 coyote Sleep 16   NULL 0.000
220686 coyotadmin zvr-lrts031.preprod.abcdef.local:36126 coyote Sleep 5   NULL 0.000
231455 coyotadmin zvr-lrts032.preprod.abcdef.local:43828 coyote_poi Sleep 46   NULL 0.000
234726 replic_user 10.216.51.61:56070 NULL Binlog Dump 397 Master has sent all binlog to slave; waiting for binlog to be updated NULL 0.000
235245 coyotadmin 10.221.131.229:59476 coyote Sleep 19   NULL 0.000
235298 coyotadmin zvr-lrts032.preprod.abcdef.local:47490 coyote Sleep 47   NULL 0.000
235313 coyotadmin zvr-lmin001.preprod.abcdef.local:60328 coyote Sleep 32   NULL 0.000
235338 admin_coyote localhost mysql Query 0 Init show processlist 0.000

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
55 rows in set (0,000 sec)

That'sall folks.

Comment by JP Pozzi [ 2022-10-03 ]

Slave state :

*************************** 1. row ***************************
               Connection_name: lmdb001
               Slave_SQL_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                Slave_IO_State: Waiting for master to send event
                   Master_Host: 10.216.51.61
                   Master_User: replic_user
                   Master_Port: 3306
                 Connect_Retry: 20
               Master_Log_File: mysql-bin.000021
           Read_Master_Log_Pos: 385
                Relay_Log_File: relay-bin-lmdb001.000002
                 Relay_Log_Pos: 684
         Relay_Master_Log_File: mysql-bin.000021
              Slave_IO_Running: Yes
             Slave_SQL_Running: Yes
               Replicate_Do_DB: centre,centre_eu,centre_poi,geocentre,LOG_TRAMES,mysql,test,traffic
           Replicate_Ignore_DB: 
            Replicate_Do_Table: 
        Replicate_Ignore_Table: 
       Replicate_Wild_Do_Table: 
   Replicate_Wild_Ignore_Table: 
                    Last_Errno: 0
                    Last_Error: 
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 385
               Relay_Log_Space: 995
               Until_Condition: None
                Until_Log_File: 
                 Until_Log_Pos: 0
            Master_SSL_Allowed: No
            Master_SSL_CA_File: 
            Master_SSL_CA_Path: 
               Master_SSL_Cert: 
             Master_SSL_Cipher: 
                Master_SSL_Key: 
         Seconds_Behind_Master: 0
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error: 
                Last_SQL_Errno: 0
                Last_SQL_Error: 
   Replicate_Ignore_Server_Ids: 
              Master_Server_Id: 201
                Master_SSL_Crl: 
            Master_SSL_Crlpath: 
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos: 0-201-834653345
       Replicate_Do_Domain_Ids: 
   Replicate_Ignore_Domain_Ids: 
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
              Slave_DDL_Groups: 3
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 0
          Retried_transactions: 0
            Max_relay_log_size: 1073741824
          Executed_log_entries: 62
     Slave_received_heartbeats: 7583
        Slave_heartbeat_period: 30.000
                Gtid_Slave_Pos: 0-202-860080803

Comment by JP Pozzi [ 2022-10-03 ]

The mariadb error log does not show anything about the problem and is very short.

Comment by JP Pozzi [ 2022-10-03 ]

List of variables sent to ftp as : MDEV-29621_variables.txt

Comment by Angelique Sklavounos (Inactive) [ 2022-10-03 ]

Thanks jppo. Would it be possible to get the relay log to see what the slave is trying to execute?

Comment by JP Pozzi [ 2022-10-03 ]

Hello,
Relay log transfered as MDEV-29621_relay-bin-lmdb002.000002.gz
Regards

Comment by Brandon Nesterenko [ 2022-10-03 ]

Hi jppo!

Good observation with the MDL lock difference. It looks like this may be a side-effect of MDEV-28487 (fixed in 10.4.26). Prior to the MDEV-28487 fix, sequence updates in row format would always be logged in minimal mode for binlog_row_image. Now that it can use full mode (the default), the MDL lock mode uses exclusive locking, which may be causing your issue. I'm going to continue to look deeper, but in the mean time, if your data meets the requirements of using minimal mode for binlog_row_image, switching to that may bypass your issue. Alternatively you can try 10.4.25.

The requirements, for easy reference:

Note that to safely change this setting from the default, the table
being replicated to must contain identical primary key definitions,
and columns must be present, in the same order, and use the same
data types as the original table. If these conditions are not met,
matches may not be correctly determined and updates and deletes
may diverge on the replica, with no warnings or errors returned.

Comment by JP Pozzi [ 2022-10-04 ]

Hello,

Thanks for the explanations.
I will try with 10.4.25, but I don't know how to get the server package for Centos7 X86-64.

Regards

Comment by JP Pozzi [ 2022-10-04 ]

Hello,
I try to downgrade the "slave" server to 10.4.25 with no result :
The lowks are always here with same characteristics :
{ HEAD_ID | LOCK_MODE | LOCK_DURATION | LOCK_TYPE | TABLE_SCHEMA | TABLE_NAME |
------------------------------------------------------------------------------------------+

50 MDL_EXCLUSIVE NULL Table metadata lock coyote id_suivi
50 MDL_SHARED_WRITE NULL Table metadata lock coyote LOCAL_APPAREIL_M

------------------------------------------------------------------------------------------+}}
And the slave is always "Waiting for room in worker thread event queue".

Will try to downgrade the master.

Regards
JP P

Comment by Brandon Nesterenko [ 2022-10-05 ]

Great! Thanks for the update. I'll continue working to fix the deadlock under this ticket, so you'll know which future release you can upgrade to next.

Comment by Brandon Nesterenko [ 2022-10-05 ]

Hi jppo,

Did you ever see the deadlock happen from the sequence id_alert? The relay log you provided only has updates using the sequence id_alert (I didn't see id_suivi used anywhere in that relay log). I recreated a similar event stream from that uploaded relay log; however could not recreate the deadlock. Do you have any binary or relay logs with the events that caused the deadlock, i.e. inserts that use the id_suivi sequence? A log that would come from 10.4.25 is fine. It also looks like the tables you are inserting into have triggers, would you be able to show them as well?

Thanks!

  • Brandon
Comment by JP Pozzi [ 2022-11-23 ]

Hello Brandon,
I had forgotten that problem as I was very busy on other subjects ...
It is very difficult to get that kind of information as the systems are working at a very high rate (30000 to 50000 update rows every minute and a little less inserts).

Comment by Andrei Elkin [ 2023-03-10 ]

In another customer report the following picture is presented.
A slave worker (thread id 125982 below) requests IX type of Global-read-lock which must be in conflict with
other worker locks, most probably with X type one (125979).
MDL locks and the processlist are as the following.

 
+-----------+-------------------------+---------------+---------------------+--------------+-------------------+
| THREAD_ID | LOCK_MODE               | LOCK_DURATION | LOCK_TYPE           | TABLE_SCHEMA | TABLE_NAME        |
+-----------+-------------------------+---------------+---------------------+--------------+-------------------+
|    125982 | MDL_INTENTION_EXCLUSIVE | NULL          | Global read lock    |              |                   |
|    125979 | MDL_INTENTION_EXCLUSIVE | NULL          | Commit lock         |              |                   |
|    125981 | MDL_INTENTION_EXCLUSIVE | NULL          | Commit lock         |              |                   |
|    125980 | MDL_INTENTION_EXCLUSIVE | NULL          | Commit lock         |              |                   |
|    403322 | MDL_SHARED_READ         | NULL          | Table metadata lock | mysql        | user              |
|    125981 | MDL_SHARED_WRITE        | NULL          | Table metadata lock | iwork        | pl                |
|    125978 | MDL_SHARED_READ         | NULL          | Table metadata lock | mysql        | gtid_slave_pos    |
|    125979 | MDL_SHARED_WRITE        | NULL          | Table metadata lock | mysql        | gtid_slave_pos    |
|    125981 | MDL_SHARED_WRITE        | NULL          | Table metadata lock | mysql        | gtid_slave_pos    |
|    125980 | MDL_SHARED_WRITE        | NULL          | Table metadata lock | mysql        | gtid_slave_pos    |
|    125981 | MDL_SHARED_WRITE        | NULL          | Table metadata lock | iwork        | audit_            |
|    125980 | MDL_SHARED_WRITE        | NULL          | Table metadata lock | iwork        | trade_explain     |
|    125979 | MDL_EXCLUSIVE           | NULL          | Table metadata lock | iwork        | seq_trade_explain |
+-----------+-------------------------+---------------+---------------------+--------------+-------------------+
 
+--------+--------------------+--------------------+-------+--------------+---------+-----------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
| Id     | User               | Host               | db    | Command      | Time    | State                                         | Info                                                                                                                                                                                                     | Progress |
+--------+--------------------+--------------------+-------+--------------+---------+-----------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+
|      1 | system user        |                    | NULL  | Daemon       |    NULL | InnoDB purge coordinator                      | NULL                                                                                                                                                                                                     |    0.000 |
|      2 | system user        |                    | NULL  | Daemon       |    NULL | InnoDB purge worker                           | NULL                                                                                                                                                                                                     |    0.000 |
|      4 | system user        |                    | NULL  | Daemon       |    NULL | InnoDB purge worker                           | NULL                                                                                                                                                                                                     |    0.000 |
|      3 | system user        |                    | NULL  | Daemon       |    NULL | InnoDB purge worker                           | NULL                                                                                                                                                                                                     |    0.000 |
|      5 | system user        |                    | NULL  | Daemon       |    NULL | InnoDB shutdown handler                       | NULL                                                                                                                                                                                                     |    0.000 |
| 125977 | system user        |                    | NULL  | Slave_IO     | 1779166 | Waiting for master to send event              | NULL                                                                                                                                                                                                     |    0.000 |
| 125980 | system user        |                    | NULL  | Slave_worker |   20315 | Waiting for prior transaction to commit       | NULL                                                                                                                                                                                                     |    0.000 |
| 125979 | system user        |                    | iwork | Slave_worker |   20315 | Waiting for prior transaction to commit       | COMMIT                                                                                                                                                                                                   |    0.000 |
| 125982 | system user        |                    | NULL  | Slave_worker |   20315 | Waiting for table metadata lock               | NULL                                                                                                                                                                                                     |    0.000 |
| 125981 | system user        |                    | NULL  | Slave_worker |   20315 | Waiting for prior transaction to commit       | NULL                                                                                                                                                                                                     |    0.000 |
| 125978 | system user        |                    | NULL  | Slave_SQL    |   20323 | Waiting for room in worker thread event queue | NULL                                                                                                                                                                                                     |    0.000 |
| 403322 | monuser            | 10.67.21.126:59496 | NULL  | Sleep        |     115 |                                               | NULL                                                                                                                                                                                                     |    0.000 |
| 405701 | tmtb_tableau_iwork | 10.68.22.59:58046  | iwork | Query        |   11282 | Waiting for table metadata lock               | SELECT TABLE_NAME, TABLE_COMMENT, TABLE_TYPE, TABLE_SCHEMA FROM ( SELECT * FROM INFORMATION_SCHEMA.TABLES  WHERE TABLE_SCHEMA LIKE 'iwork' AND ( TABLE_TYPE='BASE TABLE' OR TABLE_TYPE='VIEW' ) ) TABLES |    0.000 |
| 405792 | tmtb_tableau_iwork | 10.68.22.59:33726  | iwork | Query        |   10633 | Waiting for table metadata lock               | SELECT TABLE_NAME, TABLE_COMMENT, TABLE_TYPE, TABLE_SCHEMA FROM ( SELECT * FROM INFORMATION_SCHEMA.TABLES  WHERE TABLE_SCHEMA LIKE 'iwork' AND ( TABLE_TYPE='BASE TABLE' OR TABLE_TYPE='VIEW' ) ) TABLES |    0.000 |
| 406462 | tmtb_tableau_iwork | 10.68.22.6:56774   | iwork | Query        |    6400 | Waiting for table metadata lock               | SELECT TABLE_NAME, TABLE_COMMENT, TABLE_TYPE, TABLE_SCHEMA FROM ( SELECT * FROM INFORMATION_SCHEMA.TABLES  WHERE TABLE_SCHEMA LIKE 'iwork' AND ( TABLE_TYPE='BASE TABLE' OR TABLE_TYPE='VIEW' ) ) TABLES |    0.000 |
| 411908 | tmtb_tableau_iwork | 10.68.22.6:48418   | iwork | Query        |    1648 | Waiting for table metadata lock               | SELECT TABLE_NAME, TABLE_COMMENT, TABLE_TYPE, TABLE_SCHEMA FROM ( SELECT * FROM INFORMATION_SCHEMA.TABLES  WHERE TABLE_SCHEMA LIKE 'iwork' AND ( TABLE_TYPE='BASE TABLE' OR TABLE_TYPE='VIEW' ) ) TABLES |    0.000 |
| 412662 | tmtb_tableau_iwork | 10.68.22.59:36440  | iwork | Query        |     743 | Waiting for table metadata lock               | SELECT TABLE_NAME, TABLE_COMMENT, TABLE_TYPE, TABLE_SCHEMA FROM ( SELECT * FROM INFORMATION_SCHEMA.TABLES  WHERE TABLE_SCHEMA LIKE 'iwork' AND ( TABLE_TYPE='BASE TABLE' OR TABLE_TYPE='VIEW' ) ) TABLES |    0.000 |
| 413116 | myoper             | localhost          | NULL  | Query        |       0 | Init                                          | show full processlist                                                                                                                                                                                    |    0.000 |
+--------+--------------------+--------------------+-------+--------------+---------+-----------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+

For further analysis it's important to know what is the transaction that the slave worker 125982 is on.
The transaction must be next to the last executed one on slave (gtid_slave_pos).
When this sort of hang occurs next time we will need to know
1. all details of at least slave_parallel_workers + 1 transactions starting off gtid_slave_pos.
mysqlbinlog -v for those or the whole pertinent master binlog file should be very helpful.
with the n of at least the number of slave workers, and
2. table definitions of the transaction range tables.

Comment by Andrei Elkin [ 2023-03-14 ]

Binlog and schema are provided.

Comment by Andrei Elkin [ 2023-03-18 ]

The 1st part of the fixes proves the fixes with the NEW master -> NEW slave.
It's ready for review at bb-10.4-andrei.

You can start on reviewing while the 2nd part that provides OLD -> NEW is at construction/testing.

Comment by Brandon Nesterenko [ 2023-03-29 ]

Finished first round of review, left a few notes on commits bae30af and bdeaee3

Comment by Brandon Nesterenko [ 2023-04-12 ]

Yes, I reviewed the patch already. It was mostly in good standing, and I imagine should be in the next release.

Comment by Brandon Nesterenko [ 2023-04-20 ]

Approved the latest commit

Comment by Angelique Sklavounos (Inactive) [ 2023-04-27 ]

Testing with the fixes looked okay for pushing.

Generated at Thu Feb 08 10:10:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.