Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24296

Stuck replication - Slave SQL thread is blocked by Update_rows_log_event::find_row(-1)

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Not a Bug
    • Affects Version/s: 10.4.12
    • Fix Version/s: N/A
    • Component/s: Replication
    • Labels:
    • Environment:
      OS RHEL 7.6, kernel: 3.10.0-957.el7.x86_64 #1 SMP Thu Oct 4 20:48:51 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

      Description

      Stuck replication - Slave SQL thread is blocked by "Update_rows_log_event::find_row(-1)" state of executed SQL command

      Environment parameters of Primary and Replica server
      OS: RHEL 7.6
      Kernel: 3.10.0-957.el7.x86_64
      MariaDB version: MariaDB-server-10.4.12-1.el7
      DB size: 6.1TB
      Binary log format: mixed

      State before issue and issue day:

      • replication was running without problems several weeks
      • replication was stopped for several days due to SW maintenance on Primary side
      • nothing changed on Primary DB only new data were still inserted or modified
      • replication started after several days
      • on Primary DB there were 13 GB changes in binlogs waiting for replication
      • performed replication start (start slave; )
      • replication started with downloading of changes from Primary binlogs to Replica relaylogs
      • first changes performed properly (replication executed several positions from relaylog)
      • but after some time the replication stopped with executing of changes from relaylogs
      • processlist shows that Slave_SQL command is in state: Update_rows_log_event::find_row(-1)
      • stop and start slave did not help to solve issue
      • kill this Slave SQL command and following start slave did not solve this issue and the same state appeared again
      • executing of relaylogs stoped on binlog position 902554004
      • extract from slave status command

        Master_Log_File: bin.000098
        Read_Master_Log_Pos: 575635222
        Exec_Master_Log_Pos: 902554004
        Seconds_Behind_Master: 1911134
        Slave_SQL_Running_State: Update_rows_log_event::find_row(-1)
        Slave_DDL_Groups: 0
        Slave_Non_Transactional_Groups: 0
        Slave_Transactional_Groups: 1
        

      • in relay log position 902554004 is following SQL alter table command

        ...
        #201102 16:50:53 server id 71  end_log_pos 902553870 CRC32 0x8c017e3b   GTID 0-71-7704 ddl
        /*!100001 SET @@session.gtid_seq_no=7704*//*!*/;
        # at 109931178
        #201102 16:50:53 server id 71  end_log_pos 902554004 CRC32 0xa0601914   Query   thread_id=25616 exec_time=1     error_code=0
        SET TIMESTAMP=1604332253/*!*/;
        ALTER TABLE values MODIFY COLUMN node varchar(9) NOT NULL
        /*!*/;
        # at 109931312
        #201102 16:51:45 server id 71  end_log_pos 902554046 CRC32 0xf9ccef43   GTID 0-71-7705 trans
        ...
        

      • size of processed table is 175MB
      • at the same same when this replication was started so another replication between another two DB servers were started also with the same parameters and version and there is not problem, only DB size is 500GB for this another environment
      • I found that similar issue was reported here MDEV-20398 but it should be solved since 10.4.8+ version and I am using 10.4.12
      • no issue found in MariaDB log file
      • I am not aware about using of any unsafe SQL command when mixed binary logging is used instead of safest row binary logging
      • in first two comments below I am sending next info: full slave status, full processlist, full innodb engine status, global status, global variables and show create table info

      Do you see any issue in relay log?
      Do you have any tip what to check next?
      Do you have any tip for the issue reason?
      Could the issue be caused by some hiccup that caused issue in binlog or relaylog or in some internal record in DB about processed comand?
      Is it possible that issue MDEV-20398 is not repaired in version 10.4.12 or if I met with another reason with the same error?

      Thank you

        Attachments

          Activity

            People

            Assignee:
            alice Alice Sherepa
            Reporter:
            TomasK Tomas Kucera
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Git Integration