Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20247

Replication hangs with "preparing" and never starts

Details

    Description

      Hi,

      maybe first the setup. Given are 2 FreeBSD nodes running MariaDB 10.4.6 in a master <- -> master replication setup. Care has been taken that both nodes don't insert something in the same DB so that's no problem.
      Additionally there's a slave running Devuan ASCII (Debian Jessie) also running MariaDB 10.4.6 installed from the repository you provide.
      The slave isn't running 24x7 and primarily used for offsite backups.

      The Devuan installation is fresh, since you no longer provide an init script and expect everyone is running systemd, the init script from a previous version has been taken.
      Once installation is completed, the following steps have been taken (additionally on the masters).
      Renaming the 'root' user to another one, dumping the mysql.user event and correcting the definer to match the new 'root' user.
      At one of the masters everything except the mysql DB has been dumped with:

      mysqldump -u blah -p --master-data=1 --databases db1 db2 db3 > dump.sql
      

      That dump.sql has been copied and source'd at the slave, including the "change master" things.The host IP, username and password have been set and a "start slave" has been triggered.

      Then the fun starts as replication on the slave never does anything:

      *************************** 1. row ***************************
                      Slave_IO_State: NULL
                         Master_Host: master_host_ip
                         Master_User: local-dbrep
                         Master_Port: 3306
                       Connect_Retry: 60
                     Master_Log_File: mysql-bin.000084
                 Read_Master_Log_Pos: 1917045
                      Relay_Log_File: slave-relay.000001
                       Relay_Log_Pos: 4
               Relay_Master_Log_File: mysql-bin.000084
                    Slave_IO_Running: Preparing
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB: 
                 Replicate_Ignore_DB: mysql,mysql
                  Replicate_Do_Table: 
              Replicate_Ignore_Table: db1.session,db2.session
             Replicate_Wild_Do_Table: 
         Replicate_Wild_Ignore_Table: 
                          Last_Errno: 0
                          Last_Error: 
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 1917045
                     Relay_Log_Space: 512
                     Until_Condition: None
                      Until_Log_File: 
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: No
                  Master_SSL_CA_File: 
                  Master_SSL_CA_Path: 
                     Master_SSL_Cert: 
                   Master_SSL_Cipher: 
                      Master_SSL_Key: 
               Seconds_Behind_Master: NULL
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 0
                       Last_IO_Error: 
                      Last_SQL_Errno: 0
                      Last_SQL_Error: 
         Replicate_Ignore_Server_Ids: 
                    Master_Server_Id: 1
                      Master_SSL_Crl: 
                  Master_SSL_Crlpath: 
                          Using_Gtid: No
                         Gtid_IO_Pos: 
             Replicate_Do_Domain_Ids: 
         Replicate_Ignore_Domain_Ids: 
                       Parallel_Mode: conservative
                           SQL_Delay: 0
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                    Slave_DDL_Groups: 0
      Slave_Non_Transactional_Groups: 0
          Slave_Transactional_Groups: 0
      1 row in set (0.000 sec)
      

      The log reports that the slave has been connected to the master, at the master the log reports a connection from the slave.
      A show full processlist reports

      *************************** 1. row ***************************
            Id: 1
          User: system user
          Host:
            db: NULL
       Command: Daemon
          Time: NULL
         State: InnoDB purge worker
          Info: NULL
      Progress: 0.000
      *************************** 2. row ***************************
            Id: 2
          User: system user
          Host:
            db: NULL
       Command: Daemon
          Time: NULL
         State: InnoDB purge worker
          Info: NULL
      Progress: 0.000
      *************************** 3. row ***************************
            Id: 3
          User: system user
          Host:
            db: NULL
       Command: Daemon
          Time: NULL
         State: InnoDB purge worker
          Info: NULL
      Progress: 0.000
      *************************** 4. row ***************************
            Id: 4
          User: system user
          Host:
            db: NULL
       Command: Daemon
          Time: NULL
         State: InnoDB purge coordinator
          Info: NULL
      Progress: 0.000
      *************************** 5. row ***************************
            Id: 5
          User: system user
          Host:
            db: NULL
       Command: Daemon
          Time: NULL
         State: InnoDB shutdown handler
          Info: NULL
      Progress: 0.000
      *************************** 6. row ***************************
            Id: 42
          User: my_new_root_user
          Host: localhost
            db: NULL
       Command: Query
          Time: 0
         State: Init
          Info: show full processlist
      Progress: 0.000
      *************************** 7. row ***************************
            Id: 43
          User: system user
          Host:
            db: NULL
       Command: Connect
          Time: 690
         State: NULL
          Info: NULL
      Progress: 0.000
      *************************** 8. row ***************************
            Id: 44
          User: system user
          Host:
            db: NULL
       Command: Slave_SQL
          Time: 690
         State: Slave has read all relay log; waiting for the slave I/O thread to update it
          Info: NULL
      Progress: 0.000
      8 rows in set (0.000 sec)
      

      No replication happens at all at the slave, while everything between the 2 masters works fine.

      Manually doing a "stop slave;" at the slave hangs and nothing happens. Additionally, while the slave is still running, doing a "mysqlcheck mysql --auto-repair --optimize" the process hangs at the "mysql.gtid_slave_pos" table.

      Additionally trying to shutdown MariaDB via "mysqladmin shutdown" the process gets struck, strace doesn't show anything and a processlist shows "flushing relay log" which, after being kept running for 30 minutes, was killed manually.

      The entire setup has been working with MariaDB 10.1.2 and it's guaranteed to work with MySQL 8.0 without any problem at all.

      Attachments

        1. gdb_out_1047.txt
          54 kB
        2. my.cnf
          2 kB
        3. mysqld.err
          0.4 kB
        4. slave_db.err
          3 kB
        5. slave_my.cnf
          2 kB
        6. trace
          11 kB

        Issue Links

          Activity

            phobos Phobos added a comment - - edited

            Well,

            sorry for the confusion. I've updated my master servers to 10.4.6, then the slave, then the slave within some days to 10.4.7 and then noticed the problem so yes, chances are good it was introduced with 10.4.7.
            Since your Debian repos don't provide previous versions, I haven't been able to downgrade.
            About the thread concurrency mentioned above, I've been able to reproduce the problem that once giving the option an value != 0, the problem starts to happen.
            Setting it to 0 or commenting it out, everything starts working again as expected.

            Changing this setting was only required on the slave, there has been no configuration change @ the master servers.

            phobos Phobos added a comment - - edited Well, sorry for the confusion. I've updated my master servers to 10.4.6, then the slave, then the slave within some days to 10.4.7 and then noticed the problem so yes, chances are good it was introduced with 10.4.7. Since your Debian repos don't provide previous versions, I haven't been able to downgrade. About the thread concurrency mentioned above, I've been able to reproduce the problem that once giving the option an value != 0, the problem starts to happen. Setting it to 0 or commenting it out, everything starts working again as expected. Changing this setting was only required on the slave, there has been no configuration change @ the master servers.
            atom Ahmet added a comment - - edited

            Hey phobos,
            my problem occured from 10.4.6 to 10.4.7. After a downgrade to 10.4.6 everything worked well again.

            BTW there is a source you can use. See: https://downloads.mariadb.org/mariadb/10.4.6/
            You can just add:

            deb [arch=amd64,ppc64el,arm64] http://mirrors.n-ix.net/mariadb/mariadb-10.4.6/repo/ubuntu/ xenial main
            

            to /etc/apt/sources.list - After that, you can downgrade with:

            apt install mariadb-server=1:10.4.6+maria~xenial mariadb-server-10.4=1:10.4.6+maria~xenial mariadb-client=1:10.4.6+maria~xenial mariadb-client-10.4=1:10.4.6+maria~xenial mariadb-server-core-10.4=1:10.4.6+maria~xenial mariadb-client-core-10.4=1:10.4.6+maria~xenial mysql-common=1:10.4.6+maria~xenial mariadb-client=1:10.4.6+maria~xenial libmariadb3=1:10.4.6+maria~xenial
            

            An finally don't forget to hold Mariadb:

            apt-mark hold mariadb-server*
            

            In my case, replication is working again. But an update to 10.4.7 is not an option for me, until the problem is fixed.

            atom Ahmet added a comment - - edited Hey phobos , my problem occured from 10.4.6 to 10.4.7. After a downgrade to 10.4.6 everything worked well again. BTW there is a source you can use. See: https://downloads.mariadb.org/mariadb/10.4.6/ You can just add: deb [arch=amd64,ppc64el,arm64] http://mirrors.n-ix.net/mariadb/mariadb-10.4.6/repo/ubuntu/ xenial main to /etc/apt/sources.list - After that, you can downgrade with: apt install mariadb-server=1:10.4.6+maria~xenial mariadb-server-10.4=1:10.4.6+maria~xenial mariadb-client=1:10.4.6+maria~xenial mariadb-client-10.4=1:10.4.6+maria~xenial mariadb-server-core-10.4=1:10.4.6+maria~xenial mariadb-client-core-10.4=1:10.4.6+maria~xenial mysql-common=1:10.4.6+maria~xenial mariadb-client=1:10.4.6+maria~xenial libmariadb3=1:10.4.6+maria~xenial An finally don't forget to hold Mariadb: apt-mark hold mariadb-server* In my case, replication is working again. But an update to 10.4.7 is not an option for me, until the problem is fixed.

            Just to add more weight to this - we recently upgraded our slave from 10.4.6 to 10.4.7 on Ubuntu Xenial - this was actually in the hope of fixing another issue (https://jira.mariadb.org/browse/MDEV-20115) instead we noticed replication lag steadily increasing, both slave Running/IO processes appeared to be OK though (not stuck on `Preparing`) but the lag continued to increase. I had a hunch it was likely to with the upgrade and then stumbled across this post, after downgrading the slave the replication started flowing again... I've attached a gdb output of the process in case it's of use gdb_out_1047.txt

            lukealexander Luke Alexander added a comment - Just to add more weight to this - we recently upgraded our slave from 10.4.6 to 10.4.7 on Ubuntu Xenial - this was actually in the hope of fixing another issue ( https://jira.mariadb.org/browse/MDEV-20115 ) instead we noticed replication lag steadily increasing, both slave Running/IO processes appeared to be OK though (not stuck on `Preparing`) but the lag continued to increase. I had a hunch it was likely to with the upgrade and then stumbled across this post, after downgrading the slave the replication started flowing again... I've attached a gdb output of the process in case it's of use gdb_out_1047.txt
            alice Alice Sherepa added a comment -

            mtr test:

            --source include/have_innodb.inc
             
            --connect (master1,127.0.0.1,root,,,$SERVER_MYPORT_1)
            --connect (master2,127.0.0.1,root,,,$SERVER_MYPORT_2)
             
            #setup master-master replication
            --connection master1
             
            ALTER TABLE mysql.gtid_slave_pos engine=innodb;
            GRANT REPLICATION SLAVE ON *.* TO 'root'@'127.0.0.1';
            --replace_result $SERVER_MYPORT_2 MYPORT_2
            eval CHANGE MASTER TO master_port=$SERVER_MYPORT_2, master_host='127.0.0.1', master_user='root';
            start slave;
            --source include/wait_for_slave_to_start.inc
             
            --connection master2
             
            ALTER TABLE mysql.gtid_slave_pos engine=innodb;
            GRANT REPLICATION SLAVE ON *.* TO 'root'@'127.0.0.1';
            --replace_result $SERVER_MYPORT_1 MYPORT_1
            eval CHANGE MASTER TO master_port=$SERVER_MYPORT_1, master_host='127.0.0.1', master_user='root';
            start slave;
            --source include/wait_for_slave_to_start.inc
             
            --connection master1
            stop slave;
            --source include/wait_for_slave_to_stop.inc
             
            # restart server with innodb-thread-concurrency!=0
            --connection master2
            -- let $restart_parameters=--innodb-thread-concurrency=8
            -- source include/restart_mysqld.inc
             
            start slave;
             
            --source include/wait_for_slave_sql_to_start.inc
            --vertical_results
            show slave status;
            --source include/wait_for_slave_io_to_start.inc 
             
            stop slave;
            --source include/wait_for_slave_to_stop.inc
            --disconnect master1
            --disconnect master2
            

            !include suite/rpl/my.cnf
             
            [mysqld.1]
            log-bin=master-bin
             
            [mysqld.2]
            log-bin=master-bin
             
            [ENV]
            SERVER_MYPORT_1= @mysqld.1.port
            SERVER_MYPORT_2= @mysqld.2.port
            

            # restart: --innodb-thread-concurrency=8
            start slave;
            include/wait_for_slave_sql_to_start.inc
            show slave status;
            Slave_IO_State	NULL
            Master_Host	127.0.0.1
            Master_User	root
            Master_Port	16000
            Connect_Retry	60
            Master_Log_File	master-bin.000001
            Read_Master_Log_Pos	661
            Relay_Log_File	slave-relay-bin.000003
            Relay_Log_Pos	4
            Relay_Master_Log_File	master-bin.000001
            Slave_IO_Running	Preparing
            Slave_SQL_Running	Yes
            
            

            alice Alice Sherepa added a comment - mtr test: --source include/have_innodb.inc   --connect (master1,127.0.0.1,root,,,$SERVER_MYPORT_1) --connect (master2,127.0.0.1,root,,,$SERVER_MYPORT_2)   #setup master-master replication --connection master1   ALTER TABLE mysql.gtid_slave_pos engine=innodb; GRANT REPLICATION SLAVE ON *.* TO 'root' @ '127.0.0.1' ; --replace_result $SERVER_MYPORT_2 MYPORT_2 eval CHANGE MASTER TO master_port=$SERVER_MYPORT_2, master_host= '127.0.0.1' , master_user= 'root' ; start slave; --source include/wait_for_slave_to_start.inc   --connection master2   ALTER TABLE mysql.gtid_slave_pos engine=innodb; GRANT REPLICATION SLAVE ON *.* TO 'root' @ '127.0.0.1' ; --replace_result $SERVER_MYPORT_1 MYPORT_1 eval CHANGE MASTER TO master_port=$SERVER_MYPORT_1, master_host= '127.0.0.1' , master_user= 'root' ; start slave; --source include/wait_for_slave_to_start.inc   --connection master1 stop slave; --source include/wait_for_slave_to_stop.inc   # restart server with innodb-thread-concurrency!=0 --connection master2 -- let $restart_parameters=--innodb-thread-concurrency=8 -- source include/restart_mysqld.inc   start slave;   --source include/wait_for_slave_sql_to_start.inc --vertical_results show slave status; --source include/wait_for_slave_io_to_start.inc   stop slave; --source include/wait_for_slave_to_stop.inc --disconnect master1 --disconnect master2 !include suite/rpl/my.cnf   [mysqld.1] log-bin=master-bin   [mysqld.2] log-bin=master-bin   [ENV] SERVER_MYPORT_1= @mysqld.1.port SERVER_MYPORT_2= @mysqld.2.port # restart: --innodb-thread-concurrency=8 start slave; include/wait_for_slave_sql_to_start.inc show slave status; Slave_IO_State NULL Master_Host 127.0.0.1 Master_User root Master_Port 16000 Connect_Retry 60 Master_Log_File master-bin.000001 Read_Master_Log_Pos 661 Relay_Log_File slave-relay-bin.000003 Relay_Log_Pos 4 Relay_Master_Log_File master-bin.000001 Slave_IO_Running Preparing Slave_SQL_Running Yes

            Thanks thiru for your report!

            Just a little clarification regarding init script:
            > ... since you no longer provide an init script and expect everyone is running systemd, the init script from a previous version has been taken.

            We still provide init script in `/usr/share/mysql` :
            See https://github.com/MariaDB/server/commit/a4cc6fb91f3855e0ed803a6e1762440cfcf8cb5c

            faust Faustin Lammler added a comment - Thanks thiru for your report! Just a little clarification regarding init script: > ... since you no longer provide an init script and expect everyone is running systemd, the init script from a previous version has been taken. We still provide init script in `/usr/share/mysql` : See https://github.com/MariaDB/server/commit/a4cc6fb91f3855e0ed803a6e1762440cfcf8cb5c

            People

              thiru Thirunarayanan Balathandayuthapani
              phobos Phobos
              Votes:
              9 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.