[MDEV-20247] Replication hangs with "preparing" and never starts - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Affects Version/s: 10.2.26, 10.1.41, 10.3.17, 10.4.7
Fix Version/s: 10.2.27, 10.1.42, 10.3.18, 10.4.8
Component/s: Replication, Storage Engine - InnoDB
Labels:
- regression
Environment:
Devuan / BSD / Linux

Description

Hi,

maybe first the setup. Given are 2 FreeBSD nodes running MariaDB 10.4.6 in a master <- -> master replication setup. Care has been taken that both nodes don't insert something in the same DB so that's no problem.
Additionally there's a slave running Devuan ASCII (Debian Jessie) also running MariaDB 10.4.6 installed from the repository you provide.
The slave isn't running 24x7 and primarily used for offsite backups.

The Devuan installation is fresh, since you no longer provide an init script and expect everyone is running systemd, the init script from a previous version has been taken.
Once installation is completed, the following steps have been taken (additionally on the masters).
Renaming the 'root' user to another one, dumping the mysql.user event and correcting the definer to match the new 'root' user.
At one of the masters everything except the mysql DB has been dumped with:

mysqldump -u blah -p --master-data=1 --databases db1 db2 db3 > dump.sql

That dump.sql has been copied and source'd at the slave, including the "change master" things.The host IP, username and password have been set and a "start slave" has been triggered.

Then the fun starts as replication on the slave never does anything:

*************************** 1. row ***************************

                Slave_IO_State: NULL

                   Master_Host: master_host_ip

                   Master_User: local-dbrep

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: mysql-bin.000084

           Read_Master_Log_Pos: 1917045

                Relay_Log_File: slave-relay.000001

                 Relay_Log_Pos: 4

         Relay_Master_Log_File: mysql-bin.000084

              Slave_IO_Running: Preparing

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB: mysql,mysql

            Replicate_Do_Table:

        Replicate_Ignore_Table: db1.session,db2.session

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 0

                    Last_Error:

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 1917045

               Relay_Log_Space: 512

               Until_Condition: None

                Until_Log_File:

                 Until_Log_Pos: 0

            Master_SSL_Allowed: No

            Master_SSL_CA_File:

            Master_SSL_CA_Path:

               Master_SSL_Cert:

             Master_SSL_Cipher:

                Master_SSL_Key:

         Seconds_Behind_Master: NULL

 Master_SSL_Verify_Server_Cert: No

                 Last_IO_Errno: 0

                 Last_IO_Error:

                Last_SQL_Errno: 0

                Last_SQL_Error:

   Replicate_Ignore_Server_Ids:

              Master_Server_Id: 1

                Master_SSL_Crl:

            Master_SSL_Crlpath:

                    Using_Gtid: No

                   Gtid_IO_Pos:

       Replicate_Do_Domain_Ids:

   Replicate_Ignore_Domain_Ids:

                 Parallel_Mode: conservative

                     SQL_Delay: 0

           SQL_Remaining_Delay: NULL

       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it

              Slave_DDL_Groups: 0

Slave_Non_Transactional_Groups: 0

    Slave_Transactional_Groups: 0

1 row in set (0.000 sec)

The log reports that the slave has been connected to the master, at the master the log reports a connection from the slave.
A show full processlist reports

*************************** 1. row ***************************

      Id: 1

    User: system user

    Host:

      db: NULL

 Command: Daemon

    Time: NULL

   State: InnoDB purge worker

    Info: NULL

Progress: 0.000

*************************** 2. row ***************************

      Id: 2

    User: system user

    Host:

      db: NULL

 Command: Daemon

    Time: NULL

   State: InnoDB purge worker

    Info: NULL

Progress: 0.000

*************************** 3. row ***************************

      Id: 3

    User: system user

    Host:

      db: NULL

 Command: Daemon

    Time: NULL

   State: InnoDB purge worker

    Info: NULL

Progress: 0.000

*************************** 4. row ***************************

      Id: 4

    User: system user

    Host:

      db: NULL

 Command: Daemon

    Time: NULL

   State: InnoDB purge coordinator

    Info: NULL

Progress: 0.000

*************************** 5. row ***************************

      Id: 5

    User: system user

    Host:

      db: NULL

 Command: Daemon

    Time: NULL

   State: InnoDB shutdown handler

    Info: NULL

Progress: 0.000

*************************** 6. row ***************************

      Id: 42

    User: my_new_root_user

    Host: localhost

      db: NULL

 Command: Query

    Time: 0

   State: Init

    Info: show full processlist

Progress: 0.000

*************************** 7. row ***************************

      Id: 43

    User: system user

    Host:

      db: NULL

 Command: Connect

    Time: 690

   State: NULL

    Info: NULL

Progress: 0.000

*************************** 8. row ***************************

      Id: 44

    User: system user

    Host:

      db: NULL

 Command: Slave_SQL

    Time: 690

   State: Slave has read all relay log; waiting for the slave I/O thread to update it

    Info: NULL

Progress: 0.000

8 rows in set (0.000 sec)

No replication happens at all at the slave, while everything between the 2 masters works fine.

Manually doing a "stop slave;" at the slave hangs and nothing happens. Additionally, while the slave is still running, doing a "mysqlcheck mysql --auto-repair --optimize" the process hangs at the "mysql.gtid_slave_pos" table.

Additionally trying to shutdown MariaDB via "mysqladmin shutdown" the process gets struck, strace doesn't show anything and a processlist shows "flushing relay log" which, after being kept running for 30 minutes, was killed manually.

The entire setup has been working with MariaDB 10.1.2 and it's guaranteed to work with MySQL 8.0 without any problem at all.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

gdb_out_1047.txt
54 kB
2019-08-12 07:43
my.cnf
2 kB
2019-08-06 18:33
mysqld.err
0.4 kB
2019-08-06 18:33
slave_db.err
3 kB
2019-08-06 18:34
slave_my.cnf
2 kB
2019-08-06 18:33
trace
11 kB
2019-08-06 18:37

Issue Links

duplicates

MDEV-20288 Shutdown of database and stop slave do not work in MariaDB 10.3.17 when slave threads are running

Closed

is duplicated by

MDEV-20240 Master/Master slave sync not working

Closed

MDEV-20398 Slave sql thread blocked by Update_rows_log_event::find_row(-1)

Closed

links to

stackexchange

Activity

Ascending order - Click to sort in descending order

View 14 older comments

Phobos added a comment - 2019-08-07 08:33 - edited

Well,

sorry for the confusion. I've updated my master servers to 10.4.6, then the slave, then the slave within some days to 10.4.7 and then noticed the problem so yes, chances are good it was introduced with 10.4.7.
Since your Debian repos don't provide previous versions, I haven't been able to downgrade.
About the thread concurrency mentioned above, I've been able to reproduce the problem that once giving the option an value != 0, the problem starts to happen.
Setting it to 0 or commenting it out, everything starts working again as expected.

Changing this setting was only required on the slave, there has been no configuration change @ the master servers.

Phobos added a comment - 2019-08-07 08:33 - edited Well, sorry for the confusion. I've updated my master servers to 10.4.6, then the slave, then the slave within some days to 10.4.7 and then noticed the problem so yes, chances are good it was introduced with 10.4.7. Since your Debian repos don't provide previous versions, I haven't been able to downgrade. About the thread concurrency mentioned above, I've been able to reproduce the problem that once giving the option an value != 0, the problem starts to happen. Setting it to 0 or commenting it out, everything starts working again as expected. Changing this setting was only required on the slave, there has been no configuration change @ the master servers.

Ahmet added a comment - 2019-08-07 08:46 - edited

Hey phobos,
my problem occured from 10.4.6 to 10.4.7. After a downgrade to 10.4.6 everything worked well again.

BTW there is a source you can use. See: https://downloads.mariadb.org/mariadb/10.4.6/
You can just add:

deb [arch=amd64,ppc64el,arm64] http://mirrors.n-ix.net/mariadb/mariadb-10.4.6/repo/ubuntu/ xenial main

to /etc/apt/sources.list - After that, you can downgrade with:

apt install mariadb-server=1:10.4.6+maria~xenial mariadb-server-10.4=1:10.4.6+maria~xenial mariadb-client=1:10.4.6+maria~xenial mariadb-client-10.4=1:10.4.6+maria~xenial mariadb-server-core-10.4=1:10.4.6+maria~xenial mariadb-client-core-10.4=1:10.4.6+maria~xenial mysql-common=1:10.4.6+maria~xenial mariadb-client=1:10.4.6+maria~xenial libmariadb3=1:10.4.6+maria~xenial

An finally don't forget to hold Mariadb:

apt-mark hold mariadb-server*

In my case, replication is working again. But an update to 10.4.7 is not an option for me, until the problem is fixed.

Ahmet added a comment - 2019-08-07 08:46 - edited Hey phobos , my problem occured from 10.4.6 to 10.4.7. After a downgrade to 10.4.6 everything worked well again. BTW there is a source you can use. See: https://downloads.mariadb.org/mariadb/10.4.6/ You can just add: deb [arch=amd64,ppc64el,arm64] http://mirrors.n-ix.net/mariadb/mariadb-10.4.6/repo/ubuntu/ xenial main to /etc/apt/sources.list - After that, you can downgrade with: apt install mariadb-server=1:10.4.6+maria~xenial mariadb-server-10.4=1:10.4.6+maria~xenial mariadb-client=1:10.4.6+maria~xenial mariadb-client-10.4=1:10.4.6+maria~xenial mariadb-server-core-10.4=1:10.4.6+maria~xenial mariadb-client-core-10.4=1:10.4.6+maria~xenial mysql-common=1:10.4.6+maria~xenial mariadb-client=1:10.4.6+maria~xenial libmariadb3=1:10.4.6+maria~xenial An finally don't forget to hold Mariadb: apt-mark hold mariadb-server* In my case, replication is working again. But an update to 10.4.7 is not an option for me, until the problem is fixed.

Luke Alexander added a comment - 2019-08-12 07:43

Just to add more weight to this - we recently upgraded our slave from 10.4.6 to 10.4.7 on Ubuntu Xenial - this was actually in the hope of fixing another issue (https://jira.mariadb.org/browse/MDEV-20115) instead we noticed replication lag steadily increasing, both slave Running/IO processes appeared to be OK though (not stuck on `Preparing`) but the lag continued to increase. I had a hunch it was likely to with the upgrade and then stumbled across this post, after downgrading the slave the replication started flowing again... I've attached a gdb output of the process in case it's of use gdb_out_1047.txt

Luke Alexander added a comment - 2019-08-12 07:43 Just to add more weight to this - we recently upgraded our slave from 10.4.6 to 10.4.7 on Ubuntu Xenial - this was actually in the hope of fixing another issue ( https://jira.mariadb.org/browse/MDEV-20115 ) instead we noticed replication lag steadily increasing, both slave Running/IO processes appeared to be OK though (not stuck on `Preparing`) but the lag continued to increase. I had a hunch it was likely to with the upgrade and then stumbled across this post, after downgrading the slave the replication started flowing again... I've attached a gdb output of the process in case it's of use gdb_out_1047.txt

Alice Sherepa added a comment - 2019-08-12 14:51

mtr test:

--source include/have_innodb.inc

--connect (master1,127.0.0.1,root,,,$SERVER_MYPORT_1)

--connect (master2,127.0.0.1,root,,,$SERVER_MYPORT_2)

#setup master-master replication

--connection master1

ALTER TABLE mysql.gtid_slave_pos engine=innodb;

GRANT REPLICATION SLAVE ON *.* TO 'root'@'127.0.0.1';

--replace_result $SERVER_MYPORT_2 MYPORT_2

eval CHANGE MASTER TO master_port=$SERVER_MYPORT_2, master_host='127.0.0.1', master_user='root';

start slave;

--source include/wait_for_slave_to_start.inc

--connection master2

ALTER TABLE mysql.gtid_slave_pos engine=innodb;

GRANT REPLICATION SLAVE ON *.* TO 'root'@'127.0.0.1';

--replace_result $SERVER_MYPORT_1 MYPORT_1

eval CHANGE MASTER TO master_port=$SERVER_MYPORT_1, master_host='127.0.0.1', master_user='root';

start slave;

--source include/wait_for_slave_to_start.inc

--connection master1

stop slave;

--source include/wait_for_slave_to_stop.inc

# restart server with innodb-thread-concurrency!=0

--connection master2

-- let $restart_parameters=--innodb-thread-concurrency=8

-- source include/restart_mysqld.inc

start slave;

--source include/wait_for_slave_sql_to_start.inc

--vertical_results

show slave status;

--source include/wait_for_slave_io_to_start.inc

stop slave;

--source include/wait_for_slave_to_stop.inc

--disconnect master1

--disconnect master2

!include suite/rpl/my.cnf

[mysqld.1]

log-bin=master-bin

[mysqld.2]

log-bin=master-bin

[ENV]

SERVER_MYPORT_1= @mysqld.1.port

SERVER_MYPORT_2= @mysqld.2.port

# restart: --innodb-thread-concurrency=8

start slave;

include/wait_for_slave_sql_to_start.inc

show slave status;

Slave_IO_State	NULL

Master_Host	127.0.0.1

Master_User	root

Master_Port	16000

Connect_Retry	60

Master_Log_File	master-bin.000001

Read_Master_Log_Pos	661

Relay_Log_File	slave-relay-bin.000003

Relay_Log_Pos	4

Relay_Master_Log_File	master-bin.000001

Slave_IO_Running	Preparing

Slave_SQL_Running	Yes

Alice Sherepa added a comment - 2019-08-12 14:51 mtr test: --source include/have_innodb.inc --connect (master1,127.0.0.1,root,,,$SERVER_MYPORT_1) --connect (master2,127.0.0.1,root,,,$SERVER_MYPORT_2) #setup master-master replication --connection master1 ALTER TABLE mysql.gtid_slave_pos engine=innodb; GRANT REPLICATION SLAVE ON *.* TO 'root' @ '127.0.0.1' ; --replace_result $SERVER_MYPORT_2 MYPORT_2 eval CHANGE MASTER TO master_port=$SERVER_MYPORT_2, master_host= '127.0.0.1' , master_user= 'root' ; start slave; --source include/wait_for_slave_to_start.inc --connection master2 ALTER TABLE mysql.gtid_slave_pos engine=innodb; GRANT REPLICATION SLAVE ON *.* TO 'root' @ '127.0.0.1' ; --replace_result $SERVER_MYPORT_1 MYPORT_1 eval CHANGE MASTER TO master_port=$SERVER_MYPORT_1, master_host= '127.0.0.1' , master_user= 'root' ; start slave; --source include/wait_for_slave_to_start.inc --connection master1 stop slave; --source include/wait_for_slave_to_stop.inc # restart server with innodb-thread-concurrency!=0 --connection master2 -- let $restart_parameters=--innodb-thread-concurrency=8 -- source include/restart_mysqld.inc start slave; --source include/wait_for_slave_sql_to_start.inc --vertical_results show slave status; --source include/wait_for_slave_io_to_start.inc stop slave; --source include/wait_for_slave_to_stop.inc --disconnect master1 --disconnect master2 !include suite/rpl/my.cnf [mysqld.1] log-bin=master-bin [mysqld.2] log-bin=master-bin [ENV] SERVER_MYPORT_1= @mysqld.1.port SERVER_MYPORT_2= @mysqld.2.port # restart: --innodb-thread-concurrency=8 start slave; include/wait_for_slave_sql_to_start.inc show slave status; Slave_IO_State NULL Master_Host 127.0.0.1 Master_User root Master_Port 16000 Connect_Retry 60 Master_Log_File master-bin.000001 Read_Master_Log_Pos 661 Relay_Log_File slave-relay-bin.000003 Relay_Log_Pos 4 Relay_Master_Log_File master-bin.000001 Slave_IO_Running Preparing Slave_SQL_Running Yes

Faustin Lammler added a comment - 2019-09-11 08:19

Thanks thiru for your report!

Just a little clarification regarding init script:
> ... since you no longer provide an init script and expect everyone is running systemd, the init script from a previous version has been taken.

We still provide init script in `/usr/share/mysql` :
See https://github.com/MariaDB/server/commit/a4cc6fb91f3855e0ed803a6e1762440cfcf8cb5c

Faustin Lammler added a comment - 2019-09-11 08:19 Thanks thiru for your report! Just a little clarification regarding init script: > ... since you no longer provide an init script and expect everyone is running systemd, the init script from a previous version has been taken. We still provide init script in `/usr/share/mysql` : See https://github.com/MariaDB/server/commit/a4cc6fb91f3855e0ed803a6e1762440cfcf8cb5c

MariaDB Server

Replication hangs with "preparing" and never starts

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration