[MDEV-29006] Mariadb replication hangs Created: 2022-07-02  Updated: 2022-09-25

Status: Open
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.6.8
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Yevgeny Kosarzhevsky Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Centos 7


Attachments: Text File mariadb_gdb.txt     Text File mariadb_gdb2.txt     Text File mariadb_gdb3.txt    

 Description   

Hello.

I have random hangs of replication on slave servers.
I cannot kill SQL thread or restart mariadb when the issue arises.

Here is some strace output. Nothing else happens.

[pid 29054] futex(0x556d11c326f0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 33579] futex(0x556d10554d04, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1215783, {tv_sec=1656738020, tv_nsec=74607000}, 0xffffffff <unfinished ...>
[pid 29054] <... futex resumed>)        = 0
[pid 29054] clock_gettime(CLOCK_REALTIME, {tv_sec=1656738019, tv_nsec=721983121}) = 0
[pid 29054] futex(0x556d11c3fa04, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 874299, {tv_sec=1656738079, tv_nsec=721983121}, 0xffffffff <unfinished ...>
[pid 33579] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid 33579] clock_gettime(CLOCK_REALTIME, {tv_sec=1656738020, tv_nsec=74849558}) = 0
[pid 33579] futex(0x556d10554d40, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 33579] futex(0x556d10554d04, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1215785, {tv_sec=1656738020, tv_nsec=474915000}, 0xffffffff <unfinished ...>
[pid 34951] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid 34951] tgkill(33578, 33711, SIGUSR1) = 0
[pid 33711] <... futex resumed>)        = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid 34951] futex(0x556d131be2c8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 33711] --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_TKILL, si_pid=33578, si_uid=995} ---
[pid 34951] <... futex resumed>)        = 0
[pid 33711] rt_sigreturn({mask=[HUP INT QUIT PIPE ALRM TERM TSTP]} <unfinished ...>
[pid 34951] futex(0x556d131be334, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 801, {tv_sec=1656738022, tv_nsec=109540000}, 0xffffffff <unfinished ...>
[pid 33711] <... rt_sigreturn resumed>) = -1 EINTR (Interrupted system call)
[pid 33711] futex(0x7f7798006b78, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 33579] <... futex resumed>)        = -1 ETIMEDOUT (Connection timed out)
[pid 33579] clock_gettime(CLOCK_REALTIME, {tv_sec=1656738020, tv_nsec=475173531}) = 0
[pid 33579] futex(0x556d10554d40, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 33579] futex(0x556d10554d04, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1215787, {tv_sec=1656738020, tv_nsec=721893000}, 0xffffffff^Cstrace: Process 33578 detached



 Comments   
Comment by Sergei Golubchik [ 2022-07-08 ]

could you instead of strace connect with gdb and run thread apply all bt full ?

Comment by Yevgeny Kosarzhevsky [ 2022-07-10 ]

Attached gdb output

Comment by Yevgeny Kosarzhevsky [ 2022-07-17 ]

When this happens, I get system process in 'Closing tables' status:

MariaDB [(none)]> show processlist;
+---------+-------------+-----------+------+-----------+-------+----------------+------------------+----------+
| Id      | User        | Host      | db   | Command   | Time  | State          | Info             | Progress |
+---------+-------------+-----------+------+-----------+-------+----------------+------------------+----------+
|       5 | system user |           | NULL | Slave_SQL | 49559 | closing tables | NULL             |    0.000 |
| 4975818 | root        | localhost | NULL | Query     |     0 | starting       | show processlist |    0.000 |
+---------+-------------+-----------+------+-----------+-------+----------------+------------------+----------+
2 rows in set (0.000 sec)

After trying to kill this process it's showing as killed:

MariaDB [(none)]> kill 5;
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [(none)]> show processlist;
+---------+-------------+-----------+------+---------+-------+----------------+------------------+----------+
| Id      | User        | Host      | db   | Command | Time  | State          | Info             | Progress |
+---------+-------------+-----------+------+---------+-------+----------------+------------------+----------+
|       5 | system user |           | NULL | Killed  | 49581 | closing tables | NULL             |    0.000 |
| 4975818 | root        | localhost | NULL | Query   |     0 | starting       | show processlist |    0.000 |
+---------+-------------+-----------+------+---------+-------+----------------+------------------+----------+
2 rows in set (0.000 sec)

Shutdown command or term signal don't terminate the process. It keeps running. After term signal I get a single line in logs:

[Note] /usr/sbin/mariadbd (initiated by: unknown): Normal shutdown

Then I have to kill it with KILL signal as nothing happens for hours.

Comment by Sergei Golubchik [ 2022-07-25 ]

Did your attached gdb output correspond to that moment when the connection id 5 (which is a Slave_SQL thread, according to the first show processlist) is closing tables?

In the gdb output the slave sql thread (Thread 229 in gdb) doesn't seem to be closing tables, it's applying a Write_rows_log_event.

Comment by Yevgeny Kosarzhevsky [ 2022-07-27 ]

Hello.

Here is newly collected gdb output with all debug symbols loaded.
It represents this issue. mariadb_gdb2.txt

Slave_SQL_Running_State: closing tables

# mysqladmin processlist
+----------+-------------+-----------------------+----+-------------+---------+---------------------------------------------------------------+------------------+----------+
| Id       | User        | Host                  | db | Command     | Time    | State                                                         | Info             | Progress |
+----------+-------------+-----------------------+----+-------------+---------+---------------------------------------------------------------+------------------+----------+
| 5        | system user |                       |    | Slave_IO    | 4819484 | Waiting for master to send event                              |                  | 0.000    |
| 6        | system user |                       |    | Slave_SQL   | 28409   | closing tables                                                |                  | 0.000    |
| 442      | replication | 192.168.156.157:52836 |    | Binlog Dump | 4819314 | Master has sent all binlog to slave; waiting for more updates |                  | 0.000    |
| 12170878 | root        | localhost             |    | Query       | 0       | starting                                                      | show processlist | 0.000    |
+----------+-------------+-----------------------+----+-------------+---------+---------------------------------------------------------------+------------------+----------+

Comment by Yevgeny Kosarzhevsky [ 2022-07-27 ]

After I had sent TERM to mariadbd, it stayed in sleep state. I add one more output reflecting this.
mariadb_gdb3.txt

Comment by Yevgeny Kosarzhevsky [ 2022-09-25 ]

After downgrading to 10.6.7 the issue isn't observed anymore.

Generated at Thu Feb 08 10:05:09 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.