[MDEV-25611] RESET MASTER still causes the server to hang - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.5, 10.6, 10.11, 11.1(EOL), 11.2(EOL), 11.4, 11.5(EOL), 11.6(EOL)
Fix Version/s: 11.4.4
Component/s: Replication
Labels:

Description

10.6 2ceadb390309f9c29786bb5ec2a6459dfe88344e
versioning.rpl 'innodb,row,trx_id' w3 [ fail ] timeout after 900 seconds
Test ended at 2021-05-06 15:23:30

Test case timeout after 900 seconds

== D:/winx64-debug/build/mysql-test/var/3/log/rpl.log ==
include/master-slave.inc

== D:/winx64-debug/build/mysql-test/var/3/tmp/analyze-timeout-mysqld.2.err ==
SHOW PROCESSLIST;
Id User Host db Command Time State Info Progress
16 root localhost:38509 NULL Query 0 starting SHOW PROCESSLIST 0.000

== D:/winx64-debug/build/mysql-test/var/3/tmp/analyze-timeout-mysqld.1.err ==
SHOW PROCESSLIST;
Id User Host db Command Time State Info Progress
15 root localhost:37368 test Query 960 starting RESET MASTER 0.000
17 root localhost:38510 NULL Query 0 starting SHOW PROCESSLIST 0.000

mysqltest failed but provided no output

Among the stack traces, no InnoDB code is executing (only buf_flush_page_cleaner() is in an untimed wait). The culprit for the hang seems to be a thread that is executing RESET MASTER:

10.6 2ceadb390309f9c29786bb5ec2a6459dfe88344e
server!pthread_cond_wait(struct _RTL_CONDITION_VARIABLE * cond = 0x00007fff`622137a8, struct _RTL_CRITICAL_SECTION * mutex = 0x00007fff`62213778) [D:\winx64-debug\build\src\mysys\my_wincond.c @ 91]
server!psi_cond_wait(struct st_mysql_cond * that = 0x00007fff`622137a8, struct st_mysql_mutex * mutex = 0x00007fff`62213778, char * file = 0x00007fff`60671dc8 "D:\winx64-debug\build\src\sql\log.cc", unsigned int line = 0x10f8) [D:\winx64-debug\build\src\mysys\my_thr_init.c @ 596]
server!inline_mysql_cond_wait(struct st_mysql_cond * that = 0x00007fff`622137a8, struct st_mysql_mutex * mutex = 0x00007fff`62213778, char * src_file = 0x00007fff`60671dc8 "D:\winx64-debug\build\src\sql\log.cc", unsigned int src_line = 0x10f8) [D:\winx64-debug\build\src\include\mysql\psi\mysql_thread.h @ 1070]
server!MYSQL_BIN_LOG::reset_logs(class THD * thd = 0x00000227`de152788, bool create_new_log = true, struct rpl_gtid * init_state = 0x00000000`00000000, unsigned int init_state_len = 0, unsigned long next_log_number = 0) [D:\winx64-debug\build\src\sql\log.cc @ 4345]
server!reset_master(class THD * thd = 0x00000227`de152788, struct rpl_gtid * init_state = 0x00000000`00000000, unsigned int init_state_len = 0, unsigned long next_log_number = 0) [D:\winx64-debug\build\src\sql\sql_repl.cc @ 3966]
server!reload_acl_and_cache(class THD * thd = 0x00000227`de152788, unsigned int64 options = 0x80, struct TABLE_LIST * tables = 0x00000000`00000000, int * write_to_binlog = 0x00000063`731edb74) [D:\winx64-debug\build\src\sql\sql_reload.cc @ 362]
server!mysql_execute_command(class THD * thd = 0x00000227`de152788) [D:\winx64-debug\build\src\sql\sql_parse.cc @ 5452]
server!mysql_parse(class THD * thd = 0x00000227`de152788, char * rawbuf = 0x00000227`de1ca690 "--- memory read error at address 0x00000227`de1ca690 ---", unsigned int length = 0xc, class Parser_state * parser_state = 0x00000063`731ee950) [D:\winx64-debug\build\src\sql\sql_parse.cc @ 8018]
server!dispatch_command(enum_server_command command = COM_QUERY (0n3), class THD * thd = 0x00000227`de152788, char * packet = 0x00000227`ddfceec9 "--- memory read error at address 0x00000227`ddfceec9 ---", unsigned int packet_length = 0xc, bool blocking = false) [D:\winx64-debug\build\src\sql\sql_parse.cc @ 1899]
server!do_command(class THD * thd = 0x00000227`de152788, bool blocking = false) [D:\winx64-debug\build\src\sql\sql_parse.cc @ 1406]
server!threadpool_process_request(class THD * thd = 0x00000227`de152788) [D:\winx64-debug\build\src\sql\threadpool_common.cc @ 394]
server!tp_callback(struct TP_connection * c = 0x00000227`ddf80e30) [D:\winx64-debug\build\src\sql\threadpool_common.cc @ 203]
server!tp_callback(struct _TP_CALLBACK_INSTANCE * instance = 0x00000063`731ef968, void * context = 0x00000227`ddf80e30) [D:\winx64-debug\build\src\sql\threadpool_win.cc @ 279]
server!io_completion_callback(struct _TP_CALLBACK_INSTANCE * instance = 0x00000063`731ef968, void * context = 0x00000227`ddf80e30, void * overlapped = 0x00000227`ddf80ea0, unsigned long io_result = 0, unsigned int64 nbytes = 0x11, struct _TP_IO * io = 0x00000227`de05a5e0) [D:\winx64-debug\build\src\sql\threadpool_win.cc @ 299]

I think that also 10.5 must be affected by this. On older server versions, it is probably harder to reproduce this, because we never observed ~~MDEV-24302~~ on older versions than 10.5. This could correlate with scheduling improvements in 10.5 (~~MDEV-23855~~) and 10.6 (~~MDEV-20612~~, ~~MDEV-25404~~); earlier, there could have been more scattered calls to write InnoDB log that would wake up RESET MASTER from its wait.

Attachments

Issue Links

causes

MDEV-35197 Wrong locking order of LOCK_log and LOCK_global_system_variables, potential deadlock of server

Closed

is blocked by

MDEV-24302 RESET MASTER hangs as Innodb does not report on binlog checkpoint

Closed

is duplicated by

MDEV-34827 Various issues when using RESET MASTER and innodb_flush_log_at_timeout

Closed

relates to

MDEV-14462 Confusing error message: ib_logfiles are too small for innodb_thread_concurrency=0

Closed

MDEV-25948 Remove log_flush_task

Closed

MDEV-18959 Engine transaction recovery through persistent binlog

Stalled

MDEV-24660 MYSQL_BIN_LOG::cleanup(): Assertion `b->xid_count == 0' failed in MYSQL_BIN_LOG::cleanup on SHUTDOWN

Closed

(2 relates to)

Activity

Ascending order - Click to sort in descending order

View 19 older comments

Roel Van de Paar added a comment - 2024-06-11 07:36 - edited

I found the following today:

# mysqld options required: --log-bin

SET GLOBAL innoDB_flush_log_at_timeout=600;

CREATE TEMPORARY TABLE t (c INT) ENGINE=InnoDB;

RESET MASTER;  # Always "hangs" here

And/or

# mysqld options required: --log-bin

SET GLOBAL innoDB_flush_log_at_timeout=600;

CREATE TABLE t (c INT) ENGINE=InnoDB;

RESET MASTER;  # Occassionally "hangs" here, though often completes immediately and then...

RESET MASTER;  # ...occassionally "hangs" here, or...

RESET MASTER;  # ...here

# If it still does not hang, wait one second, and retry the command: it will hang

I can try and transform this into an MTR testcase if needed.
"hangs" : as long as innoDB_flush_log_at_timeout is set to:

10.11.8 64314d3094ef5eb52223c613632292be96bd7732 (Optimized)
10.11.8-opt>RESET MASTER; # Always hangs here
Query OK, 0 rows affected (9 min 59.897 sec)

CLI direct replay works as shown. No replication needed. Only a single server instance is needed.

Also see ~~MDEV-34827~~ which has more findings about this area (now marked as duplicate)

Roel Van de Paar added a comment - 2024-06-11 07:36 - edited I found the following today: # mysqld options required: --log-bin SET GLOBAL innoDB_flush_log_at_timeout=600; CREATE TEMPORARY TABLE t (c INT ) ENGINE=InnoDB; RESET MASTER; # Always "hangs" here And/or # mysqld options required: --log-bin SET GLOBAL innoDB_flush_log_at_timeout=600; CREATE TABLE t (c INT ) ENGINE=InnoDB; RESET MASTER; # Occassionally "hangs" here, though often completes immediately and then ... RESET MASTER; # ...occassionally "hangs" here, or ... RESET MASTER; # ...here # If it still does not hang, wait one second , and retry the command: it will hang I can try and transform this into an MTR testcase if needed. "hangs" : as long as innoDB_flush_log_at_timeout is set to: 10.11.8 64314d3094ef5eb52223c613632292be96bd7732 (Optimized) 10.11.8-opt>RESET MASTER; # Always hangs here Query OK, 0 rows affected (9 min 59.897 sec) CLI direct replay works as shown. No replication needed. Only a single server instance is needed. Also see MDEV-34827 which has more findings about this area (now marked as duplicate)

Kristian Nielsen added a comment - 2024-09-18 12:21 - edited

Here is a proposed fix for this bug:

https://github.com/MariaDB/server/commit/f8e77f4553640426bdff1e46ce60f07e549651a6

commit f8e77f4553640426bdff1e46ce60f07e549651a6 (HEAD -> knielsen_mdev30889_max_global_memory_used, origin/knielsen_mdev30889_max_global_memory_used)

Author: Kristian Nielsen <knielsen@knielsen-hq.org>

Date:   Wed Sep 18 14:19:52 2024 +0200

    MDEV-25611: RESET MASTER causes the server to hang

    RESET MASTER waits for storage engines to reply to a binlog checkpoint

    requests. If this response is delayed for a long time for some reason, then

    RESET MASTER can hang.

    Fix this by forcing a log sync in all engines just before waiting for the

    checkpoint reply.

    (Waiting for old checkpoint responses is needed to preserve durability of

    any commits that were synced to disk in the to-be-deleted binlog but not yet

    synced in the engine.)

    Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>

diff --git a/sql/log.cc b/sql/log.cc

index 6a59dd1f062..4929a6baf0c 100644

--- a/sql/log.cc

+++ b/sql/log.cc

@@ -4435,6 +4435,9 @@ bool MYSQL_BIN_LOG::reset_logs(THD *thd, bool create_new_log,

     mark_xids_active(current_binlog_id, 1);

     do_checkpoint_request(current_binlog_id);

+    /* Flush all engine logs to force checkpoint responses to come through. */

+    ha_flush_logs();

     /* Now wait for all checkpoint requests and pending unlog() to complete. */

     mysql_mutex_lock(&LOCK_xid_list);

     for (;;)

diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc

index 34fc2904cc8..666e6667009 100644

--- a/storage/innobase/handler/ha_innodb.cc

+++ b/storage/innobase/handler/ha_innodb.cc

@@ -1583,7 +1583,7 @@ innobase_start_trx_and_assign_read_view(

 @return false */

 static bool innobase_flush_logs(handlerton*)

-  if (!srv_read_only_mode && srv_flush_log_at_trx_commit)

+  if (!srv_read_only_mode)

     /* Write any outstanding redo log. Durably if

     innodb_flush_log_at_trx_commit=1. */

     log_buffer_flush_to_disk(srv_flush_log_at_trx_commit == 1);

Kristian Nielsen added a comment - 2024-09-18 12:21 - edited Here is a proposed fix for this bug: https://github.com/MariaDB/server/commit/f8e77f4553640426bdff1e46ce60f07e549651a6 commit f8e77f4553640426bdff1e46ce60f07e549651a6 (HEAD -> knielsen_mdev30889_max_global_memory_used, origin/knielsen_mdev30889_max_global_memory_used) Author: Kristian Nielsen <knielsen@knielsen-hq.org> Date: Wed Sep 18 14:19:52 2024 +0200 MDEV-25611: RESET MASTER causes the server to hang RESET MASTER waits for storage engines to reply to a binlog checkpoint requests. If this response is delayed for a long time for some reason, then RESET MASTER can hang. Fix this by forcing a log sync in all engines just before waiting for the checkpoint reply. (Waiting for old checkpoint responses is needed to preserve durability of any commits that were synced to disk in the to-be-deleted binlog but not yet synced in the engine.) Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org> diff --git a/sql/log.cc b/sql/log.cc index 6a59dd1f062..4929a6baf0c 100644 --- a/sql/log.cc +++ b/sql/log.cc @@ -4435,6 +4435,9 @@ bool MYSQL_BIN_LOG::reset_logs(THD *thd, bool create_new_log, mark_xids_active(current_binlog_id, 1); do_checkpoint_request(current_binlog_id); + /* Flush all engine logs to force checkpoint responses to come through. */ + ha_flush_logs(); + /* Now wait for all checkpoint requests and pending unlog() to complete. */ mysql_mutex_lock(&LOCK_xid_list); for (;;) diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index 34fc2904cc8..666e6667009 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -1583,7 +1583,7 @@ innobase_start_trx_and_assign_read_view( @return false */ static bool innobase_flush_logs(handlerton*) { - if (!srv_read_only_mode && srv_flush_log_at_trx_commit) + if (!srv_read_only_mode) /* Write any outstanding redo log. Durably if innodb_flush_log_at_trx_commit=1. */ log_buffer_flush_to_disk(srv_flush_log_at_trx_commit == 1);

Kristian Nielsen added a comment - 2024-09-19 06:48

I suggest to fix this from 11.4. The bug has minor user impact, unlikely to cause a problem in real life, and not worth the risk to introduce regressions in old GA versions.

For testing <11.4 versions, implement testing so that it does not combine RESET MASTER with high value of innoDB_flush_log_at_timeout; the hang does not trigger with standard settings of this variable as then the innodb redo log is flushed regularly.

Kristian Nielsen added a comment - 2024-09-19 06:48 I suggest to fix this from 11.4. The bug has minor user impact, unlikely to cause a problem in real life, and not worth the risk to introduce regressions in old GA versions. For testing <11.4 versions, implement testing so that it does not combine RESET MASTER with high value of innoDB_flush_log_at_timeout; the hang does not trigger with standard settings of this variable as then the innodb redo log is flushed regularly.

Andrei Elkin added a comment - 2024-09-20 13:06

knielsen, agreed. Thanks for the contribution!

Andrei Elkin added a comment - 2024-09-20 13:06 knielsen , agreed. Thanks for the contribution!

Kristian Nielsen added a comment - 2024-09-27 13:11

Pushed to 11.4

Kristian Nielsen added a comment - 2024-09-27 13:11 Pushed to 11.4

MariaDB Server

RESET MASTER still causes the server to hang

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration