[MDEV-24625] Failure to open binary log does not cause a fatal error, but leads to further assorted sporadic problems Created: 2021-01-19  Updated: 2023-10-02

Status: Stalled
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.4
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Brandon Nesterenko
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-30409 SIGSEGV in __memmove_avx_unaligned_er... Open

 Description   

Reported for 10.4 because that's the branch I was dealing with. I don't have information whether other versions behave the same way.

The initial subject of investigation was a cluster of assorted sporadic assertion failures happening in concurrent tests in a specific environment with binary logging enabled.

All failures were united by the fact that they happened immediately or very soon after an error like this appeared in the error log:

2021-01-18 20:22:41 15 [ERROR] Could not use mysql-bin for logging (error 2). Turning logging off for the whole duration of the MariaDB server process. To turn it on again: fix the cause, shutdown the MariaDB server and restart it.
2021-01-18 20:22:41 15 [ERROR] Could not use ./mysql-bin.000002 for logging (error 0). Turning logging off for the whole duration of the MariaDB server process. To turn it on again: fix the cause, shutdown the MariaDB server and restart it.

The root cause, in this case, turned out to be an error in wolfSSL code which led to a failing call to my_random_bytes in MYSQL_BIN_LOG::open which, in turn, led to the binlog opening error. However, while looking into it, I found the handling of binlog opening errors on MariaDB side be rather questionable.

The ERROR records are written as a part of error handling in MYSQL_BIN_LOG::open, after which the function returns with non-success.
Turning binary logging off can be a pretty big deal, people don't enable it without a reason. Yet, MariaDB takes the situation rather lightly. There are about 15 points of failure in MYSQL_BIN_LOG::open alone, so it's far from a unique situation; and all of them seemingly lead to the same outcome, and no fatal errors or even debug assertions happen at this point.

However, shortly after this various problems start happening. As a side effect, at least the following assertion failures occur sporadically, interchangeably, and sometimes simultaneously:

mysqld: /data/src/10.4-bug/sql/sql_class.cc:6560: int THD::binlog_write_row(TABLE*, bool, const uchar*): Assertion `((__builtin_expect(((WSREP_ON_) != 0),0) && this->variables.wsrep_on) && wsrep_emulate_bin_log) || mysql_bin_log.is_open()' failed.
 
#7  0x00007f84d7d4af36 in __GI___assert_fail (assertion=0x55f48fdf7338 "((__builtin_expect(((WSREP_ON_) != 0),0) && this->variables.wsrep_on) && wsrep_emulate_bin_log) || mysql_bin_log.is_open()", file=0x55f48fdf53d0 "/data/src/10.4-bug/sql/sql_class.cc", line=6560, function=0x55f48fdf72b8 "int THD::binlog_write_row(TABLE*, bool, const uchar*)") at assert.c:101
#8  0x000055f48f09ce82 in THD::binlog_write_row (this=0x7f8450000d90, table=0x7f84500c86a0, is_trans=false, record=0x7f84500ca158 "\200\b") at /data/src/10.4-bug/sql/sql_class.cc:6560
#9  0x000055f48f48cb39 in Write_rows_log_event::binlog_row_logging_function (thd=0x7f8450000d90, table=0x7f84500c86a0, is_transactional=false, before_record=0x0, after_record=0x7f84500ca158 "\200\b") at /data/src/10.4-bug/sql/log_event.h:4806
#10 0x000055f48f485005 in binlog_log_row_internal (table=0x7f84500c86a0, before_record=0x0, after_record=0x7f84500ca158 "\200\b", log_func=0x55f48f48ca84 <Write_rows_log_event::binlog_row_logging_function(THD*, TABLE*, bool, unsigned char const*, unsigned char const*)>) at /data/src/10.4-bug/sql/handler.cc:6422
#11 0x000055f48f485110 in binlog_log_row (table=0x7f84500c86a0, before_record=0x0, after_record=0x7f84500ca158 "\200\b", log_func=0x55f48f48ca84 <Write_rows_log_event::binlog_row_logging_function(THD*, TABLE*, bool, unsigned char const*, unsigned char const*)>) at /data/src/10.4-bug/sql/handler.cc:6442
#12 0x000055f48f486847 in handler::ha_write_row (this=0x7f84500c9968, buf=0x7f84500ca158 "\200\b") at /data/src/10.4-bug/sql/handler.cc:6760
#13 0x000055f48f0bfc0b in write_record (thd=0x7f8450000d90, table=0x7f84500c86a0, info=0x7f84d46b1670) at /data/src/10.4-bug/sql/sql_insert.cc:2060
#14 0x000055f48f0f8120 in read_sep_field (thd=0x7f8450000d90, info=..., table_list=0x7f8450012300, fields_vars=..., set_fields=..., set_values=..., read_info=..., enclosed=..., skip_lines=0, ignore_check_option_errors=true) at /data/src/10.4-bug/sql/sql_load.cc:1164
#15 0x000055f48f0f65d0 in mysql_load (thd=0x7f8450000d90, ex=0x7f8450012268, table_list=0x7f8450012300, fields_vars=..., set_fields=..., set_values=..., handle_duplicates=DUP_ERROR, ignore=true, read_file_from_client=false) at /data/src/10.4-bug/sql/sql_load.cc:669
#16 0x000055f48f10d1cc in mysql_execute_command (thd=0x7f8450000d90) at /data/src/10.4-bug/sql/sql_parse.cc:4981

mysqld: /data/src/10.4-bug/mysys/mf_iocache.c:598: _my_b_write: Assertion `Count >= rest_length' failed.
 
#7  0x00007fb1910def36 in __GI___assert_fail (assertion=0x55d12c50507f "Count >= rest_length", file=0x55d12c504e38 "/data/src/10.4-bug/mysys/mf_iocache.c", line=598, function=0x55d12c5055e0 <__PRETTY_FUNCTION__.14595> "_my_b_write") at assert.c:101
#8  0x000055d12c05abab in _my_b_write (info=0x55d12cc117c0 <mysql_bin_log+928>, Buffer=0x7fb17974cb90 "\252\304\005`\242o", Count=19) at /data/src/10.4-bug/mysys/mf_iocache.c:598
#9  0x000055d12c05903f in my_b_write (info=0x55d12cc117c0 <mysql_bin_log+928>, Buffer=0x7fb17974cb90 "\252\304\005`\242o", Count=19) at /data/src/10.4-bug/include/my_sys.h:548
#10 0x000055d12c05ce6a in my_b_safe_write (info=0x55d12cc117c0 <mysql_bin_log+928>, Buffer=0x7fb17974cb90 "\252\304\005`\242o", Count=19) at /data/src/10.4-bug/mysys/mf_iocache.c:1854
#11 0x000055d12b926bf9 in Log_event_writer::write_internal (this=0x7fb17974cc80, pos=0x7fb17974cb90 "\252\304\005`\242o", len=19) at /data/src/10.4-bug/sql/log_event.cc:1600
#12 0x000055d12b926eec in Log_event_writer::encrypt_and_write (this=0x7fb17974cc80, pos=0x7fb17974cb90 "\252\304\005`\242o", len=19) at /data/src/10.4-bug/sql/log_event.cc:1646
#13 0x000055d12b9271b7 in Log_event_writer::write_header (this=0x7fb17974cc80, pos=0x7fb17974cb90 "\252\304\005`\242o", len=19) at /data/src/10.4-bug/sql/log_event.cc:1689
#14 0x000055d12b9276cd in Log_event::write_header (this=0x7fb17974cd70, event_data_length=19) at /data/src/10.4-bug/sql/log_event.cc:1782
#15 0x000055d12b934322 in Gtid_log_event::write (this=0x7fb17974cd70) at /data/src/10.4-bug/sql/log_event.cc:8061
#16 0x000055d12b6dd371 in Log_event_writer::write (this=0x7fb17974cc80, ev=0x7fb17974cd70) at /data/src/10.4-bug/sql/log_event.h:5236
#17 0x000055d12b90d95f in MYSQL_BIN_LOG::write_event (this=0x55d12cc11420 <mysql_bin_log>, ev=0x7fb17974cd70, cache_data=0x0, file=0x55d12cc117c0 <mysql_bin_log+928>) at /data/src/10.4-bug/sql/log.cc:5373
#18 0x000055d12b365ba1 in MYSQL_BIN_LOG::write_event (this=0x55d12cc11420 <mysql_bin_log>, ev=0x7fb17974cd70) at /data/src/10.4-bug/sql/log.h:820
#19 0x000055d12b9100a4 in MYSQL_BIN_LOG::write_gtid_event (this=0x55d12cc11420 <mysql_bin_log>, thd=0x7fb11c000d90, standalone=true, is_transactional=false, commit_id=0) at /data/src/10.4-bug/sql/log.cc:6100
#20 0x000055d12b910dd6 in MYSQL_BIN_LOG::write (this=0x55d12cc11420 <mysql_bin_log>, event_info=0x7fb17974d1a0, with_annotate=0x0) at /data/src/10.4-bug/sql/log.cc:6420
#21 0x000055d12b3fcd7c in THD::binlog_query (this=0x7fb11c000d90, qtype=THD::STMT_QUERY_TYPE, query_arg=0x7fb11c3cbd40 "CREATE DEFINER=`rqg`@`localhost` PROCEDURE IF NOT EXISTS `sp_1068769`()\nBEGIN INSERT INTO seq8 VALUES (-64, 113, 33309, -7346215417171410944, 6, 61, 0, 0) ; FLUSH PRIVILEGES ; END", query_len=179, is_trans=false, direct=false, suppress_use=false, errcode=0) at /data/src/10.4-bug/sql/sql_class.cc:7226
#22 0x000055d12b98e516 in Sp_handler::sp_create_routine (this=0x55d12ca030f0 <sp_handler_procedure>, thd=0x7fb11c000d90, sp=0x7fb11c66eab8) at /data/src/10.4-bug/sql/sp.cc:1465

mysqld: /data/src/10.4-bug/sql/log.cc:7988: void MYSQL_BIN_LOG::trx_group_commit_leader(MYSQL_BIN_LOG::group_commit_entry*): Assertion `is_open()' failed.
 
#7  0x00007ff5632cbf36 in __GI___assert_fail (assertion=0x558f7dc4e79a "is_open()", file=0x558f7dc4c82c "/data/src/10.4-bug/sql/log.cc", line=7988, function=0x558f7dc506f0 "void MYSQL_BIN_LOG::trx_group_commit_leader(MYSQL_BIN_LOG::group_commit_entry*)") at assert.c:101
#8  0x0000558f7d22f42d in MYSQL_BIN_LOG::trx_group_commit_leader (this=0x558f7e52b420 <mysql_bin_log>, leader=0x7ff54c717e80) at /data/src/10.4-bug/sql/log.cc:7988
#9  0x0000558f7d22ed17 in MYSQL_BIN_LOG::write_transaction_to_binlog_events (this=0x558f7e52b420 <mysql_bin_log>, entry=0x7ff54c717e80) at /data/src/10.4-bug/sql/log.cc:7819
#10 0x0000558f7d22e066 in MYSQL_BIN_LOG::write_transaction_to_binlog (this=0x558f7e52b420 <mysql_bin_log>, thd=0x7ff4a4000d90, cache_mngr=0x7ff4a4349fe0, end_ev=0x7ff54c718020, all=true, using_stmt_cache=true, using_trx_cache=true) at /data/src/10.4-bug/sql/log.cc:7440
#11 0x0000558f7d21cdf9 in binlog_flush_cache (thd=0x7ff4a4000d90, cache_mngr=0x7ff4a4349fe0, end_ev=0x7ff54c718020, all=true, using_stmt=true, using_trx=true) at /data/src/10.4-bug/sql/log.cc:1788
#12 0x0000558f7d21d43d in binlog_commit_flush_xid_caches (thd=0x7ff4a4000d90, cache_mngr=0x7ff4a4349fe0, all=true, xid=3034) at /data/src/10.4-bug/sql/log.cc:1896
#13 0x0000558f7d234cde in MYSQL_BIN_LOG::log_and_order (this=0x558f7e52b420 <mysql_bin_log>, thd=0x7ff4a4000d90, xid=3034, all=true, need_prepare_ordered=false, need_commit_ordered=true) at /data/src/10.4-bug/sql/log.cc:9779
#14 0x0000558f7d0ee59e in ha_commit_trans (thd=0x7ff4a4000d90, all=true) at /data/src/10.4-bug/sql/handler.cc:1646
#15 0x0000558f7cf22719 in trans_commit_implicit (thd=0x7ff4a4000d90) at /data/src/10.4-bug/sql/transaction.cc:301
#16 0x0000558f7cd80eb6 in mysql_execute_command (thd=0x7ff4a4000d90) at /data/src/10.4-bug/sql/sql_parse.cc:3745
#17 0x0000558f7cd8f087 in mysql_parse (thd=0x7ff4a4000d90, rawbuf=0x7ff4a4012128 "ALTER /* QNO 476 CON_ID 22 */ ONLINE IGNORE TABLE `t8_Aria` DROP FOREIGN KEY e", length=78, parser_state=0x7ff54c7195a0, is_com_multi=false, is_next_command=false) at /data/src/10.4-bug/sql/sql_parse.cc:7958

mysqld: /home/elenst/src/10.4/sql/log.cc:5976: int MYSQL_BIN_LOG::flush_and_set_pending_rows_event(THD*, Rows_log_event*, bool): Assertion `((thd && (__builtin_expect(((WSREP_ON_) != 0),0) && thd->variables.wsrep_on)) && wsrep_emulate_bin_log) || mysql_bin_log.is_open()' failed.
 
#7  0x00007f8db228c0d2 in __assert_fail () from /lib64/libc.so.6
#8  0x00005622dac87d39 in MYSQL_BIN_LOG::flush_and_set_pending_rows_event (this=0x5622dc2cff00 <mysql_bin_log>, thd=0x7f8d08000af0, event=0x7f8d080fd3e0, is_transactional=false) at /home/elenst/src/10.4/sql/log.cc:5976
#9  0x00005622da78552e in THD::binlog_prepare_pending_rows_event<Write_rows_log_event> (this=0x7f8d08000af0, table=0x7f8d080334c0, serv_id=111, needed=18, is_transactional=false, hint=0x0) at /home/elenst/src/10.4/sql/sql_class.cc:6419
#10 0x00005622da77d28b in THD::binlog_write_row (this=0x7f8d08000af0, table=0x7f8d080334c0, is_trans=false, record=0x7f8d080f9bb8 "\370\006\002") at /home/elenst/src/10.4/sql/sql_class.cc:6589
#11 0x00005622dab67d4e in Write_rows_log_event::binlog_row_logging_function (thd=0x7f8d08000af0, table=0x7f8d080334c0, is_transactional=false, before_record=0x0, after_record=0x7f8d080f9bb8 "\370\006\002") at /home/elenst/src/10.4/sql/log_event.h:4806
#12 0x00005622dab602f3 in binlog_log_row_internal (table=0x7f8d080334c0, before_record=0x0, after_record=0x7f8d080f9bb8 "\370\006\002", log_func=0x5622dab67ca0 <Write_rows_log_event::binlog_row_logging_function(THD*, TABLE*, bool, unsigned char const*, unsigned char const*)>) at /home/elenst/src/10.4/sql/handler.cc:6422
#13 0x00005622dab603f5 in binlog_log_row (table=0x7f8d080334c0, before_record=0x0, after_record=0x7f8d080f9bb8 "\370\006\002", log_func=0x5622dab67ca0 <Write_rows_log_event::binlog_row_logging_function(THD*, TABLE*, bool, unsigned char const*, unsigned char const*)>) at /home/elenst/src/10.4/sql/handler.cc:6442
#14 0x00005622dab61c2f in handler::ha_write_row (this=0x7f8d08034328, buf=0x7f8d080f9bb8 "\370\006\002") at /home/elenst/src/10.4/sql/handler.cc:6760
#15 0x00005622da79f9c1 in write_record (thd=0x7f8d08000af0, table=0x7f8d080334c0, info=0x7f8d9c550630) at /home/elenst/src/10.4/sql/sql_insert.cc:1736
#16 0x00005622da7d7635 in read_sep_field (thd=0x7f8d08000af0, info=..., table_list=0x7f8d08012060, fields_vars=..., set_fields=..., set_values=..., read_info=..., enclosed=..., skip_lines=0, ignore_check_option_errors=false) at /home/elenst/src/10.4/sql/sql_load.cc:1164
#17 0x00005622da7d5b09 in mysql_load (thd=0x7f8d08000af0, ex=0x7f8d08011fc8, table_list=0x7f8d08012060, fields_vars=..., set_fields=..., set_values=..., handle_duplicates=DUP_REPLACE, ignore=false, read_file_from_client=false) at /home/elenst/src/10.4/sql/sql_load.cc:671

Probably there are many more of them, and these are just most frequent or most obvious. And sometimes none of these occur, and the concurrent test "successfully" reaches the end.

It is not in itself surprising that all these things fail when the binary log previously failed to open. However, the behavior is largely inconsistent.


- if all these are debug-only errors which can never affect release builds, then they are bogus assertions which should be eased;

- assertions with a complex condition, where the check for binary log being open is only a part of it, should probably be revised and maybe split;

- assertions which only check that the binary log is open should probably be replaced or at least accompanied by a debug assertion in MYSQL_BIN_LOG::open error handling, so that the initial problem is caught timely and deterministically, instead of relying on aftermath checks;

- finally, if failures upon these checks indicate a real probable or inevitable non-debug problem, then maybe the whole idea of turning off binary logging at runtime should be reconsidered. After all, it's better if the server shuts down gracefully right upon binlog opening failure, than crash or corrupt data later at random times in an obscure way.



 Comments   
Comment by Elena Stepanova [ 2021-01-19 ]

While the initial encryption failure is unrelated to binary logging problems, or even to MariaDB code in general, I don't want notes about it to be lost, as it's a real problem which can affect real users if they are unlucky enough to have a similar environment. It can also help us investigate similar problems later.

From slack (abridged, the subject is wolfSSL submodule in 10.4 branch):

elenst wrote:

it looks like a rather simple race between 2 threads in Hash_gen

thread 140345438934784: in Hash_DRBG_Generate: calling Hash_gen with reseedCtr 10651, lastBlock -1383005185, matchCount 0
thread 140345438934784: in Hash_gen: i is 0, len is 1
thread 140345438934784: in Hash_gen: calling XMEMCPY
thread 140345438934784: in Hash_gen: after XMEMCPY: drbg->reseedCtr is 10651, drbg->lastBlock is -1383005185, checkBlock is 1229563600, drbg->matchCount is 0
 
thread 140345439241984: in Hash_DRBG_Generate: calling Hash_gen with reseedCtr 10651, lastBlock -1383005185, matchCount 0
thread 140345439241984: in Hash_gen: i is 0, len is 1
 
thread 140345438934784: in Hash_gen: set matchCount to 0 and drbg->lastBlock to 1229563600
thread 140345438934784: in Hash_gen: returning 0
 
thread 140345439241984: in Hash_gen: calling XMEMCPY
thread 140345439241984: in Hash_gen: after XMEMCPY: drbg->reseedCtr is 10651, drbg->lastBlock is 1229563600, checkBlock is 1229563600, drbg->matchCount is 0
thread 140345439241984: in Hash_gen: drbg->reseedCtr is 10651, drbg->lastBlock is 1229563600, checkBlock is 1229563600, setting matchCount to 1 and len is now 2
 
thread 140345438934784: in Hash_DRBG_Generate: Hash_gen returned SUCCESS, reseed increased to 10652
thread 140345438934784: wc_RNG_GenerateBlock: RETURNING: 0
 
thread 140345439241984: in Hash_gen: i is 1, len is 2
thread 140345439241984: in Hash_gen: calling XMEMCPY
thread 140345439241984: in Hash_gen: after XMEMCPY: drbg->reseedCtr is 10652, drbg->lastBlock is 1229563600, checkBlock is 1229563600, drbg->matchCount is 1
thread 140345439241984: in Hash_gen: drbg->matchCount is 1, drbg->reseedCtr is 10652, drbg->lastBlock is 1229563600, returning DRBG_CONT_FAILURE 3
thread 140345439241984: in Hash_DRBG_Generate: Hash_gen returned FAILURE: 3
thread 140345439241984: wc_RNG_GenerateBlock: RNG_FAILURE_E: -199, set rng->status to DRBG_FAILED: 2
thread 140345439241984: wc_RNG_GenerateBlock: RETURNING: -199

The problem seems to be that drbg->lastBlock is updated in one thread concurrently with the check checkBlock == drbg->lastBlock in another thread, so the equality can be true; once it's true, matchCount gets increased, the loop is made to repeat, and then it almost inevitably fails.
What's worse, once it has failed, rng->status is set to non-OK, and it is forever, it's not reset anymore, and all further calls to wc_RNG_GenerateBlock fail very quickly.

initGlobalRNG is 1, I don't have WOLFSSL_SMALL_STACK

serg wrote:

so wolfssl uses a global shared state for all wolfSSL_RAND_bytes invocations
and doesn't protect it with a mutex
I've reported an issue on github
we can add a mutex around RAND_bytes
but the caller doesn't know whether RAND_bytes uses a global or a local state

The issue at github: https://github.com/wolfSSL/wolfssl/issues/3657

Comment by Andrei Elkin [ 2021-10-08 ]

MDEV-17856 Port binlog_error_action from MySQL must be relevant.

Generated at Thu Feb 08 09:31:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.