[MDEV-32024] Galera library 26.4.16 fails with every server version Created: 2023-08-27  Updated: 2023-10-24  Resolved: 2023-10-24

Status: Closed
Project: MariaDB Server
Component/s: Galera, Tests
Affects Version/s: 10.4, 10.5, 10.6, 10.10, 10.11, 11.0, 11.1
Fix Version/s: 10.4.32, 10.5.23, 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3, 11.2.2, 11.3.1

Type: Bug Priority: Critical
Reporter: Elena Stepanova Assignee: Julius Goryavsky
Resolution: Fixed Votes: 1
Labels: Test_disabled

Attachments: File fix_10_4_2.diff     File fix_11_1.diff    
Issue Links:
Blocks
is blocked by MDEV-32051 Failed to insert streaming client Closed
Issue split
split to MDEV-32561 WSREP FSM failure: no such a transiti... Closed
Relates

 Description   

MariaDB tests in CI mainly run with the latest Galera library from mariadb-4.x branch, as was demanded by Galera development.

The recently pushed 26.4.16 library fails everywhere, with a variety of errors, and thus makes the server CI unusable.

errors

galera_sr.GCF-1060 'innodb'              w2 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2023-08-23 03:55:04
line
2023-08-23  3:54:54 25 [Warning] WSREP: Failed to insert streaming client 25
2023-08-23  3:54:54 25 [Warning] WSREP: Failed to insert streaming client 25
2023-08-23  3:54:54 25 [Warning] WSREP: Failed to insert streaming client 25
2023-08-23  3:54:54 25 [Warning] WSREP: Failed to insert streaming client 25

crashes

galera.galera_sequences 'innodb'         w2 [ retry-fail ]
        Test ended at 2023-08-24 07:27:36
 
CURRENT_TEST: galera.galera_sequences
mysqltest: At line 178: query 'INSERT INTO t1(b) values (2)' failed with wrong errno 2013: 'Lost connection to MySQL server during query', instead of 0...

wrong results

wsrep.wsrep_provider_plugin_defaults 'innodb' w2 [ retry-fail ]
        Test ended at 2023-08-24 14:25:01
 
CURRENT_TEST: wsrep.wsrep_provider_plugin_defaults
--- /usr/share/mariadb-test/suite/wsrep/r/wsrep_provider_plugin_defaults.result	2023-08-24 07:55:52.000000000 +0000
+++ /dev/shm/var/2/log/wsrep_provider_plugin_defaults.reject	2023-08-24 14:25:01.342887420 +0000
@@ -10,7 +10,7 @@
 'wsrep_provider_signal',
 'wsrep_provider_gmcast_listen_addr');
 COUNT(*)
-83
+84
 SELECT * FROM INFORMATION_SCHEMA.SYSTEM_VARIABLES
 WHERE VARIABLE_NAME LIKE 'wsrep_provider_%' AND VARIABLE_NAME NOT IN (
 'wsrep_provider',
@@ -998,6 +998,21 @@
 READ_ONLY	NO
 COMMAND_LINE_ARGUMENT	REQUIRED
 GLOBAL_VALUE_PATH	NULL
+VARIABLE_NAME	WSREP_PROVIDER_PROTONET_BACKEND
+SESSION_VALUE	NULL
+GLOBAL_VALUE	asio
+GLOBAL_VALUE_ORIGIN	COMPILE-TIME
+DEFAULT_VALUE	asio
+VARIABLE_SCOPE	GLOBAL
+VARIABLE_TYPE	VARCHAR
+VARIABLE_COMMENT	Wsrep provider option
+NUMERIC_MIN_VALUE	NULL
+NUMERIC_MAX_VALUE	NULL
+NUMERIC_BLOCK_SIZE	NULL
+ENUM_VALUE_LIST	NULL
+READ_ONLY	YES
+COMMAND_LINE_ARGUMENT	REQUIRED
+GLOBAL_VALUE_PATH	NULL
 VARIABLE_NAME	WSREP_PROVIDER_PROTONET_VERSION
 SESSION_VALUE	NULL
 GLOBAL_VALUE	0
 
mysqltest: Result length mismatch

And so on, it's not a full list.

Please check and fix it, and please at least run Galera tests locally before pushing something to main.



 Comments   
Comment by Sergei Golubchik [ 2023-09-25 ]

FYI, I've fixed galera.galera_as_slave_gtid_myisam in 10.10 and wsrep.wsrep_provider_plugin_defaults in 11.0

Comment by Sergei Golubchik [ 2023-09-25 ]

janlindstrom, I don't understand how you can fix a crash with changes in the test

Comment by Sergei Golubchik [ 2023-09-25 ]

And, please, when fixing sequences, apply the following patch too (I cannot do it while the test is crashing):

--- a/sql/sql_table.cc
+++ b/sql/sql_table.cc
@@ -5324,7 +5324,7 @@ bool wsrep_check_sequence(THD* thd, const sequence_defini>
     if (db_type != DB_TYPE_INNODB)
     {
       my_error(ER_NOT_SUPPORTED_YET, MYF(0),
-               "Galera cluster does support only InnoDB sequences");
+               "non-InnoDB sequences in Galera cluster");
       return(true);
     }
 
@@ -5335,8 +5335,7 @@ bool wsrep_check_sequence(THD* thd, const sequence_defini>
         seq->cache)
     {
       my_error(ER_NOT_SUPPORTED_YET, MYF(0),
-               "In Galera if you use CACHE you should set INCREMENT BY 0"
-              " to behave correctly in a cluster");
+               "CACHE without INCREMENT BY 0 in Galera cluster");
       return(true);
     }
 

Currently it prints

ERROR 42000: This version of MariaDB doesn't yet support 'Galera cluster does support only InnoDB sequences'

which looks rather silly

Comment by Julius Goryavsky [ 2023-09-25 ]

janlindstrom At the moment we have a server crash and lost connection, and not just a hang in the test:

CURRENT_TEST: galera.galera_sequences
mysqltest: At line 236: query 'INSERT INTO t1(b) values (1),(2),(3),(4),(5),(6),(7),(8),(9)' failed with wrong errno 2013: 'Lost connection to MySQL server during query', instead of 0...
 
The result from queries just before the failure was:
< snip >
@@auto_increment_offset
2
SET SESSION wsrep_sync_wait=0;
connection node_1;
connection node_2;
connection node_1;
DROP SEQUENCE t;
DROP TABLE t1;
CREATE SEQUENCE t INCREMENT BY 0 NOCACHE ENGINE=INNODB;
DROP SEQUENCE t;
CREATE SEQUENCE t INCREMENT BY 1 CACHE=20 ENGINE=INNODB;
ERROR 42000: This version of MariaDB doesn't yet support 'In Galera if you use CACHE you should set INCREMENT BY 0 to behave correctly in a cluster'
CREATE SEQUENCE t INCREMENT BY 0 CACHE=20 ENGINE=INNODB;
CREATE TABLE t1(a int not null primary key default nextval(t), b int) engine=innodb;
connection node_2;
# Wait DDL to replicate
connection node_1;
SET SESSION wsrep_sync_wait=0;
connection node_2;
SET SESSION wsrep_sync_wait=0;
 
More results from queries before failure can be found in /dev/shm/var/1/log/galera_sequences.log

and:

WSREP_SST: [INFO] rsync SST completed on donor (20230925 06:53:55.214)
2023-09-25  6:53:55 0 [Note] WSREP: Donor monitor thread ended with total time 2 sec
2023-09-25  6:53:55 0 [Note] WSREP: (492dcd42-89d1, 'tcp://0.0.0.0:16002') turning message relay requesting off
2023-09-25  6:53:56 0 [Note] WSREP: async IST sender served
2023-09-25  6:53:56 0 [Note] WSREP: 1.0 (centos74-amd64): State transfer from 0.0 (centos74-amd64) complete.
2023-09-25  6:53:56 0 [Note] WSREP: Member 1.0 (centos74-amd64) synced with group.
2023-09-25  6:53:58 0 [Note] WSREP: Member 1.0 (centos74-amd64) desyncs itself from group
2023-09-25  6:53:58 0 [Note] WSREP: Member 1.0 (centos74-amd64) resyncs itself to group.
2023-09-25  6:53:58 0 [Note] WSREP: Member 1.0 (centos74-amd64) synced with group.
2023-09-25  6:53:58 0 [Note] WSREP: Member 1.0 (centos74-amd64) desyncs itself from group
2023-09-25  6:53:58 0 [Note] WSREP: Member 1.0 (centos74-amd64) resyncs itself to group.
2023-09-25  6:53:58 0 [Note] WSREP: Member 1.0 (centos74-amd64) synced with group.
2023-09-25  6:53:58 1 [ERROR] Slave SQL: Error 'Unknown table 'test.t1'' on query. Default database: 'test'. Query: 'DROP TABLE t1', Internal MariaDB error code: 1051
2023-09-25  6:53:58 1 [Warning] WSREP: Ignoring error 'Unknown table 'test.t1'' on query. Default database: 'test'. Query: 'DROP TABLE t1', Error_code: 1051
2023-09-25  6:53:58 1 [ERROR] Slave SQL: Error 'Unknown SEQUENCE: 'test.sq2'' on query. Default database: 'test'. Query: 'DROP SEQUENCE sq2', Internal MariaDB error code: 4091
2023-09-25  6:53:58 1 [Warning] WSREP: Ignoring error 'Unknown SEQUENCE: 'test.sq2'' on query. Default database: 'test'. Query: 'DROP SEQUENCE sq2', Error_code: 4091
2023-09-25  6:53:58 17 [ERROR] WSREP: FSM: no such a transition REPLICATING -> COMMITTED
230925  6:53:58 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.4.32-MariaDB-log source revision: 3ac25b480055e7e99e46a958c04f9ffb7a6d68cf
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=2
max_threads=153
thread_count=10
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63557 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x563c9cbbc808
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f6d85e8dc40 thread_stack 0x49000
mysys/stacktrace.c:175(my_print_stacktrace)[0x563c9301d4de]
sql/signal_handler.cc:238(handle_fatal_signal)[0x563c92a6d687]
sigaction.c:0(__restore_rt)[0x7f6d8d4165e0]
/lib64/libc.so.6(gsignal+0x37)[0x7f6d8c86b1f7]
/lib64/libc.so.6(abort+0x148)[0x7f6d8c86c8e8]
src/fsm.hpp:56(galera::FSM<galera::TrxHandle::State, galera::TrxHandle::Transition>::shift_to(galera::TrxHandle::State, int))[0x7f6d892e4cda]
src/replicator_smm.cpp:1423(galera::ReplicatorSMM::commit_order_leave(galera::TrxHandleSlave&, wsrep_buf const*))[0x7f6d892f44bb]
detail/shared_count.hpp:371(galera_commit_order_leave)[0x7f6d892e0468]
/usr/sbin/mysqld(_ZN5wsrep18wsrep_provider_v2618commit_order_leaveERKNS_9ws_handleERKNS_7ws_metaERKNS_14mutable_bufferE+0x91)[0x563c930ab001]
src/wsrep_provider_v26.cpp:969(wsrep::wsrep_provider_v26::commit_order_leave(wsrep::ws_handle const&, wsrep::ws_meta const&, wsrep::mutable_buffer const&))[0x563c930a4ee0]
src/transaction.cpp:579(wsrep::transaction::ordered_commit())[0x563c92b5aae9]
sql/log.cc:7822(MYSQL_BIN_LOG::queue_for_group_commit(MYSQL_BIN_LOG::group_commit_entry*))[0x563c92b6001c]
sql/log.cc:7480(MYSQL_BIN_LOG::write_transaction_to_binlog(THD*, binlog_cache_mngr*, Log_event*, bool, bool, bool))[0x563c92b604b0]
sql/log.cc:516(binlog_cache_mngr::reset(bool, bool))[0x563c92b6066d]
sql/log.cc:1814(binlog_commit_flush_stmt_cache(THD*, bool, binlog_cache_mngr*))[0x563c92b60894]
sql/log.cc:2091(binlog_rollback(handlerton*, THD*, bool))[0x563c92b60a7f]
sql/handler.cc:1956(ha_rollback_trans(THD*, bool))[0x563c92a70f6b]
sql/handler.cc:1747(ha_commit_trans(THD*, bool))[0x563c92a71c94]
sql/transaction.cc:438(trans_commit_stmt(THD*))[0x563c9297121f]
sql/sql_class.h:4028(THD::get_stmt_da())[0x563c92871242]
sql/sql_parse.cc:8013(mysql_parse(THD*, char*, unsigned int, Parser_state*, bool, bool))[0x563c9287903b]
sql/sql_class.h:4028(THD::get_stmt_da())[0x563c928798a6]
sql/sql_parse.cc:1843(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool))[0x563c9287c77e]
sql/sql_parse.cc:1379(do_command(THD*))[0x563c9287ce22]
sql/sql_connect.cc:1420(do_handle_one_connection(CONNECT*))[0x563c92962512]
sql/sql_connect.cc:1326(handle_one_connection)[0x563c929625fd]
perfschema/pfs.cc:1872(pfs_spawn_thread)[0x563c92cef3ed]
pthread_create.c:0(start_thread)[0x7f6d8d40ee25]
/lib64/libc.so.6(clone+0x6d)[0x7f6d8c92e34d]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x563c9ccdc020): INSERT INTO t1(b) values (1),(2),(3),(4),(5),(6),(7),(8),(9)
 
Connection ID (thread ID): 17
Status: KILL_QUERY

Comment by Jan Lindström [ 2023-10-20 ]

https://github.com/MariaDB/server/pull/2793

Comment by Julius Goryavsky [ 2023-10-24 ]

Fix merged with head revision: https://github.com/MariaDB/server/commit/e913f4e11e1e519196f276d7c5689f653e724547
Remaining part moved to separate task: https://jira.mariadb.org/browse/MDEV-32561

Generated at Thu Feb 08 10:28:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.