[MDEV-31988] galera_partition test: assertion due to unallowed state transition Created: 2023-08-16  Updated: 2023-09-14  Resolved: 2023-09-12

Status: Closed
Project: MariaDB Server
Component/s: Galera, Galera SST
Affects Version/s: None
Fix Version/s: 10.4.32, 10.5.23, 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3

Type: Bug Priority: Critical
Reporter: Julius Goryavsky Assignee: Julius Goryavsky
Resolution: Fixed Votes: 1
Labels: None

Attachments: File gp.diff    
Issue Links:
Relates
relates to MDEV-27847 SIGABRT in wsrep::server_state::state... Stalled

 Description   

galera_partition test fails intermittently (~30% of the time) when testing on the 10.6-enterprise-rebase branch due to unallowed state transition:

2023-08-16 16:02:13 2 [Note] WSREP: Server panda synced with group
2023-08-16 16:02:13 2 [Note] WSREP: server panda state change: joined -> synced
2023-08-16 16:02:13 2 [Note] WSREP: Server status change joined -> synced
2023-08-16 16:02:13 2 [Note] WSREP: Synchronized with group, ready for connections
2023-08-16 16:02:13 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2023-08-16 16:02:13 2 [Note] WSREP: Detected STR version: 1, req_len: 153, req: STRv1
2023-08-16 16:02:13 2 [Note] WSREP: Cert index preload: 3 -> 3
2023-08-16 16:02:13 0 [Note] WSREP: 1.1 (panda): State transfer to 3.3 (ns367620) complete.
2023-08-16 16:02:13 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 3)
2023-08-16 16:02:13 0 [Note] WSREP: Processing event queue:... -nan% (0/0 events) complete.
2023-08-16 16:02:13 0 [Note] WSREP: Member 1.1 (ns367620) synced with group.
2023-08-16 16:02:13 0 [Note] WSREP: Processing event queue:...100.0% (1/1 events) complete.
2023-08-16 16:02:13 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 3)
2023-08-16 16:02:13 22 [Note] WSREP: assigned new next trx id: 25
2023-08-16 16:02:13 22 [Note] WSREP: wsrep_after_statement for 22 client_state exec  client_mode local trans_state aborted
2023-08-16 16:02:13 22 [Note] WSREP: wsrep_thd_binlog_reset
2023-08-16 16:02:13 22 [Note] WSREP: assigned new next trx id: 26
2023-08-16 16:02:13 22 [Note] WSREP: wsrep_commit_empty for 22 client_state exec client_mode local trans_state executing sql SELECT VARIABLE_NAME, VARIABLE_VALUE
              FROM INFORMATION_SCHEMA.GLOBAL_STATUS
              WHERE VARIABLE_NAME = 'wsrep_ready'
2023-08-16 16:02:13 22 [Note] WSREP: wsrep_after_statement for 22 client_state exec  client_mode local trans_state aborted
2023-08-16 16:02:13 22 [Note] WSREP: wsrep_thd_binlog_reset
2023-08-16 16:02:13 22 [Note] WSREP: wsrep_after_statement for 22 client_state exec  client_mode local trans_state aborted
2023-08-16 16:02:13 22 [Note] WSREP: wsrep_thd_binlog_reset
2023-08-16 16:02:13 22 [Note] WSREP: assigned new next trx id: 27
2023-08-16 16:02:13 2 [Note] WSREP: server panda state change: synced -> donor
2023-08-16 16:02:13 2 [Note] WSREP: Server status change synced -> donor
2023-08-16 16:02:13 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2023-08-16 16:02:13 0 [Note] WSREP: async IST sender starting to serve tcp://127.0.0.1:16014 sending 3-3, preload starts from 3
2023-08-16 16:02:13 0 [Note] WSREP: Donor monitor thread started to monitor
2023-08-16 16:02:13 0 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address '127.0.0.1:16015/rsync_sst' --local-port 16001 --socket '/home/panda/maria-10.6/build/mysql-test/var/tmp/mysqld.2.sock' --progress 0 --datadir '/home/panda/maria-10.6/build/mysql-test/var/mysqld.2/data/' --defaults-file '/home/panda/maria-10.6/build/mysql-test/var/my.cnf' --defaults-group-suffix '.2' --gtid '7ac8f618-3c3d-11ee-94fc-13bdcf9cdd3a:3' --gtid-domain-id 0 --binlog-index 'mysqld-bin.index' --mysqld-args --defaults-group-suffix=.2 --defaults-file=/home/panda/maria-10.6/build/mysql-test/var/my.cnf --log-output=file --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --loose-enable-partition --core-file --loose-debug-sync-timeout=300'
2023-08-16 16:02:13 0 [Note] WSREP: IST sender 3 -> 3
2023-08-16 16:02:13 2 [Note] WSREP: sst_donor_thread signaled with 0
2023-08-16 16:02:13 13 [Note] WSREP: Server panda synced with group
2023-08-16 16:02:13 13 [Warning] WSREP: server: panda unallowed state transition: donor -> synced
 
mariadbd: /home/panda/maria-10.6/wsrep-lib/src/server_state.cpp:1368: void wsrep::server_state::state(wsrep::unique_lock<wsrep::mutex>&, state): Assertion `0' failed.
230816 16:02:13 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.6.15-10-MariaDB-debug-log source revision: a58082bb6a7efb490aafbed5e264168b743c0a9d
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=1
max_threads=153
thread_count=14
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63896 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f1e44000dc8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f1e96b1bb90 thread_stack 0x49000
WSREP_SST: [INFO] rsync SST started on donor (20230816 16:02:13.195)
WSREP_SST: [INFO] Using stunnel for SSL encryption: CA: '/home/panda/maria-10.6/mysql-test/std_data/cacert.pem', CAPATH='', ssl-mode='VERIFY_CA' (20230816 16:02:13.312)
2023-08-16 16:02:13 0 [Note] WSREP: Flushing tables for SST...
mysys/stacktrace.c:215(my_print_stacktrace)[0x559e8ac50d9f]
sql/signal_handler.cc:241(handle_fatal_signal)[0x559e8a2f4435]
libc_sigaction.c:0(__restore_rt)[0x7f1ecb577520]
nptl/pthread_kill.c:44(__pthread_kill_implementation)[0x7f1ecb5cba7c]
posix/raise.c:27(__GI_raise)[0x7f1ecb577476]
stdlib/abort.c:81(__GI_abort)[0x7f1ecb55d7f3]
intl/loadmsgcat.c:1177(_nl_load_domain)[0x7f1ecb55d71b]
/lib/x86_64-linux-gnu/libc.so.6(+0x39e96)[0x7f1ecb56ee96]
src/server_state.cpp:1371(wsrep::server_state::state(std::unique_lock<wsrep::mutex>&, wsrep::server_state::state))[0x559e8ae3c76c]
src/server_state.cpp:1095(wsrep::server_state::on_sync())[0x559e8ae3af59]
src/wsrep_provider_v26.cpp:531((anonymous namespace)::synced_cb(void*))[0x559e8ae527d7]
2023-08-16 16:02:14 0 [Note] WSREP: 0.1 (ns367620): State transfer to 2.2 (ns367620) complete.
2023-08-16 16:02:14 0 [Note] WSREP: Member 0.1 (ns367620) synced with group.
src/replicator_smm.cpp:3066(galera::ReplicatorSMM::process_sync(long))[0x7f1ec644c904]
src/gcs_action_source.cpp:144(galera::GcsActionSource::dispatch(void*, gcs_action const&, bool&))[0x7f1ec649da0e]
src/gcs_action_source.cpp:192(galera::GcsActionSource::process(void*, bool&))[0x7f1ec649dfa9]
src/replicator_smm.cpp:402(galera::ReplicatorSMM::async_recv(void*))[0x7f1ec643bbe8]
src/wsrep_provider.cpp:291(galera_recv)[0x7f1ec6412b06]
src/wsrep_provider_v26.cpp:858(wsrep::wsrep_provider_v26::run_applier(wsrep::high_priority_service*))[0x559e8ae53b8e]
sql/wsrep_thd.cc:57(wsrep_replication_process(THD*, void*))[0x559e8a714c20]
2023-08-16 16:02:15 0 [Note] WSREP: 2.2 (ns367620): State transfer from 0.1 (ns367620) complete.
2023-08-16 16:02:15 0 [Note] WSREP: Member 2.2 (ns367620) synced with group.
sql/wsrep_mysqld.cc:4034(start_wsrep_THD(void*))[0x559e8a70007d]
perfschema/pfs.cc:2203(pfs_spawn_thread)[0x559e8a659630]
nptl/pthread_create.c:442(start_thread)[0x7f1ecb5c9b43]
x86_64/clone3.S:83(__clone3)[0x7f1ecb65ba00]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): (null)
Connection ID (thread ID): 13
Status: NOT_KILLED



 Comments   
Comment by Jan Lindström [ 2023-08-17 ]

I could not reproduce this issue:

jan@jan-HP-ZBook-15u-G5:~/work/mariadb/10.6e/mysql-test$ ./mtr galera.galera_partition --repeat=30 --enable-disabled
Logging: ./mtr  galera.galera_partition --repeat=30 --enable-disabled
VS config: 
vardir: /home/jan/work/mariadb/10.6e/mysql-test/var
Checking leftover processes...
Removing old var directory...
Creating var directory '/home/jan/work/mariadb/10.6e/mysql-test/var'...
Checking supported features...
MariaDB Version 10.6.15-10-MariaDB-debug
 - SSL connections supported
 - binaries are debug compiled
 - binaries built with wsrep patch
Collecting tests...
 - galera.galera_partition will be run although it's been disabled
   due to 'MENT-1922 assertion due to unallowed state transition'
Installing system database...
 
==============================================================================
 
TEST                                      RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------
 
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
galera.galera_partition 'innodb'         [ pass ]   3729
galera.galera_partition 'innodb'         [ 2 pass ]   3917
galera.galera_partition 'innodb'         [ 3 pass ]   3748
galera.galera_partition 'innodb'         [ 4 pass ]   3914
galera.galera_partition 'innodb'         [ 5 pass ]   3818
galera.galera_partition 'innodb'         [ 6 pass ]   3690
galera.galera_partition 'innodb'         [ 7 pass ]   3791
galera.galera_partition 'innodb'         [ 8 pass ]   3853
galera.galera_partition 'innodb'         [ 9 pass ]   3849
galera.galera_partition 'innodb'         [ 10 pass ]   3865
galera.galera_partition 'innodb'         [ 11 pass ]   3801
galera.galera_partition 'innodb'         [ 12 pass ]   3803
galera.galera_partition 'innodb'         [ 13 pass ]   3832
galera.galera_partition 'innodb'         [ 14 pass ]   3901
galera.galera_partition 'innodb'         [ 15 pass ]   3794
galera.galera_partition 'innodb'         [ 16 pass ]   3825
galera.galera_partition 'innodb'         [ 17 pass ]   3824
galera.galera_partition 'innodb'         [ 18 pass ]   3938
galera.galera_partition 'innodb'         [ 19 pass ]   3776
galera.galera_partition 'innodb'         [ 20 pass ]   3884
galera.galera_partition 'innodb'         [ 21 pass ]   3875
galera.galera_partition 'innodb'         [ 22 pass ]   3975
galera.galera_partition 'innodb'         [ 23 pass ]   3852
galera.galera_partition 'innodb'         [ 24 pass ]   3975
galera.galera_partition 'innodb'         [ 25 pass ]   3811
galera.galera_partition 'innodb'         [ 26 pass ]   3841
galera.galera_partition 'innodb'         [ 27 pass ]   3914
galera.galera_partition 'innodb'         [ 28 pass ]   3924
galera.galera_partition 'innodb'         [ 29 pass ]   3818
galera.galera_partition 'innodb'         [ 30 pass ]   3870
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 115.407 of 154 seconds executing testcases
 
Completed: All 30 tests were successful.

Comment by Jan Lindström [ 2023-08-17 ]

I do not think this is related to MDEV-27847 because test does not change wsrep_cluster_address.

Comment by Jan Lindström [ 2023-08-17 ]

I propose a fix but if this does not help I need more information i.e. error logs and test stdout. Again do git am < file_name to apply. This fix might be good to put 10.4+ (CS/ES)

Comment by Jan Lindström [ 2023-08-24 ]

https://github.com/MariaDB/server/pull/2734

Comment by Julius Goryavsky [ 2023-09-12 ]

Fixed, https://github.com/MariaDB/server/commit/fee138a1233d7a74658eaf2a830a3c11da654253

Generated at Thu Feb 08 10:27:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.