[MDEV-25199] galera.galera_pc_recovery fails to start up Created: 2021-03-19  Updated: 2024-01-05

Status: Stalled
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.4, 10.5, 10.6
Fix Version/s: 10.4, 10.5

Type: Bug Priority: Critical
Reporter: Marko Mäkelä Assignee: Sergei Golubchik
Resolution: Unresolved Votes: 0
Labels: hang, recovery

Issue Links:
Blocks
blocks MDEV-30172 Galera test case cleanup Stalled

 Description   

The test galera.galera_pc_recovery fails to start up. Due to a deficiency of mtr, this failure is not flagged on buildbot. Locally, I can repeat this at least starting with 10.4 (not sure about 10.2 and 10.3).

worker[48] mysql-test-run: WARNING: Timeout after mysql-test-run waited 180 seconds for the process [mysqld.2 - pid: 792954, winpid: 792954] to create a pid file.
worker[48] mysql-test-run: *** ERROR: Failed to start mysqld mysqld.2 with command /dev/shm/10.4/sql/mysqld
galera.galera_sst_mysqldump 'debug,innodb' w38 [ pass ]  141742
 
Only  445  of 446 completed.
--------------------------------------------------------------------------
The servers were restarted 152 times
Spent 3429.559 of 231 seconds executing testcases
 
Completed: Failed 2/361 tests, 99.45% were successful.
 
Failing test(s): galera.galera_bf_lock_wait galera_sr.galera-features#56

In var/48/log/mysqld.1.err there is evidence of this test starting up:

CURRENT_TEST: galera.galera_pc_recovery
2021-03-19 11:20:36 0 [Warning] WSREP: option --wsrep-causal-reads is deprecated
2021-03-19 11:20:36 0 [Note] /dev/shm/10.4/sql/mysqld (mysqld 10.4.19-MariaDB-debug-log) starting as process 776028 ...
2021-03-19 11:20:36 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32222)
2021-03-19 11:20:36 0 [Warning] Changed limits: max_open_files: 1024  max_connections: 151 (was 151)  table_cache: 421 (was 2000)
2021-03-19 11:20:36 0 [Warning] WSREP: wsrep_sst_receive_address is set to '127.0.0.1:16964' which makes it impossible for another host to reach this one. Please set it to the address which this node can be connected at by other cluster members.
2021-03-19 11:20:36 0 [Note] WSREP: Loading provider /usr/lib/galera/libgalera_smm.so initial position: 00000000-0000-0000-0000-000000000000:-1
2021-03-19 11:20:36 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
2021-03-19 11:20:36 0 [Note] WSREP: wsrep_load(): Galera 4.7(rbac8171) by Codership Oy <info@codership.com> loaded successfully.
2021-03-19 11:20:36 0 [Note] WSREP: CRC-32C: using 64-bit x86 acceleration.
2021-03-19 11:20:36 0 [Warning] WSREP: Could not open state file for reading: '/dev/shm/10.4/mysql-test/var/48/mysqld.1/data//grastate.dat'

In var/log/stdout.log there is no occurrence of galera.galera_pc_recovery.

The mysqld.2.err ends in the following:

2021-03-19 11:20:59 2 [Note] WSREP: Starting rollbacker thread 2
2021-03-19 11:20:59 0 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.50143S), skipping check

The last lines in mysqld.1.err were sent before those timestamps:

2021-03-19 11:20:40 0 [Note] WSREP: async IST sender starting to serve tcp://127.0.0.1:16966 sending 2-2, preload starts from 2
2021-03-19 11:20:40 0 [Note] WSREP: IST sender 2 -> 2
2021-03-19 11:20:40 0 [Note] WSREP: Donor monitor thread started to monitor
2021-03-19 11:20:40 0 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address '127.0.0.2:16967/rsync_sst' --socket '/dev/shm/10.4/mysql-test/var/tmp/48/mysqld.1.sock' --datadir '/dev/shm/10.4/mysql-test/var/48/mysqld.1/data/' --defaults-file '/dev/shm/10.4/mysql-test/var/48/my.cnf' --defaults-group-suffix '.1' --gtid '5d019f5e-8894-11eb-a1b7-2b3fae821768:2' --gtid-domain-id '0' --binlog-index 'mysqld-bin.index' --mysqld-args --defaults-group-suffix=.1 --defaults-file=/dev/shm/10.4/mysql-test/var/48/my.cnf --log-output=file --innodb --innodb-cmpmem --innodb-cmp-per-index --innodb-trx --innodb-locks --innodb-lock-waits --innodb-metrics --innodb-buffer-pool-stats --innodb-buffer-page --innodb-buffer-page-lru --innodb-sys-columns --innodb-sys-fields --innodb-sys-foreign --innodb-sys-foreign-cols --innodb-sys-indexes --innodb-sys-tables --innodb-sys-virtual --core-file --loose-debug-sync-timeout=300'
2021-03-19 11:20:40 2 [Note] WSREP: sst_donor_thread signaled with 0
2021-03-19 11:20:40 0 [Note] WSREP: Flushing tables for SST...
2021-03-19 11:20:40 0 [Note] WSREP: pause
2021-03-19 11:20:40 0 [Note] WSREP: Provider paused at 5d019f5e-8894-11eb-a1b7-2b3fae821768:2 (5)
2021-03-19 11:20:40 0 [Note] WSREP: Server paused at: 2
2021-03-19 11:20:40 0 [Note] WSREP: Tables flushed.
2021-03-19 11:20:41 0 [Note] WSREP: resume
2021-03-19 11:20:41 0 [Note] WSREP: resuming provider at 5
2021-03-19 11:20:41 0 [Note] WSREP: Provider resumed.
2021-03-19 11:20:41 0 [Note] WSREP: SST sent: 5d019f5e-8894-11eb-a1b7-2b3fae821768:2
2021-03-19 11:20:41 0 [Note] WSREP: Server status change donor -> joined
2021-03-19 11:20:41 0 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-03-19 11:20:41 0 [Note] WSREP: 0.0 (jyty): State transfer to 1.0 (jyty) complete.
2021-03-19 11:20:41 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 2)
2021-03-19 11:20:41 0 [Note] WSREP: Member 0.0 (jyty) synced with group.
2021-03-19 11:20:41 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 2)
2021-03-19 11:20:41 2 [Note] WSREP: Server jyty synced with group
2021-03-19 11:20:41 2 [Note] WSREP: Server status change joined -> synced
2021-03-19 11:20:41 2 [Note] WSREP: Synchronized with group, ready for connections
2021-03-19 11:20:41 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2021-03-19 11:20:41 0 [Note] WSREP: Donor monitor thread ended with total time 1 sec
2021-03-19 11:20:42 0 [Note] WSREP: (5cfe4bac-b52e, 'tcp://0.0.0.0:16962') turning message relay requesting off
2021-03-19 11:20:45 0 [Note] WSREP: async IST sender served
2021-03-19 11:20:45 0 [Note] WSREP: 1.0 (jyty): State transfer from 0.0 (jyty) complete.
2021-03-19 11:20:45 0 [Note] WSREP: Member 1.0 (jyty) synced with group.

The test ended at 11:24:16.



 Comments   
Comment by Jan Lindström (Inactive) [ 2021-04-14 ]

Yurchenko Can you have a look of this. First server does not even start, it could be that mtr is not ready for such test or there is a bug on test/code.

Comment by Anel Husakovic [ 2021-04-14 ]

"can’t count" error: Take any test (or 20 tests), put "foo" in their opt or cnf file, and here you go

Comment by Alexey [ 2021-09-22 ]

This looks like a bug in the MTR framework, specifically it prevents one of the following statements of the test to have effect:

--exec echo "restart: --wsrep-start-position=$galera_wsrep_start_position1 --wsrep_cluster_address=gcomm://127.0.0.1:$NODE_GALERAPORT_1,127.0.0.1:$NODE_GALERAPORT_2" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
--exec echo "restart: --wsrep-start-position=$galera_wsrep_start_position2 --wsrep_cluster_address=gcomm://127.0.0.1:$NODE_GALERAPORT_1,127.0.0.1:$NODE_GALERAPORT_2" > $MYSQLTEST_VARDIR/tmp/mysqld.2.expect

Which of those statements fails to have effect (a node is not restarted) is up to a race, so if you put a sleep between those, you will have a consistent failure in the second.

The bug was introduced in the merge commit e841957416e9287d1e9b2e32c952d6d0c1a2e2ed which in turn brings in a bunch of MTR-related commits from 10.2, see c4f0133444d2a867ee79f5c16b9d7829a05861cd..7fe351aba4e4e17e1be2599fa36ff73a0ce103f9

10.2 does not have `galera_pc_recovery` test, so that allowed the bug to sneak in.

At this point I believe it would be most efficient to assign this issue to the author of these changes (Sergei) or whoever else is skilled in MTR internals.

Comment by Jan Lindström [ 2024-01-05 ]

serg Please refer comment from Alexey.

Generated at Thu Feb 08 09:35:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.