[MDEV-32974] Member fails to join due to old seqno in GTID - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 11.0.1, 11.0.2, 11.0.3, 11.0.4, 10.11
Fix Version/s: 10.11.8, 11.0.6, 11.1.5, 11.2.4, 11.4.2
Component/s: Galera, Galera SST, Storage Engine - InnoDB
Labels:
- regression
- upgrade
Environment:
bitnami/mariadb-galera:11.0.4-debian-11-r0 in k8s cluster

Description

We are unable to find the old seqno anywhere except in the `ibdata1` file of the donor(by searching for the hex of it). But not sure how to get rid of this old seqno.
What we did try is starting db-0 with fresh volume by removing K8s PVC for it, so it does a SST but that fails as well.

Logs from member db-0:

[Note] WSREP: SST received

[Note] WSREP: Server status change joiner -> initializing

[Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

[Note] mysqld: Aria engine: starting recovery

recovered pages: 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.0 seconds); tables to flush: 2 1(0.0 seconds);

[Note] mysqld: Aria engine: recovery done

[Note] InnoDB: Compressed tables use zlib 1.2.11

[Note] InnoDB: Number of transaction pools: 1

[Note] InnoDB: Using crc32 + pclmulqdq instructions

[Note] InnoDB: Using Linux native AIO

[Note] InnoDB: Initializing buffer pool, total size = 2.000GiB, chunk size = 32.000MiB

[Note] InnoDB: Completed initialization of buffer pool

[Note] InnoDB: File system buffers for log disabled (block size=512 bytes)

[Note] InnoDB: End of log at LSN=180964319

[Note] InnoDB: Resizing redo log from 12.016KiB to 96.000MiB; LSN=180964319

[Note] InnoDB: File system buffers for log disabled (block size=512 bytes)

[Note] InnoDB: Reinitializing innodb_undo_tablespaces= 3 from 0

[Note] InnoDB: Data file .//undo001 did not exist: new to be created

[Note] InnoDB: Setting file .//undo001 size to 10.000MiB

[Note] InnoDB: Database physically writes the file full: wait...

[Note] InnoDB: Data file .//undo002 did not exist: new to be created

[Note] InnoDB: Setting file .//undo002 size to 10.000MiB

[Note] InnoDB: Database physically writes the file full: wait...

[Note] InnoDB: Data file .//undo003 did not exist: new to be created

[Note] InnoDB: Setting file .//undo003 size to 10.000MiB

[Note] InnoDB: Database physically writes the file full: wait...

[Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.

[Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...

[Note] InnoDB: File './ibtmp1' size is now 12.000MiB.

[Note] InnoDB: log sequence number 180964319; transaction id 73558

[Note] InnoDB: Loading buffer pool(s) from /bitnami/mariadb/data/ib_buffer_pool

[Note] InnoDB: Cannot open '/bitnami/mariadb/data/ib_buffer_pool' for reading: No such file or directory

[Note] Plugin 'FEEDBACK' is disabled.

[Warning] 'innodb-log-files-in-group' was removed. It does nothing now and exists only for compatibility with old my.cnf files.

[Warning] 'innodb-file-format' was removed. It does nothing now and exists only for compatibility with old my.cnf files.

[Note] Recovering after a crash using mysql-bin

[Note] Starting table crash recovery...

[Note] Crash table recovery finished.

[Note] Server socket created on IP: '0.0.0.0'.

[Warning] 'proxies_priv' entry '@% root@db-0' ignored in --skip-name-resolve mode.

[Note] WSREP: wsrep_init_schema_and_SR (nil)

[Note] WSREP: Server initialized

[Note] WSREP: Server status change initializing -> initialized

[Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

[Note] WSREP: Recovered position from storage: 6aa53efc-db72-11ec-880f-a282ce494905:117376

[Note] WSREP: Starting applier thread 6

[Note] WSREP: Starting applier thread 7

[Note] WSREP: Starting applier thread 8

[Note] WSREP: Recovered view from SST:

  id: 6aa53efc-db72-11ec-880f-a282ce494905:128934

  status: primary

  protocol_version: 4

  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO

  final: no

  own_index: 0

  members(3):

	0: 46250027-95c8-11ee-9d94-06eb82413a07, db

	1: 4ed771b5-941f-11ee-82bf-56ec952abcf6, db

	2: bfa6e688-941f-11ee-97c5-d323f02068c1, db

[ERROR] WSREP: sst_received failed: SST script passed bogus GTID: 6aa53efc-db72-11ec-880f-a282ce494905:117376. Preceding view GTID: 6aa53efc-db72-11ec-880f-a282ce494905:128934

[Note] WSREP: SST received: 00000000-0000-0000-0000-000000000000:-1

[Note] WSREP: Joiner monitor thread ended with total time 14 sec

[ERROR] WSREP: Application received wrong state:

	Received: 00000000-0000-0000-0000-000000000000

	Required: 6aa53efc-db72-11ec-880f-a282ce494905

[ERROR] WSREP: Application state transfer failed. This is unrecoverable condition, restart required.

[Note] WSREP: ReplicatorSMM::abort()

[Note] WSREP: Closing send monitor...

[Note] WSREP: Closed send monitor.

[Note] WSREP: gcomm: terminating thread

[Note] WSREP: gcomm: joining thread

[Note] WSREP: gcomm: closing backend

[Note] /opt/bitnami/mariadb/sbin/mysqld: ready for connections.

Version: '11.0.3-MariaDB-log'  socket: '/opt/bitnami/mariadb/tmp/mysql.sock'  port: 3306  Source distribution

[Note] WSREP: view(view_id(NON_PRIM,46250027-9d94,6636) memb {

	46250027-9d94,0

} joined {

} left {

} partitioned {

	4ed771b5-82bf,0

	bfa6e688-97c5,0

})

[Note] WSREP: PC protocol downgrade 1 -> 0

[Note] WSREP: view((empty))

[Note] WSREP: gcomm: closed

[Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

[Note] WSREP: Flow-control interval: [128, 160]

[Note] WSREP: Received NON-PRIMARY.

[Note] WSREP: Shifting JOINER -> OPEN (TO: 128934)

[Note] WSREP: New SELF-LEAVE.

[Note] WSREP: Flow-control interval: [0, 0]

[Note] WSREP: Received SELF-LEAVE. Closing connection.

[Note] WSREP: Shifting OPEN -> CLOSED (TO: 128934)

[Note] WSREP: RECV thread exiting 0: Success

[Note] WSREP: recv_thread() joined.

[Note] WSREP: Closing replication queue.

[Note] WSREP: Closing slave action queue.

[Note] WSREP: /opt/bitnami/mariadb/sbin/mysqld: Terminated.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

datadir.tgz
3.67 MB
2024-03-01 06:47
node2_before_prepare.tgz
818 kB
2024-03-01 06:59
node2_after_prepare.tgz
812 kB
2024-03-01 07:01
node1.cnf
0.8 kB
2024-03-27 06:40
node2.cnf
0.8 kB
2024-03-27 06:40

Issue Links

is caused by

MDEV-19229 Allow innodb_undo_tablespaces to be changed after database creation

Closed

MDEV-29986 Set innodb_undo_tablespaces=3 by default

Closed

Activity

Ascending order - Click to sort in descending order

Ihtisham ul Haq created issue - 2023-12-08 13:00

Ihtisham ul Haq made changes - 2023-12-08 15:20

Field	Original Value	New Value
Description	After upgrading initially from 10.6->10.11->11.0.3 and now to 11.0.4 we see that 2 of the members start up without any issues but the 3rd member(db-0) fails to start up due to an old sequence no.(117376) that i believe is being passed on from the donor. Logs from member db-0: {code:bash} [Note] WSREP: SST received [Note] WSREP: Server status change joiner -> initializing [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. [Note] mysqld: Aria engine: starting recovery recovered pages: 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.0 seconds); tables to flush: 2 1(0.0 seconds); [Note] mysqld: Aria engine: recovery done [Note] InnoDB: Compressed tables use zlib 1.2.11 [Note] InnoDB: Number of transaction pools: 1 [Note] InnoDB: Using crc32 + pclmulqdq instructions [Note] InnoDB: Using Linux native AIO [Note] InnoDB: Initializing buffer pool, total size = 2.000GiB, chunk size = 32.000MiB [Note] InnoDB: Completed initialization of buffer pool [Note] InnoDB: File system buffers for log disabled (block size=512 bytes) [Note] InnoDB: End of log at LSN=180964319 [Note] InnoDB: Resizing redo log from 12.016KiB to 96.000MiB; LSN=180964319 [Note] InnoDB: File system buffers for log disabled (block size=512 bytes) [Note] InnoDB: Reinitializing innodb_undo_tablespaces= 3 from 0 [Note] InnoDB: Data file .//undo001 did not exist: new to be created [Note] InnoDB: Setting file .//undo001 size to 10.000MiB [Note] InnoDB: Database physically writes the file full: wait... [Note] InnoDB: Data file .//undo002 did not exist: new to be created [Note] InnoDB: Setting file .//undo002 size to 10.000MiB [Note] InnoDB: Database physically writes the file full: wait... [Note] InnoDB: Data file .//undo003 did not exist: new to be created [Note] InnoDB: Setting file .//undo003 size to 10.000MiB [Note] InnoDB: Database physically writes the file full: wait... [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active. [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ... [Note] InnoDB: File './ibtmp1' size is now 12.000MiB. [Note] InnoDB: log sequence number 180964319; transaction id 73558 [Note] InnoDB: Loading buffer pool(s) from /bitnami/mariadb/data/ib_buffer_pool [Note] InnoDB: Cannot open '/bitnami/mariadb/data/ib_buffer_pool' for reading: No such file or directory [Note] Plugin 'FEEDBACK' is disabled. [Warning] 'innodb-log-files-in-group' was removed. It does nothing now and exists only for compatibility with old my.cnf files. [Warning] 'innodb-file-format' was removed. It does nothing now and exists only for compatibility with old my.cnf files. [Note] Recovering after a crash using mysql-bin [Note] Starting table crash recovery... [Note] Crash table recovery finished. [Note] Server socket created on IP: '0.0.0.0'. [Warning] 'proxies_priv' entry '@% root@db-0' ignored in --skip-name-resolve mode. [Note] WSREP: wsrep_init_schema_and_SR (nil) [Note] WSREP: Server initialized [Note] WSREP: Server status change initializing -> initialized [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. [Note] WSREP: Recovered position from storage: 6aa53efc-db72-11ec-880f-a282ce494905:117376 [Note] WSREP: Starting applier thread 6 [Note] WSREP: Starting applier thread 7 [Note] WSREP: Starting applier thread 8 [Note] WSREP: Recovered view from SST: id: 6aa53efc-db72-11ec-880f-a282ce494905:128934 status: primary protocol_version: 4 capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO final: no own_index: 0 members(3): 0: 46250027-95c8-11ee-9d94-06eb82413a07, db 1: 4ed771b5-941f-11ee-82bf-56ec952abcf6, db 2: bfa6e688-941f-11ee-97c5-d323f02068c1, db [ERROR] WSREP: sst_received failed: SST script passed bogus GTID: 6aa53efc-db72-11ec-880f-a282ce494905:117376. Preceding view GTID: 6aa53efc-db72-11ec-880f-a282ce494905:128934 [Note] WSREP: SST received: 00000000-0000-0000-0000-000000000000:-1 [Note] WSREP: Joiner monitor thread ended with total time 14 sec [ERROR] WSREP: Application received wrong state: Received: 00000000-0000-0000-0000-000000000000 Required: 6aa53efc-db72-11ec-880f-a282ce494905 [ERROR] WSREP: Application state transfer failed. This is unrecoverable condition, restart required. [Note] WSREP: ReplicatorSMM::abort() [Note] WSREP: Closing send monitor... [Note] WSREP: Closed send monitor. [Note] WSREP: gcomm: terminating thread [Note] WSREP: gcomm: joining thread [Note] WSREP: gcomm: closing backend [Note] /opt/bitnami/mariadb/sbin/mysqld: ready for connections. Version: '11.0.3-MariaDB-log' socket: '/opt/bitnami/mariadb/tmp/mysql.sock' port: 3306 Source distribution [Note] WSREP: view(view_id(NON_PRIM,46250027-9d94,6636) memb { 46250027-9d94,0 } joined { } left { } partitioned { 4ed771b5-82bf,0 bfa6e688-97c5,0 }) [Note] WSREP: PC protocol downgrade 1 -> 0 [Note] WSREP: view((empty)) [Note] WSREP: gcomm: closed [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 [Note] WSREP: Flow-control interval: [128, 160] [Note] WSREP: Received NON-PRIMARY. [Note] WSREP: Shifting JOINER -> OPEN (TO: 128934) [Note] WSREP: New SELF-LEAVE. [Note] WSREP: Flow-control interval: [0, 0] [Note] WSREP: Received SELF-LEAVE. Closing connection. [Note] WSREP: Shifting OPEN -> CLOSED (TO: 128934) [Note] WSREP: RECV thread exiting 0: Success [Note] WSREP: recv_thread() joined. [Note] WSREP: Closing replication queue. [Note] WSREP: Closing slave action queue. [Note] WSREP: /opt/bitnami/mariadb/sbin/mysqld: Terminated. {code}	After upgrading initially from 10.6->10.11->11.0.3 and now to 11.0.4 we see that 2 of the members start up without any issues but the 3rd member(db-0) fails to start up due to an old sequence no.(117376) that i believe is being passed on from the donor. We are unable to fine the old seqno any where except in the `ibdata1` file of the donor(by searching for the hex of it). And not sure how to get rid of this old seqno. What we did try is starting db-0 with fresh volume by removing K8s PVC for it, so it does a SST but that fails as well. Logs from member db-0: {code:bash} [Note] WSREP: SST received [Note] WSREP: Server status change joiner -> initializing [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. [Note] mysqld: Aria engine: starting recovery recovered pages: 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.0 seconds); tables to flush: 2 1(0.0 seconds); [Note] mysqld: Aria engine: recovery done [Note] InnoDB: Compressed tables use zlib 1.2.11 [Note] InnoDB: Number of transaction pools: 1 [Note] InnoDB: Using crc32 + pclmulqdq instructions [Note] InnoDB: Using Linux native AIO [Note] InnoDB: Initializing buffer pool, total size = 2.000GiB, chunk size = 32.000MiB [Note] InnoDB: Completed initialization of buffer pool [Note] InnoDB: File system buffers for log disabled (block size=512 bytes) [Note] InnoDB: End of log at LSN=180964319 [Note] InnoDB: Resizing redo log from 12.016KiB to 96.000MiB; LSN=180964319 [Note] InnoDB: File system buffers for log disabled (block size=512 bytes) [Note] InnoDB: Reinitializing innodb_undo_tablespaces= 3 from 0 [Note] InnoDB: Data file .//undo001 did not exist: new to be created [Note] InnoDB: Setting file .//undo001 size to 10.000MiB [Note] InnoDB: Database physically writes the file full: wait... [Note] InnoDB: Data file .//undo002 did not exist: new to be created [Note] InnoDB: Setting file .//undo002 size to 10.000MiB [Note] InnoDB: Database physically writes the file full: wait... [Note] InnoDB: Data file .//undo003 did not exist: new to be created [Note] InnoDB: Setting file .//undo003 size to 10.000MiB [Note] InnoDB: Database physically writes the file full: wait... [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active. [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ... [Note] InnoDB: File './ibtmp1' size is now 12.000MiB. [Note] InnoDB: log sequence number 180964319; transaction id 73558 [Note] InnoDB: Loading buffer pool(s) from /bitnami/mariadb/data/ib_buffer_pool [Note] InnoDB: Cannot open '/bitnami/mariadb/data/ib_buffer_pool' for reading: No such file or directory [Note] Plugin 'FEEDBACK' is disabled. [Warning] 'innodb-log-files-in-group' was removed. It does nothing now and exists only for compatibility with old my.cnf files. [Warning] 'innodb-file-format' was removed. It does nothing now and exists only for compatibility with old my.cnf files. [Note] Recovering after a crash using mysql-bin [Note] Starting table crash recovery... [Note] Crash table recovery finished. [Note] Server socket created on IP: '0.0.0.0'. [Warning] 'proxies_priv' entry '@% root@db-0' ignored in --skip-name-resolve mode. [Note] WSREP: wsrep_init_schema_and_SR (nil) [Note] WSREP: Server initialized [Note] WSREP: Server status change initializing -> initialized [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. [Note] WSREP: Recovered position from storage: 6aa53efc-db72-11ec-880f-a282ce494905:117376 [Note] WSREP: Starting applier thread 6 [Note] WSREP: Starting applier thread 7 [Note] WSREP: Starting applier thread 8 [Note] WSREP: Recovered view from SST: id: 6aa53efc-db72-11ec-880f-a282ce494905:128934 status: primary protocol_version: 4 capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO final: no own_index: 0 members(3): 0: 46250027-95c8-11ee-9d94-06eb82413a07, db 1: 4ed771b5-941f-11ee-82bf-56ec952abcf6, db 2: bfa6e688-941f-11ee-97c5-d323f02068c1, db [ERROR] WSREP: sst_received failed: SST script passed bogus GTID: 6aa53efc-db72-11ec-880f-a282ce494905:117376. Preceding view GTID: 6aa53efc-db72-11ec-880f-a282ce494905:128934 [Note] WSREP: SST received: 00000000-0000-0000-0000-000000000000:-1 [Note] WSREP: Joiner monitor thread ended with total time 14 sec [ERROR] WSREP: Application received wrong state: Received: 00000000-0000-0000-0000-000000000000 Required: 6aa53efc-db72-11ec-880f-a282ce494905 [ERROR] WSREP: Application state transfer failed. This is unrecoverable condition, restart required. [Note] WSREP: ReplicatorSMM::abort() [Note] WSREP: Closing send monitor... [Note] WSREP: Closed send monitor. [Note] WSREP: gcomm: terminating thread [Note] WSREP: gcomm: joining thread [Note] WSREP: gcomm: closing backend [Note] /opt/bitnami/mariadb/sbin/mysqld: ready for connections. Version: '11.0.3-MariaDB-log' socket: '/opt/bitnami/mariadb/tmp/mysql.sock' port: 3306 Source distribution [Note] WSREP: view(view_id(NON_PRIM,46250027-9d94,6636) memb { 46250027-9d94,0 } joined { } left { } partitioned { 4ed771b5-82bf,0 bfa6e688-97c5,0 }) [Note] WSREP: PC protocol downgrade 1 -> 0 [Note] WSREP: view((empty)) [Note] WSREP: gcomm: closed [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 [Note] WSREP: Flow-control interval: [128, 160] [Note] WSREP: Received NON-PRIMARY. [Note] WSREP: Shifting JOINER -> OPEN (TO: 128934) [Note] WSREP: New SELF-LEAVE. [Note] WSREP: Flow-control interval: [0, 0] [Note] WSREP: Received SELF-LEAVE. Closing connection. [Note] WSREP: Shifting OPEN -> CLOSED (TO: 128934) [Note] WSREP: RECV thread exiting 0: Success [Note] WSREP: recv_thread() joined. [Note] WSREP: Closing replication queue. [Note] WSREP: Closing slave action queue. [Note] WSREP: /opt/bitnami/mariadb/sbin/mysqld: Terminated. {code}
Environment	bitnami/mariadb-galera:11.0.4-debian-11-r0	bitnami/mariadb-galera:11.0.4-debian-11-r0 in k8s cluster

Ihtisham ul Haq made changes - 2023-12-08 15:22

Description

After upgrading initially from 10.6->10.11->11.0.3 and now to 11.0.4 we see that 2 of the members start up without any issues but the 3rd member(db-0) fails to start up due to an old sequence no.(117376) that i believe is being passed on from the donor.

We are unable to fine the old seqno any where except in the `ibdata1` file of the donor(by searching for the hex of it). And not sure how to get rid of this old seqno.
What we did try is starting db-0 with fresh volume by removing K8s PVC for it, so it does a SST but that fails as well.

Logs from member db-0:

{code:bash}
[Note] WSREP: SST received
[Note] WSREP: Server status change joiner -> initializing
[Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
[Note] mysqld: Aria engine: starting recovery
recovered pages: 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.0 seconds); tables to flush: 2 1(0.0 seconds);
[Note] mysqld: Aria engine: recovery done
[Note] InnoDB: Compressed tables use zlib 1.2.11
[Note] InnoDB: Number of transaction pools: 1
[Note] InnoDB: Using crc32 + pclmulqdq instructions
[Note] InnoDB: Using Linux native AIO
[Note] InnoDB: Initializing buffer pool, total size = 2.000GiB, chunk size = 32.000MiB
[Note] InnoDB: Completed initialization of buffer pool
[Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
[Note] InnoDB: End of log at LSN=180964319
[Note] InnoDB: Resizing redo log from 12.016KiB to 96.000MiB; LSN=180964319
[Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
[Note] InnoDB: Reinitializing innodb_undo_tablespaces= 3 from 0
[Note] InnoDB: Data file .//undo001 did not exist: new to be created
[Note] InnoDB: Setting file .//undo001 size to 10.000MiB
[Note] InnoDB: Database physically writes the file full: wait...
[Note] InnoDB: Data file .//undo002 did not exist: new to be created
[Note] InnoDB: Setting file .//undo002 size to 10.000MiB
[Note] InnoDB: Database physically writes the file full: wait...
[Note] InnoDB: Data file .//undo003 did not exist: new to be created
[Note] InnoDB: Setting file .//undo003 size to 10.000MiB
[Note] InnoDB: Database physically writes the file full: wait...
[Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
[Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
[Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
[Note] InnoDB: log sequence number 180964319; transaction id 73558
[Note] InnoDB: Loading buffer pool(s) from /bitnami/mariadb/data/ib_buffer_pool
[Note] InnoDB: Cannot open '/bitnami/mariadb/data/ib_buffer_pool' for reading: No such file or directory
[Note] Plugin 'FEEDBACK' is disabled.
[Warning] 'innodb-log-files-in-group' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
[Warning] 'innodb-file-format' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
[Note] Recovering after a crash using mysql-bin
[Note] Starting table crash recovery...
[Note] Crash table recovery finished.
[Note] Server socket created on IP: '0.0.0.0'.
[Warning] 'proxies_priv' entry '@% root@db-0' ignored in --skip-name-resolve mode.
[Note] WSREP: wsrep_init_schema_and_SR (nil)
[Note] WSREP: Server initialized
[Note] WSREP: Server status change initializing -> initialized
[Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
[Note] WSREP: Recovered position from storage: 6aa53efc-db72-11ec-880f-a282ce494905:117376
[Note] WSREP: Starting applier thread 6
[Note] WSREP: Starting applier thread 7
[Note] WSREP: Starting applier thread 8
[Note] WSREP: Recovered view from SST:
  id: 6aa53efc-db72-11ec-880f-a282ce494905:128934
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(3):
0: 46250027-95c8-11ee-9d94-06eb82413a07, db
1: 4ed771b5-941f-11ee-82bf-56ec952abcf6, db
2: bfa6e688-941f-11ee-97c5-d323f02068c1, db

[ERROR] WSREP: sst_received failed: SST script passed bogus GTID: 6aa53efc-db72-11ec-880f-a282ce494905:117376. Preceding view GTID: 6aa53efc-db72-11ec-880f-a282ce494905:128934
[Note] WSREP: SST received: 00000000-0000-0000-0000-000000000000:-1
[Note] WSREP: Joiner monitor thread ended with total time 14 sec
[ERROR] WSREP: Application received wrong state:
Received: 00000000-0000-0000-0000-000000000000
Required: 6aa53efc-db72-11ec-880f-a282ce494905
[ERROR] WSREP: Application state transfer failed. This is unrecoverable condition, restart required.
[Note] WSREP: ReplicatorSMM::abort()
[Note] WSREP: Closing send monitor...
[Note] WSREP: Closed send monitor.
[Note] WSREP: gcomm: terminating thread
[Note] WSREP: gcomm: joining thread
[Note] WSREP: gcomm: closing backend
[Note] /opt/bitnami/mariadb/sbin/mysqld: ready for connections.
Version: '11.0.3-MariaDB-log' socket: '/opt/bitnami/mariadb/tmp/mysql.sock' port: 3306 Source distribution
[Note] WSREP: view(view_id(NON_PRIM,46250027-9d94,6636) memb {
46250027-9d94,0
} joined {
} left {
} partitioned {
4ed771b5-82bf,0
bfa6e688-97c5,0
})
[Note] WSREP: PC protocol downgrade 1 -> 0
[Note] WSREP: view((empty))
[Note] WSREP: gcomm: closed
[Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
[Note] WSREP: Flow-control interval: [128, 160]
[Note] WSREP: Received NON-PRIMARY.
[Note] WSREP: Shifting JOINER -> OPEN (TO: 128934)
[Note] WSREP: New SELF-LEAVE.
[Note] WSREP: Flow-control interval: [0, 0]
[Note] WSREP: Received SELF-LEAVE. Closing connection.
[Note] WSREP: Shifting OPEN -> CLOSED (TO: 128934)
[Note] WSREP: RECV thread exiting 0: Success
[Note] WSREP: recv_thread() joined.
[Note] WSREP: Closing replication queue.
[Note] WSREP: Closing slave action queue.
[Note] WSREP: /opt/bitnami/mariadb/sbin/mysqld: Terminated.
{code}

After upgrading initially from 10.6->10.11->11.0.3 and now to 11.0.4 we see that 2 of the members start up without any issues but the 3rd member(db-0) fails to start up due to an old sequence no.(117376) that i believe is being passed on from the donor.

We are unable to find the old seqno anywhere except in the `ibdata1` file of the donor(by searching for the hex of it). But not sure how to get rid of this old seqno.
What we did try is starting db-0 with fresh volume by removing K8s PVC for it, so it does a SST but that fails as well.

Logs from member db-0:

{code:bash}
[Note] WSREP: SST received
[Note] WSREP: Server status change joiner -> initializing
[Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
[Note] mysqld: Aria engine: starting recovery
recovered pages: 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.0 seconds); tables to flush: 2 1(0.0 seconds);
[Note] mysqld: Aria engine: recovery done
[Note] InnoDB: Compressed tables use zlib 1.2.11
[Note] InnoDB: Number of transaction pools: 1
[Note] InnoDB: Using crc32 + pclmulqdq instructions
[Note] InnoDB: Using Linux native AIO
[Note] InnoDB: Initializing buffer pool, total size = 2.000GiB, chunk size = 32.000MiB
[Note] InnoDB: Completed initialization of buffer pool
[Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
[Note] InnoDB: End of log at LSN=180964319
[Note] InnoDB: Resizing redo log from 12.016KiB to 96.000MiB; LSN=180964319
[Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
[Note] InnoDB: Reinitializing innodb_undo_tablespaces= 3 from 0
[Note] InnoDB: Data file .//undo001 did not exist: new to be created
[Note] InnoDB: Setting file .//undo001 size to 10.000MiB
[Note] InnoDB: Database physically writes the file full: wait...
[Note] InnoDB: Data file .//undo002 did not exist: new to be created
[Note] InnoDB: Setting file .//undo002 size to 10.000MiB
[Note] InnoDB: Database physically writes the file full: wait...
[Note] InnoDB: Data file .//undo003 did not exist: new to be created
[Note] InnoDB: Setting file .//undo003 size to 10.000MiB
[Note] InnoDB: Database physically writes the file full: wait...
[Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
[Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
[Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
[Note] InnoDB: log sequence number 180964319; transaction id 73558
[Note] InnoDB: Loading buffer pool(s) from /bitnami/mariadb/data/ib_buffer_pool
[Note] InnoDB: Cannot open '/bitnami/mariadb/data/ib_buffer_pool' for reading: No such file or directory
[Note] Plugin 'FEEDBACK' is disabled.
[Warning] 'innodb-log-files-in-group' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
[Warning] 'innodb-file-format' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
[Note] Recovering after a crash using mysql-bin
[Note] Starting table crash recovery...
[Note] Crash table recovery finished.
[Note] Server socket created on IP: '0.0.0.0'.
[Warning] 'proxies_priv' entry '@% root@db-0' ignored in --skip-name-resolve mode.
[Note] WSREP: wsrep_init_schema_and_SR (nil)
[Note] WSREP: Server initialized
[Note] WSREP: Server status change initializing -> initialized
[Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
[Note] WSREP: Recovered position from storage: 6aa53efc-db72-11ec-880f-a282ce494905:117376
[Note] WSREP: Starting applier thread 6
[Note] WSREP: Starting applier thread 7
[Note] WSREP: Starting applier thread 8
[Note] WSREP: Recovered view from SST:
  id: 6aa53efc-db72-11ec-880f-a282ce494905:128934
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 0
  members(3):
0: 46250027-95c8-11ee-9d94-06eb82413a07, db
1: 4ed771b5-941f-11ee-82bf-56ec952abcf6, db
2: bfa6e688-941f-11ee-97c5-d323f02068c1, db

[ERROR] WSREP: sst_received failed: SST script passed bogus GTID: 6aa53efc-db72-11ec-880f-a282ce494905:117376. Preceding view GTID: 6aa53efc-db72-11ec-880f-a282ce494905:128934
[Note] WSREP: SST received: 00000000-0000-0000-0000-000000000000:-1
[Note] WSREP: Joiner monitor thread ended with total time 14 sec
[ERROR] WSREP: Application received wrong state:
Received: 00000000-0000-0000-0000-000000000000
Required: 6aa53efc-db72-11ec-880f-a282ce494905
[ERROR] WSREP: Application state transfer failed. This is unrecoverable condition, restart required.
[Note] WSREP: ReplicatorSMM::abort()
[Note] WSREP: Closing send monitor...
[Note] WSREP: Closed send monitor.
[Note] WSREP: gcomm: terminating thread
[Note] WSREP: gcomm: joining thread
[Note] WSREP: gcomm: closing backend
[Note] /opt/bitnami/mariadb/sbin/mysqld: ready for connections.
Version: '11.0.3-MariaDB-log' socket: '/opt/bitnami/mariadb/tmp/mysql.sock' port: 3306 Source distribution
[Note] WSREP: view(view_id(NON_PRIM,46250027-9d94,6636) memb {
46250027-9d94,0
} joined {
} left {
} partitioned {
4ed771b5-82bf,0
bfa6e688-97c5,0
})
[Note] WSREP: PC protocol downgrade 1 -> 0
[Note] WSREP: view((empty))
[Note] WSREP: gcomm: closed
[Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
[Note] WSREP: Flow-control interval: [128, 160]
[Note] WSREP: Received NON-PRIMARY.
[Note] WSREP: Shifting JOINER -> OPEN (TO: 128934)
[Note] WSREP: New SELF-LEAVE.
[Note] WSREP: Flow-control interval: [0, 0]
[Note] WSREP: Received SELF-LEAVE. Closing connection.
[Note] WSREP: Shifting OPEN -> CLOSED (TO: 128934)
[Note] WSREP: RECV thread exiting 0: Success
[Note] WSREP: recv_thread() joined.
[Note] WSREP: Closing replication queue.
[Note] WSREP: Closing slave action queue.
[Note] WSREP: /opt/bitnami/mariadb/sbin/mysqld: Terminated.
{code}

Ranjan Ghosh added a comment - 2023-12-08 18:53

I have the same problem and I didn't even update from a very old version. Somehow I cannot start the second node anymore even if I wipe /var/lib/mysql completely on the joining node. The RSync-SST finishes but then I get: "SST script passed bogus GTID" and the node fails. How can I reset this and get out of this situation?

Ranjan Ghosh added a comment - 2023-12-08 18:53 I have the same problem and I didn't even update from a very old version. Somehow I cannot start the second node anymore even if I wipe /var/lib/mysql completely on the joining node. The RSync-SST finishes but then I get: "SST script passed bogus GTID" and the node fails. How can I reset this and get out of this situation?

Ihtisham ul Haq added a comment - 2023-12-08 21:00

What worked was bootstrapping the cluster. And then manual SST for the affected node(probably the SST on startup would also work but we didn't try).

We have other databases which have this issue as well(with 11.0.3) and we thought upgrading to 11.0.4 might fix it due this change. But the issue can be seen even with the change, and we don't want to upgrade/restart those due to risk of running in to the same issue.

Ihtisham ul Haq added a comment - 2023-12-08 21:00 What worked was bootstrapping the cluster. And then manual SST for the affected node(probably the SST on startup would also work but we didn't try). We have other databases which have this issue as well(with 11.0.3) and we thought upgrading to 11.0.4 might fix it due this change . But the issue can be seen even with the change, and we don't want to upgrade/restart those due to risk of running in to the same issue.

Ihtisham ul Haq made changes - 2023-12-08 21:07

Priority

Major [ 3 ]

Critical [ 2 ]

Ihtisham ul Haq added a comment - 2023-12-11 12:44

@RanjanGhosh Have you been able to reproduce this issue, by any chance?

Ihtisham ul Haq added a comment - 2023-12-11 12:44 @RanjanGhosh Have you been able to reproduce this issue, by any chance?

Ranjan Ghosh added a comment - 2023-12-11 15:31

@ihti: I haven't been able to reproduce it but your tip to bootstrap the whole cluster was also what worked for me. I needed to shut down all nodes and start booting them up one-by-one from scratch. It's quite surprising because it's the first time I can think of that could not be solved by delete /var/lib/mysql on the second node if the first node seems to be running without any problems. Or put differently: You cannot somehow "see" that the first node is in a weird state that doesn't allow the second node to come on. Everything seems normal. The first node is running. It accepts queries etc. And then you try to start the second node and it just won't work. Even if you delete /var/lib/mysql completely - it is impossible to bring up the second node. I tried it multiple times. Only after restarting/bootstrapping the first node, things start to work again.

Googling didn't immediately yield an answer for me so I hope people will find this now looking for:

SST script passed bogus GTID

Ranjan Ghosh added a comment - 2023-12-11 15:31 @ihti: I haven't been able to reproduce it but your tip to bootstrap the whole cluster was also what worked for me. I needed to shut down all nodes and start booting them up one-by-one from scratch. It's quite surprising because it's the first time I can think of that could not be solved by delete /var/lib/mysql on the second node if the first node seems to be running without any problems. Or put differently: You cannot somehow "see" that the first node is in a weird state that doesn't allow the second node to come on. Everything seems normal. The first node is running. It accepts queries etc. And then you try to start the second node and it just won't work. Even if you delete /var/lib/mysql completely - it is impossible to bring up the second node. I tried it multiple times. Only after restarting/bootstrapping the first node, things start to work again. Googling didn't immediately yield an answer for me so I hope people will find this now looking for: SST script passed bogus GTID

Ihtisham ul Haq made changes - 2023-12-13 09:25

Priority

Critical [ 2 ]

Blocker [ 1 ]

Ihtisham ul Haq added a comment - 2023-12-19 15:56

Steps to reproduce the issue:

1. Run mariadb-galera cluster with 3 peers in version 10.11.6
We use docker images bitnami/mariadb-galera:10.11.6-debian-11-r0
```
myname="$(hostname -f)";
export MARIADB_GALERA_NODE_NAME="$myname"
export MARIADB_EXTRA_FLAGS="--wsrep-sst-receive-address=$myname"
export MARIADB_GALERA_CLUSTER_ADDRESS="gcomm://$LIST_OF_PEERS
/opt/bitnami/scripts/mariadb-galera/entrypoint.sh /opt/bitnami/scripts/mariadb-galera/run.sh
```
2. Create some test entries in the database.
3. Stop one peer and upgrade it to 11.0.4 (e.g. docker image bitnami/mariadb-galera:11.0.4-debian-11-r1)
4. Create some more entries in the database.
5. Start the updated peer, it will fail.

Ihtisham ul Haq added a comment - 2023-12-19 15:56 Steps to reproduce the issue: 1. Run mariadb-galera cluster with 3 peers in version 10.11.6 We use docker images bitnami/mariadb-galera:10.11.6-debian-11-r0 ``` myname="$(hostname -f)"; export MARIADB_GALERA_NODE_NAME="$myname" export MARIADB_EXTRA_FLAGS="--wsrep-sst-receive-address=$myname" export MARIADB_GALERA_CLUSTER_ADDRESS="gcomm://$LIST_OF_PEERS /opt/bitnami/scripts/mariadb-galera/entrypoint.sh /opt/bitnami/scripts/mariadb-galera/run.sh ``` 2. Create some test entries in the database. 3. Stop one peer and upgrade it to 11.0.4 (e.g. docker image bitnami/mariadb-galera:11.0.4-debian-11-r1) 4. Create some more entries in the database. 5. Start the updated peer, it will fail.

Max Lamprecht added a comment - 2024-01-05 14:49 - edited

We have narrowed it down to commit: https://github.com/MariaDB/server/commit/44dce3b2077e64a1efc607668d0d7b42a7c4ee78

If we set innodb_undo_tablespaces to 0 the start works.

Max Lamprecht added a comment - 2024-01-05 14:49 - edited We have narrowed it down to commit: https://github.com/MariaDB/server/commit/44dce3b2077e64a1efc607668d0d7b42a7c4ee78 If we set innodb_undo_tablespaces to 0 the start works.

Sergei Golubchik made changes - 2024-01-09 14:18

Priority

Blocker [ 1 ]

Critical [ 2 ]

Sergei Golubchik made changes - 2024-01-09 14:18

Assignee

Marko Mäkelä [ marko ]

Sergei Golubchik made changes - 2024-01-09 14:19

Fix Version/s		11.0 [ 28320 ]
Fix Version/s		11.1 [ 28549 ]
Fix Version/s		11.2 [ 28603 ]

Max Lamprecht added a comment - 2024-01-12 11:53

We figured out that we are not able to change the innodb_undo_tablespaces setting online to 3 with the mariabackup wsrep_sst method. This is because with mariabackup there is no clean shutdown of the innodb donor.

We used wsrep_sst_method=mysqldump as a workaround to migrate our galera clusters to the new default

Max Lamprecht added a comment - 2024-01-12 11:53 We figured out that we are not able to change the innodb_undo_tablespaces setting online to 3 with the mariabackup wsrep_sst method. This is because with mariabackup there is no clean shutdown of the innodb donor. We used wsrep_sst_method=mysqldump as a workaround to migrate our galera clusters to the new default

Marko Mäkelä added a comment - 2024-01-12 14:33

Thank you for the detailed bug report, including clear steps to reproduce.

~~MDEV-29986~~ indeed changed innodb_undo_tablespaces=3 by default. This was possible thanks to ~~MDEV-19229~~, which allows the parameter to be changed at server startup, provided that the undo logs are logically empty. Usually this is guaranteed by a prior shutdown with innodb_fast_shutdown=0.

It is unclear to me why the following error would be caused by a change of innodb_undo_tablespaces:

[ERROR] WSREP: sst_received failed: SST script passed bogus GTID: 6aa53efc-db72-11ec-880f-a282ce494905:117376. Preceding view GTID: 6aa53efc-db72-11ec-880f-a282ce494905:128934

InnoDB does not store any GTID information in its internal data structures, such as undo logs. For normal replication, there is a table mysql.gtid_slave_pos, which is in InnoDB format by default. I am not familiar with the GTID and how it plays with Galera.

For Galera, we store a wsrep_checkpoint in a rollback segment header. The storage format was last changed in MariaDB Server 10.3, related to ~~MDEV-15132~~ and ~~MDEV-15158~~. In ~~MDEV-29986~~, in the function srv_undo_tablespaces_reinit(), we made effort to preserve this information. Maybe there is some flaw in this logic?

I’m assigning this to the Galera developers for root cause analysis.

Marko Mäkelä added a comment - 2024-01-12 14:33 Thank you for the detailed bug report, including clear steps to reproduce. MDEV-29986 indeed changed innodb_undo_tablespaces=3 by default. This was possible thanks to MDEV-19229 , which allows the parameter to be changed at server startup, provided that the undo logs are logically empty. Usually this is guaranteed by a prior shutdown with innodb_fast_shutdown=0 . It is unclear to me why the following error would be caused by a change of innodb_undo_tablespaces : [ERROR] WSREP: sst_received failed: SST script passed bogus GTID: 6aa53efc-db72-11ec-880f-a282ce494905:117376. Preceding view GTID: 6aa53efc-db72-11ec-880f-a282ce494905:128934 InnoDB does not store any GTID information in its internal data structures, such as undo logs. For normal replication, there is a table mysql.gtid_slave_pos , which is in InnoDB format by default. I am not familiar with the GTID and how it plays with Galera. For Galera, we store a wsrep_checkpoint in a rollback segment header. The storage format was last changed in MariaDB Server 10.3, related to MDEV-15132 and MDEV-15158 . In MDEV-29986 , in the function srv_undo_tablespaces_reinit() , we made effort to preserve this information. Maybe there is some flaw in this logic? I’m assigning this to the Galera developers for root cause analysis.

Marko Mäkelä made changes - 2024-01-12 14:33

Link

This issue is caused by ~~MDEV-29986~~ [ ~~MDEV-29986~~ ]

Marko Mäkelä made changes - 2024-01-12 14:33

Component/s		Galera [ 10124 ]
Component/s		Galera SST [ 10121 ]
Component/s		Storage Engine - InnoDB [ 10129 ]
Fix Version/s		11.3 [ 28565 ]
Fix Version/s		11.4 [ 29301 ]
Affects Version/s		11.0.2 [ 28706 ]
Affects Version/s		11.0.1 [ 28548 ]
Assignee	Marko Mäkelä [ marko ]	Seppo Jaakola [ seppo ]
Labels		regression upgrade

Jan Lindström made changes - 2024-02-29 12:13

Status

Open [ 1 ]

Confirmed [ 10101 ]

Jan Lindström made changes - 2024-02-29 12:13

Attachment		node1.err [ 73220 ]
Attachment		node2.err [ 73221 ]

Jan Lindström made changes - 2024-02-29 12:15

Attachment		binlog.000002 [ 73222 ]
Attachment		binlog.000003 [ 73223 ]
Attachment		binlog.index [ 73224 ]
Attachment		donor_galera_info [ 73225 ]
Attachment		mariabackup.move.log [ 73226 ]
Attachment		mariabackup.prepare.log [ 73227 ]
Attachment		mariadb_backup_galera_info [ 73228 ]
Attachment		xtrabackup_binlog_pos_innodb [ 73229 ]

Jan Lindström added a comment - 2024-03-01 06:20

marko I tested with 2-node cluster so that node1 uses 10.11 and node2 uses 11.3. I can reproduce the issue i.e. node2 does not join the cluster. I can't reproduce if wsrep_sst_method=rsync or if innodb-undo-tablespaces=3 used on 10.11. Therefore, either mariabackup or server behaves incorrectly on upgrade 10.11 --> 11.3 and it seems to be related on fact that innodb-undo-tablespaces=3 by default on 11.3. From logs I see that joiner does send correct gtid to donor and this information is sent correctly to joiner. However, for some reason either on mariabackup --prepare or next server startup wrong gtid is read from storage. From extended logs I can see that donor has correct gtid stored in storage at the point when mariabackup --backup is done. I'm not familiar how this creation of undo log tablespaces work on upgrade but based on my analysis something goes wrong.

Jan Lindström added a comment - 2024-03-01 06:20 marko I tested with 2-node cluster so that node1 uses 10.11 and node2 uses 11.3. I can reproduce the issue i.e. node2 does not join the cluster. I can't reproduce if wsrep_sst_method=rsync or if innodb-undo-tablespaces=3 used on 10.11. Therefore, either mariabackup or server behaves incorrectly on upgrade 10.11 --> 11.3 and it seems to be related on fact that innodb-undo-tablespaces=3 by default on 11.3. From logs I see that joiner does send correct gtid to donor and this information is sent correctly to joiner. However, for some reason either on mariabackup --prepare or next server startup wrong gtid is read from storage. From extended logs I can see that donor has correct gtid stored in storage at the point when mariabackup --backup is done. I'm not familiar how this creation of undo log tablespaces work on upgrade but based on my analysis something goes wrong.

Jan Lindström made changes - 2024-03-01 06:21

Assignee

Seppo Jaakola [ seppo ]

Marko Mäkelä [ marko ]

Marko Mäkelä added a comment - 2024-03-01 06:27

janlindstrom, can you provide rr replay traces that cover the snapshot transfer and the subsequent failure?

Marko Mäkelä added a comment - 2024-03-01 06:27 janlindstrom , can you provide rr replay traces that cover the snapshot transfer and the subsequent failure?

Marko Mäkelä made changes - 2024-03-01 06:27

Assignee

Marko Mäkelä [ marko ]

Jan Lindström [ JIRAUSER53125 ]

Jan Lindström made changes - 2024-03-01 06:42

Attachment

binlog.000002 [ 73222 ]