[MDEV-27475] Internal MariaDB error code: 1677 causes Galera Cluster to hang Created: 2022-01-12  Updated: 2023-05-12  Resolved: 2023-05-12

Status: Closed
Project: MariaDB Server
Component/s: Galera, Galera SST
Affects Version/s: 10.5.12
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Jean-Louis Dupond Assignee: Jan Lindström
Resolution: Incomplete Votes: 0
Labels: galera
Environment:

Debian Buster



 Description   

On a 3-node Galera cluster we had an important issue.
One of the nodes started giving issues with an InnoDB table that was corrupt.

To fix this, we decided to take out 1 node, and enable mysqldump SST to get it back in sync.
This way that synced node would not sync the InnoDB tablespace corruptions with it as it does with mariabackup. And mysqldump on the table just worked fine.

Now when the sync completed, MariaDB couldn't apply a statement on the newly synced node, and started to lock the whole cluster.
Writes were hanging in Commit state on all nodes.
Restarting mariadb on the other nodes did hang on shutdown forever.
We finally decided to kill all nodes, and bootstrap the first node again to bring everything back.
But ofcourse, this should never happen

The relevant logs:
2022-01-11 20:00:12 20 [Note] WSREP: SST setting local position to dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201,0-1-23343303 current 00000000-0000-0000-0000-000000000000:-1
2022-01-11 20:00:12 20 [Note] WSREP: SST received
2022-01-11 20:00:12 20 [Note] WSREP: Server status change joiner -> joined
2022-01-11 20:00:12 20 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-01-11 20:00:12 20 [Note] WSREP: Recovered position from storage: dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201
2022-01-11 20:00:12 20 [Note] WSREP: Recovered view from SST:
id: dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201
status: primary
protocol_version: 4
capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
final: no
own_index: 2
members(3):
0: 3279978f-72e1-11ec-bb44-1a42bcfa4247, cust-db002
1: 49b7b512-67db-11ec-b003-935f043b4930, cust-db001
2: af861566-72f2-11ec-ad10-77dfd16d0b73, cust-db003

2022-01-11 20:00:12 20 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2022-01-11 20:00:12 13666 [Note] WSREP: Recovered cluster id dd6f969e-55fb-11eb-9e2b-cf04bf47000f
2022-01-11 20:00:12 20 [Note] WSREP: SST received: dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201
2022-01-11 20:00:12 20 [Note] WSREP: SST succeeded for position dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201
2022-01-11 20:00:12 20 [Note] WSREP: wsrep_start_position set to 'dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201,0-1-23343303'
2022-01-11 20:00:12 5 [Note] WSREP: Installed new state from SST: dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201
2022-01-11 20:00:12 5 [Note] WSREP: Cert. index preload up to 40776201
2022-01-11 20:00:12 0 [Note] WSREP: ####### IST applying starts with 40776202
2022-01-11 20:00:12 0 [Note] WSREP: ####### IST current seqno initialized to 40776162
2022-01-11 20:00:12 0 [Note] WSREP: Receiving IST... 0.0% ( 0/40 events) complete.
2022-01-11 20:00:12 0 [Note] WSREP: IST preload starting at 40776162
2022-01-11 20:00:12 0 [Note] WSREP: Service thread queue flushed.
2022-01-11 20:00:12 0 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:40776161, protocol version: 5
2022-01-11 20:00:12 0 [Note] WSREP: REPL Protocols: 10 (5)
2022-01-11 20:00:12 0 [Note] WSREP: ####### Adjusting cert position: 40776200 -> 40776201
2022-01-11 20:00:12 0 [Note] WSREP: Service thread queue flushed.
2022-01-11 20:00:12 0 [Note] WSREP: Lowest cert index boundary for CC from preload: 40776162
2022-01-11 20:00:12 0 [Note] WSREP: Min available from gcache for CC from preload: 40776162
2022-01-11 20:00:12 0 [Note] WSREP: Receiving IST...100.0% (40/40 events) complete.
2022-01-11 20:00:12 5 [Note] WSREP: IST received: dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201
2022-01-11 20:00:12 5 [Note] WSREP: Lowest cert index boundary for CC from sst: 40776162
2022-01-11 20:00:12 5 [Note] WSREP: Min available from gcache for CC from sst: 40776162
2022-01-11 20:00:12 0 [Note] WSREP: 2.0 (cust-db003): State transfer from 1.0 (cust-db001) complete.
2022-01-11 20:00:12 0 [Note] WSREP: Shifting JOINER -> JOINED (TO: 40810268)
2022-01-11 20:00:12 0 [Note] WSREP: 1.0 (cust-db001): State transfer to 2.0 (cust-db003) complete.
2022-01-11 20:00:16 0 [Note] WSREP: Member 1.0 (cust-db001) synced with group.
2022-01-11 20:00:20 5 [ERROR] Slave SQL: Column 46 of table 'cust.t_member' cannot be converted from type 'timestamp' to type 'int(11)', Internal MariaDB error code: 1677
2022-01-11 20:00:20 5 [Warning] WSREP: Event 15 Write_rows_v1 apply failed: 3, seqno 40776671
2022-01-11 20:00:20 0 [Note] WSREP: Member 2(cust-db003) initiates vote on dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776671,cddba1600e18e4f4:
2022-01-11 20:00:20 0 [Note] WSREP: Member 0(cust-db002) responds to vote on dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776671,0000000000000000: Success
2022-01-11 20:00:20 0 [Warning] WSREP: Received bogus VOTE message: 40776671.0, from node 3279978f-72e1-11ec-bb44-1a42bcfa4247, expected > 40810261. Ignoring.
2022-01-11 20:00:20 0 [Note] WSREP: Member 1(cust-db001) responds to vote on dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776671,0000000000000000: Success
2022-01-11 20:00:20 0 [Warning] WSREP: Received bogus VOTE message: 40776671.0, from node 49b7b512-67db-11ec-b003-935f043b4930, expected > 40810256. Ignoring.

The strange thing is nothing was changed on that table that day.
2022-01-11 20:00:20 5 [ERROR] Slave SQL: Column 46 of table 'cust.t_member' cannot be converted from type 'timestamp' to type 'int(11)', Internal MariaDB error code: 1677
So why would it give an error on this?

And the table that causes MariaDB to crash in the afternoon that was corrupt is not the same table as stated in the conversion error.



 Comments   
Comment by Jean-Louis Dupond [ 2022-01-12 ]

Table looks like this:
`aa` varchar(200) DEFAULT NULL,
`ab` varchar(200) DEFAULT NULL,
`ac` int(11) DEFAULT NULL,
`ad` varchar(160) DEFAULT NULL,
`ae` bigint(10) DEFAULT NULL,
`af` varchar(200) DEFAULT NULL,
`ag` bigint(10) DEFAULT NULL,
`ah` bigint(10) DEFAULT NULL,
`ai` bigint(10) DEFAULT NULL,
`aj` bigint(10) DEFAULT NULL,
`ak` bigint(10) DEFAULT NULL,
`al` bigint(10) DEFAULT NULL,
`am` varchar(1600) DEFAULT NULL,
`an` bigint(10) NOT NULL,
`ao` int(11) DEFAULT NULL,
`ap` int(11) DEFAULT NULL,
`aq` varchar(800) DEFAULT NULL,
`ar` varchar(400) DEFAULT NULL,
`as` varchar(80) DEFAULT NULL,
`at` datetime DEFAULT NULL,
`au` bigint(10) DEFAULT NULL,
`av` bigint(10) DEFAULT NULL,
`aw` varchar(3200) DEFAULT NULL,
`ax` int(11) DEFAULT NULL,
`ay` int(11) DEFAULT NULL,
`az` datetime DEFAULT NULL,
`ba` int(11) DEFAULT NULL,
`bb` int(11) DEFAULT NULL,
`bc` double DEFAULT NULL,
`bd` bigint(10) DEFAULT NULL,
`be` int(11) DEFAULT NULL,
`bf` bigint(10) DEFAULT NULL,
`bg` int(11) DEFAULT NULL,
`bh` varchar(1600) DEFAULT NULL,
`bi` datetime DEFAULT NULL,
`bj` varchar(200) DEFAULT NULL,
`bk` varchar(200) DEFAULT NULL,
`bl` int(11) DEFAULT NULL,
`bm` int(11) DEFAULT NULL,
`bn` varchar(200) DEFAULT NULL,
`bo` int(11) DEFAULT NULL,
`bp` varchar(500) DEFAULT NULL,
`bq` bigint(10) DEFAULT NULL,
`br` datetime DEFAULT NULL,
`bs` datetime DEFAULT NULL,
`bt` datetime DEFAULT NULL,
`bu` int(11) DEFAULT NULL,
PRIMARY KEY (`an`),
UNIQUE KEY `UI_MEMBER_ap_an` (`ap`,`an`),
KEY `NUI_MEMBER_al` (`al`),
KEY `XIF28T_MEMBER` (`aj`),
KEY `XIF27T_MEMBER` (`ae`),
KEY `XIF26T_MEMBER` (`ah`),
KEY `XIF24T_MEMBER` (`ag`),
KEY `XIF21T_MEMBER` (`ak`),
KEY `NQ_T_MEMBER_au` (`au`),
KEY `XIF25T_MEMBER` (`ai`),
KEY `NUI_MEMBER_aa` (`aa`),
KEY `NUI_MEMBER_ab` (`ab`),
CONSTRAINT `FK_MEMBER_al` FOREIGN KEY (`al`) REFERENCES `t_al` (`ID`),
CONSTRAINT `t_member_ibfk_1` FOREIGN KEY (`ak`) REFERENCES `t_cc` (`ID`),
CONSTRAINT `t_member_ibfk_2` FOREIGN KEY (`ag`) REFERENCES `t_dd` (`ID`),
CONSTRAINT `t_member_ibfk_3` FOREIGN KEY (`ai`) REFERENCES `t_cc` (`ID`),
CONSTRAINT `t_member_ibfk_4` FOREIGN KEY (`ah`) REFERENCES `t_ee` (`ID`),
CONSTRAINT `t_member_ibfk_5` FOREIGN KEY (`ae`) REFERENCES `t_cc` (`ID`),
CONSTRAINT `t_member_ibfk_6` FOREIGN KEY (`aj`) REFERENCES `t_cc` (`ID`),
CONSTRAINT `t_member_ibfk_7` FOREIGN KEY (`an`) REFERENCES `t_ff` (`ID`)

On the db003 (which was just synced), the wresp thread was Sleeping and showed:

5 system user   NULL Sleep 3073 wsrep applied write set INSERT INTO T_MEMBER (aa, ab, ac, ad, ae, af, ag, ah, ai, aj, ak, al, am, ao, ap, aq, ar, as, at, au, av, aw, ax, ay, az, ba,bb, bd, be,bf,bg,bl,bm,bn, bq, an) VALUES ('David', 'W.', 0, NULL, NULL, NULL, 1, 2, NULL, NULL, NULL, 445447, NULL, 0, -1, NULL, NULL, NULL, '2022-01-11 16:27:16.144', 1084, -1, NULL, 0, 0, NULL, 0, 1, NULL, 0, -1, 0, 1 ,1,NULL,NULL,445744) 0.000
Comment by Jan Lindström [ 2023-04-11 ]

dupondje With given table structure, I could not even create table as there are syntax errors. However, I have few questions:

  • Why do you use mysqldump ? While it is supported, there are issues using timestamp format and different timezones. I suggest using rsync or mariabackup
  • Why this server has different type on that column compared to what is expected (timestamp)?
  • Can you provide output of mysqldump?
Generated at Thu Feb 08 09:53:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.