Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27475

Internal MariaDB error code: 1677 causes Galera Cluster to hang

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Incomplete
    • 10.5.12
    • N/A
    • Galera, Galera SST
    • Debian Buster

    Description

      On a 3-node Galera cluster we had an important issue.
      One of the nodes started giving issues with an InnoDB table that was corrupt.

      To fix this, we decided to take out 1 node, and enable mysqldump SST to get it back in sync.
      This way that synced node would not sync the InnoDB tablespace corruptions with it as it does with mariabackup. And mysqldump on the table just worked fine.

      Now when the sync completed, MariaDB couldn't apply a statement on the newly synced node, and started to lock the whole cluster.
      Writes were hanging in Commit state on all nodes.
      Restarting mariadb on the other nodes did hang on shutdown forever.
      We finally decided to kill all nodes, and bootstrap the first node again to bring everything back.
      But ofcourse, this should never happen

      The relevant logs:
      2022-01-11 20:00:12 20 [Note] WSREP: SST setting local position to dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201,0-1-23343303 current 00000000-0000-0000-0000-000000000000:-1
      2022-01-11 20:00:12 20 [Note] WSREP: SST received
      2022-01-11 20:00:12 20 [Note] WSREP: Server status change joiner -> joined
      2022-01-11 20:00:12 20 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2022-01-11 20:00:12 20 [Note] WSREP: Recovered position from storage: dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201
      2022-01-11 20:00:12 20 [Note] WSREP: Recovered view from SST:
      id: dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201
      status: primary
      protocol_version: 4
      capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
      final: no
      own_index: 2
      members(3):
      0: 3279978f-72e1-11ec-bb44-1a42bcfa4247, cust-db002
      1: 49b7b512-67db-11ec-b003-935f043b4930, cust-db001
      2: af861566-72f2-11ec-ad10-77dfd16d0b73, cust-db003

      2022-01-11 20:00:12 20 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      2022-01-11 20:00:12 13666 [Note] WSREP: Recovered cluster id dd6f969e-55fb-11eb-9e2b-cf04bf47000f
      2022-01-11 20:00:12 20 [Note] WSREP: SST received: dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201
      2022-01-11 20:00:12 20 [Note] WSREP: SST succeeded for position dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201
      2022-01-11 20:00:12 20 [Note] WSREP: wsrep_start_position set to 'dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201,0-1-23343303'
      2022-01-11 20:00:12 5 [Note] WSREP: Installed new state from SST: dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201
      2022-01-11 20:00:12 5 [Note] WSREP: Cert. index preload up to 40776201
      2022-01-11 20:00:12 0 [Note] WSREP: ####### IST applying starts with 40776202
      2022-01-11 20:00:12 0 [Note] WSREP: ####### IST current seqno initialized to 40776162
      2022-01-11 20:00:12 0 [Note] WSREP: Receiving IST... 0.0% ( 0/40 events) complete.
      2022-01-11 20:00:12 0 [Note] WSREP: IST preload starting at 40776162
      2022-01-11 20:00:12 0 [Note] WSREP: Service thread queue flushed.
      2022-01-11 20:00:12 0 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:40776161, protocol version: 5
      2022-01-11 20:00:12 0 [Note] WSREP: REPL Protocols: 10 (5)
      2022-01-11 20:00:12 0 [Note] WSREP: ####### Adjusting cert position: 40776200 -> 40776201
      2022-01-11 20:00:12 0 [Note] WSREP: Service thread queue flushed.
      2022-01-11 20:00:12 0 [Note] WSREP: Lowest cert index boundary for CC from preload: 40776162
      2022-01-11 20:00:12 0 [Note] WSREP: Min available from gcache for CC from preload: 40776162
      2022-01-11 20:00:12 0 [Note] WSREP: Receiving IST...100.0% (40/40 events) complete.
      2022-01-11 20:00:12 5 [Note] WSREP: IST received: dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776201
      2022-01-11 20:00:12 5 [Note] WSREP: Lowest cert index boundary for CC from sst: 40776162
      2022-01-11 20:00:12 5 [Note] WSREP: Min available from gcache for CC from sst: 40776162
      2022-01-11 20:00:12 0 [Note] WSREP: 2.0 (cust-db003): State transfer from 1.0 (cust-db001) complete.
      2022-01-11 20:00:12 0 [Note] WSREP: Shifting JOINER -> JOINED (TO: 40810268)
      2022-01-11 20:00:12 0 [Note] WSREP: 1.0 (cust-db001): State transfer to 2.0 (cust-db003) complete.
      2022-01-11 20:00:16 0 [Note] WSREP: Member 1.0 (cust-db001) synced with group.
      2022-01-11 20:00:20 5 [ERROR] Slave SQL: Column 46 of table 'cust.t_member' cannot be converted from type 'timestamp' to type 'int(11)', Internal MariaDB error code: 1677
      2022-01-11 20:00:20 5 [Warning] WSREP: Event 15 Write_rows_v1 apply failed: 3, seqno 40776671
      2022-01-11 20:00:20 0 [Note] WSREP: Member 2(cust-db003) initiates vote on dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776671,cddba1600e18e4f4:
      2022-01-11 20:00:20 0 [Note] WSREP: Member 0(cust-db002) responds to vote on dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776671,0000000000000000: Success
      2022-01-11 20:00:20 0 [Warning] WSREP: Received bogus VOTE message: 40776671.0, from node 3279978f-72e1-11ec-bb44-1a42bcfa4247, expected > 40810261. Ignoring.
      2022-01-11 20:00:20 0 [Note] WSREP: Member 1(cust-db001) responds to vote on dd6f969e-55fb-11eb-9e2b-cf04bf47000f:40776671,0000000000000000: Success
      2022-01-11 20:00:20 0 [Warning] WSREP: Received bogus VOTE message: 40776671.0, from node 49b7b512-67db-11ec-b003-935f043b4930, expected > 40810256. Ignoring.

      The strange thing is nothing was changed on that table that day.
      2022-01-11 20:00:20 5 [ERROR] Slave SQL: Column 46 of table 'cust.t_member' cannot be converted from type 'timestamp' to type 'int(11)', Internal MariaDB error code: 1677
      So why would it give an error on this?

      And the table that causes MariaDB to crash in the afternoon that was corrupt is not the same table as stated in the conversion error.

      Attachments

        Activity

          People

            janlindstrom Jan Lindström
            dupondje Jean-Louis Dupond
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.