[MDEV-4404] Galera Node throws "Could not read field" error and drops out of cluster Created: 2013-04-17  Updated: 2021-07-12  Resolved: 2014-11-18

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 5.5.29-galera
Fix Version/s: 5.5.41-galera

Type: Bug Priority: Major
Reporter: Matthew Wheeler Assignee: Nirbhay Choubey (Inactive)
Resolution: Cannot Reproduce Votes: 3
Labels: galera
Environment:

CentOS release 6.3



 Description   

After a couple days of running one node in a 2 node cluster (with arbitrator) will error out saying "Could not read field" Error 1610 and then "Could not execute Update_rows event" Error 1030.

The other node continued. The field exists in the table. Nodes were initialized using xtrabackup method.

130417 13:54:15 [ERROR] Slave SQL: Could not read field 'UPDT_DB_DTTM' of table 'GLOBAL.CLIENT_LAST_OPEN_PLATFORM', Error_code: 1610
130417 13:54:15 [ERROR] Slave SQL: Could not execute Update_rows event on table GLOBAL.CLIENT_LAST_OPEN_PLATFORM; Got error 1610 from storage engine, Error_code: 1030; handler error No Error!; the event's master log FIRST,
 end_log_pos 272, Error_code: 1030
130417 13:54:15 [Warning] WSREP: RBR event 2 Update_rows apply warning: 1610, 222480722
130417 13:54:15 [ERROR] WSREP: Failed to apply trx: source: 6b76b9ab-a623-11e2-0800-0bcb2fe17662 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 7818 trx_id: 8694488863 seqnos (l: 60013290, g: 222480722, s: 222480721
, d: 222480682, ts: 1366221255138895484)
130417 13:54:15 [ERROR] WSREP: Failed to apply app buffer: seqno: 222480722, status: WSREP_FATAL
         at galera/src/replicator_smm.cpp:apply_wscoll():53
         at galera/src/replicator_smm.cpp:apply_trx_ws():120
130417 13:54:15 [ERROR] WSREP: Node consistency compromized, aborting...
130417 13:54:15 [Note] WSREP: Closing send monitor...
130417 13:54:15 [Note] WSREP: Closed send monitor.
130417 13:54:15 [Note] WSREP: gcomm: terminating thread
130417 13:54:15 [Note] WSREP: gcomm: joining thread
130417 13:54:15 [Note] WSREP: gcomm: closing backend
130417 13:54:15 [Note] WSREP: view(view_id(NON_PRIM,382a82e8-a3d4-11e2-0800-2753044f506b,25) memb {
        382a82e8-a3d4-11e2-0800-2753044f506b,
} joined {
} left {
} partitioned {
        6b76b9ab-a623-11e2-0800-0bcb2fe17662,
        ebe9df82-9e2c-11e2-0800-65ae0b1b9cdc,
})
130417 13:54:15 [Note] WSREP: view((empty))
130417 13:54:15 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130417 13:54:15 [Note] WSREP: gcomm: closed
130417 13:54:15 [Note] WSREP: Flow-control interval: [16, 16]
130417 13:54:15 [Note] WSREP: Received NON-PRIMARY.
130417 13:54:15 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 222480778)
130417 13:54:15 [Note] WSREP: Received self-leave message.
130417 13:54:15 [Note] WSREP: Flow-control interval: [0, 0]
130417 13:54:15 [Note] WSREP: Received SELF-LEAVE. Closing connection.
130417 13:54:15 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 222480778)
130417 13:54:15 [Note] WSREP: RECV thread exiting 0: Success
130417 13:54:15 [Note] WSREP: recv_thread() joined.
130417 13:54:15 [Note] WSREP: Closing slave action queue.
130417 13:54:15 [Note] WSREP: /usr/sbin/mysqld: Terminated.
130417 13:54:16 mysqld_safe Number of processes running now: 0
130417 13:54:16 mysqld_safe WSREP: not restarting wsrep node automatically
130417 13:54:16 mysqld_safe mysqld from pid file /database/data/4c-maria-02.pid ended

From our settings:

# logs and replication
log-bin=mysql-bin
binlog-format=ROW
max_binlog_cache_size=1024G
 
# Galera Settings
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address=gcomm://10.200.0.10
wsrep_cluster_name='client01_cluster'
wsrep_node_name='4c-maria-02'
wsrep_slave_threads=24
wsrep_retry_autocommit=10
wsrep_sst_method=xtrabackup
wsrep_sst_auth=galera:AhwFVAahpZfh8BVG
 
 
# innodb (xtradb) settings
default_storage_engine=InnoDB
innodb_file_per_table
innodb_file_format=barracuda
innodb_log_file_size=2000M
innodb_log_files_in_group=2
innodb_flush_log_at_trx_commit=2
innodb_autoinc_lock_mode=2
innodb_locks_unsafe_for_binlog=1

MariaDB-galera was installed from repos.



 Comments   
Comment by Matthew Wheeler [ 2013-04-17 ]

BTW - this has happened twice in the last week. Each time we reinitialized the effected node. Different node each time. Different table and field each time.

Comment by Elena Stepanova [ 2013-04-22 ]

Hi Matthew,

Did GRA_* log files with this event appear in the datadir when the failure happened? Could you please provide them, along with the complete error log?
If the binary log and general log are available, they might help too (they're probably big, but you could archive and upload them to ftp.askmonty.org).

Thanks.

Comment by Matthew Wheeler [ 2013-04-22 ]

Yep. i have the logs available for both times this has happened. We are running using a ZFS file system and I took snapshots of the data and log directories after each instance.

I'll send the .err, the GRA_, and last binary log file(mysql-bin.000.) If there is anything else you need, let me know.

Where should i upload them to on the ftp site?

In the err, you'll see that we were trying to do some bulk loads - for some reason this is not working at all for us, even with data sets as small as 10 records. they load into the node that it is run on, but the replication does not happen. you'll see the error there. I'll submit this as a separate issue, i just wanted to explain what was going on before the crash. We were testing that some time before and it was separate tables, so i don't think that effected it.

Thanks.

Comment by Elena Stepanova [ 2013-04-22 ]

>> Where should i upload them to on the ftp site?
The 'private' folder at ftp.askmonty.org is for uploads. Please put MDEV-4404 somewhere in the archive name, so we could find it.

Comment by Matthew Wheeler [ 2013-04-22 ]

ok. Files uploaded: MDEV-4404.tar.gz. 2 folders in there with files from each occurrence of this problem.

Thanks for your help.

Comment by Ryo Tagami [ 2013-04-23 ]

I had a similar plobrem with 2 node MariaDB Galera Cluster where one node dies with following error (other node continues to work) .

130423 1:08:01 [ERROR] Slave SQL: Could not read field 'inventory_link' of table 'zabbix.items', Error_code: 1610
130423 1:08:01 [ERROR] Slave SQL: Could not execute Update_rows event on table zabbix.items; Got error 1610 from storage engine, Error_code: 1030; handler error No Error!; the event's master log FIRST, end_log_pos 19692, Error_code: 1030
130423 1:08:01 [Warning] WSREP: RBR event 48 Update_rows apply warning: 1610, 658709
130423 1:08:01 [ERROR] WSREP: Failed to apply trx: source: af2f0dcd-ab00-11e2-0800-0b205d0005d5 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 28 trx_id: 27751540 seqnos (l: 64548, g: 658709, s: 658708, d: 658708, ts: 1366646881273136683)
130423 1:08:01 [ERROR] WSREP: Failed to apply app buffer: seqno: 658709, status: WSREP_FATAL
at galera/src/replicator_smm.cpp:apply_wscoll():53
at galera/src/replicator_smm.cpp:apply_trx_ws():120
130423 1:08:01 [ERROR] WSREP: Node consistency compromized, aborting...

I've uploaded tarball contains GRA_*.log, my.cnf, binary log and error log as MDEV-4404-2.tar.bz2 to ftp.askmonty.org/private .
I hope this helps.

Comment by Elena Stepanova [ 2013-04-23 ]

Matthew, Ryo, thank you.

Hi Seppo,

Could it be another manifestation of MDEV-4328?

Comment by Matthew Wheeler [ 2013-05-13 ]

We just had this same error happen again on an otherwise stable cluster. We have been functionally using the cluster as master-slave - only doing reads from one node and reads and writes to the other. Same error on the "slave" node.

I have snapshots of the system after the crash if you would like another example of this.

Comment by Aleksey Sanin (Inactive) [ 2013-05-21 ]

Same here:

[root@devdb01 ~]# cat /var/log/mysqld/error.log
130521 1:24:46 [ERROR] Slave SQL: Could not read field 'version' of table '_phpunit_1369097057.accounts', Error_code: 1610
130521 1:24:46 [ERROR] Slave SQL: Could not execute Update_rows event on table _phpunit_1369097057.accounts; Got error 1610 from storage engine, Error_code: 1030; handler error No Error!; the event's master log FIRST, end_log_pos 1008, Error_code: 1030
130521 1:24:46 [Warning] WSREP: RBR event 6 Update_rows apply warning: 1610, 99885084
130521 1:24:46 [ERROR] WSREP: Failed to apply trx: source: 2e5275ca-9025-11e2-0800-c71fc764db36 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 177600 trx_id: 359777583 seqnos (l: 2450385, g: 99885084, s: 99885083, d: 99885082, ts: 1369223354944731000)
130521 1:24:46 [ERROR] WSREP: Failed to apply app buffer: seqno: 99885084, status: WSREP_FATAL
at galera/src/replicator_smm.cpp:apply_wscoll():53
at galera/src/replicator_smm.cpp:apply_trx_ws():120
130521 1:24:46 [ERROR] WSREP: Node consistency compromized, aborting...
130521 1:24:46 [Note] WSREP: Closing send monitor...
130521 1:24:46 [Note] WSREP: Closed send monitor.
130521 1:24:46 [Note] WSREP: gcomm: terminating thread
130521 1:24:46 [Note] WSREP: gcomm: joining thread
130521 1:24:46 [Note] WSREP: gcomm: closing backend
130521 1:24:46 [Note] WSREP: view(view_id(NON_PRIM,22b01156-c0df-11e2-0800-301a01976742,12) memb {
22b01156-c0df-11e2-0800-301a01976742,
} joined {
} left {
} partitioned {
2e5275ca-9025-11e2-0800-c71fc764db36,
e7b1cabc-9024-11e2-0800-bbd99c93b31f,
})
130521 1:24:46 [Note] WSREP: view((empty))
130521 1:24:46 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130521 1:24:46 [Note] WSREP: gcomm: closed
130521 1:24:46 [Note] WSREP: Flow-control interval: [16, 16]
130521 1:24:46 [Note] WSREP: Received NON-PRIMARY.
130521 1:24:46 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 99885088)
130521 1:24:46 [Note] WSREP: Received self-leave message.
130521 1:24:46 [Note] WSREP: Flow-control interval: [0, 0]
130521 1:24:46 [Note] WSREP: Received SELF-LEAVE. Closing connection.
130521 1:24:46 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 99885088)
130521 1:24:46 [Note] WSREP: RECV thread exiting 0: Success
130521 1:24:46 [Note] WSREP: recv_thread() joined.
130521 1:24:46 [Note] WSREP: Closing slave action queue.
130521 1:24:46 [Note] WSREP: /usr/sbin/mysqld: Terminated.

Comment by Matthew Wheeler [ 2013-06-12 ]

Happened again today.

Took a snapshot and can post any/all of the logs needed to help diagnose.

It has been almost a month since the last crash. We are running with one box as a read only "slave" which seems to delay times between crashes.

Thanks.

Comment by Seppo Jaakola [ 2013-08-20 ]

The error is thrown, when unpacking a replication ROW event in rpl_record.cc:unpack_row():

        pack_ptr= f->unpack(f->ptr, pack_ptr, row_buffer_end, metadata);
	DBUG_PRINT("debug", ("field: %s; metadata: 0x%x;"
                             " pack_ptr: 0x%lx; pack_ptr': 0x%lx; bytes: %d",
                             f->field_name, metadata,
                             (ulong) old_pack_ptr, (ulong) pack_ptr,
                             (int) (pack_ptr - old_pack_ptr)));
        if (!pack_ptr)
        {
          rli->report(ERROR_LEVEL, ER_SLAVE_CORRUPT_EVENT,
                      "Could not read field '%s' of table '%s.%s'",
                      f->field_name, table->s->db.str,
                      table->s->table_name.str);
          DBUG_RETURN(ER_SLAVE_CORRUPT_EVENT);
        }

The check for pack_ptr and throwing the error is MariaDB specific code

Comment by Seppo Jaakola [ 2013-08-20 ]

Analyzed Ryo's logs from zabbix database.
The GRA* file contains full buffer of replication events, which caused the crash, so I was expecting to see a corrupted file. However, mysqlbinlog has no problem in parsing the file. The file contains 318 replication events, and event #48 caused the crash when unpacking field #47 ('inventory_link'). In mysqlbinlog output this event looks good.

Comment by Seppo Jaakola [ 2013-08-20 ]

Did an experiment with MGC 5.5.32 node, which was modified to read replication events from the zabbix GRA_* file. It turns out that MGC can process the full file (all 318 events). updates from the GRA file do not succeed, as my database is empty, but no problem in parsing the events, nevertheless.

So, all in all, it looks like this issue is concurrency related, something interferes with the parsing of replication events.

Comment by Seppo Jaakola [ 2013-08-20 ]

There is no clear explanation to this symptom: the node crashed for a corrupted replication event, yet the same replication events, as analyzed now, seem to be valid.

I added log message to show more information of the replication event, after the corruption has been detected. This log message will be in next 5.5.32 release, the actual fix for this bug must wait for the release after 5.5.32.

Log message pushed in revision: http://bazaar.launchpad.net/~maria-captains/maria/maria-5.5-galera/revision/3413

Comment by Matthew Wheeler [ 2013-09-25 ]

I am happy to report this happened again and after we have updated to 5.5.32 with the extra logging.. I have a snapshot of both systems nodes in the cluster. Please let me know what files you wish me to send you.

You have never seen a bunch of programmers and admins so happy over a crash!!

Comment by Seppo Jaakola [ 2013-09-27 ]

Matthew, this is interesting indeed! Please upload your mysql error logs from both nodes. If you have GRA__.log file related to the crash, it will be needed as well. If the information is sensitive (GRA file contains the transaction data in plain text), you can email directly to me as well (seppo.jaakola@codership.com).

We recently tracked a problem related to binlog event annotation processing. If you have binlog_annotate_row_events enabled, it may affect the the issue you are facing.

Comment by Matthew Wheeler [ 2013-09-27 ]

No sensitive info but I emailed the files to you anyway. Let me know if you didn't get them.

I included our cnf files. we do not have binlog_annotate_row_events set and I don't think it is enabled by default.

Let me know if you need anything else.

thanks again.

Comment by Matthew Wheeler [ 2014-10-17 ]

After the last couple of updates we have not seen this issue again.

This could be closed.

Comment by Elena Stepanova [ 2014-10-17 ]

nirbhay_c,

Could you please confirm it should be closed, and close if it should be?

Comment by Eugene Pankov [ 2021-07-12 ]

I've just seen this with MariaDB 10.2.30 today, without Xtrabackup and with `binlog_annotate_row_events`. All nodes in the cluster have shut themselves down the moment one ran out of /tmp space.

The log mentions an existing VARCHAR field.

Log: https://gist.githubusercontent.com/Eugeny/e14e39e24db07a1c258565480a835fc8/raw/2a6b6259632696dc761d18cbc53d670998af8343/galera.log

Generated at Thu Feb 08 06:56:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.