[MDEV-15261] WSREP: Failed to report last committed Created: 2018-02-09  Updated: 2019-12-12  Resolved: 2019-12-12

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.1.18, 10.2.14
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Rodney Antonio Ramos Assignee: Jan Lindström (Inactive)
Resolution: Not a Bug Votes: 1
Labels: None
Environment:

3 nodes running mysql Ver 15.1 Distrib 10.1.18-MariaDB, for Linux (x86_64) using readline 5.1.

  • Hosts: Physical Hosts running CentOS 7.4.1708 (Core), 32 CPUs, 64 GB RAM

1 node is the master and the others are slave.

  1. MariaDB Configuration:

[mysqld]
datadir=/mysql/data/mysql
socket=/mysql/data/mysql/mysql.sock
open_files_limit = 65535
max_connections = 500
user=mysql
binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
innodb_flush_log_at_trx_commit=0
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_provider_options="gcache.size=50G; gcache.page_size=1024M; gcs.fc_limit=1000000; gcs.fc_master_slave=yes"
wsrep_cluster_name="zabbixdb_cluster"
wsrep_cluster_address="gcomm://10.1.1.1,10.1.1.2,10.1.1.3"
wsrep_on=ON
wsrep_node_name=AAAAAA
wsrep_slave_threads=32

wsrep_sst_auth=root:xxxxxxx
wsrep_sst_method=xtrabackup

innodb_autoextend_increment = 256
innodb_buffer_pool_instances = 52
innodb_buffer_pool_size = 52G
innodb_concurrency_tickets = 5000
innodb_file_per_table = 1
innodb_flush_method = O_DIRECT
innodb_log_file_size = 512M
innodb_log_files_in_group = 4
innodb_old_blocks_time = 1000
innodb_open_files = 2048
innodb_stats_on_metadata = OFF
innodb_lock_wait_timeout = 50
innodb_io_capacity = 2000

optimizer_switch = 'index_condition_pushdown=off'

large-pages
binlog-row-event-max-size = 8192
character_set_server = utf8
collation_server = utf8_bin
expire_logs_days = 1
join_buffer_size = 262144
max_allowed_packet = 32M
max_connect_errors = 10000
max_heap_table_size = 134217728
query_cache_type = 0
query_cache_size = 0
slow-query-log = ON
table_open_cache = 2048
thread_cache_size = 64
tmp_table_size = 134217728
wait_timeout = 86400


Issue Links:
Relates
relates to MDEV-17550 Improve WSREP's "Failed to report las... Open

 Description   

Every node in the cluster is logging warnings similar to this:

[Warning] WSREP: Failed to report last committed 211651504, -4 (Interrupted system call)

At this point, one node starts the flow-control mechanism and the my application stops.

I can see, running the myq_status too, that the "Queue down" on one node achieve the fc_limit and starts the flow control.

I must stop the MariaDB on this node to application becomes up again.

I have one master node e two slave nodes. I could see this behavior on the slave nodes only.

I couln´t find why one node is starting the flow control. On the /var/log/messages I can see only the "WSREP: Failed to report last committed" messages. Nothing more.

Any suggestion?



 Comments   
Comment by Zdravelina Sokolovska (Inactive) [ 2018-02-16 ]

Hello rodneyra , are you receiving the same Warning messages with v10.1.30 ?

Comment by Rodney Antonio Ramos [ 2018-02-16 ]

Hello winstone! I´m using the v10.1.18.

The my.cnf is in the "Environment" header.

I´m trying to upgrade to v10.2.13 in my test environment first, but I´m having some difficulties.

Do you think that it should be a good idea to upgrade to 10.1.31 first?

My production environment is very big, almost 750 GB of data, and I must be very careful to make any change.

Thanks!

Comment by Rodney Antonio Ramos [ 2018-02-20 ]

Hello winstone!

I´m planning to upgrade my MariaDB to 10.1.31 release on March 3rd.

Do you think it a good idea or should I upgrade to 10.2.13 release already?

Thnaks!

Comment by Zdravelina Sokolovska (Inactive) [ 2018-02-20 ]

hello rodneyra , it'd better to wait with upgrade the next version due to a current problem on both 10.1.31 and 10.2.13

Comment by Rodney Antonio Ramos [ 2018-02-21 ]

Sorry, winstone. I didn´t understand.

Should I upgrade to 10.1.30 release?

Aren´t the releases 10.1.31 and 10.2.13 stable?

Thanks.

Comment by Rodney Antonio Ramos [ 2018-06-18 ]

I´ve upgrade do 10.1.31 and problem is the same.

The wsrep_local_recv_queue starts increase in one node and I don´t know why.

There is no error or warning log messages anymore.

At the moment, my wsrep_local_recv_queue is with more than 4 millions and galera cluster do not apply the write-sets.

Can someone help me? There is nothing on the log, even with debug enabled.

Comment by Zdravelina Sokolovska (Inactive) [ 2018-09-17 ]

wsrep debug enabled does not give any useful information either.
1. It would be needed to get more details from the logs related to the [Warning] WSREP Failed to report last committed
2. It's needed also to have in docs descriptions of Warning Codes for example:
-107 (Transport endpoint is not connected)
-110 (Connection timed out)
-77 (File descriptor in bad state)
-4 (Interrupted system call)
and suggested actions per code, as the issues is related to
reduced performance and service degradation of the entire cluster.

Comment by brianr [ 2018-10-04 ]

I'm on MariaDB 10.2.18 and experience the exact same thing after upgrade from 10.2.9.
(Galera-3: 25.3.24 (Ubuntu 16.04))
Any suggestions on how to troubleshoot?

I see apx 100 daily Warning of either (-110) or (-77) , predominantly the former, 110.

It's cribbling the cluster performance, each node with 4 CPU's are fighting to serve the mysqld --wsrep_start_position=.... processes (and the abundance of threads it creates. (one process with 146 threads) )
( I have to ask ... Does it create a thread for each session?)

And galeracluster.com support pages offer just about 0 options to investigates, so I hope this apporach kicks me in a better direction ??

Best regards
Brian

Comment by Geoff Montee (Inactive) [ 2019-06-12 ]

The "Failed to report last committed" messages don't necessarily mean that something is wrong. See MDEV-17550.

If your cluster is under a lot of load and you are experiencing performance problems, then you may be seeing the effects of flow control. See the following to find out how to configure flow control parameters:

http://galeracluster.com/documentation-webpages/documentation/managing-fc.html

Generated at Thu Feb 08 08:19:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.