Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.4.10
-
Linux 4.19.91-gentoo #1 SMP x86_64 AMD EPYC 7451 AMD CPU, GNU/Linux
Description
While trying to debug (mis)behavior of Galera cluster (galera-26.4.3 with mariadb-10.4.10) faced with following issues:
1) Enabling debug options requires node restart
2) Important information (connection details) is missing from logs even in debug mode
3) Some constants are mentioned in logs and documentation without explanation of their meaning and values
Original situation
Log is flooded with messages like
2020-01-07 10:09:15 0 [Warning] WSREP: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6354549, -110 (Connection timed out) |
2020-01-07 10:09:17 0 [Warning] WSREP: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6354560, -110 (Connection timed out) |
2020-01-07 10:09:23 0 [Warning] WSREP: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6354612, -110 (Connection timed out) |
2020-01-07 10:09:47 0 [Warning] WSREP: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6354986, -110 (Connection timed out) |
2020-01-07 10:09:48 0 [Warning] WSREP: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6354995, -110 (Connection timed out) |
So, some connection to either client or one of cluster nodes is failing with timeout and attempt to get details on what and where is failing was taken with setting
SET GLOBAL wsrep_debug=1; |
SET GLOBAL wsrep_provider_options='debug=1'; |
From this point, several things we got are different from what was expected.
1. Minor issue: changes to wsrep_provider_options:
Expected behabior:
output of
show variables like 'wsrep_provider_options' \G |
Value: base_dir = /var/lib/mysql/; base_host = ***.***.***.36; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no;... |
changes to
show variables like 'wsrep_provider_options' \G |
Value: base_dir = /var/lib/mysql/; base_host = ***.***.***.36; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = yes;... |
Actual behavior:
wsrep_provider_options don't change, because according to https://mariadb.com/kb/en/galera-cluster-system-variables/#wsrep_provider_options it is not dynamic, so enabling this option requires node restart - an unwanted action.
2. Main problem: debug output.
As setting GLOBAL wsrep_debug=1 enabled partially debug output, additional information was logged. But it is useless as even in debug mode connection details not printed.
Actual behavior:
Logged lines are:
--
|
2020-01-07 10:48:45 1380127 [Note] [Debug] WSREP: galera/src/wsrep_provider.cpp:galera_release():770: trx 24348367 not found for release |
2020-01-07 10:48:45 1380127 [Note] [Debug] WSREP: galera/src/wsrep_provider.cpp:galera_release():770: trx 18446744073709551615 not found for release |
2020-01-07 10:48:45 1380370 [Note] [Debug] WSREP: gcs/src/gcs_core.cpp:core_msg_send_retry():307: Backend requested wait |
2020-01-07 10:48:45 0 [Note] [Debug] WSREP: gcs/src/gcs_sm.hpp:_gcs_sm_enqueue_common():251: send monitor wait timed out, waited for PT1S |
2020-01-07 10:48:45 0 [Warning] WSREP: galera/src/galera_service_thd.cpp:thd_func():60: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6386924, -110 (Connection timed out) |
2020-01-07 10:48:45 1380370 [Note] [Debug] WSREP: gcs/src/gcs_core.cpp:core_msg_send_retry():307: Backend requested wait |
2020-01-07 10:48:45 1380132 [Note] WSREP: assigned new next trx id: 24348385 |
--
|
2020-01-07 10:48:46 1380370 [Note] [Debug] WSREP: gcs/src/gcs_core.cpp:core_msg_send_retry():307: Backend requested wait |
2020-01-07 10:48:46 1380370 [Note] [Debug] WSREP: gcs/src/gcs_core.cpp:core_msg_send_retry():307: Backend requested wait |
2020-01-07 10:48:46 1380370 [Note] [Debug] WSREP: gcs/src/gcs_core.cpp:core_msg_send_retry():307: Backend requested wait |
2020-01-07 10:48:46 0 [Note] [Debug] WSREP: gcs/src/gcs_sm.hpp:_gcs_sm_enqueue_common():251: send monitor wait timed out, waited for PT1S |
2020-01-07 10:48:46 0 [Warning] WSREP: galera/src/galera_service_thd.cpp:thd_func():60: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6386935, -110 (Connection timed out) |
2020-01-07 10:48:46 1380646 [Note] WSREP: assigned new next trx id: 24348515 |
2020-01-07 10:48:46 1380646 [Note] WSREP: wsrep_commit_empty(1380646) |
--
|
2020-01-07 10:48:48 1380681 [Note] WSREP: wsrep_commit_empty(1380681) |
2020-01-07 10:48:48 1380681 [Note] [Debug] WSREP: galera/src/wsrep_provider.cpp:galera_release():770: trx 24348732 not found for release |
2020-01-07 10:48:48 1380681 [Note] [Debug] WSREP: galera/src/wsrep_provider.cpp:galera_release():770: trx 18446744073709551615 not found for release |
2020-01-07 10:48:48 0 [Note] [Debug] WSREP: gcs/src/gcs_sm.hpp:_gcs_sm_enqueue_common():251: send monitor wait timed out, waited for PT2S |
2020-01-07 10:48:48 0 [Warning] WSREP: galera/src/galera_service_thd.cpp:thd_func():60: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6386951, -110 (Connection timed out) |
2020-01-07 10:48:48 1281515 [Note] [Debug] WSREP: gcs/src/gcs_core.cpp:core_msg_send_retry():307: Backend requested wait |
2020-01-07 10:48:48 1281515 [Note] [Debug] WSREP: gcs/src/gcs_core.cpp:core_msg_send_retry():307: Backend requested wait |
These lines refer to gcs/src/gcs_core.cpp, but miss any detail about connection that timed out: was it unix socket ot TCP, local or from some client or from another node in the cluster? No any way to find what connection times out and what causes these warning messages flooding logs.
Expected behavior:
Connection details (socket name/local IP/remote IP) are printed into logs with debug mode on.
3. Additional question.
Documentation about parameters seems to be lacking details.
Actual state:
Constants PT2S/PT1S/PT3S and so on are not mentioned in documentation, so the only guess that 'P' is either for 'Percona' or 'Precise', 'T' is for 'Time' or 'Timeout' and 1S/2S/3S is value in seconds (have seen also examples of 'PT1M' that is probably 1 minute somewhere).
Expected state:
Page https://mariadb.com/kb/en/wsrep_provider_options/ should either explain these mentioned constants or have a link to explanation.