Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21431

Insufficient debugging information

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.4.10
    • Fix Version/s: 10.4
    • Component/s: Debug, Galera
    • Labels:
      None
    • Environment:
      Linux 4.19.91-gentoo #1 SMP x86_64 AMD EPYC 7451 AMD CPU, GNU/Linux

      Description

      While trying to debug (mis)behavior of Galera cluster (galera-26.4.3 with mariadb-10.4.10) faced with following issues:

      1) Enabling debug options requires node restart
      2) Important information (connection details) is missing from logs even in debug mode
      3) Some constants are mentioned in logs and documentation without explanation of their meaning and values

      Original situation

      Log is flooded with messages like

      2020-01-07 10:09:15 0 [Warning] WSREP: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6354549, -110 (Connection timed out)
      2020-01-07 10:09:17 0 [Warning] WSREP: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6354560, -110 (Connection timed out)
      2020-01-07 10:09:23 0 [Warning] WSREP: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6354612, -110 (Connection timed out)
      2020-01-07 10:09:47 0 [Warning] WSREP: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6354986, -110 (Connection timed out)
      2020-01-07 10:09:48 0 [Warning] WSREP: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6354995, -110 (Connection timed out)
      

      So, some connection to either client or one of cluster nodes is failing with timeout and attempt to get details on what and where is failing was taken with setting

      SET GLOBAL wsrep_debug=1;
      SET GLOBAL wsrep_provider_options='debug=1';
      

      From this point, several things we got are different from what was expected.

      1. Minor issue: changes to wsrep_provider_options:

      Expected behabior:
      output of

      show variables like 'wsrep_provider_options' \G
      Value: base_dir = /var/lib/mysql/; base_host = ***.***.***.36; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no;...
      

      changes to

      show variables like 'wsrep_provider_options' \G
      Value: base_dir = /var/lib/mysql/; base_host = ***.***.***.36; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = yes;...
      

      Actual behavior:
      wsrep_provider_options don't change, because according to https://mariadb.com/kb/en/galera-cluster-system-variables/#wsrep_provider_options it is not dynamic, so enabling this option requires node restart - an unwanted action.

      2. Main problem: debug output.

      As setting GLOBAL wsrep_debug=1 enabled partially debug output, additional information was logged. But it is useless as even in debug mode connection details not printed.

      Actual behavior:

      Logged lines are:

      --
      2020-01-07 10:48:45 1380127 [Note] [Debug] WSREP: galera/src/wsrep_provider.cpp:galera_release():770: trx 24348367 not found for release
      2020-01-07 10:48:45 1380127 [Note] [Debug] WSREP: galera/src/wsrep_provider.cpp:galera_release():770: trx 18446744073709551615 not found for release
      2020-01-07 10:48:45 1380370 [Note] [Debug] WSREP: gcs/src/gcs_core.cpp:core_msg_send_retry():307: Backend requested wait
      2020-01-07 10:48:45 0 [Note] [Debug] WSREP: gcs/src/gcs_sm.hpp:_gcs_sm_enqueue_common():251: send monitor wait timed out, waited for PT1S
      2020-01-07 10:48:45 0 [Warning] WSREP: galera/src/galera_service_thd.cpp:thd_func():60: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6386924, -110 (Connection timed out)
      2020-01-07 10:48:45 1380370 [Note] [Debug] WSREP: gcs/src/gcs_core.cpp:core_msg_send_retry():307: Backend requested wait
      2020-01-07 10:48:45 1380132 [Note] WSREP: assigned new next trx id: 24348385
      --
      2020-01-07 10:48:46 1380370 [Note] [Debug] WSREP: gcs/src/gcs_core.cpp:core_msg_send_retry():307: Backend requested wait
      2020-01-07 10:48:46 1380370 [Note] [Debug] WSREP: gcs/src/gcs_core.cpp:core_msg_send_retry():307: Backend requested wait
      2020-01-07 10:48:46 1380370 [Note] [Debug] WSREP: gcs/src/gcs_core.cpp:core_msg_send_retry():307: Backend requested wait
      2020-01-07 10:48:46 0 [Note] [Debug] WSREP: gcs/src/gcs_sm.hpp:_gcs_sm_enqueue_common():251: send monitor wait timed out, waited for PT1S
      2020-01-07 10:48:46 0 [Warning] WSREP: galera/src/galera_service_thd.cpp:thd_func():60: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6386935, -110 (Connection timed out)
      2020-01-07 10:48:46 1380646 [Note] WSREP: assigned new next trx id: 24348515
      2020-01-07 10:48:46 1380646 [Note] WSREP: wsrep_commit_empty(1380646)
      --
      2020-01-07 10:48:48 1380681 [Note] WSREP: wsrep_commit_empty(1380681)
      2020-01-07 10:48:48 1380681 [Note] [Debug] WSREP: galera/src/wsrep_provider.cpp:galera_release():770: trx 24348732 not found for release
      2020-01-07 10:48:48 1380681 [Note] [Debug] WSREP: galera/src/wsrep_provider.cpp:galera_release():770: trx 18446744073709551615 not found for release
      2020-01-07 10:48:48 0 [Note] [Debug] WSREP: gcs/src/gcs_sm.hpp:_gcs_sm_enqueue_common():251: send monitor wait timed out, waited for PT2S
      2020-01-07 10:48:48 0 [Warning] WSREP: galera/src/galera_service_thd.cpp:thd_func():60: Failed to report last committed be****1e-2e04-11ea-b073-bb328****238:6386951, -110 (Connection timed out)
      2020-01-07 10:48:48 1281515 [Note] [Debug] WSREP: gcs/src/gcs_core.cpp:core_msg_send_retry():307: Backend requested wait
      2020-01-07 10:48:48 1281515 [Note] [Debug] WSREP: gcs/src/gcs_core.cpp:core_msg_send_retry():307: Backend requested wait
      

      These lines refer to gcs/src/gcs_core.cpp, but miss any detail about connection that timed out: was it unix socket ot TCP, local or from some client or from another node in the cluster? No any way to find what connection times out and what causes these warning messages flooding logs.

      Expected behavior:

      Connection details (socket name/local IP/remote IP) are printed into logs with debug mode on.

      3. Additional question.

      Documentation about parameters seems to be lacking details.

      Actual state:

      Constants PT2S/PT1S/PT3S and so on are not mentioned in documentation, so the only guess that 'P' is either for 'Percona' or 'Precise', 'T' is for 'Time' or 'Timeout' and 1S/2S/3S is value in seconds (have seen also examples of 'PT1M' that is probably 1 minute somewhere).

      Expected state:

      Page https://mariadb.com/kb/en/wsrep_provider_options/ should either explain these mentioned constants or have a link to explanation.

        Attachments

          Activity

            People

            Assignee:
            jplindst Jan Lindström
            Reporter:
            euglorg Eugene
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Dates

              Created:
              Updated: