[MDEV-28070] mysqd signal 11 - Thread pointer: 0x7f7910005768 - BF lock wait long for Created: 2022-03-15  Updated: 2022-05-15  Resolved: 2022-05-15

Status: Closed
Project: MariaDB Server
Component/s: wsrep
Affects Version/s: 10.5.15, 10.5
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Jaroslav Assignee: Jan Lindström (Inactive)
Resolution: Incomplete Votes: 0
Labels: None


 Description   

It seems that our mariadb dies on it's own with following on latest 10.5.15 and 10.5.13 and takes a while to properly start again
Is this some kind of bug?
The setup is multi-master (master/master/master) using galera/wsrep.
The config is:

  galera.cnf: |
    [galera]
    user = mysql
    bind-address = 0.0.0.0
    default_storage_engine = InnoDB
    binlog_format = ROW
    innodb_autoinc_lock_mode = 2
    innodb_flush_log_at_trx_commit = 0
    query_cache_size = 0
    query_cache_type = 0
    binlog_cache_size = 61440
 
    # MariaDB Galera settings
    wsrep_on=ON
    wsrep_provider=/usr/lib/galera/libgalera_smm.so
    wsrep_sst_method=rsync
    wsrep_slave_threads=8
    wsrep_sync_wait=7
 
    # Cluster settings (automatically updated)
    wsrep_cluster_address=gcomm://
    wsrep_cluster_name=mysql
    wsrep_node_address=127.0.0.1
  mariadb.cnf: "[client]\ndefault-character-set = utf8\n[mysqld]\ncore-file\nunix_socket
    = OFF\nperformance_schema = ON\ncharacter-set-server = utf8\ncollation-server
    = utf8_general_ci\nignore-db-dirs = lost+found \nmax_connections = 250\ninteractive_timeout
    = 450 \nwait_timeout = 450\ntable_definition_cache = 2100\n# InnoDB tuning\ninnodb_buffer_pool_size
    = 7000MB\ninnodb_log_file_size = 1600MB\n"

2022-03-14  1:15:00 680302 [Warning] Aborted connection 680302 to db: 'storage' user: 'root' host: 'PROPERHOST' (Got an error reading communication packets)
220314  1:16:14 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.5.15-MariaDB-1:10.5.15+maria~focal
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=59
max_threads=252
thread_count=68
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 685818 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f7910005768
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f7930198d58 thread_stack 0x49000
mysqld(my_print_stacktrace+0x32)[0x55d8aa63ed32]
mysqld(handle_fatal_signal+0x485)[0x55d8aa086995]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7f7b439d23c0]
Printing to addr2line failed
mysqld(_ZN13st_join_table17save_explain_dataEP20Explain_table_accessybPS_+0x2df)[0x55d8a9ecfa5f]
mysqld(_ZN4JOIN24save_explain_data_internEP13Explain_querybbbPKc+0xbc2)[0x55d8a9ed1762]
mysqld(_ZN4JOIN17save_explain_dataEP13Explain_querybbbb+0x1f8)[0x55d8a9ed1a28]
mysqld(_ZN4JOIN13build_explainEv+0x83)[0x55d8a9ed1b03]
mysqld(_ZN4JOIN8optimizeEv+0x9a)[0x55d8a9edc41a]
mysqld(_ZN13st_select_lex31optimize_unflattened_subqueriesEb+0x12e)[0x55d8a9e49b9e]
mysqld(_ZN4JOIN15optimize_stage2Ev+0x1644)[0x55d8a9ed6e84]
mysqld(_ZN4JOIN14optimize_innerEv+0x1bbf)[0x55d8a9eda03f]
mysqld(_ZN4JOIN8optimizeEv+0xc3)[0x55d8a9edc443]
mysqld(_Z12mysql_selectP3THDP10TABLE_LISTR4ListI4ItemEPS4_jP8st_orderS9_S7_S9_yP13select_resultP18st_select_lex_unitP13st_select_lex+0xb7)[0x55d8a9edc517]
mysqld(_Z13handle_selectP3THDP3LEXP13select_resultm+0x157)[0x55d8a9edcf57]
mysqld(+0x76b6c1)[0x55d8a9e696c1]
mysqld(_Z21mysql_execute_commandP3THD+0x436b)[0x55d8a9e7842b]
mysqld(_ZN18Prepared_statement7executeEP6Stringb+0x465)[0x55d8a9e89a25]
mysqld(_ZN18Prepared_statement12execute_loopEP6StringbPhS2_+0x89)[0x55d8a9e89bf9]
mysqld(+0x78cab5)[0x55d8a9e8aab5]
mysqld(_Z19mysqld_stmt_executeP3THDPcj+0x30)[0x55d8a9e8acf0]
mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x22d7)[0x55d8a9e71957]
mysqld(_Z10do_commandP3THD+0x11c)[0x55d8a9e733bc]
mysqld(_Z24do_handle_one_connectionP7CONNECTb+0x421)[0x55d8a9f7b391]
mysqld(handle_one_connection+0x5d)[0x55d8a9f7b80d]
mysqld(+0xbe540f)[0x55d8aa2e340f]
2022-03-14  1:16:15 678169 [Warning] Aborted connection 678169 to db: 'scheduler' user: 'root' host: 'REDACTED' (Got timeout reading communication packets)
/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7f7b439c6609]
2022-03-14  1:16:51 680400 [Warning] Aborted connection 680400 to db: 'unconnected' user: 'user' host: 'REDACTED' (Got an error reading communication packets)
2022-03-14  1:17:01 680403 [Warning] Aborted connection 680403 to db: 'unconnected' user: 'user' host: 'REDACTED' (Got an error reading communication packets)
2022-03-14  1:17:05 0 [Note] InnoDB: WSREP: BF lock wait long for trx:57511324 query: update volume set placement_properties = ? where id = ?^�.b
2022-03-14  1:17:05 0 [Note] InnoDB: WSREP: BF lock wait long for trx:57511325 query: insert into volume(created_at, updated_at, uuid, external_uuid, vendor_id, vendor_subnet_id, allocation_id, pool_id, creation_token, display_name, external_name, service_level, security_style, snap_reserve, snapshot_directory, quota_in_bytes, protocol_types, restricted_actions, life_cycle_state, life_cycle_state_details, account_id, is_data_protection, clone_snapshot_id, storage_class, kerberos_enabled, throughput, network_proximity_info, network_proximity, storage_availability_zone, availability_zone, placement_properties, ldap_enabled, smb_share_settings, cool_access, coolness_period, is_data_store, regional_ha, unix_permissions, is_quota_enabled, constituent_volumes_per_aggregate, ontap_volume_style, enable_subvolumes, encrypted) select now(), now(), ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? from dual where not exists (select id from volume where deleted_at is null and c
2022-03-14  1:17:05 0 [Note] InnoDB: WSREP: BF lock wait long for trx:57511327 query: insert into volume(created_at, updated_at, uuid, external_uuid, vendor_id, vendor_subnet_id, allocation_id, pool_id, creation_token, display_name, external_name, service_level, security_style, snap_reserve, snapshot_directory, quota_in_bytes, protocol_types, restricted_actions, life_cycle_state, life_cycle_state_details, account_id, is_data_protection, clone_snapshot_id, storage_class, kerberos_enabled, throughput, network_proximity_info, network_proximity, storage_availability_zone, availability_zone, placement_properties, ldap_enabled, smb_share_settings, cool_access, coolness_period, is_data_store, regional_ha, unix_permissions, is_quota_enabled, constituent_volumes_per_aggregate, ontap_volume_style, enable_subvolumes, encrypted) select now(), now(), ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? from dual where not exists (select id from volume where deleted_at is null and c
2022-03-14  1:17:06 0 [Note] InnoDB: WSREP: BF lock wait long for trx:57511324 query: update volume set placement_properties = ? where id = ?^�.b
2022-03-14  1:17:06 0 [Note] InnoDB: WSREP: BF lock wait long for trx:57511325 query: insert into volume(created_at, updated_at, uuid, external_uuid, vendor_id, vendor_subnet_id, allocation_id, pool_id, creation_token, display_name, external_name, service_level, security_style, snap_reserve, snapshot_directory, quota_in_bytes, protocol_types, restricted_actions, life_cycle_state, life_cycle_state_details, account_id, is_data_protection, clone_snapshot_id, storage_class, kerberos_enabled, throughput, network_proximity_info, network_proximity, storage_availability_zone, availability_zone, placement_properties, ldap_enabled, smb_share_settings, cool_access, coolness_period, is_data_store, regional_ha, unix_permissions, is_quota_enabled, constituent_volumes_per_aggregate, ontap_volume_style, enable_subvolumes, encrypted) select now(), now(), ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? from dual where not exists (select id from volume where deleted_at is null and c
2022-03-14  1:17:06 0 [Note] InnoDB: WSREP: BF lock wait long for trx:57511327 query: insert into volume(created_at, updated_at, uuid, external_uuid, vendor_id, vendor_subnet_id, allocation_id, pool_id, creation_token, display_name, external_name, service_level, security_style, snap_reserve, snapshot_directory, quota_in_bytes, protocol_types, restricted_actions, life_cycle_state, life_cycle_state_details, account_id, is_data_protection, clone_snapshot_id, storage_class, kerberos_enabled, throughput, network_proximity_info, network_proximity, storage_availability_zone, availability_zone, placement_properties, ldap_enabled, smb_share_settings, cool_access, coolness_period, is_data_store, regional_ha, unix_permissions, is_quota_enabled, constituent_volumes_per_aggregate, ontap_volume_style, enable_subvolumes, encrypted) select now(), now(), ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? from dual where not exists (select id from volume where deleted_at is null and c



 Comments   
Comment by Jan Lindström (Inactive) [ 2022-03-15 ]

jaroslav Can you please provide full error log, table structure for volume, node config files. Can you easily reproduce the issue? What topology you use master<->slave or is it multi-master? Stack trace does not immediately look Galera related but those BF long waits are, but they could be harmless. Can you use --wsrep-debug=1 setting?

Comment by Jaroslav [ 2022-03-15 ]

We don't have simply way to reproduce it. It happens during some of stress tests when we put higher amount of calls to database and it goes simply down after a while with above issue.
Ive added additional details to description now.

In meantime I'm going to try to reproduce using wsrep_debug=1 option in our config.

Generated at Thu Feb 08 09:57:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.