[MDEV-10939] innodb.innodb_monitor failed with a timeout in buildbot on valgrind Created: 2016-10-02  Updated: 2020-07-16  Resolved: 2020-07-16

Status: Closed
Project: MariaDB Server
Component/s: Tests
Affects Version/s: 10.0
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Marko Mäkelä
Resolution: Cannot Reproduce Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-12052 Shutdown crash presumably due to mast... Closed
relates to MDEV-7069 Fix buildbot failures in main server ... Stalled

 Description   

It's the first time it ever happened, according to the cross-reference.
The test normally takes 6-7 seconds, this time it timed out after 9000 seconds, so it's not just the builder being especially slow.
http://buildbot.askmonty.org/buildbot/builders/work-amd64-valgrind/builds/9387/steps/test/logs/stdio

innodb.innodb_monitor 'innodb_plugin'    w1 [ fail ]  timeout after 9000 seconds
        Test ended at 2016-10-01 05:24:30
 
Test case timeout after 9000 seconds
 
== /mnt/data/buildot/maria-slave/work-opensuse-amd64/build/mysql-test/var/1/log/innodb_monitor.log == 
select name from information_schema.innodb_metrics
where status != IF(name like "log%", 'enabled', 'disabled');
name
set global innodb_monitor_enable="os_%a_fs_ncs";
set global innodb_monitor_enable="os%pending%";
select name, status from information_schema.innodb_metrics
where name like "os%";
name	status
os_data_reads	disabled
os_data_writes	disabled
os_data_fsyncs	enabled
os_pending_reads	enabled
os_pending_writes	enabled
os_log_bytes_written	disabled
os_log_fsyncs	disabled
os_log_pending_fsyncs	enabled
os_log_pending_writes	enabled
set global innodb_monitor_enable="";
ERROR 42000: Variable 'innodb_monitor_enable' can't be set to the value of ''
set global innodb_monitor_enable="_";
 
 == /mnt/data/buildot/maria-slave/work-opensuse-amd64/build/mysql-test/var/1/tmp/analyze-timeout-mysqld.1.err ==
mysqltest: Could not open connection 'default' after 500 attempts: 2002 Can't connect to local MySQL server through socket '/mnt/data/buildot/maria-slave/work-opensuse-amd64/build/mysql-test/var/tmp/1/mysqld.1.sock' (111 "Connection refused")



 Comments   
Comment by Marko Mäkelä [ 2017-02-16 ]

There is not much in the server error log either:

Version: '10.0.28-MariaDB-debug'  socket: '/mnt/data/buildot/maria-slave/work-opensuse-amd64/build/mysql-test/var/tmp/1/mysqld.1.sock'  port: 16000  Source distribution
2016-10-01 02:52:59 40cc950 [Info] InnoDB: the file format in the system tablespace is now set to Barracuda.
2016-10-01 02:52:59 40cc950 [Info] InnoDB: the file format in the system tablespace is now set to Antelope.
CURRENT_TEST: innodb.innodb_gis
CURRENT_TEST: innodb.innodb_information_schema
CURRENT_TEST: innodb.innodb_monitor
161001  5:23:38 [ERROR] mysqld got signal 6 ;

9000 seconds is 2½ hours, which roughly matches the time stamp difference above (02:52:49 + 2½ hours = 05:22:49).

Even though the server was killed with -ABRT on the timeout, this was a Valgrind run, and Valgrind would require a special measures to dump a core of the instrumented process.
Apparently the hang occurs while executing the following SQL:

# Reset counters only in "module_metadata" module
set global innodb_monitor_disable = module_metadata;

The subsystem MONITOR_MODULE_METADATA that this symbol refers to appears to be completely unused, even in MySQL 5.7.

If the hang is related to the SQL statement (it could be something else too), then the most likely cause would be something in the function innodb_monitor_update(). That function is sending some output to the client connection and some to the server error log, but it is not using any synchronization primitives explicitly.

There is a monitor_mutex that serializes access to all monitor counters when HAVE_ATOMIC_BUILTINS_64 is not set. Maybe some other thread was stuck while holding this mutex? Does the build configuration on this platform define HAVE_ATOMIC_BUILTINS_64?

Comment by Marko Mäkelä [ 2017-02-16 ]

The build log indicates that monitor_mutex should not be a culprit:

-- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS
-- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS - Success
-- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS_BYTE
-- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS_BYTE - Success
-- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS_64
-- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS_64 - Success
-- Performing Test HAVE_IB_GCC_SYNC_SYNCHRONISE
-- Performing Test HAVE_IB_GCC_SYNC_SYNCHRONISE - Success
-- Performing Test HAVE_IB_GCC_ATOMIC_THREAD_FENCE
-- Performing Test HAVE_IB_GCC_ATOMIC_THREAD_FENCE - Failed
-- Performing Test HAVE_IB_GCC_ATOMIC_TEST_AND_SET
-- Performing Test HAVE_IB_GCC_ATOMIC_TEST_AND_SET - Failed
-- Performing Test HAVE_IB_ATOMIC_PTHREAD_T_GCC
-- Performing Test HAVE_IB_ATOMIC_PTHREAD_T_GCC - Success

Comment by Marko Mäkelä [ 2017-02-16 ]

MDEV-12052 reports an assertion failure during shutdown, possibly related to a lost signal. A lost signal could theoretically hang the whole server.

Generated at Thu Feb 08 07:46:04 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.