Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-10939

innodb.innodb_monitor failed with a timeout in buildbot on valgrind

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Cannot Reproduce
    • 10.0(EOL)
    • N/A
    • Tests
    • None

    Description

      It's the first time it ever happened, according to the cross-reference.
      The test normally takes 6-7 seconds, this time it timed out after 9000 seconds, so it's not just the builder being especially slow.
      http://buildbot.askmonty.org/buildbot/builders/work-amd64-valgrind/builds/9387/steps/test/logs/stdio

      innodb.innodb_monitor 'innodb_plugin'    w1 [ fail ]  timeout after 9000 seconds
              Test ended at 2016-10-01 05:24:30
       
      Test case timeout after 9000 seconds
       
      == /mnt/data/buildot/maria-slave/work-opensuse-amd64/build/mysql-test/var/1/log/innodb_monitor.log == 
      select name from information_schema.innodb_metrics
      where status != IF(name like "log%", 'enabled', 'disabled');
      name
      set global innodb_monitor_enable="os_%a_fs_ncs";
      set global innodb_monitor_enable="os%pending%";
      select name, status from information_schema.innodb_metrics
      where name like "os%";
      name	status
      os_data_reads	disabled
      os_data_writes	disabled
      os_data_fsyncs	enabled
      os_pending_reads	enabled
      os_pending_writes	enabled
      os_log_bytes_written	disabled
      os_log_fsyncs	disabled
      os_log_pending_fsyncs	enabled
      os_log_pending_writes	enabled
      set global innodb_monitor_enable="";
      ERROR 42000: Variable 'innodb_monitor_enable' can't be set to the value of ''
      set global innodb_monitor_enable="_";
       
       == /mnt/data/buildot/maria-slave/work-opensuse-amd64/build/mysql-test/var/1/tmp/analyze-timeout-mysqld.1.err ==
      mysqltest: Could not open connection 'default' after 500 attempts: 2002 Can't connect to local MySQL server through socket '/mnt/data/buildot/maria-slave/work-opensuse-amd64/build/mysql-test/var/tmp/1/mysqld.1.sock' (111 "Connection refused")
      

      Attachments

        Issue Links

          Activity

            There is not much in the server error log either:

            Version: '10.0.28-MariaDB-debug'  socket: '/mnt/data/buildot/maria-slave/work-opensuse-amd64/build/mysql-test/var/tmp/1/mysqld.1.sock'  port: 16000  Source distribution
            2016-10-01 02:52:59 40cc950 [Info] InnoDB: the file format in the system tablespace is now set to Barracuda.
            2016-10-01 02:52:59 40cc950 [Info] InnoDB: the file format in the system tablespace is now set to Antelope.
            CURRENT_TEST: innodb.innodb_gis
            CURRENT_TEST: innodb.innodb_information_schema
            CURRENT_TEST: innodb.innodb_monitor
            161001  5:23:38 [ERROR] mysqld got signal 6 ;
            

            9000 seconds is 2½ hours, which roughly matches the time stamp difference above (02:52:49 + 2½ hours = 05:22:49).

            Even though the server was killed with -ABRT on the timeout, this was a Valgrind run, and Valgrind would require a special measures to dump a core of the instrumented process.
            Apparently the hang occurs while executing the following SQL:

            # Reset counters only in "module_metadata" module
            set global innodb_monitor_disable = module_metadata;
            

            The subsystem MONITOR_MODULE_METADATA that this symbol refers to appears to be completely unused, even in MySQL 5.7.

            If the hang is related to the SQL statement (it could be something else too), then the most likely cause would be something in the function innodb_monitor_update(). That function is sending some output to the client connection and some to the server error log, but it is not using any synchronization primitives explicitly.

            There is a monitor_mutex that serializes access to all monitor counters when HAVE_ATOMIC_BUILTINS_64 is not set. Maybe some other thread was stuck while holding this mutex? Does the build configuration on this platform define HAVE_ATOMIC_BUILTINS_64?

            marko Marko Mäkelä added a comment - There is not much in the server error log either: Version: '10.0.28-MariaDB-debug' socket: '/mnt/data/buildot/maria-slave/work-opensuse-amd64/build/mysql-test/var/tmp/1/mysqld.1.sock' port: 16000 Source distribution 2016-10-01 02:52:59 40cc950 [Info] InnoDB: the file format in the system tablespace is now set to Barracuda. 2016-10-01 02:52:59 40cc950 [Info] InnoDB: the file format in the system tablespace is now set to Antelope. CURRENT_TEST: innodb.innodb_gis CURRENT_TEST: innodb.innodb_information_schema CURRENT_TEST: innodb.innodb_monitor 161001 5:23:38 [ERROR] mysqld got signal 6 ; 9000 seconds is 2½ hours, which roughly matches the time stamp difference above (02:52:49 + 2½ hours = 05:22:49). Even though the server was killed with -ABRT on the timeout, this was a Valgrind run, and Valgrind would require a special measures to dump a core of the instrumented process. Apparently the hang occurs while executing the following SQL: # Reset counters only in "module_metadata" module set global innodb_monitor_disable = module_metadata; The subsystem MONITOR_MODULE_METADATA that this symbol refers to appears to be completely unused, even in MySQL 5.7. If the hang is related to the SQL statement (it could be something else too), then the most likely cause would be something in the function innodb_monitor_update(). That function is sending some output to the client connection and some to the server error log, but it is not using any synchronization primitives explicitly. There is a monitor_mutex that serializes access to all monitor counters when HAVE_ATOMIC_BUILTINS_64 is not set. Maybe some other thread was stuck while holding this mutex? Does the build configuration on this platform define HAVE_ATOMIC_BUILTINS_64?

            The build log indicates that monitor_mutex should not be a culprit:

            -- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS
            -- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS - Success
            -- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS_BYTE
            -- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS_BYTE - Success
            -- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS_64
            -- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS_64 - Success
            -- Performing Test HAVE_IB_GCC_SYNC_SYNCHRONISE
            -- Performing Test HAVE_IB_GCC_SYNC_SYNCHRONISE - Success
            -- Performing Test HAVE_IB_GCC_ATOMIC_THREAD_FENCE
            -- Performing Test HAVE_IB_GCC_ATOMIC_THREAD_FENCE - Failed
            -- Performing Test HAVE_IB_GCC_ATOMIC_TEST_AND_SET
            -- Performing Test HAVE_IB_GCC_ATOMIC_TEST_AND_SET - Failed
            -- Performing Test HAVE_IB_ATOMIC_PTHREAD_T_GCC
            -- Performing Test HAVE_IB_ATOMIC_PTHREAD_T_GCC - Success
            

            marko Marko Mäkelä added a comment - The build log indicates that monitor_mutex should not be a culprit: -- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS -- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS - Success -- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS_BYTE -- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS_BYTE - Success -- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS_64 -- Performing Test HAVE_IB_GCC_ATOMIC_BUILTINS_64 - Success -- Performing Test HAVE_IB_GCC_SYNC_SYNCHRONISE -- Performing Test HAVE_IB_GCC_SYNC_SYNCHRONISE - Success -- Performing Test HAVE_IB_GCC_ATOMIC_THREAD_FENCE -- Performing Test HAVE_IB_GCC_ATOMIC_THREAD_FENCE - Failed -- Performing Test HAVE_IB_GCC_ATOMIC_TEST_AND_SET -- Performing Test HAVE_IB_GCC_ATOMIC_TEST_AND_SET - Failed -- Performing Test HAVE_IB_ATOMIC_PTHREAD_T_GCC -- Performing Test HAVE_IB_ATOMIC_PTHREAD_T_GCC - Success

            MDEV-12052 reports an assertion failure during shutdown, possibly related to a lost signal. A lost signal could theoretically hang the whole server.

            marko Marko Mäkelä added a comment - MDEV-12052 reports an assertion failure during shutdown, possibly related to a lost signal. A lost signal could theoretically hang the whole server.

            People

              marko Marko Mäkelä
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.