Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-10804

main.stat_tables_par fails sporadically in buildbot

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.2(EOL)
    • 10.2.6
    • Tests
    • None

    Description

      http://buildbot.askmonty.org/buildbot/builders/p8-rhel71-bintar-debug/builds/1194/steps/test/logs/stdio

      main.stat_tables_par                     w1 [ fail ]
              Test ended at 2016-09-13 02:56:52
       
      CURRENT_TEST: main.stat_tables_par
      --- /home/buildbot/maria-slave/power8-vlp06-bintar-debug/build/mysql-test/r/stat_tables_par.result	2016-09-12 23:22:49.532112558 -0400
      +++ /home/buildbot/maria-slave/power8-vlp06-bintar-debug/build/mysql-test/r/stat_tables_par.reject	2016-09-13 02:56:52.463254631 -0400
      @@ -176,7 +176,6 @@
       dbt3_s001	lineitem	i_l_orderkey_quantity	2	1.0404
       dbt3_s001	lineitem	i_l_partkey	1	30.0250
       dbt3_s001	lineitem	i_l_receiptdate	1	2.6477
      -dbt3_s001	lineitem	i_l_shipdate	1	2.6500
       dbt3_s001	lineitem	i_l_suppkey	1	600.5000
       dbt3_s001	lineitem	i_l_suppkey_partkey	1	30.0250
       dbt3_s001	lineitem	i_l_suppkey_partkey	2	8.5786
      @@ -191,7 +190,6 @@
       dbt3_s001	lineitem	i_l_orderkey_quantity	2	1.0404
       dbt3_s001	lineitem	i_l_partkey	1	30.0250
       dbt3_s001	lineitem	i_l_receiptdate	1	2.6477
      -dbt3_s001	lineitem	i_l_shipdate	1	2.6500
       dbt3_s001	lineitem	i_l_suppkey	1	600.5000
       dbt3_s001	lineitem	i_l_suppkey_partkey	1	30.0250
       dbt3_s001	lineitem	i_l_suppkey_partkey	2	8.5786
       
      mysqltest: Result length mismatch
      

      Attachments

        Issue Links

          Activity

            elenst Elena Stepanova added a comment - Still happens. Most recent occurrence: http://buildbot.askmonty.org/buildbot/builders/p8-rhel7-bintar-debug/builds/2081/steps/test/logs/stdio
            danblack Daniel Black added a comment -

            The diffs are in the results of these two (identical) queries: https://github.com/MariaDB/server/blob/10.2/mysql-test/t/stat_tables_par.test#L147..L154

            Seems to be a sparatic error however I did have it once.

            The removal of the disable_

            {result_log|warnings}

            for connection1/2 before these statements will probably provide the insight required. Both analyze statements should predictably return:

            Table  Op      Msg_type        Msg_text
            dbt3_s001.lineitem     analyze status  Engine-independent statistics collected
            dbt3_s001.lineitem     analyze status  OK
            

            I also noticed the following timeout which is further down in the test:

            | Id | User        | Host      | db        | Command | Time | State                                   | Info                                                                                                                               | Progress |
            | 15 | root        | localhost | dbt3_s001 | Query   |   98 | Waiting for table level lock            | analyze table lineitem persistent for all                                                                                          |    0.000 |
            | 16 | root        | localhost | dbt3_s001 | Query   |   98 | debug sync point: statistics_read_start | select * from mysql.index_stats, lineitem where index_name= 'i_l_shipdate' and l_orderkey=1 and l_partkey=68 order by prefix_arity |    0.000 |
            

            danblack Daniel Black added a comment - The diffs are in the results of these two (identical) queries: https://github.com/MariaDB/server/blob/10.2/mysql-test/t/stat_tables_par.test#L147..L154 Seems to be a sparatic error however I did have it once. The removal of the disable_ {result_log|warnings} for connection1/2 before these statements will probably provide the insight required. Both analyze statements should predictably return: Table Op Msg_type Msg_text dbt3_s001.lineitem analyze status Engine-independent statistics collected dbt3_s001.lineitem analyze status OK I also noticed the following timeout which is further down in the test: | Id | User | Host | db | Command | Time | State | Info | Progress | | 15 | root | localhost | dbt3_s001 | Query | 98 | Waiting for table level lock | analyze table lineitem persistent for all | 0.000 | | 16 | root | localhost | dbt3_s001 | Query | 98 | debug sync point: statistics_read_start | select * from mysql.index_stats, lineitem where index_name= 'i_l_shipdate' and l_orderkey=1 and l_partkey=68 order by prefix_arity | 0.000 |

            Original bug was fixed as part of MDEV-3891, but the test case had a bug that cause a random failure in debug builds while running tests.

            monty Michael Widenius added a comment - Original bug was fixed as part of MDEV-3891 , but the test case had a bug that cause a random failure in debug builds while running tests.
            monty Michael Widenius added a comment - - edited

            The reason for this is that the following code almost always causes a deadlock that will eventually be resolved with a timeout:

            connection con1;
            set debug_sync='statistics_update_start SIGNAL parker WAIT_FOR go1 EXECUTE 1';
            set debug_sync='thr_multi_lock_after_thr_lock SIGNAL go2 EXECUTE 2';
            use dbt3_s001;
            --send analyze table lineitem persistent for all

            connection con2;
            set debug_sync='open_and_process_table WAIT_FOR parker';
            set debug_sync='statistics_read_start SIGNAL go1 WAIT_FOR go2';
            use dbt3_s001;
            --send select * from mysql.index_stats, lineitem where index_name= 'i_l_shipdate' and l_orderkey=1 and l_partkey=68 order by prefix_arity;

            The deadlock happens because the SELECT will first run and take a read lock for index_stats, then the analyze statement will run and tries to take a write lock on table_stats, column_stats and index_stats.
            This will block as SELECT already have the lock.

            The tests works 'sometimes' depending on lock order in memory, as thr_multi_lock_after_thr_lock was done after the first lock.
            Another problem is that the sync point thr_multi_lock_after_thr_lock was hit multiple times by the analyze
            thread, which caused the sync point to become inactive before it was time to use it.

            Can be fixed by adding a new sync point just before all thr locks and ensuring that thr_multi_lock_before_thr_lock is not triggered for the first call.

            This should work as the bug fix for MDEV-3891 will notice that the select should not use statistics for this query and the analyze will continue after the select finishes.

            The attached patch fixes these issues

            monty Michael Widenius added a comment - - edited The reason for this is that the following code almost always causes a deadlock that will eventually be resolved with a timeout: connection con1; set debug_sync='statistics_update_start SIGNAL parker WAIT_FOR go1 EXECUTE 1'; set debug_sync='thr_multi_lock_after_thr_lock SIGNAL go2 EXECUTE 2'; use dbt3_s001; --send analyze table lineitem persistent for all connection con2; set debug_sync='open_and_process_table WAIT_FOR parker'; set debug_sync='statistics_read_start SIGNAL go1 WAIT_FOR go2'; use dbt3_s001; --send select * from mysql.index_stats, lineitem where index_name= 'i_l_shipdate' and l_orderkey=1 and l_partkey=68 order by prefix_arity; The deadlock happens because the SELECT will first run and take a read lock for index_stats, then the analyze statement will run and tries to take a write lock on table_stats, column_stats and index_stats. This will block as SELECT already have the lock. The tests works 'sometimes' depending on lock order in memory, as thr_multi_lock_after_thr_lock was done after the first lock. Another problem is that the sync point thr_multi_lock_after_thr_lock was hit multiple times by the analyze thread, which caused the sync point to become inactive before it was time to use it. Can be fixed by adding a new sync point just before all thr locks and ensuring that thr_multi_lock_before_thr_lock is not triggered for the first call. This should work as the bug fix for MDEV-3891 will notice that the select should not use statistics for this query and the analyze will continue after the select finishes. The attached patch fixes these issues

            Fixed by moving DEBUG_SYNC's to make things predictable and adjusting test

            This also reduce the time for the test from 300 seconds to a few seconds

            monty Michael Widenius added a comment - Fixed by moving DEBUG_SYNC's to make things predictable and adjusting test This also reduce the time for the test from 300 seconds to a few seconds

            People

              monty Michael Widenius
              elenst Elena Stepanova
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.